0% found this document useful (0 votes)
230 views507 pages

Statistical Methods For Spatial Data Analysis 07f414bf098301cd

stats-spatial

Uploaded by

Sreepriya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
230 views507 pages

Statistical Methods For Spatial Data Analysis 07f414bf098301cd

stats-spatial

Uploaded by

Sreepriya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 507

Texts in Statistical Science

Statistical Methods
for Spatial Data Analysis
CHAPMAN & HALL/CRC
Texts in Statistical Science Series
Series Editors
Bradley P. Carlin, University of Minnesota, USA
Chris Chatfield, University of Bath, UK
Martin Tanner, Northwestern University, USA
Jim Zidek, University of British Columbia, Canada

Analysis of Failure and Survival Data Epidemiology — Study Design and


Peter J. Smith Data Analysis, Second Edition
The Analysis and Interpretation of Multivariate Data for M. Woodward
Social Scientists Essential Statistics, Fourth Edition
David J. Bartholomew, Fiona Steele, D.A.G. Rees
Irini Moustaki, and Jane Galbraith A First Course in Linear Model Theory
The Analysis of Time Series — Nalini Ravishanker and Dipak K. Dey
An Introduction, Sixth Edition Interpreting Data — A First Course
Chris Chatfield in Statistics
Applied Bayesian Forecasting and Time Series Analysis A.J.B. Anderson
A. Pole, M. West and J. Harrison An Introduction to Generalized
Applied Nonparametric Statistical Methods, Third Linear Models, Second Edition
Edition A.J. Dobson
P. Sprent and N.C. Smeeton Introduction to Multivariate Analysis
Applied Statistics — Handbook of GENSTAT Analysis C. Chatfield and A.J. Collins
E.J. Snell and H. Simpson Introduction to Optimization Methods and Their
Applied Statistics — Principles and Examples Applications in Statistics
D.R. Cox and E.J. Snell B.S. Everitt

Bayes and Empirical Bayes Methods for Data Analysis, Large Sample Methods in Statistics
Second Edition P.K. Sen and J. da Motta Singer
Bradley P. Carlin and Thomas A. Louis Linear Models with R
Julian J. Faraway
Bayesian Data Analysis, Second Edition
Andrew Gelman, John B. Carlin, Markov Chain Monte Carlo — Stochastic Simulation
Hal S. Stern, and Donald B. Rubin for Bayesian Inference
D. Gamerman
Beyond ANOVA — Basics of Applied Statistics
R.G. Miller, Jr. Mathematical Statistics
K. Knight
Computer-Aided Multivariate Analysis,
Fourth Edition Modeling and Analysis of Stochastic Systems
A.A. Afifi and V.A. Clark V. Kulkarni

A Course in Categorical Data Analysis Modelling Binary Data, Second Edition


T. Leonard D. Collett

A Course in Large Sample Theory Modelling Survival Data in Medical Research, Second
T.S. Ferguson Edition
D. Collett
Data Driven Statistical Methods
P. Sprent Multivariate Analysis of Variance and Repeated Measures
— A Practical Approach for Behavioural Scientists
Decision Analysis — A Bayesian Approach D.J. Hand and C.C. Taylor
J.Q. Smith
Multivariate Statistics — A Practical Approach
Elementary Applications of Probability Theory, Second B. Flury and H. Riedwyl
Edition
H.C. Tuckwell Practical Data Analysis for Designed Experiments
B.S. Yandell
Elements of Simulation
B.J.T. Morgan
Texts in Statistical Science

Statistical Methods
for Spatial Data Analysis

Oliver Schabenberger
Carol A. Gotway

Boca Raton London New York


Published in 2005 by
Chapman & Hall/CRC
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2005 by Taylor & Francis Group, LLC


Chapman & Hall/CRC is an imprint of Taylor & Francis Group

No claim to original U.S. Government works


Printed in the United States of America on acid-free paper
10 9 8 7 6 5 4 3 2

International Standard Book Number-10: 1-58488-322-7 (Hardcover)


International Standard Book Number-13: 978-1-58488-322-7 (Hardcover)

This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with
permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish
reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials
or for the consequences of their use.
No part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or
other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.

Library of Congress Cataloging-in-Publication Data

Catalog record is available from the Library of Congress

Visit the Taylor & Francis Web site at


http://www.taylorandfrancis.com
Taylor & Francis Group and the CRC Press Web site at
is the Academic Division of T&F Informa plc. http://www.crcpress.com
To Lisa and Charlie
Contents

Preface xv

1 Introduction 1
1.1 The Need for Spatial Analysis 1
1.2 Types of Spatial Data 6
1.2.1 Geostatistical Data 7
1.2.2 Lattice Data, Regional Data 8
1.2.3 Point Patterns 11
1.3 Autocorrelation—Concept and Elementary Measures 14
1.3.1 Mantel’s Tests for Clustering 14
1.3.2 Measures on Lattices 18
1.3.3 Localized Indicators of Spatial Autocorrelation 23
1.4 Autocorrelation Functions 25
1.4.1 The Autocorrelation Function of a Time Series 25
1.4.2 Autocorrelation Functions in Space—Covariance and
Semivariogram 26
1.4.3 From Mantel’s Statistic to the Semivariogram 29
1.5 The Effects of Autocorrelation on Statistical Inference 31
1.5.1 Effects on Prediction 32
1.5.2 Effects on Precision of Estimators 34
1.6 Chapter Problems 37

2 Some Theory on Random Fields 41


2.1 Stochastic Processes and Samples of Size One 41
2.2 Stationarity, Isotropy, and Heterogeneity 42
2.3 Spatial Continuity and Differentiability 48
2.4 Random Fields in the Spatial Domain 52
2.4.1 Model Representation 53
2.4.2 Convolution Representation 57
2.5 Random Fields in the Frequency Domain 62
2.5.1 Spectral Representation of Deterministic Functions 62
2.5.2 Spectral Representation of Random Processes 65
2.5.3 Covariance and Spectral Density Function 66
2.5.4 Properties of Spectral Distribution Functions 70
2.5.5 Continuous and Discrete Spectra 72
2.5.6 Linear Location-Invariant Filters 74
x CONTENTS

2.5.7 Importance of Spectral Analysis 77


2.6 Chapter Problems 78

3 Mapped Point Patterns 81


3.1 Random, Aggregated, and Regular Patterns 81
3.2 Binomial and Poisson Processes 83
3.2.1 Bernoulli and Binomial Processes 83
3.2.2 Poisson Processes 84
3.2.3 Process Equivalence 85
3.3 Testing for Complete Spatial Randomness 86
3.3.1 Monte Carlo Tests 87
3.3.2 Simulation Envelopes 88
3.3.3 Tests Based on Quadrat Counts 90
3.3.4 Tests Based on Distances 97
3.4 Second-Order Properties of Point Patterns 99
3.4.1 The Reduced Second Moment Measure—
The K-Function 101
3.4.2 Estimation of K- and L-Functions 102
3.4.3 Assessing the Relationship between Two Patterns 103
3.5 The Inhomogeneous Poisson Process 107
3.5.1 Estimation of the Intensity Function 110
3.5.2 Estimating the Ratio of Intensity Functions 112
3.5.3 Clustering and Cluster Detection 114
3.6 Marked and Multivariate Point Patterns 118
3.6.1 Extensions 118
3.6.2 Intensities and Moment Measures for Multivariate
Point Patterns 120
3.7 Point Process Models 122
3.7.1 Thinning and Clustering 123
3.7.2 Clustered Processes 125
3.7.3 Regular Processes 128
3.8 Chapter Problems 129

4 Semivariogram and Covariance Function Analysis


and Estimation 133
4.1 Introduction 133
4.2 Semivariogram and Covariogram 135
4.2.1 Definition and Empirical Counterparts 135
4.2.2 Interpretation as Structural Tools 138
4.3 Covariance and Semivariogram Models 141
4.3.1 Model Validity 141
4.3.2 The Matérn Class of Covariance Functions 143
4.3.3 The Spherical Family of Covariance Functions 145
4.3.4 Isotropic Models Allowing Negative Correlations 146
4.3.5 Basic Models Not Second-Order Stationary 149
4.3.6 Models with Nugget Effects and Nested Models 150
xi

4.3.7 Accommodating Anisotropy 151


4.4 Estimating the Semivariogram 153
4.4.1 Matheron’s Estimator 153
4.4.2 The Cressie-Hawkins Robust Estimator 159
4.4.3 Estimators Based on Order Statistics and Quantiles 161
4.5 Parametric Modeling 163
4.5.1 Least Squares and the Semivariogram 164
4.5.2 Maximum and Restricted Maximum Likelihood 166
4.5.3 Composite Likelihood and Generalized Estimating
Equations 169
4.5.4 Comparisons 172
4.6 Nonparametric Estimation and Modeling 178
4.6.1 The Spectral Approach 179
4.6.2 The Moving-Average Approach 183
4.6.3 Incorporating a Nugget Effect 186
4.7 Estimation and Inference in the Frequency Domain 188
4.7.1 The Periodogram on a Rectangular Lattice 190
4.7.2 Spectral Density Functions 198
4.7.3 Analysis of Point Patterns 200
4.8 On the Use of Non-Euclidean Distances in Geostatistics 204
4.8.1 Distance Metrics and Isotropic Covariance Functions 205
4.8.2 Multidimensional Scaling 206
4.9 Supplement: Bessel Functions 210
4.9.1 Bessel Function of the First Kind 210
4.9.2 Modified Bessel Functions of the First and Second Kind 210
4.10 Chapter Problems 211

5 Spatial Prediction and Kriging 215


5.1 Optimal Prediction in Random Fields 215
5.2 Linear Prediction—Simple and Ordinary Kriging 221
5.2.1 The Mean Is Known—Simple Kriging 223
5.2.2 The Mean Is Unknown and Constant—Ordinary Kriging 226
5.2.3 Effects of Nugget, Sill, and Range 228
5.3 Linear Prediction with a Spatially Varying Mean 232
5.3.1 Trend Surface Models 234
5.3.2 Localized Estimation 238
5.3.3 Universal Kriging 241
5.4 Kriging in Practice 243
5.4.1 On the Uniqueness of the Decomposition 243
5.4.2 Local Versus Global Kriging 244
5.4.3 Filtering and Smoothing 248
5.5 Estimating Covariance Parameters 254
5.5.1 Least Squares Estimation 256
5.5.2 Maximum Likelihood 259
5.5.3 Restricted Maximum Likelihood 261
xii CONTENTS

5.5.4 Prediction Errors When Covariance Parameters Are


Estimated 263
5.6 Nonlinear Prediction 267
5.6.1 Lognormal Kriging 267
5.6.2 Trans-Gaussian Kriging 270
5.6.3 Indicator Kriging 278
5.6.4 Disjunctive Kriging 279
5.7 Change of Support 284
5.7.1 Block Kriging 285
5.7.2 The Multi-Gaussian Approach 289
5.7.3 The Use of Indicator Data 290
5.7.4 Disjunctive Kriging and Isofactorial Models 290
5.7.5 Constrained Kriging 291
5.8 On the Popularity of the Multivariate Gaussian Distribution 292
5.9 Chapter Problems 295

6 Spatial Regression Models 299


6.1 Linear Models with Uncorrelated Errors 301
6.1.1 Ordinary Least Squares—Inference and Diagnostics 303
6.1.2 Working with OLS Residuals 307
6.1.3 Spatially Explicit Models 316
6.2 Linear Models with Correlated Errors 321
6.2.1 Mixed Models 325
6.2.2 Spatial Autoregressive Models 335
6.2.3 Generalized Least Squares—Inference and Diagnostics 341
6.3 Generalized Linear Models 352
6.3.1 Background 352
6.3.2 Fixed Effects and the Marginal Specification 354
6.3.3 A Caveat 355
6.3.4 Mixed Models and the Conditional Specification 356
6.3.5 Estimation in Spatial GLMs and GLMMs 359
6.3.6 Spatial Prediction in GLMs 369
6.4 Bayesian Hierarchical Models 383
6.4.1 Prior Distributions 385
6.4.2 Fitting Bayesian Models 386
6.4.3 Selected Spatial Models 390
6.5 Chapter Problems 400

7 Simulation of Random Fields 405


7.1 Unconditional Simulation of Gaussian Random Fields 406
7.1.1 Cholesky (LU) Decomposition 407
7.1.2 Spectral Decomposition 407
7.2 Conditional Simulation of Gaussian Random Fields 407
7.2.1 Sequential Simulation 408
7.2.2 Conditioning a Simulation by Kriging 409
7.3 Simulated Annealing 409
xiii

7.4 Simulating from Convolutions 413


7.5 Simulating Point Processes 418
7.5.1 Homogeneous Poisson Process on the Rectangle (0, 0) ×
(a, b) with Intensity λ 418
7.5.2 Inhomogeneous Poisson Process with Intensity λ(s) 419
7.6 Chapter Problems 419

8 Non-Stationary Covariance 421


8.1 Types of Non-Stationarity 421
8.2 Global Modeling Approaches 422
8.2.1 Parametric Models 422
8.2.2 Space Deformation 423
8.3 Local Stationarity 425
8.3.1 Moving Windows 425
8.3.2 Convolution Methods 426
8.3.3 Weighted Stationary Processes 428

9 Spatio-Temporal Processes 431


9.1 A New Dimension 431
9.2 Separable Covariance Functions 434
9.3 Non-Separable Covariance Functions 435
9.3.1 Monotone Function Approach 436
9.3.2 Spectral Approach 436
9.3.3 Mixture Approach 438
9.3.4 Differential Equation Approach 439
9.4 The Spatio-Temporal Semivariogram 440
9.5 Spatio-Temporal Point Processes 442
9.5.1 Types of Processes 442
9.5.2 Intensity Measures 443
9.5.3 Stationarity and Complete Randomness 444

References 447

Author Index 463

Subject Index 467


Preface

The study of statistical methods for spatial data analysis presents challenges
that are fairly unique within the statistical sciences. Like few other areas,
spatial statistics draws on and brings together philosophies, methodologies,
and techniques that are typically taught separately in statistical curricula.
Understanding spatial statistics requires tools from applied statistics, mathe-
matical statistics, linear model theory, regression, time series, and stochastic
processes. It also requires a different mindset, one focused on the unique char-
acteristics of spatial data, and additional analytical tools designed explicitly
for spatial data analysis.
When preparing graduate level courses in spatial statistics for the first time,
we each struggled to pull together all the ingredients necessary to present the
material in a cogent manner at an accessible, practical level that did not tread
too lightly on theoretical foundations. This book ultimately began with our
efforts to resolve this struggle. It has its foundations in our own experience,
almost 30 years combined, with the analysis of spatial data in a variety of
disciplines, and in our efforts to keep pace with the new tools and techniques
in this diverse and rapidly evolving field.
The methods and techniques discussed in this text do by no means provide
a complete accounting of statistical approaches in the analysis of spatial data.
Weighty monographs are available on any one of the main chapters. Instead,
our goal is a comprehensive and illustrative treatment of the basic statistical
theory and methods for spatial data analysis. Our approach is mostly model-
based and frequentist in nature, with an emphasis on models in the spatial,
and not the spectral, domain. Geostatistical methods that developed largely
outside of the statistical mainstream, e.g., kriging methods, can be cast easily
in terms of prediction theory based on statistical regression models. Focusing
on a model formulation allows us to discuss prediction and estimation in the
same general framework. But many derivations and results in spatial statistics
either arise from representations in the spectral domain or are best tackled
in this domain, so spectral representations appear throughout. We added a
section on spectral domain estimation (§4.7) that can be incorporated in a
course together with the background material in §2.5. While we concentrate
on frequentist methods for spatial data analysis, we also recognize the utility
of Bayesian hierarchical models. However, since these models are complex
and intricate, we leave their discussion until Chapter 6, after much of the
xvi PREFACE

foundation of spatial statistics necessary to understand and interpret them


has been developed.
The tools and approaches we consider essential comprise Chapters 1–7.
Chapter 1, while introductory, also provides a first description of the basic
measures of spatial autocorrelation and their role in the analysis of spatial
data. Chapter 2 provides the background and theoretical framework of ran-
dom fields necessary for subsequent chapters, particularly Chapters 4 and
5. We begin the heart of statistical methods for spatial data analysis with
mapped point patterns in Chapter 3. Since a good understanding of spatial
autocorrelation is necessary for spatial analysis, estimation and modeling of
the covariance function and semivariogram are treated in detail in Chapter
4. This leads easily into spatial prediction and kriging in Chapter 5. One of
the most important chapters is Chapter 6 on spatial regression. It is unique
in its comprehensiveness, beginning with linear models with uncorrelated er-
rors and ending with a succinct discussion of Bayesian hierarchical models
for spatial data. It also provides a discussion of model diagnostics for linear
and generalized linear spatial models. Chapter 7 is devoted to the simulation
of spatial data since we believe simulation is an essential component of sta-
tistical methods for spatial data analysis and one that is often overlooked in
most other textbooks in spatial statistics. The chapters on non-stationary co-
variance (Chapter 8) and spatio-temporal models (Chapter 9) are primarily a
review of a rapidly evolving and emerging field. These chapters are supplemen-
tal to the core of the text, but will be useful to Ph.D. students in statistics
and others who desire a brief, concise, but relatively thorough overview of
these topics.
The book is intended as a text for a one-semester graduate level course in
spatial statistics. It is assumed that the reader/student is familiar with linear
model theory, the requisite linear algebra, and a good, working knowledge of
matrix algebra. A strong foundation in fundamental statistical theory typ-
ically taught in a first course in probability and inference (e.g., probability
distributions, conditional expectation, maximum likelihood estimation) is es-
sential. Stochastic process theory is not assumed; the needed ingredients are
discussed in Chapter 2. The text does not require prior exposure to spectral
methods. There are problems at the end of the main chapters that can be
worked into course material.
It is difficult to write a book of this scope without the help and input of
many people. First and foremost, we need to express our deepest appreciation
to our spouses, Lisa Schabenberger and Charlie Crawford, who graciously
allowed us ample time to work on this project. We are also fortunate to have
many friends within the greater statistical community and many of them
undoubtedly influenced our work. In particular, we are grateful for the insights
of Noel Cressie, Linda Young, Konstantin Krivoruchko, Lance Waller, Jay Ver
Hoef, and Montserrat Fuentes. We are indebted to colleagues who shared their
data with us. Thanks to Thomas Mueller for the C/N ratio data, to Jeffrey
Walters for the woodpecker data, to Vaisala Inc. for the lightning strikes data,
xvii

to Victor De Oliveira for the rainfall data, and to Felix Rogers for the low
birth weight data. Thanks to Charlie for being there when shape files go bad;
his knowledge of Virginia and his patience with the “dry as toast” material
in the book are much appreciated.
Finding time to write was always a challenge, and we are indebted to David
Olson for supporting Carol through her struggles with government bureau-
cracy. Finally, we learned to value our close collaboration and critical thinking
ability, enjoying the time we spent discussing and debating recent develop-
ments in spatial statistics, and combining our different ideas like pieces of a
big puzzle. We hope the book reflects the integration of our knowledge and
our common philosophy about statistics and spatial data analysis. A special
thanks to our editor at CRC Press, Bob Stern, for his vision, support, and
patience.
The material in the book will be supplemented with additional material
provided through the CRC Press Web site (www.crcpress.com). The site will
provide many of the data sets used as examples in the text, software code
that can be used to implement many of the principal methods described and
illustrated in the text, as well as updates and corrections to the text itself.
We welcome additions, corrections, and discussions for this Web page so that
it can make statistical methods for spatial data analysis useful to scientists
across many disciplines.

Oliver Schabenberger Carol A. Gotway Crawford


Cary, North Carolina Atlanta, Georgia
CHAPTER 1

Introduction

1.1 The Need for Spatial Analysis

Statistical methods for spatial data analysis play an ever increasing role in
the toolbox of the statistician, scientist, and practitioner. Over the years,
these methods have evolved into a self-contained discipline which continues to
grow and develop and has produced a specific vocabulary. Characteristic of
spatial statistics is its immense methodological diversity. In part, this is due
to its many origins. Some of the methods developed outside of mainstream
statistics in geology, geography, meteorology, and other subject matter areas.
Some are rooted in traditional statistical areas such as linear models and
response surface theory. Others are derived from time series approaches or
stochastic process theory. Many methods have undergone specific adaptations
to cope with the specific challenges presented by, for example, the fact that
spatial processes are not equivalent to two-dimensional time series processes.
The novice studying spatial statistics is thus challenged to absorb and combine
varied tools and concepts, revisit notions of randomness and data generating
mechanisms, and to befriend a new vernacular.
Perhaps the foremost reason for studying spatial statistics is that we are
often not only interested in answering the “how much” question, but the “how
much is where” question. Many empirical data contain not only information
about the attribute of interest—the response being studied—but also other
variables that denote the geographic location where the particular response
was observed. In certain instances, the data may consist of location infor-
mation only. A plant ecologist, for example, records the locations within a
particular habitat where a rare plant species can be found. It behooves us to
utilize this information in statistical inference provided it contributes mean-
ingfully to the analysis.
Most authors writing about statistical methods for spatial data will argue
that one of the key features of spatial data is the autocorrelation of observa-
tions in space. Observations in close spatial proximity tend to be more similar
than is expected for observations that are more spatially separated. While
correlations between observations are not a defining feature of spatial data,
there are many instances in which characterizing spatial correlation is of pri-
mary analytical interest. It would also be shortsighted to draw a line between
“classical” statistical modeling and spatial modeling because of the existence
of correlations. Many elementary models exhibit correlations.
2 INTRODUCTION

Example 1.1 A drug company is interested in testing the uniformity of


its buffered aspirin tablets for quality control purposes. A random sample
of three batches of aspirin tablets is collected from each of the company’s
two manufacturing facilities. Five tablets are then randomly selected from
each batch and the diameter of each tablet is measured. This experiment is a
nested design with t = 2 manufacturing facilities, r = 3 batches per facility,
rt = 6 experimental units (the batches of aspirin), and five sub-samples per
experimental unit (n = 5), the diameter measurements made on the tablets
within each batch. The experimental design can be expressed by the linear
model
Yijk = µ + τi + eij + #ijk ,
i = 1, t = 2, j = 1, 2, r = 3 k = 1, 2, n = 5.
The eij represent the variability among the batches within a manufactur-
ing facility and the #ijk terms represent the variability in the diameters of
the tablets within the batches, i.e., the sub-sampling errors. By virtue of the
random sampling of the batches, the eij are uncorrelated. The #ijk are un-
correlated by virtue of randomly sampling the tablets from each batch. How
does this experiment induce correlations? Measurements made on different
experimental units are uncorrelated, but for two tablets from the same batch:
Cov[Yijk , Yijk! |eij ] = Cov[eij + #ijk , eij + #ijk! |eij ] = Cov[#ijk , #ijk! ] = 0
Cov[Yijk , Yijk! ] = Cov[eij + #ijk , eij + #ijk! |eij ]
= Cov[eij , eij ] + Cov[eij , #ijk! ] +
Cov[#ijk , eij ] + Cov[#ijk , #ijk! ]
= Var[eij ] + 0 + 0 + 0.
The data measured on the same batch are correlated because they share the
same random effect, the experimental error eij associated with that unit.

During the past two decades we witnessed tremendous progress in the anal-
ysis of another type of correlated data: longitudinal data and repeated mea-
sures. It is commonly assumed that correlations exist among the observations
repeatedly collected for the same unit or subject. An important aspect of such
data is that the repeated measures are made according to some metric such
as time, length, depth, etc. This metric typically plays some role in expressing
just how the correlations evolve with time or distance. Models for longitudinal
and repeated measures data bring us thus closer to models for spatial data,
but important differences remain. Consider, for example, a longitudinal study
of high blood pressure patients, where s subjects are selected at random from
some (large) population. At certain time intervals t1 , · · · , tni , (i = 1, · · · , s)
the blood pressure of the ith patient is measured along with other variables,
e.g., smoking status, exercise and diet habits, medication. Thus, we observe a
value of Yi (tj ), j = 1, · · · , ni , the blood pressure of patient i at time tj . Al-
though the observations taken on a particular patient are most likely serially
THE NEED FOR SPATIAL ANALYSIS 3

correlated, these data fall within the realm of traditional random sampling.
We consider the vectors Yi = [Yi (t1 ), · · · , Yi (tni )]! as independent random
vectors because patients were selected at random. A statistical model for the
data from the ith patient might be
Yi = X i β + e i , (1.1)
where Xi is a known (ni ×p) matrix of regressor and design variables associated
with subject i. The coefficient vector β can be fixed, a random vector, or have
elements of both types. Assume for this example that β is a vector of fixed
effects coefficients and that ei ∼ (0, Vi (θ l )). The matrix Vi (θ l ) contains the
variances of and covariances between the observations from the same patient.
These are functions of the elements of the vector θ l . We call θ the vector of
covariance parameters in this text, because of its relationship to the covariance
matrix of the observations.
The analyst might be concerned with
1. estimating the parameters β and θ l of the data-generating mechanism,
2. testing hypotheses about these parameters,
3. estimating the mean vector E[Yi ] = Xi β,
4. predicting the future blood pressure value of a patient at time t.
To contrast this example with a spatial one, consider collecting n observa-
tions on the yield of a particular crop from an agricultural field. The analyst
may raise questions similar to those in the longitudinal study:
1. about the parameters describing the data-generating mechanism,
2. about the average yield in the northern half of the field compared to the
southern-most area where the field surface is sloped,
3. about the average yield on the field,
4. about the crop yield at an unobserved location.
Since the questions are so similar in their reference to estimates, parame-
ters, averages, hypotheses, and unobserved events, maybe a statistical model
similar to the one in the longitudinal study can form the basis of inference?
One suggestion is to model the yield Z at spatial location s = [x, y]! as
Z(s) = x!s α + ν,
where xs is a vector of known covariates, α is the coefficient vector, and
ν ∼ (0, σ 2 ). A point s in the agricultural field is identified by its x and y
coordinate in the plane. If we collect the n observations into a single vector,
Z(s) = [Z(s1 ), Z(s2 ), · · · , Z(sn )]! , this spatial model becomes
Z(s) = Xs α + ν, ν ∼ (0, Σ(θ s )), (1.2)
where Σ is the covariance matrix of the vector Z(s). The subscript s is used
to identify a component of the spatial model; the subscript l is used for a
component of the longitudinal model to avoid confusion.
4 INTRODUCTION

Similarly, in the longitudinal case, the observations from different subjects


can be collected into a single vector, Y = [Y!1 , Y!2 , · · · , Y!n ]! , and model (1.1)
can be written as
Y = Xl β + e, (1.3)
where Xl is a stacking of X1 , X2 , · · · , Xn , and e is a random vector with mean
0 and covariance matrix
 
V1 (θ l ) 0 ··· 0
 0 V2 (θ l ) · · · 0 
 
Var[e] = V(θ l ) =  . . .
 0 0 .. .. 
0 0 ··· Vs (θ l )

The models for Y and Z(s) are rather similar. Both suggest some form of
generalized least squares (GLS) estimation for the fixed effects coefficients,
' = (X! V(θ l )−1 Xl )−1 X! V(θ l )−1 Y
β l l
' = (X!s Σ(θ s )−1 Xs )−1 X!s Σ(θ s )−1 Z(s).
α
Also common to both models is that θ is unknown and must be estimated.
The estimates β ' and α' are efficient, but unattainable. In Chapters 4–6 the
issue of estimating covariance parameters is visited repeatedly.
However, there are important differences between models (1.3) and (1.2)
that imperil the transition from the longitudinal to the spatial application.
The differences have technical, statistical, and computational implications.
Arguably the most important difference between the longitudinal and the
spatial model is the sampling mechanism. In the longitudinal study, we sam-
ple s independent random vectors, and each realization can be thought of as
that of a temporal process. The sampling of the patients provides the repli-
cation mechanism that leads to inferential brawn. The technical implication
is that the variance-covariance matrix V(θ) is block-diagonal. The statistical
implication is that standard multivariate limiting arguments can be applied,
' because the es-
for example, to ascertain the asymptotic distribution of β,
timator and its associated estimating equation can be expressed in terms of
sums of independent contributions. Solving
s
( ) *
X!i Vi (θ l )−1 Yi − X!i β = 0
i=1

leads to + ,−1
s
( s
(
'=
β X!i Vi (θ l )−1 Xi X!i Vi (θ l )−1 Yi .
i=1 i=1
The computational implication is that data processing can proceed on a
subject-by-subject basis. The key components of estimators and measures of
their precision can be accumulated one subject at a time, allowing fitting of
models to large data sets (with many subjects) to be computationally feasible.
In the spatial case, Σ is not block-diagonal and there is usually no repli-
THE NEED FOR SPATIAL ANALYSIS 5

cation mechanism. The computational ramifications of a dense matrix Σ are


formidable in their own right. For example, computing β ' in the subject-by-
subject form of a longitudinal model requires inversion of Vi , a matrix of
order at most max{ni }. In the spatial case, computing α' requires inversion of
an (n × n) matrix. The implications and ramifications of lacking a replication
mechanism are even more daunting. We suggested earlier that longitudinal
data may be considered as the sampling of s temporal processes. Making the
connection to the spatial case suggests a view of spatial data as a single sam-
ple (s ≡ 1) of a two-dimensional process. This is indeed one way of viewing
many types of spatial data (see §2), and it begs the question as to how we can
make any progress with statistical inference based on a sample of size one.
While there are models, data structures, and analyses in the more tradi-
tional, classical tool set of the statistician that are portable to the spatial case,
the analysis of spatial data provides some similar and many unique challenges
and tasks. These tasks and challenges stimulate the need for spatial mod-
els and analysis. Location information is collected in many applications, and
failing to use this information can obstruct important characteristics of the
data-generating mechanism. The importance of maintaining the spatial con-
text in the analysis of georeferenced data can be illustrated with the following
example.

Example 1.2 Data were generated on a 10×10 lattice by drawing iid variates
from a Gaussian distribution with mean 5 and variance 1, denoted here as
G(5, 1). The observations were assigned completely at random to the lattice
coordinates (Figure 1.1a). Using a simulated annealing algorithm (§7.3), the
data points were then rearranged such that a particular value is surrounded
by more and more similar values. We define a nearest-neighbor of site si ,
(i = 1, · · · , 100), to be a lattice site that could be reached from si with a
one-site move of a queen piece on a chess board. Then, let z i denote the
average of the neighboring sites of si . The arrangements shown in Figure
1.1b–d exhibit increasing correlations between Z(si ), the observation at site
si , and the average of its nearest-neighbors.
Since the same 100 data values are assigned to the four lattice arrange-
ments, any exploratory statistical method that does not utilize the coordinate
information will lead to identical results for the four arrangements. For ex-
ample, histograms, stem-and-leaf plots,Q-Q plots, and sample moments will
be identical. However, the spatial pattern depicted by the four arrangements
is considerably different. This is reflected in the degree to which the four
arrangements exhibit spatial autocorrelation (Figure 1.2).
6 INTRODUCTION

a) b)
10 10
9 9
8 8
7 7
Row

6 6

Row
5 5
4 4
3 3
2 2
1 1

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Column Column

c) 10
d)
10
9
9
8
8
7
7
6
Row

6
Row

5
5
4
4
3
3
2
2
1
1
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Column Column

Figure 1.1 Simulated spatial arrangements on a 10 × 10 lattice. Independent draws


from a G(5, 1) distribution assigned completely at random are shown in panel a. The
same data were then rearranged to create arrangements (b–d) that differ in their
degree of spatial autocorrelation.

1.2 Types of Spatial Data

Because spatial data arises in a myriad of fields and applications, there is


a myriad of spatial data types, structures, and scenarios. A classification of
spatial data invariably is either somewhat coarse or overly detailed. We err
on the side of a coarse classification, which allows you, the reader, to subse-
quently equate a data type with certain important features of the underlying
spatial process. These features, in turn, lead to particular classes of models,
inferential questions, and applications. For example, for spatial data that is
typically observed exhaustively, e.g., statewide data collected for each county,
the question of predicting the attribute of interest for an unobserved county
in the state is senseless. There are no unobserved counties.
The classification we adopt here follows Cressie (1993) and distinguishes
data types by the nature of the spatial domain. To make these matters more
precise, we denote a spatial process in d dimensions as
{Z(s) : s ∈ D ⊂ Rd }.
TYPES OF SPATIAL DATA 7

a) b)

7 7
= 0.3

6 6
z

z
5 5

4 4

3 3

3.5 4.0 4.5 5.0 5.5 6.0 3.5 4.0 4.5 5.0 5.5 6.0
zbar zbar

c) d)
= 0.6 = 0.9
7 7

6 6
z

z
5 5

4 4

3 3

3.5 4.0 4.5 5.0 5.5 6.0 3 4 5 6 7


zbar zbar

Figure 1.2 Correlations between lattice observations Z(si ) and the average of the
nearest neighbors for lattice arrangements shown in Figure 1.1.

Here, Z denotes the attribute we observe, for example, yield, concentration,


or the number of sudden infant deaths. The location at which Z is observed is
s, a (d×1) vector of coordinates. Most of the spatial processes in this book are
processes in two-dimensional space, d = 2, and s = [x, y]! are the Cartesian
coordinates. The spatial data types are distinguished through characteristics
of the domain D.

1.2.1 Geostatistical Data

The domain D is a continuous, fixed set. By continuous we mean that Z(s)


can be observed everywhere within D, i.e., between any two sample locations
si and sj you can theoretically place an infinite number of other samples.
By fixed we mean that the points in D are non-stochastic. Because of the
continuity of D, geostatistical data is also referred to as “spatial data with
continuous variation.” It is important to associate the continuity with the
domain, not with the attribute being measured. Whether the attribute Z is
8 INTRODUCTION

continuous or discrete, has no bearing on whether the data are geostatistical,


or not.

Example 1.3 Consider measuring air temperature. Air temperature could,


at least theoretically, be recorded at any location in the U.S. Practically,
however, air temperature values cannot be recorded exhaustively. It is usually
recorded at a finite number of specified locations such as those designated as
U.S. weather stations, depicted in Figure 1.3.

Figure 1.3 U.S. Weather Stations. Source: National Climatic Data Center.

It is reasonable to treat these data as geostatistical, defining a continuous


temperature surface across the U.S. Our assessment of these data as geosta-
tistical would not change if, instead of the air temperature, we determine an
indicator variable, namely whether the air exceeds a specified threshold limit,
at each weather station. How we select the locations at which Z is observed
also has no bearing on whether the process is geostatistical or not. If instead
of a specified network of weather stations, we had measured air temperature
at the geographic centroid of each state, we would still observe the same tem-
perature surface, just at different points.
Since the spatial domain D (in this case the entire U.S.) is continuous,
it cannot be sampled exhaustively, and an important task in the analysis of
geostatistical data is the reconstruction of the surface of the attribute Z over
the entire domain, i.e., mapping of Z(s). Thus, we are interested in how the
temperature values vary geographically, as in Figure 1.4.

1.2.2 Lattice Data, Regional Data

Lattice data are spatial data where the domain D is fixed and discrete, in other
words, non-random and countable. The number of locations can be infinite;
what is important is that they can be enumerated. Examples of lattice data
are attributes collected by ZIP code, census tract, or remotely sensed data
TYPES OF SPATIAL DATA 9

Figure 1.4 Temperature variations in the U.S.

reported by pixels. Spatial locations with lattice data are often referred to as
sites and these sites usually represent not points in space, but areal regions.
It is often mathematically convenient or necessary to assign to each site one
precise spatial coordinate, a “representative” location. If we refer to Wake
County in North Carolina, then this site is a geographic region, a polygon
on a map. To proceed statistically, we need to spatially index the sites so we
can make measurements between them. For example, to measure the distance
between the Wake County site and another county in North Carolina, we need
to adopt some convention for measuring the distance between two regions. One
possibility would be the Euclidean distance between a representative location
within each county, for example, the county centroid, or the seat of the county
government.
Unfortunately, the mathematical notation typically used with lattice data,
and the term lattice data itself, are both misleading. It seems natural to refer
to the observation at the ith site as Zi and then to use si to denote the rep-
resentative point coordinate within the site. In order to emphasize the spatial
nature of the observation, the same notation, Z(si ), used for geostatistical
data is also routinely used for lattice data. The subscript indexes the lattice
site and si denotes its representative location. Unfortunately, this notation
and the idea of a “lattice” in general, has encouraged most scientists to treat
lattice data as if they were simply a collection of measurements recorded at a
finite set of point locations. In reality, most, if not all, lattice data are spatially
aggregated over areal regions. For example,
• yield is measured on an agricultural plot,
• remotely sensed observations are associated with pixels that correspond to
areas on the ground,
• event counts, e.g., number of deaths, crime statistics, are reported for coun-
ties or ZIP codes,
• U.S. Census Bureau data is made available by census tract.
The aggregation relates to integration of a continuous spatial attribute (e.g.,
10 INTRODUCTION

average yield per plot), or enumeration (integration with respect to a counting


measure). Because the areal units on which the aggregation takes place can
be irregularly shaped regions, the name regional data is more appropriate in
our opinion. If it is important to emphasize the areal nature of a lattice site,
we use notation such as Z(Ai ), where Ai denotes the ith areal unit. Here, Ai
is called the spatial support of the data Z(Ai ) (see §5.7).
Another interesting feature of lattice data is exhaustive observation in many
applications. Many spatial data sets with lattice data provide information
about every site in the domain. If, for example, state-wide data provide in-
formation on cancer mortality rates for every county, the issue of predicting
cancer mortality in any county does not arise. The rates are known. An im-
portant goal in statistical modeling lattice data is smoothing the observed
mortality rates and assessing the relationship between cancer rates and other
variables.

Example 1.4 Blood lead levels in children, Virginia 2000. The mis-
sion of the Lead-Safe Virginia Program sponsored by the Virginia Department
of Health is to eradicate childhood lead poisoning. As part of this program,
children under the age of 72 months were tested for elevated blood lead levels.
The number of children tested each year is reported for each county in Vir-
ginia, along with the number that had elevated blood lead levels. The percent
of children with elevated blood lead levels for each Virginia county in 2000 is
shown in Figure 1.5.

Percent
0.00 − 2.62
2.63 − 5.23
5.24 − 10.46
10.47 −20.93
20.94 −71.43

Figure 1.5 Percent of children under the age of 72 months with elevated blood lead
levels in Virginia in 2000. Source: Lead-Safe Virginia Program, Virginia Department
of Health.

This is a very typical example of lattice or regional data. The domain D


is the set of 133 counties (the sites) in Virginia which is clearly fixed (not
random), discrete, and easily enumerated. The attribute of interest, the per-
centage of children with elevated blood lead levels, is an aggregate variable,
a ratio of two values summed over each county. A common, simple, probabil-
ity model for this variable is based on the binomial distribution: assume the
number of children per county with elevated blood lead levels, Y (si ), follows
TYPES OF SPATIAL DATA 11

a binomial distribution with mean πni , where ni is the number of children


tested in county i and π is the probability of having an elevated blood level.
If Z(si ) = Y (si )/ni , then Var[Z(si )] = π(1 − π)/ni , which may change con-
siderably over the domain. Unfortunately, many of the statistical methods
used to analyze lattice data assume that the data have a constant mean and
a constant variance. This is one of the main reasons we find the term lattice
data misleading. It is just too easy to forget the aggregate nature of the data
and the heterogeneity (in mean and in variance) that can result when data
are aggregated geographically. A plethora of misleading analyses can easily be
found in the literature, in statistical textbooks and journals as well as in those
within specific subject-matter areas such as geography and epidemiology.
In regional data analysis, counties in close proximity to one another with
similar values produce a spatial pattern indicative of positive spatial autocor-
relation. Identifying groups of counties in close proximity to one another with
high values is often of particular interest, suggesting a “cluster” of elevated
risk with perhaps a common source. Another goal in regional data analysis
is identification of the spatial risk factors for the response of interest. For
example, the primary source of elevated blood lead levels in children is dust
from lead-based paint in older homes in impoverished areas. Thus, we might
seek to correlate the map in Figure 1.5, to one produced from Census data
showing the median housing value per county, a surrogate for housing age and
maintenance quality (Figure 1.6).

Median Housing Value


34400 −48400
48401 −55700
55701 −72200
72201 −89300
89301 −231000

Figure 1.6 Median housing value per county in Virginia in 2000. Source: U.S. Census
Bureau.

1.2.3 Point Patterns

Geostatistical and lattice data have in common the fixed, non-stochastic do-
main. A domain D is fixed if it does not change from one realization of the
spatial process to the next. Consider pouring sand out of a bucket onto a
desk and let Z(s) denote the depth of the poured sand at location s. The set
12 INTRODUCTION

D, which comprises all possible locations in this experiment, is the desktop.


After having taken the depth measurements at some sample locations on the
desk, return the sand to the bucket and pour it again. This creates another
realization of the random process {Z(s) : s ∈ D ⊂ R2 }. We now have two
realizations of a geostatistical process; the domain has not changed between
the first and second pouring, unless someone moved the desk. The objective
of a statistical inquiry with such data is obviously the information collected
on the attribute Z, for example, the construction of an elevation map of the
sand on the desk. The objective is not to study the desk, the domain is known
and unchanging.
Can we conceive of a random domain in this example, that is, a set of points
that changes with each realization of the random process? If Z(s) is the depth
of the sand at location s, apply an indicator transform such that
-
1 Z(s) > c
I(s) =
0 otherwise.
If Z(s) is geostatistical data, so is the indicator I(s), which returns 1 whenever
the depth of the sand exceeds the threshold c. Now remove all locations in
D for which I(s) = 0 and define the remaining set of points as D∗ . As the
process of pouring the sand is repeated, a different realization of the random
set D∗ is obtained. Both the number of points in the realized set, as well
as their spatial configuration, is now the outcome of a random process. The
attribute “observed” at every point in D∗ is rather un-interesting, I(s) ≡ 1,
if s ∈ D∗ . The focus of a statistical investigation is now the set D∗ itself, we
study properties of the random domain. This is the realm of point pattern
analysis. The collection of points I(s), s ∈ D∗ , is termed a point pattern.
The important feature of point pattern data is the random domain, not the
degenerate nature of the attribute at the points of D. Many point patterns
are of this variety, however. They are referred to as unmarked patterns, or
simply as point patterns. The locations at which weeds emerge in a garden,
the location of lunar craters, the points at which nerve fibers intersect tissue
are examples of spatial point patterns. If along with the location of an event
of interest we observe a stochastic attribute Z, the point pattern is termed a
marked pattern. For example, the location and depth of lunar craters, the
location of weed emergence and the light intensity at that location can be
considered as marked point patterns.

Example 1.5 Lightning strikes. Once again we turn to the weather for
an interesting example. Prompt and reliable information about the location of
lightning strikes is critical for any business or industry that may be adversely
affected by the weather. Weather forecasters use information about lightning
strikes to alert them (and then the public at large) to potentially dangerous
storms. Air traffic controllers use such information to re-route airline traffic,
and major power companies and forestry officials use it to make efficient use
of human resources, anticipating the possibility of power outages and fires.
TYPES OF SPATIAL DATA 13

The National Lightning Detection Network locates lightning strikes across


the U.S. instantaneously by detecting the electromagnetic signals given off
when lightning strikes the earth’s surface. Figure 1.7 shows the lightning flash
events within approximately 200 miles of the East coast of the U.S. between
April 17 and April 20, 2003.

Figure 1.7 Locations of lightning strikes within approximately 200 miles of the east
coast of the U.S. between April 17 and April 20, 2003. Data courtesy of Vaisala Inc.,
Tucson, AZ.

This is an example of a spatial point pattern. It consists of the locations of


the lightning strikes (often called event locations) recorded within a boundary
that is the set of all potential locations within 200 miles of the East Coast of
the U.S. It is called a pattern, but this is in fact one of the questions that may
be of interest: Do the lightning strikes occur in a spatially random fashion or
is there some sort of pattern to their spatial distribution?

In addition to the locations of the lightning strikes, the National Lightning


14 INTRODUCTION

Detection Network also records information about the polarity (a negative or


positive charge), and peak amplitude of each strike. Thus, these attributes,
together with the locations of the lightning strikes, are an example of a marked
point pattern. With marked point patterns, we are interested in the spatial
relationships among the values of the marking attribute variable, above and
beyond any induced by the spatial distribution of the strikes. We will treat
such analyses in more detail in Chapter 3.

1.3 Autocorrelation—Concept and Elementary Measures

As the name suggests, autocorrelation is the correlation of a variable with


itself. If Z(s) is the attribute Z observed in the plane at spatial location
s = [x, y]! , then the term spatial autocorrelation refers to the correlation
between Z(si ) and Z(sj ). It is a correlation between the same attribute at two
locations. In the absence of spatial autocorrelation, the fact that two points si
and sj are close or far has no bearing on the relationship between the values
Z(si ) and Z(sj ) at those locations. Conversely, if there is positive spatial
autocorrelation, proximity in space should couple with similarity in attribute
values. Another way to visualize the concept is to consider a three-dimensional
coordinate system in which the (x, y)-axes represent the spatial coordinates
and the vertical axis the magnitude of Z. Under positive autocorrelation,
points that are close in the (x–y)-plane tend to have similar values of Z. In
the three-dimensional coordinate system this creates clusters of points.

1.3.1 Mantel’s Tests for Clustering

Some elementary statistical measures of the degree to which data are au-
tocorrelated can be motivated precisely in this way; as methods for detect-
ing clusters in three-dimensional space. Mantel (1967) considered a general
procedure to test for disease clustering in a spatio-temporal point process
{Z(s, t) : s ∈ D ⊂ R2 , t ∈ T ∗ ⊂ R}. For example, (s, t) is a coordinate
in space and time at which a leukemia case occurred. This is an unmarked
spatio-temporal point pattern, Z is a degenerate random variable. To draw
the parallel with studying autocorrelation in a three-dimensional coordinate
system, we could also consider the data-generating mechanism as a spatial
point process with a mark variable T , the time at which the event occurs.
Denote this process as T (s) and the observed data as T (s1 ), T (s2 ), · · · , T (sn ).
The disease process is said to be clustered if cases that occur close together
in space also occur close together in time. In order to develop a statistical
measure for this tendency to group in time and space, let Wij denote a measure
of spatial proximity between si and sj and let Uij denote a measure of the
temporal proximity of the cases. For example, we can take for leukemia cases
at si and sj
Wij = ||si − sj || Uij = |T (si ) − T (sj )|.
AUTOCORRELATION—CONCEPT AND ELEMENTARY MEASURES 15

Mantel (1967) suggested to test for clustering by examining the test statistics
Mantel
n−1 n Statistics
( (
M1 = Wij Uij (1.4)
i=1 j=i+1
(n ( n
M2 = Wij Uij (1.5)
i=1 j=1
Wii = Uii = 0.
Because of the restriction Wii = Uii = 0, (1.4) sums the product Wij Uij for
the n(n − 1)/2 unique pairs and M2 is a sum over all n(n − 1) pairs. Collect
the spatial and temporal distance measures . in (n × n) matrices W and U
n
whose diagonal elements are 0. Then M2 = i=1 u!i wi , where u!i is the ith
row of U and w!i is the ith row of W.
Although Mantel (1967) focused on diagnosing spatio-temporal disease clus-
tering, the statistics (1.4) and (1.5) can be applied for any spatial attribute
and are not restricted to marked point patterns. Define Wij as a measure
of closeness between si and sj and let Uij be a measure of closeness of the
observed values; for example, Uij = |Z(si ) − Z(sj )| or Uij = {Z(si ) − Z(sj )}2 .
Notice that for a geostatistical or lattice data structure the Wij are fixed
because the domain is fixed and the Uij are random. We can imagine a re-
gression through the origin with response Uij and regressor Wij . If the process
exhibits positive autocorrelation, then small Wij should pair with small Uij .
Points will be distributed randomly beyond some critical Wij . The Mantel
statistics are related to the slope estimator of the regression Uij = βWij + eij
(Figure 1.8):
M2
β' = .n .n 2 .
i=1 j=1 Wij

The general approach to inference about the clustering, based on statistics


such as (1.4) and (1.5), is to test whether the magnitude of the observed
value of M is unusual in the absence of spatial aggregation and to reject
the hypothesis of no clustering if the M statistic is sufficiently extreme. Four
approaches are commonly used to perform this significance test. We focus on
M2 here and denote as M2(obs) the observed value of Mantel’s statistic based
on the data Z(s1 ), · · · , Z(sn ).
• Permutation Test. Under the hypothesis of no clustering, the observed
values Z(s1 ), · · · , Z(sn ) can be thought of as a random assignment of val-
ues to spatial locations. Alternatively, we can imagine assigning the site
coordinates s1 , · · · , sn completely at random to the observed values. With
n sites, the number of possible assignments is finite: n!. Enumerating M2
for the n! arrangements yields its null distribution and the probability to
exceed M2(obs) can be ascertained. Let Er [M2 ] and Varr [M2 ] denote the
mean and variance of M2 under this randomization approach.
16 INTRODUCTION

3
uij
2

0 4 8 12
Distance in Space (|| si - sj||)

2.5

2.0
Average u ij

1.5

1.0

0.5

0.0
0 4 8 12
Distance in Space (|| si - sj||)

Figure 1.8 Scatter plots of uij = |Z(si )−Z(sj )| vs. wij = ||si −sj || for the simulated
lattice of Figure 1.1d. Bottom panel shows uij vs. wij .

• Monte Carlo Test. Even for moderately large n, the number of possible
random assignments of values to sites is large. Instead of a complete enu-
meration one can sample independently k random assignments to construct
the empirical null distribution of M2 . For a 5%-level test at least k = 99
samples are warranted and k = 999 is recommended for a 1%-level test.
The larger k, the better the empirical distribution approximates the null
distribution. Then, M2(obs) is again combined with the k values for M2 from
the simulation and the relative rank of M2(obs) is computed. If this rank is
sufficiently extreme, the hypothesis of no autocorrelation is rejected.
• Asymptotic Z-Test With Gaussian Assumption. The distribution of
M2 can also be determined—at least asymptotically—if the distribution of
the Z(s) is known. The typical assumption is that of a Gaussian distribution
for the Z(s1 ), · · · , Z(sn ) with common mean and common variance. Under
the null distribution, Cov[Z(si ), Z(sj )] = 0 ∀i (= j, and the mean and
variance of M2 can be derived. Denote these as Eg [M2 ] and Varg [M2 ],
since these moments may differ from those obtained under randomization.
A test statistic
M2(obs) − Eg [M2 ]
Zobs =
Varg [M2 ]
is formed and, appealing to the large-sample distribution of M2 , Zobs is
AUTOCORRELATION—CONCEPT AND ELEMENTARY MEASURES 17

compared against cutoffs from the G(0, 1) distribution. The spatial prox-
imity weights Wij are considered fixed in this approach.
• Asymptotic Z-Test Under Randomization. The mean and variance
of M2 under randomization can also be used to formulate a test statistic
M2(obs) − Er [M2 ]
Zobs = ,
Varr [M2 ]
which follows approximately (for n large) a G(0, 1) distribution under the
null hypothesis. The advantage over the previous Z-test lies in the absence
of a distributional assumption for the Z(si ). The disadvantage of either
Z-test is the reliance on an asymptotic distribution for Zobs .
The Mantel statistics (1.4) and (1.5) are usually not applied in their raw
form in practice. Most measures of spatial autocorrelation that are used in
exploratory analysis are special cases of M1 and M2 . These are obtained by
considering particular structures for Wij and Uij and by scaling the statistics.

Example 1.6 Knox (1964) performed a binary classification of the spatial


distances ||hij || = ||si − sj || and the temporal separation |ti − tj | between
leukemia cases in the study of the space-time clustering of childhood leukemia,
- -
1 if ||si − sj || < 1 km 1 if |ti − tj | < 59 days
Wij = Uij =
0 otherwise 0 otherwise.
Let W∗ denote the (n × n) matrix consisting of the upper triangular part of
W with all other elements set to zero (and similarly for U∗ ). The number
of “close” pairs in space and time is then given by M1 , the number of pairs
close in space is given by Sw

= 1! W∗ 1, and the number of pairs close in time
by Su∗ = 1! U∗ 1. A (2 × 2) contingency table can be constructed (Table 1.1).
The expected cell frequencies under the hypothesis of no autocorrelation can
be computed from the marginal totals and the usual Chi-square test can be
deployed.

Table 1.1 (2 × 2) Contingency table in Knox’s double dichotomy


Proximity in Time
In Space Close Far Total
Close M1 ∗
Sw − M1 Sw∗
n(n−1)
Far ∗
Su − M 1 2 − Su∗ − Sw

+ M1 n(n − 1)/2 − Sw

n(n−1)
Total Su∗ 2 − Su∗ n(n − 1)/2

Other special cases of the Mantel statistics are the Black-White join count,
Moran’s I, and Geary’s c statistic. We now discuss these in the context of
lattice data and then provide extensions to a continuous domain.
18 INTRODUCTION

1.3.2 Measures on Lattices

Lattice or regional data are in some sense the coarsest of the three spatial
data types because they can be obtained from other types by spatial accumu-
lation (integration). Counting the number of events in non-overlapping sets
A1 , · · · , Am of the domain D in a point process creates a lattice structure.
A lattice process can be created from a geostatistical process by integrating
Z(s) over the sets A1 , · · · , Am .
Key to analyzing lattice structures is the concept of spatial connectivity. Let
i and j index two members of the lattice and imagine that si and sj are point
locations with which the lattice members are identified. For example, i and j
may index two counties and si and sj are the spatial locations of the county
centroid or the seat of the county government. It is not necessary that each
lattice member is associated with a point location, but spatial connectivity
between sites is often expressed in terms of distances between “representative”
points. With each pair of sites we associate a weight wij which is zero if i = j
or if the two sites are not spatially connected. Otherwise, wij takes on a
non-zero value. (We use lowercase notation for the spatial weights because
the domain is fixed for lattice data.) The simplest connectivity structure is
obtained if the lattice consists of regular units. It is then natural to consider
binary weights

-
1 if sites i and j are connected
wij = (1.6)
0 if sites i and j are not connected.

Sites that are connected are considered spatial neighbors and you determine
what constitutes connectedness. For regular lattices it is customary to draw
on the moves that a respective chess piece can perform on a chess board
(Figure 1.9a–c). For irregularly shaped areal units spatial neighborhoods can
be defined in a number of ways. Two common approaches are shown in Figure
1.10 for counties of North Carolina. Counties are considered connected if they
share a common border or if representative points within the county are less
than a certain critical distance apart. The weight wij assigned to county j, if
it is a neighbor of county i, may be a function of other features of the lattice
sites; for example, the length of the shared border, the relative sizes of the
counties, etc. Symmetry of the weights is not a requirement. If housing prices
are being studied and a small, rural county abuts a large, urban county, it is
reasonable to assume that changes in the urban county have different effects
on the rural county than changes in the rural environment have on the urban
situation.
AUTOCORRELATION—CONCEPT AND ELEMENTARY MEASURES 19

a) b) c)

5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

Figure 1.9 Possible definitions of spatial connectedness (contiguity) for a regular


lattice. Edges abut in the rook definition (a), corners touch in the bishop definition
(b), edges and corners touch in the queen definition (c). Adapted from Figure 9.26
in Schabenberger and Pierce (2002).

Figure 1.10 Two definitions of spatial connectedness (contiguity) for an irregular


lattice. The figure shows counties of North Carolina and two possible neighborhood
definitions. The left neighborhood shows counties with borders adjoining that of the
central, cross-hatched county. The neighborhood in the center of the map depicts
counties whose centroid is within 35 miles of the central cross-hatched county.

1.3.2.1 Black-White Join Counts

Measures of spatial autocorrelation that are of historical significance—and re-


main important exploratory tools in spatial data analysis to date—originated
from the consideration of data on lattices. If the attribute Z is binary, that
is,
-
1 if an event of interest occurred at site i
Z(si ) =
0 if the event did not occur at site i,
20 INTRODUCTION

a generalization of Mantel’s statistic M2 is


n n
1 ((
BB = wij Z(si )Z(sj ), (1.7)
2 i=1 j=1

Black- known as the Black-Black join count statistic. The moniker stems from con-
Black sidering sites with Z(si ) = 1 as colored black and sites where the event of
Join Count interest did not occur as colored white. By the same token, the Black-White
join count statistic is given by
n n
1 ((
BW = wij (Z(si ) − Z(sj ))2 . (1.8)
2 i=1 j=1

Black- A permutation approach to test the hypothesis of no spatial autocorrelation


White is possible by permuting the assignment of n1 black and n − n1 white cells
Join Count to the lattice sites. Typically, inference based on BB or BW proceeds by
assuming an asymptotic Gaussian distribution of
BB − E[BB] BW − E[BW ]
/ or / ,
Var[BB] Var[BW ]
where the means and variances are derived for a particular sampling model.
Under binomial sampling it is assumed that Pr(Z(si ) = 1) = π and that the
number of black cells out of n is a Binomial(n, π) random variable. If the con-
straint is added that the number of black cells equals n1 , the observed number,
then hypergeometric sampling is appropriate. Cliff and Ord (1981, Ch. 2) de-
rive the means and variances for BB and BW under the two assumptions.
For the case of binomial sampling one obtains
n n
1 2((
E[BB] = π wij
2 i=1 j=1
1 2
Var[BB] = π (1 − π) {S1 (1 − π) + S2 π}
4
(n ( n
E[BW ] = π(1 − π) wij
i=1 j=1
1
Var[BW ] = S1 π(1 − π) + S2 π(1 − π)(1 − 4π(1 − π))
4
n n
1 ((
S1 = (wij + wji )2
2 i=1 j=1
 2
(n ( n n
( 
S2 = wij + wji .
 
i=1 j=1 j=1

Whereas for binomial sampling we have Pr(Z(si ) = 1) = π and Pr(Z(si ) =


1, Z(sj ) = 1) = π 2 under the null hypothesis of no autocorrelation, for
hypergeometric sampling with n1 black cells the corresponding results are
AUTOCORRELATION—CONCEPT AND ELEMENTARY MEASURES 21

Pr(Z(si ) = 1) = n1 /n and Pr(Z(si ) = 1, Z(sj ) = 1) = n1 (n1 − 1)/{n(n − 1)}.


Using the notation
n1 × (n1 − 1) × · · · × (n1 − k + 1)
P(k) = ,
n × (n − 1) × · · · × (n − k + 1)
.
and putting w.. = i,j wij , the mean and variance of the BB statistic under
hypergeometric sampling are (Cliff and Ord, 1981, (2.27), p. 39)
(n ( n
1
E[BB] = P(2) wij
2 i=1 j=1
S1 6 7 S2 6 7
Var[BB] = P(2) − 2P(3) + P(4) + P(3) − P(4)
4 4
w..2 16 72
+ P(4) − w.. P(2) .
4 4

1.3.2.2 The Geary and Moran Statistics

The BB statistic is a special case of the Mantel statistic with Z(si ) binary and
Uij = Z(si )Z(sj ). If Z is a continuous attribute whose mean is not spatially
varying, i.e., E[Z(s)] = µ, then closeness of attributes at sites si and sj can
be expressed by various measures:
Uij = (Z(si ) − µ)(Z(sj ) − µ) (1.9)
Uij = (Z(si ) − Z)(Z(sj ) − Z) (1.10)
Uij = |Z(si ) − Z(sj )| (1.11)
Uij = (Z(si ) − Z(sj ))2 . (1.12)
The practically important measures are (1.10) and (1.12) because of mathe-
matical tractability and interpretation. If Z is consistent for µ, then E[(Z(si )−
Z)(Z(sj ) − Z)] is a consistent estimator of Cov[Z(si ), Z(sj )]. It is tempting to
scale this measure by. an estimate of σ 2 . If the Z(si ) were a random sample,
n
then S 2 = (n − 1)−1 i=1 (Z(si ) − Z)2 would be the appropriate estimator.
In case of (1.10) this leads us to consider
(Z(si ) − Z)(Z(sj ) − Z)
(n − 1) .n , (1.13)
i=1 (Z(si ) − Z)
2

as an “estimator” of the correlation between Z(si ) and Z(sj ). It is not a sensi-


ble estimator, however, its statistical properties are poor. Because it combines
information about covariation (numerator) with information about variation
it is a reasonable estimator. Combining the contributions (1.13) over all pairs
of locations, introducing the spatial connectivity weights wij , and scaling,
yields the statistic known as Moran’s I, first discussed in Moran (1950): Moran’s I
(n ( n
n
I= wij (Z(si ) − Z)(Z(sj ) − Z). (1.14)
(n − 1)S 2 w.. i=1 j=1
22 INTRODUCTION

An autocorrelation measure based on (1.12) was constructed by Royaltey,


Astrachan, and Sokal (1975). Note that if Var[Z(s)] = σ 2 , then E[(Z(si ) −
Z(sj ))2 ] = Var[Z(si )−Z(sj )] = 2σ 2 −2Cov[Z(si ), Z(sj )]. Scaling and combin-
ing local contributions based on (1.12) leads to the statistic known as Geary’s
Geary’s c c (Geary, 1954):
n (
( n
1
c= wij (Z(si ) − Z(sj ))2 . (1.15)
2S 2 w.. i=1 j=1

As with the general Mantel statistic, inference for the I and c statistic
can proceed via permutation tests, Monte-Carlo tests, or approximate tests
based on the asymptotic distribution of I and c. To derive the mean and
variance of I and c one either proceeds under the assumption that the Z(si )
are Gaussian random variables with mean µ and variance σ 2 or derives the
mean and variance under the assumption of randomizing the attribute values
to the lattice sites. Either assumption yields the same mean values,
1
Eg [I] = Er [I] = −
n−1
Eg [c] = Er [c] = 1,
but expressions for the variances under Gaussianity and randomization differ.
The reader is referred to the formulas in Cliff and Ord (1981, p. 21) and to
Chapter problem 1.8.
The interpretation of the Moran and Geary statistic is as follows. If I > E[I],
then a site tends to be connected to sites that have similar attribute values.
The spatial autocorrelation is positive and increases in strength with |I −E[I]|.
If I < E[I], attribute values of sites connected to a particular site tend to be
dissimilar. The interpretation of the Geary coefficient is opposite. If c > E[c],
sites are connected to sites with dissimilar values and vice versa for c < E[c].
The assumptions of constant mean and constant variance of the Z(si ) must
not be taken lightly when testing for spatial autocorrelation with these statis-
tics. Values in close spatial proximity may be similar not because of spatial
autocorrelation but because the values are independent realizations from dis-
tributions with similar mean. Values separated in space may appear dissimilar
if the mean of the random field changes.

Example 1.7 Independent draws from a G(µ(x, y), σ 2 ) distribution were


assigned to the sites of a 10 × 10 regular lattice. The mean function of the
lattice sites is µ(x, y) = 1.4 + 0.1x + 0.2y + 0.002x2 . These data do not exhibit
any spatial autocorrelation. They are equi-dispersed but not mean-stationary.
Calculating the moments of the Moran’s I statistic with a rook definition for
the spatial contiguity weights wij leads to the following results.
/
Assumption I E[I] Var[I] Zobs p-value
Z(si ) ∼ G(µ, σ 2 ) 0.2597 −0.0101 0.0731 3.690 0.00011
Randomization 0.2597 −0.0101 0.0732 3.686 0.00011
AUTOCORRELATION—CONCEPT AND ELEMENTARY MEASURES 23

The statistic is sensitive to changes in the mean function, I “detected”


spurious autocorrelation.
While this simple example serves to make the point, the impact of heteroge-
neous means and variances on the interpretation of Moran’s I is both widely
ignored and completely confused throughout the literature. McMillen (2003)
offers perhaps the first correct assessment of this problem (calling it model
misspecification). Waller and Gotway (2004) discuss this problem at length
and provide several practical illustrations in the context of spatial epidemiol-
ogy.

Because it is often not reasonable to assume constancy of the mean over a


larger domain, two courses of action come to mind.
• Fit a mean model to the data and examine whether the residuals from
the fit exhibit spatial autocorrelation. That is, postulate a linear model
Z(s) = X(s)β + e and obtain the least squares estimate of the (p × 1)
vector β. Since testing for spatial autocorrelation is usually part of the
exploratory stages of spatial data analysis, one has to rely on ordinary
least squares estimation at this stage. Then compute the Moran or Geary
' In matrix-vector form the I
statistic of the residuals e'i = Z(si ) − x! (si )β.
statistic can be written as Moran’s I
! for OLS
n 'e W' e
Ires = ! , (1.16) Residuals
e'
w.. ' e
e = Me and M = I − X(X! X)−1 X! . The mean of Moran’s I based
where '
on OLS residuals is no longer −(n − 1)−1 . Instead, we have
n
Eg [Ires ] = tr[MW].
(n − k)w..
• Even if the mean changes globally throughout the domain it may be reason-
able to assume that it is locally constant. The calculation of autocorrelation
measures can then be localized. This approach gives rise to so-called LISAs,
local indicators of spatial autocorrelation.

1.3.3 Localized Indicators of Spatial Autocorrelation

Taking another look at Mantel’s M2 statistic for matrix association in §1.3.1


reveals that the statistic can be written as a sum of contributions of individual
data points:
(n (n (n
M2 = Wij Uij = Mi .
i=1 j=1 i=1
.n
The contribution Mi = j=1 Wij Uij can be thought of as the autocorrelation
measure associated with the ith site, a local measure. Since the Black-White
join count, Moran’s I and Geary’s c statistics are special cases of M2 , they can
be readily localized. The concept of local indicators of spatial autocorrelation
24 INTRODUCTION

(LISAs) was advanced by Anselin (1995), who defined a statistic to be a LISA


LISAs if
1. it measures the extent of spatial autocorrelation around a particular obser-
vation,
2. it can be calculated for each observation in the data, and
3. if its sum is proportional to a global measure of spatial autocorrelation such
as BB, c, or I.
Local A local version of Moran’s I that satisfies these requirements is
Moran’s I (n
n
I(si ) = (Z(si ) − Z) wij (Z(sj ) − Z), (1.17)
(n − 1)S 2 j=1
.n
so that i=1 I(si ) = w.. I. Anselin (1995) derives the mean and variance of
I(si ) under the randomization assumption. For example,
n
−1 (
Er [I(si )] = wij .
n − 1 j=1

The interpretation of I(si ) derives directly from the interpretation of the


global statistic. If I(si ) > E[I(si )], then sites connected to the ith site ex-
hibit attribute values similar to Z(si ). There is locally positive autocorrelation
among the values; large (small) values tend to be surrounded by large (small)
values. If I(si ) < E[I(si )], then sites connected to the ith site have dissimilar
values. Large (small) values tend to be surrounded by small (large) values.
Local indicators of spatial autocorrelation are useful exploratory tools to:
(i) Detect regions in which the autocorrelation pattern is markedly different
from other areas in the domain.
(ii) Detect local spatial clusters (hot spots). There are many definitions of
hot spots and cold spots. One such definition focuses on the histogram of
the data and identifies those sites as hot (cold) spots where Z(si ) is extreme.
LISAs enable hot spot definitions that take into account the surrounding
values of a site. A hot spot can be defined in this context as one or several
contiguous sites where the local indicators are unusually large or small.
(iii) Distinguish spatial outliers from distributional outliers. A value un-
usual in a box plot of the Z(si ) is a distributional outlier. Data points that
are near the center of the sample distribution do not raise flags or concerns
in this box plot. But these values can be unusual relative to the neighboring
sites. Anselin (1995) suggests comparing the distribution of I(si ) to I/n to
determine outliers and/or leverage points in the data.
The exact distributional properties of the autocorrelation statistics are elu-
sive, even in the case of a Gaussian random field. The Gaussian approximation
tends to work well, but the same cannot necessarily be said for the local statis-
tics. Anselin (1995) recommends randomization inference as for the global
statistic with the added feature that the randomization be conditional; the
AUTOCORRELATION FUNCTIONS 25

attribute value at site i is not subject to permutation. The implementation


of the permutation approach can be accelerated in the local approach, since
only values in the neighborhood need to be permuted. However, Besag and
Newell (1991) and Waller and Gotway (2004) note that when the data have
heterogeneous means or variances, a common occurrence with count data and
proportions, the randomization assumption is inappropriate. In some cases,
the entire concept of permutation makes little sense. Instead they recommend
the use of Monte Carlo testing.
While localized measures of autocorrelation can be excellent exploratory
tools, they can also be difficult to interpret. Moreover, they simply cannot
be used as confirmatory tools. On a lattice with n sites one obtains n local
measures, and could perform n tests of significant autocorrelation. This is a
formidable multiplicity problem. Even in the absence of autocorrelation, the
I(si ) are correlated if they involve the same sites, and it is not clear how to
adjust individual Type-I error levels to maintain a desired overall level.

1.4 Autocorrelation Functions

1.4.1 The Autocorrelation Function of a Time Series

Autocorrelation is the correlation among the family of random variables that


make up a stochastic process. In time series, this form of correlation is often
referred to as serial correlation. Consider a (weakly) stationary time series
Z(t1 ), · · · , Z(tn ) with E[Z(ti )] = 0 and Var[Z(ti )] = σ 2 , i = 1, · · · , n. Rather
than a single measure, autocorrelation in a time series is measured by a func-
tion of the time points. The covariance function of the series at points ti and
tj is given by
Cov[Z(ti ), Z(tj )] = E[Z(ti )Z(tj )] = C(tj − ti ), (1.18)
and the (auto)correlation function is then
Cov[Z(ti ), Z(tj )] C(tj − ti )
R(tj − ti ) = / = . (1.19)
Var[Z(ti )]Var[Z(tj )] C(0)
Figure 1.11 shows the realizations of two stochastic processes. Open circles
represent an independent sequence of random variables with mean zero and
variance 0.3. Closed circles represent a sequence of random variables with
mean zero, variance 0.3, and autocorrelation function R(tj − ti ) = ρ|tj −ti | , ρ >
0. The positive autocorrelation is reflected in the fact that runs of positive
residuals alternate with runs of negative residuals. In other words, if an obser-
vation at time t is above (below) average, it is very likely that an observation
in the immediate past was also above (below) average. Positive autocorre-
lation in time series or spatial data are much more common than negative
autocorrelation. The latter is often an indication of an improperly specified
mean function, e.g., the process exhibits deterministic periodicity which has
not been properly accounted for.
26 INTRODUCTION

1.0

0.6

0.2
Z(t)

-0.2

-0.6

-1.0

1 3 5 7 9 11 13 15 17 19 21
Time (t)

Figure 1.11 A sequence of independent observations (open circles) and of correlated


data (AR(1) process, closed circles), σ 2 = 0.3.

1.4.2 Autocorrelation Functions in Space—Covariance and Semivariogram

The autocovariance or autocorrelation function of a time series reflects the


notion that values in a time series are not unrelated. If the serial correlation
is positive, a high value at time t is likely to be surrounded by high values
at times t − 1 and t + 1. In the spatial case, autocorrelation is reflected by
the fact that values at locations si and sj are stochastically dependent. If this
correlation is positive, we expect high (low) values to be surrounded by high
(low) values. If it is negative, high (low) values should be surrounded by low
(high) values. Tobler’s first law of geography states that “everything is related
to everything else, but near things are more related than distant things” (To-
bler, 1970). It reflects an additional fact common to many statistical models
for spatial data: correlations decrease with increasing spatial separation.
The covariance function of a spatial process is defined as
C(s, h) = Cov[Z(s), Z(s + h)] = E[{Z(s) − µ(s)}{(Z(s + h) − µ(s + h)}]
and the autocorrelation function (or correlogram) is
C(h)
R(s, h) = / .
Var[Z(s)]Var[Z(s + h)]
These definitions are obvious extensions of the covariance and correlation
function for time series (equations (1.18) & (1.19)). The differences between
temporal and spatial data are worth noting, however, because they are in
part responsible for the facts that (i) many statistical methods for spatial
AUTOCORRELATION FUNCTIONS 27

data analysis have developed independently of developments in time series


analysis, (ii) that spatial adaptations of time series models face particular
challenges. First, time is directed; it flows from the past through the present
into the future. Space is not directed in the same sense. There is no equivalent
to past, present, or future. Second, edge effects are important for time series
models, but much more so for spatial models. A time series has a starting
and an end point. A configuration in a spatial domain has a bounding volume
with potentially many points located close to the boundary. In time series
applications it is customary to focus on serial correlations between the past
and current values. In spatial applications we have to be concerned with cor-
relations in all directions of the domain. In addition, the spatial dependencies
may develop differently in various directions. This issue of anisotropy of a
spatial process has no parallel in time series analysis.
Like time series analysis, however, many statistical methods of spatial data
analysis make certain assumptions about the behavior of the covariance func-
tion C(h), known as stationarity assumptions. We will visit various types of
stationarity and their implications for the spatial process in detail in §2.2
and the reasons for the importance of stationarity are enunciated in §2.1. For
now, it is sufficient to note that a commonly assumed type of stationarity,
second-order stationarity, implies that the spatial process Z(s) has a con-
stant mean µ and a covariance function which is a function of the spatial
separation between points only,
Cov[Z(s), Z(s + h)] = C(h).
Second-order stationarity implies the lack of importance of absolute coordi-
nates. In the time series context it means that the covariance between two
values separated by two days is the same, whether the first day is a Wednes-
day, or a Saturday. Under second-order stationarity, it follows directly that
Cov[Z(s), Z(s) + 0] = C(0) does not depend on the spatial location. Conse-
quently, Var[Z(s)] = σ 2 is not a function of spatial location. The variability
of the spatial process is the same everywhere. While we need to be careful
to understand what random mechanism the expectation and variance of Z(s)
refer to (see §2.1), it emerges that second-order stationarity is the spatial
equivalent of the iid random sample assumption in classical statistic.
Under second-order stationarity, the covariance and correlation function of
the spatial process simplify, they no longer depend on s:
Cov[Z(s), Z(s + h)] = C(h)
C(h)
R(h) = .
σ2
Expressing the second-order structure of a spatial process in terms of the
covariance or the correlation function seems to be equivalent. This is not
necessarily so; qualitatively different processes can have the same correlation
functions (see Chapter problems).
Statisticians are accustomed to describe stochastic dependence between ran-
28 INTRODUCTION

dom variables through covariances or correlations. Originating in geostatistical


applications that developed outside the mainstream of statistics, a common
tool to capture the second-moment structure in spatial data is the semivari-
ogram. The reasons to prefer the semivariogram over the covariance function
in statistical applications are explored in §4.2. At this point we focus on the
definition of the semivariogram, namely
1
γ(s, h) = Var[Z(s) − Z(s + h)] (1.20)
2
1
= {Var[Z(s)] + Var[Z(s + h)] − 2Cov[Z(s), Z(s + h)]} .
2

If the process is second-order stationary, then it follows immediately that


18 2 9
γ(h) = 2σ − 2C(h) = C(0) − C(h). (1.21)
2
On these grounds the semivariogram and the covariance function contain the
same information, and can be used interchangeably. For second-order station-
ary spatial processes with positive spatial correlation, the semivariogram has
a characteristic shape. Since γ(0) = C(0) − C(0) = 0, the semivariogram
passes through the origin and must attain γ(h∗ ) = σ 2 if the lag exceeds the
distance h∗ for which data points are correlated. Under Tobler’s first law of
geography, we expect C(h) to decrease as h increases. The semivariogram
then transitions smoothly from the origin to the sill σ 2 (Figure 1.12). The
manner in which this transition takes place is important for the smoothness
and continuity properties of the spatial process (see §2.3). The lag distance
h∗ = ||h∗ ||, for which C(h∗ ) = 0, is termed the range of the spatial process.
If the semivariogram achieves the sill only asymptotically, then the practical
range is defined as the lag distance at which the semivariogram achieves 95%
of the sill.
Geostatisticians are also concerned with semivariograms that do not pass
through the origin. The magnitude of the discontinuity of the origin is termed
the nugget effect after one of the possible explanations of the phenomenon:
small nuggets of ore are distributed throughout a larger body of rock. This
microscale variation creates a spatial process with sill θ0 . The spatial struc-
ture of the microscale process cannot be observed unless the spacing of the
sample observations is made smaller. Whereas nugget effects are commonly
observed in sample semivariograms calculated from data, its very existence
must be called into question if the spatial process is believed to be mean
square continuous (§2.3). The second source for a possible nugget effect in a
sample semivariogram is measurement error.
The semivariogram is not only a descriptive tool, it is also a structural tool
through which the second-order properties of a spatial process can be studied.
Because of its importance in spatial statistics, in particular for methods of
spatial prediction, Chapter 4 is dedicated in its entirety to semivariogram
analysis and estimation.
AUTOCORRELATION FUNCTIONS 29

Sill
10

8
(||h||)

6
Semivariogram

4 0

Practical Range
0

0 10 20 30 40
Lag distance || h||

Figure 1.12 Exponential semivariogram with sill 10 and practical range 15. Semi-
variogram not passing through the origin has a nugget effect of θ0 = 4.

1.4.3 From Mantel’s Statistic to the Semivariogram

The measures of spatial association in §1.3.2 define spatial similarity (con-


tiguity) through the weight matrix W and result in a single statistic that
describes the extent to which data are spatially autocorrelated for the entire
domain. The notion of spatial closeness comes into play through the wij , but
the actual distances between sites are not directly used. First, this requires
that each lattice site is identified with a representative point si in Rd . Second,
there is a finite set of distances for which the degree of spatial dependence can
be investigated if the data are on a lattice. Consider a regular, rectangular
lattice. We could define more than one set of neighborhood weights. For ex-
(1)
ample, let wij be first-order weights based on a queen definition as in Figure
(2)
1.9c) and let wij be the second-order weights based on a queen’s move. Then
(2)
wij = 1 if site j can be reached from site i by a queen’s move that passes
over at least one additional tile on the chess board. The two statistics

n−1
( n
(
(1) (1)
M1 = wij Uij
i=1 j=i+1
30 INTRODUCTION

n−1
( n
(
(2) (2)
M1 = wij Uij
i=1 j=i+1

measure the extent of the first-order-neighbor and second-order-neighbor spa-


tial dependence, respectively. Continuing in this manner by extending the
neighborhood, a set of statistics is obtained that describes the degree of spa-
tial dependence with increasing distance from a given site. Alternatively, we
can involve the distances between sites directly and define weights that yield
a measure of autocorrelation at lag distance h:
-
1 if ||si − sj || = h
wij (h) =
0 otherwise.

For irregular (or small) lattices we can incorporate a lag tolerance # > 0 to
ensure a sufficient number of pairs in each lag class:
-
1 if h − # ≤ ||si − sj || < h + #
wij (h, #) =
0 otherwise.
These weight definitions are meaningful whenever an observation is associated
with a point in Rd , hence they can be applied to geostatistical data. Two com-
mon choices for Uij are squared differences and cross-products. The similarity
statistics can be re-written as
(
M1∗ = (Z(si ) − Z(sj ))2
N (h)
(
M1∗∗ = (Z(si ) − Z)(Z(sj ) − Z),
N (h)

where N (h) is the set of site pairs that are lag distance h (or h ± #) apart. The
.n−1 .n ..
cardinality of this set is |N (h)| = i=1 j=i+1 wij (h) (or wij (h, #)). If
the random field has constant mean µ, then (Z(si ) − Z(sj )) is an unbiased
2

estimator of the variogram 2γ(si , sj − si ). If the random field has furthermore


constant variance σ 2 , then (Z(si ) − Z(sj ))2 estimates 2(σ 2 − C(si − sj )). This
suggests to estimate the semivariogram at lag vector h as
1 ( 2
'(h) =
γ {Z(si ) − Z(sj )} .
2|N (h)|
N (h )

This estimator, due to Matheron (1962, 1963), is termed the classical semi-
variogram estimator and a plot of γ
'(h) versus ||h|| is known as the empirical
semivariogram. Similarly,
1 (
'
C(h) = (Z(si ) − Z)(Z(sj ) − Z)
|N (h)|
N (h )

is known as the empirical covariance estimator. The properties of these esti-


mators, their respective merits and demerits, and their utility for describing
and modeling geostatistical data are explored in detail in Chapter 4.
THE EFFECTS OF AUTOCORRELATION ON STATISTICAL INFERENCE 31

1.5 The Effects of Autocorrelation on Statistical Inference

Since much of classical applied statistics rests on the assumption of iid ob-
servations, you might be tempted to consider correlation in the data as a
nuisance at first. It turns out, however, (i) that ignoring autocorrelation has
serious implications for the ensuing statistical inference, and (ii) that correla-
tion in the data can be employed to your benefit. To demonstrate these points,
we consider the following simple problem.
Assume that observations Y1 , · · · , Yn are Gaussian distributed with mean
µy , variance σ 2 and covariances Cov[Yi , Yj ] = σ 2 ρ (i (= j). A second sample
of size n for variable X has similar properties. Xi ∼ G(µx , σ 2 ), (i = 1, · · · , n),
Cov[Xi , Xj ] = σ 2 ρ (i (= j). The samples for Y and X are independent,
Cov[Yi , Xj ] = 0 (∀i, j). Ignoring.the fact that the Yi are correlated, one might
n
consider the sample mean n−1 i=1 Yi as the “natural” estimator of µ. Some
straightforward manipulations yield
n n
6 7 1 ((
Var Y = Cov[Yi , Yj ]
n2 i=1 j=1
1 8 2 9
= nσ + n(n − 1)σ 2
ρ
n2
σ2
= {1 + (n − 1)ρ}.
n
Assume that ρ > 0, so that Var[Y ] > σ 2 /n; the sample mean is more dis-
persed than in a random sample. More importantly, we note that E[Y ] = µy ,
regardless of the correlations, but that
lim Var[Y ] = σ 2 ρ.
n

The sample mean is not a consistent estimator of the population mean µ.


That is bad news.
A test of the hypothesis H0 : µx = µy that proceeds as if the data were
uncorrelated would use test statistic
Y −X

Zobs = / ,
σ 2/n
whereas the correct test statistic would be
Y −X
Zobs = / ∗
< Zobs .
σ 2{1 + (n − 1)ρ}/n
The test statistic Zobs

does not account for the autocorrelation and is too large,
p-values are too small, the evidence in the data against the null hypothesis is
overstated. The test rejects more often than it should.
The effect of positive autocorrelation is that n correlated observations do
not provide the same amount of information than n uncorrelated observa-
tions. Cressie (1993, p. 15) approaches this problem by asking “How many
32 INTRODUCTION

samples of the uncorrelated kind provide the same precision as a sample of


correlated observations?” If n denotes the number of correlated samples and n!
the number of uncorrelated samples, the effective sample size is calculated
Effective as
n
Sample Size n! = . (1.22)
1 + (n − 1)ρ
To draw the conclusion from this demonstration that autocorrelation of data
is detrimental would be incorrect. What the exercise conveys is that ignoring
correlations and relying on the statistics known to perform well in iid samples
is detrimental. How can the apparent loss of power in testing H0 : µx = µy be
recovered? For one, X and Y are not the most efficient estimators of µx and
µy in this problem. The generalized least squares estimator
'y = (1! Σ−1 Y)/(1! Σ−1 1),
µ
where Σ = σ 2 {(1 − ρ)I + ρJ}, should be used instead of Y . Test statistics
should be derived based on µ 'x .
'y − µ

1.5.1 Effects on Prediction

Autocorrelations must be accounted for to achieve viable inferences. In other


situations, the very presence of autocorrelations strengthens statistical abili-
ties. An important case in point is the prediction of random variables. Consider
again the simple model
Y = 1µ + e,
Compound with Y = [Y1 , · · · , Yn ]! , 1 = [1, · · · , 1]! , e = [e1 , · · · , en ]! , E[e] = 0, and
Symmetry,  
1 ρ ρ ··· ρ
Exchangeable
ρ 1 ρ ··· ρ
Correla-  
ρ ρ 1 ··· ρ
tions Var[e] = Σ = σ 2  . ..  = σ 2 {(1 − ρ)I + ρJ}.
. 
. ··· ··· ··· . 
.
ρ ··· · · · .. 1
This structure is termed the equicorrelation, compound symmetry, or ex-
changeable correlation structure. It arises naturally in situations with hier-
archical random effects, e.g., models for split-plot designs or experiments in-
volving sub-sampling. The compound symmetry structure is not commonly
used in spatial statistics to model autocorrelations. It is not a reasonable cor-
relation model for most spatial data since it does not take into account the
spatial configuration. We select it here because the simple form of Σ enables
us to carry out the manipulations that follow in closed form; Σ−1 is easily
obtained.
Imagine that the prediction of a new observation Y0 is of interest. Since the
observed data are correlated, it is reasonable to assume that the new obser-
vation is also correlated with Y; Cov[Y, Y0 ] = c = σ 2 ρ1. To find a suitable
predictor p(Y0 ), certain restrictions are imposed. We want the predictor to
THE EFFECTS OF AUTOCORRELATION ON STATISTICAL INFERENCE 33

be linear in the observed data, i.e., p(Y0 ) = λ! Y. Furthermore, the predictor


should be unbiased in the sense that E[p(Y0 )] = E[Y0 ]. This constraint implies
that λ! 1 = 1. If the measure of prediction loss is squared-error, we are led to
the minimization of
E[(p(Y0 ) − Y0 )2 ] subject to λ! 1 = 1 and p(Y0 ) = λ! Y.
This can be rewritten as an unconstrained minimization problem using the
Lagrange multiplier m,
minλ E[(p(Y0 ) − Y0 )2 ] subject to λ! 1 = 1
⇔ minλ,m Var[p(Y0 )] + Var[Y0 ] − 2Cov[p(Y0 ), Y0 ] − 2m(λ! 1 − 1)
⇔ minλ,m λ! Σλ + σ 2 − 2λ! c − 2m(λ! 1 − 1).
Taking derivatives with respect to λ and m and setting these to zero leads to
the system of equations
0 = Σλ − c − m1
0 = λ! 1 − 1.
The solutions turn out to be
m = (1 − 1! Σ−1 c)(1! Σ−1 1)−1 (1.23)
: ;!
1 − 1! Σ−1 c
λ! = c+1 ! −1
Σ−1 . (1.24)
1Σ 1
After some algebraic manipulations, the best linear unbiased predictor (BLUP)
can be expressed as p(Y0 ) = λ! Y
' + c! Σ−1 (Y − 1'
p(Y0 ) = µ µ), (1.25)
where µ ' = (1! Σ−1 1)−1 1! Σ−1 Y.
' is the generalized least squares estimator: µ
The minimized mean-squared error, the prediction variance, is
2
σpred = σ 2 − c! Σ−1 c + (1 − 1! Σ−1 c)2 (1! Σ−1 1)−1 . (1.26)
We note in passing that expression (1.25) is known in the geostatistical vernac-
ular as the ordinary kriging predictor and (1.26) is called the ordinary kriging
variance. Spatial prediction and kriging are covered in detail in Chapters 5 and
6. In the current section we continue to explore the effects of autocorrelation
on statistical inference without defining any one particular method.
If the data were uncorrelated (ρ = 0), the BLUP for Y0 would be p∗ (Y0 ) = Y
and the prediction variance would be σpred2
= σ 2 (1 + 1/n), a familiar result.
If we were to repeat the exercise with the goal to estimate µ, rather than
to predict Y0 , subject to the same linearity and unbiasedness constraints, we
would find p(µ) = (1! Σ−1 1)−1 1! Σ−1 Y = µ ' as the best predictor for µ. In the
case of uncorrelated data, the best predictor of the random variable Y0 and
the best estimator of its mean E[Y0 ] are identical. When the correlations are
taken into account, the predictor of Y0 and the estimator of E[Y0 ] differ,
p(Y0 ) = p(µ) + c! Σ−1 (Y − 1p(µ)).
34 INTRODUCTION

To investigate how the precision of the predicted value is affected by incorpo-


rating the correlations, we derive a scalar expression for σpred
2
which can be
compared against σ (1 + 1/n), the prediction error when data are not corre-
2

lated. The inverse of Σ can be found by applying Theorem 8.3.4 in Graybill


(1983, p. 190).

Theorem 1.1 If the k × k matrix C can be written as C = (a − b)I + bJ,


then C−1 exists if and only if a (= b and a (= −(k − 1)b, and is given by
: ;
−1 1 b
C = I− J .
a−b a + (k − 1)b

The proof is given on p. 191 of Graybill (1983).

The theorem can be applied to our situation since σ −2 Σ = (1−ρ)I+ρJ. The


condition. that ρ (= −1/(n − 1) is met. In fact, from Var[Yi ] = σ 2 > 0 it follows
n
that Var[ i=1 Yi ] = nσ 2 (1+(n−1)ρ) > 0 which implies ρ > −1/(n−1). That
the correlation coefficient is bounded from below is a simple consequence of
equicorrelation. Applying Theorem 1.1 leads to 1! Σ−1 1 = nσ −2 /[1+(n−1)ρ].
Finally, after some (tedious) algebra we obtain the prediction variance in the
compound symmetry model:
: ;
1 (ρn)2 + (1 − ρ)2
σpred = σ 1 +
2 2
.
n 1 + (n − 1)ρ
In order for the term ((ρn)2 + (1 − ρ)2 )/(1 + (n − 1)ρ) to be less than one, we
must have (provided ρ > 0)
n+1
ρ< 2 . (1.27)
n +1
If the condition (1.27) is met, predictions in the compound symmetry model
will be more precise than in the independence model. As the strength of the
correlation increases predictions in the compound symmetry model can be
less precise, however, because the effective sample size shrinks quickly.

1.5.2 Effects on Precision of Estimators

The effect of ignoring autocorrelation in the data and proceeding with in-
ference as if the data points were uncorrelated was discussed in §1.5. The
effective sample size formula (1.22) allows a comparison of the precision of
the arithmetic sample mean for the compound symmetry model and the case
of uncorrelated data. The intuitive consequence of this expression is that posi-
tive autocorrelation results in a “loss of information.” A sample of independent
observations of size n contains more information as a sample of autocorrelated
observations of the same size. As noted, the arithmetic sample mean is not the
appropriate estimator of the population mean in the case of correlated data.
To further our understanding of the consequences of positive autocorrelation
THE EFFECTS OF AUTOCORRELATION ON STATISTICAL INFERENCE 35

and the idea of “information loss,” consider the following setting.


Yi = µ + ei
E[ei ] = 0
Cov[Yi , Yj ] = σ 2 ρ|i−j| (1.28)
i = 1, · · · , n.

This model is known as the autoregressive model of first order, or simply the
AR(1) model. Let Y = [Y1 , · · · , Yn ]! and Var[Y] = Σ. As for the compound
symmetry model, an expression for Σ−1 is readily available. Graybill (1983,
pp. 198–201) establishes that Σ−1 is a diagonal matrix of type 2, that is,
σij (= 0 if |i − j| ≤ 1, σij = 0 if |i − j| > 1 and that
 
1 −ρ 0 0 ··· 0
 −ρ 1 + ρ2 −ρ 0 ··· 0 
 
 0 −ρ 1 + ρ2
−ρ ··· 0 
1  .. 
Σ−1 = 2  
σ (1 − ρ )  0
2 0 −ρ 1+ρ 2
··· . 
 . .. .. .. 
 . . 
. . . ··· −ρ
0 0 ··· 0 −ρ 1
The generalized least squares estimator of µ is
' = (1! Σ−1 1)−1 (1! Σ−1 Y)
µ (1.29)
and some algebra leads to (see Chapter problems)
1+ρ
Var['
µ] = σ 2 . (1.30)
(n − 2)(1 − ρ) + 2
Two special cases are of particular interest. If ρ = 0, the variance is equal
to σ 2 /n as it should, since µ
' is then the arithmetic mean. If the data points
are perfectly correlated (ρ = 1) then Var[' µ] = σ 2 , the variance of a single
observation. The variance of the estimator then does not depend on sample
size. Having observed one observation, no additional information is accrued
by sampling additional values; if ρ = 1, all further values would be identical
to the first.
When you estimate µ in the autoregressive model by (1.29), the correlations
are not ignored, the best possible linear estimator is being used. Yet, compared
to a set of independent data of the same sample size, the precision of the
estimator is reduced. The values in the body of Table 1.2 express how many
times more variable µ ' is compared to σ 2 /n, that is,
: ;−1
1−ρ 2 ρ
+ .
1+ρ n1+ρ
This is not a relative efficiency in the usual sense. It does not set into relation
the mean square errors of two competing estimators for the same set of data. It
is a comparison of mean-squared errors for two estimators under two different
36 INTRODUCTION

data scenarios. We can think of the values in Table 1.2 as the relative excess
variability (REV) incurred by correlation in the data.

Table 1.2 Relative excess variability of the generalized least squares estimator (1.29)
in AR(1) model.

Autocorrelation Parameter Sample Size


0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 n
1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1
1.00 1.14 1.29 1.44 1.62 1.80 2.00 2.22 2.45 2.71 3
1.00 1.17 1.36 1.59 1.84 2.14 2.50 2.93 3.46 4.13 5
1.00 1.18 1.40 1.65 1.96 2.33 2.80 3.40 4.20 5.32 7
1.00 1.19 1.42 1.70 2.03 2.45 3.00 3.73 4.76 6.33 9
1.00 1.20 1.43 1.72 2.08 2.54 3.14 3.98 5.21 7.21 11
1.00 1.20 1.44 1.74 2.12 2.60 3.25 4.17 5.57 7.97 13
1.00 1.20 1.45 1.76 2.14 2.65 3.33 4.32 5.87 8.64 15
1.00 1.21 1.46 1.77 2.16 2.68 3.40 4.45 6.12 9.23 17
1.00 1.21 1.46 1.78 2.18 2.71 3.45 4.55 6.33 9.76 19
1.00 1.21 1.47 1.78 2.19 2.74 3.50 4.64 6.52 10.2 21
1.00 1.21 1.47 1.79 2.21 2.76 3.54 4.71 6.68 10.7 23
1.00 1.21 1.47 1.80 2.22 2.78 3.57 4.78 6.82 11.0 25
1.00 1.21 1.47 1.80 2.22 2.79 3.60 4.83 6.94 11.4 27
1.00 1.21 1.47 1.80 2.23 2.81 3.63 4.88 7.05 11.7 29
1.00 1.21 1.48 1.81 2.24 2.82 3.65 4.93 7.15 12.0 31
1.00 1.21 1.48 1.81 2.24 2.83 3.67 4.96 7.24 12.3 33
1.00 1.21 1.48 1.81 2.25 2.84 3.68 5.00 7.33 12.5 35
1.00 1.21 1.48 1.82 2.25 2.85 3.70 5.03 7.40 12.8 37
1.00 1.22 1.48 1.82 2.26 2.85 3.71 5.06 7.47 13.0 39
1.00 1.22 1.48 1.82 2.26 2.86 3.73 5.09 7.53 13.2 41
1.00 1.22 1.48 1.82 2.26 2.87 3.74 5.11 7.59 13.4 43
1.00 1.22 1.48 1.82 2.27 2.87 3.75 5.13 7.64 13.6 45
1.00 1.22 1.48 1.82 2.27 2.88 3.76 5.15 7.69 13.7 47
1.00 1.22 1.48 1.83 2.27 2.88 3.77 5.17 7.74 13.9 49

For any given sample size n > 1 the REV of (1.29) increases with ρ. No
' = Y1 .
loss is incurred if only a single observation is collected, since then µ
The REV increases with sample size and this effect is more pronounced for
ρ large than for ρ small. As is seen from (1.30) and the fact that E[' µ] = µ,
the GLS estimator is consistent for µ. Its precision increases with sample
size for any given value ρ < 1. An important message that you can glean
from these computations is that the most efficient estimator when data are
(positively) correlated can be (much) more variable than the most efficient
estimator for independent data. In designing simulation studies with corre-
lated data these effects are often overlooked. The number of simulation runs
CHAPTER PROBLEMS 37

sufficient to achieve small simulation variability for independent data is less


than the needed number of runs for correlated data.

1.6 Chapter Problems

Problem 1.1 Categorize the following examples of spatial data as to their


data type:
(i) Elevations in the foothills of the Allegheny mountains;
(ii) Highest elevation within each state in the United States;
(iii) Concentration of a mineral in soil;
(iv) Plot yields in an uniformity trial;
(v) Crime statistics giving names of subdivisions where break-ins occurred
in the previous year and property loss values;
(vi) Same as (v), but instead of the subdivision, the individual dwelling is
identified;
(vii) Distribution of oaks and pines in a forest stand;
(viii) Number of squirrel nests in the pines of the stand in (vii).

Problem 1.2 Consider Y1 , · · · , Yn , Gaussian random variables.with mean µ,


n
variance σ 2 , and Cov[Yi , Yj ] = σ 2 ρ, ∀i (= j. Is S 2 = (n − 1)−1 i=1 (Yi − Y )2
an unbiased estimator of σ 2 ?

Problem 1.3 Derive the mean and variance of the BB join count statistic
(1.7) under the assumption of binomial sampling. Notice that E[Z(si )k ] =
π, ∀k, because Z(si ) is an indicator variable. Also, under the null hypothesis
of no autocorrelation, Var[Z(si )Z(sj )] = π 2 − π 4 .

Problem 1.4 Is the variance of the BB join count statistic larger under
binomial sampling or under hypergeometric sampling? Imagine that you are
studying the retail volume of grocery stores in a municipality. The data are
coded such that Z(si ) = 1 if the retail volume of the store at site si exceeds
20 million dollars per year, Z(si ) = 0 otherwise. The BB join count statistic
with suitably chosen weights is used to test for spatial autocorrelation in the
sale volumes. Discuss a situation when you would rely on the assumption of
binomial sampling and one where hypergeometric sampling is appropriate.

Problem 1.5 Let W = [wij ] be a spatial contiguity matrix and let ui =


Z(si ) − Z. Collect .
the ui into a vector u = [u1 , · · · , un ]! and standardize the
weights such that i,j wij = n. Let Y = Wu and consider the regression
through the origin Y = βu + e, e ∼ (0, σ 2 I). What is measured by the slope
β?
38 INTRODUCTION

Problem 1.6 Given a lattice of sites, assume that Z(si ) is a binary variable.
Compare the Black-White join count statistic to Moran’s I in this case (using
the same contiguity definition). Are they the same? If not, are they very
different? What advantages would one have in using the BB (or BW ) statistic
in this problem that I does not offer?

Problem 1.7 Show that Moran’s I is a scale-free statistic, i.e., Z(s) and
λZ(s) yield the same I statistic.

Problem 1.8 Consider the simple 2 × 3 lattice with observations Z(s1 ) =


5, Z(s2 ) = −3, Z(s3 ) = −6, etc. For problems (i)–(iii) that follow, assume the
rook definition of spatial connectivity.
Column 1 Column 2 Column 3
Row 1 5 −3 −6
Row 2 2 4 −2

(i) Derive the mean and variance of Moran’s I under randomization em-
pirically by enumerating the 6! permutations of the lattice. Compare your
answer to the formulas for E[I] and Var[I].
(ii) Calculate the empirical p-value for the hypothesis of no spatial autocor-
relation. Compare it against the p-value based on the Gaussian approxima-
tion under randomization. For this problem you need to know the variance
of I under randomization. It can be obtained from the following expression,
given in Cliff and Ord (1981, Ch. 2):
n[n2 − 3n + 3]S1 − nS2 + 3w..2 − b[(n2 − n)S1 − 2nS2 + 6w..2
Er [I 2 ] =
(n − 3)(n − 2)(n − 1)w..2
n n
1 ((
S1 = (wij + wji )2
2 i=1 j=1
 2
(n ( n (n 
S2 = wij + wji
 
i=1 j=1 j=1
.n
(Z(si ) − Z)4
b = n 8.ni=1 9
2 2
i=1 (Z(si ) − Z)

(iii) Prepare a histogram of the 6! I values. Do they appear Gaussian?

Problem 1.9 Consider Y1 , · · · , Yn , Gaussian random variables with mean µ,


variance σ 2 , and Cov[Yi , Yj ] = σ 2 ρ(|i − j| = 1). Assume that ρ > 0 and find
the power function of the test for H0 : µ = µ0 versus H1 : µ > µ0 , where the
test statistic is given by
Y − µ0
Zobs = < .
Var[Y ]
CHAPTER PROBLEMS 39

Compare this power function to the power function that is obtained for ρ = 0.

Problem 1.10 Using the same setup as in the previous problem, find the
best linear unbiased predictor p(Y0 ) for a new observation based on Y1 , · · · , Yn .
Compare its prediction variance σpred
2
to that of the predictor Y .

Problem 1.11 Show that (1.23) and (1.24) are the solutions to the con-
strained minimization problem that yields the best (under squared-error loss)
linear unbiased predictor in §1.5.1. Establish that the solution is indeed a
minimum.

Problem 1.12 Establish algebraically the equivalence of p(Y0 ) = λ! Z(s) and


(1.25), where λ is given by (1.24).

Problem 1.13 Consider a random field Z(s) ∼ G(µ, σ 2 ) with covariance


function Cov[Z(s), Z(s + h)] = C(h) and let s0 be an unobserved location. If
[Z(s0 ), Z(s)]! are jointly Gaussian distributed, find E[Z(s0 )|Z(s)] and compare
your finding to (1.25).

Problem 1.14 For second-order stationary processes it is common—particu-


larly in time series analysis—to describe the second-order properties through
the auto-correlation function (the correlogram) R(h) = C(h)/C(0). The ad-
vantage of R(h) is ease of interpretation since covariances are scale-dependent.
It turns out, however, that the correlogram is an incomplete description of the
second-order properties. Qualitatively different processes can have the same
correlogram. Consider the autoregressive time series
Z1 (t) = ρZ(t − 1) + e(t)
Z2 (t) = ρZ(t − 1) + U (t),
where E[Z1 (t)] = E[Z2 (t)] = 0, e(t) ∼ iid G(0, σ 2 ), and
-
0 with probability p
U (t) ∼ iid .
G(0, σ ) with probability 1 − p
2

(i) Find the covariance function, the correlogram, and the semivariogram
for the processes Z1 (t) and Z2 (t).
(ii) Set ρ = 0.5 and simulate the two processes with p = 0.5, 0.8, 0.9. For
each of the simulations generate a sequence for t = 0, 1, · · · , 200 and graph
the realization against t. Choose σ 2 = 0.2 throughout.

Problem 1.15 Using the expression for Σ−1 in §1.5.2, derive the variance of
the generalized least squares estimator in (1.30).

Problem 1.16 Table 1.2 gives relative excess variabilities (REV) for the GLS
estimator with variance (1.30) for several values of ρ ≥ 0. Derive a table akin
to Table 1.2 and discuss the REV if −1 < ρ ≤ 0.
CHAPTER 2

Some Theory on Random Fields

2.1 Stochastic Processes and Samples of Size One

A stochastic process is a family or collection of random variables, the members


of which can be identified or located (indexed) according to some metric. For
example, a time series Y (t), t = t1 , · · · , tn , is indexed by the time points
t1 , · · · , tn at which the series is observed. Similarly, a spatial process is a
collection of random variables that are indexed by some set D ⊂ Rd containing
spatial coordinates s = [s1 , s2 , · · · , sd ]! . For a process in the plane, d = 2, and
the longitude and latitude coordinates are often identified as s = [x, y]! . If
the dimension d of the index set of the stochastic process is greater than
one, the stochastic process is often referred to as a random field. In this
text we are mostly concerned with spatial processes in R2 , although higher-
dimensional spaces are implicit in many derivations; spatio-temporal processes
are addressed in Chapter 9. The name random “field” should not connote a
two-dimensional plane or even an agricultural application. It is much more
general.
In classical, applied statistics, stochastic process formulations of random
experiments are uncommon. Its basis is steeped in the notion of random sam-
pling, i.e., iid observations. To view the time series Y (t) or the spatial data
Z(s) as a stochastic process is not only important because the observations
might be correlated. It is the random mechanism that generates the data
which is viewed differently from what you might be used to. To be more pre-
cise, think of Z(s), the value of the attribute Z at location s, as the outcome
of a random experiment ω. Extending the notation slightly for the purpose of
this discussion, we put Z(s, ω) to make the dependency on the random exper-
iment explicit. A particular realization ω produces a surface Z(·, ω). Because
the surface from which the samples are drawn is the result of this random ex-
periment, Z(s) is also referred to as a random function. As a consequence,
the collection of n georeferenced observations that make up the spatial data
set do not represent a sample of size n. They represent the incomplete observa-
tion of a single realization of a random experiment; a sample of size one from
an n-dimensional distribution. This raises another, important question: if we
put statements such as E[Z(s)] = µ(s), with respect to what distribution is
the expectation being taken? The expectation represents the long-run average
of the attribute at location s over the distribution of the possible realizations
42 SOME THEORY ON RANDOM FIELDS

ω of the random experiment,


E[Z(s)] = EΩ [Z(s, ω)]. (2.1)

Imagine pouring sand from a bucket onto a surface. The sand distributes
on the surface according to the laws of physics. We could—given enough
resources—develop a model that predicts with certainty how the grains come
to lie on the surface. By the same token, we can develop a deterministic model
that predicts with certainty whether a coin will land on heads or tails, taking
into account the angle and force at which the coin is released, the conditions
of the air through which it travels, the conditions of the surface on which
it lands, and so on. It is accepted, however, to consider the result of the
coin-flip as the outcome of a random experiment. This probabilistic model is
more parsimonious and economic than the deterministic model, and enables
us to address important questions; e.g., whether the coin is fair. Considering
the precise placement of the sand on the surface as the result of a random
experiment is appropriate by similar reasoning. At issue is not that we consider
the placement of the sand a random event. What is at issue is that the sand
was poured only once; no matter at how many locations the depth of the sand
is measured. If we are interested in the long-run average depth of the sand at
a particular location s0 , the expectation (2.1) tells us that we need to repeat
the process of pouring the sand over and over again and consider the expected
value with respect to the probability distribution of all surfaces so generated.
The implications are formidable. How are we to learn anything about the
variability of a random process if only a single realization is available? In
practical applications there is usually no replication in spatial data in the
sense of observing several, independent realizations of the process. Are infer-
ences about the long-run average really that important then? Are we then
not more interested to model and predict the realized surface rather than
some average surface? How are we to make progress with statistical inference
based on a sample of size one? Fortunately, we can, provided that the random
process has certain stationarity properties. The assumption of stationarity in
random fields is often criticized, and sometimes justifiably so. Analyzing ob-
servations from a stochastic process as if the process were stationary—when it
is not—can lead to erroneous inferences and conclusions. Without a good un-
derstanding of stationarity (and isotropy) issues, little progress can be made
in the study of non-stationary processes. And in the words of Whittle (1954),
The processes we mentioned can only as a first approximation be regarded
as stationary, if they can be so regarded at all. However, the approximation is
satisfactory sufficiently often to make the study of the stationary type of process
worth while.

2.2 Stationarity, Isotropy, and Heterogeneity

A random field
{Z(s) : s ∈ D ⊂ Rd } (2.2)
STATIONARITY, ISOTROPY, AND HETEROGENEITY 43

is called a strict (or strong) stationary field if the spatial distribution is


invariant under translation of the coordinates, i.e., Spatial
Distribu-
Pr(Z(s1 ) < z1 , Z(s2 ) < z2 , · · · , Z(sk ) < zk ) =
tion
Pr(Z(s1 + h) < z1 , Z(s2 + h) < z2 , Z(sk + h) < zk ),
for all k and h. A strictly stationary random field repeats itself throughout
the domain.
As the name suggests, strict stationarity is a stringent condition; most sta-
tistical methods for spatial data analysis are satisfied with stationary con-
ditions based on the moments of the spatial distribution rather than the
distribution itself. This is akin to statistical estimation proceeding on the
basis of the mean and the variance of a random variable only; least squares
methods, for example, do not require knowledge of the data’s joint distri-
bution. Second-order (or weak) stationarity of a random field implies that
E[Z(s)] = µ and Cov[Z(s), Z(s + h)] = C(h). The mean of a second-order
stationary random field is constant and the covariance between attributes at
different locations is only a function of their spatial separation (the lag-vector)
h. Stationarity reflects the lack of importance of absolute coordinates. The
covariance of observations spaced two days apart in a second-order stationary
time series will be the same, whether the first day is a Monday or a Friday.
The function C(h) is called the covariance function of the spatial process
and plays an important role in statistical modeling of spatial data. Strict sta-
tionarity implies second-order stationarity but the reverse is not true by the
same token by which we cannot infer the distribution of a random variable
from its mean and variance alone.
The existence of the covariance function C(h) in a second-order stationary
random field has important consequences. Since C(h) does not depend on
absolute coordinates and Cov[Z(s), Z(s + 0)] = Var[Z(s)] = C(0), it follows
that the variability of a second-order stationary random field is the same
everywhere. A second-order stationary spatial process has a constant mean,
constant variance, and a covariance function that does not depend on absolute
coordinates. Such a process is the spatial equivalent of a random sample in
classical statistic in which observations have the same mean and dispersion
(but are not correlated).
The covariance function C(h) of a second-order stationary random field has
several other properties. In particular, Properties
of the
(i) C(0) ≥ 0; Covariance
(ii) C(h) = C(−h), i.e., C is an even function; Function
C(h)
(iii) C(0) ≥ |C(h)|;
(iv) C(h) = Cov[Z(s), Z(s + h)] = Cov[Z(0), Z(h)];
.k
(v) If Cj (h) are valid covariance functions, j = 1, · · · , k, then j=1 bj Cj (h)
is a valid covariance function, if bj ≥ 0 ∀j;
44 SOME THEORY ON RANDOM FIELDS

=k
(vi) If Cj (h) are valid covariance functions, j = 1, · · · , k, then j=1 Cj (h)
is a valid covariance function;
(vii) If C(h) is a valid covariance function in Rd , then it is also a valid
covariance function in Rp , p < d.

Properties (i) and (ii) are immediate, since C(h) = Cov[Z(s), Z(s + h)]. At
lag h = 0 this yields the variance of the process. Since C(h) does not de-
pend on spatial location s—otherwise the process would not be second-order
stationary—we have C(h) = Cov[Z(s), Z(s + h)] = Cov[Z(t − h), Z(t)] =
C(−h) for t = s + h. Since R(h) = C(h)/C(0) is the autocorrelation function
and is bounded −1 ≤ R(h) ≤ 1, (iii) follows from the Cauchy-Schwarz in-
equality. The lack of importance of absolute coordinates that is characteristic
for a stationary random field, is the reason behind (iv). This particular prop-
erty will be helpful later to construct covariance functions for spatio-temporal
data. Properties (i) and (iii) together suggest that the covariance function has
a true maximum at the origin. In §2.3 it is shown formally that this is indeed
the case.
Property (v) is useful to construct covariance models as linear combinations
of basic covariance models. It is also an important mechanism in nonparamet-
ric modeling of covariances. The proof of this property is simple, once we
have established what valid means. For a covariance function C(si − sj ) of a
second-order stationary spatial random field in Rd to be valid, C must satisfy
Positive the positive-definiteness condition
Definite- k (
k
(
ness ai aj C(si − sj ) ≥ 0, (2.3)
i=1 j=1

for any set of locations and real numbers. This is an obvious requirement,
since (2.3) is the variance of the linear combination a! [Z(s1 ), · · · , Z(sk )].
In time series analysis, stationarity is just as important as with spatial
data. A frequent device employed there to turn a non-stationary series into a
stationary one is differencing of the series. Let Y (t) denote an observation in
the series at time t and consider the random walk
Y (t) = Y (t − 1) + e(t),
where the e(t) are independent random variables with mean 0 and variance
σ 2 . Although we have E[Y (t)] = E[Y (t − k)], the variance is not constant,
Var[Y (t)] = tσ 2 ,
and the covariance does depend on the origin, Cov[Y (t), Y (t − k)] = (t − k)σ 2 .
The random walk is not second-order stationary. However, the first differences
Y (t) − Y (t − 1) are second-order stationary (see Chapter problems). A similar
device is used in spatial statistics; even if Z(s) is not second-order stationary,
the increments Z(s)−Z(s + h) might be. A process that has this characteristic
is said to have intrinsic stationarity. It is often defined as follows: the process
STATIONARITY, ISOTROPY, AND HETEROGENEITY 45

{Z(s) : s ∈ D ⊂ Rd } is said to be intrinsically stationary if E[Z(s)] = µ and


1
Var[Z(s) − Z(s + h)] = γ(h).
2
The function γ(h) is called the semivariogram of the spatial process. It can
be shown that the class of intrinsic stationary processes is larger than the class
of second-order stationary processes (Cressie, 1993, Ch. 2.5.2). To see that a
second-order stationary process is also intrinsically stationary, it is sufficient
to examine
Relationship
Var[Z(s) − Z(s + h)] = Var[Z(s)] + Var[Z(s + h)] − 2Cov[Z(s), Z(s + h)]
between
= 2{Var[Z(s)] − 2C(h)} Semivario-
= 2{C(0) − C(h)} = 2γ(h). gram and
Covariance
On the other hand, intrinsic stationarity does not imply second-order station- Function
arity.
Because of the relationship γ(h) = C(0) − C(h), statistical methods for
second-order stationary random fields can be cast in terms of the semivar-
iogram or the covariance function. Whereas statisticians are more familiar
with variances and covariances, many geostatisticians prefer the semivari-
ogram. Working with γ(h) compared to C(h) has definite advantages, in par-
ticular when estimating these functions based on observed data (see §4.2).
As parameters of the stochastic process under study, they can be viewed as
re-parameterizations of the second-order structure of the process and thus
as “equivalent.” If the process is intrinsic but not second-order stationary,
however, C(h) is a non-existing parameter. One should then work with the
semivariogram γ(h).
The second moment structure of a weakly stationary random field is a
function of the spatial separation h, but the covariance function can depend
on the direction. In the absence of this direction-dependence, that is, when
the covariance function or the semivariogram depends only on the absolute
distance between points, the function is termed isotropic. If the random
field is second-order stationary with an isotropic covariance function, then
C(h) = C ∗ (||h||), where ||h|| is the Euclidean norm of the lag vector,
<
||(s + h) − s|| = ||h|| = h21 + h22 .

Similarly, if the semivariogram of an intrinsic (or weakly) stationary process


is isotropic, then γ(h) = γ ∗ (||h||). Note that C and C ∗ are two different
functions (as are γ and γ ∗ ). In the sequel we will not use the “star” notation,
however. It will be obvious from the context whether a covariance function or
semivariogram is isotropic or anisotropic.

Example 2.1 Realizations of an isotropic and anisotropic random field are


shown in Figure 2.1. The random field in the left-hand panel of the figure
exhibits geometric anisotropy, a particular kind of direction dependence of
46 SOME THEORY ON RANDOM FIELDS

the covariance structure (see §4.3.7). Under geometric anisotropy the variance
of the process is the same in all directions, but the strength of the spatial
autocorrelation is not. The realization in the left-hand panel was generated
with autocorrelations that are stronger in the East-West direction than in the
North-South direction. The realization on the right-hand side of the panel has
the same covariance structure in all directions as the anisotropic model in the
East-West direction.

Geometric Anisotropic Random Field Isotropic Random Field


10

10
8

8
6

6
4

4
2

2 4 6 8 10 2 4 6 8 10

Figure 2.1 Anisotropic (left) and isotropic (right) second-order stationary random
fields. Adapted from Schabenberger and Pierce (2002, Figure 9.11).

The lesser correlations in the North-South direction of the anisotropic field


create a realization that is less smooth in this direction. The values of the
random field change more slowly in the East-West direction. In the isotropic
field, the same smoothness of the process is achieved in all directions of the
domain. The concept of the smoothness of a spatial process is discussed in
greater detail in §2.3.

We close this section with an interesting and important question: “What


is a Gaussian random field?”The assumption of Gaussian (i.e., Normal ) dis-
tributed data is made commonly in the analysis of random samples in clas-
sical statistics. First, we prefer the name Gaussian over Normal , since there
is really nothing normal about this distribution. Few data really follow this
distribution law. The central position of the Gaussian distribution for statis-
tical modeling and inference derives from such powerful results as the Central
Limit Theorem and its mathematical simplicity and elegance. But what does
it mean if the Gaussian label is attached to a spatial random field?
STATIONARITY, ISOTROPY, AND HETEROGENEITY 47

Example 2.2 A famous data set used in many spatial statistics texts is the
uniformity trial of Mercer and Hall (1911). On an area consisting of 20 × 25
experimental units a uniform wheat variety was planted and the grain yield
was recorded for each of the units. These are lattice data since a field plot is
a discrete spatial unit. We can identify a particular unit with a unique spatial
location, however, e.g., the center of the field plot. From the histogram and the
Q-Q plot of the data one might conclude that the data came from a Gaussian
distribution (Figure 2.2). That statement would ignore the spatial context of
the data and the random mechanism. From a random field perspective, the
500 observations represent a single observation from a 500-dimensional spatial
distribution.
80

5.0
60

4.5
Grain yield

4.0
40

3.5
20

3.0
0

2.5 3.0 3.5 4.0 4.5 5.0 -3 -2 -1 0 1 2 3

Quantiles of Standard Normal

Figure 2.2 Histogram of Mercer and Hall grain yields and normal QQ-plot.

A random field {Z(s) : s ∈ D ⊂ Rd } is a Gaussian random field (GRF), if Gaussian


the cumulative distribution function Random
Field
Pr (Z(s1 ) < z1 , Z(s2 ) < z2 , · · · , Z(sk ) < zk )
is that of a k-variate Gaussian random variable for all k. By the properties
of the multivariate Gaussian distribution this implies that each Z(si ) is a
univariate Gaussian random variable. The reverse, unfortunately, is not true.
Even if Z(si ) ∼ G(µ(si ), σ 2 (si )), this does not imply that the joint distribution
of Z(s1 ), · · · , Z(sn ) is multivariate Gaussian. This leap of faith is all too often
made, however.
Just as the univariate Gaussian distribution is the magic ingredient of many
48 SOME THEORY ON RANDOM FIELDS

classical statistical methods, spatial analysis for Gaussian random fields is


more straightforward than for other cases. For example, the best linear unbi-
ased predictor for the attribute Z(s0 ) at an unobserved locations s0 in general
is only best in this restricted class of predictors. If the random field is a GRF,
then these linear predictors turn out to be the best predictors (under squared
error loss) among all possible functions of the data (more on this in §5.2).
Second-order stationarity does not imply strict stationarity of a random field.
In a Gaussian random field, this implication holds.
The Gaussian distribution is often the default population model for con-
tinuous random variables in classical statistics. If the data are clearly non-
Gaussian, practitioners tend to go at great length to invoke transformations
that make the data look more Gaussian-like. In spatial statistics we need to
make the important distinction between the type of spatial data—connected
to the characteristics of the domain D—and the distributional properties of
the attribute Z being studied. The fact that the domain D is continuous, i.e.,
the data are geostatistical, has no bearing on the nature of the attribute as
discrete or continuous. One can observe the presence or absence of a disease in
a spatially continuous domain. The fact that D is discrete, does not impede
the attribute Z(s) at location s from possibly following the Gaussian law.
Nor should continuity of the domain be construed as suggesting a Gaussian
random field.

2.3 Spatial Continuity and Differentiability

The partial derivatives of the random field {Z(s) : s ∈ D ⊂ Rd },


∂Z(s)
Ż(s) = ,
∂sj
are random variables whose stochastic behavior contains important informa-
tion about the nature of the process; in particular its continuity. The more
continuous a spatial process, the smoother and the more spatially structured
are its realizations. Figure 2.3 shows realizations of four processes in R1 . The
processes have the same variance but increase in the degree of continuity from
top to bottom. For a given lag h, the correlation function R(h) of a highly
continuous process will be larger than the correlation function of a less contin-
uous process. As a consequence, neighboring values will change more slowly
(Figure 2.4).
The modeler of spatial data needs to understand the differences in con-
tinuity between autocorrelation models and the implications for statistical
inference. Some correlation models, such as the gaussian correlation model in
Figure 2.4d, are more smooth than can be supported by physical or biolog-
ical processes. With increasing smoothness of the spatial correlation model,
statistical inference tends to be more sensitive to model mis-specification.
Practitioners might argue whether to model a particular realization with an
exponential (Figure 2.4b) or a spherical (Figure 2.4c) correlation model. The
SPATIAL CONTINUITY AND DIFFERENTIABILITY 49

Increasing Spatial Continuity


a)

b)

c)

d)

0 10 20 30 40 50
Position on Transect

Figure 2.3 Realizations of four processes on a transect of length 50 that differ in the
degree of spatial continuity. More continuous processes have smoother realizations,
their successive values are more similar. This is indicative of higher spatial auto-
correlation for the same distance lag. Realization a) is that of a white noise process
(uncorrelated data), b)–d) are those of Gaussian processes with exponential, spheri-
cal, and gaussian covariance function, respectively. Adapted from Schabenberger and
Pierce (2002).

estimates of the parameters governing the two models will differ for a partic-
ular set of data. The fitted correlation models may imply the same degree of
continuity.
We have focused on inferring the degree of continuity from the behavior of
the correlation (or covariance) model near the origin. This is intuitive, since
the near-origin behavior governs the lag interval for which correlations are
high. The theoretical reason behind this focus lies in mean square continuity
and differentiability of the random field. Consider a sequence of random vari-
ables {Xn }. We say that {Xn } is mean square continuous if there exists Mean
a random variable X with E[X 2 ] < ∞, such that E[(Xn − X)2 ] → 0. For a Square
spatial random field {Z(s) : s ∈ D ⊂ Rd } with constant mean and constant Continuity
variance, mean-square continuity at s implies that
6 7
lim E (Z(s) − Z(s + h))2 = 0.
h→0

Since E[(Z(s) − Z(s + h))2 ] = 2Var[Z(s)] − 2C(h) = 2(C(0) − C(h)) = 2γ(h),


50 SOME THEORY ON RANDOM FIELDS

1.0
d)

0.8
Correlation Function R(h)
c)

b)
0.6

0.4

0.2

a)
0.0

0 2 4 6 8 10
Lag on transect

Figure 2.4 Correlation functions of the spatial processes shown in Figure 2.3. The
more sharply the correlation function decreases from the origin, the less continuous
is the process.

we conclude from
6 7
lim E (Z(s) − Z(s + h))2 = lim 2(C(0) − C(h)),
h→0 h→0

that unless C(h) → C(0) as h → 0, the random field cannot be continuous


at s. The random field will be mean square continuous, if and only if it is
continuous at the origin. Mean square continuity can be verified through the
behavior of the covariance function near 0.
As we will see in §4.2.2, some processes appear to have a semivariogram
for which γ(h) → c ≡ const., as h → 0. While empirical data may sug-
gest that the semivariogram does not pass through the origin, explaining this
Nugget phenomenon, known as the nugget effect, to the purists’ satisfaction is an
Effect entirely different matter. A process that exhibits a discontinuity at the ori-
gin cannot be mean square continuous. Mean square continuity is important
because some methods for spatial data analysis require the covariance func-
tion to be continuous. Examples are the spectral representation (§2.5) and
some “nonparametric” methods of covariance function estimation (§4.6). It
is argued on occasion, that functions should not be considered as covariance
models for stochastic processes, unless they are continuous. This excludes
SPATIAL CONTINUITY AND DIFFERENTIABILITY 51

models with nugget effect from consideration and reflects the sentiment that
only the study of mean square continuous processes is worthwhile.
Mean square continuity by itself does not convey much about the smooth-
ness of the process and how it is related to the covariance function. The
smoothness concept is brought into focus by studying the partial derivatives
of the random field. First, consider the special case of a weakly stationary
spatial process on a transect, {Z(s) : s ∈ D ⊂ R}, with mean µ and variance
σ 2 . Furthermore assume that data are collected at equally spaced intervals δ.
The gradient between successive observations is then
Z(s + δ) − Z(s)
Ż = ,
δ
with E[Ż] = 0 and variance
Var[Ż] = δ −2 {Var[Z(s + δ)] + Var[Z(s)] − 2Cov[Z(s + δ), Z(s)]}
8 9
= 2δ −2 σ 2 − C(δ) ≡ σ̇ 2 . (2.4)
For a second-order stationary random field, we know that C(0) = σ 2 and
hence [dC(δ)/dδ]h=0 = 0. Additional details can be garnered from (2.4), be-
cause
δ2
C(δ) = σ 2 − σ̇ 2 (2.5)
2
can be the limiting form (as δ → 0) of C(δ) only if σ̇ 2 is finite. As a consequence
of (2.5), the negative of the second derivative of C(δ) is the mean square
derivative σ̇ 2 ; the covariance function has a true maximum at the origin.
Notice that (2.4) can be written as σ̇ 2 = 2δ −2 {C(0) − C(δ)} = 2δ −2 γ(δ),
where γ(δ) is the semivariogram of the Z process. For the mean square deriva-
tive to be finite, the semivariogram cannot rise more quickly in δ than δ 2 .
This condition is known as the intrinsic hypothesis. It is, in fact, slightly Intrinsic
stronger than 2γ(δ)/δ 2 → const., as δ → ∞. A valid semivariogram must Hypothesis
satisfy 2γ(δ)/δ 2 → 0 as δ → ∞. For example, the power semivariogram model
-
0 h=0
γ(h; θ) =
θ1 + θ2 ||h||θ3 h (= 0
is a valid semivariogram for an intrinsically stationary process only if 0 ≤
θ3 < 2.
For a general process Z(s) on R with covariance function C, define
Z(s + h) − Z(s)
Żh = .
h
Stein (1999, Ch. 2.6) proves that Z(s) is mean square differentiable, if and
only if the second derivative of C(h) evaluated at h = 0 exists and is finite.
In general, Z(s) is m-times mean square differentiable if and only if
: 2m ;
d C(h)
dh2m 0
52 SOME THEORY ON RANDOM FIELDS

exists and is finite. The covariance function of (dm Z(s))/(dsm ) is

d2m C(h)
(−1)m .
dh2m

The smoothness of a spatial random field increases with the number of times
it is mean square differentiable. The gaussian covariance model
- >
||si − sj ||2
C(si − sj ) = σ exp −3
2
, (2.6)
α2

for example, is infinitely differentiable. A spatial random field with covariance


(2.6) is infinitely smooth. Stein (1999, p. 30) argues that such smoothness is
unrealistic for physical processes under normal circumstances.

2.4 Random Fields in the Spatial Domain

The representation

{ Z(s) : s ∈ D ⊂ Rd } (2.7)
is very general and reveals little about the structure of the random field under
study. To be applicable, the formulation must be cast within a framework
through which (i) statistical methods of analysis and inference can be derived,
and (ii) the properties of statistical estimators as well as the properties of the
random field itself can be studied. For second-order stationary random fields,
the core components of any formulation are the mean function E[Z(s)] = µ(s),
the covariance function C(h) = Cov[Z(s), Z(s + h)], and the properties of
the index set D (fixed continuous, fixed discrete, or random). Of the many
possible formulations that add structure to (2.7), we present two that structure
the random field in the spatial domain (§2.4.1 and §2.4.2), and the spectral
representation in the frequency domain (§2.5). The distinction between spatial
and spectral representation is coarsely whether Z(s) is expressed in terms of
functions of the observed coordinates s, or in terms of a random field X(ω)
that lives in a space consisting of frequencies.
Readers accustomed to traditional statistical modeling techniques such as
linear, nonlinear, and generalized linear models will find the model repre-
sentation in §2.4.1 most illustrative. Readers trained in the analysis of time
series data in the spectral domain might prefer the representation in §2.5. The
following discussion enunciates the relationships and correspondence between
the three formulations. They have specific advantages and disadvantages. The
model formulation will be the central representation for most of the remainder
of this text. We invoke the spectral representation when it is mathematically
more convenient to address an issue in the frequency domain, compared to
the spatial domain.
RANDOM FIELDS IN THE SPATIAL DOMAIN 53

2.4.1 Model Representation

A statistical model is the mathematical representation of a data-generating


mechanism. It is an abstraction of the physical, biological, chemical, etc. pro-
cesses that generated the data; emphasizing those aspects of the process that
matter for the analysis, and ignoring (or down-weighing) the inconsequential
aspects. The most generic statistical models are a decomposition of a random
response variable into a mathematical structure describing the mean and an
additive stochastic structure describing variation and covariation among the
responses. This simple decomposition is often expressed symbolically as
Data = Structure + Error.
The decomposition is immediately applicable to random fields, in particular,
where we are concerned with their first- and second-moment structure. To
motivate, recall that intrinsic and weak stationarity require constancy of the
mean, E[Z(s)] = µ. If the mean of the random field changes with location,
then µ(s) is called the large-scale trend of the random field. It is, of course,
common to observe large-scale structure in data. By definition, the random
field will be non-stationary and much of our preceding discussion seems to
be called into question. What is the point of assuming stationarity, if its first
requirement—constancy of the mean—is typically not met? The idea is then
to not associate stationarity properties with the attribute Z(s), but with its
de-trended version. We can put
Z(s) = f(X, s, β) + e(s), (2.8)
where Z(s) = [Z(s1 ), · · · , Z(sn )]! , X is an (n × p) matrix of covariates, β is a
vector of parameters, and e(s) is a random vector with mean 0 and variance
Var[e(s)] = Σ(θ). The function f may be nonlinear, hence we need to define
what the vector f represents. The elements of f are stacked as follows:
 f (x , s , β) 
1 1
 f (x2 , s2 , β) 
f(X, s, β) = 
 .. .

.
f (xn , sn , β)
It follows from (2.8), that E[Z(s)] = f(X, s, β) represents the large-scale trend,
the mean structure, of the spatial model. The variation and covariation of
Z(s) is represented through the stochastic properties of e(s). The stationarity
assumption is made for the error terms e(s) of the model, not for the attribute
Z(s). The zero mean assumption of the model errors is a reflection of our
belief that the model is correct on average. When modeling spatial data it is
important to recognize the random mechanism this averaging process appeals
to (see §2.1). The stationarity properties of the random field are reflected
by the structure of Var[e(s)] = Σ(θ). The entries of this covariance matrix
can be built from the covariance function C(h) of a second-order stationary
process. The dependence of Var[e(s)] on the vector θ is added because in
many applications the analyst explicitly parameterizes C(h) ≡ C(h, θ).
54 SOME THEORY ON RANDOM FIELDS

We make various simplifications and alterations to the basic structure (2.8)


along the way. The large-scale structure will often be expressed as a linear
function of the spatial coordinates, E[Z(s)] = X(s)β. The design or regressor
matrix X is then a response surface model or other polynomial in the coor-
dinates, hence the dependence of X on s. The matrix X(s) can involve other
variables apart from location information as is the case in spatial regression
models (Chapter 6). The function f is often a monotonic, invertible function,
which enables us to extend generalized linear models theory to the spatial
context. In that case we may choose to model
Z(s) = f (x! (s)β) + e(s)
f −1 (E[Z(s)]) = x! (s)β. (2.9)

The model formulation of a spatial process is useful, because it is reminis-


cent of traditional linear or nonlinear statistical structures. It belies, however,
which parameter component is most important to the modeler. Rarely is equal
weight given to the vector β of mean parameters and the vector θ of covariance
parameters. In spatial regression or analysis of variance models, more empha-
sis is placed on inferences about the mean function and θ is often considered a
nuisance parameter. In applications of spatial prediction, the covariance struc-
ture and the parameter values θ are critical, they “drive” the mean square
prediction error.
The model formulation (2.8) is sometimes termed a direct formulation be-
cause the covariance function of the random field is explicitly defined through
Σ(θ). In linear mixed model theory for clustered correlated data, for example,
you can distinguish correlation structures that are induced through hierarchi-
cal random effects from correlation structures that are parameterized explic-
itly (Schabenberger and Pierce, 2002, Ch. 7). For random field models, the
distinction between a direct specification of the covariance structure in (2.8)
and formulations in which the covariance function is incidental (induced) can
also be made. Induced covariance functions are typical in statistical models
for lattice data, in hierarchical models, and in some bivariate smoothing tech-
niques. Direct modeling of the covariance function is a frequent device for
geostatistical data.

2.4.1.1 Modeling Covariation in Geostatistical Data Directly

Consider a statistical model for the random field Z(s) with additive error
structure,
Z(s) = µ(s) + e(s),
where the errors have covariance function Cov[e(s), e(s + h)] = Ce (h). As
with other statistical models, the errors can contain more than a single com-
ponent. It is thus helpful to consider the following decomposition of the process
Scales of (Cressie, 1993, Ch. 3.1):
Variation
Z(s) = µ(s) + W (s) + η(s) + #(s). (2.10)
RANDOM FIELDS IN THE SPATIAL DOMAIN 55

Although akin to a decomposition into sources of variability in an analysis of


variance model, (2.10) is largely operational. The individual components may
not be identifiable and/or separable. The mean function µ(s) is the large-
scale trend of the random field. In terms of (2.8), we parameterize µ(s) =
f (x, s, β). The remaining components on the right-hand side of (2.10) are
random processes. W (s) is termed smooth-scale variation; it is a stationary
process with covariance function CW (h) or semivariogram γW (h). The range
rW of this process, i.e., the lag distance beyond which points on the surface are
(practically) uncorrelated, is larger than the smallest lag distance observed in
the data. That is, rW > min{||si − sj ||}, ∀i (= j. The spatial autocorrelation
structure of the W (s) process can be modeled and estimated from the data.
The second process, η(s), is termed micro-scale variation by Cressie (1993,
p. 112). It is also a stationary process, but its range is less than min{||si −sj ||}.
The spatial structure of the process cannot be estimated from the observed
data. We can only hope to estimate Var[η(s)] = ση2 , but even this is not
without difficulties. The final random component, #(s), denotes white noise
measurement error with variance Var[#(s)] = σ$2 . Unless there are true repli-
cations in the data—which is usually not the case— ση2 and σ$2 cannot be
estimated separately. Many modelers thus consider spatial models of the form
Z(s) = µ(s) + W (s) + e∗ (s),
where e∗ (s) = η(s) + #(s). Since the three random components in (2.10) are
typically independent, the variance-covariance matrix of Z(s) decomposes as
Var[Z(s)] = Σ(θ) = ΣW (θ W ) + Ση (θ η ) + Σ$ (θ $ ).

With the decomposition (2.10) in place, we now define two types of models.
1. Signal Model. Let S(s) = µ(s) + W (s) + η(s) denote the signal of the
process. It contains all components which are spatially structured, either
through deterministic or stochastic sources. The decomposition Z(s) =
S(s) + #(s) plays an important role in applications of spatial prediction.
There, it is the signal S(s) that is of interest to the modeler, not the noisy
version Z(s).
2. Mean Model. Let e(s) = W (s) + η(s) + #(s) denote the error process of
the model and consider Z(s) = µ(s) + e(s). If the modeler focuses on the
mean function but needs to account for autocorrelation structure in the
data, the mean model is often the entry point for analysis. It is noteworthy
that e(s) contains different spatial error processes, some more structured
than others. The idea of W (s) and η(s) is to describe small- and microscale
stochastic fluctuations of the process. If one allows the mean function µ(s)
to be flexible, then a locally varying mean function can absorb some of
this random variation. In other words, “one modeler’s mean function is
another modeler’s covariance structure.” The early approaches to cope with
spatial autocorrelation in field experiments, such as trend surface models
and nearest-neighbor adjustments, attempted to model the mean structure
56 SOME THEORY ON RANDOM FIELDS

by adding local deterministic variation to the treatment structure to justify


a model with uncorrelated errors (see §6.1).

Cliff and Ord (1981, Ch. 6) distinguish between reaction and interac-
tion models. In the former, sites react to outside influences, e.g., plants react
to the availability of nutrients in the root zone. Since this availability varies
spatially, plant size or biomass will exhibit a regression-like dependence on nu-
trient availability. By this reasoning, nutrient-related covariates are included
as regressors in the mean function f (x, s, β). In an interaction model, sites
react not to outside influences, but react with each other. Neighboring plants,
for example, compete with each other for resources. Schabenberger and Pierce
(2002, p. 601) conclude that “when the dominant spatial effects are caused by
sites reacting to external forces, these effects should be part of the mean func-
tion [f (x, s, β)]. Interactive effects [. . .] call for modeling spatial variability
through the autocorrelation structure of the error process.”
The distinction between reactive and interactive models is not cut-and-
dried. Significant autocorrelation does not imply an interactive model over
a reactive one. Spatial autocorrelation can be spurious—if caused by large-
scale trends—or real—if caused by cumulative small-scale, spatially varying
components.

2.4.1.2 Modeling Covariation in Lattice Data Indirectly

When the spatial domain is discrete, the decomposition (2.10) is not directly
applicable, since the random processes W (s) and η(s) are now defined on a
fixed, discrete domain. They no longer represent continuous spatial variation.
As before, reactive effects can be modeled directly through the mean function
µ(s). To incorporate interactive effects, the covariance structure of the model
must be modified, however. One such modification gives rise to the simul-
taneous spatial autoregressive (SSAR) model. Let µ(si ) denote the mean of
the random field at location si . Then Z(si ) is thought to consist of a mean
contribution, contributions of neighboring sites, and random noise:
Z(si ) = µ(si ) + e(si )
(n
= µ(si ) + bij {Z(si ) − µ(si )} + #(si ). (2.11)
j=1

The coefficients bij in (2.11) describe the spatial connectivity of the sites.
If all bij = 0, the model reduces to a standard model with mean µ(si ) and
uncorrelated errors #(si ). The coefficients govern the spatial autocorrelation
structure, but not directly. The responses at sites si and sj can be correlated,
even if bij = 0. To see this, consider a linear mean function µ(si ) = x(si )! β,
collect the coefficients into a matrix B = [bij ], and write the model as
Simultaneous
Z(s) = X(s)β + B(Z(s) − X(s)β) + '(s)
Spatial
Auto- '(s) = (I − B)(Z(s) − X(s)β).
regression
RANDOM FIELDS IN THE SPATIAL DOMAIN 57

It follows that Var[Z(s)] = (I − B)−1 Var['(s)](I − B! )−1 . If Var['(s)] = σ 2 I,


then Var[Z(s)] = σ 2 (I − B)−1 (I − B! )−1 . Although we may have bij = 0, we
can have Cov[Z(si ), Z(sj )] (= 0. Since B is a parameter of this model and con-
tains many unknowns, the modeler typically parameterizes the neighborhood
structure of the lattice model to facilitate estimation. A common choice is to
put B = ρW, where W is a user-defined spatial connectivity matrix and ρ is
a parameter to be estimated.
The spatial covariance structure of the lattice model is induced by the choice
and parameterization of the B matrix. It is thus modeled indirectly. A random
field representation that can be applied to discrete and continuous spatial
domains and also induces covariances is based on convolutions of random
noise with kernel functions.

2.4.2 Convolution Representation

The most important characteristic of a second-order stationary process to be


studied is its covariance structure. The mean, by definition of second-order
stationarity, is constant and the magnitude of the mean may be of interest to
the analyst. Beyond the simple problem to estimate the mean of the process,
the second-order properties of the process are of primary interest. For geo-
statistical data, these properties can be described by the covariance function,
the correlation function, or the semivariogram. For lattice data, the second-
order structure is modeled through a neighborhood connectivity matrix and
a parameterization of conditional or joint distributions of the data (see the
previous subsection for introductory remarks and §6.2.2 for details). For point
patterns, the first- and second-order properties of the process are described
by the first- and second-order intensity functions (see §3.4).
The statistical model representation (2.8) is useful for geostatistical and lat-
tice data. The second-order properties of the random field are explicit in the
variance-covariance matrix Σ(θ) of the model errors in geostatistical models
or implied by the connectivity matrix B in lattice models. The model repre-
sentation is convenient, because it has a familiar structure. It acknowledges
the presence of spatial autocorrelation, but not how the autocorrelation orig-
inated.
Autocorrelation is the result of small-scale, stochastically dependent ran-
dom innovations. Whereas random innovations at different locations are inde-
pendent, the attribute being finally observed is the result of a mixing process
that combines these innovations. This is the general idea behind the con-
volution representation of a stochastic process. It essentially relies on the
idea that correlated data can be expressed as linear combinations of uncor-
related data. Consider iid Bernoulli(π) random variables X1 , · · · , Xn . Then
.k .k+m
U = i=1 Xi is a Binomial(k,π) random variable and V = i=1 Xi is a
Binomial(k + m,π) random variable. Obviously, U and V are correlated be-
cause they share k observations, Cov[U, V ] = min(k, k + m)π(1 − π). This idea
58 SOME THEORY ON RANDOM FIELDS

can be generalized. Let Xi , (i = 1, · · · n), denote a sequence of independent


random variables with common mean µ and common variance σ 2 . Define a
weight
.n function K(i, j). For simplicity we choose a weight function such that
i=1 K(i, j) = 1, but that is not a requirement. Correlated.nrandom variables
can be.created by considering the weighted averages Yj = i=1 K(i, j)Xi and
n
Yk = i=1 K(i, k)Xi . Then
? n n
@
( (
Cov[Yj , Yk ] = Cov K(i, j)Xi , K(i, k)Xi
i=1 i=1
n
(( n
= σ2 K(i, j)K(m, k). (2.12)
i=1 m=1

The covariance between two weighted averages is governed by the weight


functions.
To generate or represent a spatial random field {Z(s) : s ∈ D ⊂ Rd }, we
consider a white noise process X(s) such that E[X(s)] = µx , Var[X(s)] = σx2 ,
and Cov[X(s), X(s + h)] = Cx (h) = 0, h (= 0. The random field X(s) is
referred to as the excitation field. For geostatistical data with continuous
D, the random field Z(s) can be written in terms of the excitation field as
Convolution A A
Representa- Z(s) = K(s − u)X(u) du = K(v)X(s + v) dv. (2.13)
tion all u all v
For a lattice process, integration is replaced with summation:
( (
Z(s) = K(s − u)X(u) = K(v)X(s + v). (2.14)
u v

Formulas (2.13) and (2.14) resemble kernel smoothers in nonparametric


statistics. To evoke this parallel consider uncorrelated data Y1 , · · · , Yn , ob-
served at design points x1 < x2 < · · · < xn . The data are generated according
to some model
Yi = f (xi ) + ei ,
where the ei are uncorrelated errors with mean zero and variance σ 2 . A predic-
tion of the mean of Y at a particular point x0 can be obtained as a weighted
average of the Yi . Since the mean of Y changes with x it is reasonable to
give those observations at design points close to x0 more weight than obser-
vations at design points for which |x0 − xi | is large. Hence, a weight function
w(xi , x0 , h) is considered, for example,
8 9
w(xi , x0 , h) = exp −(xi − x0 )2 /h2 .

The weights are often standardized to sum to one, i.e.,


w(xi , x0 , h)
K(xi , x0 , h) = .n ,
i=1 w(xi , x0 , h)
RANDOM FIELDS IN THE SPATIAL DOMAIN 59

and the predicted value at point x0 is


n
(
f'(x0 ) = K(xi , x0 , h)Yi . (2.15)
i=1

The predictor (2.15) is known as the Nadaraya-Watson estimator (Nadaraya,


1964; Watson, 1964). It is a convolution between a kernel function K(·) and
the data. The parameter h, known as the bandwidth, governs the smoothness
of the process. For larger h, the weights are distributed more evenly among
the observations, and the prediction function fˆ(x) will be smoother than for
small h. Comparing (2.15) with (2.13) and (2.14), the function K(s−u) can be
viewed as the kernel smoothing the white noise process X(s). Several results
follow immediately from the convolution representation (2.13).

Lemma 2.1 If X(s) in (2.13) is a white noise process with mean µx and
variance σx2 , then under some mild regularity conditions,
B
(i) E[Z(s)] = µx u K(u) du;
B
(ii) Cov[Z(s), Z(s + h)] = σx2 u K(u)K(u + h) du;
(iii) Z(s) is a weakly stationary random field.
Proof. The proof is straightforward and requires only standard calculus but it
is dependent on being able to exchange the order of integration (the regularity
conditions that permit application of Fubini’s theorem). To show (iii), it is
sufficient to establish that E[Z(s)] and Cov[Z(s), Z(s + h)] do not depend on
s. Provided the order of integration can be exchanged, tackling (i) yields
A A
E[Z(s)] = K(v)X(s − v) dvF (dx)
A X v A
= K(v) X(s − v)F (dx)dv
vA X

= µx K(v)dv.
v
To show (ii) assume that µx = 0. The result holds in general for other values
of µx . Then,
A A
Cov[Z(s), Z(s + h)] = K(v)K(t) E[X(s − v)X(s + h − t)] dvdt
v
A A t
= K(v)K(t) Cx (h + v − t) dvdt.
v t
Since X(s) is a white noise random field, only those terms for which h+v−t =
0 need to be considered and the double integral reduces to
A
Cov[Z(s), Z(s + h)] = σx 2
K(v)K(h + v) dv. (2.16)
v
Since the mean and covariance function of Z(s) do not depend on spatial
location, (iii) follows. This completes the proof of the lemma. Similar results,
60 SOME THEORY ON RANDOM FIELDS

replacing integration with summation, can be established in the case of a


discrete domain (2.14).

Example 2.3
To demonstrate the effect of convolving white noise we consider an excita-
tion process X(s) on the line and two kernel functions. The gaussian kernel
function where the bandwidth h corresponds to the standard deviation of the
kernel and a uniform kernel function. The width of the uniform distribution
was chosen so that its standard deviation also equals h. Figure 2.5 shows
the realization of the excitation field and the realizations of the convolution
(2.13) for bandwidths h = 0.1; 0.05; 0.025. For a given bandwidth, convolving
with the uniform kernel produces realizations less smooth than those with the
gaussian kernel; the uniform kernel distributes weights evenly. For a particular
kernel function, the smoothness of the process decreases with the bandwidth.

h=0.10
3

-1

0.0 0.2 0.4 0.6 0.8 1.0

3 h=0.05

-1

0.0 0.2 0.4 0.6 0.8 1.0

3 h=0.025

-1

0.0 0.2 0.4 0.6 0.8 1.0

Figure 2.5 Convolutions of Gaussian white noise with gaussian (solid line) and uni-
form (dotted line) kernel functions. The bandwidth h corresponds to the standard
deviation of the gaussian kernel. The width of the uniform kernel was chosen to
have equal standard deviation than the gaussian kernel. The jagged line represents
the realization of the excitation field.

The two kernel functions lead to distinctly different autocorrelation func-


tions (Figure 2.6). For the uniform kernel autocorrelations decrease linearly
with the lag. Once the lag exceeds one half of the window width, the cor-
relation function is exactly zero. The gaussian kernel leads to a correlation
RANDOM FIELDS IN THE SPATIAL DOMAIN 61

function known as the gaussian model,


8 9
C(h) = exp −h2 /α2 ,
which is infinitely smooth (see §2.3). The slow decline of the correlation func-
tion with increasing lag—as compared to the correlation function under the
uniform kernel—leads to smooth realizations (see Figure 2.5).

1.0

0.8
Correlation Function R(h)

0.6 h = 0.10

0.4 h = 0.05

h = 0.025

0.2

0.0

0.0 0.1 0.2 0.3 0.4 0.5


Lag distance h

Figure 2.6 Correlation functions determined according to (2.16) for the convolu-
tions shown in Figure 2.5. The bandwidth h corresponds to the standard deviation
of the gaussian kernel. The width of the uniform kernel was chosen to have standard
deviation equal to that of the gaussian kernel.

The convolution representation of a spatial process is useful in several re-


gards.
• It describes a correlated process in terms of the excitation of a latent white
noise process and defines the covariance structure of the process indirectly.
Instead of asking what parametric model Σ(θ) describes the spatial depen-
dency structure, one can reformulate the question and ask which convolu-
tion kernel K(s − u, γ) could have given rise to the data. The estimation of
the covariance parameter θ has been supplanted by the estimation of the
kernel parameters γ.
• By choosing the convolution kernel, a wide range of covariance structures
can be modeled.
62 SOME THEORY ON RANDOM FIELDS

• Convolution techniques can be used to derive classes of covariance functions


and semivariogram models based on moving averages that provide more
flexibility than parametric models (see §4.6.2).
• Convolution techniques are appealing to model non-stationary processes
(see §8.3.2). Two basic approaches are to
– let the convolution kernel K(u) depend on spatial location, i.e., K(u) =
Ks (u). For example, the kernel function for a process in R2 may be the
bivariate Gaussian density, where the correlation parameter and vari-
ances are a function of s (Higdon, 1998; Higdon, Swall, and Kern, 1999).
– use a location-invariant kernel function to convolve processes X(s) that
exhibit spatial correlation (Fuentes, 2001).
• The representation (2.13) suggests a method for simulating spatial random
fields based on convolving white noise. By choosing kernels with
A A
K(u)du = 1, uK(u)du = 0,
u u
the mean and variance of the generated random field Z(s) can be directed
from the mean and variance of the excitation field X(s). Such a method
of simulating spatial data holds promise for generating discrete attributes
Z for which valid correlation models are not necessarily obvious, and for
which valid joint distributions—from which to sample otherwise—are often
intractable (see §7.4).

2.5 Random Fields in the Frequency Domain

Although common in the study of time series, frequency domain methods


are not (yet) widely used in spatial statistics. One obstacle is probably the
superficially more formidable mathematical treatment, another is unfamil-
iarity with the interpretation of key quantities such as the spectral density
and the periodogram. Hopefully, the exposition that follows will allay these
perceptions. The text in the following two subsections draws heavily on the
excellent discussion in Priestley (1981, Ch. 4) and can only give the very ba-
sics of spectral analysis. Other texts that provide most readable introductions
into spectral analysis (with focus on time series) are Bloomfield (1976) and
Chatfield (1996). Both frequency-domain and space-domain representations
of random fields are discussed in detail in the text by Vanmarcke (1983). Our
exposition is an amalgam of these resources.

2.5.1 Spectral Representation of Deterministic Functions

Representing functions through their frequency content has a long history in


physical and engineering sciences. To see how the ideas connect to our study
of random processes in the plane, we need to step back for a moment and
RANDOM FIELDS IN THE FREQUENCY DOMAIN 63

consider deterministic functions. A (deterministic) function f (s) is periodic if


f (s) = f (s + ip) for any s and i = · · · , −2, −1, 0, 1, 2, · · ·. If no p exists for
which this relationship holds for all s, then f (s) is said to be non-periodic.
Otherwise, the smallest p for which f (s) = f (s + ip), ∀s is called the period
of the function. For example, cos(x) is periodic with period 2π. Provided that
the periodic function f (s) with period 2p is absolutely integrable over [−p, p],
f (s) can be written as a Fourier series Fourier
(∞ Series
1
f (s) = a0 + {an cos(πns/p) + bn sin(πns/p)} . (2.17)
2 n=1

The condition of absolute integrability guarantees that the Fourier coefficients


a0 , a1 , · · · , b0 , b1 , · · · are finite. These coefficients are given by (Priestley, 1981,
p. 194)
A
1 p
an = f (s) cos(πns/p) ds
p −p
A
1 p
bn = f (s) sin(πns/p) ds.
p −p
In physical applications the energy dissipated by a process in a particular
time interval is of importance. For the deterministic, periodic function with
period 2p, the total energy dissipated over the (time) interval (−p, p) is a
function of the Fourier coefficients. Following Priestley (1981, Ch. 4.4) define
c0 = a0 /2 and c. i = (0.5(ai + bi ))
2 2 1/2
, i = 1, 2, · · ·. Then the total energy

over
.∞ (−p, p) is 2p i=0 ci . The power is defined as the energy per unit time,
2

c
i=0 i
2
. The importance of the spectral representation for such functions lies
in the fact that the coefficients c2i yield the contribution to the power from
the term in the Fourier series at frequency i/(2p). The power spectrum is
obtained by graphing c2i against i/(2p). For periodic functions this spectrum
is discrete.
Most (deterministic) functions are not periodic, and adjustments must be
made. We are still interested in the energy and power distribution, but the
power spectrum will no longer be discrete; there will be power at all fre-
quencies. In a sense, moving from Fourier analysis of periodic to non-periodic
functions is somewhat akin to the changes incurred in switching the study of
probability from discrete to continuous random variables. The switch is made
by representing the function not as a Fourier series, but a Fourier integral,
provided the function satisfies some additional regularity conditions. If g(s) is
a non-periodic, deterministic function, then, provided g(s) decays to zero as
s → ∞ and s → −∞, and
A ∞
|g(s)| ds < ∞,
−∞

it can be represented as (Priestley, 1981, Ch. 4.5) Fourier


A ∞ Integral
g(s) = {p(f ) cos(2πf s) + q(f ) sin(2πf s)} df , (2.18)
−∞
64 SOME THEORY ON RANDOM FIELDS

where f denotes frequencies. It is convenient to change to angular frequencies


ω = 2πf and to express (2.18) as a function of ω. Notice that there is no
consistency in the literature in this regard. Some authors prefer frequencies,
others prefer angular frequencies. In the expressions that follow, the latter is
more appealing in our opinion, since the terms in complex exponentials such
as iωs are easier to read than i2πf s.
The functions p(f ) and q(f ) in (2.18) are defined, unfortunately, through
the inverse transformations,
A ∞ A ∞
p(ω) = g(s) cos(ωs) ds q(ω) = g(s) sin(ωs) ds.
−∞ −∞

A more convenient formulation of√these relationships is through a Fourier pair


using complex exponentials (i ≡ −1):
A ∞
1
g(s) = √ G(ω) exp{iωs} dω (2.19)
2π −∞
A ∞
1
G(ω) = √ g(s) exp{−iωs} ds. (2.20)
2π −∞

An important identity can now be established, namely


A ∞ A ∞
g (s) ds =
2
|G(ω)|2 dω. (2.21)
−∞ −∞

This shows that |G(ω)|2 represents the density of energy contributed by com-
ponents of g(s) in the vicinity of ω. This interpretation is akin to the coef-
ficients c2i in the case of periodic functions, but now we are concerned with
a continuous distribution of energy over frequencies. Again, this makes the
point that studying frequency properties of non-periodic functions vs. peri-
odic functions bears resemblance to the comparison of studying probability
mass and density functions for discrete and continuous random variables.
Once the evolution from deterministic-periodic to deterministic-nonperiodic
functions to realizations of stochastic processes is completed, the function
G(ω) will be a random function itself and its average will be related to the
density of power across frequencies.
One important special case of (2.19) occurs when the function g(s) is real-
valued and even, g(s) = g(−s), ∀s, since then the complex exponential can
be replaced with a cosine function and does not require complex mathemat-
ics. The even deterministic function of greatest importance in this text is the
covariance function C(h) of a spatial process. But this mathematical conve-
nience is not the only reason for our interest in (2.19) when g(s) is a covariance
function. It is the function that takes the place of G(ω) when we consider ran-
dom processes rather than deterministic functions that is of such importance
then, the spectral density function.
RANDOM FIELDS IN THE FREQUENCY DOMAIN 65

2.5.2 Spectral Representation of Random Processes

At first glance, the extension of Fourier theory from deterministic functions


to random processes appears to be immediate. After all, it was argued in §2.1
that the random processes of concern generate random functions. A single
realization is then simply a function of temporal and/or spatial coordinates
to which Fourier methods can be applied. This function—because it is the
outcome of a random experiment—is, of course, most likely not periodic and
one would have to consider Fourier integrals rather than Fourier series. The
conditions of the previous subsection that permit the representation (2.19)
are not necessarily met, however. First, it is required that the random process
be second-order stationary. Consider a process Z(s) on the line (s ∈ D ⊂ R1 ).
The notion of a “self-replicating” process implied by stationarity is clearly
at odds with the requirement that Z(s) → 0 as s → ∞ and s → −∞. How
can the process decay to zero for large and small s while maintaining the
same behavior in-between? The second problem is that we observe only a
single realization of the stochastic process. While we can treat this as a single
function in the usual sense, our interest lies in describing the behavior of
the process, not only that of the single realization. The transition from non-
periodic deterministic functions to realizations of stationary random processes
thus requires two important changes: (i) to recast the notion of energy and
power spectra, (ii) to consider expectation operations in order to move from
properties of a single realization to properties of the underlying process.
The solution to the first dilemma is to truncate the realization Z(s) and to
consider -
Z(s) −S ≤ s ≤ S
Z̃(s) =
0 s > S or s < −S
instead. Now Z̃(s) “decays” to zero as |s| → ∞, and a Fourier integral can be
applied, provided the condition of absolute integrability is also met:
A ∞
1
Z̃(s) = √ G̃(ω) exp{iωs} dω (2.22)
2π −∞
A S
1
G̃(ω) = √ Z(s) exp{−iωs} ds. (2.23)
2π −S
Unfortunately, |G̃(ω)|2 cannot be viewed as the energy density of Z(s), since
(i) the truncation points ±S were arbitrary and (ii) consideration of the limit
of |G̃(ω)|2 as S → ∞ is not helpful. In that case we could have started with
a Fourier pair based on Z(s) in the first place, which is a problem because
the stationary process does not decay to 0. Although the process has infinite
energy on (−∞, ∞) it is possible that the power (energy per unit length) limit
1
lim |G̃(ω)|2 (2.24)
S→∞ 2S

is finite (∀ω). The problem of a non-decaying realization is consequently tack-


led by focusing on the power of the function, rather than its energy. The
66 SOME THEORY ON RANDOM FIELDS

conditions under which the limit (2.24) is indeed finite are surprisingly re-
lated to the rate at which the covariance function C(h) = Cov[Z(s), Z(s + h)]
decays with increasing h; the continuity of the process. The problem of infer-
ring properties of the process from a single realization is tackled by considering
Spectral the expectation of (2.24). If it exists, the function
Density : ;
1
s(ω) = lim E |G̃(ω)|2
(2.25)
S→∞ 2S
is called the (power) spectral density function of the random process. We
now establish the relationship between spectral density and covariance func-
tion for a process on the line because it yields a more accessible formulation
than (2.25). The discussion will then extend to processes in Rd .

2.5.3 Covariance and Spectral Density Function

Two basic approaches that connect the covariance function


C(h) = Cov[Z(s), Z(s + h)]
to the spectral density function s(ω) are as follows.
(i) The realization Z(s) is represented as a linear combination of sinusoids
with random amplitudes and random phase angles. The covariance C(h)
at lag h can then be expressed as a linear combination of variances at
discrete frequencies and cosine terms. The variances are approximated as
a rectangle, the width of which corresponds to a frequency interval, the
height corresponds to the spectral mass. Upon taking the limit the spectral
density function emerges.
(ii) The energy density |G̃(ω)|2 is expressed in terms of the convolution
A ∞
Z̃(s)Z̃(s − h) ds
−∞

which leads to (2S)−1 |G̃(ω)|2 as the Fourier transform of the sample covari-
ance function. Taking expected values and the limit in (2.25) establishes
the relationship between s(ω) and C(h). This is the approach considered
in Priestley (1981, pp. 212–213).
The development that follows considers approach (i) and is adapted from
the excellent discussion in Vanmarcke (1983, Ch. 3.2–3.4). We commence by
focusing on random processes in R1 . The extensions to processes in Rd are
immediate, the algebra more tedious. The extensions are provided at the end
of this subsection.

2.5.3.1 Spectral Representation for Processes in R1

Consider the second-order, real-valued stationary process {Z(s) : s ∈ D ⊂ R1 }


with E[Z(s)] = µ, Var[Z(s)] = σ 2 . The random function Z(s) can be expressed
RANDOM FIELDS IN THE FREQUENCY DOMAIN 67

as a sum of 2K sinusoids
K
( K
(
Z(s) = µ + Yj (s) = µ + Aj cos(ωj s + φj ). (2.26)
j=−K j=−K

The Aj are random amplitudes and the φj are random phase angles dis-
tributed uniformly on (0, 2π); j = 1, · · · , K. All Aj and φj are mutually inde-
pendent. The Yj (s) are thus zero mean random variables because
A 2π
(2π)−1
cos(a + φ) dφ = 0
0

and φj ⊥ Aj . The frequencies are defined as ωj = ±[∆ω (2j − 1)/2].


To determine the covariance function of the Z process from (2.26) we use in-
dependence, simple geometry, and make use of property (iv) of the covariance
function in §2.2: Cov[Z(s), Z(s + h)] = Cov[Z(0), Z(h)]. This leads to
K
(
C(h) = E[Aj cos(φj ) Aj cos(ωj h + φj )]
j=−K
K
( K
(
1
= E[A2j ] cos(ωj h) = σj2 cos(ωj h). (2.27)
2
j=−K j=−K
.K
It follows from (2.27) that Var[Z(s)] = C(0) = j=−K σj2 . The variance σj2
at each discrete frequency equals one-half of the average squared amplitude
and the variability of the Z process is being distributed over the discrete
frequencies ωj , (j = −K, · · · , −1, 1, · · · , K). The variance σj2 can be thought
of as the area of a rectangle centered at ωj , having width ∆ω and height s(wj ):
σj2 = s(ωj )∆ω .
We think of s(ωj ) as the spectral mass at frequency wj . In the limit, as ∆ω → 0
and K → ∞, (2.27) becomes
A ∞
C(h) = cos(ωh)s(ω) dω.
−∞

Using the Euler relationship exp{iωh} = cos(ωh) + i sin(ωh), the covariance


can also be written as Fourier
A ∞ Pair: C(h)
C(h) = exp {iωh} s(ω) dω, (2.28) and s(ω)
−∞

and the spectral density emerges as the Fourier transformation of C(h):


A ∞
1
s(ω) = C(h) exp {−iωh} dh. (2.29)
2π −∞

The relationship between spectral density and covariance function as a


Fourier pair is significant on a number of levels.
• In §2.5.2 we introduced s(ω) as a limit of a scaled and averaged energy
68 SOME THEORY ON RANDOM FIELDS

density and claimed that this is the appropriate function through which to
study the power/energy properties of a random process in the frequency
domain. Provided s(ω) exists! Since C(h) is a non-periodic deterministic
function, s(ω) exists when C(h) satisfies the conditions in §2.5.1 for (2.20):
B ∞ must decay to zero as h → ∞ and must be absolutely integrable,
C(h)
−∞
|C(h)| dh < ∞.
• Bochner’s theorem states that for every continuous nonnegative function
C(h) with finite C(0) there corresponds a nondecreasing function dS(ω)
such that (2.28) holds. If dS(ω) is absolutely continuous then dS(ω) =
s(ω)dω (see §2.5.4 for the implications if S(ω) is a step-function). It is
thus necessary that the covariance function is continuous which disallows
a discontinuity at the origin (see §2.3). Bochner’s theorem further tells us
that C(h) is positive-definite if and only if it has representation (2.28). This
provides a method for constructing valid covariance functions for stochastic
processes. If, for example, a function g(h) is a candidate for describing the
covariance in a stochastic process, then, if g(h) can be expressed as (2.28)
it is a valid model. If g(h) does not have this representation it should
not be considered. The construction of valid covariance functions from the
spectral representation is discussed (for the isotropic case) in §4.3.1 and for
spatio-temporal models in §9.3.
• Since the spectral density function is the Fourier transform of the covariance
function, this suggests a simple method for estimating s(ω) from data.
Calculate the sample covariance function at a set of lags and perform a
Fourier transform. This is indeed one method to calculate an estimate of
s(ω) known as the periodogram (see §4.7.1).

2.5.3.2 Extensions to Rd

The extensions of spectral representations to random fields of higher di-


mension are now given. To ease the transition, one additional representa-
tion of a stochastic process is introduced: in terms of a linear combination
of a complex-valued white noise random field in the frequency domain. Let
{Z(s) : s ∈ D ⊂ R1 } be a (real-valued) second-order stationary random field
with variance Var[Z(s)] = σ 2 . Without loss of generality the mean of Z(s) is
assumed to be zero.
By the spectral representation theorem a real-valued random process Z(s)
with mean 0 can be represented as
A
Z(s) = exp{iωs}dX(ω).

Where is the connection of this representation to (2.26)? The components


exp{iωs} = cos(ωs) + i sin(ωs) play the same role as the cosine terms in
(2.26), except now we include sine components. These components are as-
sociated with the infinitesimal frequency interval dω and are amplified by a
complex random amplitude dX(ω). These random components have very spe-
RANDOM FIELDS IN THE FREQUENCY DOMAIN 69

cial properties. Their mean is zero and they are uncorrelated, representing a
white noise process in the frequency domain. Because of their uncorrelatedness
the covariance function can now be written as (X means complex conjugate)
:A A ;
C(h) = Cov[Z(0), Z(h)] = E dX(ω1 ) dX(ω2 )
Ω1 Ω2
A
6 7
= exp{iωs}E |dX(ω)|2 .

Because of (2.28) it follows that E[|dX(ω)|2 ] = s(ω)dω.


This setup allows a direct extension to processes in Rd . Let {Z(s) : s ∈
D ⊂ Rd } and let X(ω) be an orthogonal process—a process with zero mean
and independent increments, possibly complex-valued—with E[|dX(ω)|2 ] =
s(ω)dω. Then the process Z(s) can be expressed in terms of X(ω) as Spectral
A ∞ A ∞ Representa-
Z(s) = ··· exp{iω ! s}dX(ω) (2.30) tion of
−∞ −∞ Z(s)
and has covariance function
Cov[Z(s), Z(s + h)] = Cov[Z(0), Z(h)] = E[Z(0)Z(h)]
:A ∞ A ∞ ;
= E exp{iω 0}dX(ω)
!
exp{iω h}dX(ω)
!
−∞ −∞
:A ∞ A ∞ ;
= E dX(ω) exp{iω h}dX(ω)
!
−∞ −∞
A ∞
= exp{iω ! h}E[|dX(ω)|2 ]
−∞
A ∞
= exp{iω ! h}s(ω)dω,
−∞

where the integrals are understood to be d-dimensional. Since C(h) is an even


function,C(h) = C(−h), if Z(s) is real-valued we can also write
A ∞ A ∞
C(h) = ··· cos(ω ! h)s(ω) dω.
−∞ −∞

The covariance function C(h) and the spectral density function s(ω) form a
Fourier transform pair,
A ∞ A ∞
C(h) = ··· exp{iω ! h}s(ω) dω (2.31)
−∞ −∞
A ∞ A ∞
1
s(ω) = ··· exp{−iω ! h}C(h) dh, (2.32)
(2π)d −∞ −∞

Notice that this is an application of the inversion formula for characteristic


functions. C(h) is the characteristic function of a d-dimensional random vari-
able whose cumulative distribution function is S(ω). The continuity of the
characteristic function disallows spectral representations of stochastic pro-
70 SOME THEORY ON RANDOM FIELDS

cesses whose covariance functions are discontinuous at the origin, whether the
process is in R1 or in Rd .
Also notice the similarity of (2.30) to the convolution representation (2.13).
Both integrate over a stochastic process of independent increments. The con-
volution representation operates in the spatial domain, (2.30) operates in
the frequency domain. This correspondence between convolution and spectral
representation can be made more precise through linear filtering techniques
(§2.5.6). But first we consider further properties of spectral density functions.
Up to this point we have tacitly assumed that the domain D is continuous.
Many stochastic processes have a discrete domain and lags are restricted to
an enumerable set. For example, consider a process on a rectangular r ×c row-
column lattice. The elements of the lag vectors h = [h1 , h2 ]! consist of the set
(h1 , h2 ) : h1 , h2 = 0, ±1, ±2, · · ·. The first modification to the previous formu-
las is that integration in the expression for s(ω) is replaced by summation.
Let ω = [ω1 , ω2 ]! . Then,

( ∞
(
1
s(ω) = C(h) cos(ω1 h1 + ω2 h2 ).
(2π)2
h1 =−∞ h2 =−∞

The second change is the restriction of the frequency domain to [−π, π], hence
A π A π
C(h) = cos(ω ! h)s(ω) dω.
−π −π

Note that we still assume that the spectral density is continuous, even if the
spatial domain is discrete. Continuity of the domain and continuity of the
spectral distribution function dS(ω) are different concepts.

2.5.4 Properties of Spectral Distribution Functions

Assume that the stochastic process of concern is a real-valued, second-order


stationary spatial process in R2 with continuous domain. Both lags and fre-
quencies range over the real line and the relevant Fourier pair is
A ∞A ∞
C(h) = cos(ω ! h)s(ω) dω
−∞ −∞
A ∞A ∞
1
s(ω) = cos(ω ! h)C(h) dh.
(2π) −∞ −∞
2

Integrated If s(ω) exists for all ω we can introduce the integrated spectrum
Spectrum A ω1 A ω2
S(ω) = S(ω1 , ω2 ) = s(ϑ1 , ϑ2 ) dϑ1 dϑ2 . (2.33)
−∞ −∞

Several interesting properties of the integrated spectrum can be derived


• S(∞, ∞) = Var[Z(s)]. This follows directly because C(0) = Var[Z(s)] ≡
σ 2 . The variance of the process thus represents the total power contributed
RANDOM FIELDS IN THE FREQUENCY DOMAIN 71

by all frequency components and s(ω)dω is interpreted as the contribution


to the total power (variance) of the process from components in Z(s) with
frequency in ω1 + dω1 × ω2 + dω2 .
• S(−∞, −∞) = 0.
• S(ω) is a non-decreasing function of ω. That is, if ω2 > ω1 and ϑ2 > ϑ1 ,
then
S(ω2 , ϑ2 ) − S(ω2 , ϑ1 ) − S(ω1 , ϑ2 ) + S(ω1 , ϑ1 ) ≥ 0.
In the one-dimensional case this simply states that if ω2 > ω1 then S(ω2 ) ≥
S(ω1 ). This property follows from the fact that s(ω1 , ω2 ) ≥ 0 ∀ω1 , ω2 , which
in turn is easily seen from (2.25).
The integrated spectrum behaves somewhat like a bivariate distribution
function, except it does not integrate to one. This is easily remedied by in-
troducing the normalized integrated spectrum F (ω) = σ −2 S(ω). This
function shares several important properties with the cumulative distribution
function of a bivariate random vector: Normalized
Integrated
• 0 ≤ F (ω) ≤ 1. Spectrum
• F (−∞, −∞) = 0 and F (∞, ∞) = 1.
• F (ω) is non-decreasing (in the sense established above).
Because of these properties, F (ω) is also called the spectral distribution func-
tion and the label spectral density function is sometimes attached to Normalized
Spectral
∂2
f (ω1 , ω2 ) = F (ω1 , ω2 ), Density
∂ω1 ∂ω2 Function
if f (ω1 , ω2 ) exists. Since we have already termed s(ω) the spectral density
function (sdf) and f (ω) = s(ω)/σ 2 , we call f (ω) the normalized sdf (nsdf) to
avoid confusion. The sdf and covariance function form a Fourier pair; similarly,
the nsdf and the correlation function R(h) = C(h)/C(0) = Corr[Z(s), Z(s +
h)] form a Fourier pair:
A ∞A ∞
R(h) = cos(ω ! h)f (ω) dω (2.34)
−∞ −∞
A ∞A ∞
1
f (ω) = cos(ω ! h)R(h) dh. (2.35)
(2π)2 −∞ −∞

The study of spectral distributions is convenient because (i), the integrated


functions S(ω) and F (ω) have properties akin to cumulative distribution func-
tions, and (ii), properties of the spectral functions teach us about properties
of the covariance (correlation) function and vice versa. Concerning (i), con-
sider integrating F (ω) over individual coordinates in the frequency domain,
this will marginalize
B∞ the spectral distribution function. In R2 , for example,
F1 (ω) = −∞ F (ω, ϑ2 )dϑ2 is the spectral distribution function of the ran-
dom field Z(s1 , s2 ) for a fixed coordinate s2 . The variance of the process in
72 SOME THEORY ON RANDOM FIELDS

the other Bcoordinate can be obtained from the marginal spectral density as

F1 (∞) = −∞ F (∞, ϑ2 )dϑ2 . Because

∂ d F (ω1 , · · · , ωd )
f (ω) = ,
∂ω1 · · · ∂ωd
the spectral densities can also be marginalized,
A ∞ A ∞
f1 (ω) = ··· f (ω, ω2 , · · · , ωd ) dω2 , · · · , dωd .
−∞ −∞

Concerning (ii), it is easy to establish that


• for a real valued process C(h) is an even function, and as a result, the sdf
is also an even function, s(ω) = s(−ω);
• if C(h) is reflection symmetric so that C(h1 , h2 ) = C(−h1 , h2 ), the spectral
density function is also reflection symmetric, s(ω1 , ω2 ) = s(−ω1 , ω2 ). Con-
versely, reflection symmetry of s(ω) implies reflection symmetry of C(h).
The proof is straightforward and relies on the evenness of C(h), s(ω), and
cos(x);
• a random field in R2 is said to have a separable covariance function C(h),
if C(h) = C(h1 , h2 ) ≡ C1 (h1 )C2 (h2 ) for all h1 , h2 , where C1 (·) and C2 (·)
are valid covariance functions. In this case the sdf also factors, s(ω1 , ω2 ) =
s1 (ω1 )s2 (ω2 ). Similarly, if the spectral density function can be factored as
s(ω1 , ω2 ) = g(ω1 )h(ω2 ), then the covariance function is separable.

2.5.5 Continuous and Discrete Spectra

The relationship between F (ω) and f (ω) is reminiscent of the relationship


between a cumulative distribution function and a probability density function
(pdf) for a continuous random variable. In the study of random variables,
F (y) = Pr(Y ≤ y) always exists, whereas the existence of the pdf requires
absolute continuity of F (y). Otherwise we are led to a probability mass func-
tion p(y) = Pr(Y = y) of a discrete random variable (unless the random
variable has a mixed distribution). A similar dividing line presents itself in
the study of spectral properties. If F (ω) is absolutely continuous, then f (ω)
exists. This in turn implies that the covariance function is absolute integrable,
i.e., C(h) must decrease quickly enough to 0 as the elements of h grow to |∞|.
If the covariance function does not diminish quickly enough, F (ω) exists but
f (ω) may not. The apparent difficulty this presents in the Fourier expressions
can be overcome by representing the transforms as Fourier-Stieltjes integrals,
made possible by the celebrated Wiener-Khintchine theorem,
A ∞ A ∞
R(h) = ··· exp{iω ! h}dF (ω).
−∞ −∞

Priestley (1981, pp. 219–222) outlines the proof of the theorem and establishes
the connection to Bochner’s theorem.
RANDOM FIELDS IN THE FREQUENCY DOMAIN 73

Although this representation accommodates both continuous and non-con-


tinuous normalized integrated spectra, it does not convey readily the impli-
cations of these cases. Consider a process in R1 . If F (ω) is a step function
with jumps pi at an enumerable set of frequencies Ω = {ω1 , ω2 , · · ·}, where
.
i pi = 1, then the spectrum is said to be discrete. In this case
-
dF (ω) 0 ω∈Ω
=
dω ∞ ω (∈ Ω
We can still conceive of a normalized spectral “density” function by using a
Dirac delta function:

(
f (ω) = pi δ(ω − ωi ),
i=1
where -
∞ x=0
δ(x) =
0 x (= 0.
The spectral density of a process with discrete spectrum has infinite peaks at
the frequencies ω1 , ω2 , · · ·, and is zero elsewhere. Processes with discrete spec-
trum have strictly periodic components and the spikes of infinite magnitude
correspond to the frequencies of periodicity.
If dF (ω)/dω = f (ω) exists for all ω, then F (ω) is continuous and so is
the nsdf. This type of spectrum is encountered for non-periodic functions. If a
continuous spectrum exhibits a large—but finite—spike at some frequency ω ∗ ,
it is indicated that the process has a near-periodic behavior at that frequency.
When examining continuous spectral densities it is helpful to remember
that when a function g(x) is relatively flat, its Fourier transform will be highly
compressed and vice versa. If the autocorrelation function drops off sharply
with increasing lag, the spectral density function will be rather flat. If high
autocorrelations persist over a long range of lags, the spectral density function
will be narrow and compressed near the zero frequency.
Consider a process in R1 with correlation function
Corr[Z(s), Z(s + h)] = exp{−3|h|/α} α > 0.
This correlation function is known as the exponential correlation model, its
genesis is more fully discussed in Chapter 4. Here we note that the parameter
α represents the practical range of the process, i.e., the lag distance at which
the correlations have diminished to (approximately) 0.05. The larger α, the
higher the correlations at a particular lag distance (Figure 2.7a). Substituting
into (2.35) the nsdf can be derived as
A ∞
1
f (ω) = cos(ωh) exp{−3h/α} dh
2π −∞
A
1 ∞
= cos(ωh) exp{−3h/α} dh
π 0
3 α
= .
π 9 + ω 2 α2
74 SOME THEORY ON RANDOM FIELDS

The spectral density function for a process with α small is flatter than the
sdf for a process with α large (Figure 2.7b).

a) Correlation function R(h) b) f( )/f(0)

1.0 1.0

0.8 0.8

0.6 0.6 =1

= 10

0.4 0.4 =2

=5
0.2 =5 0.2

=2 = 10

0.0 0.0
=1

2 6 10 1 3 5
Lag h

Figure 2.7 Autocorrelation function and spectral density functions for processes with
exponential correlation structure and different ranges. Panel b displays f (ω)/f (0) to
amplify differences in shape, rather than scale.

2.5.6 Linear Location-Invariant Filters

At the beginning of §2.5 we noted the similarity between the spectral repre-
sentation (2.30) of the field Z(s) and the convolution representation (2.13).
The correspondence can be made more precise by considering linear filter-
ing techniques. Wonderful expositions of linear filtering and its applications
to spectral analysis of random fields can be found in Thiébaux and Pedder
(1987, Ch. 5.3) and in Percival and Walden (1993, Ch.5). Our comments are
based on these works.
A linear filter is defined as a linear, location-invariant transformation of
one variable to another. The input variable is the one to which the linear
transformation is applied, the resulting variable is called the output of the
filter. We are interested in applying filtering techniques to random fields, and
denote the linear filtering operation as
Z(s) = L{Y (s)}, (2.36)
RANDOM FIELDS IN THE FREQUENCY DOMAIN 75

to depict that Z(s) is the output when the linear filter L is applied to the
process Y (s). A linear filter is defined through the properties Properties
of a Linear
(i) For any constant a, L{aY (s)} = aL{Y (s)}; Filter
(ii) If Y1 (s) and Y2 (s) are any two processes, then L{Y1 (s) + Y2 (s)} =
L{Y1 (s)} + L{Y2 (s)};
(iii) If Z(s) = L{Y (s)} for all s, then Z(s + h) = L{Y (s + h)}.

The linear filter is a scale-preserving, linear operator [(i) and (ii)] and is not
affected by shifts in the origin (iii). It is easy to establish that the convolution
(2.13) is a linear filter with input X(s) (Chapter problems). It would be more
appropriate to write the linear filter as L{Y (·)} = Z(·) since it associates a
function with another function. The notation we choose here suggests that
the filter associates a point with a point and should be understood to imply
that the filter maps the function defined on a point-by-point basis by Y (s) to
the function defined on a point-by-point basis by Z(s).
The excitation field of a convolution has a spectral representation, provided
it is mean square continuous. Hence,
A
X(s) = exp {iω ! s} U (dω), (2.37)

where U (ω) is an orthogonal process in the frequency domain and dω is an


infinitesimal region in that space. Combining (2.37) with (2.13), we obtain
A A A
Z(s) = K(v)X(s − v) dv = K(v)dv exp {iω ! (s − v)} U (dω)
Av Av dω

= K(u) exp {iω ! u} du exp {iω ! s} U (dω)


Au dω

= H(ω) exp {iω ! s} U (dω) = L{X(s)}. (2.38)


The complex function H(ω) is called the frequency response function


or the transfer function of the linear system. As the covariance function Transfer
and the spectral density function, the kernel function K(s) and the transfer Function
function form a Fourier transform pair:
A
H(ω) = K(u) exp {iω ! u} du
u A
1
K(s) = H(ω) exp {−iω ! s} dω.
(2π)d dω

Since X(s) has a spectral representation, so does Z(s). Given an orthogonal


process P (ω) for which
A
Z(s) = exp {iω ! s} P (dω),
ω
76 SOME THEORY ON RANDOM FIELDS

we see from (2.38) that the two frequency domain processes must be related,
P (dω) = H(ω)U (dω).
As a consequence, the spectral densities of the Z and X processes are also
related. Since E[|P (dω)|2 ] = sz (ω)dω and E[|U (dω)|2 ] = sx (ω)dω, we have
sz (ω) = |H(ω)|2 sx (ω). (2.39)
The spectral density function of the Z process is the product of the squared
modulus of the transfer function and the spectral density function of the input
field X(ω). This result is important because it enables us to construct one
spectral density from another, provided the transfer function and the spectral
density function of either Z or X are known, for example
sx (ω) = |H(ω)|−2 sz (ω).

Filtering The transfer function relates the spectra of the input to a linear filter in
Complex a simple fashion. It is noteworthy that the relationship is independent of
Expo- location. We can learn more about the transfer function H(ω). Consider the
nentials one-dimensional case for the moment. Since spectral representations involve
complex exponentials, consider the function εω (s) ≡ exp {iωs} for a particular
frequency ω. When εω (s) is processed by a linear location-invariant filter, we
can write the output yω (s + h) as
yω (s + h) = L{εω (s + h)} = L{exp{iωh}εω (s)}
= exp{iωh}L{εω (s)} = exp{iωh}yω (s),
for any shift h. We have made use here of the scale preservation and the
location invariance properties of the linear filter. Since the result holds for
any s, it also holds for s = 0 and we can write for the output of the filter
yω (h) = exp{iωh}yω (0).
Since the shift h can take on any value we also have
yω (s) = exp{iωs}yω (0) = εω (s)H(ω).
Is it justified to call the coefficient of exp{iωs} the transfer function? Let a
realization of X(s) be given by
A ∞
x(s) = exp {iωs} u(dω).
−∞
B∞
Then L{x(s)} = −∞
exp{iωs}H(ω)u(dω) and
A ∞
L{X(s)} = exp {iωs} H(ω)U (dω),
−∞

which is (2.38). This result is important, because it allows us to find the


transfer function of a filter by inputting the sequence defined by exp{iωs}.
RANDOM FIELDS IN THE FREQUENCY DOMAIN 77

Example 2.4 Consider a discrete autoregressive time-series of first order,


the AR(1) process,
Xt = αXt−1 + et ,
where the et are independent innovations with mean 0 and variance σ 2 , and α
is the autoregressive parameter, a constant. Define the linear filter L{Xt } = et ,
so that L{ut } = ut − αut−1 for a sequence {ut }. To find the transfer function
of the filter, input the sequence {exp{iωt}}:
L{exp{iωt}} = exp{iωt} − α exp{iω(t − 1)} = exp{iωt}(1 − α exp{−iω}).
The transfer function of the filter is H(ω) = 1 − α exp{−iω} since this is the
coefficient of exp{iωt} on the filter output. But the output of the filter was et ,
a sequence of independent, homoscedastic random innovations. The spectral
densities se (ω) and sX (ω) are thus related by (2.39) as
se (ω) = |H(ω)|2 sX (ω),
and thus
σ2
sX (ω) = .
|1 − α exp{−iω}|2
Linear filtering and the specific results for filtering complex exponentials made
it easy to find the spectral density function of a stochastic process, here, the
AR(1) process.

2.5.7 Importance of Spectral Analysis

The spectral representation of random fields may appear cumbersome at first;


mathematics in the frequency domain require operations with complex-valued
(random) variables and the interpretation of the spectrum is not (yet) entirely
clear. That the spectral density function distributes the variability of the
process over the frequency domain is appealing, but “so what?” The following
are some reasons for the increasing importance of spectral methods in spatial
data analysis.
• Mathematical proofs and derivations are often simpler in the frequency
domain. The skillful statistician working with stochastic processes switches
back and forth between the spatial and the frequency domain, depending
on which “space” holds greater promise for simplicity of argument and
derivation.
• The spectral density function and the covariance function of a stationary
stochastic process are closely related; they form a Fourier transform pair.
On this ground, studying the second-order properties of a random field via
the covariance function or the spectral density can be viewed as equivalent.
However,
– the spectral density and the covariance are two different but comple-
mentary representations of the second-order properties of a stochastic
78 SOME THEORY ON RANDOM FIELDS

process. The covariance function emphasizes spatial dependency as a


function of coordinate separation. The spectral density function empha-
sizes the association of components of variability with frequencies. From
the covariance function we glean the degree of continuity and the decay
of spatial autocorrelation with increasing point separation. From the
spectral density we glean periodicity in the process;
– it is often difficult in practice to recognize the implications of different
covariance structures for statistical inference from their mathematical
form or from a graph of the functions alone. Processes that differ sub-
stantially in their stochastic properties can have covariance functions
that appear rather similar when graphed. The spectral density function
can amplify and highlight subtle differences in the second-order structure
more so than the covariance functions.
• The spectral density function—as the covariance function—can be esti-
mated from data via the periodogram (see §4.7.1). Computationally this
does not provide any particular challenges beyond computing the sample
covariances, at least if data are observed on a grid. Summary statistics
calculated from data in the spatial domain are usually correlated. This
correlation stems either from the fact that the same data point Z(si ) is
repeatedly used in multiple summaries, and/or from the spatial autocorre-
lation. The ordinates of the periodogram, the data-based estimate of the
spectral density function, are—at least asymptotically—independent and
have simple distributional properties. This enables you to construct test
statistics with standard properties.
• The derivation of the spectral representation and its ensuing results requires
mean-square continuous, second-order stationary random fields. Studying
second-order properties of random fields in the spatial domain often re-
quires, in addition, isotropy of the process. An example is the study of
spatial dependence in point patterns (see §3.4). The K-function due to
Ripley (1976) is a useful device to study stochastic dependence between
random events in space. Many arguments favor the K-function approach,
probably most of all, interpretability. It does, however, require isotropy. Es-
tablishing whether a point pattern is isotropic or anisotropic in the spatial
domain is tricky. A spectral analysis requires only second-order stationar-
ity and the stochastic dependence among events can be gleaned from an
analysis of the pattern’s spectra. In addition, the spectral analysis allows
a simple test for anisotropy.

2.6 Chapter Problems

Problem 2.1 Let Y (t) = Y (t − 1) + e(t) be a random walk with indepen-


dent innovations E(t) ∼ (0, σ 2 ). Show that the process is not second-order
stationary, but that the differences D(t) = Y (t) − Y (t − 1) are second-order
stationary.
CHAPTER PROBLEMS 79

Problem 2.2 For γ(si − sj ) to be a valid semivariogram for a spatial process


Z(s), it must be conditionally negative-definite, i.e.,
m (
( m
ai aj γ(si − sj ) ≤ 0,
i=1 j=1
.m
for any number of sites s1 , · · · , sm and real numbers a1 , · · · , am with i=1 ai =
0. Why is that?

Problem 2.3 A multivariate Gamma random field can be constructed as


follows. Let Xi , i = 1, · · · , n, be independent Gamma(αi , β) random variables
so that E[Xi ] = αi β and Var[Xi ] = αi β 2 . Imagine a regularly spaced transect
consisting of Z(si ) = X0 + Xi , i = 1, · · · , n. Find the covariance function
C(i, j) = Cov[Z(si ), Z(sj )] and the correlation function. Is this a stationary
process? What is the distribution of the Z(si )?

Problem 2.4 Consider the convolution representation (2.13) of a white noise


excitation field for a random process on the line. If the convolution kernel is
the gaussian kernel
- >
1 1 (s − u)2
K(s − u, h) = √ exp − ,
2πh2 2 h2
where h denotes the bandwidth of the kernel, find an expression for the au-
tocorrelation function.

Problem 2.5 The convolution representation (2.13) of a random field can


be used to simulate random fields. Assume you want to generate a random
field Z(s) in which the data points are autocorrelated, Cov[Z(s), Z(s + h)] =
C(h), but marginally, the Z(s) should be “Poisson-like.” That is, you want
E[Z(s)] = λ, Var[Z(s)] = λ, and the Z(s) should be non-negative. Find a
random field X(s) that can be used as the excitation field. Hint: Consider X
such that Var[X] = E[X]θ, for example, a negative binomial random variable.

Problem 2.6 For a non-periodic, deterministic function g(s), prove (2.21).

Problem 2.7 Consider a random field in the plane with covariance func-
tion C(h) = C(h1 , h2 ), where h1 and h2 are the coordinate shifts in the x-
and y-directions. The covariance function is called separable if C(h1 , h2 ) =
C1 (h1 )C2 (h2 ). Here, C1 is the covariance function of the random process Z(x)
along lines parallel to the x-axis. Show that separability of the covariance
function implies separability of the spectral density.

Problem 2.8 Establish that the convolution (2.13) is a linear, location-inva-


riant filter.

Problem 2.9 Establish that the derivative L{Z(s)} = dZ(s)/ds = W (s) is


a linear filter. Use this result to find the spectral density of W in terms of the
spectral density of Z.
CHAPTER 3

Mapped Point Patterns

3.1 Random, Aggregated, and Regular Patterns

The realization of a point process {Z(s) : s ∈ D ⊂ R2 } consists of an ar-


rangement (pattern) of points in the random set D. These points are termed
the events of the point process. If the events are only partially observed, the
recorded pattern is called a sampled point pattern. When all events of a real-
ization are recorded, the point pattern is said to be mapped. In this chapter
we consider mapped point patterns in R1 (line processes) and R2 (spatial
processes).
Since D is a random set, the experiment that generates a particular real-
ization can be viewed as a random draw of locations in D at which events are
observed. From this vantage point, all mapped point patterns are realizations
of random experiments and the coarse distinction of patterns into (completely)
random, clustered (spatially aggregated), and regular ones should not lead to
the false impression that the latter two types of patterns are void of a random
mechanism. A point pattern is called a completely random pattern if the
following criteria are met. The average number of events per unit area—the Complete
intensity λ(s)—is homogeneous throughout D, the number of events in two Spatial
non-overlapping subregions (Borel sets) A1 and A2 are independent, and the Random-
number of events in any subregion is Poisson distributed. Thus, events dis- ness
tribute uniformly and independently throughout the domain. The mathemat-
ical manifestation of complete spatial randomness is the homogeneous Poisson
process (§3.2.2). It is a process void of any spatial structure and serves as the
null hypothesis for many statistical investigations into point patterns. Ob-
served point patterns are tested initially against the hypothesis of a complete
spatial random (CSR) pattern. If the CSR hypothesis is rejected, then the
investigator often follows up with more specific analyses that shed additional
light on the nature of the spatial point pattern.
Diggle (1983) calls the homogeneous Poisson process an “unattainable stan-
dard.” Most processes deviate from complete spatial randomness in some fash-
ion. Events may be independent in non-overlapping subregions, but the in-
tensity λ(s) with which they occur is not homogeneous throughout D. More
events will then be located in regions where the intensity is large, fewer events
will be located in regions where λ(s) is small. Events may occur with a con-
stant (average) intensity λ(s) ≡ λ but exhibit some form of interaction. The
presence of an event can attract or repel other events nearby. Accordingly,
82 MAPPED POINT PATTERNS

deviations from the CSR pattern are coarsely distinguished as aggregated


(clustered) or regular patterns (Figure 3.1). Since the reason for this devi-
ation may be deterministic spatial variation of the intensity function λ(s)
and/or the result of a stochastic element, we do not associate any mecha-
nism with clustering or regularity yet. Different point process models achieve
a certain deviation from the CSR pattern in different ways (see §3.7). At this
point we distinguish random, regular, and clustered patterns simply on the
basis of their relative patterns. In a clustered pattern, the average distance
between an event si and its nearest-neighbor event is smaller than the same
average distance in a CSR pattern. Similarly, in a regular pattern the average
distance between an event and its nearest neighbor is larger than expected
under complete spatial randomness.

a) b)
1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 c) 0.0 0.2 0.4 0.6 0.8 1.0

1.0

0.8

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0

Figure 3.1 Realizations of a completely random pattern (a), a Poisson cluster process
(b), and a process with regularity (sequential inhibition, c). All patterns have n = 100
events on the unit square.

Most observed patterns are snapshot realizations of a process that evolves in


time. The temporal evolution affects the degree to which events are aggregated
or clustered. A naturally regenerating oak stand, for example, will exhibit a
highly clustered arrangement as newly established trees are close to the old
trees from which acorns fell. Over time, differences in micro-site conditions
lead to a (random) thinning of the trees. The spatial distribution appears less
clustered. As trees grow older, competition leads to more regular patterns
when trees have established the needed growing space.
BINOMIAL AND POISSON PROCESSES 83

3.2 Binomial and Poisson Processes

The homogeneous Poisson process (HPP) is the stochastic representation of


complete spatial randomness, the point process equivalence of signal-free white
noise. Statistical tests for comparing an observed point pattern against that
expected from a HPP thus play a central role in point pattern analysis. In
practical implementations of these tests, observed patterns are usually com-
pared against a Binomial process. The connection between the Poisson and
the Binomial process lies in the conditioning on the number of events. In sim-
ulation studies the number of events in simulated patterns is typically held
fixed to equal the number of events in the observed pattern.

3.2.1 Bernoulli and Binomial Processes

Let νd (D) denote Lebesgue measure of D ∈ Rd . If d = 1, then νd (D) is the


length of an interval, in R2 it measures the area of D and in R3 the volume.
We will refer to ν(A) simply as the volume of the Borel set A. If a single event
s is distributed in D such that Pr(s ∈ A) = ν(A)/ν(D) for all sets A ⊂ D,
this process containing a single point is termed a Bernoulli process. It is a
rather uninteresting process but if n Bernoulli processes are superposed to
form a process of n events in D the resulting process is much more interesting
and is termed a Binomial point process. Notice that we are following the
same logic as in classical statistics, where the Binomial experiment is defined
as n independent and identical Bernoulli experiments with common success
probability.
Point processes can be studied through either the stochastic properties of
the event locations, or through a counting measure. The latter is often more
intuitive, but the former is frequently the representation from which methods
of simulating realizations of a point process model can be devised. If Z(s) is
a Binomial point process, then
ν(A1 ) · . . . · ν(An )
Pr(s1 ∈ A1 , · · · , sn ∈ An ) =
ν(D)n
for subregions A1 , · · · , An in D. In terms of a counting measure, let N (A)
denote the number of events in the (Borel) set A ⊂ D. In a Binomial process
N (A) is a Binomial random variable with sample size n = N (D) and success
probability π(A) = ν(A)/ν(D).
The (first-order) intensity λ(s) of a spatial point process measures the Intensity
average number of events per unit area (volume). The intensity is defined as
a limit since it is considered a function of points in D on an area basis. Let
ds denote an infinitesimal area (disc) in Rd centered at s. Then the limit
E[N (ds)]
λ(s) = lim
ν(ds)→0 ν(ds)
is the first-order intensity of the point process Z(s). With the Binomial pro-
84 MAPPED POINT PATTERNS

cess, the average number of events in region A is simply nπ(A), and for any
Borel subset A of D
nπ(ds) nν(ds)/ν(D) n
λ(s) = lim = lim = ≡ λ.
ν(ds)→0 ν(ds) ν(ds)→0 ν(ds) ν(D)
Since the first-order intensity does not change with spatial location, the Bi-
Homoge- nomial process is a homogeneous (or uniform) process.
neity Points in non-overlapping subregions are not independent, however. Since
the total number of events in D is fixed, m events in A necessarily implies
n−m events in D\A. Because of the correlation between the number of events
in disjoint subregions, a Binomial process is not a completely spatial random
process. It is a very important point process, however, for testing observed
patterns against the CSR hypothesis. Whereas a CSR pattern is the result of
a homogeneous Poisson process, in Monte Carlo tests of the CSR hypothesis
one usually conditions the simulations to have the same number of events
as the observed pattern. Conditioning a homogeneous Poisson process on the
number of events yields a Binomial process.

3.2.2 Poisson Processes

There are many types of Poisson processes with relevance to spatial statis-
tics. Among them are the homogeneous Poisson process, the inhomogeneous
Poisson process, the Poisson cluster process, and the compound Poisson pro-
cess. A process is referred to as the Poisson process if it has the following two
properties:
Homogeneous
(i) If N (A) denotes the number of events in subregion A ⊂ D, then N (A) ∼
Poisson
Poisson(λν(A)), where 0 < λ < ∞ denotes the constant intensity function
Process
of the process;
(ii) If A1 and A2 are two disjoint subregions of D, then N (A1 ) and N (A2 )
are independent.
Stoyan, Kendall, and Mecke (1995, p. 33) call (ii) the “completely random”
property. It is noteworthy that property (ii) follows from (i) but that the
reverse is not true. The number of events in A can be distributed as a Poisson
variable with a spatially varying intensity, but events can remain independent
in disjoint subsets. We consider the combination of (i) and (ii) as the definition
of complete spatial randomness. A point process that satisfies properties (i)
and (ii) is called a homogeneous Poisson (or CSR) process.
If the intensity function λ(s) varies spatially, property (i) is not met, but (ii)
may still hold. A process of this kind is the inhomogeneous Poisson process
(IPP). It is characterized by the following properties.
Inhomogeneous
(i) If N (A) denotes the number of events in subregion A ⊂ D, then N (A) ∼
Poisson
Poisson(λ(A)),
B where 0 < λ(s) < ∞ is the intensity at location s and
Process
λ(A) = A λ(s)ds;
BINOMIAL AND POISSON PROCESSES 85

(ii) If A1 and A2 are two disjoint subregions of D, then N (A1 ) and N (A2 )
are independent.
The HPP is obviously a special case of the IPP where the intensity is constant.
Stoyan et al. (1995) refer to the HPP as the stationary Poisson process and
label the IPP the general Poisson process. Stationarity of point processes is
explored in greater detail in §3.4. We note here that stationarity implies (at
least) that the first-order intensity of the process is translation invariant which
requires that λ(s) ≡ λ. The inhomogeneous Poisson process is a non-stationary
point process.

3.2.3 Process Equivalence

The first-order intensity λ(s) and the yet to be introduced second-order inten-
sity λ2 (si , sj ) (§3.4) capture the mean and dependence structure in a spatial
point pattern. As the mean and covariance of two random variables X and
Y provide an incomplete description of the bivariate distribution, these two
intensity measures describe a point process incompletely. Quite different pro-
cesses can have the same intensity measures λ(s) and λ2 (si , sj ) (for an exam-
ple, see Baddeley and Silverman, 1984). In order to establish the equivalence
of two point processes, their distributional properties must be studied. This
investigation can focus on the distribution of the n-tuple {s1 , · · · , sn } by con-
sidering the process as random sets of discrete points, or through distributions
defined for random measures counting the number of points. We focus on the
second approach. Let N (A) denote the number of events in region (Borel set)
A with volume ν(A). The finite-dimensional distributions are probabilities of
the form Finite-
Pr(N (A1 ) = n1 , · · · , N (Ak ) = nk ), dimensional
where n1 , · · · , nk ≥ 0 and A1 , · · · , Ak are Borel sets. The distribution of Distribu-
the counting measure is determined by the system of these probabilities for tion
k = 1, 2, · · ·. It is convenient to focus on regions A1 , · · · , Ak that are mutu-
ally disjoint (non-overlapping). A straightforward system of probabilities that
determines the distribution of a simple point process consists of the zero-
probability functionals Void Proba-
PN0 (A) = Pr(N (A) = 0) bilities
for Borel sets A. Stoyan et al. (1995, Ch. 4.1) refer to these functionals as void-
probabilities since they give the probability that region A is void of events.
Notice that using zero-probability functionals for point process identification
requires simple processes; no two events can occur at the same location.
Cressie (1993, p. 625) sketches the proof of the equivalence theorem,
which states that two simple point processes with counting measures N1 and
N2 are identically distributed if and only if their finite-dimensional distribu-
tions coincide for all integers k and sets A1 , · · · , Ak and if and only if their
void-probabilities are the same: PN0 1 (A) = PN0 2 (A) ∀A.
86 MAPPED POINT PATTERNS

Example 3.1 The equivalence theorem can be applied to establish the equiv-
alence of a Binomial process and a homogeneous Poisson process on D that is
conditioned on the number of events. First note that for the Binomial process
we have N (A) ∼ Binomial(n, π(A)), where π(A) = ν(A)/ν(D). Hence,
: ;n
ν(D) − ν(A)
PN0 (A) = {1 − π(A)}n = (3.1)
ν(D)
n!
Pr(N (A1 ) = n1 , · · · , N (Ak ) = nk ) =
n1 ! · . . . · nk !
ν(A1 )n1 · . . . · ν(Ak )nk
× , (3.2)
ν(D)n
for A1 , · · · , Ak disjoint regions such that A1 ∪ · · · ∪ Ak = D, n1 + · · · + nk = n.
Let M (A) denote the counting measure in a homogeneous Poisson process
with intensity λ. The void-probability in region A is then given by
0
PM (A) = exp {−λν(A)} .
Conditioning on the number of events M (D) = n, the void-probability of the
conditioned process becomes

|M (D)=n (A) = Pr(M (A) = 0|M (D) = n)


0
PM
Pr(M (A) = 0)Pr(M (D \ A) = n)
=
Pr(M (D) = n)
e−λν(A) [λν(D \ A)]n e−λν(D\A)
=
[λν(D)]n e−λν(D)
: ;n
ν(D \ A)n ν(D) − ν(A)
= = ,
ν(D)n ν(D)
which is (3.1). To establish that the Poisson process M (A), given M (D) = n,
is a Binomial process through the finite-dimensional distributions is the topic
of Chapter problem 3.1.

3.3 Testing for Complete Spatial Randomness

A test for complete spatial randomness addresses whether or not the observed
point pattern could possibly be the realization of a homogeneous Poisson
process (or a Binomial process for fixed n). Just as the stochastic properties
of a point process can be described through random sets of points or counting
measures, statistical tests of the CSR hypothesis can be based on counts of
events in regions (so-called quadrats), or distance-based measures using the
event locations. Accordingly, we distinguish between quadrat count methods
(§3.3.3) and distance-based methods (§3.3.4).
With a homogeneous Poisson process, the number of events in region A
is a Poisson variate and counts in non-overlapping regions are independent.
The distributional properties of quadrat counts are thus easy to establish, in
TESTING FOR COMPLETE SPATIAL RANDOMNESS 87

particular for point patterns on rectangles. The distribution of test statistics


based on quadrat counts is known at least asymptotically and allows closed-
form tests. For irregularly shaped spatial domains, when considering edge
effects, and for rare events (small quadrat counts), these approximations may
not perform well. The sampling distribution of statistics based on distances
between events or distances between sampling locations and events are much
less understood, even in the case of a Poisson process. Although nearest-
neighbor distributions can be derived for many processes, edge-effects and
irregularly shaped domains are difficult to account for.
When sampling distributions are intractable or asymptotic results not reli-
able, one may rely on simulation methods. For point pattern analysis simula-
tion methods are very common, if not the norm. Two of the basic tools are
the Monte Carlo test and the examination of simulation envelopes.

3.3.1 Monte Carlo Tests

A Monte Carlo test for CSR is a special case of a simulation test. The hy-
pothesis is that an observed pattern Z(s) could be the realization of a point
process model Ψ. A test statistic Q is chosen which can be evaluated for the
observed pattern and for any realization simulated under the model Ψ. Let
q0 denote the realized value of the test statistic for the observed pattern.
Then generate g realizations of Ψ and calculate their respective test statis-
tics: q1 = q(ψ1 ), · · · , qg = q(ψg ). The statistic q0 is combined with these and
the set of g + 1 values is ordered (ranked). Depending on the hypothesis and
the choice of Q, either small or large values of Q will be inconsistent with
the model Ψ. For example, if Q is the average distance between events and
their nearest neighbors, then under aggregation one would expect q0 to be
small when Ψ is a homogeneous Poisson process. Under regularity, q0 should
be large. If Ψ is rejected as a data-generating mechanism for the observed
pattern when q0 ≤ q(k) or q0 ≥ q(g+1−k) , where q(k) denotes the kth smallest
value, this is a two-sided test with significance level α = 2k/(g + 1).
Monte Carlo tests have numerous advantages. The p-values of the tests
are exact in the sense that no approximation of the distribution of the test
statistic is required. The p-values are inexact in the sense that the number
of possible realizations under Ψ is typically infinite. At least the number of
realizations will be so large that enumeration is not possible. The number g
of simulations must be chosen sufficiently large. For a 5% level test g = 99
and for a 1% level test g = 999 have been recommended. As long as the
model Ψ can be simulated, the observed pattern can be compared against
complex point processes by essentially the same procedure. Simulation tests
thus provide great flexibility.
A disadvantage of simulation tests is that several critical choices are left
to the user, for example, the number of simulations and the test statistic.
Diggle (1983) cautions of “data dredging,” the selection of non-sensible test
88 MAPPED POINT PATTERNS

statistics for the sake of rejecting a particular hypothesis. Even if sensible test
statistics are chosen, the results of simulation tests may not agree. The power
of this procedure is also difficult to establish, in particular, when applied to
tests for point patterns. The alternative hypothesis for which the power is to
be determined is not at all clear.

3.3.2 Simulation Envelopes

A Monte Carlo test calculates a single test statistic for the observed pattern
and each of the simulated patterns. Often, it is illustrative to examine not
point statistics but functions of the point pattern. For example, let hi denote
the distance from event si to the nearest other event and let I(hi ≤ h) denote
the indicator function which returns 1 if hi ≤ h. Then
n
' 1(
G(h) = I(hi ≤ h)
n i=1
is an estimate of the distribution function of nearest-neighbor event distances
and can be calculated for any value of h. With a clustered pattern, we expect
an excess number of short nearest-neighbor distances (compared to a CSR
pattern). The method for obtaining simulation envelopes is similar to that
used for a Monte Carlo test, but instead of evaluating a single test statistic
for each simulation, a function such as G(h) ' is computed. Let G ' 0 (h) denote
the empirical distribution function based on the observed point pattern. Cal-
culate G 'g (h) from g point patterns simulated under CSR (or any
' 1 (h), · · · , G
other hypothesis of interest). Calculate the percentiles of the investigated func-
tion from the g simulations. For example, upper and lower 100% simulation
envelopes are given by
' l (h) = min {G
G ' i (h)} and G ' u (h) = max {G ' i (h)}.
i=1,···,g i=1,···,g

Finally, a graph is produced which plots G' 0 (h), G


' l (h), and G
' u (h) against the
theoretical distribution function G(h), or, if G(h) is not attainable, against
the average empirical distribution function from the simulation,
g
1('
G(h) = Gi (h).
n i=1

Example 1.5 (Lightning strikes. Continued) Recall the lightning data


from §1.2.3 (p. 11). The pattern comprises 2,927 lightning flashes recorded
by the National Lightning Detection Network within approximately 200 miles
of the East coast of the United States during a span of four days in April
2003. Figure 3.2 displays the observed pattern and two bounding domains, the
bounding box and the convex hull. Obviously, the pattern appears clustered,
and this agrees with our intuition. Lightning strikes do not occur completely
at random, they are associated with storms and changes in the electric charges
of the atmosphere.
TESTING FOR COMPLETE SPATIAL RANDOMNESS 89

41

36
Latitude

31

26

-85 -81 -77 -73


Longitude

Figure 3.2 Locations of lightning strikes, bounding rectangle, and convex hull.

In order to perform a Monte Carlo-based analysis of these data we need


to define the applicable domain. State boundaries are not adequate because
the strikes were observed within a certain distance of the East coast (see
Figure 1.7 on p. 13). Two common approaches are to choose the bounding
rectangle and the convex hull of the observed pattern. Because the data were
collected within a distance of approximately 200 miles within the coast line,
the bounding box is not a very good representation of the domain. It adds
too much “white space” and thus enhances the degree of clustering.
Figure 3.3 displays G' 0 (h) based on the bounding box and the convex hull
along with the simulation envelopes for s = 500. The extent of clustering is
evident; the estimated G functions step outside of the simulation envelopes
immediately. The exaggerated impression of clustering one obtains by using
the bounding rectangle for these data is also evident.

Simulation envelopes can be used for confirmatory inference in different


ways.
• While generating the g simulations one can calculate a test statistic along
with the function of interest and thereby obtain all necessary ingredients
for a Monte Carlo test. For example, one may use h, the average nearest-
neighbor distance.
90 MAPPED POINT PATTERNS

Bounding box
Convex hull
1.0

0.8
G(h) and envelopes

0.6

0.4

0.2

0.0

0.1 0.3 0.5 0.7 0.9


Average Simulated G(h)

Figure 3.3 G-function and simulation envelopes from 500 simulations on bounding
box and convex hull.

• If the null hypothesis is reasonable, the observed function G(h) should fall
within the simulation envelopes. When G(h) and its envelopes are graphed
against h and a 95% upper simulation envelope is exceeded at a given
small distance h0 , a Monte Carlo test with test statistic G(h0 ) would have
rejected the null hypothesis in a one-sided test at the 5% level. It is thus
common to calculate 95% simulation envelopes and examine whether G(h) '
crosses the envelopes. It must be noted, however, that simulation envelopes
are typically plotted against the theoretical G(h) or G(h), not distance.
Furthermore, unless the value of h0 is set in advance, the Type-I error of
this method is not protected.

3.3.3 Tests Based on Quadrat Counts

3.3.3.1 Goodness-of-Fit Test

The most elementary test of CSR based on counting events in regions is based
on dividing the domain D into non-overlapping regions (quadrats) A1 , · · · , Ak
of equal size such that A1 ∪ · · · ∪ Ak = D. Typically, the domain is assumed to
TESTING FOR COMPLETE SPATIAL RANDOMNESS 91

be bounded by a rectangle and partitioned into r rows and c columns. If nij


is the number of events in quadrat ij, and n = n/(rc) is the expected number
of events in any quadrat under CSR, then the standard Pearson Chi-square
statistic is Index of
(r ( c Dispersion
(nij − n)2
X2 = . (3.3)
i=1 j=1
n

This test statistic is that of a Chi-square goodness-of-fit test of the hypothesis


that the n points are distributed uniformly and independently in D, or, in
other words, whether the quadrat counts are independent Poisson variates
with common mean. Fisher, Thornton, and MacKenzie (1922) used (3.3) in
the latter sense to test whether bacterial density on plates can be described
by the Poisson distribution. Note that the reference distribution for (3.3) is
χ2rc−1 . Although (3.3) is written as a Chi-square statistic for a contingency
table, the double summation is used to emphasize the row-column partition
of the domain. Furthermore, no additional degree of freedom is lost to the
estimation of n, since n is known in a mapped point pattern. An alternative
expression for (3.3) is X 2 = (rc−1)s2 /n, where s2 is the sample variance of the
r × c quadrat counts. If the pattern is CSR, then the ratio of sample variance
and sample mean should be approximately 1. X 2 is thus also referred to as the
index of dispersion. Note that Diggle (1983, p. 33) terms I = X 2 /(rc − 1)
as the index of dispersion.
The goodness-of-fit test based on quadrat counts is simple and the Chi-
square approximation performs well provided that the expected number of
events per quadrat exceeds 1 and rc > 6 (Diggle, 1983, p. 33). It is, however,
very much influenced by the choice of the quadrat size.

Example 3.2 For the three point patterns on the unit square shown in
Figure 3.1, quadrat counts were calculated on a r = 5 × c = 5 grid. Since each
pattern contains n = 100 points, n = 4 is common to the three realizations.
The quadrat counts are shown in Table 3.1.
With a CSR process, the counts distribute evenly across the quadrats,
whereas in the clustered pattern counts concentrate in certain areas of the
domain. Consequently, the variability of quadrat counts, if events aggregate,
exceeds the variability of the Poisson process. Clustered processes exhibit large
values of X 2 . The reverse holds for regular processes whose counts are under-
dispersed relative to the homogeneous Poisson process. The CSR hypothesis
based on the index of dispersion is thus rejected in the right tail against the
clustered alternative and the left tail against the regular alternative.
The sensitivity of the goodness-of-fit test to the choice of quadrat size is
evident when the processes are divided into r = 3 × c = 3 quadrats. The
left tail X 2 probabilities for 9 quadrats are 0.08, 0.99, and 0.37 for the CSR,
clustered, and regular process, respectively. The scale on which the point
pattern appears random, clustered, or regular, depends on the scale on which
92 MAPPED POINT PATTERNS

Table 3.1 Quadrat Counts in Simulated Point Patterns of Figure 3.1


CSR Process, c = Cluster Process, c = Regular Process, c =
r= 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5

5 6 3 7 4 5 2 1 7 5 10 7 5 3 2 6
4 5 3 1 3 4 3 6 2 3 10 3 2 4 5 7
3 2 3 8 3 5 8 4 1 6 5 4 5 3 5 3
2 5 5 5 5 4 1 6 0 2 1 2 4 7 3 3
1 1 4 2 3 4 1 2 8 3 3 4 4 4 2 3

X 2 = 17.0 X 2 = 45.9 X 2 = 14.5


Pr(χ224 ≤ X 2 ) = 0.15 Pr(χ224 ≤ X 2 ) = 0.99 Pr(χ224 ≤ X 2 ) = 0.06

the counts are aggregated. This is a special case of what is known in spatial
data analysis as the change of support problem (see §5.7).

3.3.3.2 Analysis of Contiguous Quadrats

The fact that quadrat counts can indicate randomness, regularity, and clus-
tering depending on the scale of aggregation was the idea behind the method
of contiguous quadrat aggregation proposed in an influential paper by Greig-
Smith (1952). Whereas the use of randomly placed quadrats of differing size
had been common a the time, Greig-Smith (1952) proposed a method of ag-
gregating events at different scales by counting events in successively larger
quadrats which form a grid in the domain. Initially, the domain is divided into
a grid of 2q × 2q quadrats. Common choices for q are 4 or 5 leading to a basic
aggregation into 256 or 1024 quadrats. Then the quadrats are successively
combined into blocks consisting of 2 × 1, 2 × 2, 2 × 4, 4 × 4 quadrats and so
forth. Consider a division of the domain into 16 × 16 = 256 quadrats as shown
in Figure 3.4.
Depending on whether the rectangular blocks that occur at every second
level of aggregation are oriented horizontally or vertically, two modes of ag-
gregation are distinguished. The entire pattern contains two blocks of 128
quadrats. Each of these contains two blocks of 64 quadrats. Each of these
contains two blocks of 32 quadrats, and so forth. Let Nr,i denote the number
of events in the ith block of size r. The sum of squares between blocks of size
r is given by
m m/2
( (
SSr = 2 2
Nr,i − 2
N2r,j .
i=1 j=1

The term sum of squares reveals the connection of the method with a (nested)
analysis of variance. Blocks of size 1 are nested within blocks of size 2, these
TESTING FOR COMPLETE SPATIAL RANDOMNESS 93

16 16
15 15
14 14
13 13
12 12
11 11
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Figure 3.4 Aggregation of quadrat counts into blocks of successively larger size for
analysis of contiguous quadrats according to Greig-Smith (1952). Horizontal aggre-
gation is shown in the left-hand panel, vertical aggregation in the right-hand panel.
Dashed line shows the division of one block of size 128 into two blocks of size 64.

are nested within blocks of size 4, and so forth. The mean square associated
with blocks of size r is then simply M Sr = SSr /22q . The original Greig-Smith
analysis consisted of plotting M Sr against the block size r. Peaks or troughs
in this graph are interpreted as indicative of clustered or regular patch sizes.
Since Var[M Sr ] increases with the quadrat area, care must be exercised not
to over-interpret the fluctuations in M Sr , in particular for larger block sizes.
The Greig-Smith analysis thus provides a good application where simulation
envelopes should be considered. The peaks and troughs in the M Sr plot can
then be interpreted relative to the variation that should be expected at that
block size. To calculate simulation envelopes, the quadrat counts for the finest
gridding are randomly permuted s times among the grid locations.

Example 3.3 Red cockaded woodpecker. The red-cockaded Woodpecker


is a federally endangered species sensitive to disruptions of its habitat because
of its breeding habits. The species is known as a cooperative breeder where
young male offspring remain with a family as nonbreeding helpers, sometimes
for several years. The species builds its nest cavities in live pine trees. Because
of the resistance exerted by live trees, building a cavity can take a long time
and loss of nesting trees is more damaging than for species that can inhabit
natural cavities or cavities vacated by other animals. Decline of the bird’s
population has been attributed to the reduction of natural habitat, the lack
of old-growth, and the suppression of seasonal fires (Walters, 1990). Thirty
clusters (=families) of birds were followed in the Fort Bragg area of North
Carolina over a 23-month period (December 1994 to October 1996). For sev-
94 MAPPED POINT PATTERNS

enteen of the twenty-three months, one bird from each cluster was selected and
observed for an entire day. At eight-minute intervals during that observation
period, the location of the bird was geo-referenced. One of the goals of the
study was to obtain an estimate of the homerange of the cluster, that is, the
area in which the animals perform normal activities (Burt, 1943). Figure 3.5
shows the counts obtained from partitioning the 4,706 ft × 4,706 ft bounding
square of the pattern into 32 × 32 quadrats of equal size. A concentration of
the birds near the center of the study area is obvious, the data appear highly
clustered.

1
32
1 1 1 1 1
31
1 1 2 3 2 1 1 1 1
30
1 1 1 2 1 2 1 1 1 2
29
1 1 1 1
28
2 1 1
27
1 1 1 1
26
1 1 1 2 3
25
1 1 1
24
1 1 1 1
23
1
22
1 1 1 1 1 1 1 1 1 2
21
1 1 1 1 1 1 1 1 1 1
20
1 1 1 2 1 1 3 3 1 1 1 1
19
1 1 2 2 1 3 1 1 1 2 1 1
18
1 1 1 7 4 1 5 2 1 4 2 1
Row

1 1 1
17
1 1 1 2 1 2 1 4 8 5 3 2 2 2 1 1
16
1 2 1 2 1 3 7 5 13 2 2 1 1 3 2 2
15
1 1 4 1 3 4 7 8 4 1 2 1 1 1
14
1 1 1 2 2 2 4 1 1 3 3 2 4 1 2 1 1 1 1 1 1
13
2 2 3 2 3 3 5 5 4 2 5 2 2 1 2 3 2 4 1 1 1
12
1 2 4 2 2 2 5 2 4 3 1 1 8 1 2 3 3 1 1 1 1 1 1 1
11
1 2 1 4 2 5 5 6 8 5 4 9 4 4 4 4 2 1 1 1 1 1 2 1
10
1 1 2 2 1 2 1 3 4 3 3 6 3 8 4 5 3 2 1
9
2 1 1 1 2 1 2 1 5 1 7 3 3 1 3 1 1
8
1 1 2 1 1 2 1 3 5 1 4 5 4 1 1
7
1 1 1 1 1 1 3 1 4 1 6 3 1 1
6
1 1 1 1 1 2 2 2 1
5
1 2 1 1 1 1
4
2 1 1 1 1 2
3
1
2
1 1 1
1

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Column

Figure 3.5 Quadrat counts for woodpecker data. The bounding square was divided
into 32 × 32 square quadrats of equal size. The width of a quadrat is approximately
147 ft. A total of 675 locations were recorded. Data kindly provided by Professor Jeff
Walters, Department of Biology, Virginia Tech.

The nested analysis of variance shows the dependence of M Sr on the di-


rection of aggregation. In particular for large block sizes, the differences for
(v)
vertical and horizontal aggregation are considerable. Let M Sr denote the
(h)
mean square at block size r from a vertical aggregation pattern and M Sr
the corresponding mean square from a horizontal pattern (see Figure 3.4).
We recommend to eliminate the effects of the orientation of aggregation by
(v) (h)
considering the average M S r = 0.5(M Sr + M Sr ). Next, s = 200 ran-
TESTING FOR COMPLETE SPATIAL RANDOMNESS 95

Table 3.2 Nested analysis of variance for vertical and horizontal aggregation into
contiguous quadrats

Block No. of SSr M Sr M Sr M S r Env.


Size Blocks Vert. Horiz. Vert. Horiz. Min Max
1 1024 2299 2299 0.72 0.70 0.71 1.64 1.94
2 512 3869 3877 0.69 0.71 0.70 1.62 1.99
4 256 7025 7025 1.32 1.67 1.49 1.41 2.32
8 128 12701 12343 2.54 1.84 2.19 1.18 2.41
16 64 22803 22803 5.69 2.95 4.32 0.89 2.83
32 32 39779 42581 3.84 9.31 6.57 0.83 3.34
64 16 75631 75631 35.39 46.91 41.15 0.28 5.50
128 8 115013 103231 73.87 50.86 62.36 0.29 6.00
256 4 154383 154383 72.59 8.92 40.76 0.02 8.53
512 2 234425 299633 12.92 140.27 76.59 0.00 7.85
1024 1 455625 455625 . . .

dom permutations of the quadrat counts in Figure 3.5 were generated and
the nested analysis of variance was repeated for each permutation. For the
smallest block sizes the woodpecker distribution exhibits some regularity, but
the effects of strong clustering are apparent for the larger block sizes (Table
3.2). Since Var[M Sr ] increases with block size, peaks and troughs must be
compared to the simulation envelopes to avoid over-emphasizing spikes in the
M Sr . Figure 3.6 shows significant clustering for block sizes of 32 and more
quadrats. The spike of M S r at r = 128 represents the mean square for blocks
of 16 × 8 quadrats. This corresponds to a “patch size” of 2,352 ft × 1,176 ft.
The Woodpecker quadrat count data are vertically aggregated. Observa-
tions collected over time were accumulated into discrete spatial units. The
analysis shows that both the size of the unit of aggregation as well as the
spatial configuration (orientation) has an effect on the analysis. This is an ex-
ample of a particular aspect of the change of support problem, the modifiable
areal unit problem (MAUP, see §5.7).

The graph of the mean squares by block size in the Greig-Smith analysis
is an easily interpretable, exploratory tool. The analysis can attain a confir-
matory character if it is combined with simulation envelopes or Monte Carlo
tests. Closed-form significance tests have been suggested based on sums of
squares and mean squares and the Chi-square and F -distributions (see, for
example, Thompson, 1955, 1958; Zahl, 1977). Upton and Fingleton (1985,
p. 53) conclude that “virtually all the significance tests are suspect.” The
analysis of contiguous quadrats also conveys information only about “scales
of pattern” that coincide with blocks of size 2k , k = 0, · · · 2q − 1. A peak or
trough in the M Sr plot for a particular value of r could be induced by a patch
96 MAPPED POINT PATTERNS

140 Average MSr


MSr vertical aggregation
130 MSr horizontal aggregation
120
110
100
90
80
MS r

70
60
50
40
30
20
10
0

1 2 4 8 16 32 64 128 256 512


Block size r

Figure 3.6 Mean squares for nested, contiguous quadrat counts as a function of
block size. Solid lines without symbols denote 100% simulation envelopes for M S r =
(v) (h)
0.5(M Sr + M Sr ).

size for which the mean squares cannot be calculated. The effects of orienta-
tion of the grid on the analysis are formidable. Despite of these problems, the
Greig-Smith analysis remains a popular tool.

3.3.3.3 Application of Other Methods to Quadrat Counts

The grouping of events into quadrats creates a lattice structure to which


other statistical methods for spatial data can be applied. Since the counts are
independent under CSR, finding significant spatial autocorrelation among the
counts leads to a rejection of the CSR hypothesis. One approach is thus to
compute Moran’s I or Geary’s c statistic and to apply the tests in §1.3.2.

Example 3.3 (Red cockaded woodpecker. Continued) Grouping the


woodpecker events into a lattice of 10 × 10 quadrats leads to Table 3.3. The
clustering of events is clearly evident. Both statistics provide strong evidence
of clustering, based on a “queen” definition of the lattice neighborhood (Table
3.4). In Monte Carlo tests based on 200 permutations of the quadrat counts,
no arrangement produced an I or c statistic more extreme than the observed
value.
TESTING FOR COMPLETE SPATIAL RANDOMNESS 97

Table 3.3 Quadrature counts for woodpecker data based on 10 × 10 square quadrats

Column
Row 1 2 3 4 5 6 7 8 9 10
10 0 0 0 5 9 4 5 0 0 0
9 0 2 2 6 3 4 1 0 0 0
8 0 0 6 3 3 0 1 0 2 2
7 1 0 2 1 1 3 5 4 2 2
6 2 1 4 19 15 14 6 4 3 1
5 3 4 17 42 41 8 9 5 1 0
4 2 14 21 34 31 31 22 7 4 4
3 5 5 11 25 31 45 33 3 0 1
2 2 0 3 5 6 10 19 8 0 0
1 0 0 3 5 0 3 2 2 0 0

Table 3.4 Results for Moran’s I and Geary’s c analysis based on quadrat counts in
Table 3.3

Observed Expected Standard P-


Method Statistic Value Value Error Value
Normality I 0.7012 −0.0101 0.0519 < .0001
Randomization I 0.7012 −0.0101 0.0509 < .0001
Normality c 0.4143 1 0.0611 < .0001
Randomization c 0.4143 1 0.0764 < .0001

3.3.4 Tests Based on Distances

The choice of shape and number of quadrats in CSR tests based on areal
counts is a subjective element that can influence the outcome. Test statistics
that are based on distances between events or between sample points and
events eliminate this subjectiveness, but are more computationally involved.
In this subsection tests based on distances between events are considered.
Let hij denote the inter-event distance between events at locations si and sj ,
hij = ||si − sj ||. The distance between event si and the nearest other event is
called the nearest-neighbor distance and denoted hi .
Sampling distributions of test statistics based on inter-event or nearest-
neighbor distances are elusive, even under the CSR assumption. Ripley and
Silverman (1978) described a closed-form quick test that is based on the first
ordered inter-event distances. For example, if t1 = min{hij } then t21 has an
98 MAPPED POINT PATTERNS

exponential distribution under CSR. The consequences of recording locations


inaccurately looms large for such a test, and Ripley and Silverman recommend
to use tests based on the third-smallest inter-event distance.
To circumvent the problem of determining the sampling distribution and
to account for irregularly shaped spatial domains, simulation based methods
are common in CSR tests based on distances. Whereas there are n(n − 1)/2
inter-event distances, there are only n nearest-neighbor distances and using
a tesselation or triangulation the nearest-neighbor distances can be quickly
determined. Enumerating all inter-event distances and finding the smallest for
each point in order to determine nearest-neighbor distances is only acceptable
for small problems. Tests based on inter-event distances are thus computa-
tionally more intensive. The accuracy with which the point locations are de-
termined intuitively appears to be more important when only the distances
between points and their nearest neighbors are considered.
Reasonable test statistics for Monte Carlo tests are h, the average nearest
neighbor distance, t, the average inter-event distance,

' 0 ) = #(yi ≤ y0 ) ,
G(y
n
the empirical estimate of the probability that the nearest-neighbor distance
is at most y0 ,

' 0 ) = 2 #(hij ≤ h0 ) ,
H(t
n(n − 1)
the empirical estimate of the probability that the inter-event distance is at
most h0 , and so forth. There are many other sensible choices provided that the
test statistic is interpretable in the context of testing the CSR hypothesis. In
a clustered pattern, for example, h tends to be smaller than in a CSR pattern
and tends to be larger in a regular pattern (Figure 3.7). If y0 is chosen small,
' 0 ) will be larger than expected under CSR in a clustered pattern and
G(y
smaller in a regular pattern.

Example 3.3 (Red cockaded woodpecker. Continued) The average


nearest-neighbor distance among the red cockaded woodpecker event locations
is h = 0.0148. In 200 simulations of CSR processes with the same number of
events as the observed pattern, none of the average nearest-neighbor distances
was less than 0.0148. The CSR hypothesis is rejected against the clustered al-
ternative with p-value 0.00498. Figure 3.8 shows the empirical distribution
function of nearest-neighbor distance and the upper and lower simulation en-
velopes. The upper envelope is exceeded for very short distances, evidence of
strong clustering in the data.
SECOND-ORDER PROPERTIES OF POINT PATTERNS 99

a) b)

20
20
15
15

Density
Density

10
10

5 5

0 0
0.01 0.03 0.04 0.06 0.08 0.10 0.11 0.13 0.01 0.04 0.06 0.09 0.11 0.14 0.16 0.19

Nearest neighbor distance Nearest neighbor distance


c)

1.5
Density

1.0

0.5

0.0
0.07 0.23 0.38 0.54 0.70 0.86 1.01 1.17

Nearest neighbor distance

Figure 3.7 Distribution of nearest-neighbor distances in the three point patterns of


Figure 3.1. Completely random pattern (a), clustered pattern (b), and regular pattern
(c).

3.4 Second-Order Properties of Point Patterns

If λ(s) plays a role in point pattern analysis akin to the mean function, what
function of event locations expresses dependency of events? In order to capture
spatial interaction, more than one event needs to be considered. The second-
order intensity function sets into relationship the expected number of the Second-
cross-product of event counts in infinitesimal disks and the volumes of the order
disks as the disks are shrunk. Intensity
E[N (dsi )N (dsj )]
λ2 (si , sj ) = lim . (3.4)
|dsi |→0,|dsj |→0 |dsi ||dsj |
Stoyan et al. (1995, p. 112) refer to (3.4) as the second-order product density
since it is the density of the second-order factorial moment measure.
A point process is homogeneous (uniform) if λ(s) = λ. A process is station-
ary, if the second-order intensity depends only on event location differences, Stationarity
λ2 (si , sj ) = λ∗2 (si − sj ). If the process is furthermore isotropic, the second-
order intensity depends only on distance, λ2 (si , sj ) = λ∗2 (||si − sj ||) = λ∗2 (h).
100 MAPPED POINT PATTERNS

1.0

0.9

0.8

Emp. Distribution Function 0.7

0.6

0.5

0.4

0.3

0.2

0.1

0.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Average EDF in simulations

Figure 3.8 EDF of nearest-neighbor distance (solid line) for woodpecker data and
simulation envelopes (dashed lines) based on 200 simulations.

In other words, the distribution of a stationary point process is translation


invariant; the distribution of an isotropic pattern is rotation invariant. As
previously, the star notation is dropped in what follows.
The covariance between event counts in regions A and B can be expressed
in terms of the two intensities. Assuming a stationary, process,
A A
Cov[N (A), N (B)] = λ2 (si − sj ) dsi dsj + λ|A ∩ B| − λ2 |A||B|. (3.5)
A B

The definition of λ2 (si , sj ) resembles a cross-product expectation. If


Cov[X, Y ] = E[XY ] − E[X]E[Y ],
then if E[XY ] = E[X]E[Y ], the random variables are uncorrelated. The co-
Covariance variance density function of the point process is similarly defined as
Density
C(si − sj ) = λ2 (si − sj ) − λ(si )λ(sj ). (3.6)
Function
If λ2 (si − sj ) = λ(si )λ(sj ), then C(si − sj ) = 0 and Cov[N (A), N (B)] = 0.
From (3.5) or (3.6) it is clear that a stationary, isotropic process does not
exhibit spatial dependency if λ2 (h) = λ2 . Beyond these simple relationships
interpretation of the second-order intensity is difficult. The study of the de-
pendence among events in a point pattern typically rests on functions of λ2
that have a more accessible interpretation.
SECOND-ORDER PROPERTIES OF POINT PATTERNS 101

3.4.1 The Reduced Second Moment Measure—The K-Function

The K-function A Ripley’s


2π h K-Function
K(h) = 2 xλ2 (x) dx (3.7)
λ o
of Ripley (1976) is a function of λ2 for stationary and isotropic processes.
It is also known as the reduced second moment measure (Cressie, 1993), as
the second reduced moment function (Stoyan et al., 1995), and as the second
order reduced moment measure (Møller and Waagepetersen, 2003). Studying
the second-order properties of a point pattern via the K-function is popular
because the function has very appealing properties and interpretation.
• If the process is simple, λK(h) represents the expected number of extra
events within distance h from an arbitrary event. In the HPP with intensity
λ this expected number is λπh2 and the K-function for the HPP is simply
K(h) = πh2 .
• If K(h) is known for a particular point process, the second-order intensity
is easily derived from (3.7),
λ2 dK(t)
λ2 (h) = .
2πh dh
• The definition for simple processes suggests a method of estimating K(h)
from an observed pattern as a function of the average number of events
less than distance h apart.
• In a clustered pattern an event is likely to be surrounded by events from
the same cluster. The number of extra events within small distances will
be large. In regular patterns the number of extra events for short distances
will be small.
• K(h) is not affected by events that are missing completely at random
(MCAR). If not all events have been recorded—the pattern is not mapped—
and the missing data process is MCAR, the observed pattern is a subset of
the complete process whose events are retained or deleted in a sequence of
iid Bernoulli trials. Such random thinning, also called p-thinning, reduces
the intensity and the number of extra events by the same factor. The orig-
inal process and the pattern which results from p-thinning have the same
K-function (see §3.7.1).
• Other functions of λ2 used in the study of dependence in point patterns
are easily related to K(h). For example, the pair-correlation function
1 dK(h)
R(h) = ,
2hπ dh
the radial distribution function
dK(h)
F (h) = λ ,
dh
and the L-function /
L(h) = K(h)/π.
102 MAPPED POINT PATTERNS

3.4.2 Estimation of K- and L-Functions

The first-order intensity of a homogeneous process does not depend on spatial


location, λ(s) = λ, and the natural estimator of the intensity within region A
Global is
Intensity ' = N (A) .
λ (3.8)
Estimator ν(A)
Recall that the K-function (3.7) is defined for stationary, isotropic point pat-
terns and that λK(h) ≡ E(h) is the expected number of extra events within
distance h. If hij is the distance between events si and sj , a naı̈ve moment
estimator of E(h) is
n n
1 ((
Ẽ(h) = I(hij ≤ h).
n i=1
j&=i

The inner sum yields the number of observed extra events within distance
h of event si . The outer sum accumulates these counts. Since the process is
stationary, the intensity is estimated with (3.8) and K̃(h) = λ̂−1 Ẽ(h).
Because events outside the study region are not observed, this estimator
is negatively biased. If one calculates the extra events for an event near the
boundary of the region, counts will be low because events outside the region
are not taken into account. To adjust for these edge effects, various corrections
have been applied. If one considers only those events for the computation of
K(h) whose distance di from the nearest boundary exceeds h, one obtains
n .n
j=1&=i I(hij ≤ h and dj > h)
(
E (h) =

.n
i=1 j=1 I(dj > h)
- ∗ .n
'd (h) =
E
E j=1 I(dj > h) > 0
0 otherwise.

Ripley’s estimator (Ripley, 1976) applies weights w(si , sj ) to each pair of


observations that correspond to the proportion of the circumference of a circle
that is within the study region, centered at si , and with radius hij = ||si −sj ||.
The estimator for E(h) applying this edge correction is
n n
' 1 ((
E(h) = w(si , sj )−1 I(hij ≤ h).
n i=1
j&=i

In either case,
'
K(h) '−1 E(h)
=λ ' or '
K(h) '−1 E
=λ 'd (h).
Cressie (1993, p. 616) discusses related estimators of K(h).
In statistical analyses one commonly computes K(h) for a set of distances
and compares the estimate against the K-function of the CSR process (πh2 ).
Unfortunately, important deviations between empirical and theoretical second-
'
order behavior are often difficult to determine when K(h) and K(h) are over-
layed in a plot. In addition, the variance of the estimated K-function increases
SECOND-ORDER PROPERTIES OF POINT PATTERNS 103

quickly with h and for large distances the behavior can appear erratic. Using
a plug-in estimate, the estimated L-function
<
' '
L(h) = K(h)/π
has better statistical properties. For graphical comparisons of empirical and
'
theoretical second-order behavior under CSR we recommend a plot of L(h)−h
versus h. The CSR model is the horizontal reference line at 0. Clustering of
events manifests itself as positive values at short distances. Significance is
assessed through Monte Carlo testing as described in §3.3.1 and in practice, we
'
consider a plot of L(h)−h versus h together with the corresponding simulation
envelopes computed under CSR as described in §3.3.2.

Example 1.5 (Lightning strikes. Continued) Based on the empirical


distribution function of nearest neighbor distances we concluded earlier that
the lightning data are highly clustered. If clustering is not the result of an
inhomogeneous lightning intensity, but due to dependence of the events, a
second-order analysis with K- or L-functions is appropriate. Figure 3.9 shows
the observed L-functions for these data and simulation envelopes based on
s = 200. The quick rise of the L-functions above the reference line for small
distances is evidence of clustering. Whereas the simulation envelopes do not
differ between an analysis on the bounding box and the convex hull, the
empirical L-function in the former case overstates the degree of clustering
because the bounding rectangle adds too much empty white space.

3.4.3 Assessing the Relationship between Two Patterns

The K-function considers only the location of events; it ignores any attribute
values (marks) associated with the events. However, many point patterns in-
clude some other information about the events and this information is often
binary in nature, e.g., which of two competing species of trees occurred at a
particular location, whether or not an individual with a certain disease at a
particular location is male or female, or whether or not a plant at a location
was diseased. Diggle and Chetwynd (1991) refer to such processes as labeled.
In cases such as these, we may wonder whether the nature of the spatial pat-
tern is different for the two types of events. We discuss marked point patterns
and multivariate spatial point processes in more generality in §3.6. In this
section, we focus on the simple, yet common, case of a bivariate process with
binary marks.
One generalization of K(h) to a bivariate spatial point process is (Ripley,
1981; Diggle, 1983, p. 91)
Kij (h) = λ−1 E[#of type j events within distance h
of a randomly chosen type i event].
Suppose the type i events in A are observed with intensity λi at locations
104 MAPPED POINT PATTERNS

2.5

2.0

1.5
L(h)-h and envelopes

1.0

0.5

0.0

-0.5 Bounding box


Convex hull

-1.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0


Distance h

Figure 3.9 L-functions and simulation envelopes from 200 simulations on bounding
'd (h).
box and convex hull for lightning data. Edge correction is based on E

referenced by s, and the type j events in A are observed with intensity λj


at locations referenced by u. Then, an edge-corrected estimator of Kij (h) is
(Ripley, 1981)
((
K 'i λ
' ij (h) = [λ 'j ν(A)]−1 w(sk , ul )−1 I(hkl ≤ h), (3.9)
k l

where hkl = ||sk − ul ||, and w(sk , ul ) is the proportion of the circumference
of a circle centered at location sk with radius hkl that lies inside A.
If the bivariate spatial process is stationary, the cross-K-functions are sym-
metric, i.e., K12 = K21 . However, K ' 12 (= K
' 21 , so Lotwick and Silverman
(1982) suggest using the more efficient estimator

Kij 'j K
(h) = λ 'i K
' ij (h) + λ 'j + λ
' ji (h)/λ 'i .

Under a null hypothesis of independence between the two spatial point


processes, Kij (h) = πh2 , regardless of the nature of the pattern of either type
of event. Thus, again we work with the corresponding L-function, L∗ij (h) =
(Kij

(h)/π)1/2 , and under independence L∗ij (h) = h. Values of L∗ij (h) − h > 0
SECOND-ORDER PROPERTIES OF POINT PATTERNS 105

indicate attraction between the two processes at distance h. Values of L∗ij (h)−
h < 0 indicate repulsion. Unfortunately, hypothesis tests are more difficult in
this situation since a complete bivariate model must be specified.
Diggle (1983) provides an alternative philosophy that is not based on CSR.
Another way to define the null hypothesis of “no association” between the
two processes is that each event is equally likely to be a type i (or type j)
event. This is known as the random labeling hypothesis. This hypoth- Random
esis is subtly different than the independence hypothesis. The two scenarios Labeling
arise from different random mechanisms. Under independence, the locations Hypothesis
and associated marks are determined simultaneously. Under the random la-
beling hypothesis, locations arise from a univariate spatial point process and a
second random mechanism determines the marks. Thus, the marks are deter-
mined independently of the locations. Diggle notes that the random labeling
hypothesis neither implies nor is implied by the stricter notion of statistical
independence between the two spatial point processes, and confusing the two
scenarios can lead to “the analysis of data by methods which are largely ir-
relevant to the problem in hand” (Diggle, 1983, p. 93). The random labeling
hypothesis always conditions on the set of locations of all observed events,
and, under this hypothesis
K11 = K22 = K12 (3.10)
(Diggle and Chetwynd, 1991). In contrast, the independence approach con-
ditions on the marginal structure of each process. Thus, the two approaches
lead to different expected values for K12 (h), to different tests, and to different
interpretation.
Diggle and Chetwynd use the relationships in (3.10) to construct a test
based on the difference of the K-functions
D(h) = Kii (h) − Kjj (h).
They suggest estimating D(h) by plugging in estimates of Kii and Kjj ob-
tained from (3.9) (adjusted so that D(h) is unbiased, see Diggle and Chet-
wynd, 1991). Under the random labeling hypothesis, the expected value of
D(h) is zero for any distance h. Positive values of D(h) suggest spatial clus-
tering of type i events over and above any clustering observed in the type j
events.
Diggle and Chetwynd (1991) derive the variance-covariance structure of
'
D(h) and give an approximate test based on the standard Gaussian distri-
bution. However, Monte Carlo simulation is much easier. To test the random
labeling hypothesis, we condition on the set of all n1 + n2 locations, draw a
sample of n1 from these (if we want to test for clustering in type i events),
assign the locations not selected to be of type j, and then compute D(h) '
for each sampling. Under the random labeling hypothesis, the n1 sampled
locations reflect a random “thinning” (see §3.7.1) of the set of all locations.
Also, Monte Carlo simulation enables us to consider different statistics for the
comparison of the patterns, for example D(h) = Lii (h) − Ljj (h).
106 MAPPED POINT PATTERNS

Example 1.5 (Lightning strikes. Continued) There are two types of


lightning flashes between the earth surface (or objects on it) and a storm cloud.
Most frequent are flashes where a channel of air with negative charge travels
from the bottom of the cloud. Flashes with positive polarity commence from
the upper regions of the cloud, occur less frequently, are more scattered, but
also more destructive. We extracted from the lightning strikes data set those
flashes that occurred off the coast of Florida, Georgia, and South Carolina.
Of these 820 flashes, np = 67 were positive flashes.

32

31
Latitude

30

29

-80.5 -80.0 -79.5 -79.0 -78.5


Longitude

Figure 3.10 Lightning flashes off the coast of Florida, Georgia, and South Carolina.
Empty circles depict flashes with negative charge (nn = 753), closed circles depict
flashes with positive charge (np = 76).

Figure 3.11 displays the L-functions for the two types of events, their differ-
ence, 5%, and 95% simulation envelopes for D(h) = Lp (h) − Ln (h) based on
200 random assignment of labels to event. There is no evidence that lightning
strikes of different polarity differ in their degree of clustering. In carrying out
the simulations, the same bounding shape is assumed for all patterns, that
based on the data for both event types combined.
THE INHOMOGENEOUS POISSON PROCESS 107

L p (h)

1.10 L n (h)

0.95

0.80

0.65
L(h), D(h)

0.50

0.35

0.20
D(h) = L p(h) - L n(h)

0.05

-0.10

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7


Distance (h)

Figure 3.11 Observed L-functions for flashes with positive and negative charge in
pattern of Figure 3.10 and their difference. Dotted lines depict 5 and 95 percentile
envelopes from 200 random labelings of polarity.

3.5 The Inhomogeneous Poisson Process

Most processes deviate from complete spatial randomness in some fashion.


For example, events may be independent in non-overlapping subregions, but
the intensity λ(s) with which they occur is not homogeneous throughout D.
More events will then be located in regions where the intensity is large, and
fewer events will be located in regions where λ(s) is small. Thus, the resulting
point pattern often appear clustered. In fact, this is typical for geographi-
cal processes based on human populations. Residences where people live are
clustered into cities, towns, school districts, and neighborhoods. In analyzing
events that relate to people, we have to adjust for this geographical variation
in our spatial analyses.

Example 3.4 GHCD 9 infant birth weights. As an example, consider


the case-control study of Rogers, Thompson, Addy, McKeown, Cowen, and
DeCoulfé (2000) for which the study area comprised 25 contiguous counties in
southeastern Georgia, collectively referred to as Georgia Health Care District
108 MAPPED POINT PATTERNS

9 (GHCD9) (Figure 3.12). One of the purposes of the study was to examine
geographic risk factors associated with the risk of having a very low birth
weight (VLBW) baby, one weighing less than 1,500 grams at birth. Cases
were identified from all live-born, singleton infants born between April 1, 1986
and March 30, 1988 and the locations of the mothers’ addresses are shown in
Figure 3.13.

Statesboro

Vidalia Savannah
Fort Stewart Wilmington Island
Hinesville
Douglas
Brunswick
Waycross St. Simons

0 90 180 360 Miles

Figure 3.12 Georgia Health Care District 9.

Notice how the aggregated pattern in the locations of the cases corresponds
to the locations of the cities and towns in Georgia Health Care District 9. A
formal test of CSR will probably not tell us anything we did not already know.
How much of this clustering can be attributed to a geographical pattern in
cases of very low birth weight infants and how much is simply due to clustering
in residences is unclear. To separate out these two confounding issues, we
need to compare the geographic pattern in the cases to that based on a set
of controls that represent the geographic pattern in infants who were born
with normal birth weights. Controls were selected for this study by drawing
a 3% random sample of all live-born infants weighing more than 2,499 grams
at birth. This sampling was constrained so that the controls met the same
residency and time frame requirements as the case subjects. Their geographic
distribution is shown in Figure 3.14. Notice that the locations of both the
cases and the controls appear to be clustered. We can use the controls to
quantify a background geographical variation in infant birth weights and then
assess whether there are differences in the observed spatial pattern for babies
born with very low birth weights. In order to do this, we need a different
null hypothesis than the one provided by CSR. One that is often used is
Constant called the constant risk hypothesis (Waller and Gotway, 2004). Under the
Risk
Hypothesis
THE INHOMOGENEOUS POISSON PROCESS 109

Figure 3.13 Cases of very low birth weight babies in Georgia Health Care District
9 from Rogers et al. (2000). The locations have been randomly relocated to protect
confidentiality.

constant risk model, the probability of being an event is the same, regardless
of location (e.g., each baby born to a mother residing in Georgia Health Care
District 9 has the same risk of being born with a very low birth weight).
Under the constant risk hypothesis, we expect more events in areas with more
individuals. Clusters of cases in high population areas could violate CSR but
would not necessarily violate the constant risk hypothesis. Thus, choosing
the constant risk hypothesis as a null model allows us to refine the question
of interest from “are the cases clustered?” (the answer to which we already
know is probably “yes”) to the question “are the cases more clustered than
we would expect under the constant risk hypothesis?” Answering this latter
question allows adjustment for any patterns that might occur among all the
individuals within a domain of interest.

Statistical methods that allow us to use the constant risk hypothesis as a


null model are often based on an inhomogeneous intensity function. When
point processes are studied through counting measures, the number of events
in region A, N (A), is a random variable and the usual expectations can be
constructed from the mass function of N (A). Since N (A) is an aggregation
of events in region A, there is a function λ(s) whose integration over A yields
its expected value. Earlier, this function was termed the (first-order) intensity First-order
and defined as a limit, Intensity
E[N (ds)]
λ(s) = lim . (3.11)
|ds|→0 |ds|
110 MAPPED POINT PATTERNS

Figure 3.14 Cases of very low birth weight babies in Georgia Health Care District 9
and Controls from Rogers et al. (2000). The locations have been randomly relocated
to protect confidentiality.

The connection with the average number of events in region A is simply


A
E[N (A)] = µ(A) = λ(s) ds.
A

Studying point patterns through λ(s) rather than through E[N (A)] is often
mathematically advantageous because it eliminates the dependency on the
size (and shape) of the area A. In practical applications, when an estimate of
the intensity function is sought, an area context is required.

3.5.1 Estimation of the Intensity Function

Even for homogeneous processes it is useful to study the intensity of events


more locally, for example, to determine whether to proceed with an analysis of
'
the second-order behavior. In practice, spatially variable estimates λ(s) of the
intensity at location s are obtained by nonparametric smoothing of quadrat
counts or by methods of density estimation.
To see the close relationship between density estimation and intensity esti-
mation consider a random sample y1 , · · · , yn from the distribution of random
variable Y . An estimate of the density function f (y) at y0 can be found from
the number of sample realizations within a certain distance h from y0 ,
n C D
ˆ 1 ( yi − y 0
f (y0 ) = k , (3.12)
nh i=1 h
THE INHOMOGENEOUS POISSON PROCESS 111

where k(t) is the uniform density on −1 ≤ t ≤ 1,


-
0 |yi − y0 | > h
k(t) =
1 otherwise.

If the neighborhood h is small, fˆ(y0 ) is a nearly unbiased estimate of f (y0 )


but suffers from large variability. With increasing window width h, the es-
timate becomes smoother, less variable, and more biased. A mean-squared-
error-based procedure such as cross-validation can be used to determine an
appropriate value for the parameter h.
Rather than choosing a uniform kernel function that gives equal weight to
all points within the window y0 ± h, we use modal kernel functions. Popular
kernel functions are the Gaussian kernel
1 8 9
k(t) = √ exp −t2 /2 ,

the quadratic kernel
-
0.75(1 − t2 ) |t| ≤ 1
k(t) =
0 otherwise
or the minimum variance kernel
- 3
8 (3 − 5t ) |t| ≤ 1
2
k(t) =
0 otherwise.
The choice of the kernel function is less important in practice than the
B choice of
the bandwidth.
B A function k(u) can serve as a kernel provided that k(u)du =
1 and uk(u) = 0.
For a spatial point pattern, density estimation produces an estimate of the
probability of observing an event at location s and integrates to one over the
domain A. The relationship between the density fA (s) on A and the intensity
λ(s) is
A
λ(s) = fA (s) λ(u)du = fA (s)µ(A).
A
The intensity and density are proportional. For a process in R1 , we modify
the density estimator (3.12) as
n C D
' 1 ( si − s0
λ(s0 ) = k , (3.13)
ν(A)h i=1 h

to obtain a kernel estimator of the first-order intensity.


In the two-dimensional case, the univariate kernel function needs to be
replaced by a function that can accommodate two coordinates. A product-
kernel function is obtained by multiplying two univariate kernel functions.
This choice is frequently made for convenience, it implies the absence of in-
teraction between the coordinates. The bandwidths can be chosen differently
for the two dimensions, but kernels with spherical contours are common. If
112 MAPPED POINT PATTERNS

the xi and yi denote the coordinates of location si , then the product-kernel


Kernel approach leads to the intensity estimator
Intensity (n C D C D
' 0) = 1 x i − x0 y i − y 0
Estimator λ(s k k , (3.14)
ν(A)hx hy i=1 hx hy
where hx and hy are the bandwidths in the respective directions of the co-
ordinate system. The independence of the coordinates can be overcome with
bivariate kernel functions. For example, elliptical contours can be achieved
with a bivariate Gaussian kernel function with unequal variances. A non-zero
covariance of the coordinates introduces a rotation.
The expressions above do not account for edge effects, which can be sub-
stantial. Diggle (1985) suggested an edge-corrected kernel intensity estimator
Kernel with a single bandwidth
Intensity n C D
' 1 ( 1 s − si
Estimator λ(s) = k .
ph (s) i=1 h2 h
B
The denominator ph (s) = A h−2 k((s − u)/h)du serves as the edge correction.

3.5.2 Estimating the Ratio of Intensity Functions

In many applications, the goal of the analysis is the comparison of spatial


patterns between two groups (e.g., between males and females, between cases
and controls). Suppose we have n1 events of one type and n2 events of an-
other type and let λ1 (s) and λ2 (s) be their corresponding intensity functions.
It seems natural to estimate the ratio λ1 (s)/λ2 (s) by the ratio of the corre-
sponding kernel density estimates λ '1 (s)/λ
'2 (s), where λ
'i (s) is given in (3.14).
Since the intensity function is proportional to the density function, Kelsall
and Diggle (1995) suggest inference (conditional on n1 and n2 ) based on
r'(s) = log{f'1 (s)/f'2 (s)},
where f1 and f2 are the densities of the two processes and f'1 and f'2 are
their corresponding kernel density estimators. Mapping r'(s) provides a spatial
picture of the (logarithm of) the probability of observing an event of one type
rather than an event the other type at location s in D.

Example 3.4 (Low birth weights. Continued) Applying this procedure to


the case-control data considered in the previous section we obtain the surface
shown in Figure 3.15. This shows the relative risk of a VLBW birth at every
location within Georgia Health Care District 9. Naturally, the eye is drawn to
areas with the highest risk, but care must be taken in interpreting the results.
First, in drawing such a map, we implicitly assume that r(s) is a continuous
function of location s, which is somewhat unappealing. There are probably
many locations where it is impossible for people to live and for which r(s) is
THE INHOMOGENEOUS POISSON PROCESS 113

Relative Risk
0.00−0.20
0.21 − 0.28
0.29 − 0.32
0.33 − 0.40
0.41 − 0.52
0.53− 0.75
0.76−1.10
1.10 − 1.75
1.76 − 2.80
2.81 − 4.65

Figure 3.15 Relative risk of very low birth weight babies in Georgia Health Care Dis-
trict 9. Conclusions are not epidemiologically valid since locations and case/control
status were altered to preserve confidentiality.

inherently zero. Second, zero estimates for f2 (s) are clearly problematic for
computation of r'(s), but certainly can occur and do have meaning as part of
the estimation of f2 .
An advantage to choosing a kernel with infinite tails such as the bivariate
Gaussian kernel, as we have done here, is that the estimate of f2 (s) is non-zero
for all locations. Third, the choice for the bandwidths is critical and different
choices can have a dramatic effect on the resulting surface. In constructing
the map in Figure 3.15 we experimented with a variety of choices for the
bandwidth and the underlying grid for which the kernel density estimates are
obtained. The combination of the two (bandwidth and grid spacing) reflects
the tradeoff between resolution and stability. A fine grid with a small band-
width will allow map detail, but the resulting estimates may be unstable. A
coarse grid with a large bandwidth will produce more stable estimates, but
much of the spatial variation in the data will be smoothed away. Also, there
are large regions within Georgia Health Care District 9 without controls. This
leads to unstable estimates for certain bandwidths. We began by choosing the
bandwidths according to automatic selection criteria (e.g., cross validation,
Wand and Jones, 1995), but found the results were visually uninteresting; the
resulting map appeared far too smooth. Because of the large gaps where there
are no controls, we took the same bandwidth for the cases as for the controls
and then increased it systematically until we began to lose stability in the
estimates. We may have actually crossed the threshold here: the area with a
high relative risk on the western edge of the domain may be artificially high,
114 MAPPED POINT PATTERNS

reflecting estimate instability and edge effects, rather than a high relative risk.
This illustrates the importance of careful estimation and interpretation of the
results, particularly if formal inference (e.g., hypothesis tests, see Kelsall and
Diggle, 1995 and Waller and Gotway, 2004) will be conducted using the re-
sulting estimates. However, even with a few potential anomalies, and the odd
contours that result from the kernel, Figure 3.15 does allow us to visualize
the spatial variation in the risk of very low birth weight.

3.5.3 Clustering and Cluster Detection

While the K-function can be used to assess clustering in events that arise from
a homogeneous Poisson process, the assumption of stationarity upon which it
is based precludes its use for inhomogeneous Poisson processes. Thus, Cuzick
and Edwards (1990) adapted methods based on nearest neighbor distances
(described in §3.3) for use with inhomogeneous Poisson processes. Instead of
assuming events occur uniformly in the absence of clustering, a group of con-
trols is used to define the baseline distribution and nearest neighbor statistics
are based on whether the nearest neighbor to each case is another case or a
control. The null hypothesis of no clustering is that each event is equally likely
to have been a case or a control, i.e., the random labeling hypothesis.
Let {s1 , . . . sn } denote the locations of all events and assume n1 of these
are cases and n2 are controls. Let
-
1 if si is a case
δi =
0 if si is a control.

and
-
1 if the nearest neighbor to si is a case
di =
0 if the nearest neighbor to si is a control.

The test statistic represents the number of the q nearest neighbors of cases
that are also cases,
n
(
Tq = δi dki ,
i=1
where q is specified by the user. For inference, Cuzick and Edwards (1990)
derive an asymptotic test based on the Gaussian distribution. A Monte Carlo
test based on the random labeling hypothesis is also applicable.

Example 3.4 (Low birth weights. Continued) We use Cuzick and Ed-
ward’s NN test to assess whether there is clustering in locations of babies born
with very low birth weights in Georgia Health Care District 9. This test is
not entirely applicable to this situation in that it assumes each event location
must be either a case or a control. However, because people live in apart-
ment buildings, there can be multiple cases and/or controls at any location;
we cannot usually measure a person’s location so specifically. This situation
THE INHOMOGENEOUS POISSON PROCESS 115

is common when addresses in urban areas have been geocoded. In Georgia


Health Care District 9, there were 7 instances of duplicate locations, each
containing from 2–3 locations. Thus, in order to use Cuzick and Edward’s NN
test, we randomly selected one record from each of these groups.
In order to make the test we need to specify q, the number of nearest
neighbors. Small choices for q tend to make the test focus more locally, while
larger values of q allow a more regional assessment of clustering. Thus, to some
degree, different values of q indicate the scale of any observed clustering. With
this in mind, we chose q = 1, 5, 10, 20 and the results are summarized in Table
3.5.

Table 3.5 Results from Cuzick and Edward’s NN Test Applied to Case/Control Data
in Georgia Health Care District 9. The p-values were obtained from Monte Carlo sim-
ulation. Conclusions are not epidemiologically valid since locations and case/control
status were altered to preserve confidentiality.

q Tq p-value
1 81 0.0170
5 388 0.0910
10 759 0.2190
20 1464 0.3740

The results in Table 3.5 seem to indicate that there is some clustering
among the cases at very local levels. As q is increased, the test statistics are
not significant, indicating that, when considering Georgia Health Care District
9 overall, there is no strong evidence for clustering among locations of babies
born with very low birth weights. Note, however, that Tq2 is correlated with
Tq1 for q1 < q2 since the q2 nearest neighbors include the q1 nearest neighbors.
Ord (1990) suggests using contrasts between statistics (e.g., Tq2 − Tq1 ) since
they exhibit considerably less correlation and can be interpreted as excess
cases between the q1 and the q2 nearest neighbors of cases.

As Stroup (1990) notes, if we obtain a significant result from Cuzick and


Edwards’ test, the first response should be “Where’s the cluster?” However,
answering this question is not what this, or other methods for detecting spatial
clustering, are designed to do. Besag and Newell (1991) stress the need to
clearly distinguish between the concepts of detecting clustering and detecting
clusters. Clustering is a global tendency for events to occur near other events. A
cluster is a local collection of events that is inconsistent with the hypothesis of
no clustering, either CSR or constant risk (Waller and Gotway, 2004). Cuzick
and Edwards’ test, and tests based on the K-function are tests of clustering.
We need different methods to detect clusters.
Interest in cluster detection has grown in recent years because of the pub-
lic health and policy issues that surround their interpretation. For example,
116 MAPPED POINT PATTERNS

a cluster of people with a rare disease could indicate a common, local envi-
ronmental contaminant. A cluster of burglarized residences can alert police
to “hot spots” of crime that warrant increased surveillance. Thus, what is
needed in such situations is a test that will: 1) detect clusters; 2) determine
whether they contain a significantly higher or lower number of events than we
would expect; and 3) identify the location and extent of the cluster. There are
several methods for cluster detection (see, e.g., Waller and Gotway, 2004, for
a comprehensive discussion and illustration), but the most popular is the spa-
tial scan statistic developed by Kulldorff and Nagarwalla (1995) and Kulldorff
(1997) and popularized by the SatScan software (Kulldorff and International
Management Services, Inc., 2003).
Scan statistics use moving windows to compare a value (e.g., a count of
events or a proportion) within the window to the value outside of the window.
Kulldorff (1997) uses circular windows with variable radii ranging from the
smallest inter-event distance to a user-defined upper bound (usually one half
the width of the study area). The spatial scan statistic may be applied to
circles centered at specified grid locations or centered on the set of observed
event locations.
The spatial scan statistic developed by Kulldorff (1997) considers local like-
lihood ratio statistics that compare the likelihood under the the constant risk
hypothesis to various alternatives where the proportion of cases within the
window is greater than that outside the window. Let C denote the total num-
ber of cases and let c be the total number of cases within a window. Under
the assumption that the number of cases follows a Poisson distribution, the
likelihood function for a given window is proportional to
E c Fc C C − c DC−c
I(·),
n C −n
where n is the number expected assuming constant risk assumption over the
study domain. I(·) denotes the indicator function which, when high propor-
tions are of interest, is equal to 1 when the window has more cases than
expected.
The likelihood function is maximized over all windows and the window with
the maximum likelihood function is called “the most likely cluster.” Signifi-
cance is determined by using Monte Carlo simulation. Using random labeling,
cases are randomly assigned to event locations, the likelihood function is com-
puted for each window, and the maximum value of this function is determined.
In this way the distribution of the maximum likelihood function is simulated
(Turnbull, Iwano, Burnett, Howe, and Clark, 1990). As a result, the spatial
scan statistic provides a single p-value for the study area, and avoids the
multiple testing problem that plagues many other approaches.

Example 3.4 (Low birth weights. Continued) We use the spatial scan
statistic developed by Kulldorff (1997) and Kulldorff and International Man-
agement Services, Inc. (2003) to find the most likely cluster among locations
THE INHOMOGENEOUS POISSON PROCESS 117

of babies born with very low birth weights in Georgia Health Care District 9.
We note that this “most likely” cluster may not be at all “likely,” and thus
we rely on the p-value from Monte Carlo testing to determine its significance.
We assumed a Poisson model and allowed the circle radii to vary from the
smallest inter-event distance to one half of the largest inter-event distance.
The results from the scan give the location and radius of the circular win-
dow that constitutes the most likely cluster and a p-value from Monte Carlo
testing. The results are shown in Figure 3.16.

Most Likely Cluster


p=0.002
64 Cases Observed
p=0.002
relative risk=1.72

Figure 3.16 Results from the spatial scan statistic. Conclusions are not epidemio-
logically valid since locations and case/control status were altered to preserve confi-
dentiality.

The most likely cluster is approximately 12 km in diameter and is located


just north of the city of Savannah. Under the constant risk hypothesis, 37.1
cases were expected here, but 64 were observed, leading to a relative risk of
1.72. The p-value of 0.002 indicates that the proportion of cases observed in
this window is significantly higher than we would expect under the constant
risk hypothesis. Note that these results are somewhat consistent with those
from Cuzick and Edwards’ test (Table 3.5) that indicated local clustering
among the cases. Note, however, that the power of spatial statistics can also
vary spatially. We have greater power to detect abnormalities in regions where
more events occurred. Since such regions are often urban, the results of the
spatial scan should be interpreted in this context.
118 MAPPED POINT PATTERNS

3.6 Marked and Multivariate Point Patterns

3.6.1 Extensions

Up to this point we have focused on the random distribution of events through-


out a spatial domain. Little attention was paid to whether some additional
attribute is observable at the event location or whether the events were all
of the same type. Consider, for example, the distribution of trees through-
out a forest stand. Most foresters would not be satisfied with knowing where
the trees are. Tree attributes such as breast height diameter, age, height, and
species are important, as well as understanding whether these attributes are
related to the spatial configuration of trees. It is well established, for example,
that in managed forests a tree’s diameter is highly influenced by its available
horizontal growing area, whereas its height is primarily a function of soil qual-
ity. This suggests that the distribution of tree diameters in a forest stand is
related to the intensity of a point process that governs the distribution of
trees.
To make the connection between events and attributes observed at event
locations more precise, we recall the notation Z(s) for the attribute Z ob-
served at location s. In the first two chapters the notation Z(s) was present
throughout and it appears that we lost it somehow in discussing point pat-
terns. It never really left us, but up to this “point” patterns were just that:
points. The focus was on studying the distribution of the events itself. The
“unmarked” point pattern of previous sections is a special case of the marked
pattern, where the distribution of Z is degenerate (a mark space with a single
value).
In the vernacular of point process theory, Z is termed the mark variable.
It is a random variable, its support is called the mark space. The mark space
can be continuous or discrete; the diameter or height of a tree growing at s,
the depth of the lunar crater with center s, the value of goods stolen during
a burglary, are examples of marked processes with continuous mark variable.
The number of eggs in a birds nest at s or the grass species growing at s are
cases of discrete mark variables. Figure 3.17 is an example of a point process
with a binary mark variable. The spatial events represent tree locations in
a forest in Lansing, MI. The mark variable associated with each location
indicates whether the tree is a hickory or a maple.
So why are we treating marked point processes separately from, say, geo-
statistical data? Well, we are and we are not. The “big” difference between
the two types of data is the randomness of the spatial domain, of course. In
geostatistical data, the domain is continuous and observations are collected
at a finite number of points. The sample locations can be determined by a
random mechanism, such as stratified sampling, or be chosen by a determin-
istic method. In either case, the samples represent an incomplete observation
of a random function and the random choosing of the sample locations does
not enter into a geostatistical analyses as a source of randomness. A mapped,
MARKED AND MULTIVARIATE POINT PATTERNS 119

1.0

0.8

0.6
y

0.4

0.2

0.0

0.0 0.2 0.4 0.6 0.8 1.0


x

Figure 3.17 Marked point pattern. Distribution of hickories (open circles) and
maples (closed circles) in a forest in Lansing, MI (Gerrard, 1969; Diggle, 1983).
The mark variable is discrete with two levels.

marked point pattern, on the other hand, represents the complete observation
of all event locations. There are no other locations at which the attribute Z
could have been observed. Consequently, the notion of a continuous random
surface expanding over the domain does not arise and predicting the value of
Z at an unobserved location appears objectionable. Why would one want to
predict the diameter of trees that do not exist? For planning purposes, for
example. In order to do so, one views the marked point process conditional
and treats the event locations as if they were non-stochastic.
Because there are potentially two sources of randomness in a marked point
pattern, the randomness of the mark variable at s given that an event occurred
at that location and the distribution of events according to a stochastic pro-
cess, one can choose to study either conditional on the other, or to study
them jointly. If the mark variable is not stochastic as in §3.1–3.4, one can
still ask questions about the distribution of events. When the mark variable
is stochastic, we are also interested in studying the distributional properties
of Z. In the tree example we may inquire about
120 MAPPED POINT PATTERNS

• the degree of randomness, clustering, or regularity in the distribution of


trees;
• the mean and variance of tree heights;
• the correlation between heights of trees at different locations;
• whether the distribution of tree heights depends on the location of trees.
A second extension of the unmarked point processes leads to multivariate
point patterns, which are collections of patterns for events of different types.
Møller and Waagepetersen (2003) refer to them as multitype patterns. Let
1 , · · · , sn1 denote the locations at which events of type m = 1, · · · , M occur
sm m

and assume that a multivariate process generates the events in D ∈ R2 . The


counting measure Nm (A) represents the number of events of type m in the
Borel set A and the event counts for the entire pattern is the (M × 1) vector
N(A) = [N1 (A), · · · , NM (A)]. Basic questions that arise with multivariate
point patterns concern
• the multinomial distribution of events of each type;
• the spatial distribution of events;
• whether the proportions with which events of different types occur depend
on location.
The connection between multivariate and marked patterns is transparent.
Rather than counting the number of events in each of.the M patterns one
M
could combine the patterns into a single pattern of m=1 nm events and
associate with each event location a mark variable that indicates the pattern
type. A particularly important case is that of a two-level mark variable, the
bivariate point process (Figure 3.18).
A unified treatment of marked, unmarked, and multivariate point processes
is possible by viewing the process metric as the product space between the
space of the mark variable and the spatial domain. For more details along
these lines the reader is referred to the texts by Stoyan, Kendall, and Mecke
(1995) and by Møller and Waagepetersen (2003).

3.6.2 Intensities and Moment Measures for Multivariate Point Patterns

Recall that for a univariate point pattern the first- and second-order intensities
are defined as
E[N (ds)]
λ(s) = lim
|ds|→0 |ds|
E[N (dsi )N (dsj )]
λ2 (si , sj ) = lim .
|dsi |→0,|dsj |→0 |dsi ||dsj |

For a multivariate pattern Nm (A) is the count of events of type m in the


region (Borel set) A of the mth pattern. The intensities for the component
MARKED AND MULTIVARIATE POINT PATTERNS 121

1.0 1.0

0.8 0.8

0.6 0.6
y

y
0.4 0.4

0.2 0.2

0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x

Figure 3.18 Lansing tree data as a bivariate point pattern. The bivariate pattern is
the collection of the two univariate patterns. Their superposition is a marked process
(Figure 3.17).

patterns are similarly defined as


E[Nm (ds)]
λm (s) = lim
|ds|→0 |ds|
E[Nm (dsi )Nm (dsj )]
λm,2 (si , sj ) = lim .
|dsi |→0,|dsj |→0 |dsi ||dsj |

Summary statistics and estimates of the first- and second-order properties


can be computed as in the univariate case for each of the k pattern types.
In addition, one can now draw on measures that relate properties between
pattern types. For example, the (second-order) cross-intensities Cross-
pattern
E[Nm (dsi m )Nl (dsj l )]
λml,2 (si , sj ) = lim . Intensity
|dsm
i
|→0,|dslj |→0 |dsi m ||dsj l |
is an obvious extension of λm,2 .
As in the univariate case, the second-order intensities are difficult to inter-
pret and reduced moment measures are used instead. The K-functions for the
individual component patterns are obtained as in the univariate case under
the usual assumptions of stationarity and isotropy. To study the interaction
between pattern types, recall that for a univariate pattern λK(h) represents
122 MAPPED POINT PATTERNS

the expected number of additional events within distance h of an arbitrary


event. Considering in a multivariate case the expected number of events of
type m within distance h of an event of type l under stationarity and isotropy
Cross K- of N in R2 leads to
Functions A h
1
Kml (h) = 2π xλml,2 (x) dx, (3.15)
λm λl 0

where λml,2 is the isotropic cross-pattern intensity (Hanisch and Stoyan, 1979;
Cressie, 1993). For the univariate case, Ripley’s edge corrected estimator of
the K-function is
n n
' ν(A) ( (
K(h) = w(si , sj )−1 I(hij ≤ h),
n2 i=1
j&=i

where w(si , sj ) is the proportion of the circumference of a circle that is within


the study region, passes through sj and is centered at si .
In the multivariate case let hml
ij = ||si − sj || denote the distance between
m l

the ith point of type m and the jth point of type l. An edge corrected estimator
of the cross K-function between the mth and lth pattern is
nm (
( nl
' ml (h) = ν(A)
K w(si m − sj l )−1 I(hml
ij ≤ h),
nm nl i=1 j=1

where w(si m − sj l ) is the proportion of the circumference of a circle within


A that passes through sj l and is centered at si m (Hanisch and Stoyan, 1979;
Cressie, 1993, p. 698).

3.7 Point Process Models

The homogeneous Poisson process provides the natural starting point for a
statistical investigation of an observed point pattern. Rejection of the CSR
hypothesis does not come as a great surprise in many applications and you are
naturally confronted with the question “What kind of pattern is it?” If the
CSR test suggests a clustered pattern, one may want to compare, for example,
the observed K-function to simulated K-functions from a cluster process.
We can only skim the surface of point process models in this chapter. A
large number of models have been developed and described for clustered and
regular alternatives, details can be found in, e.g., Diggle (1983), Cressie (1993),
Stoyan, Kendall, and Mecke (1995), and Møller and Waagepetersen (2004).
The remainder of this chapter draws on these sources as well as on Appendix
A9.9.11 in Schabenberger and Pierce (2002). The models were chosen for their
representativeness for a particular data-generating mechanism, and because
of their importance in theoretical and applied statistics. When you analyze
an observed spatial point pattern, keep in mind that based on a single re-
alization of the process unambiguous identification of the event-generating
point process model may not be possible. For example, an inhomogeneous
POINT PROCESS MODELS 123

Poisson process and a Cox Process (see below) lead to clustering of events.
The mechanisms are entirely different, however. In case of the IPP, events in
non-overlapping regions are independent and clustering arises because the in-
tensity function varies spatially. In the Cox process, clustering occurs because
events are dependent, the (average) intensity may be homogeneous. Certain
Poisson cluster processes, where one point process generates parent events
and a second process places offspring events around the locations of the par-
ent events, can be made equivalent to a Poisson process with a randomly
varying intensity.
Processes that are indistinguishable based on a single realization, can have
generating mechanism that suggest very different biological and physical in-
terpretation. It behooves the analyst to consider process models whose genesis
are congruent with the subject-matter theory. Understanding the genesis of
the process models also holds important clues about how to simulate realiza-
tions from the model.

3.7.1 Thinning and Clustering

One method of deriving a point process model is to apply a defined opera-


tion to an existing process. Among the basic operations discussed by Stoyan
et al. (1995, Ch. 5) are superpositioning, thinning, and clustering. If Z1 (s),
Z2 (s), · · ·, Zk (s) are point processes, then their superposition
k
G
Z(s) = Zi (s)
i=1

is also a point process. If the Zi (s) are mutually independent homogeneous


Poisson processes with intensities λ1 , · · · , λk , then Z(s) is a homogeneous Pois-
.k
son process with intensity i=1 λi .
More important than the combining of processes is the operation by which
events in one process are eliminated based on some probability p; thinning.
Stoyan et al. (1995) distinguish the following types of thinning
• p-thinning. Each point in the pattern is retained with probability p and
eliminated with probability 1−p. The retention decisions can be represented
as N (A) independent Bernoulli trials with common success probability p.
• p(s)-thinning. The retention probabilities are given by the deterministic
function 0 ≤ p(s) ≤ 1.
• π-thinning. The thinning function is stochastic, a random field. A thinning
is obtained by drawing a realization p(s) of the random function π(s) and
applying p(s)-thinning.
These types of thinning are obvious generalizations, with p-thinning being
the most special case. They are important operations, because the properties
of the resultant process can relate quite easily to the properties of the original
124 MAPPED POINT PATTERNS

process to which thinning is applied. For this reason, one often applies thinning
operations to a homogeneous Poisson process, because its properties are well
understood.

Example 3.5 An inhomogeneous Poisson process can be constructed as the


p(s)-thinning of a homogeneous Poisson process. The basic result is the fol-
lowing. If a Poisson process Z(s) with intensity λ is subject to p(s)-thinning,
then the resulting process Z ∗ (s) is also a Poisson process with intensity λp(s).
To generate an inhomogeneous Poisson process on A with intensity α(s),
commence by generating a homogeneous Poisson process on A with inten-
sity λ = maxA {α(s)}. The thinning rule is to retain points from the original
pattern with probability p(s) = α(s)/λ. This is the idea behind the Lewis-
Shedler algorithm for simulating Poisson processes with heterogeneous inten-
sity (Lewis and Shedler, 1979).

Further properties of thinned processes are as follows.


B
1. If µ(A) = λ(s)ds is the intensity measure of the original process and
µ∗ (A) is the measure of the thinned process, then under p(s)-thinning
A
µ (A) = p(s)µ(ds).

2. If the original process has intensity λ, then the thinned process has intensity
λ∗ (s) = λp(s). For π-thinning of a process with intensity λ, the resulting
intensity is λE[π(s)].
3. If Z(s) is a Poisson process subject to p(s)-thinning, then the thinned
process Z ∗ and the process Z \ Z ∗ of the removed points are independent
Poisson processes with intensities λp(s) and λ(1−p(s)), respectively (Møller
and Waagepetersen, 2003, p. 23).
4. The p-thinning of a stationary process yields a stationary process. p(s)-
thinning does not retain stationarity. The π-thinning of a stationary process
is a stationary process, provided that the random field π(s) is stationary.
5. The K-function of a point process is not affected by p-thinning. The in-
tensity and the expected number of extra events within distance h from an
arbitrary event are reduced by the same factor.
6. The K-function of a π-thinned process can be constructed from the K-
function of the original process and the mean and covariance of the π(s)
process. If the random field π(s) has mean ξ and covariance function
C(||h||) = E[π(0)π(h)] − ξ 2 , then
A h
1
K (h) =

2
(C(u) + ξ 2 )dK(u).
0 ξ

7. The π-thinning of a Poisson process yields a Cox process (see below).


POINT PROCESS MODELS 125

In an inhomogeneous Poisson process regions where the intensity is higher


receive more events per unit area than regions in which the intensity is
low. The result is a clustered appearance of the point pattern. Thinning
with a location-dependent probability function, whether it is deterministic or
stochastic, thus leads to clustered patterns. Areas with high retention proba-
bility have greater density of events. Although you can achieve aggregation of
events by thinning, clustering as a point process operation refers to a different
technique: the event at si is replaced with the realization of a separate point
process that has ni events. Each realization of the second process is referred
to as a cluster. The final process consists of the union (superposition) of the
events in the clusters. A convenient framework in which to envision clustering
operations is that of “parent” and “offspring” processes. First, a point pro-
cess generates k events, call these the parent process and parent events. At
each parent event, ni offspring events are generated according to a bivariate
distribution function which determines the coordinates of the offspring. A bi-
variate density with small dispersion groups offsprings close to the parents,
forming distinct clusters. The Poisson cluster process (see below) arises when
the parent process is a Poisson process.

3.7.2 Clustered Processes

3.7.2.1 Cox Process

An inhomogeneous Poisson process creates aggregated patterns. Regions where


λ(s) is high receive a greater density of points compared to regions with low
intensity. If the intensity function is itself the realization of a stochastic pro-
cess, the resulting point process model is known as a doubly stochastic pro-
cess, or Cox process. The random intensity is denoted Λ(s) and λ(s) is a
particular realization. Since conditional on Λ(s) = λ(s) we obtain an inho-
mogeneous Poisson process, the (conditional) realizations of a Cox process
are non-stationary, yet the process may still be stationary, since properties of
the process are reckoned also with respect to the distribution of the random
intensity measure.
B
Writing µ(A) = A Λ(s)ds, then
: ;
1
Pr(N (A) = n) = EΛ µ(A) exp {−µ(A)} ,
n
n!
where µ(A) is a random variable. On the contrary, in the inhomogeneous
Poisson process we have Pr(N (A) = n) = n! 1
µ(A)n exp {−µ(A)} with µ(A)
a constant. Similarly, the first- and second-order intensities of a Cox process
are determined as expected values:
λ = E [Λ(s)]
λ2 (s1 , s2 ) = E [Λ(s1 )Λ(s2 )] .

If the random intensity measure Λ(s) is stationary, so is the Cox process and
126 MAPPED POINT PATTERNS

similarly for isotropy. The intensity function Λ(s) = W λ, where W is a non-


negative random variable, for example, enables one to model clustered data
without violating first-order stationarity, whereas the inhomogeneous Poisson
process is a non-stationary process. Furthermore, events in disjoint regions in
a Cox process are generally not independent.
An interesting case is where Λ(s) = Λ is a Gamma(α, β) random variable
with probability density function, mean, and variance
1
f (Λ) = Λα−1 exp {−Λ/β} Λ>0
β α Γ(α)
E [Λ] = αβ
Var [Λ] = αβ 2 .
The number of events in region A is then a negative binomial random variable.
Also, since the variance of a negative Binomial random variable exceeds that of
a Poisson variable with the same mean, it is seen how the additional variability
in the process that stems from the uncertainty in Λ increases the dispersion
of the data. As Stoyan et al. (1995, p. 156) put it, Cox processes are “super-
Poissonian.” The variance of the counts exceeds those of a Poisson process
with the same intensity, they are overdispersed. Also, since clustered processes
exhibit more variation in the location of events than the CSR process, the Cox
process is a point process model that implies clustering.
A “brute-force” method to simulation of a Cox process is a two-step pro-
cedure that first generates a realization of the intensity process Λ(s). In the
second step, an inhomogeneous Poisson process is generated based on the
realization λ(s).

3.7.2.2 Poisson Cluster Process

The Poisson cluster process derives by applying a clustering operation to the


events in a Poisson process. Because of this generating mechanism, it is an
appealing point process model for investigations that involve “parents” and
“offspring.” The original process and the clustering are motivated as follows:

(i) The parent process is the realization of an inhomogeneous Poisson pro-


cess with intensity λ(s).
(ii) Each parent produces a random number N of offspring.
(iii) The positions of the offspring relative to their parents are distributed
according to a bi-variate distribution function f (s).
(iv) The final process consists of the locations of the offspring only.

The Neyman-Scott process (Neyman and Scott, 1972) is a special case of


the Poisson cluster process. It is characterized by the following simplifications
of (ii) and (iii):

(ii∗ ) The number of offspring are realized independently and identically for
POINT PROCESS MODELS 127

each parent according to a discrete probability mass function Pr(N = k) =


pn .
(iii∗ ) The positions of the offspring relative to their parents are distributed
independently and identically.
We note that some authors consider as Neyman-Scott processes those Pois-
son cluster processes in which postulate (i) is also modified to allow only
stationary parent processes, see, e.g., Stoyan et al. (1995, p. 157) for this
view, and Cressie (1993, p. 662) for the more general case. When the parent
process is stationary with intensity ρ, further simplifications arise. For exam-
ple, if additionally, f (s) is radially symmetric, the Neyman-Scott process is
also stationary. The intensity of the resulting cluster process is λ = ρE[N ].
If the parent process is a homogeneous Poisson process with intensity ρ and
f (s) is radially symmetric, it can be shown that the second-order intensity is
2
λ2 (h) = ρ2 E [N ] + ρE [N (N − 1)] f (h).
In this expression h denotes the Euclidean distance between two events in
the same cluster (offspring from the same parent). Invoking the relationship
between the second-order intensity and the K-function, the latter can be
derived as
1 E [N (N − 1)]
K(h) = πh2 + F (h),
ρ E[N ]2
where F (h) is the cumulative distribution function of h.
Suppose, for example, that the parent process is a homogeneous Poisson
process with intensity ρ, f (s) is a bivariate Gaussian density with mean 0 and
variance σ 2 I, and the number of offspring per parent has a Poisson distribution
with mean E[N ] = µ. You can then show (see Chapter problems) that
C - >D
1 h2
K(h) = πh +2
1 − exp − 2 . (3.16)
ρ 4σ
Notice that the K-function does not depend on the mean number of offspring
per parent (µ). The explicit formula for the K-function of a Neyman-Scott pro-
cess allows fitting a theoretical K-function to an observed K-function (Cressie,
1993, p. 666; Diggle, 1983, p. 74).
The mechanisms of aggregating events by way of a random intensity func-
tion (Cox process) and by way of a clustering operation (Poisson cluster pro-
cess) seem very different, yet for a particular realization of a point pattern
they may not be distinguishable. In fact, the theoretical results by Bartlett
(1964) show the equivalence of Neyman-Scott processes to certain Cox pro-
cesses (the equivalence of the more general Poisson cluster processes to Cox
processes has not been established (Cressie, 1993, p. 664)). Mathematically,
the equivalence exists between Cox processes whose random intensity function
is of the form ∞
(
Λ(s) ∝ k(s − si ), (3.17)
i=1
128 MAPPED POINT PATTERNS

and Neyman-Scott processes with inhomogeneous Poisson parent process and


N ∼ Poisson(µ). Think of the intensity function (3.17) as placing bivariate
densities at locations s1 , s2 , · · ·, for example, bivariate densities of independent
Gaussian variables. A realization of Λ(s) determines where the densities are
centered (the parent locations in the Neyman-Scott process). The densities
themselves determine the intensity of events near that center (the offspring in
the Neyman-Scott process). The realization of this Cox process is no different
from the Neyman-Scott process that places the parents at the centers of the
densities and generates offspring with the same density about them.

3.7.3 Regular Processes

In order to generate events with greater regularity than the homogeneous


Poisson process, we can invoke a very simple requirement: no two events can
be closer than some minimum permissible distance δ. For example, start with
a homogeneous Poisson process and apply a thinning that retains points with
the probability that there are no points within distance δ. Such processes are
referred to as hard-core processes. The Matérn models I and II and Diggle’s
simple sequential inhibition process are variations of this theme.
Matérn (1960) constructed regular point processes of two types which are
termed the Matérn models I and II, respectively (see also Matérn 1986).
Model I starts with a homogeneous Poisson process Z0 with intensity ρ.
Then all pairs of events that are separated by a distance of less than δ
are thinned. The remaining events form the more regular spatial point pro-
cess Z1 . If si is an event of Z0 , the probability of its retention in Z1 is
Pr(no other point within distance δ of si ). Since the process is CSR, this leads
to
8 9
Pr(si is retained) = Pr(N (πδ 2 ) = 0) = exp −ρπδ 2 .
and the intensity of the resulting process is
8 9
λ = ρ exp −ρπδ 2 < ρ.
The second-order intensity is given by λ2 (h) = ρ2 p(h), where p(h) is the
probability that two events distance h apart are retained. The function p(h)
is given by (Diggle, 1983, p. 61; Cressie, 1993, p. 670)
-
0 h<δ
p(h) =
exp{−ρUδ (h)} h ≥ δ,
where Uδ (h) is the area of the union of two circles with radius δ and distance
h apart. The K-function can be obtained by integration,
A
2π h
K(h) = 2 xλ2 (x)dx.
λ 0

Matérn’s second model also commences with a homogeneous Poisson pro-


cess Z0 with intensity ρ and marks each event s independently with a random
CHAPTER PROBLEMS 129

variable M (s) from a continuous distribution function. Often M (s) is taken


to be uniform on (0, 1). The event s is deleted, if another event u is closer
than the minimum permissible distance δ and if its mark is less than the mark
at u. Put differently, the event s is retained, if there is no other point within
distance δ with a mark less than M (s). Diggle (1983, p. 61) refers to the mark
variable M (s) as the “time of birth” of the event s. An event of Z0 is then
removed, if it lies within a distance δ of an older event. You keep the “oldest”
events.
The intensity of the resulting process, which comprises all points not thin-
ned, is
1 ) 8 9*
λ = 2 1 − exp −ρπδ 2 .
πδ
Cressie (1993, p. 670) gives an expression for the second-order intensity.
Diggle et al. (1976) consider the following procedure that leads to a regular
process termed simple sequential inhibition. Place a disk of radius δ at random
in the region A. Determine the remaining points in A for which you could place
a disk of radius δ that would not overlap with the first disk. Select the center
point of the next disk at random from a uniform distribution of these points.
Continue in this fashion, choosing at each stage the disk center at random
from the points at which the next disk would not overlap with any of the
previous disks. The process stops when a pre-specified number of disks has
been placed or no additional disk can be placed without overlapping previously
placed disks. This model is appealing for regular patterns where events have
an inhibition distance such as cell nuclei that are surrounded by cell mass.
The simple sequential inhibition process is a Matérn model II conditioned on
the total number of points (Ripley, 1977).

3.8 Chapter Problems

Problem 3.1 Let N (A) denote the counting measure in region A ⊂ D of


a homogeneous Poisson process. The finite-dimensional distribution of the
process for non-overlapping intervals A1 , · · · , Ak is given by
λn ν(A1 )n1 · . . . · ν(Ak )nk
Pr(N (A1 ) = n1 , · · · , N (Ak ) = nk ) =
n1 ! · . . . · nk !
H k I
(
× exp − λν(Ai ) .
i=1

Show that by conditioning on N (D) = n1 +· · ·+nk ≡ n, the finite-dimensional


distribution equals that of a Binomial process (i.e., (3.2)).

Problem 3.2 For the homogeneous Poisson process, find unbiased estimators
of λ and λ2 .

Problem 3.3 Data are sampled independently from two respective Gaussian
130 MAPPED POINT PATTERNS

populations with variance σ 2 = 1.52 . Five observations are drawn from each
population. The realized values are
Sampled Values
Population yi1 yi2 yi3 yi4 yi5
i=1 9.7 10.2 10.9 8.6 10.3
i=2 8.2 6.1 9.6 8.2 8.9

Calculate the p-value for H0 : µ1 = µ2 against the two-sided alternative by


using the appropriate standard test. Then, perform Monte Carlo tests with s =
19, 49, 99, and s = 999 repetitions. Compare the simulated p-values against
the p-value from the standard test.

Problem 3.4 The random intensity function of the Cox process induces spa-
tial dependency between the events. You can think of this mechanism as
creating stochastic dependency through shared random effects. To better un-
derstand and appreciate this important mechanism consider the following two
statistical scenarios
1. A one-way random effects model can be written as Yij = µ + αi + #ij where
the #ij are iid random variables with zero mean and variance σ$2 and the
αi are independent random variables with mean 0 and variance σα2 . Also,
Cov[αi , #ij ] = 0. Determine Cov[Yij , Ykl ]. How does “sharing” of random
effects relate to the covariance of observations for i = k and i (= k?
2. Let Y1 |λ, · · · , Yn |λ be a random sample from a Poisson distribution with
mean λ. If λ ∼ Gamma(α, β), find the mean and variance of Yi as well as
Cov[Yi , Yj ].

Problem 3.5 (Stein, 1999, Ch. 1.4) Consider a Poisson process on the line
with intensity λ. Let A denote an interval on the line with length |A|, and
N (A) the number of events in A. Define Z(t) = N ((t − 1, t + 1]). Find the
mean and covariance function of Z(t).

Problem 3.6 Let {Z(s) : s ∈ D ⊂ R2 } be a Cox process with random


intensity function Λ(s) = Λ. The intensity is distributed as a Gamma(α, β)
random variable,
1
f (λ) = α λα−1 exp {−λ/β} .
β Γ(α)
Find the distribution of N (A), the number of events in region A. How does
Var[N (A)] relate to the variance of N (A) in a homogeneous Poisson process
with intensity λ = αβ?

Problem 3.7 Generate an inhomogeneous Poisson process on the unit square


with intensity function λ(s) = λ(x, y) = exp{5x + 2y}. Divide the unit square
into a 4 × 4 grid and perform a Chi-square goodness-of-fit test for CSR based
on quadrat counts. If you want to generate this process conditional on having
exactly n = 100 events, how must the generating algorithm be adjusted?
CHAPTER PROBLEMS 131

Problem 3.8 For an inhomogeneous Poisson process with intensity function


λ(s), show that the density function of {s1 , · · · , sn } is given by
-
exp{−µ(A)} n=0
f ({s1 , · · · , sn }) = =n
exp{−µ(A)} i=1 λ(si )/n! n ≥ 1.

Problem 3.9 A special case of the Neyman-Scott process is as follows. A


homogeneous Poisson process with intensity λ generates parent events. Each
parent produces a random number N of offspring, N ∼ Poisson(µ), and the
positions of the offspring relative to their parents are iid with a radial symmet-
ric, standard bivariate Gaussian density. That is, the offspring event s = [x, y]!
has distribution G2 (0, σ 2 I). The K-function of this process in R2 is given by
E[N (N − 1)]
K(h) = πh2 + F (h),
λm2
where F (h) is the cumulative distribution function of the distance
8 91/2
(x1 − x2 )2 + (y1 − y2 )2
between two events in the same cluster (Cressie, 1993, p. 665).
(i) Show that the mean of the squared distance to an offspring from its
parent is 2σ 2 .
(ii) Show that the density of H, the
8 distance 9between two events in the
same cluster is f (h) = h/(2σ ) exp −h /(4σ ) .
2 2 2

(iii) Show that the K function is K(h) = πh2 + λ−1 (1 − exp{−h2 /(4σ 2 )}).
(iv) Explain why the K-function does not depend on the mean number of
offspring per parent (µ).
(v) Under what condition do this process and the homogeneous Poisson
process have the same second-order properties?
CHAPTER 4

Semivariogram and Covariance


Function Analysis and Estimation

4.1 Introduction

Two important features of a random field are its mean and covariance struc-
ture. The former represents the large-scale changes of Z(s), the latter the
variability due to small- and micro-scale stochastic sources. In Chapter 2, we
gave several different representations of the stochastic dependence (second-
order structure) between spatial observations. Direct and indirect specifica-
tions based on model representations (§2.4.1), representations based on con-
volutions (§2.4.2) and spectral decompositions (§2.5). In the case of a spatial
point process, the second-order structure is represented by the second-order
intensity, and by the K-function in the isotropic case (Chapter 3). If a spatial
random field has model representation Z(s) = µ(s)+e(s), where e(s) ∼ (0, Σ),
the spatial dependence structure is expressed through the variance-covariance
matrix Σ. The semivariogram and covariance function of a spatial process
with fixed, continuous domain were introduced in §2.2, since these parameters
require that certain stationarity conditions be met. The variance-covariance
matrix of e(s) is not bound by any stationarity requirements, it simply cap-
tures the variances and covariances of the process. In addition, the model
representation does not confine e(s) to geostatistical applications, the domain
may be a lattice, for example. In practical applications, Σ is unknown and
must be estimated from the data. Unstructured variance-covariance matrices
that are common in multivariate statistical methods are uncommon in spatial
statistics. There is typically structure to the spatial covariances, for exam-
ple, they may decrease with increasing lag. And without true replications,
there is no hope to estimate the entries in an unspecified variance-covariance
matrix. Parametric forms are thus assumed so that Σ ≡ Σ(θ) and θ is esti-
mated from the data. The techniques employed to parameterize Σ vary with
circumstances. In a lattice model Σ is defined indirectly by the choice of a
neighborhood matrix and an autoregressive structure. For geostatistical data,
Σ is constructed directly from a model for the continuous spatial autocorre-
lation among observations. The importance of choosing the correct model for
Σ(θ) also depends on the application. Consider a spatial model
Z(s) = Xβ + e(s), e(s) ∼ (0, Σ(θ)),
where the primary interest is inference about β, for example, confidence in-
tervals and hypothesis tests about the mean. When θ is estimated from data,
134 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

Estimated the estimated generalized least squares estimator is given by


GLS ' ! ' −1 X)−1 X! Σ(θ)
' −1 Z(s).
β egls = (X Σ(θ)
Estimator
Provided that θ ' is a consistent estimator and that Σ(θ)' satisfies some reg-
ularity conditions as n → ∞, the EGLS estimator of β ' is consistent (see
§6.2). A judicious choice of the covariance model for Σ(θ) will suffice for this
analysis. Contrariwise, assume that our interest is in predicting the value of a
new observation Z(s0 ) at the unobserved location s0 , and that the underlying
spatial model is
Z(s) = µ1 + e(s), e(s) ∼ (0, Σ(θ)).
The best linear unbiased predictor (BLUP) under squared-error loss for this
problem is derived in §5.2 as
pok (Z; s0 ) = µ
' + c(θ)! Σ(θ)−1 (Z(s) − 1'
µ).
Here, µ
' is the generalized least squares estimator of the (spatially constant)
mean µ and c(θ) = Cov[Z(s0 ), Z(s)]! . The correctness of the covariance model
is important in this situation, since it “drives” the spatial predictor through
c(θ) and Σ(θ). Much more attention needs to be paid to the selection of the
covariance model compared to situations where the parameters of the mean
function are of primary interest.
The predictor pok (Z; s0 ) cannot be computed unless θ is known, which
typically it is not. The optimal predictor pok (Z; s0 ) is thus inaccessible just as
the generalized least squares estimator
' = (X! Σ(θ)−1 X)−1 X! Σ(θ)−1 Z(s)
β gls

is inaccessible. Besides issues concerning the selection of appropriate mod-


els for Σ, the analyst is concerned with finding a good estimator of θ. In
this chapter we discuss various estimation strategies for θ under the assump-
tion of stationarity of the Z(s) process. In Chapter 5 we extend the problem
of estimating θ to the situation where E[Z(s)] is not constant, that is, the
random field exhibits a large-scale trend. Furthermore, we assess there the
ramifications of working with estimated, rather than known, quantities (using
' instead of θ).
θ
Predicting the random field at unobserved locations is of particular impor-
tance for geostatistical data. By definition, a complete sampling of the random
field surface is impossible and the user may be interested in predicting the
amount Z at an arbitrary location s0 . For lattice data, complete observa-
tions of the domain is common, since sites are associated with discrete spatial
units. In this chapter we discuss methods for representing spatial dependence
in geostatistical data (§4.2), common models for parameterizing this depen-
dence (§4.3), and statistical approaches to estimation (§4.4 and §4.5). Because
modeling the spatial dependence is so important for spatial prediction, these
issues have received much attention and have given rise to a considerable array
of models and estimation methods. The primary tool that we investigate for
SEMIVARIOGRAM AND COVARIOGRAM 135

estimating the spatial dependence is the semivariogram. In what follows, the


reader is reminded that stationarity assumptions are implicit and much dam-
age can be done by applying semivariogram estimators and semivariogram
models to data from non-stationary spatial processes.

4.2 Semivariogram and Covariogram

4.2.1 Definition and Empirical Counterparts

Let {Z(s) : s ∈ D ⊂ Rd } be a spatial process and define


1
γ ∗ (si , sj ) = Var[Z(si ) − Z(sj )]
2
1
= {Var[Z(si )] + Var[Z(sj )] − 2Cov[Z(si ), Z(sj )]} . (4.1)
2
If γ ∗ (si , sj ) ≡ γ(si − sj ), a function of the coordinate difference si − sj
only, then we call γ(si − sj ) the semivariogram of the spatial process. If
Z(s) is intrinsically stationary (§2.2), then γ(si − sj ) is a parameter of the
stochastic process. In the absence of stationarity, γ ∗ remains a valid func-
tion from which the variance-covariance matrix Var[Z(s)] = Σ can be con-
structed, but it should not be referred to as the semivariogram. The function
2γ(si − sj ) is referred to as the variogram although the literature is not con-
sistent in this regard. Some authors define γ(si − sj ) through (4.1) and refer
to it as the variogram. Chilès and Delfiner (1999, p. 31), for example, define
γ(h) = 12 Var[Z(s) − Z(s + h)], term it the variogram because it “tends to be-
come established for its simplicity” and acknowledge that γ(h) is also called
the semivariogram. There is nothing “established” about being off the mark
by factor 2. For clarity, we refer throughout to γ as the semivariogram and to
2γ as the variogram. The savings in ink are disproportionate to the confusion
created when “semi” is dropped.
The name variogram is most often associated with the work by Matheron
(1962, 1963). Jowett (1955a,b) used the term variogram sparingly and called
the equivalent of γ(h) in the time series context the serial variation func-
tion. (We wish he would have used the term semivariogram sparingly.) Jowett
(1955c) termed what is now known as the empirical semivariogram (§4.4)
the serial variation curve. Statistical computing packages are also notori-
ous for calculating a semivariogram but labeling it the variogram. In the
S+SpatialStats! r manual, for example, Kaluzny et al. (1998, p. 68) define the
semivariogram as in (4.1), but refer to it as the variogram for “conciseness.”
The VARIOGRAM procedure in SAS/STAT! r computes the semivariogram
and uses the variogram label in the output data set.
If the spatial process is not only intrinsic, but second-order stationary, the
semivariogram can be expressed in terms of the covariance function C(si −
sj ) = Cov[Z(si ), Z(sj )] as
γ(si − sj ) = C(0) − C(si − sj ), (4.2)
136 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

making use of the fact that Var[Z(s)] = C(0) under second-order stationarity.
Note that the name semivariogram is used both for the function γ(si − sj ) as
well as the graph of γ(h) against h. When working with covariances, C(si −sj )
is the covariance function, the graph of C(h) against h is referred to as the
covariogram. Similarly, a graph of the correlation function R(h) against h
is termed the correlogram. In the spirit of parallel language, we sometimes
will use the term covariogram even if technically the term covariance function
may be more appropriate.
Because of the simple relationship between semivariogram and covariance
function, it seems immaterial which function is used to study the spatial de-
pendence of a process. Since the class of intrinsic stationary processes contains
the class of second-order stationary processes, the semivariogram of a second-
order stationary process can be constructed from the covariance function by
(4.2). If the process is intrinsic but not second-order stationary, the covari-
ance function is not a parameter of the process. Our preference to work with
the semivariogram γ and not the variogram 2γ is partly due to the fact that
γ(si − sj ) → C(0) provided C(si − sj ) → 0. An unbiased estimate of the
semivariogram for lag distances at which data are (practically) uncorrelated,
is an unbiased estimate of the variance of the process.
In geostatistical applications, it is common to work with the semivariogram,
rather than the covariance function. Statisticians, on the other hand, are
trained in expressing dependency between random variables in terms of co-
variances. The reasons are not just convenience and training, and a nice inter-
pretation of the semivariogram sill in a second-order stationary process. When
you are estimating the spatial dependence from data, the ambivalence between
covariance function and semivariogram gives way to differences in statistical
properties of the empirical estimators. Details on empirical semivariogram
estimators are given in §4.4. Here we address briefly the issue of bias when
working with (semi-)variograms and with covariances. Let Z(s1 ), · · · , Z(sn )
denote the observations from a spatial process with constant but unknown
mean. Since then
1
γ(si − sj ) = E[(Z(si ) − Z(sj ))2 ],
2
Matheron a simple, moment-based estimator due to Matheron (1962, 1963) is
Estimator 1 ( 2
'(si − sj ) =
γ {Z(si ) − Z(sj )} ,
2|N (si − sj )|
N (si -sj )

where N (si − sj ) is the set of location pairs with coordinate difference si − sj


and |N (si − sj )| is the number of distinct pairs in this set. The corresponding
Empirical natural estimator of the covariance function C(h) is
Covario- 1 (
gram ' i − sj ) =
C(s (Z(si ) − Z)(Z(sj ) − Z), (4.3)
|N (si − sj )|
N (si -sj )
.n
where Z = n −1
i=1 Z(si ). Let si − sj = h. The estimator γ
'(h) is unbiased
SEMIVARIOGRAM AND COVARIOGRAM 137

for γ(h) if Z(s) is intrinsically stationary. If the mean is estimated from the
'
data, then C(h) is a biased estimator of the covariance function at lag h.
Furthermore,
1 ( 8 92
'(h) =
γ Z(si ) − Z − Z(sj ) + Z
2|N (h)|
N (h )
1 ( 8 92
= Z(si ) − Z '
− C(h),
|N (h)|
N (h )

' .n 8 92
' '
but C(0) = n−1 i=1 Z(si ) − Z . As a consequence, C(0) − C(h) (= γ
'(h)
and a semivariogram estimate constructed from (4.3) will also be biased. As
|N (h)|/n → 1, the bias disappears.
The semivariogram estimator γ '(h) has other appealing properties. For ex-
ample, γ'(0) = 0 = γ(0) and γ '(h) = γ'(−h), sharing properties of the semi-
variogram γ(h). If the data contain a linear large-scale trend,
Z(s) = X(s)β + e(s),
then the spatial dependency in the model errors e(s) is often estimated from
the least squares residuals. Since Var[e(s)] = Σ is unknown—otherwise there
is no need for semivariogram or covariance function estimation—the ordinary
least squares residuals
) ! !*
e(s) = I − X(s)(X(s) X(s))−1 X(s) Z(s)
'
are often used. Although the semivariogram estimated from ' e(s) is a biased
estimator for the semivariogram of e(s), this bias is less than the bias of the
covariance function estimator based on 'e(s) for C(h) (Cressie and Grondona,
1992; Cressie, 1993, p. 71 and §3.4.3).
The advantages of the classical semivariogram estimator over the covariance
function estimator (4.3) stem from the fact that the unknown—but constant—
mean is not important for estimation of γ(h). The semivariogram filters the
mean. This must not be interpreted as robustness of variography to arbitrary
mean. First, the semivariogram is a parameter of a spatial process only un-
der intrinsic or second-order stationary, both of which require E[Z(s)] = µ.
Second, the semivariogram reacts rather poorly to changes in the mean with
spatial locations. Let Z(s) = µ(s) + e(s), where E[e(s)] = 0, γe (h) = γz (h).
It is easy to show (Chapter problem 4.1) that
1 ( 2
γz (h)] = γe (h) +
E[' {µ(si ) − µ(sj )} , (4.4)
2|N (h)|
N (h )

the empirical semivariogram is positively biased.


It is important to keep separate the question of whether a statistical method
is expressed in terms of the semivariogram or the covariance function and the
question of whether one should prefer the empirical semivariogram or the em-
pirical covariogram. Because the empirical covariogram is a biased estimator
138 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

and the empirical semivariogram is unbiased, is not justification to favor sta-


tistical techniques that express dependence in terms of the semivariogram.
The bias would only be of concern, if the method of estimating the parame-
ters in θ actually drew on the empirical covariogram. For example, (restricted)
maximum likelihood techniques (see §4.5.2) express the likelihood in terms of
covariances, but the expression (4.3) is never formed. Finally, the empirical
semivariogram or covariogram is hardly ever the end result of a statistical
analysis. In a confirmatory analysis you need to estimate the parameter vec-
tor θ so that large-scale trends can be estimated efficiently, and for spatial
predictions. If you fit a theoretical semivariogram model γ(θ) to the empiri-
cal semivariogram by nonlinear least squares, for example, the least squares
estimates of θ will be biased; most nonlinear least squares estimates are.

4.2.2 Interpretation as Structural Tools

The behavior of the covariance function near the origin and its differentiability
were studied in §2.3 to learn about the continuity and smoothness of a second-
order stationary random field. Recall that a mean square continuous random
field must be continuous everywhere, and that a random field cannot be mean
square continuous unless it is continuous at the origin. Hence, C(h) → C(0)
as h → 0 which implies that γ(h) → 0 as h → 0. Furthermore, we must have
γ(0) = 0, of course. Mean square continuity of a random field implies that
the semivariogram is continuous at the origin. The notion of smoothness of
a random field was then brought into focus in §2.3 by studying the partial
derivatives of the process. The more often a random field is mean square
differentiable, the higher its degree of smoothness.
The semivariogram is not only a device to derive the spatial dependency
structure in a random field and to build the variance-covariance matrix of
Z(s), which is needed for model-based statistical inferences. It is a structural
tool which in itself conveys much information about the behavior of a random
field. For example, semivariograms that increase slowly from the origin and/or
exhibit quadratic behavior near the origin, imply processes more smooth than
those whose semivariogram behaves linear near the origin.
For a second-order stationary random field, the (isotropic) semivariogram
γ(||h||) ≡ γ(h) has a very typical form (Figure 1.12, page 29). It rises from
the origin and if C(h) decreases monotonically with increasing h, then γ(h)
will approach Var[Z(s)] = σ2 either asymptotically or exactly at a particular
Sill and lag h∗ . The asymptote itself is termed the sill of the semivariogram and the
Range lag h∗ at which the sill is reached is called its range. Observations Z(si ) and
Z(sj ) for which ||si − sj || ≥ h∗ are uncorrelated. If the semivariogram reaches
the sill asymptotically, the practical range is defined as the lag h∗ at which
γ(h) = 0.95 × σ 2 . Semivariograms that do not reach a sill occur frequently.
This could be due to
SEMIVARIOGRAM AND COVARIOGRAM 139

• Non-stationarity of the process, e.g., the mean of Z(s) is not constant across
the domain;
• An intrinsically stationary process. The intrinsic hypothesis states that a
variogram must satisfy
γ(h)
2 →0 as ||h|| → 0.
||h||2
• The process is second-order stationary, but the largest lag for which the
semivariogram can be estimated is shorter than the range of the process.
The lag distance at which the semivariogram would flatten has not been
observed.
In practice, empirical semivariograms γ '(h) calculated from a set of data
often suggest that the semivariogram does not pass through the origin. This
intercept of the semivariogram has been termed the nugget effect c0 , c0 = Nugget
limh→0 γ(h) (= 0. If the random field under study is mean square continuous, Effect
such a discontinuity at the origin must not exist. Following Matérn (1986,
Ch. 2.2), define Qd as the class of all functions that are valid covariance func-
tions in Rd , Q!d as the subclass of functions which are continuous everywhere
except possibly at the origin, and Q!!d as the subclass of covariance functions
continuous everywhere. Matérn shows that if C(h) ∈ Q!d it can be written as
C(h) = aC0 (h) + bC1 (h), where a, b ≥ 0, C1 (h) ∈ Q!!d , and
-
1 if h = 0
C0 (h) = (4.5)
0 otherwise.
It follows that if Z(s) has a covariance function in Q!d , it can be decomposed
as Z(s) = U (s) + ν(s), where U (s) is a process with covariance function in
Q!!d and ν(s) has covariance function (4.5). Matérn (1986, p. 12) calls U (s)
the continuous component and ν(s) the chaotic component in the de-
composition. The variance of the latter component is the nugget effect of
the semivariogram. The chaotic component is not necessarily completely spa-
tially unstructured; it can be further decomposed. Recall the decomposition
Z(s) = µ(s) + W (s) + η(s) + #(s) from §2.4, where W (s) depicts smooth-scale
spatial variation, η(s) micro-scale variation, and #(s) is pure measurement
error. The micro-scale process η(s) is a stationary spatial process whose semi-
variogram has sill Var[η(s)] = ση2 . It represents spatial structure but cannot
be observed unless data points are collected at lag distances smaller than the
range of the η(s) process. The measurement error component has variance
Var[#(s)] = σ$2 and the nugget effect of a semivariogram is
c0 = ση2 + σ$2 .
The name was coined by Matheron (1962) in reference to small nuggets of
ore distributed throughout a larger body of rock. The small nuggets consti-
tute a microscale process with spatial structure. Matheron’s definition thus
appeals to the micro-scale process and Matérn’s definition to the measure-
ment error process. In practice, η(s) and #(s) cannot be distinguished unless
140 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

there are replicate observations at the same spatial locations. The modeler
who encounters a nugget effect in a semivariogram thus needs to determine
on non-statistical grounds whether the effect is due to micro-scale variation
or measurement error. The choice matters for deriving best spatial predictors
and measures of their precision (§5). Software packages are not consistent in
this regard.
In the presence of a nugget effect, the variance of a second-order stationary
process is Var[Z(s)] = c0 + σ02 , where σ02 is the partial sill. Since the nugget
reduces the smoothness of the process, a common measure for the degree of
Relative spatial structure is the relative structured variability
Structured
Variability C D
σ02
RSV = × 100%. (4.6)
σ02 + c0

This is a rather crude measure for the degree of structure (or smoothness) of a
random field. Besides the relative magnitude of the discontinuity at the origin
of the semivariogram it does not incorporate other features of the process that
represents the continuous component, e.g., its mean square differentiability.
The range of the semivariogram is often considered an important parameter.
In ecological applications it has been related to the size of patches that form af-
ter human intervention. It is not clear why the distance at which observations
are no longer spatially correlated should be equal to the diameter of patches.
Consider observations Z(s1 ), Z(s2 ), and Z(s3 ). If ||s1 − s2 || < h∗ , where h∗
is the range, but ||s1 − s3 || > h∗ , ||s2 − s3 || < h∗ then Cov[Z(s1 ), Z(s2 )] (= 0,
Cov[Z(s1 ), Z(s3 )] = 0, but Z(s3 ) and Z(s2 ) are correlated. In spatial predic-
tion Z(s3 ) can impact Z(s1 ) through its correlation with Z(s2 ). Chilès and
Delfiner (1999, p. 205) call this the relay effect of spatial autocorrelation.
The relative structured variability measures that component of spatial conti-
nuity that is reflected by the nugget effect. Other measures, which incorporate
the shape of the semivariogram (and the range) have been proposed. Russo
and Bresler (1981) and Russo and Jury (1987) consider integral scales. If
R(h) = C(h)/C(0) is the autocorrelation function of an isotropic process,
Integral then the scales for processes in R1 and R2 are
Scales
A ∞ - A ∞ >1/2
I1 = R(h) dh I2 = 2 R(h)h dh .
0 0

The idea of an integral scale is to consider distances over which observations


are highly correlated, rather than the distance at which observations are no
longer correlated. Processes with greater continuity have larger integral scales,
correlations wear off more slowly. Solie, Raun, and Stone (1999) argue that
integral scales provide objective measures for the distance at which soil and
plant variables are highly correlated and are useful when this distance cannot
be determined based on subject matter alone.
COVARIANCE AND SEMIVARIOGRAM MODELS 141

4.3 Covariance and Semivariogram Models

4.3.1 Model Validity

In §4.3.1–4.3.5 we consider isotropic models for the covariance function and the
semivariogram of a spatial process (accommodating anisotropy is discussed in
§4.3.7). We start from models for covariance functions because valid semivar-
iograms for second-order stationary processes can be constructed from valid
covariance functions. For example, if C(h) is the covariance function of an
isotropic process with variance σ 2 and no nugget effect, then
-
0 h=0
γ(h) =
σ (1 − C(h))) h > 0.
2

Not every mathematical function can serve as a model for the spatial de-
pendency in a random field, however. Let C(h) be the isotropic covariance
function of a second-order stationary field and γ(h) the isotropic semivari-
ogram of a second-order or intrinsically stationary field. Then the following
hold:
• If C(h) is valid in Rd , then it is also valid in Rs , s < d (Matérn, 1986, Ch.
2.3). If γ(h) is valid in Rd , it is also valid in Rs , s < d.
• If C1 (h) and C2 (h) are valid covariance functions, then aC1 (h) + bC2 (h),
a, b ≥ 0, is a valid covariance function.
• If γ1 (h) and γ2 (h) are valid semivariograms, then aγ1 (h) + bγ2 (h), a, b ≥ 0,
is a valid semivariogram.
• A valid covariance function C(h) is a positive-definite function, that is,
k (
( k
ai aj C(si − sj ) ≥ 0,
i=1 j=1

for any set of real numbers a1 , · · · , ak and sites. By Bochner’s theorem this
implies that C(h) has spectral representation (§2.5)
A ∞ A ∞
C(h) = ··· exp{iω ! h}dS(ω).
−∞ −∞

In the isotropic case, the spectral representation of the covariance function


in Rd becomes (Matérn, 1986, p. 14; Yaglom, 1987, p. 106; Cressie, 1993,
p. 85; Stein, 1999, Ch. 2.10)
A ∞
C(h) = Ωd (hω)dH(ω) (4.7)
0
C Dv
2
Ωd (t) = Γ(d/2)Jv (t), (4.8)
t
where v = d/2 − 1, Jv is the Bessel function of the firstBkind of order v

(§4.9.1), and H is a non-decreasing function on [0, ∞) with 0 dH(ω) < ∞.
This is known as the Hankel transform of H() of order v. The function
142 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

H is related
B to the spectral distribution function of the process through
H(u) = ||ω ||<u dS(ω) (Stein, 1999, p. 43).
Basis We call Ωd the basis function of the covariance model in Rd . For processes
Functions in Rd , d ≤ 3, Ω1 (t) = cos(t), Ω2 (t) = J0 (t), Ω3 (t) = sin(t)/t. Also, for
d → ∞, Ωd (t) → exp{−t2 } (Figure 4.1).

1.0

0.8 J_0(t)
sin(t)/t
exp( - t*t )
0.6

0.4

0.2

0.0

-0.2

-0.4

0 5 10 15 20 25 30
t

Figure 4.1 Basis functions for processes in R2 , R3 , and R∞ . With increasing di-
mension, the basis functions have fewer sign changes. Covariance functions are non-
increasing unless the basis function permits at least one sign change. A process in
R∞ does not permit negative autocorrelation at any lag distance.

• A valid semivariogram γ(h) is conditionally negative definite, that is,


m (
( m
2 ai aj γ(si − sj ) ≤ 0,
i=1 j=1
.m
for any real numbers a1 , · · · , am such that i=1 ai = 0 and any finite
number of sites. A valid (isotropic) semivariogram also has a spectral rep-
resentation, namely
A
1 ∞ −2
γ(h) = ω (1 − Ωd (ωh))dH(ω),
2 0
B∞
with 0 (1 + ω 2 )−1 dH(ω) < ∞.
• A necessary condition for γ(h) to be a valid semivariogram is that 2γ(h)
COVARIANCE AND SEMIVARIOGRAM MODELS 143

grows more slowly than ||h||2 . This is often referred to as the intrinsic
hypothesis.

4.3.2 The Matérn Class of Covariance Functions

Based on the spectral representation (4.7)–(4.8) of isotropic covariance func-


tions, Matérn (1986) constructed a flexible class of covariance functions, Matérn
C Dν Class
2 1 θh
C(h) = σ 2Kν (θh) ν > 0, θ > 0, (4.9)
Γ(ν) 2
where Kν is the modified Bessel function of the second kind of order ν > 0.
The parameter θ governs the range of the spatial dependence, the smoothness
of the process increases with ν. Properties of the Bessel functions are given in
§4.9.2; it is seen there that for fixed ν and t → 0
C D−ν
Γ(ν) t
Kν (t) ≈ .
2 2
Hence σ 2 is the variance of the process. Expression (4.9) is only one of several
possible parameterizations of this family of covariance functions. Others are
given in §4.7.2.
We commence the list of isotropic models with the Matérn class because of
its generality. C(h) given by (4.9) is valid in Rd and its smoothness increases
with ν. Although θ is related to the (practical) range of the process, the
range is itself a function of ν. For particular values of ν, the range is easily
determined however, as (4.9) takes on simple forms. As ν → ∞ the limiting
covariance model is known as the gaussian model “G”aussian
- > Model
8 9 h2
C(h) = σ exp −θh = σ exp −3 2 .
2 2 2
(4.10)
α
The second parameterization is common in geostatistical applications where
α is the practical range, the distance at which the correlations have decreased
to ≈0.05 or less (exp{−3} = 0.04978, to be more exact). Other important
cases in the Matérn class of covariance functions are obtained for ν = 1/2 and
ν = 1. In the former case, the resulting model is known as the exponential
model. Using the following results regarding Bessel functions (§4.9.2),
π I−ν (t) − Iν (t)
Kν (t) =
2 sin(πν)
J J
2 2
I−1/2 (t) = cosh(t) I1/2 (t) = sinh(t)
πt πt
1) t *
sinh(t) = e − e−t cosh(t) = e−t + sinh(t),
2
one obtains J
π −t
K1/2 (t) = e
2t
144 SEMIVARIOGRAM ANALYSIS AND ESTIMATION


Exponential and substitution into (4.9) yields (recall that Γ(1/2) = π)
Model - >
h
C(h) = σ exp {−θh} = σ exp −3
2 2
. (4.11)
α
The second parameterization is again common in geostatistical applications
where α denotes the practical range. The exponential model is the continuous-
time analog of the first-order autoregressive time series covariance structure.
It enjoys popularity not only in spatial applications, but also in modeling
longitudinal and repeated measures data (see, e.g., Jones, 1993; Schabenberger
Whittle and Pierce, 2002, Ch. 7). The model for ν = 1,
Model
C(h) = σ 2 θhK1 (θh), (4.12)
was suggested by Whittle (1954). He considered the exponential model as the
“elementary” covariance function in R1 and (4.12) as the “elementary” model
in R2 . A process Z(t) in R1 with exponential correlation can be represented
by the stochastic differential equation
C D
d
+ θ Z(t) = #(t),
dt
where #(t) is a white noise process. Whittle (1954) and Jones and Zhang (1997)
consider this the elementary stochastic differential equation in R1 . In R2 , with
coordinates x and y, Whittle awards this distinction to the stochastic Laplace
equation C 2 D
∂ ∂2
+ 2 − θ Z(x, y) = #(x, y).
2
∂x2 ∂y
A process represented by this equation has correlation function
R(h) = θhK1 (θh),
a Whittle model. Whittle (1954) concludes that “the exponential function has
no divine right in two dimensions” and calls processes in R2 with exponen-
tial covariance function “artificial”; finding it “difficult to visualize a physical
mechanism” that has covariance function (4.11).
We strongly feel that the exponential model has earned its place among the
isotropic covariance models for modeling spatial data. In fitting these models
to data, the exponential model has a definite advantage over Whittle’s model.
It does not require evaluation of infinite series (§4.9.2).
If there is an “artificial” model for the spatial dependence, it is the gaus-
sian model (4.10). Because it is the limiting model for ν → ∞ it is infinitely
differentiable. Physical and biological processes with this type of smoothness
are truly artificial. The name is unfortunately somewhat misleading. Covari-
ance model (4.10) is called the “gaussian” model because of the functional
similarity of the spectral density of a process with that covariance function to
the Gaussian probability density function (§4.7.2). It does not command the
same respect as the Gaussian distribution. We choose lowercase notation to
distinguish the covariance model (4.10) from the Gaussian distribution.
COVARIANCE AND SEMIVARIOGRAM MODELS 145

Figure 4.2 shows semivariograms derived from several different covariance


functions in the Matérn class (4.9). For the same value of θ, the semivariogram
rises more quickly from the origin as ν decreases. The Whittle model with
ν = 1 is slightly quadratic near the origin but much less so than the gaussian
model (ν → ∞). All models in this class are for second-order stationary
processes with positive spatial autocorrelation that decreases with distance.

1.0

= 0.25
0.8

= 0.75
Semivariogram

0.6 = 0.5

=1

0.4

0.2

0.0

0 2 4 6 8 10 12 14 16 18 20
Lag distance

Figure 4.2 Semivariograms constructed from covariance functions in the Matérn


class for different values of the smoothness parameter ν, θ = 0.25, and σ 2 = 1.
The model for ν → ∞ is the gaussian model and was chosen to have the same (prac-
tical) range as the Whittle model with ν = 1. Vertical lines indicate practical ranges
and the horizontal line denotes 95% of the sill.

4.3.3 The Spherical Family of Covariance Functions

A second-order stationary random field can be represented by the convolution


of a kernel function and a white noise random field (§2.4.2). The covariance
of the resulting random field is then simply the convolution of the kernels,
A
Cov[Z(s), Z(s + h)] = σx 2
K(u)K(u + h) du.
u
Chilès and Delfiner (1999, p. 81) present an interesting family of isotropic
covariance functions by choosing as kernel function the indicator function of
146 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

the sphere of Rd with diameter α,


-
1 u ≤ α/2
Kd (u) =
0 otherwise.
The family of isotropic covariance functions so constructed can be written as
H B
1 ) *(d−1)/2
1 − u2 du h ≤ α
Cd (h) ∝ h/α
0 otherwise.
The autocorrelation functions for d = 1, 2, 3 are
• Tent Model, d = 1:
-
1− h
h≤α
R1 (h) = α
0 otherwise.
Note that the convolutions in R1 with a uniform kernel in Figures 2.5 and
2.6 (page 61) yielded a tent correlation function.
• Circular Model, d = 2:
H E / F
π arccos{h/α} − α 1 − h /α
2 h 2 2 h≤α
R2 (h) =
0 otherwise.

• Spherical Model, d = 3:
- ) h *3
R1 (h) = 1− 3h
2α + 1
2 α h≤α (4.13)
0 otherwise.

In spherical models the correlation is exactly zero at lag h = α, hence


these models have a true range and often exhibit a visible kink at h = α.
The near-origin behavior of semivariograms in the spherical family is linear or
close-to-linear (Figure 4.3). Because (4.13) is valid in R3 , it is often considered
the spherical model. The popularity of the spherical covariance function and
“The” its semivariogram
Spherical + C D3 ,
3 h 1 h
Model C(h) = σ 2 1 − + (4.14)
2α 2 α
+ C D3 ,
3 h 1 h
γ(h) = σ 2 − (4.15)
2α 2 α

are a mystery to Stein (1999, p. 52), who argues that perhaps “there is a
mistaken belief that there is some statistical advantage in having the auto-
correlation function being exactly zero beyond some finite distance.”

4.3.4 Isotropic Models Allowing Negative Correlations

The second-order stationary models discussed so far permit only positive au-
tocorrelation, the semivariogram is a non-decreasing function (the covariance
COVARIANCE AND SEMIVARIOGRAM MODELS 147

1.0

0.8 Spherical
Semivariogram

0.6 Circular

0.4

Tent

0.2

0.0

0 2 4 6 8 10 12 14 16 18 20
Lag distance

Figure 4.3 Semivariograms constructed from covariance functions in the spherical


family with α = 12 and sill σ 2 = 1.

function is non-increasing). Negative spatial association can be built into mod-


els for the spatial dependency but positive and negative autocorrelations can
not change arbitrarily in a second-order stationary process.

4.3.4.1 The Nonparametric Approach

The spectral representation (4.7) of the isotropic covariance function suggests


a method to construct covariance functions that allow for negative and positive
spatial association. The basis function Ωd (t) is an oscillating function unless
d = ∞. In R2 , for example, Ωd (t) = J0 (t), the Bessel function of the first kind
of zero order (Figure 4.1 and §4.9.1). Choose dH(ω) = f (ω)dω and
- 2
σ ω=1
f (ω) =
0 otherwise.
Then C(h) = σ 2 Ω(h) and it is seen that J0 (h) is a valid correlation function in
R2 . Since linear combinations of valid covariance function are valid—provided
the coefficients in the linear combination are positive—a flexible class of co-
variance models for second-order stationary processes can be constructed as
p
(
C(h, t, w) = wi Ωd (hti ), (4.16)
i=1
148 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

where t = [t1 , · · · , tp ]! is a set of nodes and w = [w1 , · · · , wp ]! is an asso-


ciated set of weights. This is a discrete representation of (4.7) where the
non-decreasing function F (ω) has been replaced with a step function .p that has
positive mass w1 , · · · , wp at nodes t1 , · · · , tp . Notice that C(0) = i=1 wi and
p
(
γ(h, t, w) = wi (1 − Ωd (hti )) .
i=1

The models discussed previously contain various parameters, for example,


the spherical model (4.14) contains a sill parameter σ 2 and a range parameter
α. In the representation (4.16) the “parameters” of the model consist of the
placement of the nodes and the masses associated with the nodes. Models of
form (4.16) are often referred to as nonparametric because of their flexibil-
ity. Their formulation does involve unknown constants (parameters!) that are
estimated from the data. To achieve sufficient flexibility, nonparametric mod-
els may actually require a fair number of parameters (see §4.6). Parameter
proliferation and overfitting are potential problems for this and other non-
parametric families of semivariogram models. There are some candidates for
parametric models that allow for negative correlations with a small number
of parameters.

4.3.4.2 Hole Models

Any isotropic covariance model in Rd has representation (4.7) and a paramet-


ric model can be constructed from (4.8) as
C D1−v
h
C(h) = 2 Γ(d/2)
v
Jv (h/α), (4.17)
α
where α > 0 and v = d/2 − 1. Matérn (1986, p. 16) shows that permissible
correlations in isotropic are bounded from below. If R(h) denotes the auto-
correlation function, then R(h) > −0.403 in R2 and R(h) > −0.218 in R3 .
Models that allow negative correlations in excess of these values are thus not
permissible as isotropic covariance models. Since
J
2
J1/2 (t) = sin{t}, (4.18)
πt
evaluating (4.17) for v = 0.5 using (4.18) yields a model valid in R3 known as
Cardinal- the cardinal-sine model:
sine EαF
Model C(h) = sin{h/α}. (4.19)
h
This model is also known as the hole-effect or wave model. The correspond-
ing semivariogram model with sill σ 2 is
E EαF F
γ(h) = σ 1 −
2
sin{h/α} . (4.20)
h
COVARIANCE AND SEMIVARIOGRAM MODELS 149

The “practical” range for this model is defined as the lag distance at which the
first peak is no greater than 1.05σ 2 or the first valley is no less than 0.95σ 2 .
It is approximately 6.5 × πα (Figure 4.4b).

4.3.5 Basic Models Not Second-Order Stationary

The two basic isotropic models for processes that are not second-order sta-
tionary are the linear and the power model. The former is a special case of
the latter. The power model is given in terms of the semivariogram Power
Models
γ(h) = θhλ , (4.21)

with θ ≥ 0 and 0 ≤ λ < 2. If λ ≥ 2, the model violates the intrinsic hy-


pothesis. For λ = 1, the linear semivariogram model results. As mentioned
earlier, a linear semivariogram can be indicative of an intrinsically, but not
second-order stationary process. It could also be an indication of a second-
order stationary process whose semivariogram behaves linearly near the origin
(spherical model, for example), but whose sill has not been reached across the
observed lag distances (Figure 4.4a).

a) b)
1.2

2.0
1.0

1.5 =1
0.8 2
=1

1.0 0.6
= 1.0

= 0.5
0.4 = 0.5
0.5 2
= 0.5

0.2
= 1.5
0.0

0.0
0.0 0.5 1.0 1.5
0 10 20 30
Lag Distance
Lag Distance

Figure 4.4 Power semivariogram (a) and hole (cardinal-sine) models (b).
150 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

4.3.6 Models with Nugget Effects and Nested Models

The basic second-order stationary parametric covariance functions and semi-


variograms of the previous subsections are often considered too restrictive
to model the complexities of spatial dependence in geostatistical data. None
of the models presented there incorporates a nugget effect, for example. One
device to introduce such an effect into the semivariogram is through nesting of
models. Recall from §4.3.1 that aC1 (h) + bC2 (h) is a valid covariance function
for a second-order stationary process if a, b ≥ 0 and C1 (h) and C2 (h) are valid
covariance functions.
Assume that the random field Z(s) with E[Z(s)] = µ consists of orthogonal,
zero-mean components U1 (s), · · · , Up (s) and can be decomposed as
p
(
Z(s) = µ + aj Uj (s), (4.22)
j=1

In the geostatistical literature this decomposition is termed the linear model


of regionalization. Let Cz (h) denote the covariance function of the Z(s)
process. Then, because Cov[Uj (s), Uk (s)] = 0 ∀j (= k,
Cz (h) = Cov[Z(s), Z(s + h)]
(p (p
= aj ak Cov[Uj (s), Uk (s + h)]
j=1 k=1
(p
= a2j Cj (h). (4.23)
j=1
.p
Similarly, we obtain the semivariogram of Z(s) as γz (h) = j=1 a2j γj (h).
A nugget effect can be incorporated into any semivariogram as follows. Let

Z(s) = c0 U1 (s) + σ0 U2 (s), where U1 (s) is a white noise process with mean
0 and variance 1. U2 (s) is a second-order stationary process whose semivari-
ogram γ2 (h) has sill 1. Then, C2 (h) = 1 − γ2 (h), and
γz (h)= c0 + σ02 γ2 (h)
= c0 + σ02 (1 − C2 (h))
Var[Z(s)] = c0 + σ02 .
The quantity c0 is the nugget effect. Any nugget effect model can be thought of
as a nested model where one model component is white noise. This suggests
that the nugget effect is due to measurement error. If it is due to micro-
scale variation, then the corresponding component U1 (h) has a no-nugget
semivariogram γ1 (h) and c0 represents its sill.
Nesting semivariograms is popular in geostatistical applications to add flex-
ibility to models. When the process is believed to consist of several compo-
nents that operate on different spatial scales, then nesting models with dif-
ferent ranges is attractive to estimate the scales of the respective processes.
COVARIANCE AND SEMIVARIOGRAM MODELS 151

For example, spatial variation in a soil nutrient may be driven by micro-


environmental conditions on a small scale, land-use and soil-type on a medium
spatial scale, and geology on a large scale. Nesting three semivariograms to
estimate the respective ranges of the three processes has appeal. However, it
is usually difficult to justify why these processes should be orthogonal (inde-
pendent). Without it, (4.23) does not hold. The orthogonality assumption is
more tenable if a white noise measurement error process is nested with one
other model to create a nugget effect.

4.3.7 Accommodating Anisotropy

If the covariance function of a second-order stationary process is anisotropic,


the spatial structure is direction dependent. Whereas in the isotropic case, iso-
correlation contours are spherical, a particular case of anisotropy gives rise to
elliptical contours (Figure 4.5). This case is known as geometric anisotropy
and can be corrected by a linear transformation of the coordinate system.
Following Matérn (1986, p. 19), let Z1 (s) be a stationary process in Rd with
covariance function C1 (h), mean µ, and variance σ 2 . Let B(d×d) be a real
matrix and consider the stochastic process Z(s) = Z1 (Bs). Because Z1 (s) is
stationary, we have E[Z(s)] = µ and Var[Z(s)] = σ 2 . Furthermore,
Cov[Z(s), Z(s + h)] = C(h) = Cov[Z1 (Bs), Z1 (B(s + h))]
= C1 (Bs − B(s + h)) = C1 (−Bh)
= C1 (Bh).
Hence, if C1 (h) is isotropic, then C(h) = C1 (||Bh||) is a geometrically aniso-
tropic covariance function.
To correct for geometric anisotropy this transformation of the coordinate
system can be reversed. If s = [x, y]! is a coordinate in R2 such that the
process Z(s) is geometrically anisotropic, then Z(s∗ ) = Z(As) has isotropic
covariance function if A = B−1 . A linear transformation s∗ = As of Euclidean
space provides the appropriate space to express the covariance. The geometric
anisotropy shown in Figure 4.5 is corrected by (i) a rotation of the coordinate
system to align the major and minor axes of the elliptical contours and (ii),
a compression of the major axis to make contours spherical. Hence,
: ;: ;
1 0 cos θ − sin θ
A= ,
0 λ sin θ cos θ
where λ is the anisotropy ratio. A geometric anisotropy manifests itself in
semivariograms that have the same shape and sill in the direction of the major
and minor axes, but different ranges. The parameter λ equals the ratio of the
ranges in these two directions. Geometric anisotropy is common for processes
that evolve along particular directions. For example, airborne pollution will
likely exhibit anisotropy in the prevailing wind direction and perpendicular
to it.
152 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

a) b)

20 30

20
10
10

0 0
-20 -10 0 10 20 -30 -20 -10 0 10 20 30
-10

-10
-20

-30
-20

Figure 4.5 Contours of iso-correlation (0.8, 0.5, 0.3, 0.1) for two processes with
exponential correlation function. The isotropic model (a) has spherical correlation
contours. The elliptic contours in panel b) correspond to a 45 degree rotation and a
ratio of λ = 0.5 between the two major axes. The axes depict lag distances in the
(x, y) (a) and (x∗ , y ∗ ) coordinate systems (b).

In a process exhibiting zonal anisotropy the covariance function depends


on only some components of the lag vector. A case of zonal anisotropy exists
if the semivariogram sills vary with direction. Typically, the range in the
direction with the shorter range also has the smaller sill. Zonal anisotropy can
then be modeled by nesting an isotropic model and a model which depends
only on the lag-distance in the direction θ of the greater sill (Goovaerts, 1997,
p. 93),

γ(h) = γ1 (||h||) + γ2 (hθ ).

Chilès and Delfiner (1999, p. 96) warn about zonal models that partition the
coordinates, because certain linear combinations can have zero variance. In R2
let Z(s) = Z1 (x) + Z2 (y). If the components Z1 (x) and Z2 (y) are orthogonal,
then, by §4.3.6, γz (h) = γ1 (hx ) + γ2 (hy ). Let hu = [u, 0]! and hv = [0, v]! be
two vectors shifting the coordinates. Then

Var[Z(s) − Z(s + hv ) − Z(s + hu ) + Z(s + hu + hv )] = Var[X(s)] = 0,

because

X(s) = Z(s) − Z(s + hv ) − Z(s + hu ) + Z(s + hu + hv )


= Z1 (x) + Z2 (y) − Z1 (x) − Z2 (y + v) − Z1 (x + u) − Z2 (y)
+ Z1 (x + u) + Z2 (y + v) = 0.
ESTIMATING THE SEMIVARIOGRAM 153

4.4 Estimating the Semivariogram

4.4.1 Matheron’s Estimator

To learn about the semivariogram from a set of observed data, Z(s1 ), · · · ,


Z(sn ), one could plot the squared differences {Z(si ) − Z(sj )}2 against the lag
distance h (or ||h||). Such a graph is appropriately termed the empirical semi-
variogram cloud because it is usually not very informative and “clouds” the big
picture. The number of pairwise differences can be very large, lag distances
may be unique for irregularly spaced data, and extreme observations cause
many “outliers” in the cloud. Since {Z(si ) − Z(sj )}2 estimates unbiasedly the
variogram at lag si − sj , provided the mean of the random field is constant,
a more useful estimator is obtained by summarizing the squared differences.
The semivariogram estimator that averages the squared differences of points
that are distance si − sj = h apart is known commonly as the classical or
Matheron estimator since it was proposed by Matheron (1962): Matheron
( Estimator
1 2
'(h) =
γ {Z(si ) − Z(sj )} . (4.24)
2|N (h)|
N (h )

The set N (h) consists of location pairs (si , sj ) such that si − sj = h and
|N (h)| denotes the number of distinct pairs in N (h). When data are sparse or
irregularly shaped, the number of distinct pairs in N (h) may not be sufficient
to obtain a stable estimate at lag h. Typical recommendations are that at
least 30 (better 50) pairs of locations should be available at each lag. If the
number of pairs is smaller, lags are grouped into lag classes so that γ'(h) is
the average squared difference of site pairs that satisfy si − sj = h ± '. The
choice of the tolerance ' is left to the user. A graph of γ'(h) against ||h|| is
called the Matheron semivariogram or the empirical semivariogram
Among the appealing properties of the Matheron estimator—which are
partly responsible for its widespread use—are simple computation, unbiased-
ness, evenness, and attaining zero at zero lag: E['γ (h)] = γ(h), γ
'(h) = γ
'(−h),
'(0) = 0. It is difficult in general to determine distributional properties and
γ
moments of semivariogram estimators without further assumptions. The es-
timators at two different lag values are usually correlated because (i) obser-
vations at that lag class are spatially correlated, and (ii) the same points are
used in estimating the semivariogram at the two lags. Because the Matheron
estimator is based on squared differences, more progress has been made in
establishing (approximate) moments and distributions than for some of its
competitors.
Consider Z(s) to be a Gaussian random field so that (2γ(h))−1 {Z(s) −
Z(s + h)}2 ∼ χ21 and
2
Var[{Z(s) − Z(s + h)}2 ] = 2 × 4γ(h) ,

Cressie (1985) shows that the variance of (4.24) at lag hi can be approximated
154 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

as
γ(hi )2
γ (hi )] ≈ 2
Var[' . (4.25)
|N (hi )|
The approximation ignores the correlations between Z(si )−Z(sj ) and Z(sk )−
Z(sl ) (see §4.5.1). If it holds, consistency of the Matheron estimator is easily
ascertained from (4.25), since γ '(h) is unbiased. The expression (4.25) also
tells us what to expect for large lag values. In practice empirical semivari-
ograms are common that appear ill-behaved and erratic for large lags. Since
the semivariogram γ(hi ) of a second-order stationary process rises until it
reaches the sill, the numerator of (4.25) increases sharply in hi as long as the
semivariogram has not reached the sill. Even then, the variance of the Math-
eron estimator does not remain constant. The number of pairs from which
'(h) can be computed decreases sharply with h.
γ

Example 4.1 Table 4.1 demonstrates these effects for data on a 10 × 10


lattice with an exponential semivariogram (practical range 7 and sill 10). The
lag distances were grouped in unit lag classes with a lag tolerance of 0.5
units. The number of distance pairs increases until the fourth lag class and
decreases afterwards. The approximate variance of the Matheron estimator
increases slowly for small lags because the increase in γ(h) is almost offset
by an increase in the number of pairs. For distances exceeding the practical
range, γ(h) increases slowly but the variance of γ'(h) increases sharply due to
lack of observation pairs. The recommendation to use γ '(hi ) only if at least
30 (or 50) pairs are available in lag class i is based on the idea to keep the
variation of γ
'(h) at bay. It does not lead to a homoscedastic set of empirical
semivariogram values.

One could, however, group the pairs into lag classes such that each class
contains a number of observations that makes the empirical semivariogram
values homoscedastic under a particular semivariogram model. This leads to
a concentration of lag classes at small lags and sparsity at large lags. The
overall shape of the semivariogram may be difficult to determine. Since one
would have to know the form and parameters of the true semivariogram, this
is an impractical proposition in any case. But even choosing lag classes that
have the same number of points may inappropriately group the lag distances.

Example 4.2 C/N ratios. Figure 4.6 displays the 195 locations on an agri-
cultural field at which the total soil carbon and total soil nitrogen percentages
were measured. These data were kindly provided by Dr. Thomas G. Mueller,
Department of Agronomy, University of Kentucky, and represent a subset of
the data used in Chapter 9 of Schabenberger and Pierce (2002). The field had
been in no-tillage management for more than ten years when strips of the
field were chisel-plowed. The data considered here correspond to these plowed
parts of the field.
The data are geostatistical and irregularly spaced. The Euclidean distances
ESTIMATING THE SEMIVARIOGRAM 155

Table 4.1 Approximate variance of Matheron estimator (4.24) on a 10 × 10 lattice


for one unit lag classes with a tolerance of 0.5 units. Exponential semivariogram
with practical range 7 and sill 10.
Lag Number of Average Semi- Variance
Class Distances Lag variogram according to
Value in Class Distance Value (4.25)
1 342 1.196 4.011 0.094
2 448 2.152 6.023 0.162
3 520 3.036 7.278 0.203
4 850 4.062 8.246 0.160
5 608 5.131 8.891 0.260
6 684 6.078 9.261 0.251
7 522 7.049 9.512 0.347
8 444 7.984 9.673 0.422
9 368 8.996 9.788 0.521
10 94 10.005 9.862 2.069
11 60 10.925 9.907 3.272
12 8 12.042 9.942 24.714
13 2 12.728 9.957 99.147

between observations range between 5 and 565.8 feet. In order to obtain lag
classes with at least 50 observations per class, we decided on 35 lag classes of
width 6 feet. The resulting Matheron estimator of the empirical semivariogram
is shown in Figure 4.7.
It is customary not to compute the empirical semivariogram up to the
largest possible lag class. The number of available pairs shrinks quickly for
larger lags and the variability of the empirical semivariogram increases. A
common recommendation is to compute the empirical semivariogram up to
about one half of the maximum separation distance in the data, although this
is only a general guideline. It is important to extend the empirical semivar-
iogram far enough so that the important features of the spatial dependency
structure can be discerned but not so far as to hinder model selection and
interpretation due to lack of reliability. The empirical semivariogram of the
C/N ratios appears quite “well-behaved.” It rises from what appears to be
the origin up to a distance of 100 feet and has a sill between 0.25 and 0.30. A
spherical or exponential model may fit this empirical semivariogram well. We
will return to these data throughout the chapter.
The question of possible anisotropy can be investigated by computing the
empirical semivariogram surface or by constructing directional empirical semi-
variograms. To compute the semivariogram surface you divide the domain into
non-overlapping regions of equal size, typically rectangles or squares. If δx and
δy are the dimensions of the rectangles in the two main directions, we com-
pute a point on the surface at location h by averaging the pairs separated
156 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

300

250

200
Y-Coordinate (ft)

150

100

50

200 300 400 500 600 700


X-Coordinate (ft)

Figure 4.6 Sample locations in chisel-plowed components of an agricultural field. At


each location a measurement of soil carbon (%) and soil nitrogen (%) was obtained.

by h ± [δx , δy ]! . The surface is then typically smoothed or contoured and the


display is centered at h = [0, 0]! . For the C/N ratio data, Figure 4.8 shows
the result of smoothing the semivariance surface.
There does not appear to be a discernible pattern in any direction that
would differ substantively in another direction. Based on this surface we con-
sider isotropic covariance models for the C/N ratio henceforth.
One problem with the semivariogram surface is that the division of the
domain needs to be fine enough to convey the semivariogram detail in all di-
rections while at the same time to provide sufficient pairs of points to obtain
a reliable estimator of the semivariogram. The second approach of exploring
anisotropy, the construction of directional semivariograms, suffers in general
from the same problem. Because the number of directions for which the empir-
ical semivariogram is computed is relatively small, the issue of data sparsity
is less pronounced. Figure 4.9 shows empirical semivariograms for the C/N
ratio data computed at 0, 45, 90, and 135 degrees (from N) with angle toler-
ances of ±22.5 degrees. The angle tolerance was chosen so that segments do
not overlap. For example, the first semivariogram collects all pairs in a ray
between −22.5 and 22.5 degrees from N. The conclusion for these data based
on Figure 4.9 is the same as with the semivariance surface. There does not
ESTIMATING THE SEMIVARIOGRAM 157

74 136 95 121 168 194 224 241 339 387 256 368 428 384 362 275 462 428
0.4
64 99 92 229 239 238 207 246 379 289 311 329 383 423 310 311 409

0.3
Semivariance

0.2

0.1

0.0
0 50 100 150 200
Distance

Figure 4.7 Empirical semivariogram based on the Matheron (classical) estimator for
C/N ratios. Numbers across the top denote the number of pairs within the lag class.

appear to be direction dependence in the spatial covariance structure of the


C/N ratios.

Considerations of heteroscedasticity arise with all semivariogram estima-


tors, they are not unique to the Matheron estimator. Because of the simple
form of the estimator’s approximate variance, the concepts are highlighted
most clearly for this estimator. A problem that is typical for the Matheron es-
timator, however, is its sensitivity to outlying observations. In developing the
other estimators in this section, robustness or resistance of the estimator was
an important consideration. Even a single extreme observation can negatively
affect the empirical semivariogram estimate, because the squared differences
{Z(si ) − Z(sj )}2 magnify the deviation between the outlier Z(sj ) and other
values. In addition, an observation contributes to γ '(hi ) at several lags and
outliers “spread” contamination.

Example 4.3 Four point semivariogram. Consider the simplified exam-


ple from Schabenberger and Pierce (2002, p. 588) of a spatial data set with
five locations. Let Z([x, y]) denote the observed value at coordinate s = [x, y].
158 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

Figure 4.8 Smoothed semivariance surface for C/N ratio data.

The data are

Z([1, 1]) = 1 Z([1, 4]) = 4


Z([2, 2]) = 2 Z([3, 1]) = 3
Z([3, 4]) = 20

To see the impact of Z([3, 4]) on the computation of the Matheron estimator
notice that there are two pairs for each of five lag classes. The respective
estimates are
√ 18 9
'( 2) =
γ (1 − 2)2 + (2 − 3)2 ) = 1/2
4
18 9
'(2) =
γ (1 − 3)2 + (4 − 20)2 ) = 65
4
√ 18 9
'( 5) =
γ (4 − 2)2 + (20 − 2)2 ) = 82
4
18 9
'(3) =
γ (4 − 1)2 + (20 − 3)2 ) = 74.5
4
√ 18 9
'( 13) =
γ (3 − 4)2 + (20 − 1)2 ) = 90.5
4

The extreme observation Z([3, 4]) contributes to estimates at four of the


five lag values. In each case it produces the largest squared difference. If the
ESTIMATING THE SEMIVARIOGRAM 159

10 70 130 190

0 +/- 22.5 degrees 45 +/- 22.5 degrees

0.4

0.2
Semivariance

90 +/- 22.5 degrees 135 +/- 22.5 degrees

0.4

0.2

10 70 130 190
Distance

Figure 4.9 Directional empirical semivariograms for C/N ratio.


observation is √
removed from the data, the estimates are γ
'(2) = 2, γ
'( 5) = 2,
'(3) = 4.5, γ
γ '( 13) = 1/2.

Unless a measurement has been obtained in error, removing extreme ob-


servations is not the correct course of action. This also reduces the number
of pairs available at the lag classes. In order to retain observations but re-
duce their negative influence, one can downweigh the observation or choose a
statistic that is less affected by its value.

4.4.2 The Cressie-Hawkins Robust Estimator

Cressie and Hawkins (1980) suggested an estimator that alleviates the nega-
tive impact of outlying observations by eliminating squared differences from
the calculation. It is often referred to as the robust semivariogram estimator;
we refer to it as the Cressie-Hawkins (CH) estimator. Its genesis is as follows.
In a Gaussian random field all bivariate distributions of [Z(si ), Z(sj )] are
Gaussian and
Z(si ) − Z(sj )
/ ∼ G(0, 1),
2γ(si − sj )
(Z(si ) − Z(sj ))2
∼ χ21 .
2γ(si − sj )
160 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

Cressie and Hawkins (1980) note that the fourth root transformation of (Z(si )−
Z(sj ))2 yields an approximately Gaussian random variable with mean
K L 1
E |Z(si ) − Z(sj )|1/2
≈ π −1/2 Γ(0.75) × γ(si − sj )1/4 .
2
Furthermore, the expected value of the fourth power of
1 (
|Z(si ) − Z(sj )|1/2
|N (h)|
|N (h)|

turns out to be (approximately)


2γ(h)(0.457 + 0.494/|N (h)| + 0.045/|N (h)|2 ).
This suggests the variogram estimator
 4
1 ( QC 0.494 0.045
D
 |Z(si ) − Z(sj )| 
1/2
0.457 + + .
|N (h)| |N (h)| |N (h)|2
N (h)

The term 0.045/|N (h)|2 contributes very little to the bias correction, particu-
larly if |N (h)| is large. The (robust) Cressie-Hawkins semivariogram estimator
Cressie- is finally given by
Hawkins  4
Estimator 1  1 (  SC 0.494
D
R(h) =
γ |Z(si ) − Z(sj )|1/2
0.457 + . (4.26)
2  |N (h)|  |N (h)|
N (h )

Because the square root differences are averaged first and the resulting
average is then raised to the fourth power, the first term in (4.26) is much
less affected by extreme values than the average of the squared differences in
the Matheron estimator. The robust estimator is not unbiased, but the term
in the denominator serves to achieve approximate unbiasedness.
The attribute robust of the CH estimator refers to small amounts of con-
tamination in a Gaussian process. It is under this premise that Hawkins and
Cressie (1984) investigated the robustness of (4.26): white noise #(s) was added
to an intrinsically stationary process such that #(s) is G(0, σ02 ) with probability
1 − p and G(0, kσ02 ) with probability p. To be more specific, let
Z(s) = µ + S(s) + #(s),
where S(s) is a Gaussian random field with semivariogram γS (h). For some
value h it is assumed that γS (h) = mσ02 . One could thus think of S(s) as
second-order stationary with sill mσ02 . The particular model investigated was
G(0, σ02 ) with probability 0.95
#(s) ∼
G(0, 9σ02 ) with probability 0.05.

Under this Gaussian contamination model γ '(h) is no longer an unbiased


estimator of the semivariogram of the uncontaminated process (the process
with p = 0). It has positive bias, its values are too large. From Hawkins and
ESTIMATING THE SEMIVARIOGRAM 161

Cressie (1984) and Cressie (1993, p. 82, Table 2.2) it is seen that γ R(h) is less
biased than γ '(h) if the relative nugget effect is small. Similarly, if the nugget
σ0 is small relative to the semivariogram of the intrinsically stationary process,
2

then the variability of γR(h) is less than that of the Matheron estimator. The
CH estimator will typically show less variation at small lags and also result
in generally smaller values than (4.24).
However, at m = 1 the variability of γ '(h) and γ R(h) are approximately
the same and the robust estimator is more variable for m > 1. In that case
the contamination of the data plays a minor role compared to the stochastic
variation in S(s). As Hawkins and Cressie (1984) put it: “The loss of efficiency
as m → ∞ may be thought of as a premium paid by the robust estimators on
normal data to insure against the effects of possible outliers.”
As shown by Hawkins (1981), the |Z(si ) − Z(sj )|0.5 are less correlated than
the squared differences (Z(si ) − Z(sj ))2 . This is a reason to prefer the CH
estimator over the Matheron estimator when fitting a semivariogram model by
weighted (instead of generalized) least squares to the empirical semivariogram
(see §4.5).

Example 4.3 (Four point semivariogram. Continued) For the four lag
distances in this simple example the estimates according to equation (4.26)
are
- F>4 S
√ 1 1 E/ /
R( 2) =
γ |1 − 2| + |2 − 3| 0.704 = 0.71
2 2
- F> 4 S
1 1 E/ /
R(2) =
γ |1 − 3| + |4 − 20| 0.704 = 38.14
2 2
- F> 4 S
√ 1 1 E/ /
R( 5) =
γ |4 − 2| + |20 − 2| 0.704 = 45.5
2 2
- F> 4 S
1 1 E/ /
R(3) =
γ |4 − 1| + |20 − 3| 0.704 = 52.2
2 2
- F> 4 S
√ 1 1 E/ /
R( 13) =
γ |3 − 4| + |20 − 1| 0.704 = 36.6
2 2

The influence of the extreme observation Z[(3, 4]) is clearly suppressed.

4.4.3 Estimators Based on Order Statistics and Quantiles

The robustness attribute of the CH estimator refers to small amounts of con-


tamination in a Gaussian process. It is not a resistant estimator, because it is
not stable under gross contamination of the data. Furthermore, the CH and
the Matheron estimators have unbounded influence functions and a break-
down point of 0%. The influence function of an estimator measures the effect
of infinitesimal contamination of the data on the statistical properties of the
162 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

estimator (Hampel et al., 1986). The breakdown point is the percentage of


data that can be replaced by arbitrary values without explosion (implosion)
of the estimator.
The median absolute deviation (MAD), for example, is an estimator of scale
with a 50% breakdown point and a smooth influence function. For a set of
numbers x1 , · · · , xn , the MAD is

MAD = b mediani {|xi − medianj (xj )|}, (4.27)

where mediani (xi ) denotes the median of the xi . The factor b is chosen to
yield approximate unbiasedness and consistency. If x1 , · · · , xn are independent
realizations from a G(µ, σ 2 ), for example, the MAD will be consistent for σ
for b = 1.4826.
Rousseeuw and Croux (1993) suggested a robust estimator of scale which
also has a 50% breakdown point but a smooth influence function. Their Qn
estimator is given by the kth order statistic
) * of the n(n − 1)/2 inter-point
distances. Let h = 4n/25 + 1 and k = h2 . Then,

Qn = c{|xi − xj |; i < j}(k) . (4.28)

For Gaussian data, the multiplicative factor that gives consistency for the
standard deviation is c = 2.2191. The Qn estimator has positive small-sample
bias (see Table 1 for n ≤ 40 in their paper) which can be corrected (Croux
and Rousseeuw, 1992).
Genton (1998a, 2001) considers the modification that leads from (4.24) to
(4.26) not sufficient to impart robustness and develops a robust estimator
of the semivariogram based on Qn . If spatial data Z(s1 ), · · · , Z(sn ) are ob-
served, let N (h) denote the set of pairwise differences Ti = Z(si ) − Z(si + h),
i = 1, · · · , n(n − 1)/2. Next, calculate Q|N (h)| for the Ti and return as the
Genton semivariogram estimator at lag h
Estimator
1 2
γ(h) = Q . (4.29)
2 |N (h)|
Since Qn has a 50% breakdown point, γ(h) has a 50% breakdown point in
terms of the process of differences Ti , but not necessarily in terms of the Z(si ).
Genton (2001) establishes through simulation that (4.29) will be resistant to
roughly 30% of outliers among the Z(si ).
Another approach of “robustifying” the empirical semivariogram estimator
is to consider quantiles of the distribution of {Z(si )−Z(sj )}2 or |Z(si )−Z(sj )|
instead of arithmetic averages (as in (4.24) and (4.26)). If [Z(s), Z(s + h)]! are
bivariate Gaussian with common mean, then
1
{Z(s) − Z(s + h)}2 ∼ γ(h)χ21
2 J
1 1
|Z(s) − Z(s + h)| ∼ γ(h)|U | U ∼ G(0, 1).
2 2
PARAMETRIC MODELING 163

(p)
Let q|N (h)| denote the pth quantile. Then
- >
(p) 1
'p (h) = q|N (h)|
γ [Z(s) − Z(s + h)]2
2
estimates γ(h) × χ2p,1 . A median-based estimator (p = 0.5) would be
1
'p (h) =
γ median|N (h)| {[Z(s) − Z(s + h)]2 }/0.455
2
1E F4
= median|N (h)| {|Z(s) − Z(s + h)| } /0.455.
1/2
2
The latter expression is (2.4.13) in Cressie (1993, p. 75).

4.5 Parametric Modeling

The empirical semivariogram γ '(h) is an unbiased estimator of γ(h), but it


provides estimates only at a finite set of lags or lag classes. In order to obtain
estimates of γ(h) at any arbitrary lag the empirical semivariogram must be
smoothed. A nonparametric kernel smoother will not suffice since it is not
guaranteed that the resulting fit is a conditionally negative-definite function.
The common approach is to fit one of the parametric semivariogram models
of §4.3 or to apply the “nonparametric” semivariogram representation (§4.6).
Although fitting a parametric semivariogram model to the empirical semivar-
iogram by a least squares method is by far the most common approach, it is
not the only parametric technique.
Modeling techniques that fit a parametric model to the observed data
Z(s1 ), · · · , Z(sn ) are distinguished from those approaches that fit a model
to pseudo-data. In a pseudo-data approach the response being modeled is
derived from Z(s1 ), · · · , Z(sn ) and the construction of the pseudo-data often
involves subjective choices; for example, the semivariogram cloud consists of
pseudo-data Tij = Z(si ) − Z(sj ). The Matheron and the Cressie-Hawkins
estimators
'(h) = average(Tij2 )
γ R(h) = (average(|Tij |1/2 )4 ,
γ
are functions of the semivariogram cloud values that depend on the number
and width of lag classes, the maximum lag for which the empirical semivari-
ogram is calculated, the minimum number of pairs per lag class, and so forth.
Although the subjectivity inherent in the empirical semivariogram estimators
is allayed if the Tij are not averaged, the user must decide whether to model
(Z(si ) − Z(sj ))2 , Z(si )Z(sj ), or some other form of pseudo-response.
The least squares methods fit a semivariogram model to γ
'(h) or γ
R(h). Max-
imum likelihood (ML) and restricted (residual) maximum likelihood (REML)
estimation use the observed data directly, usually assuming a Gaussian ran-
dom field. Other estimating-function-based methods such as generalized esti-
mating equations (GEE) and composite likelihood (CL) also utilize pseudo-
data. No single method can claim uniform superiority. In the following sub-
164 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

sections we discuss the various approaches and their respective merits and de-
merits. To distinguish the empirical semivariogram γ(h) and its estimate γ
'(h)
from the semivariogram model being fit, we introduce the notation γ(h, θ) for
the latter. The vector θ contains all unknown parameters to be estimated from
the data. The model may be a single, isotropic semivariogram function as in
§4.3.2–4.3.5, a model with nugget effect, an anisotropic, or a nested model.

4.5.1 Least Squares and the Semivariogram

The geometric least squares principle enables us to estimate the parameters in


a model describing the mean of a random vector, taking into account the vari-
ation and covariation of the vector elements. To apply least squares estimation
to semivariogram modeling, the mean of the “response” being modeled must
be (a function of) the semivariogram. Hence, the empirical semivariogram
estimators of §4.4 serve as the data for this process. Consider an empiri-
cal semivariogram estimator at k lags. For example, a semivariogram model
γ(h, θ) can be fit to the pseudo-data
γ (h1 ), · · · , γ
' (h) = ['
γ '(hk )]! ,
or
γ γ (h1 ), · · · , γ
R (h) = [R R(hk )]! ,
or another empirical estimator. We concentrate in this section on the Math-
eron estimator. The necessary steps in the derivation can be repeated for the
other estimators.
Least squares methods do not make distributional assumptions about γ
' (h)
apart from the first two moments. They consider a statistical model of the
form
' (h) = γ(h, θ) + e(h),
γ (4.30)
where γ(h, θ) = [γ(h1 , θ), · · · , γ(hk , θ)]! . It is assumed that the (k × 1) vector
of errors in this model has mean 0. The variance-covariance matrix of the
errors, Var[e(h)] = R, typically depends on θ also. We shall write R(θ) if it is
necessary to make this dependence explicit. The appropriate course of action
is then to minimize the generalized sum of squares
!
γ (h) − γ(h, θ)) R(θ)−1 ('
(' γ (h) − γ(h, θ)) . (4.31)
If R does not depend on θ, this is a standard nonlinear generalized least
squares problem; it is solved iteratively. Otherwise, an iterative re-weighting
scheme is employed since updates to θ ' should be followed by updates to R(θ).
'
The difficulty of minimizing the generalized sum of squares does not lie with
the presence of a weight matrix. It lies in obtaining R. Following Cressie (1985,
1993), the basic ingredients are derived as follows.
To shorten notation let Tij = Z(si ) − Z(sj ), hij = si − sj , and assume that
Tij ∼ G(0, 2γ(hij , θ)). Hence, E[Tij2 ] = 2γ(hij , θ) and Var[Tij2 ] = 8γ(hij , θ)2 .
To find Cov[Tij2 , Tkl
2
], it is helpful to rely on the following result (Chapter
PARAMETRIC MODELING 165

problem 4.3): if X ∼ G(0, 1) and Y ∼ G(0, 1) with Corr[X, Y ] = ρ, then


Corr[X 2 , Y 2 ] = ρ2 . Hence,
<
Cov[Tij , Tkl ] = Var[Tij2 ]Var[Tkl
2 2 2 ]Corr[T , T ]2 .
ij kl

Since
Cov[Tij , Tkl ] = E[Tij Tkl ]
= E[Z(si )Z(sk ) − Z(si )Z(sl ) − Z(sj )Z(sk ) + Z(sj )Z(sl )]
and E[Z(si )Z(sj )] = C(0) − γ(hij , θ) + µ2 , we have
2
{γ(hij , θ) + γ(hjk , θ) − γ(hjl , θ) − γ(hik , θ)}
Corr[Tij , Tkl ] =
2
,
4γ(hij )γ(hkl )
which is (2.6.10) in Cressie (1993, p. 96). Finally,
2
Cov[Tij2 , Tkl
2
] = 2 {γ(hij , θ) + γ(hjk , θ) − γ(hjl , θ) − γ(hik , θ)} . (4.32)
If i = k and j = l, (4.32) reduces to 8γ(hij , θ)2 , of course. The variance of
the Matheron estimator at lag hm is now obtained as
 
1 ( 1 ((
γ (hm )] =
Var[2' Var  T 2
= Cov[Tij2 , Tkl
2
].
|N (hm )|2 ij
|N (hm )|2 i,j
N (h m ) k,l

Cressie (1985) suggests approximating the diagonal entries of R(θ) as


γ(hm , θ)2
γ (hm )] ≈ 2
Var[' . (4.33)
|N (hm )|
This is the appropriate variance formula if the Tij2 are uncorrelated, and if
the Gaussian assumption holds. The weighted least squares (WLS) approach
to semivariogram fitting replaces R(θ) by the diagonal matrix W(θ) whose
entries are given by (4.33). Instead of the generalized sum of squares (4.31),
this approach minimizes the weighted sum of squares
!
γ (h) − γ(h, θ)) W(θ)−1 ('
(' γ (h) − γ(h, θ))
(k
|N (hm )|
= γ (hm ) − γ(hm , θ)}2 .
{' (4.34)
m=1
2γ(hm , θ) 2

Because the off-diagonal entries of R(θ) are appreciable, the WLS criterion
is a poor approximation of (4.31). Since (4.34) can be written as a weighted
sum of squares over the k lag classes, it is a simple matter to fit a semivari-
ogram model with a nonlinear statistics package, provided it can accommodate
weights. A further “simplification” is possible if one assumes that R = φI.
This ordinary least squares (OLS) approach ignores the correlation and the
unequal dispersion among the γ '(hm ). Zimmerman and Zimmerman (1991)
found that the ordinary least squares and weighted least squares estimators
of the semivariogram performed more or less equally well. One does not lose
166 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

much by assuming that the γ '(hm ) have equal variance. The greatest loss of ef-
ficiency is not incurred by employing OLS over WLS, but by not incorporating
the correlations among the γ '(hm ).
The covariance and correlation structure of 2' γ (h) has been studied by Gen-
ton (1998b) under the assumption that Z(s) is Gaussian and by Genton (2000)
for elliptically contoured distributions (see also Genton, He, and Liu, 2001).
The derivations rest on writing the Matheron estimator as
2'
γ (h) = Z(s)! A(h)Z(s),
where A(h) is a spatial design matrix of the data at lag h. Applying known
results for quadratic forms in Gaussian random variables, Z(s) ∼ G(µ, Σ(θ)),
yields
E[2'γ (h)] = tr[A(h)Σ(θ)]
Var[2' γ (h)] = 2tr[A(h)Σ(θ)A(h)Σ(θ)]
γ (hi ), 2'
Cov[2' γ (hj )] = 2tr[A(hi )Σ(θ)A(hj )Σ(θ)],
where tr is the trace operator.
As is the case in (4.33), these expressions depend on the unknown param-
eters. Genton (1998b) assumes that the data are only “slightly correlated”
and puts Σ(θ) ∝ I. It seems rather strange to assume that the data are
uncorrelated in order to model the parameters of the data dependence. Gen-
ton (2000) shows that if the distribution of the data is elliptically contoured,
Σ = φI + 1a! + a1! , φ ∈ R, and Σ is positive definite, the correlation structure
of the Matheron estimator is
tr[A(hi )A(hj )]
Corr[2'γ (hi ), 2'
γ (hj )] = .
tr[A2 (hi )]tr[A2 (hj )]

4.5.2 Maximum and Restricted Maximum Likelihood

Estimating the parameters of a spatial random field by likelihood methods re-


quires that the spatial distribution (§2.2) be known and is only developed for
the case of the Gaussian random field. We consider here the case of a constant
mean, E[Z(s)] = µ, congruent with a second-order or intrinsic stationarity as-
sumption. Likelihood estimation does not impose this restriction, however,
and we will relax the constant mean assumption in §5.5.2 in the context of
spatial prediction with spatially dependent mean function and unknown co-
variance function. In the meantime, let Z = [Z(s1 ), · · · , Z(sn )]! denote the vec-
tor of observations and assume Z(s) ∼ G(µ1, Σ(θ)). The variance-covariance
matrix of Z(s) has been parameterized so that for any estimate θ ' the vari-
T
ances and covariances can be estimated as Var[Z(s)] = Σ(θ).' The negative of
Minus 2 log twice the Gaussian log likelihood is
Likelihood
ϕ(µ; θ; Z(s)) = ln{|Σ(θ)|} + n ln{2π}
+ (Z(s) − 1µ)! Σ(θ)−1 (Z(s) − 1µ), (4.35)
PARAMETRIC MODELING 167

and is minimized with respect to µ and θ. If θ is known, the minimum of


(4.35) can be expressed in closed form,
) *−1 !
R = 1! Σ(θ)−1 1
µ 1 Σ(θ)−1 Z(s), (4.36)
the generalized least squares estimator.

Example 4.4 Notice that the first term in (4.36) is a scalar, the inverse
of the sum of the elements of the inverse variance-covariance matrix. In the
special case where Σ(θ) = θI, we obtain 1! Σ(θ)−1 1 = n/θ and the generalized
least squares estimator is simply the sample mean,
n
θ θ1(
R = 1! Σ(θ)−1 Z(s) =
µ Z(si ) = Z.
n n θ i=1

Substituting (4.36) into (4.35) yields a (negative) log likelihood function


that depends on θ only. We say that µ has been profiled from the objective
function and term the resulting function as (twice the negative) profile log-
likelihood. This function is then minimized, typically by numerical, iterative
methods because the log likelihood is usually a nonlinear function of the co-
variance parameters. Once the maximum likelihood estimates θ 'ml have been
obtained, their value is substituted into the expressions for the profiled param-
eters to obtain their likelihood estimates. The maximum likelihood estimate
(MLE) of µ is simply
E F−1
µ ! '
'ml = 1 Σ(θ ml ) 1 −1 'ml )−1 Z(s).
1! Σ(θ (4.37)

This is known as an estimated generalized least squares estimator (EGLSE).


The case of independent and homoscedastic observations again surfaces im-
mediately as a special case. Then we can write Σ(θ) = θI and the EGLS
' Furthermore, there is a closed-form expres-
estimator does not depend on θ.
sion for θ in this case too,
n
1 () *2
θ'ml = Z(si ) − Z . (4.38)
n i=1

Maximum likelihood estimators have appealing statistical properties. For


example, under standard regularity conditions they are asymptotically Gaus-
sian and efficient. In finite samples they are often biased, however, and the
bias of covariance and variance parameters is typically negative. To illustrate,
consider again the case of independent observations with equal variance θ and
µ unknown. The MLE of θ is given by (4.38) and has bias E[θ'ml − θ] = −θ/n.
.n
If µ were known, the MLE would be θ'ml = n−1 i=1 (Z(si ) − µ)2 , which is
an unbiased estimator of θ. The bias of MLEs for variances and covariances
is due to the fact that the method makes no allowance for the loss of degrees
of freedom that is incurred when mean parameters are estimated. Restricted
168 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

maximum likelihood (also known as residual maximum likelihood; REML) es-


timation mitigates the bias in MLEs. In certain balanced situations, the bias
is removed completely (Patterson and Thompson, 1971).
The idea of REML estimation is to estimate variance and covariance pa-
rameters by maximizing the likelihood of KZ(s) instead of maximizing the
likelihood of Z(s), where the matrix K is chosen so that E[KZ(s)] = 0. Be-
cause of this property, K is called a matrix of error contrasts and its function
to “take out the mean” explains the name residual maximum likelihood.
REML estimation is developed only for the case of a linear mean function,
otherwise it is not clear how to construct the matrix of error contrasts K.
The more general case of a linear regression structure is discussed in §5.5.3.
The matrix K is not unique, but fortunately, this does not matter for param-
eter estimation and inference (Harville, 1974). In the case considered in this
chapter, namely E[Z(s)] = µ, a simple choice is the (n − 1) × n matrix
 
1 − n1 − n1 ··· − n1
 −1
 n 1 − n1 ··· − n1 

K= .. .. . . .
 . . .. .. 
− n1 − n1 · · · 1 − n1
Then KZ(s) is the (n − 1) × 1 vector of differences from the sample mean.
Notice that differences are taken for all but one observation, this is consistent
with the fact that the estimation of µ by Z amounts to the loss of a single
degree of freedom.
Now, KZ(s) ∼ G(0, KΣ(θ)K! ) and the restricted maximum likelihood es-
Minus 2 timates are those values of θ that minimize
Restricted
log ϕR (θ; KZ(s)) = ln{|KΣ(θ)K! |} + (n − 1) ln{2π} +
) *−1
Likelihood Z(s)! K! KΣ(θ)K! KZ(s). (4.39)

Akin to the profiled maximum likelihood, (4.39) does not contain infor-
mation about the mean µ. In ML estimation, µ was profiled out of the log
likelihood, in REML estimation the likelihood being maximized is that of a
different set of data, KZ(s) instead of Z(s). Hence, there is no restricted max-
imum likelihood estimator of µ. Instead, the estimator obtained by evaluating
(4.36) at the REML estimates θ 'reml is an estimated generalized least squares
estimator: E F−1
'reml = 1! Σ(θ
µ 'reml )−1 1 'reml )−1 Z(s).
1! Σ(θ (4.40)

This distinction between ML and REML estimation is important, because


the log likelihood (4.35) can be used to test hypotheses about µ and θ with a
likelihood-ratio test. Likelihood ratio comparisons based on (4.39) are mean-
ingful only when they relate to the covariance parameters in θ and the models
have the same mean structure. This issue gains importance when the mean is
modeled with a more general regression structure, for example, µ = x! (s)β,
see §5.3.3 and Chapter 6.
PARAMETRIC MODELING 169

4.5.3 Composite Likelihood and Generalized Estimating Equations

4.5.3.1 Generalized Estimating Equations

The idea of using generalized estimating equations (GEE) for the estimation of
parameters in statistical models was made popular by Liang and Zeger (1986)
and Zeger and Liang (1986) in the context of longitudinal data analysis. The
technique is an application of estimating function theory and quasi-likelihood.
Let T denote a random vector whose mean depends on some parameter vector
θ, E[T] = f (θ). Furthermore, denote as D the matrix of first derivatives of
the mean function with respect to the elements of θ. If Var[T] = Σ, then
U (θ; T) = D! Σ−1 (T − f (θ)),
is an unbiased estimating function for θ in the sense that E[U (θ; T)] = 0
and an estimate θ ' can be obtained by solving U (θ; t) = 0 (Heyde, 1997).
The optimal estimating function in the sense of Godambe (1960) is the (like-
lihood) score function. In estimating problems where the score is unaccessible
or intractable—as is often the case when data are correlated—U (θ; T) nev-
ertheless implies a consistent estimator of θ. The efficiency of this estima-
tor increases with closeness of U to the score function. For correlated data,
where Σ is unknown or contains unknown parameters, Liang and Zeger (1986)
and Zeger and Liang (1986) proposed to substitute a “working” variance-
covariance matrix W(α) for Σ and to solve instead the estimating equation
U ∗ (θ; T) = D! W(α)−1 (T − f (θ)) ≡ 0.
If the parameter vector α can be estimated and α
' is a consistent estimator,
then for any particular value of α
'
Ugee (θ; T) = D! W(α)
' −1 (T − f (θ)) ≡ 0 (4.41)
is an unbiased estimating equation. The root is a consistent estimator of θ,
provided that W(α) ' satisfies certain properties; for example, if W is block-
diagonal, or has specific mixing properties (see Fuller and Battese, 1973; Zeger,
1988).
Initially, the GEE methodology was applied to the estimation of parameters
that model the mean of the observed responses. Later, it was extended to the
estimation of association parameters, variances, and covariances (Prentice,
1988; Zhao and Prentice, 1990). This process commences with the construc-
tion of a vector of pseudo-data. For example, let Tij = (Yi − µi )(Yj − µj ),
then E[Tij ] = Cov[Yi , Yj ] and after parameterizing the covariances, the GEE
methodology can be applied. Now assume that the data comprise the in-
complete sampling of a geostatistical process, Z(s) = [Z(s1 ), · · · , Z(sn )]! and
consider the pseudo-data
(1)
Tij = (Z(si ) − µ(si ))(Z(sj ) − µ(sj ))
(2)
Tij = Z(si ) − Z(sj )
170 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

(3)
Tij = (Z(si ) − Z(sj ))2 .

(1) (2) (3)


In a stationary process E[Tij ] = C(si − sj ), Var[Tij ] = E[Tij ] = 2γ(si −
(3)
sj ). Focus on the n(n−1)/2 squared differences Tij of the unique pairs for the
(3)
moment and let hij = si −sj . Parameterizing the mean as E[Tij ] = 2γ(hij , θ),
a generalized estimating equation for θ is
∂γ(h, θ) −1 (3)
Ugee (θ; T(3) ) = 2 W (T − 2γ(h, θ)) ≡ 0, (4.42)
∂θ !
where the hij were collected into vector h and
γ(h, θ) = [γ(h11 , θ), · · · , γ(hn−1,n , θ)]! .
In cases where T is pseudo-data and θ models covariances or correlations, it
is common to consider the identity matrix as the working structure W (Pren-
tice, 1988; Zhao and Prentice, 1990; McShane, Albert, and Palmatier, 1997).
Otherwise, third and fourth moments of the distribution of Z(s) are necessary.
For this choice of a working variance-covariance matrix, the estimation prob-
lem has the structure of a nonlinear least squares problem. The generalized
GEE estimating equations can be written as
“Score”
∂γ(hij , θ) E (3) F
n−1
(
(n
Ugee (θ; T (3)
)=2 ! T ij − 2γ(h ij , θ) ≡ 0. (4.43)
i=1 j=i+1
∂θ

GEE estimates can thus be calculated as the ordinary (nonlinear) least squares
estimates in the model
(3)
Tij = 2γ(hij , θ) + δij δij ∼ iid (0, φ)
with a Gauss-Newton algorithm. Notice that this is the same as fitting the
semivariogram model by OLS to the semivariogram cloud consisting of {Z(si )−
Z(sj )}2 , instead of fitting the model to the Matheron semivariogram estima-
tor.

4.5.3.2 Composite Likelihood

In the GEE approach only the model for the mean of the data (or pseudo-
data) is required, the variance-covariance matrix is supplanted by a “working”
structure. The efficiency of GEE estimators increases with the closeness of the
working structure to Var[T]. The essential problems that lead to the considera-
tion of generalized estimating equations are the intractability of the likelihood
function and the difficulties in modeling Var[T]. No likelihood is used at any
stage of the estimation problem. A different—but as we will show, related—
strategy is to consider the likelihood of components of T, rather than the
likelihood of T. This is an application of the composite likelihood (CL) idea
(Lindsay, 1988; Lele, 1997; Heagerty and Lele, 1998).
Let Yi , (i = 1, · · · , n) denote random variables with known (marginal)
PARAMETRIC MODELING 171

distribution and let 7(θ; yi ) denote the log likelihood function for Yi . Then
∂7(θ; yi )/∂θ ! is a true score function and Component
“Score”
∂7(θ; yi )
S(θ; yi ) = =0
θ!
is an unbiased estimating function for θ. Unless the Yi are mutually indepen-
dent, the sum of the component scores S(θ; yi ) is not the score function for
the entire data. Nevertheless,
n
(
Ucl (θ; Y) = S(θ; Y)
i=1

remains an unbiased estimating function. Setting Ucl (θ; y) ≡ 0 and solving


'cl .
for θ leads to the composite likelihood estimate (CLE) θ
Several approaches exist to apply the composite likelihood idea to the es-
timation of semivariogram parameters for spatial data. We focus here on the
(2)
pseudo-data Tij = Z(si ) − Z(sj ) from the previous section (notice that the
(3)
GEEs were developed in terms of Tij ). Other setups are considered in Lele
(1997), Curriero and Lele (1999), and the Chapter problems. To perform CL
(2)
inference, a distribution must be stipulated. Assuming that the Tij are Gaus-
sian and the random field is (at least intrinsically) stationary, we obtain the
component score function
(2)
(2) ∂7(θ; Tij )
S(θ; Tij ) =
∂θ !
∂γ(hij , θ) 1 E F
(3)
= T − 2γ(h ij , θ) .
∂θ ! 4γ(hij , θ)2 ij

The composite likelihood score function is then Composite


Likelihood
n−1
( (n
∂γ(hij , θ) 1 E F
(3) “Score”
CS(θ; T (2)
)=2 Tij − 2γ(hij , θ) . (4.44)
i=1 j=i+1
∂θ ! 8γ(hij , θ)2

Comparing (4.44) to (4.43), it is seen that the respective estimating equa-


tions differ only by the factor 1/(8γ(hij , θ)2 ). This factor is easily explained.
(2)
Under the distributional assumption made here for the Tij it follows that
(3)
Tij /2γ(hij , θ) ∼ χ21 and thus
K L
(3)
Var Tij = 8γ(hij , θ)2 .

The composite likelihood estimating equation (4.44) is a variance-weighted


version of the generalized estimating equation (4.43). Put differently, GEE
(3)
estimates for the semivariogram based on pseudo-data Tij are identical to
composite likelihood estimate if the GEE working structure takes into account
the unequal dispersion of the pseudo-data. Composite likelihood estimates can
172 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

thus be calculated by (nonlinear) weighted least squares in the model


(3)
Tij = 2γ(hij , θ) + δij δij ∼ (0, 8γ(hij , θ)2 )
with a Gauss-Newton algorithm.
The derivation of the composite likelihood estimates commenced with the
(2)
pseudo-data Tij . One could have also started by defining the pseudo-data
(3)
for CL estimation as Tij as in the case of the GEEs. It is left as an exercise
(Chapter problem 4.6) to show that CS(θ; t(2) ) = CS(θ; t(3) ).

4.5.4 Comparisons

4.5.4.1 Semivariogram or Semivariogram Cloud. What Are the Data?

The composite likelihood and generalized estimating equation approaches can


be viewed as generalizations of the least squares methods in §4.5.1, where the
data consist of the empirical semivariogram cloud, rather than the empiri-
cal semivariogram. The process of averaging the semivariogram cloud into lag
classes has important consequences for the statistical properties of the estima-
tors, as well as for the practical implementation. Let the number of lag classes
be denoted by K. Fitting a semivariogram model to the empirical semivari-
ogram can be viewed as the case of K fixed, even if n → ∞. If the sample size
grows, so does the number of “lag classes” when the empirical semivariogram
cloud are the least-squares data. Hence, if the data for fitting consist of the
(3)
Tij , K → ∞ as n → ∞.
Important results on the consistency and asymptotic efficiency of least-
squares estimators can be found in Lahiri, Lee, and Cressie (2002); see also
Cressie (1985). If certain regularity conditions on the semivariogram model
are met, and if K is fixed, then OLS, WLS, and GLS estimators are con-
sistent and asymptotically Gaussian distributed under an increasing-domain
asymptotic model. If the asymptotic model is what Lahiri et al. (2002) term
a mixed-increasing-domain asymptotic structure, the least-squares estimators
remain consistent and asymptotic Gaussian, but their rate of convergence is
slower than in the pure increasing-domain asymptotic model. Under the mixed
structure, an increasing domain is simultaneously filled in with additional ob-
servations. The resultant retention of small lag distances with high spatial
dependence reduces the rate of convergence. Nevertheless, the consistency
and asymptotic normality are achieved under either asymptotic structure. A
further interesting result in Lahiri et al. (2002), is the asymptotic efficiency
of the OLS, WLS, and GLS estimators when the number of lag classes equals
the number of semivariogram parameters.
A very different picture emerges when the data for semivariogram estima-
tion by least-squares methods are given by the empirical semivariogram cloud
(3)
(Tij , ||si − sj ||). Each lag class then contains only a single observation, and
the asymptotic structure assumes that K → ∞ as the number of sampling
PARAMETRIC MODELING 173

sites increases. At first glance, one might conjecture in this case that com-
posite likelihood estimators are more efficient than their GEE counterparts,
because the GEE approach with working independence structure does not
take into account the unequal dispersion of the pseudo-data. Recall that CL
estimation as described above entails WLS fitting of the semivariogram model
to the empirical semivariogram cloud, while GEE estimation with working in-
dependence structure is OLS estimation. Write the two non-linear models as
(3)
Tij = 2γ(hij , θ) + #ij , #ij iid (0, φ)
(3)
Tij = 2γ(hij , θ) + #ij , #ij iid (0, 8γ(hij , θ)).

The mean and variance are functionally dependent in the CL approach. As a


consequence, the WLS weights depend on the parameter vector which biases
the estimating function. Fedorov (1974) established in the case of uncorre-
lated data, that WLS estimators are inconsistent if the weights depend on the
model parameters. Fedorov’s results were applied to semivariogram estimation
by Müller (1999). To demonstrate the consequence of parameter-dependent
weights in the asymptotic structure where n → ∞, K → ∞, we apply the
derivations in Müller (1999) to our CL estimator of the semivariogram. Recall
that this estimator of θ is
n−1
( ( n
1 E F
' (3)
θcl = arg min Tij − 2γ(hij , θ) .
θ i=1 j=i+1 8γ(hij , θ)

Now replace site indices with lag classes and let K → ∞. If θ ∗ denotes the
true parameter vector, then the limit of the WLS criterion is

( 1 2
(2γ(hk , θ ∗ ) − 2γ(hk θ) + #k ) .
i=1
8γ(hk , θ)

Applying the law of large numbers and rearranging terms, we find that this
is equivalent to the minimization of
∞ C D2
3 ( γ(hk , θ ∗ ) 1 2 γ(hk , θ ∗ )
+ −
2 i=1 γ(hk , θ) 3 3 γ(hk , θ)

which, in turn, is equivalent to the minimization of


(∞ C D2
γ(hk , θ ∗ ) 1
− .
i=1
γ(hk , θ) 3

(3)
The particular relationship between the mean and variance of Tij and
the dependence of the variance on model parameters has created a situation
where the semivariogram evaluated at the CL estimator is not consistent for
γ(hk , θ ∗ ). Instead, it consistently estimates 3γ(hk , θ ∗ ). The “bias correction”
for the CL estimator is remarkably simple.
174 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

4.5.4.2 Ordinary, Weighted, or Generalized Least Squares

If correlated data are fit by least squares methods, we would like to take into
account the variation and covariation of the observations. In general, GLS esti-
mation is more efficient than WLS estimation, and it, in turn, is more efficient
than OLS estimation. We are tacitly implying here that the covariance matrix
of the data is known for GLS estimation, and that the weights are known for
WLS estimation. The preceding discussion shows, however, that if one works
with the semivariogram cloud, one may be better off fitting the semivariogram
model by OLS, than by weighted least squares without bias correction, be-
cause the weights depend on the semivariogram parameters. Since the bias
correction is so simple (multiply by 1/3), it is difficult to argue in favor of
OLS. Müller (1999) also proposed an iterative scheme to obtain consistent
estimates of the semivariogram based on WLS. In the iteratively re-weighted
algorithm the weights are computed for current estimates of θ and held fixed.
Based on the non-linear WLS estimates of θ, the weights are re-computed and
the semivariogram model is fit again. The process continues until changes in
the parameter estimates from two consecutive fits are sufficiently small. The
performance of the iteratively re-weighted estimators was nearly as good as
that of the iterated GLS estimator in the simulation study of Müller (1999).
The difficulty of a pure or iterated GLS approach lies in the determination
of the full covariance structure of the pseudo-data and the possible size of the
covariance matrix. For Gaussian random fields, variances and covariances of
(3)
the Tij are easy to ascertain and can be expressed as a function of semivar-
iogram values. For the empirical semivariogram in a Gaussian random field,
the covariance matrix of the Matheron estimator γ '(h) is derived in Genton
(1998b). In either case, the covariance matrix depends on θ and an iterative
approach seems prudent. Based on a starting value θ 0 , compute Var[T(3) ] or
Var['γ (h)] and estimate the first update β1 by (estimated) generalized least
squares. Recompute the variance-covariance matrix and repeat the GLS step.
This process continues until changes in subsequent estimates of θ are minor.
The difficulty with applying GLS estimation to the semivariogram cloud is the
size of the data vector. Since the set of pseudo-data contains up to n(n − 1)/2
points, compared to K 6 n if you work with the empirical semivariogram,
building and inverting Var[T(3) ] quickly becomes computationally prohibitive
as n grows.
If the distribution of the data is elliptically contoured, the simplification
described in Genton (2000) can be put in place, provided the covariance matrix
of the data is of the form described there (see also §4.5.1 in this text). This
eliminates the covariance parameters θ from the GLS weight matrix.

4.5.4.3 Binning Versus Not Binning

Unless the data are regularly spaced, least-squares fitting of semivariogram


models invariably entails the grouping of squared differences (or functions
PARAMETRIC MODELING 175

thereof) into lag classes. Even for regularly spaced data, binning is often
necessary to achieve a recommended number of pairs in each lag class. The
process of binning itself is not without controversy. The choice (number and
spacing) of lag classes affects the resulting semivariogram cloud. The choice
of the largest lag class for which to calculate the empirical semivariogram can
eliminate values with large variability. The user who has a particular semi-
variogram model in mind may be tempted to change the width and number
of lag classes so that the empirical semivariogram resembles the theoretical
model. Combined with trimming values for larger lags, the process is slanted
towards creating a set of data to which a model fits well. The process of fitting
a statistical model entails the development of a model that supports the data;
not the development of a set of data that supports a model.
The CL and GEE estimation methods are based on the semivariogram cloud
and avoid the binning process. As discussed above, the choice of parameter-
dependent weights can negatively affect the consistency of the estimates, how-
ever, and a correction may be necessary. Estimation based on the semivari-
ogram cloud can also “trim” values at large lags, akin to the determination of
a largest lag class in least-squares fitting. Specifically, let wij denote a weight
associated with {Z(si ) − Z(sj )}2 . For example, take
-
1 if ||si − sj || ≤ c
wij =
0 if ||si − sj || > c,
and modify the composite likelihood score equation as
n−1
( n
( ∂γ(hij , θ) 1 E F
(3)
CS(θ; T (2)
)=2 wij T − 2γ(h ij , θ) ,
i=1 j=i+1
∂θ ! 8γ(hij , θ)2 ij

to exclude pairs from estimation whose distance exceeds c. One might also
(3)
use weights that depend on the magnitude of the residual (tij − 2γ(hij , θ))
to “robustify” the estimator.
The ML and REML estimators also avoid the binning process altogether.
In fact, squared differences between observed values do not play a role in
likelihood estimation. The objective functions (4.35) and (4.39) involve gen-
eralized sums of squares between observations and their means, not squared
differences between the Z(si ). It is thus not correct to cite as a disadvan-
tage of likelihood methods that one cannot eliminate from estimation pairs
at large lags. The lag structure figures into the structure of the covariance
matrix Σ(θ), which reflects the variation and covariation of the data. Likeli-
hood methods use an objective function built on the data, not an objective
function built on pseudo-data that was crafted by the analyst based on the
spatial configuration.
Comparing CL with ML (or REML) estimation, it is obvious that composite
likelihood methods are less efficient, since CS(θ; T(2) ) is not the score function
of Z(s). Computationally, CL and GEE estimation are more efficient, however.
Minimizing the maximum or restricted maximum likelihood score function is
176 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

an iterative process which involves inversion of the (n × n) matrix Σ(θ) at


every iteration. For n large, this is a time-consuming process. In CL/GEE
estimation the task of inverting an (n × n) matrix once at each iteration
is replaced with the manipulation of small (q × q) matrices, where q is the
number of elements in θ. On the other hand, CL/GEE estimation manipulates
(n − 1)/2 times as many “data” points.

Example 4.2 (C/N ratios. Continued) For the C/N ratio data we applied
the previously discussed estimation methods assuming an exponential covari-
ance structure or semivariogram. Table 4.2 displays the parameter estimates
for five estimation methods with and without a nugget effect and Figure 4.10
displays the fitted semivariograms. The least-squares fits used the classical
empirical semivariogram (Matheron estimator).
It is noteworthy in Table 4.2 that the inclusion of a nugget effect tends to
raise the estimate of the range, a common phenomenon. In other words, the
decrease in spatial continuity due to measurement error is compensated to
some degree by an increase in the range which counteracts the decline in the
spatial autocorrelations on short distances. Unfortunately, the OLS, WLS,
CL, and GEE estimation methods do not produce reliable standard errors
for the parameter estimates and the necessity for inclusion of a nugget effect
must be determined on non-statistical grounds. These methods estimate the
semivariogram parameters from pseudo-data and do not account properly for
the covariation among the data points.

Table 4.2 Estimated parameters for C/N ratios with exponential covariance struc-
ture; see also Figure 4.10.

Estimates of
Method Nugget (Partial) Sill Practical Range

OLS 0. 0.279 45.0


OLS 0.111 0.174 85.3
WLS 0. 0.278 31.4
WLS 0.117 0.166 79.4
CL 0. 0.278 32.6
CL 0.107 0.176 75.1
GEE 0. 0.281 45.9
GEE 0.110 0.173 77.2
REML 0. 0.318 41.6
REML 0.118 0.215 171.1

Only likelihood-based methods produce a reliable basis for model com-


parisons on statistical grounds. The negative of two times the restricted log
PARAMETRIC MODELING 177

likelihoods for the nugget and no-nugget models are 264.83 and 277.79, re-
spectively. The likelihood ratio statistic to test whether the presence of the
nugget effect significantly improves the model fit is 277.79 − 264.83 = 12.96
and is significant; Pr(χ21 > 12.96) < 0.00032. Based on this test, the model
should contain a nugget effect. On the other hand, the REML method pro-
duces by far the largest estimate of the variance of the process (0.318 and
0.215 + 0.118 = 0.333), and the REML estimate of the range in the no-nugget
model (171.1) appears large compared to other methods. Recall that OLS,
WLS, CL, and GEE estimates are obtained from a data set in which the
largest lag does not coincide with the largest distance in the data. Pairs at
large lags are often excluded from the analysis. In our case, only data pairs
with lags less than 6 × 35 = 210 feet were used in the OLS/WLS/CL/GEE
analyses. The ML and REML methods cannot curtail the data.
The consequences of using the empirical semivariogram cloud (GEE/CL)
versus the empirical semivariogram (OLS/WLS) are minor for these data. The
OLS and GEE estimates are quite close, as are the WLS and CL estimates.
This is further amplified in a graph of the fitted semivariograms (Figure 4.10).
The CL and WLS fits are nearly indistinguishable. The same holds for the
OLS and GEE fits.
Performing a weighted analysis does, however, affect the estimate of the
(practical) range in the no-nugget models. Maybe surprisingly, the CL esti-
mates do not exhibit the large bias that was mentioned earlier. Based on the
previous derivations, one would have expected the CL estimator of the prac-
tical range to be much larger on average than the consistent WLS estimator.
Recall that the lack of consistency of the CL estimator—which is a weighted
version of the GEE estimator—was established based on an asymptotic model
in which the number of observations as well as the number of lag classes grows
to infinity. In this application, we curtailed the max lag class (to 6 × 35 = 210
feet), a practice we generally recommend for composite likelihood and gen-
eralized estimating equation estimation. An asymptotic model under which
the domain and the number of observation increases is not meaningful in this
application. The experimental field cannot be arbitrarily increased. We accept
the CL estimators without “bias” correction.
When choosing between pseudo-data based estimates and ML/REML esti-
mates, the fitted semivariograms are sometimes displayed together with the
empirical semivariograms. The CL/GEE and ML/REML estimates in general
will fair visually less favorably compared to OLS/WLS estimates. CL/GEE
and ML/REML estimates do not minimize a (weighted) sum of squares be-
tween the model and the empirical semivariogram. The least squares estimates
obviously fit the empirical semivariogram best; that is their job. This does not
imply that least squares yields the best estimates from which to reconstruct
the second-order structure of the spatial process.
178 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

No-nugget Models Nugget Models

REML
REML

0.3 0.3

OLS, GEE
CL, WLS
0.2 0.2

0.1 0.1

0.0 0.0
0 40 80 120 0 40 80 120
Distance Distance

Figure 4.10 Fitted semivariograms for C/N data with and without nugget effect con-
structed from parameter estimates in Table 4.2.

4.6 Nonparametric Estimation and Modeling

Choosing a valid parametric semivariogram or covariogram model and fit-


ting it to the semivariogram cloud or the empirical semivariogram ensures
that the predicted variogram or covariogram has the needed properties: con-
ditional negative definiteness of the semivariogram and positive definiteness of
the covariance function. On the downside, however, one is restricted to a rela-
tively small number of semivariogram models. Few empirical semivariograms
exhibit in practical applications the textbook behavior that makes choice of
a semivariogram model a simple matter. Often, the empirical semivariogram
appears erratic or wavy. The large sample variance of γ '(h) and related esti-
mators for large lags and small number of pairs makes separating noise from
structure sometimes difficult (see §4.4).
To achieve greater flexibility, the modeler can resort to more complicated
semivariogram models that accommodate waviness and rely on nesting of
models (§4.3.6). Few parametric models incorporate positive and negative au-
tocorrelation and nesting of semivariograms is not without controversy. Stein
(1999, p. 14), for example, cautions about the common practice of nesting
spherical semivariograms. Since linear combinations of valid covariance func-
tions yield valid covariance functions, we should not give up on nested models
too quickly. The trick, presumably, is in taking linear combinations of the
right models.
What is termed the “nonparametric” approach to semivariogram modeling
consists of choosing a family of semivariogram models that is sufficiently flex-
NONPARAMETRIC ESTIMATION AND MODELING 179

ible to accommodate a wider range of shapes than the models described in


§4.3. The moniker ought not connote a rank-based approach. The resulting
models are parametric in the sense that they depend on a fixed number of un-
known quantities that are estimated from the data. The three approaches we
describe in this section have in common that the fitting process can be viewed
as a weighted nonlinear regression problem. The attribute “nonparametric”
is deserved, because, as in the case of nonparametric regression models, cer-
tain model parameters govern the smoothness of the resulting semivariogram
estimate and need to be chosen by some mechanism.
Whereas in the development of nonparametric (local) methods much energy
has been spent on the problem of determining the appropriate smoothness of
a fit based on data, nonparametric semivariogram modeling has not reached
this stage yet. Not having solved all issues related to the determination of the
degree of smoothness should not deter from exploring the intriguing ideas un-
derpinning these models. A word of caution is nevertheless in order, because
the nonparametric approach might enable a blackbox approach to geostatis-
tical analysis. It is not difficult to envision a scenario where a nonparametric
semivariogram is derived by completely data driven methods with little or
no interference from the analyst. The fitted semivariogram is then used to
construct and solve the kriging equations to produce spatial predictions. The
result of this blackbox analysis is a map of observed and predicted value which
forms the basis of decisions. If the estimated semivariogram drives the results
of spatial prediction, one should not develop this important determinant in a
black box. It is OK to look.
Nonparametric semivariogram fitting makes use of the linear combination
property of covariance functions. The “linear combination” often resolves to
integration and the basic model components being integrated or combined are
typically more elementary than the parametric models in §4.3. We distinguish
two basic approaches in this section; one based on the spectral representation
of a spatial random field, the other on a moving average representation of the
semivariogram itself.

4.6.1 The Spectral Approach

In §2.5.2 it was shown that the class of valid covariance functions in Rd can
be expressed as
A ∞ A ∞
C(h) = ··· cos(ω ! h)S(dω).
−∞ −∞
and in the isotropic case we have
A ∞
C(h) = Ωd (hω)F (dω), (4.45)
0

where Ωd (t) is defined through (4.8) in §4.3.1.


From (4.45) it is seen that this representation is a special case of a nested
180 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

model, where a linear combination is taken of the basic covariance functions


Ωd (hω).

4.6.1.1 Using Step Functions

A convenient representation of the covariogram can be achieved by considering


step functions for F (ω) in (4.45). If F has positive mass at nodes t1 , · · · , tp
with respective masses w1 , · · · , wp , then the covariance function of the process
can be represented as
p
(
C(h, t, w) = wi Ωd (hti ). (4.46)
i=1
B∞
Notice that C(0) = Var[Z(s)] = 0 F (dω). .p In the discrete representation
(4.46), the corresponding result is C(0) = i=1 wi , so that the semivariogram
is given by
Nonparametric (p
Semivari- γ(h, t, w) = wi (1 − Ωd (hti )) . (4.47)
ogram i=1
(Spectral) In general, an estimate of γ(h) can be obtained by fitting γ(h, t, w) to the
semivariogram cloud or the empirical semivariogram by a constrained, iter-
ative (reweighted) least squares approach. Details are given by Shapiro and
Botha (1991) and Cherry, Banfield, and Quimby (1996). Gorsich and Gen-
ton (2000) use a least-squares objective function to find an estimator for the
derivative of the semivariogram.
As part of this procedure the user must select the placement of the nodes
ti and the estimation is constrained because only nodes with wi ≥ 0 are
permissible. Ecker and Gelfand (1997) use a Bayesian approach where either
the nodes ti or the weights wi are random which is appealing because the
number of nodes can be substantially reduced.
The number of nodes varies widely between authors, from five to twenty,
to several hundred. Shapiro and Botha (1991) choose p to be one less than
the number of points in the empirical semivariogram cloud and spread the
nodes out evenly, the spacing being determined on an ad-hoc basis. Because
of the large number of nodes relative to the number of points in the empirical
semivariogram, these authors impose additional constraints to prevent over-
fitting and noisy behavior. For example, by bounding the slope of the fitted
function, or by ensuring monotonicity or convexity.
Gorsich and Genton (2000) point out the importance of choosing the small-
est and largest node wisely to capture the low frequency components of the
semivariogram and to avoid unnecessary oscillations. Genton and Gorsich
(2002) argue that the number of nodes should be smaller than the number of
unique lags and the nodes should coincide with zeros of the Bessel functions.
If k denotes the number of lags, these authors recommend to choose t1 , · · · , tp
such that p < k and Jv (ti ) = 0.
NONPARAMETRIC ESTIMATION AND MODELING 181

4.6.1.2 Using Parametric Kernel Functions

As with other nonparametric methods it is of concern to determine the appro-


priate degree of smoothing that should be applied to the data. The selection
of number and placement of nodes and the addition of constraints on the
derivative of the fitted semivariogram affect the smoothness of the resulting
fit. The smoothness of the basis function Ωd also plays an important role.
The nodes t1 , · · · , tp , together with the largest node tp and the lag distances
for which the covariogram is to be estimated determine the number of sign
changes of the basis function. The more sign changes are permitted, the less
smooth the resulting covariogram. The nonparametric approach discussed so
far rests on replacing F (ω) with a step function and estimating the height of
the steps given the nodes. This reduces integration to summation. An alterna-
tive approach that circumvents the step function idea uses parametric kernel
functions and can be implemented if dF (ω) = f (ω)dω.
Let f (θ, ω) denote a function chosen by the user. For a particular value of
θ the covariance of two observations spaced ||h|| = h distance units apart is
A ∞
C(θ, h) = Ωd (hω)f (θ, ω) dω. (4.48)
0

If θ can be estimated from the data, an estimate of the covariogram at lag h


can be obtained as
A ∞
' '
C(θ, h) = C(θ, h) = ' ω) dω.
Ωd (hω)f (θ,
0

The main distinction between this and the previous approach lies in the
solution of the integration problem. In the step function approach F (ω) is
discretized to change integration to summation. The “unknowns” in the step
function approach are the number of nodes, their placement, and the weights.
Some of these unknowns are fixed a priori and the remaining unknowns are
estimated from the data. In the kernel approach we fix a priori only the class
of functions f (θ, ω). Unknowns in the estimation phase are the parameters
that index the kernel function f (θ, ω). If the integral in (4.48) cannot be
solved in closed form, we can resort to a quadrature or trapezoidal rule to
calculate C(θ,' h) numerically. The parametric kernel function approach for
semivariogram estimation is appealing because of its flexibility and parsimony.
Valid semivariograms can be constructed with a surprisingly small number of
parameters. The process of fitting the semivariogram can typically be carried
out by nonlinear least squares, often without constraints.
B∞
The need for F to be non-decreasing and 0 f (θ, ω)dω < ∞ suggests to
draw on probability density functions in the construction of f (θ, ω). A particu-
larly simple, but powerful, choice is as follows. Suppose G(θ) is the cumulative
distribution function (cdf) of a U (θl , θu ) random variable so that θ = [θl , θu ]! .
182 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

Then define 
 0 ω<0
F (θ, ω) = G(θ) 0 ≤ ω ≤ 1 (4.49)

1 ω > 1.

The kernel f (θ, ω) is positive only for values of ω between 0 and 1. As a con-
sequence, the largest value for which the basis function Ωd (hω) is evaluated,
corresponds to the largest semivariogram lag. This largest lag may imply too
many or too few sign changes of the basis function (see Figure 4.1 on page
142). In the latter case, you can increase the bounds on ω in (4.49) and model

 0 ω<0
F (θ, ω) = G(θ) 0 ≤ ω ≤ b

1 ω > b.

The weighting of the basis functions is controlled by the shape of the kernel
between 0 and b. Suppose that θl = 0 and G(θ) is a uniform cdf. Then all
values of Ω(hω) receive equal weight 1/θu for 0 ≤ ω ≤ θu , and weight 0
everywhere else. By shifting the lower and upper bound of the uniform cdf,
different parts of the basis functions are weighted. For θl small, the product
hω is small for values where f (θ, ω) (= 0 and the basis functions will have few
sign changes on the interval (0, hω) (Figure 4.11).

1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0

0 10 20 30 40 50
Lag

Figure 4.11 Semivariograms with sill 1.0 constructed from covariance functions
(4.48) for processes in R2 with uniform kernels of width θu − θl = 0.2 for vari-
ous values of θl = −0.1, 0, 0.1, 0.2, 0.4, 0.6.0.8. As θl increases the semivariogram
becomes more and more wavy. The basis function is Ωd (t) = J0 (t).
NONPARAMETRIC ESTIMATION AND MODELING 183

To adjust for the variance of the process, a sill parameter is added and we
model A b Parametric
Kernel
C(θ, h) = σ 2
Ω(hω)f (θ, ω) dω. (4.50)
0 Estimator

The fitting process starts with the calculation of an empirical semivariogram


at k lag classes. Any of the estimators in §4.4 can be used. It is important,
however, to ensure that a sufficient number of pairs are available at all lag
classes. Nonparametric semivariogram estimators allow for positive and nega-
tive autocorrelations and hence can have a wavy appearance. If the fit follows
a wavy empirical semivariogram estimator, you want to be confident that this
behavior is not spurious, caused by large variability in γ '(hk ) due to an in-
sufficient number of pairs. Our fitting criterion is a nonlinear, ordinary least
squares criterion,
k
( ) *2
Q(θ) = '(hi ) − (σ 2 − C(θ, hi )) .
γ (4.51)
i=1

where C(θ, h) is given in (4.50). The parameter σ 2 is akin to the sill of classi-
cal semivariogram models in the sense that the nonparametric semivariogram
oscillates about σ 2 and will approach it asymptotically. In practical imple-
mentation, the integral in (4.50) can often be approximated with satisfactory
accuracy by a sum, applying a trapezoidal or quadrature rule. This is helpful if
the fitting procedure allows array processing, such as the NLIN or NLMIXED
procedures of SAS/STAT! r . An example of fitting a semivariogram with the
parametric kernel approach is presented at the conclusion of §4.6.3.
The uniform kernel is simple to work with but you may not want to weigh
the basis functions equally for values of ω where f (θ, ω) is nonzero. Kernel
functions with unequal weighing can be constructed easily by drawing on other
probability densities. For example,

 0 ω<0
f (µ, ξ, ω) = exp{−0.5(ω − µ)2 /ξ} 0 ≤ ω ≤ b (4.52)

1 ω > b.
is a two-parameter kernel derived from the Gaussian density. The kernel can
be scaled f (θ, ω), 0 ≤ ω ≤ b so that it integrates to one, for example,
exp{−0.5(ω − µ)2 /ξ}
f (µ, ξ, ω) = B b .
0
exp{−0.5(ω − µ)2 /ξ}

4.6.2 The Moving-Average Approach

The family of semivariogram (or covariogram) models in the previous subsec-


tion is derived by starting from a spectral representation of C(h). The search
for flexible models can also commence from a representation in the spatial
domain, the convolution representation. Recall from page 58 that a random
184 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

field Z(s) can be expressed in terms of a kernel function K(u) and a white
noise excitation field X(s) as
A
Z(s) = K(s − u)X(u) du.
u
It was shown in §2.4.2 that
A
Cov[Z(s), Z(s + h)] = σx 2
K(u)K(h + u) du,
u
and thus A
Var[Z(s)] = σx 2
K(u)2 du. (4.53)
u
These results are put to use in finding the semivariogram of the convolved
process:
γ(h) = Var[Z(s) − Z(s + h)]
:A A ;
= Var K(s − u)X(u) du − K(s + h − u)X(u) du
:A u u ;
= Var (K(s − u) − K(s + h − u)) X(u) du
:A u ;
= Var P (s − u, h)X(u) du
u
The last expression is the variance of a random field U (s) with kernel P (u, h).
Applying (4.53) one finds
A
γ(h) = Var[Z(s) − Z(s + h)] = P (s − u, h)2 du
A u
2
= (K(s − u) − K(s + h − u)) du.
u
Because s is arbitrary and C(h) is an even function, the expression simplifies
to A
Nonparametric γ(h) = (K(u) − K(u − h))2 du, (4.54)
Semivari- u
ogram the moving average formulation of the semivariogram.
(Moving
Average) Example 4.5 Recall Example 2.3, where white noise was convolved with a
uniform and a Gaussian kernel function. The resulting correlation functions in
Figure 2.5 on page 60 show a linear decline of the correlation for the uniform
kernel up to some range r. The correlation remains zero afterward. Obviously,
this correlation function corresponds to a linear isotropic semivariogram model
-
θ|h| |h| ≤ r
γ(h) =
θr |h| > r
with sill θr(h, r/> 0). This is easily verified with (4.54). For a process in R1
define K(u) = θ/2 if 0 ≤ u ≤ r and 0 elsewhere. Then (K(u)−K(u−h))2 is
NONPARAMETRIC ESTIMATION AND MODELING 185

0 where the rectangles overlap and θ/2 elsewhere. For example, if 0 ≤ h ≤ r,


we obtain
A h A h+r
1 1
γ(h) = θ du + θ du
0 2 r 2
1 1
= θh + θh = θh,
2 2
Br
and γ(h) = 2 0 θ/2 du = θr for h > r or h < −r.

Barry and Ver Hoef (1996) have drawn on these ideas and extended the
basic moving average procedure in more than one dimension, also allowing for
anisotropy. Their families of variogram models are based on moving averages
using piecewise linear components. From the previous discussion it is seen that
any valid—square integrable—kernel function K(u) can be used to construct
nonparametric semivariograms from moving averages.
The approach of Barry and Ver Hoef (1996) uses linear structures which
yields explicit expressions for the integral in (4.54). For a one-dimensional
process you choose a range c > 0 and divide the interval (0, c] into k equal
subintervals of width w = c/k. Let f (u, θi ) denote the rectangular function
with height θi on the ith interval,
-
θi (i − 1) < u/w ≤ i
f (u, θi ) = (4.55)
0 otherwise.
The moving average function K(u) is a step function with steps θ1 , · · · , θk ,
k
(
K(u|c, k) = f (u, θi ).
i=1

When the lag distance h for which the semivariogram is to be calculated is an


integer multiple of the width, h = jw, j = 1, 2, · · ·, the semivariogram becomes
 
(k (k
γ(h) = w  θi2 − θi θi−j  .
i=1 i=j+1

When the lag distance |h| equals or exceeds the range c, that is, when j ≥
the second sum vanishes and the semivariogram remains flat at σ 2 =
k, .
k
w i=1 θi2 , the sill value. When h is not an integer multiple of the width
w, the semivariogram value is obtained by interpolation.
For processes in R2 , the approach uses piecewise planar functions that are
constant on the rectangles of a grid. The grid is formed by choosing ranges
c and d in two directions and then dividing the (0, 0) × (c, d) rectangle into
k × l rectangles. Instead of the constant function on the line (4.55), the piece-
wise planar functions assign height θi,j whenever a point falls inside the ith,
jth sub-rectangle (see Barry and Ver Hoef, 1996, for details). Note that the
piecewise linear model in (4.55) is not valid for processes in R2 .
The piecewise linear (planar) formulation of Barry and Ver Hoef combines
186 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

the flexibility of other nonparametric techniques to provide flexible and valid


forms with a true range beyond which the semivariogram does not change.
Whereas in parametric modeling the unknown range parameter is estimated
from the data, here it is chosen a-priori by the user (the constant c). The
empirical semivariogram cloud can aid in determining that constant, or it can
be chosen as the distance beyond which interpretation of the semivariogram
is not desired (or needed), for example, one half of the largest point distance.
Based on results from simulation studies, Barry and Ver Hoef (1996) recom-
mend k = 15 subdivisions of (0, c] in one dimension and use 20 sub-rectangles
(k = 5, l = 4) for modeling the semivariogram of the Wolfcamp Aquifer data
(Cressie, 1993, p. 214) in two dimensions. The nonparametric semivariogram
is fit to the empirical semivariogram or semivariogram cloud by (weighted)
nonlinear least squares to estimate the θ parameters.

4.6.3 Incorporating a Nugget Effect

Any semivariogram model can be furnished with a nugget effect using the
method in §4.3.6; nonparametric models are no exception. There is, however,
a trade-off between estimating the parameters of a nonparametric model that
govern the smoothness of.the semivariogram and estimating the nugget effect.
p
The sum of the weights i=1 wi in the spectral approach and the sum of the
.k
squared step heights i=1 θi2 in the moving average approach represent the
partial sill σ02 in the presence of a nugget effect c0 . The nonlinear least squares
objective function is adjusted accordingly. When the nugget effect is large, the
process contains a lot of background noise and the nonparametric semivari-
ogram estimate tends to be not smooth, the nonparametric coefficients wi , θi2
tend to be large. Since the sill σ 2 = c0 + σ02 is fixed, the nugget estimate
will be underestimated. A large estimate of the nugget effect, on the other
hand, leads to an artificially smooth nonparametric semivariogram that is not
sufficiently flexible because of small weights.
Barry and Ver Hoef (1996) recommend estimating the nugget effect sepa-
rately from the nonparametric coefficients, for example, by fitting a line to
the first few lags of the empirical semivariogram cloud, obtaining the nugget
estimate ' c0 as the intercept. The data used in fitting the nonparametric semi-
variogram is then shifted by that amount provided ' c0 > 0.

Example 4.2 (C/N ratios. Continued) We applied the parametric kernel


approach introduced on page 181 to the C/N ratio data to fit a semivari-
ogram with nugget effect. The empirical semivariogram (Matheron estimator)
is shown in Figure 4.7 on page 157. Based on the empirical semivariogram
and our previous analyses of these data, we decided to model a nugget effect
in two ways. The nugget effect was first estimated based on fitting a linear
model to the first five lag classes by weighted least squares. The resulting
c0 = 0.1169 and it was held fixed in the subsequent fitting of the
estimate was '
NONPARAMETRIC ESTIMATION AND MODELING 187

semivariogram. In the second case, the nugget was estimated simultaneously


with the parameters of the kernel function.
For the function f (θ, ω) in (4.50), we chose a very simple kernel,
-
1/θu 0 ≤ ω ≤ b
f (θu , ω) =
0 otherwise.
This is a uniform kernel with lower bound fixed at 0. The Bessel function
of the first kind serves as the basis function and the covariance model is the
two-parameter model
A θu
C(σ , θu , h) = σ
2 2
J0 (hω)1/θu dω.
0

The nugget effect was held fixed at the same value as in the moving average
approach. Figure 4.12 shows fitted semivariograms for b = 1 and b = 2 for the
case of an externally estimated nugget effect (' c0 = 0.1169). The parameter
estimates were θ'u = 0.1499 and σ ' = 0.166 for b = 1 and θ'u = 0.1378 and
2

' = 0.166 for b = 2. Increasing the limit of integration had the effect of
σ 2

reducing the upper limit of the kernel, but not proportionally to the increase.
As a result, the Bessel functions are evaluated up to a larger abscissa, and the
fitted semivariogram for b = 2 is less smooth.

0.35

0.30

0.25
Semivariance

0.20

0.15
b=1
b=2
0.10

0.05

0.00
0 50 100 150 200
Distance

Figure 4.12 Fitted semivariograms for C/N ratios with uniform kernel functions and
b = 1, 2. Nugget effect estimated separately from kernel parameters.

When the nugget effect is estimated simultaneously with the kernel param-
eter θu , we observe a phenomenon similar to that reported by Barry and Ver
188 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

Hoef (1996). The simultaneous estimate of the nugget effect is smaller than
the externally obtained estimate. The trade-off between the nugget effect and
the sill is resolved in the optimization by decreasing the smoothness of the
fit, at the nugget effect’s expense. The estimate of the nugget effect for the
semivariograms in Figure 4.13 is 0.039 for b = 1 (0.056 for b = 2) and the
estimate of the kernel parameter increased to 0.335 for b = 1 (0.291 for b = 2).

0.35

0.30

0.25
Semivariance

0.20

0.15
b=1
b=2
0.10

0.05

0.00
0 50 100 150 200
Distance

Figure 4.13 Fitted semivariograms for C/N ratios with uniform kernel functions and
b = 1, 2. Nugget effect estimated simultaneously.

4.7 Estimation and Inference in the Frequency Domain

The study of the second-order properties of a stochastic process can be carried


out in the spatial domain based on the semivariogram, covariance function, or
correlation function, or in the frequency (spectral) domain based on the spec-
tral density function. The empirical estimators of the semivariogram and/or
the covariance function, such as (4.24) or (4.3), estimate the corresponding
process parameter. An analysis in the spatial domain may proceed as follows
1. Compute the empirical semivariogram based on the Matheron estimator,
'(h);
γ
2. Select a theoretical semivariogram model γ(h; θ);
3. Estimate θ by least squares fitting of γ(h; θ) to the empirical semivari-
ogram;
ESTIMATION AND INFERENCE IN THE FREQUENCY DOMAIN 189

4. Use the estimated model γ(h; θ) ' in further calculations, for example, to
solve a spatial prediction problem (Chapter 5).

In the frequency domain, similar steps are performed. Instead of the semi-
variogram, we work with the spectral density, however. The first step, then,
is to compute an empirical estimator of s(ω), the periodogram. Having se-
lected a theoretical model for the spatial dependency, the spectral density
model s(ω; θ), the parameter vector θ is estimated. Further calculations are
' For example, one could
then based on the estimated spectral density s(ω; θ).
construct the semivariances or covariances of the process from the estimated
spectral density function in order to solve a prediction problem.
Recall that the covariance function and the spectral density function of
a second-order stationary stochastic process form a Fourier transform pair
(§2.5),
A
1
s(ω) = C(u) exp{−iω ! u} du.
(2π)d Rd
Because you can switch between the spectral density and the covariance func-
tion by means of a (inverse) Fourier transform, it is sometimes noted that
the two approaches are “equivalent.” This is not correct in the sense that
space-domain and frequency-domain methods for studying the second-order
properties of a random field represent different aspects of the process. The
space-domain analysis expresses spatial dependence as a function of separa-
tion in spatial coordinates. It is a second-order method because it informs us
about the covariation of points in the process and its dependence on spatial
separation. Spectral methods do not study covariation of points but the man-
ner in which a function dissipates energy (or power, which is energy per unit
interval) at certain frequencies. The second step in the process, the selection of
an appropriate spectral density function, is thus arguably more difficult than
in the spatial domain, where the shape of the empirical semivariogram suggests
the model. By choosing from a sufficiently flexible family of processes—which
leads to a flexible family of spectral densities—this issue can be somewhat
defused. The Matérn class is particularly suited in this respect.
A further, important, difference between estimating the spectral density
function from the periodogram compared to estimating the semivariogram
from its empirical estimator, lies in the distributional properties. We discussed
in §4.5.1 that the appropriate least-squares methodology in fitting a semivar-
iogram model is generalized least squares, because the “data” γ '(hi ) are not
independent. The dependency stems from the spatial autocorrelation and the
sharing of data points. GLS estimation is nearly impractical, however, be-
cause of the difficulty of computing a dense variance matrix for the empirical
semivariogram values. Weighted or ordinary least squares are used instead.
The spectral approach has a considerable advantage in that the periodogram
values are—at least asymptotically—independent. A weighted least squares
approach based on the periodogram is more justifiable than a weighted least
squares approach based on the empirical semivariogram.
190 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

4.7.1 The Periodogram on a Rectangular Lattice

The spectral density and the covariance function are related to each other
through a Fourier transform. Considering that the asymptotic properties of
the empirical estimators are vastly different, it may come as a surprise that
the estimates are related in the same fashion as the process quantities; the
periodogram turns out to be the Fourier transform of the sample covariance
function.
In what follows we focus on the case where the domain D is discrete and
Z is real-valued. Specifically, we assume that the data are observed on a
rectangular r×c row-column lattice. Letting u and v denote a row and column
position, respectively, Z(u, v) represents the attribute in row u and column v.
The covariance function can then be expressed as Cov[Z(u, v), Z(u+j, v+k)] =
C(j, k) and the integral in the spectral density function can be replaced case
by a double summation:

(∞ ∞
(
1
s(ω1 , ω2 ) = C(j, k) exp{−i(ω1 j + ω2 k)}
(2π)2 j=−∞
k=−∞

( (∞
1
= C(j, k) cos{ω1 j + ω2 k} (4.56)
(2π)2 j=−∞ k=−∞
−π < ω1 < π − π < ω2 < π.

The periodogram is the sample based estimate of s(ω1 , ω2 ). It is defined as


U U2
1 1 UU( ( U
r c
U
I(ω1 , ω2 ) = U Z(u, v) exp{−i(ω1 u + ω 2 v)} U , (4.57)
(2π) rc Uu=1 v=1
2 U

Fourier and is computed for the set of frequencies


Frequencies - V W
2π r − 1 2π X r Y
S = (ω1 , ω2 ) : ω1 = − ,···,
r 2 r 2
V W >
2π c − 1 2π X c Y
ω2 = − ,···, ,
c 2 c 2

where 4·5 is the greatest integer (floor) function. These frequencies, which
are multiples of 2π/r and 2π/c, are known as the Fourier frequencies. The
connection between (4.57) and the spectral density as the Fourier transform
of the covariance function (4.56) is not obvious in this formulation. We now
establish this connection between the periodogram and the sample covariance
function for the case of a one-dimensional process Z(1), Z(2), · · · , Z(r). The
operations are similar for the two-dimensional case, the algebra more tedious,
however (see Chapter problems).
ESTIMATION AND INFERENCE IN THE FREQUENCY DOMAIN 191

4.7.1.1 Periodogram and Sample Covariances in R1

In the one-dimensional case, the Fourier frequencies can be written as ωj =


2πj/r, j = 4(j − 1)/25, · · · , 4r/25 and the periodogram becomes
U U2
1 UU( U
r
U
(2π)I(ωj ) = U Z(u) exp{−iω j u}U
r Uu=1 U
+ r ,+ r ,
1 ( (
= Z(u) exp{−iωj u} Z(u) exp{iωj u} .
r u=1 u=1

Using the Euler relation exp{ix} = cos(x) + i sin(x), the periodogram can be
expressed in terms of trigonometric functions:
r r
1( (
2πI(ωj ) = Z(u){cos(ωj u) − i sin(ωj u)} Z(u){cos ωj u + i sin(ωj u)}
r u=1 u=1
r r
1 ((
= Z(u)Z(p) cos(ωj u) cos(ωj p) +
r u=1 p=1
r (
( r
Z(u)Z(p) sin(ωj u) sin(ωj p).
u=1 p=1

At this
.point we make use of the fact that, by definition of the Fourier frequen-
cies, u cos(ωj u) = 0, and hence we can subtract any value from Z without
altering the sums. For example,
r
( r
(
Z(u) cos(ωj u) = (Z(u) − Z) cos(ωj u).
u=1 u=1

Using the further fact that cos(a) cos(b) + sin(a) sin(b) = cos(a − b), we arrive
at
r (
( r
2rπI(ωj ) = (Z(u) − Z)(Z(p) − Z) cos(ωj u) cos(ωj p) +
u=1 p=1
(r ( r
(Z(u) − Z)(Z(p) − Z) sin(ωj u) sin(ωj p)
u=1 p=1
(r ( r
= (Z(u) − Z)(Z(p) − Z) cos(ωj (u − p)). (4.58)
u=1 p=1

This expression involves cross-products of deviations from the sample mean


and must be related to a covariance estimate. The sample autocovariance
function at lag k for this one-dimensional process can be written as
- −1 .r
' r (Z(u − k) − Z)(Z(u) − Z) k≥0
C(k) = .u=k+1
r
u=−k+1 (Z(u + k) − Z)(Z(u) − Z) k < 0.
−1
r
192 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

Since cos(x) = cos(−x), (4.58) can be written as

r−1
( r−1
(
'
2πI(ωj ) = C(0) +2 '
cos(ωj k)C(k) = '
cos(ωj k)C(k).
k=1 k=−r+1

4.7.1.2 The Periodogram in R2

Similar operations as in the pervious paragraphs can be carried out for a two-
dimensional lattice process. It can be established that for ω1 (= 0 and ω2 (= 0
the periodogram is the Fourier transform of the sample covariance function,
Periodogram U U2
1 1 UU( ( U
r c
U
I(ω1 , ω2 ) = U Z(u, v) exp{−i(ω1 u + ω 2 v)}U
(2π)2 rc Uu=1 v=1 U
r−1
( c−1
(
1 ' k) exp{−i(ω1 j + ω2 k)}
= C(j,
(2π)2 j=−r+1 k=−c+1
r−1
( c−1
(
1 ' k) cos{ω1 j + ω2 k},
= C(j, (4.59)
(2π)2 j=−r+1 k=−c+1

and I(0, 0) = 0. The expression (4.59) suggests a simple method to obtain


the periodogram. Compute the sample covariance function C(j, ' k) for the
combination of lags j = −r + 1, · · · , r − 1, k = −c + 1, · · · , c − 1. Once the
sample covariance has been obtained for all relevant lags, cycle through the
set of frequencies S and compute (4.59).
The covariance function at lag j in the row direction and lag k in the column
Sample direction is estimated from sample moments as
Covariance
(( L M
' k) = 1
Function
C(j, (Z(l, m) − Z)(Z(l + j, m + k) − Z), (4.60)
rc m l

where l = max{1, 1 − j}, L = min{r, r − j}, m = max{1, 1 − k}, M =


min{c, c − k} (Ripley, 1981, p. 79). The sample covariance function is an even
' k) = C(−j,
function, C(j, ' −k), a property of the covariance function C(j, k).
Computational savings are realized by utilizing the evenness property of the
covariance function and its estimate. Evenness is also a property of the spec-
tral density and the periodogram (4.59). It is thus sufficient to compute the
periodogram only for the set of frequencies S2 , which removes from S the
points with ω2 < 0 and the set {(ω1 , ω2 ) : ω1 < 0, ω2 = 0}. Since periodogram
analyses do not require isotropy of the process, no such assumption is made
in computing (4.60). In fact, the periodogram can be used to test for isotropy
(C(j, k) = C(k, j)∀(j, k)) and other forms of second-order symmetry, for ex-
ample, reflection symmetry (C(j, k) = C(−j, k) ∀(j, k)).
ESTIMATION AND INFERENCE IN THE FREQUENCY DOMAIN 193

4.7.1.3 Interpretation

To illustrate the interpretation of the periodogram, we consider by way of


example three situations. Recall the simulated data displayed in panels a and
d of Figure 1.1 (page 6). Data were assigned independently to the positions of
a 10 × 10 lattice in Figure 1.1a and were rearranged to achieve a high degree
of spatial autocorrelation in Figure 1.1d. The data and their row/column
positions in the latter panel are shown in Table 4.3.

Table 4.3 Data for 10 × 10 lattice in Figure 1.1d.

Column
Row 1 2 3 4 5 6 7 8 9 10
1 3.55 3.52 4.41 4.04 4.43 5.01 4.81 4.97 5.02 4.63
2 3.56 3.91 3.39 3.22 4.29 5.77 4.61 5.02 4.14 4.24
3 4.26 4.26 4.32 3.99 5.40 5.83 4.92 3.55 2.68 3.22
4 4.29 5.84 5.24 5.30 5.27 5.48 5.02 2.98 3.15 2.71
5 5.75 4.80 5.50 4.51 4.71 4.62 5.26 4.15 3.10 3.97
6 6.20 5.81 4.45 5.14 4.62 5.38 5.31 4.50 4.84 3.94
7 6.87 6.23 6.18 5.61 6.00 5.75 5.67 5.32 5.03 4.68
8 6.73 7.24 6.53 5.20 4.65 5.19 4.83 5.11 5.35 5.82
9 6.42 7.27 6.14 5.95 5.52 4.78 5.40 5.32 5.36 6.17
10 7.52 6.49 6.18 5.46 5.12 5.55 5.09 5.69 5.99 5.91

Figure 4.14 and 4.15 show the sample covariance functions (4.60) for the
spatially uncorrelated data in Figure 1.1a and the highly correlated data in
Figure 1.1d. The covariances are close to zero everywhere, except for (j, k) =
(0, 0), where the sample covariance function estimates the variance of the
process. In the correlated case, covariances are substantial for small j and k.
Notice the evenness of the sample covariance function, C(j,' k) = C(−j,
' −k),
' k) (= C(−j,
and the absence of reflection symmetry, C(j, ' k). The periodograms
are displayed in Figures 4.16 and 4.17. For spatially uncorrelated data, the
periodogram is more or less evenly distributed. High and low ordinates occur
for large and small frequencies (note that I(0, 0) = 0). Strong positive spatial
association among the responses is reflected in large periodogram ordinates
for small frequencies.

4.7.1.4 Properties

The important result from which we derive properties of the periodogram


(4.59) and related quantities is as follows. Under an increasing domain asymp-
totic model, where r → ∞, c → ∞, and r/c tends to a non-zero constant,
194 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

' k) for the spatially uncorrelated data


Figure 4.14 Sample covariance function C(j,
of Figure 1.1a.

then, at the non-zero Fourier frequencies,


I(ω1 , ω2 ) d 2
2 → χ2 .
s(ω1 , ω2 )

Hence, asymptotically, E[I(ω1 , ω2 )] = s(ω1 , ω2 ), Var[I(ω1 , ω2 )] = s(ω1 , ω2 )2 .


The periodogram is an asymptotically unbiased, inconsistent estimator of the
spectral density function. The inconsistency of the periodogram—the variance
does not depend on r or c—is not a problem when a spectral density is fit to
the periodogram by least squares. Other techniques, for example, confirma-
tory inference about the spectral density, may require a consistent estimate
of s(ω). In this case it is customary to smooth or average the periodogram
ordinates to achieve consistency. This is the case in spectral analyses of point
patterns (see §4.7.3).
A further property of the periodogram ordinates is their asymptotic inde-
pendence. Under the asymptotic model stated above,
-
s(ω)2 ω k = ω l
Cov[I(ω k ), I(ω l )] =
0 ω k (= ω l .
It is this result that favors OLS/WLS estimation in the frequency domain
over the spatial domain.
ESTIMATION AND INFERENCE IN THE FREQUENCY DOMAIN 195

' k) for the highly spatially correlated


Figure 4.15 Sample covariance function C(j,
data of Figure 1.1d and Table 4.3.

The conditions under which these asymptotic results hold are detailed in
Pagano (1971). Specifically, it is required that the random field is second-
order stationary with finite variance and has a spectral density. Then we
can write Z(u, v) as the discrete convolution of a white noise excitation field
{X(s, t) : (s, t) ∈ I × I},

( ∞
(
Z(u, v) = a(j, k)X(u − j, v − k),
j=−∞ k=−∞

for a sequence of constants a(·, ·). It is important to note that it is not required
that Z(u, v) is a Gaussian random field (as is sometimes stated). Instead,
it is only necessary that the excitation field X(s, t) satisfies a central limit
condition,
r c
1 ((
√ X(s, t) → G(0, 1).
rc s=1 t=1

The asymptotic results are very appealing. For a finite sample size, however,
the periodogram is a biased estimator of the spectral density function. Fuentes
(2001) gives the following expression for the expected value of the periodogram
196 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

Figure 4.16 Periodogram I(ω1 , ω2 ) for the spatially uncorrelated data of Figure 1.1a.

at ω in R2 : A
1
I(ω) = s(α)W (α − ω) dα, (4.61)
rc(2π)2
Fejér’s with C DC 2 D
Kernel sin2 (rα1 /2) sin (cα2 /2)
W (α) = ,
sin2 (α1 /2) sin2 (α2 /2)
and α is a non-zero Fourier frequency. The bias comes about because the
function W has subsidiary peaks (sidelobes), large values away from ω. This
allows contributions to the integral from parts of s(·) far away from ω; the
phenomenon is termed leakage. The function W (·) operates as a kernel func-
tion in equation (4.61), it is known as Fejér’s kernel. Leakage occurs when
the kernel has substantive sidelobes. The kernel then transfers power from
other regions of the spectral density to ω. The bias can be substantial when
the process has high dynamic range. This quantity is defined by Percival and
Dynamic Walden (1993, p. 201) as
Range C D
max{s(ω)}
10 log10 . (4.62)
min{s(ω)}
And, the bias of I(ω) in processes with high dynamic range affects particularly
those frequencies where the spectral density is small.
The two common methods to combat leakage in periodogram estimation
ESTIMATION AND INFERENCE IN THE FREQUENCY DOMAIN 197

Figure 4.17 Periodogram I(ω1 , ω2 ) for the highly spatially correlated data of Figure
1.1d and Table 4.3.

are tapering and pre-whitening of the data. We are mentioning these tech-
niques only in passing, they are not without controversy. The interested reader
is referred to the monograph by Percival and Walden (1993) for the theory
as well as the various arguments for and against tapering and pre-whitening.
Data tapering replaces Z(s) with h(s)Z(s), where the function h(s) is termed
a taper function or data taper. The tapering controversy is ignited by the
fact that the operation in the spatial domain, creating the product h(s)Z(s),
is a weighing of the observations by h(s). Data tapers typically give smaller
weights to observations near the boundary of the domain, so it can be viewed
as a method of adjusting for edge effects. The weighing is not one that gives
more weight to observations with small variance, as is customary in statis-
tics. Negative sentiments range from “losing information” and “throwing away
data” to likening tapering and tampering. The real effect of tapering is best
seen in the frequency domain. Its upshot is to replace in (4.61) the Fejér kernel
with a kernel function that has smaller sidelobes, hence reducing leakage. Ta-
pering applies weights to data in order to change the resulting kernel function
in the frequency domain.
Pre-whitening is a filtering technique where the data are processed with a
linear filter. The underlying idea is that (i) the spectral densities of the original
and the filtered data are related (see §2.5.6), and that (ii) the filtered process
198 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

has a smaller dynamic range. The spectral density of the transformed process
(the filter output) is the product of the spectral density of the filter input and
the filter transfer function H(ω), see equation (2.39) on page 76. Then, the
periodogram is estimated from the filtered data and this is used to construct
the periodogram of the original process. Ideally, filtering would create a white
noise process, since it has the smallest dynamic range (zero). But in order to
do this we need to know either the variance matrix Var[Z(s)] = Σ, or the
spectral density function of the process. This is the very parameter we are
trying to estimate.

4.7.2 Spectral Density Functions

We now give the spectral densities that correspond to the second-order sta-
tionary and isotropic models discussed previously. For some models, e.g., the
Matérn class, several parameterizations are presented and their advantages
and disadvantages are briefly discussed.
Since processes in R1 are necessarily isotropic, the spectral density for a
process with continuous domain can be constructed via
A ∞
1
s(ω) = cos(ωh)C(h) dh.
2π −∞
In Rd , the “brute-force” method to derive the spectral density, when the
covariance function is isotropic, is to compute
A
1
s(ω) = cos(ω ! h)C(||h||) dh.
(2π) Rd
d

Another device by which to represent the spectral density as a function of a


scalar frequency argument is to consider the one-dimensional Fourier trans-
form of the covariance function on the line. Let h = [h1 , h2 , · · · , hd ]! and
h = ||h||. By isotropy
C(h) = C(h1 , h2 , · · · , hd ) = C(||h||, 0, · · · , 0) = Cr (h).
Vanmarcke (1983, p. 103) calls Cr (h) the radial covariance function. The spec-
tral density function is then obtained as
A ∞
1
s(ω) = cos(ωh)Cr (h) dh.
2π −∞
A related, but not identical approach, is to derive the d-dimensional spectral
density s(ω1 , ω2 , · · · , ωd ) and to reduce it to a radial function,
E8 91/2 F
s(ω1 , ω2 , · · · , ωd ) ≡ sr ω12 + ω22 + · · · + ωd2 .

For d = 3 the line and radial spectra are related through


ds(ω)
sr (ω) = −ω .

ESTIMATION AND INFERENCE IN THE FREQUENCY DOMAIN 199

Table 4.4 gives expressions for the spectral densities in R1 for some common
covariance models in different parameterizations.

Table 4.4 Covariance and spectral density functions for second-order stationary pro-
cesses in R1 . The parameter α is a function of the range and σ 2 is the variance of
the process.

Model C(h) s(ω)


σ 2 α 1 − cos(ωα)
Tent model σ 2 {1 − αh }I(h ≤ α)
π ω 2 α2

σ2 α
Exponential σ exp{−h/α}
2
π(1 + ω 2 α2 )

3σ 2 α
Exponential σ exp{−3h/α}
2
π(9 + ω 2 α2 )

ασ 2 −(αω)2 /4
“G”aussian σ 2 exp{−h2 /α2 } √ e
2 π

σ 2 α/ 3 −(αω)2 /12
“G”aussian σ exp{−3h /α }
2 2 2 √ e
2 π

h3 3σ 2 α (1 − cos(ωα))2 (1 − sin(ωα))2
Spherical σ 2 {1 − 3h
+ 2a3 }I(h ≤ α)
2α 2π (ωα)4

σ2
) αh *ν 2α

Γ(ν + 12 ) 2 2 −(ν+ 12 )
Matérn class 2Kν (αh) σ (α + ω )
Γ(ν) 2
Γ(ν)Γ( 12 )

π 1/2 φ
) αh *ν 1
Matérn class Γ(ν+1/2)α2ν 2 2Kν (αh) φ(α2 + ω 2 )−(ν+ 2 )

E √ Fν E √ F σ 2 g(ρ, ν)
σ2 h ν 2h ν
Matérn class Γ(ν) ρ 2Kν ρ Z [ν+1/2
ρω
1 + (2 ν )
√ 2

Notice the functional similarity of s(ω) for the gaussian models and the
Gaussian probability density function from which these models derive their
name.
The second parameterization of the Matérn class in Table 4.4 is given by
Stein (1999, p. 31). In R1 , it is related to the first parameterization through
θ2ν Γ(ν + 12 )
φ = σ2 .
Γ(ν)Γ( 12 )
200 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

In general, for a second-order stationary process in Rd , the covariance function


in the Matérn class and the associated spectral density based on the Stein
parameterization are
π d/2 φ
C(h) = (α||h||)ν Kν (α||h||) (4.63)
2ν−1 Γ(ν + d/2)α2ν
s(ω) = φ(α2 + ||ω||2 )−(ν+d/2) . (4.64)
Whereas this parameterization leads to a simple expression for the spectral
density function, the covariance function unfortunately depends on the di-
mension of the domain (d). Furthermore, the range of the process is a func-
tion of α−1 but depends strongly on ν (Stein, 1999, p. 50). Fuentes (2001)
argues—in the context of spatial prediction (kriging, Chapter 5)—that if the
degree of smoothness, ν, has been correctly determined, the kriging predic-
tions are asymptotically optimal, provided ν and φ are not spatially varying.
The low-frequency values then have little effect on the predicted values. As a
consequence, she focuses on the high frequency values and uses (in Rd )
) *ν−d/2
s(ω) = φ ||ω||2 ,
an approximation to (4.64) as ||ω|| → ∞.
The third parameterization of the Matérn class was used in Handcock and
Stein (1993) and Handcock and Wallis (1994). The range parameter ρ is less
dependent on the smoothness parameter ν than in the other two parame-
terizations. For the exponential model (ν = 1/2) √ the dpractical range in the
Handcock-Stein-Wallis parameterization is 3ρ/ 2. In R , the (isotropic) spec-
tral density is
σ 2 g(ρ, ν, d)
s(ω) = Z [ν+d/2
1 + ( 2ρω
√ )2
ν

with
ρd Γ(νd /2)(4ν)ν
g(ρ, ν, d) = .
Γ(ν)π d/2
The function g(ρ, ν) in Table 4.4 is g(ρ, ν, 1).

4.7.3 Analysis of Point Patterns

Among the advantages of spectral methods for spatial data is the requirement
that the process be second-order stationary, but it does not have to be isotropic
(§2.5.7). This is particularly important for point pattern analysis, because the
tool most commonly used for second-order analysis in the spatial domain,
the K-function, requires stationarity and isotropy. The previous discussion of
periodogram analysis focused on the case of equally spaced, gridded data. This
is not a requirement of spectral analysis but offers computational advantages
in making available the Fast Fourier Transform (FFT). Spatial locations in a
point pattern are irregularly spaced by the very nature of the process and Z is
ESTIMATION AND INFERENCE IN THE FREQUENCY DOMAIN 201

a degenerate attribute unless the process is marked. The spectral approach to


point pattern analysis proceeds as follows. An estimate of the spectral density
of the observed point process is compared against the spectral density function
of a known second-order stationary point process and tested for agreement.
To obtain the spectral density function, the covariance function of the process
is required, however. The covariance functions discussed so far are not directly
applicable to point processes since the domain is random and not fixed as for
geostatistical or lattice data.

4.7.3.1 Bartlett’s Covariance Function

Bartlett (1964) approached the development of a covariance function in a


point process by first assuming that the process is orderly, that is,
lim Pr(N (ds) > 1) = 0.
|ds|→0

This eliminates the possibility of multiple events at location s. Thus, in the


limit, E[N (ds)2 ] = E[N (ds)], and
E[N (ds)] E[N (ds)2 ]
λ = lim = lim . (4.65)
|ds|→0 |ds| |ds|→0 |ds|

Bartlett (1964) then defined the complete covariance density function as Complete
Covariance
Cov[N (dsi ), N (dsj )]
lim = C(si − sj ) + δ(si − sj ), (4.66) Density
|dsi |,|dsj |→0 |dsi ||dsj |
where C(si − sj ) = λ2 (si − sj ) − λ2 is the autocovariance density function
(§3.3.4) and δ(u) is the Dirac delta function
-
∞ u=0
δ(u) =
0 u (= 0.
Substituting into (4.66) and using (4.65) yields
H
Cov[N (dsi ), N (dsj )] C(si − sj ) si (= sj
lim =
|dsi |,|dsj |→0 |dsi ||dsj | lim|dsi |→0 |dλsi | − λ2 si = sj .

The Dirac delta function is involved in (4.66) because the variance of N (ds)/|ds|
should go to infinity in the limit as |ds| is shrunk.

4.7.3.2 Spectral Density and Periodogram

We now can take the Fourier transform of (4.66) to obtain the spectral density
function
-A >
1 −iω ! u
s(ω) = e {v(u) + λδ(u)} du
(2π)2 u
- A >
1 −iω ! u
= λ+ e v(u)du .
(2π)2 u
202 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

As before, s(ω) is estimated through the periodogram I(ω) = J(ω)J(ω).


Following Renshaw and Ford (1983), we put
n
1 (
J(ω) = Z(sj ) exp{iω ! L−1 sk }, (4.67)

k=1

where J(ω) is the complex conjugate of J(ω pq ), and {ω} = {[2πp, 2πq]! } for
p = 0, 1, 2, · · · and q = 0, ±1, ±2, · · ·. The matrix L is diagonal with entries
Lx and Ly such that D = Lx Ly . It is thus assumed that the bounding shape
of the point process is a rectangle. The term L−1 s scales the process to the
unit square and the intensity λ is estimated by the number of events n. The
periodogram is then given by
I(ω) = J(ω)J(ω)
n n
1 (( 8 ! −1
9
= exp −iω L (sj − sk ) . (4.68)
(2π)2 j=1
k=1

Mugglestone and Renshaw (1996a) recommend that the periodogram be


computed for frequencies constructed from p = 0, · · · , 16 and q = −15, · · · , 16
if n < 100.
Renshaw and Ford (1983) give a polar representation of I(ω). Note that
I(ω) is calculated for the frequencies ωp = 2πp and ωq = 2πq for/some integer
values of p = 0, 1, · · · and q = −c, −c + 1, · · · , c, c + 1. Let ρ = p2 + q 2 , the
Euclidean distance on the p–q lattice, and θ = tan−1 (q/p). If P (αρθ ) is the
periodogram in polar coordinates that corresponds to I(ωp , ωq ), then Ren-
shaw and Ford (1983) define the R- and Θ- spectra by averaging periodogram
Polar ordinates with similar values of ρ or θ:
Spectra 1 ((
SR (ρ) = P (ατ θ )
nρ τ
θ
1 ((
SΘ (θ) = P (αρτ ).
nθ ρ τ

The numbers nρ and nθ denote the number of ordinates for which τ falls
within a specified tolerance. The result is that ordinates are averaged in arcs
around the origin in the R-spectrum and in rays emanating from the origin
in the Θ-spectrum (Figure 4.18).
The R-spectrum gives insight about clustering or regularity of events, the
Θ-spectrum about the isotropy of the process. The R-spectrum is interpreted
along the same lines as the K-function. If SR (ρ) takes on large values for small
ρ, the process is clustered. Small values of SR (ρ) for small ρ implies regularity.

4.7.3.3 Inference Based on the Periodogram

Mugglestone and Renshaw (1996b) establish that (4.68) is an unbiased esti-


mator of s(ω) as n → ∞. These authors also provide an extension of a proof
ESTIMATION AND INFERENCE IN THE FREQUENCY DOMAIN 203

16

m
ctru
Spe

R Spectrum
0
q

-8

-16
0 2 4 6 8 10 12 14
p

Figure 4.18 Polar spectra according to Renshaw and Ford (1983).

by Brillinger (1972) for point processes in R1 to the two-dimensional case


(Mugglestone and Renshaw, 1996a). If the process is second-order stationary
and the spectral density is continuous, then
I(ω) d
2 → χ22 if ω (= 0 (4.69)
s(ω)
I(ω) − (2π)−2 λ d
2 → χ21 if ω = 0 (4.70)
s(0)
Cov[I(ω 1 ), I(ω 2 )] → 0, if ω 1 (= ω 2 . (4.71)

Furthermore, the CSR process has spectral density s(ω) = λ/(2π)2 . Com-
bining this result with (4.69) and (4.71) enables us to derive test statistics
for the CSR hypothesis for a test without simulations. In addition, we can
make use of the polar spectra and the fact that the spectral analysis does not
require isotropy of the process, to develop a test for isotropy (and other forms
of dependence symmetry).
First, asymptotically, any sum of periodogram ordinates is a sum of in-
dependent scaled Chi-square random variables. For example, under the CSR
hypothesis,
8π 2 (
I(ωp , ωq ) ∼ χ22m ,
λ
p,q&=0
204 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

where m is the number of periodogram ordinates in the sum. Bartlett (1978)


constructs a goodness-of-fit test of the CSR hypothesis based on this result.
The frequency plane is gridded, and ordinates are totaled in each cell.
Under CSR, the R- and Θ-spectra have simple (asymptotic) Chi-square
distributions, namely
(2π)2 1 2
SR (ρ) ∼ χ
λ 2nρ 2nρ
(2π)2 1 2
SΘ (θ) ∼ χ .
λ 2nθ 2nθ
'
A test of the CSR hypothesis can be carried out by comparing (2π)2 SR (ρ)/λ
against the confidence bounds
1 E F
× χα/2,2nρ , χ1−α/2,2nρ .
2 2
2nρ
Values of the R-spectrum outside of these bounds suggest a departure from
CSR (clustering if the upper, regularity if the lower boundary is crossed).
By construction of the Θ-spectrum, an asymmetric spectrum is a sign of
anisotropy in the spatial point process. An F -test is easily constructed, by
comparing periodogram ordinates reflected about the origin. In an isotropic
pattern we expect SΘ (θ) = SΘ (−θ). A test for isotropy in the direction θ uses
test statistic
SΘ (θ)
F =
SΘ (−θ)
and null distribution F2nθ ,2n−θ .

4.8 On the Use of Non-Euclidean Distances in Geostatistics

In §2.2 we defined isotropic semivariograms


/ and covariance functions in terms
of the Euclidean norm, ||si − sj || = (xi − xj )2 + (yi − yj )2 , for si = [xi , yi ]! .
The use of this norm was implicit in all isotropic models described in §4.3.
Even more implicit was the use of this norm in proving that these models are
“valid,” i.e., the isotropic semivariogram models are conditionally negative
definite and the isotropic covariance models are positive definite. However, in
many practical applications this distance measure may not be realistic. For ex-
ample, mountains, irregularly-shaped domains, water bodies (e.g., lakes, bays,
estuaries) and partitions in a building can present barriers to movement. Al-
though two points on either side of a barrier may be physically close, it may
be unrealistic to assume that they are related. For example, Little et al. (1997)
note that for measurements made in estuarine streams, distances should be
measured “as the fish swims” and compute water distance (the shortest
path between two sites that may be traversed entirely over water) using a
GIS. In some contexts, water distance may be calculated along an irregu-
lar one-dimensional transect (Cressie and Majure, 1997; Rathbun, 1998), but
ON THE USE OF NON-EUCLIDEAN DISTANCES IN GEOSTATISTICS 205

branching estuaries or large barriers prohibit this approach. For more com-
plex applications such as these, Kern and Higdon (2000) define an algorithm
to compute polygonal distance that compensates for irregularities in the
spatial domain. Krivoruchko and Gribov (2004) solve similar problems using
cost weighted distance, a common raster function in GIS. Unfortunately,
as we illustrate below, not all isotropic covariance function and semivariogram
models remain valid when based on non-Euclidean distances.

4.8.1 Distance Metrics and Isotropic Covariance Functions

A metric space is a nonempty set S together with a real valued function


d : S × S → [0, ∞) which satisfies the following conditions:
1. d(si , sj ) ≥ 0, with equality holding if and only if si = sj .
2. d(si , sj ) = d(sj , si ) (symmetry).
3. d(si , sj ) ≤ d(si , sk ) + d(sk , sj ) (triangle inequality).
The function d is called a metric on S. In our context, S is the collection of
spatial locations s ∈ D ⊂ Rn , and d(si , sj ) is considered to be the distance
from si to sj . Common examples in S = R2 include:
d(si , sj ) = |xi − xj | + |yi − yj | Rectangular or City Block distance;
/
d(si , sj ) = (xi − xj )2 + (yi − yj )2 Euclidean distance;
d(si , sj ) = max{|xi − xj |, |yi − yj |}.
Curriero (1996, 2004) gives a simple example that clearly demonstrates the
problem with using non-Euclidean distances with isotropic models for covari-
ance functions. He considers a regular two dimensional lattice of four points
with unit spacing. The matrix of the distances among all four points based
on the city block distance definition is
 
0 1 1 2
 1 0 2 1 
 
 1 2 0 1 .
2 1 1 0
Using these distances in an isotropic
√ gaussian model for the covariance func-
tion (4.10) with σ 2 = 20 and α = 2 3, gives the following covariance matrix
 
20.00 15.58 15.58 7.36
 15.58 20.00 7.36 15.58 
 
 15.58 7.36 20.00 15.58  .
7.36 15.58 15.58 20.00
Unfortunately, one of the characteristic roots of this matrix is negative. Con-
sequently, this matrix is not positive definite and so the gaussian model based
on city block distances cannot be a valid covariance function in R2 . However,
the exponential model remains valid when used with the city block metric
(Curriero, 2004).
206 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

It is fairly straightforward to show that a particular model based on a


specified distance metric is not a valid covariance function. However, proving
that a particular model based on a specified distance metric is valid is much
more difficult. It is not sufficient to show that the corresponding covariance
matrix is positive definite. While this may suffice for a particular problem with
specified locations, the validity may change when based on a different spatial
configuration, including that which may be considered for spatial prediction.
Other than checking the validity of a particular model with respect to a
chosen distance matrix as described above (and then hoping for the best),
the most common way of ensuring valid models for covariance functions and
semivariograms based on non-Euclidean distance metrics uses transformations
of the spatial coordinates. This is called isometric embedding (Curriero,
1996, 2004). The metric space (S, d) is isometrically embedded in a Euclidean
space of dimension k, if there exists points s∗i , s∗j ∈ Rk and a function φ : S →
Rk such that
d(si , sj ) = ||s∗i − s∗j ||,
where φ(si ) = s∗i and ||s∗i − s∗j || is the Euclidean distance metric in Rk . Exact
isometric embedding can be difficult to accomplish theoretically. However,
it can be done approximately, using a technique known as multidimensional
scaling.

4.8.2 Multidimensional Scaling

Consider a (n × n) symmetric matrix D with typical element dij ≥ 0 and


dii = 0. Then D is said to be a matrix of distances or dissimilarities. Note
that the dij are not necessarily Euclidean distances. They can represent some
other measure of how “far” two items are apart. In the space deformation
approach of Sampson and Guttorp (1992) for modeling non-stationary spatial
T
covariances, for example, the d2ij are estimates of Var[Z(s i )−Z(sj )]; see §8.2.2.

The technique of multidimensional scaling (MDS; Mardia, Kent, and Bibby,


1979, Ch. 14) is applied when the matrix D is known, but the point locations
that produced the distances are unknown. The objective is to find points
s1 , · · · , sn in Rk such that
d'ij = ||si − sj || ≈ dij .
In other words, find a configuration of points in k dimensional space such
that distances among these points are close to the measures of dissimilarity
which were given. One aspect of MDS is to determine the dimension k. For
interpretability of the solution you want k to be small. When it is known that
D is constructed from points in Rd , then k ≤ d.
A feature of solutions to the MDS problem is their invariance to rotation
and translation. If D was constructed from known coordinates, a solution can
be transformed so that its coordinates are comparable to those of the original
ON THE USE OF NON-EUCLIDEAN DISTANCES IN GEOSTATISTICS 207

configuration (see §4.8.2.2). An application is the translation of non-Euclidean


distances (e.g., city block distances) into Euclidean distances.

4.8.2.1 The Classical, Metric Solution

Mardia, Kent, and Bibby (1979, Ch. 14) distinguish the classical solution—
also called a metric solution—from the non-metric solution that is based on
ranks of distances and iterative optimization. A metric solution determines the
point configuration directly and can serve as the solution of the MDS prob-
lem or as the starting configuration of a non-metric technique. The classical
solution consists of choosing as point configuration the k (scaled) eigenvectors
that correspond to the k largest (positive) eigenvalues of the matrix

1
B = − (I − J/n)! D[2] (I − J/n).
2

The notation A[p] stands for a matrix whose elements are apij . The eigenvectors
are scaled so that if si ∗ is the ith eigenvector of B and λi is the ith (positive)
eigenvalue, then si ∗ ! si ∗ = λi .
Denote the solution so obtained as S and let d'ij = ||si −sj ||. The discrepancy
between D and the fit can be measured by
' 2 }.
ψ = trace{(B − B)

The classical solution minimizes this trace among all configurations that have
distance matrix D for a given value of k. If λ = [λ(1) , · · · , λ(n) ]! denotes the
vector of ordered eigenvalues of B, then
k
( Q(
n
r= λ(i) |λi |
i=1 i=1

is a measure of the agreement between the metric solution and the distance
matrix D.

Example 4.6 Classical MDS. Assume the following six point locations in
R2 : s1 = [0, 65], s2 = [115, 50], s3 = [225, 120], s4 = [175, 65], s5 = [115, 132.5],
s6 = [30, 105]. The matrix of Euclidean distances is
 
0 115.9 231.6 175 133.3 50
 115.9 0 130.4 61.8 82.5 101.2 
 
 231.6 130.3 0 74.3 110.7 195.6 
D1 =   175

 61.8 74.3 0 90.3 150.4  
 133.3 82.5 110.7 90.3 0 89.3 
50 101.2 195.6 150.4 89.3 0
208 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

and the matrix of city-block distances is


 
0.0 130.0 280.0 175.0 182.5 70.0
 130.0 0.0 180.0 75.0 82.5 140.0 
 
 280.0 180.0 0.0 105.0 122.5 210.0 
D2 =   175.0
.

 75.0 105.0 0.0 127.5 185.0 
 182.5 82.5 122.5 127.5 0.0 112.5 
70.0 140.0 210.0 185.0 112.5 0.0
The (ordered) eigenvalues of B1 are λ1 = [36375.07, 5405.14, 0, 0, 0, 0]! and
those of B2 are λ2 = [47754.86, 14646.25, 0, −2196.03, −5775.36]! . For k = 2,
r1 = 1 and r2 = 0.818. A classical solution with k = 2 will reproduce perfectly
the distance matrix D1 , but not the distance matrix D2 . The fitted distances
are  
0.0 115.9 231.6 175 133.3 50.0
 115.9 0.0 130.4 61.8 82.5 101.2 
 
 231.6 130.3 0.0 74.3 110.7 195.6 
' 1 = D1 = 
D 
 175.0 61.8 74.3 0.0 90.3 150.4 
 
 133.3 82.5 110.7 90.3 0.0 89.3 
50.0 101.2 195.6 150.4 89.3 0.0
and  
0.0 125.8 280.9 192.6 177.1 107.7
 125.8 0.0 169.9 67.2 99.5 131.7 
 
 280.9 169.9 0.0 130.7 113.7 221.7 
'2 = 
D .
 192.6 67.2 130.7 0.0 114.8 187.2 
 
 177.1 99.5 113.7 114.8 0.0 108.1 
107.7 131.7 221.7 187.2 108.1 0.0
The MDS solutions do not “match” the original coordinates in the sense
that they are centered at 0 and are not yet rotated and scaled (Table 4.5).
This can be accomplished with a Procrustes rotation (see below).

Table 4.5 Classical Solutions to MDS for D1 and D2


x y MDS for D1 MDS for D2
0 65.0 −112.04 −12.33 −134.88 22.99
115 50.0 0.61 −39.89 −11.41 46.98
225 120.0 117.65 17.57 141.48 −27.03
175 65.0 61.90 −31.59 52.22 68.49
115 132.5 9.69 42.10 29.04 −43.94
30 105.0 −77.82 24.13 −76.45 −67.51

The non-metric solution to the MDS problem is a rank-based method in


ON THE USE OF NON-EUCLIDEAN DISTANCES IN GEOSTATISTICS 209

which the “stress” criterion


( (f (dij ) − d'ij )2
. '2
i<j i<j dij

is minimized subject to a monotonicity requirement,


dij < dkl ⇒ f (dij ) ≤ f (dkl ) ∀ i < j, k < l.
The process is iterative, starting from an initial configuration which can be
the geographical locations (if known, as in Sampson and Guttorp, 1992), or
the result of a classical MDS solution. The algorithm is known as the Shepard-
Kruskal algorithm.

4.8.2.2 Procrustes Rotation With Scaling

Consider two point configurations S1 and S2 . For example, S1 is the original


point configuration from which D was computed based on city block distances
and S2 is the classical MDS solution. Or, S1 and S2 correspond to the classical
and non-metric MDS solutions. We would like to translate the configuration
S2 so that its coordinates match that of S1 and compute a measure of closeness
between the two configurations. The technique that accomplishes these two
goals is termed a Procrustes rotation of S2 relative to S1 (Mardia, Kent, and
Bibby, 1979, Ch. 14.7).
(i) (i)
Denote the points in Si as s1 , · · · , sn . The Procrustes rotation (with scal-
ing) of S2 relative to S1 is obtained by minimizing
n
( (1) (2)∗ ! (1) (2)∗
SS(A, b) = (si − si ) (si − si )
i=1
(2)∗ (2)
subject to si = cA! si + b. The solutions to this least squares problem are
b = s(1) − A! s(2)
A = UV!
c = tr{∆}/tr{S2 S!2 }
and U∆V! is the singular value decomposition of S!2 S1 . A Procrustes rotation
without scaling has c = 1.0.

Example 4.6 (Classical MDS. Continued) Applying a Procrustes rota-


tion of the MDS solutions in Table 4.5 with respect to the original configu-
ration yields the coordinates in Table 4.6. As expected, the solution for D1
maps perfectly in the observed coordinates; recall that r1 = 1. The coordi-
nates for the D2 solution do not agree with the original point location because
D2 is based on city block distances. The configuration S2 is the arrangement
of points that will yield a matrix of Euclidean distances closest to D2 .
210 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

Table 4.6 Procrustes rotations for MDS solutions


x y S∗1 S∗2
0 65.0 0 65.0 9.11 47.72
115 50.0 115 50.0 109.47 50.99
225 120.0 225 120.0 215.30 135.75
175 65.0 175 65.0 162.81 45.56
115 132.5 115 132.5 124.79 128.94
30 105.0 30 105.0 38.51 128.54

4.9 Supplement: Bessel Functions

4.9.1 Bessel Function of the First Kind

The Bessel function of the first kind of order ν is defined by the series
C Dν (∞
t (− 14 t2 )i
Jν (t) = . (4.72)
2 i=0
i! Γ(ν + i + 1)

We use the notation Jn (t) if the Bessel function has integer order. A spe-
cial case is J0 (t), the Bessel function of the first kind of order 0. It appears
as the basis function in spectral representations of isotropic covariance func-
tions in R2 (§4.3.1). Bessel functions of the first kind of integer order satisfy
(Abramowitz and Stegun, 1964)

2n
Jn+1 (t) = Jn (t) − Jn−1 (t)
t
1
Jn! (t) = (Jn−1 (t) − Jn+1 (t))
2
n
= Jn−1 (t) − Jn (t)
t
J−n (t) = (−1)n Jn (t)

4.9.2 Modified Bessel Functions of the First and Second Kind

There are two types of modified Bessel functions. Of particular importance for
spatial modeling are the modified Bessel functions of the second kind Kν (t) of
(real) order ν. They appear as components of the Matérn class of covariance
functions for second-order stationary processes (see §4.3.2):

π I−ν (t) − Iν (t)


Kν (t) = . (4.73)
2 sin(πν)
CHAPTER PROBLEMS 211

The function Iν (t) in (4.73) is the modified Bessel function of the first kind,
defined by
C Dν ( ∞ C D2k
t ( 14 t2 )i t
Iν (t) = .
2 i=0
i! Γ(ν + i + 1) 2
Since computation of these functions can be numerically expensive, approxi-
mations can be used for t → 0:
C D−ν
Γ(ν) t
K0 (t) ≈ − ln{t}; Kν (t) ≈ for ν > 0.
2 2
Other important results regarding modified Bessel functions (Abramowitz and
Stegun, 1964; Whittaker and Watson, 1927) are (n denoting integer and ν
denoting real order)
Kν (t) = K−ν (t)
2n
Kn+1 (t) = Kn−1 (t) + Kn (t)
t
n
Kn! (t) = Kn (t) − Kn+1 (t) ⇒ K0! (t) = −K1 (t)
t A ∞
(2t)ν Γ(t + 1/2)
Kν (t) = √ cos{νπ} (u2 + t2 )−ν−1/2 cos{u} du
π 0
-
1 ν=0
Iν (0) =
0 ν>0
A π
1
In (t) = ez cos θ cos{nθ} dθ
π 0
In (t) = I−n (t)
J J
2 2
I1/2 (t) = sinh{t} I−1/2 (t) = cosh{t}
πt πt
n
In! (t) = In (t) + In+1 (t) ⇒ I0! (t) = I1 (t)
t
Some of these properties have been used in §4.3.2 to establish that the Matèrn
model for ν = 1/2 reduces to the exponential covariance function.
A Fortran program (rkbesl) to calculate Kn+α (t) for non-negative t and
non-negative order n + α is distributed as part of the SPECFUN package (Cody,
1987). It is available at www.netlib.org.

4.10 Chapter Problems

Problem 4.1 Verify


(i) that the empirical semivariogram (use the Matheron estimator) is a
biased estimator of γ(h) under trend contamination of the random field;
(ii) that the mean of γ
'(h) is given by (4.4).
Simulate data from a simple linear regression Yi = α + βxi + ei , where the
errors are independently and identically distributed with mean 0 and variance
212 SEMIVARIOGRAM ANALYSIS AND ESTIMATION

σ 2 . Let xi = i and consider xi as the location on a transect at which Yi was


observed. Calculate the empirical semivariogram estimator for different values
of α, β, and σ 2 . How does the choice of α affect the semivariogram?

Problem 4.2 (Schabenberger and Pierce, 2002, pp. 431–433) Let Yi = β +ei ,
(i = 1, · · · , n), where ei ∼ iid G(0, σ 2 ).
(i) Find the maximum likelihood estimators of β and σ 2 .
(ii) Define a random vector
 
Y1 − Y
 Y2 − Y 
U(n−1×1) =   .. 

.
Yn−1 − Y
and find the maximum likelihood estimator of σ 2 based on U. Is it possible
to estimate β from the likelihood of U?
(iii) Show that the estimator of σ 2 found in (ii) is the restricted maximum
likelihood estimator for σ 2 based on the random vector Y = [Y1 , · · · , Yn ]! .

Problem 4.3 Consider random variables X ∼ G(0, 1) and Y ∼ G(0, 1) with


, Y 2 ] = ρ2 . Hint: ConsiderX, U ∼ G(0, 1)
Corr[X, Y ] = ρ. Show that Corr[X 2/
with X ⊥ U and define Y = ρX + 1 − ρ2 U .

Problem 4.4 The Matheron estimator can be written as a quadratic form


in the observed data,
1
'(h) = Z(s)! A(h)Z(s).
γ
2
The matrix A is called a spatial design matrix (Genton, 1998b).
(i) Imagine n regularly spaced data points in R1 . Give the matrix A(h) for
this case.
(i) Give the matrix A(h) when the data are on a 3 × 3 lattice.

Problem 4.5 Derive the composite likelihood score equation (4.44) under
the assumption that Z(si ) − Z(sj ) are zero mean Gaussian random variables.

Problem 4.6 The derivation of the composite likelihood estimator of the


(2)
semivariogram in §4.5.3 commenced by assuming that Tij = Z(si ) − Z(sj ) ∼
G(0, 2γ(hij , θ)) and then built a composite score function from the score con-
(2) (2)
tributions of the Tij . With this distributional assumption for Tij , derive the
(3)
composite likelihood score function from the distribution of Tij .

Problem 4.7 Establish the connection between the periodogram (4.57) and
the covariance function (4.60) for data on a rectangular r × c lattice. That is,
show that the periodogram is the Fourier transform of the sample covariance
function.
CHAPTER PROBLEMS 213

Problem 4.8 Determine the dynamic range, see equation (4.61) on 196, for
some of the spectral densities shown in Table 4.4. Which models are suscep-
tible to bias in periodogram estimation due to leakage? What is the dynamic
range of a white noise process?

Problem 4.9 Show that the spectral density function s(ω) for a homoge-
neous Poisson process has complete covariance density function λ(si − sj ) and
spectral density s(ω) = (2π)−2 λ.

Problem 4.10 One of the appealing features of the method of “nonpara-


metric” semivariogram fitting that relies on kernel functions rather than step
functions (see §4.6.1), is to allow estimation by unconstrained optimization.
This is only true if the function f (θ, ω) can be parameterized without con-
straints. Take, for example, the uniform kernel model (4.49) on page 182.
The parameters must satisfy θ > 0, 0 ≤ θl < θu . Give a parameterization
that satisfies these constraints. That is, give functions g, gl , and gu such that
θ = g(ψ), θl = gl (ψl ), and θu = gu (ψu , ψl ) where [ψ, ψl , ψu ] ∈ R3 .
CHAPTER 5

Spatial Prediction and Kriging

5.1 Optimal Prediction in Random Fields

Consider the random field { Z(s) : s ∈ D ⊂ Rd } observed at locations


s1 , · · · , sn , and the corresponding data vector Z(s) = [Z(s1 ), · · · , Z(sn )]! . The
domain D is fixed and continuous, hence we are dealing with geostatistical
data. Our sample is an incomplete observation of the surface Z(s, ω) that is
the outcome of a random experiment with realization ω. One of the pervasive
problems in spatial statistics is the prediction of Z at some specified location
s0 ∈ D. This can be a location that is part of the set of locations where Z(s, ω)
has been observed, or a new (= unobserved) location.
Before we elaborate on general details of the spatial prediction problem, a
few comments about the distinction of prediction and estimation are in order.
These terms are often used interchangeably. For example, in a simple linear
regression model, Yi = β0 + β1 xi + #i , where the errors are uncorrelated, the
regression coefficients β0 and β1 are estimated as β'0 and β'1 and then used
to calculate predicted values Y'0 = β'0 + β'1 x0 . It is not clear in this situation
whether Y'0 is supposed to be a predictor of Y0 , the response at x0 , or an es-
timator of E[Y0 ]. The fuzziness of the distinction in standard situations with
uncorrelated errors stems from the fact that Y'0 is the best (linear unbiased)
estimator of the fixed, non-random quantity E[Y0 ] and also the best (linear
unbiased) predictor of the random variable Y0 . Although the distinction be-
tween estimating a fixed quantity and predicting a random quantity may seem
overly pedantic, the importance of the distinction becomes clear when we con-
sider the uncertainty associated with the two quantities. The prediction error
associated with using Y'0 as a predictor of Y0 is larger due to the variability
incurred from predicting a new observation.
In the case of a spatial random field, we can also focus on either prediction
or estimation. Assume that a random field for geostatistical data (§2.4.1) has
model representation
Z(s) = X(s)β + e(s), e(s) ∼ (0, Σ).
We may be interested in estimating E[Z(s)] = X(s)β or in predicting Z(s).
In geostatistical applications prediction is often more important than mean
estimation. First, the expectation E[Z(s)] is reckoned with respect to the
distribution of the possible realizations ω at location s. Appealing to this
distribution may not be helpful. Second, one is often interested in the actual
amount Z(s) that is there, not some conceptual average amount.
216 SPATIAL PREDICTION AND KRIGING

Example 5.1 A public shooting range has been in operation for seven years
in a national forest, operated by the U.S. Forest Service. The lead concentra-
tion on the range is observed at sampling locations s1 , · · · , sn by collecting the
soil on a 50 × 50 cm square at each location, sieving the non-lead materials
and weighing the lead content. The investigators are interested in determining
the lead concentration at all locations on the shooting range. Estimating the
mean lead concentration at location s appeals to a universe of similar shooting
ranges. There may be no other shooting ranges like the one under considera-
tion. What matters to the investigators is not how much lead is at location s0
on average across many other—conceptual—shooting ranges. What matters
is to determine the amount of lead on the shooting range that was sampled.

Example 5.2 An oil company plans to install an oil well in a particular


field. Sample wells are installed at locations s1 , · · · , sn . From this sample it is
to be determined where on the field the actual oil well is to be installed. The
average yield on similar fields is not of importance. What matters is where on
this field the well should be placed.

Even if estimation of the mean, the average amount at location s, is not of


primary interest, the pursuit of spatial prediction cannot escape matters of
estimation. Both the mean and covariance structure of a random field are typ-
ically unknown. Unless the mean is known or constant, it must be estimated
from the data. In the model above, estimates for β must be found before a
predictor can be computed. And estimates of the mean will in most cases
depend on the covariance structure, which requires estimates of θ.
Geostatistical prediction methods are statistical tools for predicting g(Z(s0 ))
from a set of observed data. They are typically known as methods of kriging,
a term coined by G. Matheron in honor of the South African mining engineer
D.G. Krige, whose work on ore-grade estimation in the Witwatersrand gold
mines laid preliminary groundwork for the field of geostatistics (Krige, 1951;
Matheron, 1963). Much of the same methodology that forms the basis of the
field of geostatistics was developed around the same time in meteorology by L.
S. Gandin, where it is often referred to as objective analysis (Gandin, 1963).
Chilès and Delfiner (1999, p. ix) note that “it is from Matheron that geostatis-
tics emerged as a discipline in its own right,” and the widespread adoption of
the term “kriging” is undoubtedly a testament to his influence. An interesting
discussion on the origins of kriging and the development of optimal spatial
prediction can be found in Cressie (1990).
One of the earliest books on geostatistics is Journel and Huijbregts (1978)
and this remains an excellent source for the theory of kriging. More intro-
ductory texts include Clark (1979), Isaaks and Srivastava (1989), Armstrong
(1999), Olea (1999), and Webster and Oliver (2001). In particular, Olea (1999)
and Webster and Oliver (2001) are fairly comprehensive with many practical
OPTIMAL PREDICTION IN RANDOM FIELDS 217

examples. A more advanced treatment of geostatistics and spatial prediction


methods is given in Deutsch and Journel (1992) and the extension thereof by
Goovaerts (1997), and the books of Cressie (1993) and Chilès and Delfiner
(1999). Our writing in this chapter borrows considerably from Cressie (1993)
and Chilès and Delfiner (1999). In particular, we follow the notation set forth
in Cressie (1993).
The function g(Z(s0 )) of interest can be varied. For example, we may be
interested in predicting the Z process itself, g(Z(s0 )) = Z(s0 ). Or, we may be
interested in the average amount in a particular region B (block) of volume
|B|, A
1
g(Z(B)) = Z(u) du.
|B| B
Or, we may be interested in binary variables indicating whether Z(s) exceeds
a certain environmentally-safe threshold level c,
-
1 Z(s0 ) > c
g(Z(s0 )) =
0 Z(s0 ) ≤ c.
In what follows, the notation p(Z; g(Z(s0 ))) denotes the predictor of g(Z(s0 ))
at location s0 based on the observed data vector Z(s). If g(Z(s0 )) ≡ Z(s0 ), a
case on which we shall focus now, the “shorthand” p(Z; s0 ) will be used.
The derivation of a predictor commences with the choice of a function
L(Z(s0 ), p(Z; s0 )) that measures the loss incurred by using p(Z; s0 ) as a pre-
dictor of Z(s0 ). The squared-error loss function Squared-
2 error
L(Z(s0 ), p(Z; s0 )) = (Z(s0 ) − p(Z; s0 )) Loss
is most commonly used and it is the only loss function considered in this text.
It is not necessarily the most reasonable loss function, for example, because
it is symmetric. In some applications, under- and over-predictions need to be
penalized to different degrees. See the work by Zellner (1986), for example, on
estimation and prediction using asymmetric loss functions. Having said that,
the compelling reasons for choosing squared-error loss outweigh competing
functions in most instances. Among these reasons are the following.

1. Statistical properties of predictors are fairly easily examined under squared-


error loss. It essentially requires knowing only the first two moments of the
process.
2. The loss function L depends on the data and is thus a random variable. This
leads us to the use of statistical measures of risk, e.g., finding predictors
that minimize the average loss E[L(Z(s0 ), p(Z; s0 ))]. Under squared-error
loss, the average loss is simply the mean-squared prediction error (MSPE)
of using p(Z; s0 ) as a predictor of Z(s0 ):
K L
2
E (Z(s0 ) − p(Z; s0 )) = M SE[p(Z; s0 ); Z(s0 )].

3. The predictor that minimizes the mean-squared error has a particularly


218 SPATIAL PREDICTION AND KRIGING

appealing form; it is the conditional expectation of Z(s0 ) given the observed


data (see Chapter problems),
p0 (Z; s0 ) = E[Z(s0 )| Z(s)]. (5.1)
By the law of iterated expectations, p0 (Z; s0 ) is unbiased in the sense that
E[p0 (Z; s0 )] = E[Z(s0 )].
4. Finally, in the Gaussian case, finding the conditional expectation function
is simple. Consider a (n × 1) random vector W. Partition W = [U, V]! ,
where U is of dimension (u × 1) and V has dimension (v = (n − u) × 1).
Partition the mean vector and variance-covariance matrix accordingly:
: ; : ; : ;
U µu Σu Σuv
E[W] = E = Var[W] = .
V µv Σ!uv Σv
If W is Gaussian, then
E[U| V] = µu + Σuv Σ−1
v (V − µv ) (5.2)
Var[U| V] = Σu − Σuv Σ−1 !
v Σuv .

In terms of a Gaussian spatial random field, where Z(s0 ) and Z(s) are
jointly multivariate Gaussian with E[Z(s0 )] = µ(s0 ), Cov[Z(s), Z(s0 )] = σ,
Conditional Var[Z(s)] = Σ, and E[Z(s)] = µ(s), (5.2) becomes
Mean in
E[Z(s0 )| Z(s)] = µ(s0 ) + σ ! Σ−1 (Z(s) − µ(s)). (5.3)
GRF
This conditional expectation is linear in the observed data and establishing
its statistical properties is comparatively straightforward.

Under squared-error loss the conditional mean is the best predictor and the
mean-squared prediction error can be written as
E[(Z(s0 ) − p0 (Z; s0 ))2 ] = Var[Z(s0 )] − Var[p0 (Z; s0 )]. (5.4)
In the case of the Gaussian random field, where p0 (Z; s0 ) is given by (5.3), we
Conditional obtain
Variance in E[(Z(s0 ) − p0 (Z; s0 ))2 ] = Var[Z(s0 )] − σ ! Σ−1 σ. (5.5)
GRF
The result (5.4) is rather stunning. The term on the left-hand side must be
positive and variances are typically not subtracted. Somehow, the variation
of the best predictor under squared-error loss must be guaranteed to be less
than the variation of the random field itself (establishing (5.4) is a Chapter
problem). More importantly, this relationship conveys the behavior one should
expect from a predictor that performs well, that is, a predictor with small
mean-squared prediction error. It is a predictor that varies a lot. At first, this
seems contrary to the results learned in classical statistics where one searches
for those estimators of unknown quantities that have small mean square error.
In the search for UMVU estimators, this means finding the estimator with
least dispersion. There is a heuristic explanation for the fact that variable
predictors will perform well in this situation, however. Figure 5.1 shows a
realization Z(t) of a temporal process. In order to predict the value of the
OPTIMAL PREDICTION IN RANDOM FIELDS 219

series at time t = 20, three prediction functions are shown. The sample mean
Z and two kernel smoothers. As the smoothness of the prediction function
decreases, the variability of the predictor increases. The prediction function
that will be close on average to the value of the series is one that is allowed to
vary a lot. Based on these arguments one would expect the “best” predictor
to follow the data even more closely than the prediction functions in Figure
5.1. One would expect the best predictor to interpolate the time series at the
observed points. Kriging predictors in mean square continuous random fields
have this property, they honor the data.

2.0

1.5

1.0

0.5
Z(t)

0.0

-0.5

-1.0

-1.5

-2.0
0 10 20 30 40
Time t

Figure 5.1 Realization of a time series Z(t). The goal is to predict the value of the
series at t = 20. Three possible predictors are shown: the sample mean (horizontal
line) and kernel estimators with different bandwidth. Adapted from Schabenberger
and Pierce (2002).

Example 5.3 Simple linear regression. We close this introductory sec-


tion into spatial prediction by re-visiting a familiar (non-spatial) case, the
simple linear regression model for uncorrelated, homoscedastic data. Denote
the model as Yi = α + βxi + #i , i = 1, · · · , n, Cov[#i , #j ] = 0 if i (= j, and
Cov[#i , #i ] = σ 2 . In a first course in statistics, much ado is made about the
difference between a ξ × 100% confidence interval for E[Y0 ] and a ξ × 100%
prediction interval for Y0 = α + βx0 + #0 , a new observation. The familiar
220 SPATIAL PREDICTION AND KRIGING

formulas, if the errors are Gaussian, are


\
1 (x0 − x)2
y'0 ± t1−ξ/2,n−2 σ
' +
n Sxx
\
1 (x0 − x)2
y'0 ± t1−ξ/2,n−2 σ
' 1+ +
n Sxx
y'0 = α ' 0
' + βx
(n
Sxx = (xi − x)2 .
i=1

The “only” difference seems to be the additional “1+ ” under the square
root, and a common explanation for the distinction is that in one case we
consider Var['y0 ] and for the prediction interval we consider the variance of
the difference Var['y0 − y0 ]. We can examine the distinction now in terms of
the problem of finding the best predictor under squared-error loss.
First, we need to settle the issue whether the new observation is depen-
dent on the data Y = [Y1 , · · · , Yn ]! . Since the fitted model assumes that
the observed data are uncorrelated, there is no need to assume that a de-
pendency would exist with any new observation; hence, Cov[Yi , Y0 ] = 0, ∀i.
Under squared error loss, the best predictor of Y0 is E[Y0 |Y]; in our case the
conditional expectation is equal to the unconditional expectation because of
the independence. So, E[Y0 ] = α + βx0 is the best predictor. Since α and β
are unknown, we turn to the Gauss-Markov theorem, which instructs us that
y'0 = α ' 0 is the best linear unbiased predictor, where α
' + βx ' and β' are the
ordinary least squares estimators.
We have now arrived at the familiar result, that the best predictor of the
random quantity Y0 and the best estimator of the fixed quantity E[Y0 ] are
the same. But are the mean-squared prediction errors also the same? In order
to answer this question based on what we know up to now, we cannot draw
on equation (5.4), because y'0 is not the conditional expectation. Instead, we
draw on the following first principle: the mean-squared error M SE[U ; f (Y)]
for estimating (predicting) U based on some function f (Y) is
Var[U − f (Y)] = Var[U ] + Var[f (Y)] − 2Cov[U, f (Y)],
provided E[U ] = E[f (Y)].
In the simple linear regression example, we can apply this as follows, taking
note that C D
1 (x0 − x)2
Var[Y'0 ] = σ 2
+ .
n Sxx
Now, consider both the estimation problem and the prediction problem in this
context.

(i) Estimation of E[Y0 ]: U = E[Y0 ], f (Y) = Y'0 , Var[U ] = 0, Cov[U, f (Y)] =


LINEAR PREDICTION—SIMPLE AND ORDINARY KRIGING 221

0. As a consequence,
C D
1 (x0 − x)2
M SE[U ; f (Y)] = Var[U − f (Y)] = Var[f (Y)] = σ 2
+ .
n Sxx

(ii) Prediction of Y0 : U = Y0 , f (Y) = Y'0 , Var[U ] = σ 2 , Cov[U, f (Y)] = 0.


As a consequence,
C D
1 (x0 − x)2
M SE[U ; f (Y)] = Var[U − f (Y)] = σ 1 + +
2
.
n Sxx
The additional factor “1+ ” represents Var[Y0 ]. It is thus not quite correct to
say that in the estimation (confidence interval) case we consider the variance
of Y'0 and in the prediction case the variance of the difference Y0 − Y'0 . In both
cases we consider the variance of a difference between the target U and the
prediction function. In the case of estimation, U is a constant and for that
reason we drop Var[U ] and the covariance between U and f (Y). In the case
of prediction, U is a random variable. The covariance term is eliminated now
for a different reason, because of the independence assumption.

As we apply prediction theory to spatial data in the sections that follow,


keep in mind that we started the previous example by asking how the new
observation relates to the observed data. If Z(s) = [Z(s1 ), · · · , Z(sn )]! are spa-
tially correlated, then it is only reasonable to assume that a new observation
Z(s0 ) is part of the same process, and hence correlated with the observed data.
In this case the best estimator of the mean E[Z(s0 )] and the best predictor of
the random variable Z(s0 ) will differ. In other applications of prediction for
correlated data, you may need to revisit such assumptions.

Example 5.4 Imagine an observational clinical study in which patients are


repeatedly measured over time. Statistical models for such data often assume
that data from a particular subject are serially correlated and that data from
different patients are uncorrelated. To predict a future observation of a patient
who participated in the study, you would assume that the future observation is
correlated with the observed values for that patient and determine the corre-
lation according to the covariance structure that was applied to the observed
data. To predict the response of a patient who has not participated in the
study, an assumption of dependence with any of the observed data points is
not reasonable, if you assumed in the fitted model that patients’ responses
were independent.

5.2 Linear Prediction—Simple and Ordinary Kriging

Under squared-error loss one should utilize the conditional mean function for
predictions. Not only does p0 (Z; s0 ) minimize the Bayes risk Bayes Risk
E[L(Z(s0 ), p(Z; s0 ))],
222 SPATIAL PREDICTION AND KRIGING

it is also “unbiased” in the sense that E[p0 (Z; s0 )] = E[Z(s0 )]. This fol-
lows directly from the law of iterated expectations; E[E[Y |X]] = E[Y ]. If
{ Z(s) : s ∈ D ⊂ Rd } is a Gaussian random field, p0 (Z; s0 ) is also linear in
Z(s), the observed data. In general, however, p0 (Z; s0 ) is not a linear function
of the data and establishing the statistical properties of the best predictor
under squared-error loss can be difficult; even intractable. Thus, in statisti-
cal practice the search for good estimators is restricted to particular classes
of estimators. The properties of linearity in the observed data and unbiased-
ness are commonly imposed because of mathematical tractability and the
mistaken impression that unbiasedness is an intrinsically “good” feature. Not
surprisingly, similar constraints are imposed on prediction functions. In the
Gaussian case no additional restrictions are called for, since p0 (Z; s0 ) already
is linear and unbiased. For the general case, this new consideration of what
constitutes a best predictor leads to what are called Best Linear Unbiased
Predictors (BLUPs).
Consider random variables X and Y with joint density function f (x, y) (or
mass function p(x, y)). We are given the value of X and wish to predict Y from
it, subject to the condition that the predictor p(X) satisfies E[p(X)] = E[Y ] ≡
µy and subject to a linearity condition, p(X) = α + βX. Under squared-error
loss this amounts to finding the function p(X) that minimizes
6 7
M SE[p(X); Y ] = E (Y − p(X))2 subject to
p(X) = α + βX β = (µx − µy )/α.
The solutions to this minimization problem are (Chapter problem 5.4)
σxy σxy
α = µy − 2 µx β= 2
σx σx
σxy
p(X) = BLU P (Y |X) = µy − 2 (µx − X), (5.6)
σx
where σxy = Cov[X, Y ], σx2 = Var[X].

Example 5.3 (Simple linear regression. Continued) How do these re-


sults relate to the simple linear regression (SLR) model? In order to make the
connection between (5.6) and prediction in the SLR model, we assume that
the conditional mean function is given by
Yi |x = α + βxi + #i , #i ∼ iid (0, σ 2 ), i = 1, · · · , n.
The ordinary least squares estimators of α and β and the predictor of Y |x—
based on the sample—are
s2xy
'
α = Y − β'x̄ β' =
s2xx
Y' = ' − x) = α
Y − β(x̄ '
' + βx, (5.7)
.n .n
where s2xy = (1/(n − 1)) i=1 (xi − x)Yi , s2xx = (1/(n − 1)) i=1 (xi − x)2 .
These expressions are based on the sample estimators of the unknown quan-
LINEAR PREDICTION—SIMPLE AND ORDINARY KRIGING 223

tities in (5.6). Based on the discussion that follows, you will be able to show
that (5.7) is indeed the best linear unbiased predictor of Y |x if the means,
variance of X, and covariance are unknown. By the Gauss-Markov theorem we
know that (5.7) is also the best linear unbiased estimator (BLUE) of E[Y |x].
The differences in their mean-squared prediction errors were established pre-
viously.

5.2.1 The Mean Is Known—Simple Kriging

Consider spatial data Z(s) = [Z(s1 ), · · · , Z(sn )]! and assume


Z(s) = µ(s) + e(s), e(s) ∼ (0, Σ). (5.8)
Thus, E[Z(s)] = µ(s) and Var[Z(s)] = Σ, and we initially assume both µ(s)
and Σ are known. The goal is to find the predictor of Z(s0 ), p(Z; s0 ), that min-
imizes E[(p(Z; s0 ) − Z(s0 ))2 ]. This is a formidable problem, so we refine it by
considering linear predictors of the form p(Z; s0 ) = λ0 + λ! Z(s), where λ0 and
the elements of λ = [λ1 , · · · λn ]! are unknown coefficients to be determined.
Then E[(p(Z; s0 ) − Z(s0 ))2 ] can be written as
E[(p(Z; s0 ) − Z(s0 ))2 ] = E[(λ0 + λ! Z(s) − Z(s0 ))2 ].
Adding and subtracting the term E[p(Z; s0 ) − Z(s0 )] ≡ λ0 + λ! µ(s) − µ(s0 ),
where µ(s0 ) = E[Z(s0 )], we obtain
E[(p(Z; s0 ) − Z(s0 ))2 ] = Var[λ! Z(s) − Z(s0 )]
+ (λ0 + λ! µ(s) − µ(s0 ))2 .
Since both terms are nonnegative, the expected mean-squared prediction error
will be minimized when each term is as small as possible. Clearly, the second
term is minimized by taking λ0 = µ(s0 ) − λ! µ(s). If Var[Z(s0 )] = σ 2 and
σ = Cov[Z(s), Z(s0 )], then the second term can be written as
Var[λ! Z(s) − Z(s0 )] = σ 2 + λ! Σλ − 2σ ! λ. (5.9)
Calculus can be used to determine the minimum of this objective function.
Differentiating with respect to λ and equating to zero gives λ! Σ = σ ! . From
matrix algebra theory, these equations have a unique solution if Σ is nonsin-
gular (so that duplicate measurements at the same location are excluded).
Thus, the optimal choices for λ0 and λ are
λ0 = µ(s0 ) − λ! µ(s);
λ = Σ−1 σ
and the optimal linear predictor is Simple
Kriging
psk (Z; s0 ) = λ0 + λ! Z(s) = µ(s0 ) + σ ! Σ−1 (Z(s) − µ(s)). (5.10)
Predictor
In geostatistical parlance, (5.10) is called the simple kriging predictor. It is
the best linear predictor under squared-error loss. The predictor is equivalent
224 SPATIAL PREDICTION AND KRIGING

to the conditional mean in a Gaussian random field, (5.3). It is thus the best
predictor (linear or not) under squared-error loss if Z(s) is a GRF. If the
joint distribution of the data is not Gaussian, then (5.10) is the best predictor
among those that are linear in Z(s). It is also unbiased, but since we did not
impose an unbiasedness constraint as part of the minimization, the simple
kriging predictor is best in the class of all linear predictors. Kriging is often
referred to as optimal spatial prediction, but optimality considerations are
confined to this sub-class of predictors unless the random field is Gaussian.
Substitution of λ! = σ ! Σ−1 into the expression for the mean-squared error
in (5.9) yields the minimized mean-squared prediction error, also called the
Simple (simple) kriging variance
Kriging 2
σsk (s0 ) = σ 2 − σ ! Σ−1 σ, (5.11)
Variance
which agrees with Var[Z(s0 )| Z(s)] in (5.5) in the Gaussian case. The simple
kriging variance depends on the prediction location through the vector σ of
covariances between Z(s0 ) and the observed data.
It was noted in §5.1 on heuristic grounds, that a good predictor should
be variable in the sense that it follows the observed data closely. The simple
kriging predictor has an interesting property that it shares with many other
types of kriging predictors. Consider predicting at locations where data are
actually observed. Thus, the predictor psk (Z; s0 ) becomes psk (Z; [s1 , · · · , sn ]! ),
and in (5.10) we replace Cov[Z(s0 ), Z(s)] = σ ! with Cov[Z(s), Z(s)] = Σ and
µ(s0 ) with µ(s) to obtain
psk (Z; [s1 , · · · , sn ]! ) = µ(s) + ΣΣ−1 (Z(s) − µ(s))
= Z(s).
Thus, the simple kriging predictor interpolates the observed data. It is an
“exact” interpolator or said to “honor the data.” Historically, many disci-
plines have considered this to be a very desirable property; one that should
be asked of any spatial predictor. However, in some situations, smoothing, as
is typically done in most regression situations, may be more appealing. For
example, when the semivariogram of the spatial process contains a nugget ef-
fect, it is not necessarily desirable to interpolate the data. If the nugget effect
consists of micro-scale variability only, then a structured portion of the spatial
variability has not been observed and honoring the observed data is reason-
able. If the nugget effect contains a measurement error component, that is,
Z(s) = S(s) + #(s), where #(s) is the measurement error at location s, then we
do not want the predictor to interpolate the data. We are then not interested
in the amount that has been erroneously measured, but the amount S(s) that
is actually there. The predictor should be an interpolator of the signal S(s),
not the observed amount Z(s). In §5.4.3 these issues regarding kriging with
and without a measurement error will be revisited in more detail.
The term simple kriging is unfortunate on another ground. There is nothing
simple or common about the requirement that the mean µ(s) of the random
field be known. An exception is best linear unbiased prediction of residuals
LINEAR PREDICTION—SIMPLE AND ORDINARY KRIGING 225

from a regression fit. If Z(s) = X(s)β + '(s), then the vector of ordinary least
squares (OLS) residuals
! !
'(s) = Z(s) − X(s)(X(s) X(s))−1 X(s) Z(s)
' (5.12)
has (known) mean 0—provided that the mean model X(s)β has been speci-
fied correctly.∗ One application of simple kriging is thus the following method
intended to cope with the problem of non-stationarity that arises from large-
scale trends in the mean of Z(s). It is sometimes incorrectly labelled as “Uni-
versal Kriging,” which it is not (see §5.3.3). “Universal
Kriging,”
1. Specify a linear spatial model Z(s) = X(s)β + e(s). but not
' = (X(s)! X(s))−1 X(s)! Z(s).
2. Fit the model by OLS to obtain β quite
ols

3. Perform simple kriging on the OLS residuals (5.12) to obtain psk ('
'; '
'(s0 )).
' + psk ('
4. Obtain the kriging predictor of Z(s0 ) as x(s0 )! β '; '
'(s0 )).

Although it appears to be a reasonable thing to do, there are many problems


with this approach:
• Since Var[e(s)] = Σ, ordinary least squares estimation is not most efficient.
The large-scale trend parameters β should be estimated by generalized least
squares (GLS):
' = (X(s)! Σ−1 X(s))−1 X(s)Σ−1 Z(s).
β gls

• GLS requires knowledge of Σ. If Σ is unknown, one can resort to estimated


GLS (EGLS), where Σ is replaced with an estimate
' ! ' −1 ' −1 Z(s).
β egls = (X(s) Σ X(s))−1 X(s)Σ
How to obtain this estimate is not clear. Using the OLS residuals (or even
the GLS residuals) and estimating the covariance function or semivariogram
of '(s) from ' '(s) leads to biased estimators. The fitted residuals comply to
constraints that the unobservable model errors '(s) are not subject to, e.g.,
'ols (s) = 0 if X(s) contains an intercept. The fitted residuals will exhibit
1! '
more negative correlations than e(s).
' + psk ('
• x(s0 )! β '(s); '
'(s0 )) is an unbiased predictor of Z(s0 ), provided the
model was correctly specified, but it is not the BLUP. The practical rami-
fication is that the simple kriging variance inferred from the residuals does
not incorporate the variability associated with estimating β, regardless of
whether OLS or GLS is used. Thus, the uncertainty associated with pre-
dictions from this approach is estimated too low.
∗ Expressions involving inverses of functions of the regression matrix X(s) are written in
terms of regular inverses in this and the following chapter. The X(s) matrix is an (n × p)
matrix of rank k and k can be less than p. In this case it is safe to substitute general-
ized inverses in expressions such as (X(s)# X(s))−1 and (X(s)# Σ−1 X(s))−1 . However, we
always assume that the variance matrix Σ is positive definite and hence non-singular.
226 SPATIAL PREDICTION AND KRIGING

5.2.2 The Mean Is Unknown and Constant—Ordinary Kriging

The simple kriging predictor is used when the mean µ(s) in model (5.8) is
known. With this model, the mean can change with spatial location. If E[Z(s)]
is unknown but constant across locations, E[Z(s)] ≡ µ1, best linear unbiased
prediction under squared-error loss is known as ordinary kriging.
We need to find the predictor p(Z; s0 ) of Z(s0 ) that minimizes E[(p(Z; s0 )
−Z(s0 ))2 ], when the data follow the model
Z(s) = µ1 + e(s), e(s) ∼ (0, Σ).
Thus, E[Z(s)] = µ1 and Var[Z(s)] = Σ, where µ is an unknown constant and
Σ is known.
As in the development of the simple kriging predictor, we consider linear
predictors of the form p(Z; s0 ) = λ0 + λ! Z(s), where λ0 and the elements
of the vector λ = [λ1 , · · · λn ]! are unknown coefficients to be determined.
Repeating the development in §5.2.1 gives λ0 = µ − λ! µ1. However, this does
not determine the value of λ0 since µ is unknown. When the mean in unknown,
there is no best linear predictor in the class of all linear predictors. Thus, we
refine the problem by further restricting the class of linear predictors to those
that are also unbiased. Since the mean of Z(s) does not depend on s, it is
reasonable to posit also that E[Z(s0 )] = µ. Then we require for unbiasedness
that E[p(Z; s0 )] = E[Z(s0 )] or equivalently, E[λ0 + λ! Z(s)] = E[Z(s0 )], which
implies that λ0 + µ(λ! 1 − 1) = 0. Since this must hold for every µ, it must
hold for µ = 0 and so the unbiasedness constraint requires that λ0 = 0 and
λ! 1 = 1.
Now our problem is to choose weights λ = [λ1 , · · · , λn ]! that minimize
E[(λ! Z(s) − Z(s0 ))2 ] subject to λ! 1 = 1.
This can be accomplished as an unconstrained minimization problem intro-
ducing the Lagrange multiplier m:
arg minλ,m Q = arg minλ,m E[(λ! Z(s) − Z(s0 ))2 ] − 2m(λ! 1 − 1). (5.13)

5.2.2.1 Ordinary Kriging in Terms of the Covariance Function

Expanding the expectation in (5.13), putting Var[Z(s0 )] = σ 2 = C(0), and


assuming Z(s) is second-order stationary, we obtain
Q = C(0) + λ! Σλ − 2λ! σ − 2m(λ! 1 − 1).
Taking derivatives with respect to λ and m and setting to zero yields the
system of equations
∂Q
= 2Σλ − 2σ − 2m1 ≡ 0
∂λ
∂Q
= −2(λ! 1 − 1) ≡ 0.
∂m
LINEAR PREDICTION—SIMPLE AND ORDINARY KRIGING 227

The factor 2 in front of the Lagrange multiplier was chosen to allow cancella-
tion. It is left as an exercise (Chapter problem 5.8) to show that the solutions
to this problem are (Cressie, 1993, p. 123) Ordinary
C D! Kriging
! 1 − 1! Σ−1 σ
λ = σ+1 ! −1
Σ−1 (5.14) Weights
1Σ 1
1 − 1! Σ−1 σ
m = , (5.15)
1! Σ−1 1
and that the minimized mean-squared prediction error, the ordinary kriging
variance, is (Cressie, 1993, p. 123) Ordinary
Kriging
2
σok (s0 ) = C(0) − λ! σ + m
Variance
−1 (1 − 1! Σ−1 σ)2
= C(0) − σ Σ !
σ+ . (5.16)
1! Σ−1 1
Notice that σsk
2 2
< σok , since the last term on the right-hand side of (5.16)
is positive. Not knowing the mean of the random field increases the mean-
squared prediction error. The expression pok (Z; s0 ) = λ! Z(s), where λ is given
by (5.14), hides the fact that the unknown mean of the random field is actually
estimated implicitly. The formulation of the ordinary kriging predictor we
prefer is GLS form
' + σ ! Σ−1 (Z(s) − 1'
pok (Z; s0 ) = µ µ) (5.17) of Ordinary
(Toutenburg, 1982, p. 141, Cressie, 1993, p. 173; Gotway and Cressie, 1993). Kriging
This formulation shows the correspondence between ordinary and simple krig-
ing, as well as the connection to the best predictor in the Gaussian random
field more clearly. Comparing (5.17) and (5.10), it appears that “all” that is
required to accommodate an unknown mean is to replace µ with an estimate
'. It is important to note that not just any estimate will do. The algebraic
µ
manipulations leading from λ! Z(s) to (5.17) reveal that µ must be estimated
by its best linear unbiased estimator, which in this case is the generalized least
squares estimator (Chapter problem 5.9; Goldberger, 1962):
' = (1! Σ−1 1)−1 1! Σ−1 Z(s).
µ (5.18)

In the derivation of pok (Z; s0 ) presented above, the assumption of second-


order stationarity of the process was implicit; e.g., by putting Var[Z(s0 )] =
C(0). If the process is not second-order stationary, then C(h) is not defined.
The variance-covariance matrix of the model errors as well as the covariance
vector between Z(s0 ) and Z(s) can be constructed, however. In the second-
order stationary case the (i, j)th element of Σ is given by C(si − sj ), which is
C(||si − sj ||) if the process is isotropic. For a non-stationary process we con-
struct Σ = [C(si , sj )] and σ = Cov[Z(s), Z(s0 )] = [C(s0 , s1 ), · · · , C(s0 , sn )]!
and calculate λ and m in the usual way (equations (5.14) and (5.15)). The
ordinary kriging variance for the case of a non-stationary covariance function
becomes
2
σok (s0 ) = C(s0 , s0 ) − λ! σ + m.
228 SPATIAL PREDICTION AND KRIGING

Thus, it is not necessary to assume C(si , sj ) = C(si − sj ) to solve the kriging


problem. The actual process of best linear unbiased prediction does not in-
volve any type of stationarity. But because modeling the covariance function
or semivariogram in the absence of stationarity is typically difficult, issues
of stationarity cannot be separated from best linear prediction. In practical
applications Var[Z(s)] is not known.

5.2.2.2 Ordinary Kriging in Terms of the Semivariogram

The kriging equations can also be expressed in terms of the semivariogram


of the process. This is helpful for spatial prediction with processes that are
intrinsically, but not second-order stationary. The derivation of the kriging
weights λ in terms of γ(h) follows along the same lines as before. The criterion
(5.13) is now expanded in terms of the matrix Γ = [γ(si − sj )] and the vector
γ(s0 ) = [γ(s0 − s1 ), · · · , γ(s0 − sn )]! :
Q = −λ! Γλ + 2λ! γ(s0 ) − 2m(λ! 1 − 1).
Differentiating with respect to λ and m, setting to zero and solving the two
simultaneous equations yields (Cressie, 1993, p. 122)
C D!
! 1 − 1! Γ−1 γ(s0 )
λ = γ(s0 ) + 1 Γ−1 (5.19)
1! Γ−1 1
1 − 1! Γ−1 γ(s0 )
m = − (5.20)
1! Γ−1 1
pok (Z; s0 ) = λ! Z(s) (5.21)
! ! !
σok (s0 ) = λ γ(s0 ) + m = 2λ γ(s0 ) − λ Γλ.
2
(5.22)

5.2.3 Effects of Nugget, Sill, and Range

The ordinary kriging predictor at location s0 is pok (Z; s0 ) = λ! Z(s), where


the kriging weights λ are given by (5.14). The two important components,
the “driving forces” behind the ordinary kriging weights, are the vector σ =
Cov[Z(s0 ), Z(s)] and the variance-covariance matrix Σ = Var[Z(s)]. To be
more precise, it is the inverse covariance matrix and the vector of covariances
between attributes at prediction locations and observed locations that drive
the kriging weights. The important point we wish to make is that the spa-
tial dependency structure has great impact on ordinary (and simple) kriging
predictions.
We are not concerned with the issue that Σ is unknown, that its elements
(parameters) need to be estimated, and that the variability of the estimates
is not reflected in the standard expressions for the kriging error, such as equa-
tion (5.16). These issues are considered in §5.5. Of concern now is the fact
that Σ is unknown, and you adopt a parametric model to capture the spatial
LINEAR PREDICTION—SIMPLE AND ORDINARY KRIGING 229

dependence, for example, one of the semivariogram models of §4.3. Obviously,


the choice of model has an impact on the kriging predictions. In general,
however, the precise implications are difficult to describe, because different
features of competing models can enhance or suppress a particular effect in
model comparisons. For example, predictions based on highly continuous co-
variance models tend to be more smooth than predictions using a model of
lesser continuity (see §2.3). A gaussian semivariogram model will generally
produce smoother predicted surfaces than an exponential model (with the
same practical range and sill). Intuitively, it is the near-origin behavior of the
semivariogram or covariance function that has an important impact on krig-
ing predictions. Since the nugget-only model is the model of least continuity,
and a model with nugget effect can be thought of as a nested model, an ex-
ponential model without nugget effect can produce predictions that are more
smooth than those from a gaussian model with nugget effect. Statements and
conclusions about the effect of semivariogram parameters on ordinary krig-
ing predictions thus need to be made with care and appropriate caveats. We
consider here a simple example that draws on Isaaks and Srivastava (1989) to
illustrate some important points, ceteris paribus.

Example 5.5 Isaaks and Srivastava (1989, pp. 291, 301–307) use a small
data set of seven observations and one prediction location to examine the
effect of semivariogram parameters on ordinary kriging predictions. We use
a data set of the same size, the observed data locations and their attribute
values are as follows

i si Z(si )
1 [5,20] 100
2 [20,2] 70
3 [25,32] 60
4 [8,39] 90
5 [10,17] 50
6 [35,20] 80
7 [38,10] 40

The prediction location is s0 = [20, 20], and the sample mean of the observed
data is Z = 70.0. The observed locations surround the prediction location.
Notice that, in contrast to Isaaks and Srivastava (1989), two locations are
equidistant from the prediction location (s1 and s6 , Figure 5.2).
The kriging weights (5.14), predictions, and kriging variance (5.16), are
computed for the following series of semivariogram models.
230 SPATIAL PREDICTION AND KRIGING

40 4

22.5
3
30
13.0
Y

1 15.0 s0
15.0 6
20
5 10.4
20.6

18.0
7
10

2
0
0 5 10 15 20 25 30 35 40
X

Figure 5.2 Seven observed locations and target prediction location. The dotted rays
show Euclidean distance between the observed location and the target location.

Practical
Model Range Sill Nugget γ(h) Type
) *
A 20 10 0 10 )1 − e−3h/20 * Exponential
B 10 10 0 10 1) − e−3h/10 * Exponential
C 20 10 5 5 + 5 1 − e−3h/20 Exp. + nugget
D – – 10 ) 10 * Nugget only
E 20 20 0 20E 1 − e−3h/20 F Exponential
2
F 20 10 0 10 1 − e−3(h/20) Gaussian

Models A and B differ in the range, models A and C in the relative nugget
effect. Model D is a nugget-only model in which data are not spatially corre-
lated. A comparison of models A and E highlights the effect of the variability
of the random field. The final model has the same variability and practical
range as the exponential model A, but a much higher degree of spatial conti-
nuity. Model F exhibits large short-range correlations.
The kriging weights sum to 1.0 in all cases (within roundoff error) as needed
(Table 5.1); recall that ordinary kriging weights are derived subject to the con-
straint λ! 1 = 1. Interestingly, as the degree of spatial continuity decreases, so
does the “variation” among the kriging weights λi . Model D, the nugget-only
model, assigns the same weight λi = 1/n to each observation. The resulting
LINEAR PREDICTION—SIMPLE AND ORDINARY KRIGING 231

Table 5.1 Kriging weights for predicting Z(s0 ) with semivariograms A–F. λi denotes
the kriging weight for the attribute Z(si ). sλ is the “standard deviation” of the seven
kriging weights for a particular model.

Model pok 2
σok λ1 λ2 λ3 λ4 λ5 λ6 λ7
A 66.23 9.74 0.08 0.13 0.20 0.10 0.24 0.15 0.09
B 69.04 11.25 0.12 0.14 0.16 0.14 0.16 0.14 0.13
C 68.64 10.63 0.12 0.14 0.17 0.12 0.18 0.14 0.12
D 70.00 11.43 0.14 0.14 0.14 0.14 0.14 0.14 0.14
E 66.23 19.48 0.08 0.13 0.20 0.10 0.24 0.15 0.09
F 44.52 6.67 -0.35 0.08 0.28 0.06 0.75 0.18 0.01
||s0 − si || 15.0 18.0 13.0 22.5 10.4 15.0 20.6

Model A B C D E F
sλ 0.06 0.01 0.03 0.00 0.06 0.33

predictor is the sample mean. Recall that models A and B are identical, ex-
cept for the practical range. In the model with larger range (A), short-distance
correlations are higher, creating greater heterogeneity in the weights.
The results in Table 5.1 also demonstrate that points that are separated
by more than the range do not have zero kriging weights. Also, the kriging
weight of these points is not 1/n, unless all observations are uncorrelated (as
in model D). For example, s1 is further from the prediction location than the
(practical) range in model B, yet its kriging weight is 0.12. Points more distant
than the range are spatially correlated with other points that are less distant
from the target location than the range. This is called the relay effect (Chilès
and Delfiner, 1999, p. 205).
Models A and E are identical, except for the sill of the semivariogram. The
variability of the random field is twice as large for E, than for A. This has
no effect on the kriging weights, and hence the kriging predictor is the same
under the two models. The kriging variance, however, increases accordingly
with the variability of the random field.
Finally, model F is highly continuous, with large short-distance correlations.
Since the range and sill of model F are identical to those of model A, the
gaussian model’s short-distance correlations exceed those of the exponential
model. The kriging weights show the most “variation” of the models in Table
5.1 and the value closest to the prediction location, Z(s5 ), receives the most
weight. It contributes 3/4 of its own value to pok (Z; s0 ), accounting for more
than half of the predicted amount. Maybe surprisingly, this model yields a
negative kriging weight for the observation at s1 . A similar effect can be noted
232 SPATIAL PREDICTION AND KRIGING

for model A, but it is less pronounced. The weight for Z(s1 ) is considerably
less than that for observation Z(s5 ), although they occupy very similar points
in the spatial configuration. It is exactly because they occupy similar positions
that Z(s1 ) receives small weight, and even a negative weight in model F. The
effect of Z(s1 ) is screened by the observation at s5 , because it lies “behind”
it relative to the prediction location.
In the derivation of the kriging weights only a “sum-to-one” constraint was
imposed on the kriging weights, but not a positivity constraint. On first glance,
a negative kriging weight may seem undesirable. If weights can be negative, so
could possibly the predicted values. Spatial attributes are often positive, how-
ever, e.g., yields, concentrations, counts. When the weights are restricted to
be positive, then all predicted values lie between the minimum and maximum
observed value. Szidarovsky et al. (1987) derive a version of kriging with only
positive weights. While this predictor has attractive advantages for obtaining
predicted values of nonnegative processes, the extra constraint may lead to
unacceptably large kriging standard errors (Cressie, 1993, p. 143).

5.3 Linear Prediction with a Spatially Varying Mean

In section §5.2 we considered models of the form (5.8):


Z(s) = µ(s) + e(s), e(s) ∼ (0, Σ).
With simple kriging in §5.2.1, we assume that the mean µ(s) is known. With
ordinary kriging in §5.2.2, we assume µ(s) is unknown but not spatially vary-
ing, µ(s) ≡ µ1. Either assumption is not necessarily met, although there
are cases where it is easy to verify a constant mean, for example, when we
work with fitted residuals. Recall from §2.4.1 the operational decomposition
of variability in a random field into large-scale trend µ(s), smooth, small-scale
variation W (s), micro-scale variation η(s), and measurement error #(s):
Z(s) = µ(s) + e(s) = µ(s) + W (s) + η(s) + #(s).

The model underlying ordinary kriging predictions assumes that µ(s) = µ


and that all variation in the data is associated with the spatial dependency
structure W (s) + η(s) plus some white noise #(s). On the other hand, the
model
Z(s) = X(s)β + ', ' ∼ (0, σ 2 I)
assumes that apart from white noise all variability is associated with changes
in the mean function. The fact that we consider such very different models
for modeling (and predicting) spatial data is due to the adage that “one
modeler’s fixed effect (regressor variable) is another modeler’s random effect
(spatial dependency).” Historically, estimation and prediction in models for
spatial data started at the two extremes: regression models with uncorrelated
errors (statistics) and correlated errors with a constant mean (geostatistics).
Both approaches provide considerable simplifications over the case that is
LINEAR PREDICTION WITH A SPATIALLY VARYING MEAN 233

probably most relevant: the combination of a spatially varying mean and


spatial dependency, which gives rise to models of the form
Z(s) = µ(s) + e(s), e(s) ∼ (0, Σ).
In this section we make the assumption that µ(s) is linear in some covariates,
so that
Z(s) = X(s)β + e(s), e(s) = W (s) + η(s) + #(s) ∼ (0, Σ). (5.23)

Any one or several of the random components of e(s) may be zero (or as-
sumed to be zero) at times. The common thread of models of form (5.23) is
linearity of the mean function and a spatially correlated error process. The
notation X(s) is used to emphasize that the (n × p) matrix X depends on
spatial coordinates. The columns of this regressor matrix are usually com-
prised of spatial variables, although there are applications where the columns
of X do not depend on spatial coordinates at all. Furthermore, X may contain
dummy (design) variables and can be of less than full column rank. This situ-
ation arises when experimental data with design and treatment structure are
modeled spatially. Occasionally, the dependence of X(s) on s will be omitted
for brevity.
The random processes W (s), η(s), and #(s) have mean zero and measure-
ment errors are uncorrelated,
-
0 si (= sj
Cov[#(si ), #(sj )] =
σ$2 si = sj .
As a result, the error process of (5.23) is a zero-mean stochastic process with
Var[e(s)] = ΣW + Ση + σ$2 I ≡ Σ.

The spatial stochastic (second-order) structure is represented by the var-


iance-covariance matrix components of Σ. The models of interest in the anal-
ysis of spatial data can coarsely be classified into two categories. Models for
uncorrelated errors assume that W (s) = η(s) ≡ 0 and hence only the mea-
surement error process #(s) remains. (One can generalize this process by al-
lowing the variances to vary with spatial location, Var[#(si )] = σ$,i
2
.) Models
with correlated errors assume that Σ is not diagonal. Based on the discus-
sion in previous chapters it seems somewhat odd to analyze spatial data as
if random disturbances were uncorrelated, since we know that spatial data
at nearby locations usually exhibit similarities. There is (some) justification
for this approach, however—beyond the desire for an analysis that is simple;
such methods are the focus of §5.3.1 and §5.3.2. Statistical techniques for lin-
ear models with uncorrelated errors are simple, well-known, and commercial
software packages to perform the numerical chores of estimation, prediction,
and hypothesis testing are readily available. By comparison, statistical soft-
ware for modeling correlated data is a much more recent development and
the mathematical-statistical theory of correlated error models is more com-
234 SPATIAL PREDICTION AND KRIGING

plicated and still evolving. The temptation to bring models for spatial data
into the classical linear model (regression) framework is understandable.
If the process contains a smooth-scale spatial component, W (s), then the
smooth fluctuations in the spatial signal are handled in an uncorrelated error
model by allowing the mean function X(s)β to be sufficiently flexible. In other
words, the mean function is parameterized to capture local behavior. With
geostatistical data this can be accomplished parametrically by expressing the
mean as a polynomial function of the spatial coordinates. As the local fluctu-
ations of the spatial signal become more pronounced, higher-order terms must
be included (§5.3.1). A non-parametric alternative is to model local behavior
by applying d-dimensional smoothing or to localize estimation (§5.3.2). The
degree of smoothness is then governed by a smoothing parameter (bandwidth).
The contemporary approach, however, is to assume that some spatial sto-
chastic structure is present which conveys in the presence of W (s) and η(s),
hence Σ will be a non-diagonal matrix. (The argument that Var[W(s)] = σW 2
I,
Var[η(s)] = ση I and hence Σ is diagonal is vacuous. The individual random
2

components would not be identifiable.) Modeling then comprises parameteri-


zation of the mean function, i.e., proper choice of the columns of X(s), as well
as parameterization of Var[Z(s)]. Unstructured variance-covariance matrices
for spatial data are a non-sensible option because of the large number of pa-
rameters that would have to be estimated. We usually place some parametric
structure on the variance-covariance matrix. To make the parametric nature
Spatial of the covariance matrix more explicit, model (5.23) should be written as
Linear
Z(s) = X(s)β + e(s), e(s) ∼ (0, Σ(θ)). (5.24)
Model
The (q × 1) vector θ contains the unknown parameters of the error process.
This is the model for which prediction theory is presented in §5.3.3.

5.3.1 Trend Surface Models

The idea of the trend surface approach is to model the mean function in (5.24)
with a highly parameterized fixed effects structure, comprised of functions of
the spatial coordinates, si = [xi , yi ]! . For example, a first-degree trend surface
model is
Z(si ) = β0 + β1 xi + β2 yi + #i , #i ∼ iid (0, σ 2 ).
If E[Z(si )] = µ, and the Z(si ) are correlated, then this model is incorrect in
several places: β0 +β1 xi +β2 yi is not the model for the mean and the errors are
not iid. By over-parameterizing the mean, the model accounts for variability
that is associated with the spatial random structure. The approach pretends
that the models
Z(s) = 1µ + e(s), e(s) ∼ (0, Σ)
and
Z(s) = X(s)β + ', ' ∼ (0, σ 2 I)
LINEAR PREDICTION WITH A SPATIALLY VARYING MEAN 235

are equivalent representations of the spatial variability seen in the realiza-


tion of a random field. If e(s) contains a smooth, small-scale component,
then X(s)β captures the local behavior of the random field, not its mean. If
e(s) contains a nugget effect, its variance should equal σ 2 if the fixed effects
structure X(s)β removed all smooth, small-scale variation. In practice, it is
impossible to separate out all the potential components of a spatial process
from a single set of spatial data, and both versions of the model may fit the
data equally well, although their interpretations will be dramatically different.
The name trend surface model stems from the parameterization of X(s)β in
terms of polynomial trend surfaces in d coordinates of degree p. For a process
in R2 , we write Trend
Surface
Z(si ) = x(si )β + #i
Model
(p ( p
x(si )β = βkm xki yim k+m≤p (5.25)
k=0 m=0
#i ∼ (0, σ )
2

Cov[#i , #j ] = 0 ∀i =
( j.

Estimation and prediction theory is straightforward in models with uncor-


related (and homoscedastic) errors. Specifically, the regression coefficients are
estimated by OLS as
' = (X(s)! X(s))−1 X(s)! Z(s),
β ols

and the estimate of the residual variance is


1 (E F2
n
'2 =
σ '
Z(si ) − E[Z(si )] .
n − k i=1

The BLUE of E[Z(si )] and the BLUP of Z(si ) are the same, the best estimator
and the best predictor are
Z(s '
' 0 ) = E[Z(s '
0 )] = x (s0 )β ols .
!

' 0 ) is
The mean-squared prediction error for Z(s0 ) based on Z(s
) *
' 0 )] = σ 2 1 + x! (s0 )(X(s)! X(s))−1 x(s0 ) ,
M SE[Z(s0 ); Z(s
where we assumed that the new data point Z(s0 ) is uncorrelated with the
observed data. This is consistent with the uncorrelated error assumption of
model (5.25).
The number of regression coefficients in a trend surface model increases
quickly with the degree of the polynomial, β is a vector of length (p + 1)(p +
2)/2. To examine whether the mean function has been made sufficiently flex-
ible, the residuals from the fit can be examined for residual spatial autocor-
relation.
In the model
Z(s) = X(s)β + '
236 SPATIAL PREDICTION AND KRIGING

it is of interest to test H0 : Var['] ∝ I. Since ' is unobservable, we draw instead


on
! !
' = Z(s) − X(s)(X(s) X(s))−1 X(s) Z(s) = MZ(s).
'
If Var['] = σ 2 I then Var[' '] = σ 2 M. When the data are on a lattice, a test
for autocorrelation can be carried out based on Moran’s I statistic for regres-
sion residuals ((1.16), p. 23 and §6.1.2). For geostatistical data, the empirical
semivariogram of the ordinary least squares residuals can be examined, al-
though this will yield a biased estimator of the semivariogram of '. The bias
is typically negligible for small lags. More details on residual analysis in mod-
els with uncorrelated errors and models with correlated errors as well as the
implications for semivariogram estimation are forthcoming in Chapter 6.
Judging fitted residuals based on autocorrelation statistics such as Moran’s
I, or by interpreting the empirical semivariogram can lead to very different
conclusions about the degree of smoothness. Rook, queen, and bishop defini-
tions of spatial connectivity only examine immediate neighbors on the lattice
and thus focus on very small-scale, localized variation. The empirical semi-
variogram displays information on numerous lag classes and gives a picture
of the small-, medium-, and long-range spatial variation. We examine the two
approaches with a simulated data example.

Example 5.6 Figure 5.3a displays data simulated on a 20 × 20 grid with


exponential covariance function, range 9, nugget 0.25, and sill 0.75. The spatial
autocorrelations in this process are not very pronounced due to the medium
range, the presence of a nugget, and the comparatively low continuity of the
exponential covariance function.
A test of autocorrelation using Moran’s I indicates spatial dependence in
these data (Table 5.2, p = 0). The “beyond-independent-data” dependence in
the residuals is gradually decreased as the order of the trend surface increases
(Table 5.2). Although the value of Ires decreases with increasing p, E[Ires ]
depends on X(s) and Ires has the familiar expectation of −1/(n − 1) only
for p = 0. For p = 14, the residual autocorrelation due to the autocorrelation
in the process has been substantially reduced. But note that this results in a
regression model with 120 coefficients. Further increases in the surface degree
lead to models that are hopelessly overfit and ill-conditioned. Figure 5.3b
shows the predicted surface from the polynomial trend surface with p = 14.
A different perspective can be gained by choosing a goodness-of-fit criterion
that penalizes for the number of parameters. The last column of Table 5.2 gives
values of Akaike’s AIC criterion for maximum likelihood estimation (Akaike,
1974). The smallest value is achieved for p = 11, yielding a fit less smooth
than model selection based on Ires .
To determine an appropriate trend surface degree we can also examine the
empirical semivariogram of the ordinary least squares residuals for different
values of p. Since the empirical semivariogram estimator based on fitted resid-
uals is biased, in particular for larger lags, we focus on the behavior near the
LINEAR PREDICTION WITH A SPATIALLY VARYING MEAN 237

a) b)

c)

Figure 5.3 Trend surface modeling. a: simulated spatial arrangements on a 20 × 20


grid with exponential covariance function, range 9, nugget 0.25, and sill 0.75. Panels
b) and c) show predicted mean values from polynomial trend surface with degrees
p = 14 and p = 8.

origin. The semivariogram of the original data shows the spatial structure
clearly as well as the nugget effect (Figure 5.4). A trend surface of fifth de-
gree still exhibits some near-origin spatial structure in the semivariogram. For
p = 8 this structure has disappeared and the empirical semivariogram appears
to decline with increasing lag distance. This apparent decline is a combination
of a biased estimator and a statistical model that overfits the mean function.
Notice that the nugget effect remains present in all semivariograms; it is the
non-spatially structured source of variability in the data.
Based on the results displayed in Figure 5.4, a trend surface model of degree
p = 8 seems adequate. The predicted surface for p = 8 is shown in Figure 5.3c.
It is considerably more smooth than the surface with p = 14. Note that with
the rook definition of spatial neighborhood, the Ires statistic focuses entirely
on the relationships between residuals of nearest neighbors, emphasizing, and
perhaps over-emphasizing, local behavior. A trend surface with p = 11, as
selected based on the AIC criterion, appears to be a good compromise. Note
238 SPATIAL PREDICTION AND KRIGING

Table 5.2 Fit statistics for trend surface models of different order fit to data shown
in Figure 5.3a. Zobs and p-values refer to the test of residual spatial autocorrelation
based on the I ∗ statistic with a rook definition of spatial connectivity. The second
column gives the number of regression coefficients (intercept included).
p (p + 1)(p + 2)/2 '2
σ Ires Zobs p-value AIC
0 1 0.733 0.473 13.17 < 0.0001 1012.1
1 3 0.662 0.403 11.46 < 0.0001 973.4
2 6 0.643 0.375 10.92 < 0.0001 964.3
3 10 0.579 0.297 9.03 < 0.0001 926.5
4 15 0.552 0.266 8.51 < 0.0001 912.8
5 21 0.489 0.181 6.50 < 0.0001 869.6
6 28 0.443 0.095 4.46 < 0.0001 837.3
7 36 0.428 0.052 3.69 0.0001 830.4
8 45 0.410 0.006 2.84 0.0022 821.3
9 55 0.398 −0.019 2.64 0.0041 817.6
10 66 0.386 −0.046 2.42 0.0077 814.3
11 78 0.360 −0.098 1.49 0.0674 795.7
12 91 0.362 −0.117 1.57 0.0577 808.9
13 105 0.351 −0.147 1.36 0.0871 805.7
14 120 0.347 −0.183 1.05 0.1466 809.6
15 136 0.343 −0.207 1.08 0.1400 813.0

that p = 11 is the largest value in Table 5.2 for which the Ires statistic is not
significant at the 0.05 level.

5.3.2 Localized Estimation

Trend surface models require a large number of regression coefficients to cap-


ture rather simple spatial variation. A high polynomial degree is needed in
order for the predictions to be locally adequate everywhere. The flexibility
' is achieved through having many regressors in x. As the
' 0 ) = x(s0 )! β
of Z(s
prediction location is changed, the elements of x(s0 ) change considerably to
produce good predictions although the same vector β ' is used for predictions.
Could we not achieve the same (or greater) flexibility by keeping the order of
the trend surface low, but allowing β to vary spatially? This naturally leads
to the idea of fitting a model locally, instead of the trend surface model (5.25),
which is a global model. A global model has one set of parameters that apply
everywhere, regardless of spatial location.
If the order of the trend surface is reduced, for example to p = 1, we do
not expect the resulting plane to be a good fit over the entire domain. We
can expect, however, for the plane to fit well at a given point s0 and in its
immediate neighborhood. The idea of localized estimation is to assign weights
LINEAR PREDICTION WITH A SPATIALLY VARYING MEAN 239

0.8
p=0
Robust Empirical Semivariogram

0.7

0.6

0.5
p=5

p=6
0.4
p=7
p=8

0.3
0 2 4 6 8 10 12 14
Lag distance

Figure 5.4 Empirical Cressie-Hawkins semivariogram estimators for the original


data (p = 0) and the ordinary least squares residuals for different trend surface
degrees.

to data Z(s1 ), · · · , Z(sn ) that reflect the extent to which the low-rank model
applied at s0 is expected to fit at other locations. The weights control the
influence of observations on the estimation of the model at s0 . The benefits
of a parsimonious model x(s0 )! β for any one site is traded against estimation
of local regression coefficients and the determination of the weight function.
Local estimation in this spirit is often referred to as non-parametric re-
gression. Kernel estimation, for example, moves a window over the data and
estimates the mean at a particular location as the weighted average of the
data points within the window. This window may extend over the entire data
range if the weight function reaches zero only asymptotically, or consist of a
set of nearest neighbors. Local polynomial regression applies the same idea
but fits a polynomial model at each prediction location rather than a constant
mean. An equivalent representation of local estimation—which we prefer—is
in terms of weighted linear regression models.
Assume that a prediction is desired at s0 and that the behavior of the
realized surface near s0 can be expressed as a polynomial of first degree. An
240 SPATIAL PREDICTION AND KRIGING

ordinary least squares fit at s0 is obtained by minimizing


n
( 2
Q(s0 , λ) = W (||si − s0 ||, λ) (Z(si ) − β00 − β01 xi − β02 yi ) .
i=1

The (kernel) weight function W (||si −s0 ||, λ) depends on the distance between
observed locations and the prediction location and the smoothing parameter
(bandwidth) λ. Collecting the W (||si −s0 ||, λ) into a diagonal matrix W(s0 , λ),
the objective function can be written more clearly as
Q(s0 , λ) = (Z(s) − X(s)β 0 )! W(s0 , λ)(Z(s) − X(s)β 0 ).
This is a weighted least squares objective function in the model
Z(s) = X(s)β 0 + e0 , e0 ∼ (0, W(s0 , λ)−1 ) (5.26)
The assignment of small weight to Z(sj ) compared to Z(si ), say, is equivalent
to pretending that the variance of Z(sj ) exceeds that of Z(si ). If W (||si −
s0 ||, λ) = 0, then the data point Z(si ) can be removed entirely to allow the
inversion in (5.26). This representation of local estimation as a weighted esti-
mation problem is entirely general. Any statistical model that can be written
in terms of means and variances can be localized by this device (see Chapter
problems).
A special case of local polynomial estimation is LOESS regression (Cleve-
land, 1979), where the (tri-cube) weight function achieves exactly zero and
the estimation uses robust re-weighing of residuals. In general, the choice of
the weight function is less important than the choice of the bandwidth λ, a
reasonable choice of W will lead to reasonable results. Common choices in R1
are the Epanechnikov kernel
- 3 −1
4λ (1 − (d/λ)2 ) −λ ≤ d ≤ λ
We (d, λ) =
0 otherwise,
and the Gaussian kernel
H C D2 I
1 1 d
Wg (d, λ) = √ exp − .
λ 2π 2 λ

To construct a kernel weight function in Rd , one can either draw on a mul-


tivariate kernel or compute the kernel as the product of d univariate kernel
functions. The two approaches are not identical, of course. For example, a bi-
variate Gaussian kernel function can be constructed from the bivariate Gaus-
sian distribution function allowing an elliptical contour of the weight function.
The Gaussian product kernel

H C D2 I H C D2 I
1 1 xi − xj 1 y i − yj
Wg (si − sj , λ) = exp − exp −
2πλ2 2 λ 2 λ
- >
1 1
= exp − 2 ||si − sj ||2
2πλ 2 2λ
LINEAR PREDICTION WITH A SPATIALLY VARYING MEAN 241

has a common bandwidth for the major axes of the coordinate system and
spherical weight contours.
The two important choices made in local estimation are the degree of the
local polynomial and the bandwidth. Locally constant means lead to estimates
which suffer from edge bias. For spatial data this is an important consideration
because many data points fall near the bounding box or the convex hull of a
set of points.

5.3.3 Universal Kriging

Suppose we have data Z(s1 ), · · · , Z(sn ) at spatial locations s1 , · · · , sn , and we


want to predict Z(s0 ) at location s0 where we do not have an observation.
Further suppose that the form of the general linear model holds for both the
data and the unobservables:
Z(s) = X(s)β + e(s),
Z(s0 ) = x(s0 )! β + e(s0 ),
where x(s0 ) is the (p × 1) vector of explanatory values associated with lo-
cation s0 . As before, we assume a general variance-covariance matrix for the
data, Var[Z(s)] = Σ, but we also assume the data and the unobservables
are spatially correlated, so that Cov[Z(s), Z(s0 )] = σ, an (n × 1) vector, and
Var[Z(s0 )] = σ0 .
The goal is to find the optimal linear predictor, one that is unbiased and has
minimum variance in the class of linear, unbiased predictors. Thus, we consider
predictors of the form a! Z(s), and find the vector a so that a! Z(s) is the best
linear unbiased predictor of Z(s0 ). Statistically, this problem becomes: find
the vector a that minimizes the mean-squared prediction error
E[(a! Z(s) − Z(s0 ))2 ] = Var[a! Z(s)] + Var[Z(s0 )] − 2Cov[a! Z(s), Z(s0 )]
= a! Σa + σ0 − 2a! σ, (5.27)
subject to E[a! Z(s)] = E[Z(s0 )]. This unbiasedness condition implies
a! X(s)β = x(s0 )! β ∀β,
which gives a! X(s) = x(s0 )! .
To minimize this function subject to the constraint, we use the method of
Lagrange multipliers. The Lagrangian is
L = a! Σa + σ0 − 2a! σ + 2m! (X(s)! a − x(s0 )),
where m is a p × 1 vector of Lagrange multipliers. Differentiating with respect
to a and m gives
∂L
= 2Σa − 2σ + 2X(s)m
∂a
∂L
= 2(X(s)! a − x(s0 )).
∂m
242 SPATIAL PREDICTION AND KRIGING

Equating each term to zero and rearranging gives


Σa + X(s)m = σ
X(s) a = x(s0 ).
!

Solving these equations for a we obtain:


a = Σ−1 (σ − X(s)(X(s)! Σ−1 X(s))−1 (X(s)! Σ−1 σ − x(s0 ))) (5.28)
= ΣX − σ + Σ−1 X(s)(X(s)! Σ−1 X(s))−1 x(s0 )
with Σ−X =Σ
−1
− Σ−1 X(s)(X(s)! Σ−1 X(s))−1 X(s)! Σ−1 .

' , and the best linear unbi-


Note that (X(s)! Σ−1 X(s))−1 X! Σ−1 Z(s) = β gls
Universal ased predictor of Z(s0 ) can be written as
Kriging
a! Z(s) = puk (Z; s0 ) = x(s0 )! β ' ).
' + σ ! Σ−1 (Z(s) − X(s)β (5.29)
Predictor gls gls

We use the subscript “UK” to denote this as the universal kriging predictor,
to distinguish it from the simple and ordinary kriging predictors of §5.2.1
and §5.2.2. Obviously, if we assume a constant mean E[Z(s)] = 1µ, then the
universal kriging predictor reduces to the ordinary kriging predictor (compare
to equations (5.17) and (5.18))
' + σ ! Σ−1 (Z(s) − 1β
pok (Z; s0 ) = 1β ' ),
gls gls

with
' = (1! Σ−1 1)−1 1! Σ−1 Z(s).
β gls

The minimized mean-squared prediction error provides a measure of un-


certainty associated with the universal kriging predictor. This is obtained by
substituting the optimal weights, a, given in equation (5.28), into equation
(5.27), yielding the kriging variance
2
σuk (s0 ) = a! Σa + σ0 − 2a! σ
= σ0 − σ ! Σ−1 σ
+ (x(s0 )! − σ ! Σ−1 X(s))(X(s)! Σ−1 X(s))−1
× (x(s0 )! − σ ! Σ−1 X(s))! . (5.30)

As with ordinary kriging, universal kriging can also be done in terms of the
semivariogram (see Cressie, 1993, pp. 153–154).
To predict r variables, Z(s0 ), simultaneously, we extend the model above
to : ; : ;
Z(s) X(s)β
E = (5.31)
Z(s0 ) X(s0 )β
: ; : ;
Z(s) ΣZZ ΣZ0
Var = . (5.32)
Z(s0 ) Σ0Z Σ00
Here, ΣZZ is the n × n variance-covariance matrix of the data, Σ00 is the r × r
variance-covariance matrix among the unobservables, and Σ0Z is the r × n
KRIGING IN PRACTICE 243

variance-covariance matrix between the data and the unobservables. With this
model, the best linear unbiased predictor (BLUP) (Goldberger, 1962; Gotway
and Cressie, 1993) is
' + Σ0Z Σ−1 (Z(s) − X(s)β
Ẑ(s0 ) = X(s0 )β ' ), (5.33)
gls ZZ gls

and the associated mean-squared prediction error is given by


E[(Ẑ(s0 ) − Z(s0 ))2 ] = Σ00 − Σ0Z Σ−1
ZZ ΣZ0
+ (X(s0 ) − Σ0Z Σ−1 ! −1
ZZ X(s))(X(s) ΣZZ X(s))
−1

× (X(s0 ) − Σ0Z Σ−1


ZZ X(s)) .
!
(5.34)

Equations (5.33) and (5.34) are obvious extensions of (5.29) and (5.30) to
the multi-predictor case.

5.4 Kriging in Practice

5.4.1 On the Uniqueness of the Decomposition

For the purpose of predictions, we can model the spatial variation entirely
through the covariates, entirely as small-scale variation characterized by the
semivariogram or Σ(θ), or through some combination of covariates and resid-
ual spatial autocorrelation. Thus, the decomposition of the data into covari-
ates plus spatially correlated error as depicted through equation (5.24) is not
unique. However, our choice impacts both the interpretation of our model and
the magnitude of the prediction standard errors.
For example, suppose we accidentally left out an important spatially-varying
covariate (say xp+1 ) when we defined X(s). If we do a good job of fitting both
models, the model omitting xp+1 may fit as well as the model including xp+1 .
So we could have two competing models defined by parameters (β 1 , e(s)1 )
and (β 2 , e(s)2 ) with comparable fit. If X(s)1 β 1 (= X(s)2 β 2 , then the inter-
pretations in the two models could be very different, although both models
are valid representations of the spatial variation in the data. The predicted
surfaces based on these two models will be similar, but the standard errors
and the interpretation of covariates effects will be substantially different (see
e.g., Cressie, 1993, pp. 212–224, and Gotway and Hergert, 1997).
As you will see in §5.5, the question of how to estimate the unknown parame-
ters of the spatial correlation structure—when the mean is spatially varying—
is an important aspect of spatial prediction. If the mean is constant, then the
techniques of §4.4 and §4.5 can be applied to obtain estimators of the co-
variance and/or semivariogram parameters. It is tempting from this vantage
point to adopt an “ordinary-kriging-at-all-cost” attitude and to model spatial
variation entirely through the small-scale variation. For example, because the
semivariogram filters the (unknown but) constant mean, not knowing µ is
of no consequence in semivariogram estimation. An incorrect assumption of a
constant large-scale mean can be dangerous for your spatial analysis, however.
244 SPATIAL PREDICTION AND KRIGING

Schabenberger and Pierce (2002, p. 614) give the following example, where
data are generated on a transect according to the deterministic functions
Z1 (t) = 1 + 0.5t
Z2 (t) = 1 + 0.22t + 0.022t2 − 0.0013t3 .
Note that “data” so generated is deterministic, there is no random variation.
If one computes the Matheron estimator of the empirical semivariogram from
these data, the graphs in Figure 5.5 result. A power semivariogram model was
fit to the empirical semivariogram in the left-hand panel. A gaussian semivar-
iogram fits the empirical semivariogram in the right-hand panel well. Not
accounting for the large-scale structure may lead you to attribute determinis-
tic spatial variation—because the large-scale trend is non-random—to random
sources. The spatial “dependency” one is inclined to infer from Figure 5.5 is
entirely spurious.

30

3
Semivariogram

20

10
1

0 0

0 5 10 15 0 5 10 15
Lag h Lag h

Figure 5.5 Empirical semivariograms (dots) and fitted models for data from deter-
ministic trend. Left panel is for Z1 (t), right panel is for Z2 (t). From Schabenberger
and Pierce (2002, p. 614).

5.4.2 Local Versus Global Kriging

Kriging is a statistical method for interpolation, very similar to interpolating


splines and inverse-distance squared interpolation. The most common appli-
KRIGING IN PRACTICE 245

cation of kriging is the construction of smooth contour maps for visualization.


In this type of application, variables are predicted on a regular grid—hence
this process is often called gridding—and the values are then displayed using a
surface or contour map. Thus, since many predictions are required, kriging is
usually done using only some of the “neighboring” data in order to speed the
computations. The search neighborhood is usually taken to be a circle cen-
tered on the grid node unless the study domain is elongated (e.g., elliptical)
or the spatial process is anisotropic. If there are many points in the neigh-
borhood, we may further restrict the calculations and use just the closest m
points. (In many computer programs, defaults for m vary from 6 to 24). These
parameters (search neighborhood size, shape, and number of points used for
interpolation) can affect the nature of the interpolated surface.
The use of local search neighborhoods can result in great computational
savings when working with large data sets. However, the search neighborhood
should be selected with care as it will affect the characteristics of the predicted
surface. Global kriging using all the data produces a relatively smooth surface.
Small search neighborhoods with few data points for prediction will produce
surfaces that show more detail, but this detail may be misleading if predictions
are unstable. In general, at least 6–12 points should be used for each predicted
value; kriging with more than 25 points is often unnecessary. If the data are
sparse, or the relative nugget effect is large, distant points may have important
information and the search neighborhood should be increased to include them.
When the data are evenly distributed within the domain, a simple search that
defines the neighborhood by the closest points is usually adequate. However,
when observations are clustered, very irregularly spaced, or located on widely-
spaced transects, quadrant or octant searches can be useful. Here, one divides
the neighborhood around each target node into quadrants or octants and uses
the nearest 2–3 points from each quadrant or octant in the interpolation. This
ensures neighbors from several different directions, and not just the closest
points, will be used for prediction. It is also possible to change the search
strategy node by node, thus using an adaptive search technique.
Kriging with a local neighborhood as just described is not entirely a method
of local estimation in the sense that the local models in §5.3.2 were estimat-
ing all parameters of a linear model locally. In the local kriging approach
the parameters of the spatial dependency structure are not spatially varying;
the semivariogram is not re-estimated within each local neighborhood. When
working with covariates and trend surfaces, universal kriging is also often
done globally, since the model assumes that the relationship between the data
and the covariates holds across the entire domain. If the domain is very large
(e.g., the entire United States) this assumption may not be reasonable. In
such cases, the covariance function and any trend surface functions or covari-
ates can be determined and estimated within local moving windows and then
kriging (either ordinary or universal) can also be done locally within these
moving windows (see e.g., Haas, 1990). This is one approach to modeling a
non-stationary covariance function; see Chapter 8.
246 SPATIAL PREDICTION AND KRIGING

Example 4.2 (C/N ratios. Continued) For the C/N ratio data, first intro-
duced in §4.4.1 on page 154, we modeled an exponential semivariogram para-
metrically in §4.5 and presented estimates for nugget and no-nugget models
in Table 4.2 on page 176. Now we compute ordinary kriging predictions on
a regular 10 feet × 10 feet grid of points for the REML and OLS parameter
estimates. To solve the kriging problem for the 1560 prediction locations ef-
ficiently, we define a local quadrant search neighborhood by dividing a circle
with radius equal to the range of the semivariogram into 4 sections and choose
the 5 points nearest to the prediction location in each quadrant.
Figure 5.6 displays the predictions obtained using the OLS and REML es-
timates of the exponential semivariogram model with and without the nugget
effect. Also shown (in the third panel from the top) are the predictions ob-
tained using the REML estimates in the nugget model but using a search
radius equal to that of the range estimated using OLS.

REML
c0=0.12
c1=0.22
range=171.1

REML
c0=0.0
c1=0.32 Predicted
range=41.6 C/N Ratio
9.38 −9.79
9.80 −10.09
10.10 −10.33
10.34 −10.51
10.52 −10.65
10.66 −10.83
REML
10.84 −11.06
c0=0.12
11.07 −11.37
c1=0.22
11.38 −11.77
range=171.1
11.78 −12.31
search radius=85.3

OLS
c0=0.11
c1=0.17
range=85.3

Figure 5.6 Ordinary kriging predictions for C/N ratio data.


KRIGING IN PRACTICE 247

The predictions from the nugget models are generally similar, showing an
area of high ratios in the south-east portion of the field and various pockets
of low C/N ratios. The predictions based on the OLS-fitted semivariogram
model shown in the bottom panel are less smooth than those obtained using
the corresponding REML-fitted semivariogram model. The estimates of the
relative structured variability are about the same (RSVols = 39%, RSVreml =
35%), but the smaller OLS estimate of the range (85.3 versus 171.1) creates
a process with less continuity. You can glean this from the more “frizzled”
contour edges in the lower panel. Overall, the predictions based on the two
sets of estimates are close (Figure 5.7).

12.50

11.75
REML with nugget

11.00

10.25

9.50

9.50 10.25 11.00 11.75 12.50


OLS with nugget

Figure 5.7 Comparison of predictions in nugget models based on REML and OLS
estimates.

When using statistical software to perform these (any) calculations it is


important to be familiar with the software features and defaults. For exam-
ple, whether local or global kriging predictions are calculated, how kriging
neighborhoods are determined, etc. When the kriging neighborhood depends
on the semivariogram parameters, as is the case here, then changes in the
parameter estimates can have spurious effects on the predicted surface and
the prediction standard errors. The comparison between the top and bottom
panels in Figure 5.6 is possibly not only affected by the large REML estimate
of the range, but also by a local search radius that is twice as large. While
this effect appears to be minor in this application (comparing the first and
third panel in Figure 5.6), it can have a larger impact in others.
The least smooth predictions are those from the REML no-nugget model.
248 SPATIAL PREDICTION AND KRIGING

This may be surprising at first, since adding a nugget effect reduces the con-
tinuity of the process. However, the lack of a nugget effect is more than offset
in this case by a small range. Recall that earlier we rejected the hypothesis of
a zero nugget effect based on the REML analysis (see page 177). It would be
incorrect to attribute the more erratic appearance of the predicted C/N ratios
in the second panel to an analysis that reveals more “detail” about the C/N
surface and so would be preferable on those grounds. Since statistical infer-
ence is conditional on the selected model, the predictions in the second panel
must be dismissed if we accept the necessity of a nugget effect, regardless of
how informative the resulting map appears to be.
Figure 5.8 displays contour maps of the standard errors corresponding to
the kriging predictions in Figure 5.6. The standard errors are small near the
location of the observed data (compare to Figure 4.6 on page 156). At the data
locations the standard errors are exactly zero, since the predictions honor the
data. The standard error maps basically trace the observed locations.

5.4.3 Filtering and Smoothing

The universal kriging predictor “honors” the data. The predicted surface
passes through the data points, i.e., the predicted values at locations where
data are measured are identical to the observed values. Thus, while kriging
produces a smooth surface, it does not smooth the data like least squares or
loess regression. In some applications such smoothing may be desirable, how-
ever. For example, when the data are measured with error, it would be better
to predict a less noisy version of the data that removes the measurement error
instead of requiring the prediction surface to pass through the noisy data.

Example 5.7 Particle concentrations are measured daily at several moni-


toring sites throughout a state. Based on these measurements, a daily map
of state-wide particle concentrations is produced. Which value should be dis-
played on the map for the monitoring sites? If the particle concentrations were
measured without error, then the recorded values represent the concentrations
actually present. If the measurements are contaminated with error, one would
be interested not in the amounts that had been measured, but rather in the
actual concentrations.

Following the ideas in Cressie (1993), suppose we really want to make infer-
ences about a spatial process, S(s), but instead can only measure the process
Z(s), where
Z(s) = S(s) + #(s), s ∈ D,
with E[#(s)] = 0, Var[#(s)] = σ$2 , Cov[#(si ), #(sj )] = 0, for all i (= j and S(s)
and #(s) are independent. Further suppose that S(s) can be described with
KRIGING IN PRACTICE 249

0.000 −0.413
0.413 −0.433
0.433 −0.448
0.448 −0.460
REML 0.460 −0.469
c0=0.12 0.469 −0.476
c1=0.22 0.476 −0.485

range=171.1 0.485 −0.496


0.496 −0.512
0.512 −0.531

0 .000− 0.280
0.280 − 0.421
0.421 − 0.492
REML 0.492 − 0.528
c0=0.0 0.528 − 0.546

c1=0.32 0.546 − 0.555


0.555 − 0.560
range=41.6
0.560 − 0.569
0.569 − 0.587
0.587 − 0.623

0.000 −0.413
0.413 −0.433
0.433 −0.448
0.448 −0.460
REML 0.460 −0.469
c0=0.12 0.469 −0.476
c1=0.22 0.476 −0.485
range=171.1 0.485 −0.496

search radius=85.3 0.496 −0.512


0.512 −0.531

0.000 −0.423
0.423 −0.448
0.448 −0.464

OLS 0.464 −0.475

c0=0.11 0.475 −0.483


0.483 −0.488
c1=0.17 0.488 −0.496
range=85.3 0.496 −0.507
0.507 −0.524
0.524 −0.548

Figure 5.8 Standard error maps corresponding to kriging predictions in Figure 5.6.

the general linear model with autocorrelated errors


S(s) = x(s)! β + υ(s)
where E[υ(s)] = 0, and Cov[υ(si ), υ(sj )] = Cov[S(si ), S(sj )] = CS (si , sj ).
Thus,
Z(s) = x(s)! β + υ(s) + #(s),
and E[Z(s)] = x(s)! β. In terms of the operational decomposition given in
equation (2.10) on page 54, this is a signal model and S(s) corresponds to
a random process comprised of µ(s) + W (s) + η(s). The best linear unbi-
ased predictor of S(s) remains of the form λ∗! Z(s), where the weights, {λ∗i },
are derived by minimizing E[(λ∗! Z(s) − S(s0 ))2 ] subject to the unbiasedness
criterion E[λ∗! Z(s)] = E[S(s0 )].
250 SPATIAL PREDICTION AND KRIGING

If we first assume that E[Z(s)] = x(s)! β ≡ µ, the ordinary kriging predictor


of S(s0 ) is derived in a manner analogous to that given in §5.2.2. Using the
expressions in terms of the semivariogram, this predictor is
pof k (Z; S(s0 )) = λ∗! Z(s)
with optimal weights satisfying (Cressie, 1993, p. 128)
C D!
∗! 1 − 1! Γ−1 γ ∗ (s0 )
λ = γ (s0 ) + 1

Γ−1 .
1! Γ−1 1
The matrix Γ is the same as in the ordinary kriging equations (5.19)–(5.22),
with elements γz (si − sj ). The vector γ ∗ (s0 ) = [γ ∗ (s0 − s1 ), · · · , γ ∗ (s0 − sn )]!
is slightly different since, in the minimization, the elements of this matrix
are derived from E[(Z(s0 ) − Z(si ))2 ] = γs (s0 − si ) + σ$2 ≡ γ ∗ (s0 − si ). At
prediction locations, s0 (= si , and γ ∗ (s0 − si ) = γz (s0 − si ), i = 1, · · · , n. At
data locations, s0 = si , and γ ∗ (s0 − si ) = σ$2 .
Filtered The minimized mean-squared prediction error is given by
Kriging ∗! ∗
k (s0 ) = λ γ (s0 ) + m − σ$ ,
2 2
σof
Variance
where m is the Lagrange multiplier from the constrained minimization. Note
that this is not equal to the ordinary kriging variance, σok 2
(s0 ), defined in
equation (5.22), unless σ$ = 0. Prediction standard errors associated with
2

filtered kriging are smaller than those associated with ordinary kriging (except
at data locations) since S(·) is less variable than Z(·).
If the mean is spatially varying, i.e., if we consider the more general case
where E[Z(s)] = x(s)! β, a derivation analogous to that for universal kriging
(in terms of the covariance matrix) gives the optimal weights as solutions to
Σa + X(s)m = σ ∗
X(s)! a = x(s0 ),
where σ ∗ = Cov[Z(s), S(s0 )] and thus has elements CS (si , sj ). However, we
cannot estimate CS (si , sj ) directly; we obtain this quantity only through
CZ (si , sj ) which is equal to
-
CS (si , sj ) + σ$2 si = sj
CZ (si , sj ) = Cov[Z(si ), Z(sj )] =
CS (si , sj ) si (= sj .
Thus, at the prediction locations where si (= sj , CZ (si , sj ) = CS (si , sj ), and
σ ∗ = σ. However, at the data locations, CZ (si , sj ) = CS (si , sj ) + σ$2 , so we
use σ ∗ = CZ (si , sj ) − σ$2 .
The modification to the kriging predictor comes into play when we want
to predict at locations where we already have data. This is called filtering,
since we are removing the error from our observed data through a prediction
process. It is also smoothing, since the filtered kriging predictor smooths the
data, with larger values of σ$2 resulting in more smoothing.
Historically, the different terms (filtering and smoothing) arise from the time
series literature which identifies three distinct types of prediction problems.
KRIGING IN PRACTICE 251

The first is forecasting: prediction of future data. The second is filtering:


prediction of data at the present time. The third is smoothing: prediction of
past data. In spatial statistics, we cannot distinguish past, present, and future
and so we only have two types of prediction problems: prediction of data at
new locations where no measurements were made, and prediction of data
at observed locations where measurements were recorded. Thus, in spatial
statistics, filtering and smoothing refer to the same prediction problem.

Example 5.8 Schabenberger and Pierce (2002, Ch. A9.9.5) give a small ex-
ample to demonstrate the effects of measurement error on kriging predictions
and their standard errors, which we adapt here. Figure 5.9 shows four ob-
served locations on a grid, s1 = [0, 0]! , s2 = [4, 0]! , s3 = [0, 8]! , and s4 = [4, 8]! .
The observed values are Z(s1 ) = 5, Z(s2 ) = 10, Z(s3 ) = 15, and Z(s4 ) = 6,
with arithmetic average Z = 9. The locations s0 (1) –s0 (4) are prediction loca-
tions. Notice that one of the prediction locations is also an observed location,
s0 (3) = s2 .

5
s0(4) s2 = s0(3) s4
4
s0(2)
3
x
s0(1)
2

1
s3
0 s1

-2 0 2 4 6 8
y

Figure 5.9 Observed locations (crosses) and prediction locations (dots). Adapted from
Schabenberger and Pierce (2002).

Predictions at the four locations are obtained for two exponential semivari-
ogram models with practical range 8.5. Model A has a sill of 1.0 and no nugget
effect. Model B has a nugget effect of 0.5 and a partial sill of 0.5. We assume
that the entire nugget effect is due to measurement error.
In the absence of a nugget effect the predictions of Z(s0 ) and S(s0 ) agree
in value and precision (Table 5.3). In the presence of a nugget effect (model
B), predictions of Z(s0 ) and S(s0 ) agree in value but predictions of the signal
are more precise than those of the error-contaminated process. The difference
between the two kriging variances is the magnitude of the nugget effect.
At the observed location s2 = s0 (3) , the kriging predictor under model A
252 SPATIAL PREDICTION AND KRIGING

honors the data. Notice that the kriging weights are zero except for λ2 =
1. Since the predicted value is identical to the observed value, the kriging
variance is zero. In the presence of a nugget effect (model B) the predictor of
the signal does not reproduce the observed value and the kriging variance is
not zero.

Table 5.3 Kriging predictions of Z(s0 ) and S(s0 ) under semivariogram models A
and B. Adapted from Schabenberger and Pierce (2002, Ch. 9.9.5).

Prediction
s0 Model Target Value Variance λ1 λ2 λ3 λ4
[2, 4] A Z(s0 ) 9.0 0.92 0.25 0.25 0.25 0.25
[2, 4] A S(s0 ) 9.0 0.92 0.25 0.25 0.25 0.25
[2, 4] B Z(s0 ) 9.0 1.09 0.25 0.25 0.25 0.25
[2, 4] B S(s0 ) 9.0 0.59 0.25 0.25 0.25 0.25
[3, 2] A Z(s0 ) 8.77 0.78 0.25 0.48 0.12 0.15
[3, 2] A S(s0 ) 8.77 0.78 0.25 0.48 0.12 0.15
[3, 2] B Z(s0 ) 8.83 1.04 0.26 0.36 0.18 0.20
[3, 2] B S(s0 ) 8.83 0.54 0.26 0.36 0.18 0.20
[4, 0] A Z(s0 ) 10.0 0. 0 1 0 0
[4, 0] A S(s0 ) 10.0 0. 0 1 0 0
[4, 0] B Z(s0 ) 10.0 0. 0 1 0 0
[4, 0] B S(s0 ) 9.25 0.30 0.17 0.60 0.11 0.12
[5, −2] A Z(s0 ) 9.26 0.88 0.17 0.57 0.13 0.13
[5, −2] A S(s0 ) 9.26 0.88 0.17 0.57 0.13 0.13
[5, −2] B Z(s0 ) 9.03 1.09 0.23 0.40 0.18 0.19
[5, −2] B S(s0 ) 9.03 0.59 0.23 0.40 0.18 0.19

Example 4.2 (C/N ratios. Continued) For the exponential semivariogram


model with a nugget effect fitted using REML (Table 4.2 on page 176), we
computed the filtered (ordinary) kriging predictions on the same grid as in
Figure 5.6, assuming that the entire nugget effect is due to measurement er-
ror. The map of filtered predictions in Figure 5.10 is practically identical to
the unfiltered predictions (Figure 5.6 on page 246) because only very few
grid points are also observed locations. The prediction standard errors are
not bounded below by zero, because at the observed locations the prediction
variance is nonzero. At unobserved locations, the prediction standard errors
are smaller compared to those obtained without filtering of the Z(s) process.
A scatterplot of the standard errors of the filtered and ordinary kriging pre-
dictions emphasizes these findings (Figure 5.11). Ordinary kriging standard
errors are zero at the data locations. At unobserved locations they are larger
than the standard errors form filtered kriging.
KRIGING IN PRACTICE 253

Predicted
C/N Ratio
9.38 −9.79
9.80 −10.09
10.10 −10.33
10.34 −10.51
10.52 −10.65
10.66 −10.83
10.84 −11.06
11.07 −11.37
11.38 −11.77
11.78 −12.31

Prediction
Standard Errors
0.178 −0.226
0.226 −0.260
0.26 1−0.284
0.284 −0.300
0.301−0.312
0.312 −0.320
0.321 −0.332
0.332 −0.348
0.348 −0.372
0.372 −0.406

Figure 5.10 Prediction and standard error maps for filtered kriging. Compare to top
panels in Figures 5.6 and 5.8.

0.6

0.5
Standard errors from OK with nugget

0.4

0.3

0.2

0.1

0.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6


Standard errors from filtered OK

Figure 5.11 Prediction standard errors for filtered and ordinary kriging.
254 SPATIAL PREDICTION AND KRIGING

We may not always be concerned solely with spatial prediction. We may


also want to estimate β and make inferences about the effect of the covariates
on our outcome of interest. To use these methods, we often specify the form of
Var[Z(s)] and model Var[Z(s)] parametrically in order to reduce the number of
parameters. In this case, Var[Z(s)] = σ$2 I + ΣS , thereby focusing modeling on
the elements of ΣS . For example, suppose we assume that S(s) is an isotropic
second-order stationary process with exponential covariance function
-
c0 + ce h = 0,
CS (h; θ S ) =
ce exp{−h/ae } h > 0.
Then Var[Z(s)] = Σ(θ) = σ$2 I + ΣS (θ S ), with θ = [σ$2 , c0 , ce , ae ]! . As a
result, we assume Z(s) is a spatial process whose behavior at the origin is
due to both micro-scale variation characterized by c0 , and to measurement
error characterized by σ$2 , i.e., Var[Z(s)] = σ$2 + CS (0, θ S ) = σ$2 + c0 + ce . We
then estimate β and θ using the methods described in §5.5.1–§5.5.2. It will
probably be very difficult to separate all three variance components σ$2 , c0 ,
and ce , using just a single realization of Z(s). Thus, we either need more
information, such as an estimate of σ$2 from repeated measurements at the
same location, or we need to make some assumptions concerning the amount
of the overall nugget effect that is due to measurement error and the amount
that is due to micro-scale variation. For example, if we assume the overall
0 , is comprised of 75% measurement error and 25% micro-
nugget effect, cZ
scale variation, then σ$2 = 0.75cZ 0 , and c0 = 0.25c0 . In this case, only one
Z

parameter needs to be estimated, c0 . Z

5.5 Estimating Covariance Parameters

So far in this chapter we have tacitly assumed that quantities such as Σ, Γ, σ,


and γ, that describe the spatial dependency between observed, and new and
observed data, are known. But of course, they are not. The usual approach
is to parameterize the semivariogram or covariance function and to estimate
the parameters of this model. Then, the relevant expressions for the various
predictors and their precision are evaluated by substituting the estimated
parameters, a process aptly termed “plug-in” estimation.

Example 5.9 Suppose you want to perform ordinary kriging in terms of co-
variances and you choose the exponential model γ(h, θ) = σ 2 (1 − exp{−h/α})
as the semivariogram model. Once you have obtained estimates θ ' = [' ']! ,
σ2 , α
you estimate the semivariogram as
γ ' =σ
'(h) = γ(h, θ) '2 (1 − exp{−3h/'α})
by plugging-in the estimates into the expression for the model. In order to
estimate covariances under this model, you can invoke the relationship C(h) =
C(0) − γ(h) and estimate
'
C(h) '
= C(0) '(h)
−γ
ESTIMATING COVARIANCE PARAMETERS 255

' − γ(h, θ)
= γ(∞, θ) '
= σ '2 (1 − exp{−3h/'
'2 − σ α})
= σ
'2 exp{−3h/' '
α} = C(h, θ).

Several important issues are connected with this practice:


• how to select a model for the covariance function or semivariogram;
• how to estimate the parameters in θ;
' is a
• how to adjust inference in light of the fact that θ is a constant, yet θ
random variable.
Our efforts in this section focus on the last two points. We assume that the
model has been chosen correctly. Also, we will generically refer to parameters
of the spatial dependence as covariance parameters, regardless of whether
modeling is based on the semivariogram or the covariance function.
Fortunately, when the mean of the random field is constant, the techniques
in Chapter 4, in particular the parametric techniques in §4.5, can be used to
estimate the covariance parameters. If the mean is spatially varying, things are
considerably more complicated. One complication is that the GLS estimate of
β also depends on the covariance parameters; β ≡ β(θ). It seems natural then
to use plug-in estimation for β; β ' = β(θ).
' If you consider estimation of θ by
least squares fitting of a semivariogram model based on the empirical semivar-
iogram, for example, this requires an assumption of intrinsic stationarity (for
semivariograms) and second-order stationarity (for covariance functions); see
§4.5.1. These assumptions are of course not met in the general linear model
where µ(s) = x(s)! β (= µ. To see this, consider estimating γ(·) using the
classical semivariogram estimator given in equation (4.24). Under the model
defined by equation (5.24),
E[(Z(si ) − Z(sj ))2 ] = Var[Z(si ) − Z(sj )] + {µ(si ) − µ(sj )}2
H p I2
(
= 2γ(si − sj ) + βk (xk (si ) − xk (sj )) (5.35)
k=1

with x1 (si ) ≡ 1 for all i (Cressie, 1993, p. 165). The empirical semivariogram
no longer estimates the true, theoretical semivariogram. If the covariates are
trend surface functions, the trend often manifests itself in practice by an
empirical semivariogram that increases rapidly with ||h|| (often quadratically).
Analogous problems occur when estimating the covariance function. In light
of this problem, we need to re-examine the techniques from §4.5 for the case
of a spatially varying mean. This is the topic of §5.5.1–§5.5.3.
The third issue mentioned above, regarding the variability of θ ' is of im-
portance whether the mean is constant or spatially varying. The implications
here are whether a predictor that is best in some sense retains this property
256 SPATIAL PREDICTION AND KRIGING

when it is evaluated at estimates of the covariance parameters and how to


adjust the estimate of the prediction error. Take, for example, the ordinary
kriging predictor pok (Z; s0 ) and its kriging variance, written in terms of θ to
emphasize the dependence on a particular model
C D
1 − 1! Σ(θ)−1 σ(θ)
pok (Z; s0 ) = σ(θ) + 1 Σ(θ)−1 Z(s),
1 Σ(θ) 1
! −1

(1 − 1! Σ(θ)−1 σ(θ))2
σok2
(s0 ) = C(0) − σ(θ)! Σ(θ)−1 σ(θ) + .
1! Σ(θ)−1 1
Ordinary The plug-in predictor
Kriging + ,
' −1 '
' + 1 1 − 1 Σ(θ) σ(θ)
!
Plug-in p'ok (Z; s0 ) = σ(θ) ' −1 Z(s)
Σ(θ) (5.36)
Predictor ' −1 1
1! Σ(θ)
is no longer the best linear unbiased predictor of Z(s0 ). It is an estimate
of the BLUP, a so-called EBLUP. Also, this EBLUP will not be invariant
to your choice of θ. ' Different estimation methods yield different estimates
of the covariance parameters, which affects the predictions—unless you are
predicting at observed locations without measurement error; all predictors
honor the data, regardless of “how” you obtained θ. ' Not only is (5.36) no
longer best, we do not know its prediction error. The common practice of
evaluating σok 2 ' does not yield an estimate of the prediction error of
(s0 ) at θ
p'ok (Z; s0 ). It yields an estimate of the prediction error of pok (Z; s0 ). In other
words, by substituting estimated covariance parameters into the expression
for the predictor, we obtain an estimate of the predictor. By substituting into
the expression for the prediction variance we get an estimate of the prediction
error of a different predictor, not for the one we are using. How to determine,
or at least approximate, the prediction error of plug-in predictors, is the topic
of §5.5.4.

5.5.1 Least Squares Estimation

Consider again the spatial linear model


Z(s) = X(s)β + e(s), e(s) ∼ (0, Σ(θ)).
The universal kriging predictor, written in the GLS form, is
' + σ(θ)! Σ(θ)−1 (Z(s) − X(s)β
puk (Z; s0 ) = x(s0 )! β ' ),
gls gls

GLS where ) *
Estimator ' = X(s)! Σ(θ)−1 X(s) −1 X(s)! Σ(θ)−1 Z(s)
β (5.37)
gls
is the generalized least squares estimator of the fixed effects. How are we
going to respond to the fact that θ is unknown? In order to estimate θ by
least squares fitting of a semivariogram model, we cannot use the empirical
semivariogram based on the observed data Z(s), because the mean of Z(s) is
not constant. From equation (5.35) we see that the resulting semivariogram
ESTIMATING COVARIANCE PARAMETERS 257

would be biased, and there is little hope to find reasonable estimates for θ
this way. What we need is the semivariogram of e(s), not that of Z(s). If β
were known, then we could construct “data” Z(s) − X(s)β and unbiasedly
estimate the semivariogram, because the error process would be observable.
But β is not known, otherwise we would not find ourselves in the situation
to contemplate a linear model for a spatially varying mean. To compute the
GLS estimate of β, (5.37), requires knowledge of the covariance parameters.
If instead, we use a plug-in estimator for β, the estimated generalized least
squares estimate (EGLS) EGLS
E F−1 Estimator
' ! ' ! ' −1 Z(s),
β egls = X(s) Σ(θ) X(s)
−1
X(s) Σ(θ) (5.38)

we are still left with the problem of having to find a reasonable estimator
for θ. And we just established that this is not likely by way of least squares
techniques without knowing the mean. Schabenberger and Pierce (2002, pp.
613–615) refer to this circular argument as the “cat and mouse game of uni-
versal kriging.” In order to estimate the mean you need to have an estimate
of the covariance parameters, which you can not get by least squares without
knowing the mean. Percival and Walden (1993, p. 219) refer to a similar prob-
lem in spectral analysis of time series—where in order to derive a filter for
pre-whitening of a spectral density estimate one needs to know the spectral
density of the series—as a “cart and horse” problem.
The approach that is taken to facilitate least squares estimation of covari-
ance parameters in the case of a spatially varying mean is to compute initially
an estimate of the mean that does not depend on θ. Then use this estimate
to detrend the data and estimate the semivariogram by least squares based
' depends
on the empirical semivariogram of the residuals. Since the estimate θ
on how we estimated β initially—we know it was not by GLS or EGLS—the
process is often repeated (iterated). The resulting “simultaneous” estimation
scheme for θ and β is based on the methods described in §4.5.1 and is termed
iteratively re-weighted generalized least squares (IRWGLS) : IRWGLS
Algorithm
1. Obtain a starting estimate of β, say β;'
2. Compute residuals r = Z(s) − X(s)β; '
3. Estimate and model the semivariogram of the residuals using techniques
' by minimizing either equation (4.31)
described in §4.4 and §4.5, obtaining θ
or equation (4.34);
4. Obtain a new estimate of β using equation (5.38);
5. Repeat steps 2–4 until the relative or absolute change in estimates of β
and θ are small.
The starting value in step 1 is almost always obtained by ordinary least
squares. In the first go-around of the IRWGLS algorithm you are working with
OLS residuals, once past step 4, you are working with (E)GLS residuals. The
semivariogram estimator based on the residuals r = Z(s) − Xβ ' (step 3 above)
is biased. It is important to understand that this bias is different from the bias
258 SPATIAL PREDICTION AND KRIGING

mentioned earlier that is due to not accounting for the change in the mean,
namely equation (5.35). The bias in the semivariogram based on OLS residuals
stems from the fact that the residuals fail to share important properties of
the model errors e(s). For example, their variance is not Σ, they are rank-
deficient and heteroscedastic. Frankly, the only thing e(s) and Z(s)−X(s)β '
ols
have in common is a zero mean. The semivariogram of Z(s) − X(s)β ' is not
ols
the semivariogram of e(s). A detailed discussion of the properties of OLS and
GLS residuals, and the diagnosis of the covariance model choice based on
residuals is deferred until Chapter 6. At this point it suffices to note that the
bias in the estimated semivariogram based on OLS residuals increases with
the lag. Cressie (1993, p. 166) thus argues that the bias can be controlled
by fitting the semivariogram model by weighted nonlinear least squares as
described in §4.5.1. Because the weights are proportional to the approximate
variance of the empirical semivariogram, empirical semivariogram values at
large lags—where the bias is greater—are down-weighted.
By iterating the above steps, that is, by repeating steps 2–4 in the IRWGLS
algorithm, the bias problem is not solved. The empirical semivariogram of the
generalized least squares residuals is also a biased estimator of the semivari-
' '
ogram. Since both β ols and β egls are unbiased estimators of β, most of the
trend is removed in the very first step of the algorithm, and you may pick up
comparably little additional structure in subsequent iterations. What matters
for efficient estimation of the large-scale trend, and for spatial prediction, is
that the estimator of θ is efficient and has as little bias as possible. If you
estimate the covariogram instead of the semivariogram in step 3, then a single
OLS fit and OLS residuals may be preferable over an iterated algorithm.
The final IRWGLS estimator of β is an EGLS estimator. If we denote the
estimator of θ obtained from IRWGLS by θ 'IRW GLS , the IRWGLS estimator
IRWGLS of β is
Estimator
' ! ' ! 'IRW GLS )−1 Z(s),
IRW GLS = (X(s) Σ(θ IRW GLS )
−1
β X(s))−1 X(s) Σ(θ
(5.39)
and its variance-covariance matrix is usually estimated as
Tβ ' ! '
Var[ IRW GLS ] = (X(s) Σ(θ IRW GLS ) (5.40)
−1
X(s))−1 .

Since steps 2–4 are repeated, the process is iterative, but in contrast to
a numerical optimization technique, such as the Newton-Raphson algorithm,
it is difficult to study the overall behavior of the IRWGLS procedure. There
is, for example, no guarantee that the process “converges” in the usual sense
or that some “extremum” has been found when the process stops. When
the algorithm comes to a halt, think of it as lack of progress, rather than
convergence. Any continuance would lead to the same estimates of θ and these
would lead to the same estimates of β. We recommend not to monitor the
absolute change in parameter estimates alone, but to use a relative criterion in
step 5. For example, if β ' (u) and β
' (u+1) are the estimates from two successive
ESTIMATING COVARIANCE PARAMETERS 259

iterations, compute
(u) (u+1)
(β) |β'j − β'j |
δj = E F j = 1, · · · , p,
(u) (u+1)
0.5 |β'j | + |β'j |

(u) (u+1)
(θ) |θ'k − θ'k |
δk = E F k = 1, · · · , q,
(u) (u+1)
0.5 |θ'k | + |θ'k |
(θ) (β)
and stop when max{δk , δj } < 10−6 .

5.5.2 Maximum Likelihood

In the context of estimating parameters of the covariance function, we had


briefly discussed maximum likelihood (ML) estimation in §4.5.2. In order to
proceed with ML estimation, we must make a distributional assumption for
Z(s). It is not sufficient to specify the first two moments only, as was previously
the case. ML estimation for spatial models is developed only for the Gaussian
case (Mardia and Marshall, 1984), we assume that Z(s) ∼ G(X(s)β, Σ(θ)).
In contrast to the IRWGLS approach, ML estimation is truly simultaneous
estimation of mean and covariance parameters. This fact may be obstructed
by profiling of β (see below), but the important point is that the ML estimates
are the simultaneous solution to the problem of minimizing the negative of
twice the Gaussian log likelihood Minus 2 log
Likelihood
ϕ(β; θ; Z(s)) = ln{|Σ(θ)|} + n ln{2π}
+ (Z(s) − X(s)β)! Σ(θ)−1 (Z(s) − X(s)β)). (5.41)

If X is an (n × p) matrix of rank k, this optimization problem involves


k + q parameters. Because the elements of Σ are usually nonlinear functions
of the elements of θ, the process is typically iterative. From some starting
values [θ (0) , β (0) ], one computes successive updates according to a nonlinear
optimization technique; for example, by way of the Newton-Raphson, Quasi-
Newton, or some other suitable algorithm.
Fortunately, the size of the optimization problem can be substantially re-
duced. First, note that usually a scalar parameter can be factored from the
variance-covariance matrix. We write Σ(θ) = σ 2 Σ(θ ∗ ) where θ ∗ is a ((q −1)×
1) vector with its elements possibly adjusted to reflect the factoring of σ 2 . For
example, if the process has an exponential covariance structure, then Σ(θ ∗ ) is
the autocorrelation matrix with exponential correlation structure. Second, a
closed form expression can be obtained for the parameters β and σ 2 given θ.
This enables us to remove these parameters from the optimization, a process
termed profiling. To profile β, take derivatives of (5.41) with respect to β
and solve. The result is the GLS estimator
' = (X(s)! Σ(θ)−1 X(s))−1 X(s)! Σ(θ)−1 Z(s).
β (5.42)
260 SPATIAL PREDICTION AND KRIGING

Substituting this expression into (5.41) yields an objective function for mini-
mization profiled for β,
ϕβ (θ; Z(s)) = ln{|σ 2 Σ(θ ∗ )|} + n ln{2π} + σ −2 r! Σ(θ ∗ )−1 r, (5.43)
where
! −1 ! −1
r = Z(s) − (X(s) Σ(θ) X(s))−1 X(s) Σ(θ) Z(s)
is the GLS residual. To profile σ 2 from the objective function, note that its
MLE is
1
2
'ml
σ = r! Σ(θ ∗ )−1 r.
n
Minus 2 Substituting again yields the negative of twice the profiled log likelihood,
Profiled log
ϕβ,σ (θ ∗ ; Z(s)) = ln{|Σ(θ∗ )|} + n ln{'
σ 2 } + n(ln{2π} − 1). (5.44)
Likelihood
Minimizing (5.44) is an optimization problem with only q − 1 parameters.
Upon convergence you obtain θ'ml from σ 2
'ml '∗ , and β
and θ ' by evaluating
ml ml
MLE of β (5.42) at the maximum likelihood estimate θ'ml of θ:
E F−1
β' = X(s)! Σ(θ 'ml )−1 X(s) !
X(s) Σ(θ 'ml )−1 Z(s). (5.45)
ml

The savings in computing time are substantial when parameters can be


profiled from the optimization. For example, in a model (R2 ) where the large-
scale trend (mean function) is modeled as a trend surface of second degree
and Σ(θ) has a spherical no-nugget covariance structure, there are eight pa-
rameters (q = 2, k = 6). The iterative optimization is carried out for a single
parameter, the range of the spatial process. Given an estimate of the range
parameter, the correlation matrix Σ(θ ∗ ) can be computed which yields the
MLE of σ 2 and that of β. Note that (5.45) could also be evaluated at Σ(θ '∗ ).
ml
The profiled parameter σ 2 does not affect the point estimate of the regression
parameters. Additional computational savings can be gained by using other
algorithms that avoid repeated inversion of large, unstructured matrices (Mar-
dia and Marshall, 1984; Zimmerman, 1989).
One of the advantages of likelihood estimation is the ability to estimate
the variance-covariance matrix of the parameter estimates based on the ob-
served or expected information matrix. The observed information matrix
equals 0.5H, where H is the Hessian (second derivative) matrix of (5.41).
Standard errors of the maximum likelihood estimators are obtained as the
diagonal elements of 2H−1 or 2E[H]−1 . Because
: ;
∂ϕ(β; θ; Z(s))
E = 0,
∂β ∂θ
the information matrix is block-diagonal. Even if β is profiled from the opti-
mization, the leading block of the inverse information matrix is
−1
(X(s)Σ(θ) X(s))−1
and the variance-covariance matrix of the regression effects is estimated as
ESTIMATING COVARIANCE PARAMETERS 261

T β
Var( ' ) = (X(s)! Σ(θ'ml )−1 X(s))−1 . (5.46)
ml
Thus, the variance-covariance matrix of β'
ml has the same form as that of
' 'ml .
equation (5.40), but with θ IRW GLS replaced with θ
For full ML estimation without profiling, the inverse of the information
matrix for ω = [β ! , θ ! ]! can be written as (see Breusch, 1980 and Judge et al.,
1985, p. 182) Inverse
 −1 Fisher
! −1
(X(s) Σ(θ) X(s)) 0 Information
I(ω)−1 =   , (5.47)
−1 ] −1
0 2 ∆ (Σ(θ) Σ(θ) )∆
! 1 !

where 0 denotes the (p × q) zero matrix, and ∆ = ∆(θ) = ∂(vec (Σ(θ)))/∂θ !


is a (n2 × q) matrix that contains the partial derivatives of each element of
Σ(θ) with respect to each element in θ. The matrix operator vec(·) stacks the
columns of a]matrix into a single vector, so vec(Σ(θ)) is a (n2 × 1) vector.
The symbol denotes the matrix direct product multiplying each element in
the first matrix by every element in the second matrix producing an n2 × n2
matrix.
Detailed expressions for the Hessian matrix with respect to the covariance
parameters for the θ and θ ∗ parameterizations are given in Wolfinger, Tobias,
and Sall (1994) for ML and restricted maximum likelihood estimation. These
authors also present expressions to convert the (q − 1) × (q − 1) Hessian in
terms of θ ∗ into the q × q Hessian for the θ parameterization.

5.5.3 Restricted Maximum Likelihood

Restricted (or residual) maximum likelihood (REML) estimates are often pre-
ferred over MLEs because the latter exhibit greater negative bias for estimates
of covariance parameters. The culprit of this bias—roughly—lies in the failure
of ML estimation to account for the number of mean parameters in the estima-
tion of the covariance parameters. The most famous—and simplest—example
is that of an iid sample from a G(µ, σ 2 ) distribution, where µ is unknown.
The MLE for σ 2 is
n
1(
'ml =
σ 2
(Yi − Y )2 ,
n i=1
which has bias −σ 2 /n. The REML estimator of σ 2 is unbiased:
n
1 (
2
'reml
σ = (Yi − Y )2 .
n − 1 i=1
Similarly, in a regression model with Gaussian, homoscedastic, uncorrelated
errors, the ML and REML estimators for the residual variance are
1 ' ! (Y − Xβ)
'
2
'ml
σ = (Y − Xβ)
n
262 SPATIAL PREDICTION AND KRIGING

1 '
' ! (Y − Xβ),
2
'reml
σ = (Y − X! β)
n−k
respectively. The ML estimator is again a biased estimator.
For the spatial model Z(s) ∼ G(X(s)β, Σ(θ)), the REML adjustment con-
sists of performing ML estimation not for Z(s), but for KZ(s), where the
((n − k) × n) matrix K is chosen so that E[KZ(s)] = 0 and rank[K] = n − k.
Because of these properties, the matrix K is called a matrix of error con-
trast, which supports the alternate name as residual maximum likelihood.
Although REML estimation is well-established in statistical theory and ap-
plications, in the geostatistical arena it appeared first in work by P. Kitanidis
and co-workers in the mid-1980’s (Kitanidis, 1983; Kitanidis and Vomvoris,
1983; Kitanidis and Lane, 1985).
An important aspect of REML estimation lies in the handling of β. In
§5.5.2, the β vector was profiled from the log likelihood to reduce the size of
the optimization problem. This led to the objective function ϕβ (θ; Z(s)) in
(5.43). When you consider ML estimation for the n − k vector KZ(s), the
fixed effects β have seemingly disappeared from the objective function. Minus
twice the log likelihood of KZ(s) is
ϕR (θ; KZ(s)) = ln{|KΣ(θ)K! |} + (n − k) ln{2π}
) *−1
+ Z(s)! K! KΣ(θ)K! KZ(s). (5.48)
This is an objective function about θ only. It is possible to write ϕR (θ; KZ(s))
in terms of an estimate β ' and we will do so shortly to more clearly show the
relationship between the ML and REML objective functions. You need to
keep in mind, however, that there is no “REML estimator of β.” Minimizing
ϕR (θ; KZ(s)) yields θ'reml . What is termed β '
REML reml and is computed as
“Estima- E F−1
' ! ' X! Σ(θ'reml )−1 Z(s)
reml = X Σ(θ reml )
−1
tor” of β X
β
'reml .
is simply an EGLS estimator evaluated at θ
We now rewrite (5.48) to eliminate the matrix K from the expression. First
notice that if E[KZ(s)] = 0, then KX(s) = 0. If Σ(θ) is positive definite,
Searle et al. (1992, pp. 451–452) shows that
) *−1 −1 −1 ! −1
K! KΣ(θ)K! K = Σ(θ) − Σ(θ) X(s)Ω(θ)X(s) Σ(θ) ,
! −1 ! −1
where Ω(θ) = (X(s) Σ(θ) X(s))−1 . This identity and ΩX(s) Σ(θ) Z(s) =
' yields
β
) *−1 −1
Z(s)! K! KΣ(θ)K! KZ(s) = r! Σ(θ) r.
In fundamental work on REML estimation, Harville (1974, 1977) established
important further results. For example, he shows that if
! !
K! K = I − X(s)(X(s) X(s))−1 X(s)
and KK! = I, then minus twice the log likelihood of KZ(s) can be written as
! −1
ϕR (θ; KZ(s)) = ln{|Σ(θ)|} + ln{|X(s) Σ(θ) X(s)|}
ESTIMATING COVARIANCE PARAMETERS 263

! −1
− ln{|X(s) X(s)|} + r! Σ(θ) r
+ (n − k) ln{2π}.
Harville (1977) points out that (n − k) × n matrices whose rows are linearly
! !
independent rows of I−X(s)(X(s) X(s))−1 X(s) will lead to REML objective
functions that differ by a constant amount. The amount does not depend on
θ or β. The obvious choice as a REML objective function for minimization is
Minus two
! −1 Restricted
ϕR (θ; KZ(s)) = ln{|Σ(θ)|} + ln{|X(s) Σ(θ) X(s)|}
log
−1
+ r Σ(θ)
!
r + (n − k) ln{2π}. Likelihood
In this form, minus twice the REML log likelihood differs from (5.41) by the
! −1
terms ln{|X(s) Σ(θ) X(s)|} and k ln{2π}. As with ML estimation, a scale
parameter can be profiled from Σ(θ). The REML estimator of this parameter
is
1
' 2reml =
σ r! Σ(θ ∗ )−1 r
n−k
and upon substitution one obtains minus twice the profiled REML log likeli-
hood
!
ϕR,σ (θ ∗ ; KZ(s)) = ln{|Σ(θ ∗ )|} + ln{|X(s) Σ(θ ∗ )−1 X(s)|}
+ (n − k) ln{'σ 2 } + (n − k)(ln{2π} − 1). (5.49)
Wolfinger, Tobias, and Sall (1994) give expressions for the gradient and Hes-
sian of the REML log likelihood with and without profiling of σ 2 .
There is a large literature on the use of ML and REML for spatial modeling
and this is an area of active research in statistics. Searle, Casella and McCul-
loch (1992) provide an introduction to REML estimation in linear models
and Littell, Milliken, Stroup and Wolfinger (1996) adapt some of these re-
sults to the spatial case. Cressie and Lahiri (1996) provide the distributional
properties of REML estimators in a spatial setting.

5.5.4 Prediction Errors When Covariance Parameters Are Estimated

A common theme of statistical modeling when errors are correlated is the


plug-in form of expressions involving the unknown covariance parameters.
The “plugging” occurs when we compute predictions of new observations,
estimates of the mean, and estimates of precision. This has important con-
sequences for the statistical properties of the estimates (predictions) and our
ability to truthfully report their uncertainty. From the introductory remarks
of this section recall the case of the ordinary kriging predictor and its variance.
C D
1 − 1! Σ(θ)−1 σ(θ)
pok (Z; s0 ) = σ(θ) + 1 Σ(θ)−1 Z(s),
1! Σ(θ)−1 1
(1 − 1! Σ(θ)−1 σ(θ))2
2
σok (s0 ) = C(0) − σ(θ)! Σ(θ)−1 σ(θ) + .
1! Σ(θ)−1 1
264 SPATIAL PREDICTION AND KRIGING

Applying “plugging” to both yields


+ ,
' '
' + 1 1 − 1 Σ(θ) σ(θ) Σ(θ)
! −1
p'ok (Z; s0 ) = σ(θ) ' −1 Z(s)
! '
1 Σ(θ) 1
−1

σ '
'2 (s0 ) = C(0) ' ! Σ(θ)
− σ(θ) ' −1 σ(θ)
' +
ok
' −1 σ(θ))
(1 − 1! Σ(θ) ' 2
.
' −1 1
1! Σ(θ)
p'ok (Z; s0 ) is not the ordinary kriging predictor, it is an estimate thereof. Sim-
ilarly, σ 2
'ok (s0 ) is not the variance of the ordinary kriging predictor, it is an
estimate thereof; and because of the nonlinear involvement of θ, ' it is a biased
estimate of σok (s0 ). More importantly, σok (s0 ) is not the prediction error of
2 2

p'ok (Z; s0 ), but that of pok (Z; s0 ). Intuitively, one would expect the prediction
error of p'ok (Z; s0 ) to exceed that of the ordinary kriging predictor, because
not knowing θ has introduced additional variability into the system; θ ' is a
random vector. So, if σok (s0 ) is not the prediction error we should compute if
2

p'ok (Z; s0 ) is our predictor, are we not making things even worse by evaluating
2
σok ' Is that not a biased estimate of the wrong quantity?
(s0 ) at θ?
The consequences of plug-in estimation for estimating variability are not
germane to spatial models. To address the issues in more generality, we con-
sider in this section a more generic model and notation. The model at issue
is
Z = Xβ + e, e ∼ (0, Σ(θ)),
a basic correlated error model with a linear mean and parameterized variance
matrix. In the expressions that follow, an overline, − , denotes a general es-
timator or predictor, R denotes a quantity evaluated if θ is known, and ' is
'
used when the quantity is evaluated at the estimate θ.
Following Harville and Jeske (1992), our interest is in predicting a quantity
ω with the properties
E[ω] = x!0 β
Var[ω] = σ 2 (θ)
Cov[Z, ω] = σ(θ),
where x!0 β is estimable. If ω is any predictor of ω, then the quality of our
prediction under squared-error loss is measured by the mean-squared error
mse[ω, ω] = E[(ω − ω)2 ].
A special case is that of Var[ω] = σ 2 = 0. The problem is then one of estimat-
ing x!0 β and
mse[ω, ω] = mse[x!0 β, x!0 β] = x!0 E[(β − β)(β − β)! ]x0 .
If β is unbiased for β, then this mean-squared error equals the variance of
x!0 β.
ESTIMATING COVARIANCE PARAMETERS 265

Under the stated conditions, the GLS estimator of β and the BLUP of ω
are
R
β = (X! Σ(θ)−1 X)−1 X! Σ(θ)−1 Z
R
ω = x!0 β R
R + σ(θ)! Σ(θ)−1 (Z − Xβ).
R = β, and that ω
It is easy to establish that E[β] R is unbiased in the sense that
E[Rω ] = E[ω] = x0 β. The respective mean-squared errors are given by
!

R x! β] = (X! Σ(θ)−1 X)−1 = Ω(θ)


mse[x!0 β, (5.50)
0
mse[R ω , ω] = E[(Rω − ω) ] = p Σ(θ)p + σ (θ) − 2p σ(θ) (5.51)
2 ! 2 !

p = (x0 − σ(θ)! Σ(θ)−1 X)(X! Σ(θ)−1 X)−1 Σ(θ)−1


!

+ σ(θ)! Σ(θ)−1

' and obtain the EGLS estimator


If θ is unknown, we plug in the estimate θ
and EBLUP
'
β = ' −1 X)−1 X! Σ(θ)
(X! Σ(θ) ' −1 Z = Ω(θ)X
' ! Σ(θ)
' −1 Z
'
ω ' + σ(θ)
= x!0 β ' −1 (Z − Xβ).
' ! Σ(θ) '

The estimator β' and the predictor ω ' remain unbiased. The question, then,
'
is how to compute the variance of β and the mean square error mse[' ω , ω]?
Before looking into the details, let us consider the plug-in quantity
' = (X! Σ(θ)
Ω(θ) ' −1 X)−1 ,

which is commonly used as an estimate of Var[β].' There are two major prob-
' (= Ω(θ). We could use it as a
lems. First, it is not unbiased for Ω(θ); E[Ω(θ)]
biased estimator of (5.50), however. Second, even if it was unbiased, Ω(θ) is
not the variance of β.' We need

• an estimator of the mean-squared error that takes into account the fact
that θ was estimated and hence that θ' is a random variable;
• a computationally feasible method for evaluating the mean-squared error.
Now let us return to the more general case of predicting ω with ω
' , keeping in
'
mind that finding the variance of β is a special case of determining mse['ω , ω].
Progress can be made by considering only those estimators θ that have certain
properties. For example, Kackar and Harville (1984) and Harville and Jeske
(1992) consider even, translation invariant estimators. Christensen (1991, Ch.
VI.5) considers “residual-type” statistics; see also Eaton (1985). Suffice it to
say that ML and REML estimators have the needed properties. Kackar and
Harville (1984) decompose the prediction error into
'−ω = ω
ω R−ω + ω R = e1 (θ) + e2 (θ).
'−ω
' is translation invariant, then e1 (θ) and e2 (θ) are distributed indepen-
If θ
dently, and
mse['
ω , ω] = mse[R
ω , ω] + Var['
ω−ωR ].
266 SPATIAL PREDICTION AND KRIGING

By choosing to ignore the fact that θ was estimated, the mean-squared pre-
diction error is underestimated by the amount Var[' ω−ω R ]. Hence, mse['
ω , ω] ≥
mse[R '
ω , ω]. If you follow the practice of plugging θ into expressions that apply
if θ were known, you would estimate mse[' ' This
ω , ω] by evaluating (5.51) at θ.
yields the (estimated) mean-squared error of the wrong quantity, and can be
substantially biased.
Kackar and Harville (1984) propose a correction term, and Harville and
Jeske (1992) provide details about estimation. First, using a Taylor series,
Var['
ω−ω R ] is approximated as tr{A(θ)B(θ)}, where
: ;
R
∂ω
A = Var
∂θ
B = mse[θ, ' θ] = E[(θ
' − θ)(θ
' − θ)! ].

Although we now have an expression—at least an approximate one—for the


'
mean-squared prediction error based on a translation-invariant estimator θ,
we do not seem to have helped the cause very much. We now have the improved
mean-squared error expression
.
mse['
ω , ω] = mse[R
ω , ω] + tr{A(θ)B(θ)}, (5.52)
but mse[Rω , ω], A(θ) and B(θ) depend on θ. To evaluate the matrices, we are
going to plug in θ' again and, no surprise here, incur yet another source of bias.
This does appear like a vicious cycle. Our initial estimate of prediction error
was not reliable, because θ ' was estimated and we simply plugged it into the
expression for the unknown θ. Which prompts us to derive an approximation
of the actual prediction error, (5.52). Unfortunately, it also depends on θ,
and to evaluate it we plug in θ ' again, which biases the result. Fortunately,
we are already working with the correct prediction error, and the bias can be
managed.
Now we introduce θ into the symbolic mean squared error representation to
underline the fact that the mse[ , ] expressions depend on θ. First, Prasad and
Rao (1990) show that the plug-in estimator tr{A(θ)B( ' ' is approximately
θ)}
unbiased for tr{A(θ)B(θ)}. Then, Harville and Jeske (1992) derive that
' − tr{A(θ)B(
' ' = .
E[mse[Rω , ω, θ] θ)}] mse[R
ω , ω, θ].
Evaluating the improved estimator (5.52) at θ,' we still fall short of the pre-
diction error on average, but things have improved. Half of the bias is gone.
An approximately unbiased estimator of the mean-squared prediction error
Prasad-Rao mse['ω , ω] is then
MSE mse[R ' + 2tr{A(θ)B(
ω , ω, θ] ' '
θ)}. (5.53)
Estimator
The bias of the plug-in mean squared error estimator can be substantive,
in particular when sample size is small. However, as Zimmerman and Cressie
(1992) note, the magnitude of this bias and the improvement offered by (5.53)
depends on the validity of the Gaussian assumption, whether or not θ ' is un-
NONLINEAR PREDICTION 267

biased, the nature of the covariance model used for Σ(θ), ' the spatial con-
figuration of the data, and the strength of spatial autocorrelation. Based on
examples in Zimmerman and Zimmerman (1991) and Zimmerman and Cressie
(1992), Zimmerman and Cressie (1992) offer the following general guidelines.
The performance of the plug-in mean-squared error estimator (mse[R '
ω , ω, θ],
the estimated kriging variance) as an estimator of the true prediction mean-
squared error can often be improved upon when the spatial correlation is weak,
but it is often adequate and sometimes superior to the alternative estimators
such as (5.53) when the spatial correlation is strong. Zimmerman and Cressie
(1992) suggest that corrections of the type used in (5.53) should only be used
when θ ' is unbiased, or Σ(θ)
' is negatively biased, and the spatial correlation is
weak. In other words, the use of a plug-in estimator of the kriging variance is
fine for most spatial problems with moderate to strong spatial autocorrelation.

5.6 Nonlinear Prediction

The development of the simple and ordinary kriging predictors requires no dis-
tributional assumptions other than those pertaining to the first two moments
of the random field. Thus, simple kriging is always the best linear predictor
and ordinary kriging is always the best linear unbiased predictor, regardless
of the underlying distribution of the data. The best predictor, i.e., the one
that minimizes the mean-squared prediction error, is given in (5.1), the con-
ditional expectation of Z(s0 ) given the observed data. When the data follow
a multivariate Gaussian distribution, this expectation is linear in the data
and is equivalent to the simple kriging predictor (5.10). For other distribu-
tions, this conditional expectation may not be linear and so linear predictors
may be poor approximations to this optimal conditional expectation. Statis-
ticians often cope with such problems by transforming the data, so that the
transformed data follow a Gaussian distribution and then performing analyses
with the transformed data. In this section, we discuss several approaches to
constructing nonlinear predictors based on transformations of the data.

5.6.1 Lognormal Kriging

Suppose the logarithm of the random function Z(s) is a Gaussian random


field so that Y (s) = log{Z(s)} follows a multivariate Gaussian distribu-
tion. For the development here, assume that Y (·) is intrinsically stationary
with mean µY and semivariogram γY (h). Simple kriging of Y (s0 ) using data
Y (s1 ), · · · , Y (sn ) gives psk (Y; s0 ) from (5.10) and σsk
2
(Y; s0 ) from (5.11). This
suggests using p(Z; s0 ) = exp{psk (Y; s0 )} as a predictor of Z(s0 ). Unfortu-
nately, this predictor is biased for Z(s0 ). However, David (1988, pp. 117–118)
and Cressie (1993, pp. 135–136) show how the properties of the lognormal
distribution can be used to construct an unbiased predictor. First we draw on
268 SPATIAL PREDICTION AND KRIGING

the results in Aitchison and Brown (1957). If


Y = [Y1 , Y2 ]! ∼ G(µ, Σ)
µ = [µ1 , µ2 ]! ; Σ = σij , i, j = 1, 2
then [exp{Y1 }, exp{Y2 }]! has mean ν and covariance matrix T , where
ν = (ν1 , ν2 )! = [exp{µ1 + σ11 /2}, exp{µ2 + σ22 /2}]!
and : ;
ν12 (exp{σ11 } − 1) ν1 ν2 (exp{σ12 } − 1)
T = .
ν1 ν2 (exp{σ21 } − 1) ν22 (exp{σ22 } − 1)
To appreciate this result and its implications for prediction/estimation con-
sider the following, simple case.

Example 5.10 Let Y1 , · · · , Yn be independent and identically distributed


Gaussian variables with unknown mean µ and known variance σ 2 . We are
interested in estimating
Q = E[exp{Y }] = exp{µ + σ 2 /2}.
We first consider how to estimate E[Yi ], and then how to involve this estima-
tor to derive an unbiased.estimator of Q. The natural estimator for µ is the
n ' 1 = exp{Y } as your es-
arithmetic average Y = i=1 Yi . So, consider using Q
timator of Q. Because the Yi are Gaussian, so is Y , Y ∼ G(µ, σ 2 /n). Applying
the result of Aitchison and Brown yields
E[exp{Y }] = exp{µ + Var[Y ]/2} = exp{µ + σ 2 /(2n)}.
' 1 is biased for Q, but we can correct it. Some simple manipulations lead to
Q
E[exp{Y }] = exp{µ + σ 2 /(2n)}
= exp{µ + σ 2 /2 − σ 2 /2 + σ 2 /(2n)}
= Q exp{−σ 2 /2 + σ 2 /(2n)}.
Consequently,
exp{Y } exp{σ 2 /2 − σ 2 /(2n)} = exp{Y + 0.5(Var[Y ] − Var[Y ])}
is an unbiased estimator of Q.

Returning to the problem of finding a predictor p(Z; s0 ) based on data Y(s),


we ask that the predictor be unbiased in the sense that previous predictors
have been unbiased; its mean should equal E[Z(s)] = µZ . Applying the result
of Aitchison and Brown (1957) twice, as in the example, first to p(Z; s0 ) =
exp{psk (Y; s0 )}, and then inversely to µY gives
8 9
E[p(Z; s0 )] = E[exp{psk (Y; s0 )}] = µZ exp −σY2 /2 + Var[psk (Y; s0 )]/2 ,
where σY2 = Var[Y (si )]. The bias-corrected predictor of Z(s0 ) is then
8 9
pslk (Z; s0 ) = exp psk (Y; s0 ) + σY2 /2 − Var[psk (Y; s0 )]/2
8 9
= exp psk (Y; s0 ) + σsk2
(Y; s0 )/2 . (5.54)
NONLINEAR PREDICTION 269

Since Y (s) is a Gaussian random field, psk (Y; s0 ) = E[Y (s0 )|Y] and so
pslk (Z; s0 ) = E[exp{Y (s0 )}|Y] = E[Z(s0 )|Z].
Thus, pslk (Z; s0 ) is the optimal predictor of Z(s0 ). The corresponding con-
ditional variance, which is also the minimized mean-squared prediction error
(MSPE), is (Chilès and Delfiner, 1999, p. 191)
Var[(pslk (Z; s0 ) − Z(s0 ))|Z] = (pslk (Z; s0 ))2 [exp{σsk
2
(Y; s0 )} − 1].

When ordinary kriging is used to predict Y (s0 ), the properties of the log-
normal distribution can be used as before (see Cressie, 1993, pp. 135–136) and
in this case the bias-corrected predictor of Z(s0 ) is Ordinary
8 9 Lognormal
polk (Z; s0 ) = exp pok (Y; s0 ) + σY2 /2 − Var[pok (Y; s0 )]/2
8 9 Kriging
= exp pok (Y; s0 ) + σok
2
(Y; s0 )/2 − mY , (5.55) Predictor
where mY is the Lagrange multiplier obtained with ordinary kriging of Y (s0 )
based on data Y. The bias-corrected MSPE (see e.g., Journel, 1980; David,
1988, p. 118) is
K L
2
E (polk (Z; s0 ) − Z(s0 )) = exp{2µY + σY2 } exp{σY2 }
8 ) *
× 1 + exp{−σok 2
(Y; s0 ) + mY }
× (exp{mY } − 2)} .
Thus, unlike ordinary kriging, we need to estimate µY and σY2 (·) as well
as γY (·) in order to use lognormal kriging. Moreover, the optimality prop-
erties of polk (Z; s0 ) are at best unclear. Finding a predictor that minimizes
Var[p(Y; s0 ) − Y (s0 )] within the class of linear, unbiased predictors of Y (s0 )
(which in this case is pok (Y; s0 )), does not imply that polk (Z; s0 ) minimizes
Var[(p(Z; s0 ) − Z(s0 ))|Z] within the class of linear, unbiased predictors of
Z(s0 ).
The bias correction makes the ordinary lognormal kriging predictor sensi-
tive to departures from the lognormality assumption and to fluctuations in the
semivariogram (a criticism that applies to many nonlinear prediction meth-
ods and not just to lognormal kriging). Thus, some authors (e.g., Journel,
1980) have recommended calibration of polk (Z; s0 ), forcing the mean of kriged
predictions to equal the mean of the original Z data. This may be a useful
technique, but it is difficult to determine the properties of the resulting predic-
tor. Others (e.g., Chilès and Delfiner, 1999, p. 191) seem to regard mean unbi-
asedness as unnecessary, noting that exp{pok (Y; s0 )} is median unbiased (i.e.,
Pr(exp{pok (Y; s0 )} > Z(s0 )) = Pr(exp{pok (Y; s0 )} < Z(s0 )) = 0.5). Mar-
cotte and Groleau (1997) propose an interesting approach that works around
these problems. Instead of transforming the data, predicting Y (s0 ), and then
transforming back, they suggest predicting Z(s0 ) using the original data Z(s)
to obtain pok (Z; s0 ), and then transforming via p(Y; s0 ) = log{pok (Z; s0 )}.
Marcotte and Groleau (1997) suggest using E[Z(s0 )|p(Y; s0 )] as a predictor
of Z(s0 ). Using the properties of the lognormal distribution from Aitchison
270 SPATIAL PREDICTION AND KRIGING

and Brown (1957) given above, Marcotte and Groleau (1997) derive a compu-
tational expression for this conditional expectation that depends on µZ and
γZ (h), and is relatively robust to departures from the lognormality assump-
tion and to mis-specification of the semivariogram.
Although the theory of lognormal kriging has been developed and revis-
ited by many authors including Rendu (1979), Journel (1980), Dowd (1982),
and David (1988), problems with its practical implementation persist. David
(1988) gives several examples that provide some advice on how to detect and
correct problems with lognormal kriging and more modifications are provided
in Chilès and Delfiner (1999). Nonlinear spatial prediction is an area of active
research in geostatistics, and the last paragraph in Boufassa and Armstrong
(1989) seems to summarize the problems and the frustration: “The user of
geostatistics therefore is faced with the difficult task of choosing the most ap-
propriate stationary model for their data. This choice is difficult to make given
only information from a single realization. It would be helpful if statisticians
could devise a way of testing this.”

5.6.2 Trans-Gaussian Kriging

The lognormal distribution is nice, mathematically speaking, since its mo-


ments can be written in terms of the moments of an underlying Gaussian dis-
tribution. In many other applications, the transformation required to achieve
normality may not be the natural logarithm, and in such instances, it is dif-
ficult to obtain exact expressions relating the moments of the original data
to those of the transformed variable. Trans-Gaussian kriging, suggested by
Cressie (1993), is a more general approach to developing optimal predictors
for non-Gaussian spatial data.
Assume Z(s) = ϕ(Y (s)), where Y (s) follows a multivariate Gaussian distri-
bution, and the function ϕ is known. Again we assume that Y (·) is intrinsically
stationary with mean µY and semivariogram γY (h). We assume that µY is
unknown and use pok (Y; s0 ) in (5.21) as the predictor of Y (s0 ), although anal-
ogous derivations can be done using simple kriging. In this context, a natural
predictor of Z(s0 ) is p(Z; s0 ) = ϕ(pok (Y; s0 )), but we need to determine its
expected value in order to correct for any bias, and then derive the variance
of the resulting bias-corrected predictor. Cressie (1993, pp. 137–138) uses the
delta method to derive these properties. In what follows, we provide the details
underlying this development.
Expand ϕ(pok (Y; s0 )) ≡ ϕ(Y'0 ) around µY in a second-order Taylor series:
!!
! ϕ (µY ) '
ϕ(pok (Y; s0 )) ≈ ϕ(µY ) + ϕ (µY )(Y'0 − µY ) + (Y0 − µY )2 .
2
Taking expectations we find
!!
ϕ (µY )
E[p(Z; s0 )] = E[ϕ(pok (Y; s0 ))] ≈ ϕ(µY ) + E[(Y'0 − µY )2 ]. (5.56)
2
NONLINEAR PREDICTION 271

For unbiasedness, this should be equal to E[Z(s0 )], which can be obtained by
applying the same type of expansion to ϕ(Y (s0 )) giving
!!
ϕ (µY )
E[Z(s0 )] = E[ϕ(Y (s0 ))] ≈ ϕ(µY ) + E[(Y (s0 ) − µY )2 ]. (5.57)
2
To make (5.56) equal to (5.57), we need to add
!! !!
ϕ (µY ) ϕ (µY )
E[(Y (s0 ) − µY )2 ] − E[(Y'0 − µY )2 ]
2 2
!!
ϕ (µY ) 2
= (σok (Y; s0 ) − 2mY )
2
to p(Z; s0 ). Thus, the trans-Gaussian predictor of Z(s0 ) is Trans-
!! Gaussian
ϕ (µY ) 2
ptg (Z; s0 ) = ϕ(pok (Y; s0 )) + (σok (Y; s0 ) − 2mY ). (5.58) Predictor
2
The mean-squared prediction error of ptg (Z; s0 ), based on just a first-order
Taylor series expansion, is
K ! L2
E[(ptg (Z; s0 ) − Z(s0 )) ] ≈ ϕ (µY ) σok
2 2
(Y; s0 ). (5.59)

Statisticians are well aware of many different families of transformations


that can be used to transform a given set of data to a Gaussian distribution.
Common examples include variance stabilizing transformations such as the
square-root for the Poisson distribution and the arcsine for the Binomial dis-
tribution (see e.g., Draper and Smith, 1981, pp. 237–238), the Box-Cox family
of power transformations (Box and Cox, 1964), the Freeman-Tukey transfor-
mation (Freeman and Tukey, 1950), and monotone mappings such as the logit
or probit for Binomial proportions and the log-transform for counts. There is
also an empirical way to infer ϕ that is commonly-used in geostatistics. This
transformation, called the anamorphosis function (Rivoirard, 1994, pp. 46–
48), is obtained by matching the percentiles of the data to those of a standard
Gaussian distribution. If we consider obtaining ϕ−1 instead of ϕ (arguably,
the more natural procedure), this transformation is called the normal scores
transformation (Deutsch and Journel, 1992, p. 138).
Let Φ(y) be the standard Gaussian cumulative distribution function and
suppose F (z) is the cumulative distribution function of the data. Then, the
p-quantile of F (z) is that value zp such that F (zp ) = p. Given p, the corre-
sponding p-quantile from the standard Gaussian distribution is yp , defined as
Φ(yp ) = p. Thus, equating these for each p gives
zp = F −1 (Φ(yp )) ≡ ϕ(yp ) anamorphosis
yp = Φ−1 (F (zp )) ≡ ϕ−1 (zp ) normal scores. (5.60)

Example 5.11 Trans-Gaussian kriging of rainfall amounts. In this


example, we investigate the effects of departures from a Gaussian assumption
272 SPATIAL PREDICTION AND KRIGING

and the effects of plug-in estimators of the semivariogram model parameters


on the quality of predictions obtained from kriging. We begin by following the
example given in De Oliveira, Kedem, and Short (1997) concerning rainfall
measurements near Darwin, Australia. The data are rainfall amounts mea-
sured in millimeters (mm) accumulated over a 7-day period from the 76th –
82nd day of 1991 at 24 rainfall collection stations (Figure 5.12). A histogram

29.92 23.60

10
21.70
17.72
31.14

8 26.83 18.24
68.25 28.98

24.47
63.10 46.06
6 23.69
y (km)

66.76 33.68 19.31


22.82
4
54.35 30.57

26.76
31.27
2 33.98

29.55 41.85

0 2 4 6 8 10 12
x (km)

Figure 5.12 Locations of the 24 rainfall monitoring stations and associated weekly
rainfall amounts (mm). Data kindly provided by Dr. Victor De Oliveira, Department
of Mathematical Sciences, University of Arkansas.

of the weekly rainfall amounts (Figure 5.13) suggests that linear methods of
spatial prediction may not be the best choice for interpolating the rainfall
amounts. De Oliveira et al. (1997) consider a Box-Cox transformation
 λ
 z −1
 if λ (= 0
gλ (z) = λ


log(z) if λ = 0,
and estimate λ using maximum likelihood. The distribution of the Box-Cox
transformed data using λ̂ = −0.486 is shown in Figure 5.14. This transfor-
mation appears to have over-compensated for the skewness, and so we also
consider the normal scores transformation given in (5.60). A histogram of
the normal scores is shown in Figure 5.15. To use Trans-Gaussian kriging de-
NONLINEAR PREDICTION 273

40

30
Percent

20

10

0
10 20 30 40 50 60 70
rainfall

Figure 5.13 Histogram of weekly rainfall amounts (mm).

30

20
Percent

10

0
1.50 1.55 1.60 1.65 1.70 1.75
Box-Cox transformed scores

Figure 5.14 Histogram of transformed weekly rainfall amounts using the Box-Cox
transformation.

scribed in §5.6.2, we need to compute the empirical semivariogram of the data


on the transformed scale. The empirical semivariogram of the normal scores
is shown in Figure 5.16. The empirical semivariograms of the Box-Cox trans-
formed values and the original rainfall values (not shown here) have similar
shapes, but different sills reflecting their differing scales. While it is difficult to
274 SPATIAL PREDICTION AND KRIGING

30

20
Percent

10

0
-2.5 -1.5 -0.5 0.5 1.5 2.5
normal scores

Figure 5.15 Histogram of transformed weekly rainfall amounts using the normal
scores transformation.

adequately estimate a semivariogram with only 24 data values, the empirical


semivariogram suggests there might be spatial autocorrelation in the rainfall
amounts and their transformed values (Figure 5.16).
Predictions of rainfall amounts at locations without a collection station can
be obtained in a variety of ways. We consider Trans-Gaussian kriging using the
Box-Cox transformed values, both with and without the bias correction factor
in (5.58), since obtaining the Lagrange multiplier needed for this correction is
not routinely provided by most software programs that implement kriging. We
also consider a normal scores version of kriging in which the data are trans-
formed and back-transformed as with Trans-Gaussian kriging, but given the
nature of ϕ−1 (z), back-transformation is more involved and bias corrections
cannot be easily made (if at all). We also consider ordinary kriging since this
is the easiest spatial prediction method to implement.
We chose a parametric exponential semivariogram model (without a nugget
effect) in the parameterization
C(σ 2 , ρ, ||s − s ||) = σ 2 ρ||si −sj ||
i j

to describe the shape of the spatial autocorrelation depicted in Figure 5.16.


Since using plug-in estimates of the autocorrelation parameters may introduce
bias in the kriging standard errors, we adjust the predictions using the Prasad-
Rao MSE estimator in (5.53). To provide some insight into how the different
methods perform, we considered a cross validation study as in De Oliveira et
al. (1997). For each data value (either a transformed value or a given rainfall
amount), we delete this value and use the remaining 23 values to predict it and
NONLINEAR PREDICTION 275

20 21 31 25 30 19 21

2.0

1.5
Semivariance

1.0

0.5

0.0
2 4 6 8
Distance (km)

Figure 5.16 Empirical semivariogram of normal score transformed values. Values


across top represent number of pairs in lag class.

construct a prediction interval (PI) for the omitted, true value. Thus, we can
compare each predicted value to the true value omitted from the analysis and
obtain measures of bias and the percentage of prediction intervals containing
the true values. The covariance parameters and the overall mean were re-
estimated for each of the 24 cross validation data sets.
With ordinary kriging, we describe the spatial autocorrelation in the orig-
inal rainfall amounts using both an exponential semivariogram in the above
parameterization and a spherical semivariogram model to examine the impact
of the semivariogram model.
Predictions for Trans-Gaussian kriging based on the Box-Cox transformed
scores were obtained using (5.58) and the standard errors were obtained us-
ing (5.59). We also obtained a biased Trans-Gaussian predictor obtained by
ignoring the second term in (5.58) that depends on the Lagrange multiplier,
mY . Note that ϕ(·) in §5.6.2 pertains to the original rainfall amounts, Z(si ).
The transformed amounts are Y (s) = ϕ−1 (Z(s)) so that gλ (z) = ϕ−1 (z).
The normal scores transformation matches the cumulative probabilities that
define the distribution function of Z(s) to those of a standard normal distri-
bution. Thus, the transformed scores are just the corresponding percentiles
of the associated standard normal distribution. However, transforming back
after prediction at a new location is tricky, since the predicted values will
not coincide with actual transformed scores for which there is a direct link
back to the original rainfall amounts. Thus, for predictions that lie within
two consecutively-ranked values, predictions on the original scale are obtained
276 SPATIAL PREDICTION AND KRIGING

through interpolation. Rather than back-transforming the standard errors and


creating prediction intervals on the original scale as with Trans-Gaussian krig-
ing, we back-transformed the prediction intervals constructed from kriging
with normal scores. When back-transforming prediction intervals based on
normal scores to obtain prediction intervals on the original scale, the predic-
tion interval bounds may be less than the minimum and exceed the maximum
of the normal score transformed values. In such cases, we used linear extrapo-
lation in the tails. Deutsch and Journel (1992, pp. 131–135) provide a detailed
discussion of this issue and suggest more sophisticated approaches. We delib-
erately chose a conservative approach that does not assume long tails at either
end of the distribution of the original rainfall amounts.
Restricted maximum likelihood was used to estimate the spatial autocorre-
lation parameters. The mean, µY , needed for Trans-Gaussian kriging was es-
timated using generalized least squares. Prediction intervals (PIs) were based
on the (1−α/2) percentage point from the standard normal distribution (1.96
for a 95% PI). Since there are only 24 measurements available for estimation
and prediction, we also used percentage points from a t-distribution on 22
degrees of freedom to construct the prediction intervals. In addition, since
the use of plug-in estimates for the variance and the spatial autocorrelation
parameters can affect these intervals, we also combined the Prasad-Rao MSE
estimator with adjusted degrees of freedom according to Kenward and Roger
(1997). The Kenward-Roger adjustment is described in §6.2.3.1.
To assess the performance of each prediction method, we computed the
following:
• The average relative bias:
H I
' 0 ) − Z(s0 )
Z(s
avg × 100%.
Z(s0 )
This should be close to zero for methods that are unbiased.
• The average relative absolute error
H I
' 0 ) − Z(s0 )|
|Z(s
avg × 100%.
Z(s0 )

This should be relatively small for methods that are accurate (precise as
well as unbiased).
• The average percentage increase in prediction standard errors due to the
Prasad-Rao adjustment;
• The percentage of 95% prediction intervals containing the true value. This
should be close to 95%.
The results of the cross validation study are shown in Table 5.4.
From Table 5.4 we see that all empirical PI coverages were surprisingly
close to nominal. Also, surprisingly, the biased-corrected version of the Trans-
NONLINEAR PREDICTION 277

Table 5.4 Results from cross-validation of predictions of rainfall amounts.


Average Average % Increase in Gaussian
Relative Absolute Std. Error from 95% PI
Method Bias Error PR-adjustment
OK (EXP) 3.53 15.57 0.04 95.8
OK (SPH) 3.42 15.12 1.75 95.8
TGK (no b.c.) 0.57 15.27 0.55 95.8
TGK (b.c.) 5.13 16.73 0.55 95.8
NS -0.94 13.36 N/A 91.7

Gaussian kriging predictor has the largest bias and the largest absolute rela-
tive error. Predictions from the bias-corrected version of Trans-Gaussian krig-
ing are all higher than those obtained without this bias correction. Cressie
(1993, p. 137) notes that the approximations underlying the derivation of the
biased-corrected Trans-Gaussian predictor rely on the kriging variance of the
transformed variables being small. On the average, for the cross-validation
study presented here, this variance was approximately 0.035, so perhaps this
is not “small” enough. Also, the Box-Cox transformation tends to over-correct
the skewness in the distribution of the rainfall amounts and this may be im-
pacting the accuracy of the predictions from Trans-Gaussian kriging (Figure
5.14). The predictions from the normal scores transformation were relatively
(and surprisingly) accurate, considering no effort was made to adjust for bias
and the amount of information lost during back-transformation. The Prasad-
Rao/Kenward-Roger adjustment did change the standard errors slightly, but
not enough to result in different inferences. However, in this example, the
spatial autocorrelation is very strong ('ρ is between 0.92 and 0.98). Follow-
ing the recommendation of Zimmerman and Cressie (1992), on page 267, the
adjustment may not be needed in this case.
The empirical probability of coverage from the normal scores was below
nominal, but we used a fairly conservative method for back transformation
in the tails of the distribution that may be affecting these results. Using a
percentage point from a t-distribution to construct these intervals had little
impact on their probability of coverage. Ordinary kriging seems to do as well
as the other techniques and certainly requires far less computational (and
cerebral) effort. However, the distribution of the rainfall amounts (Figure 5.13)
is not tremendously skewed or long-tailed, so the relative accuracy of ordinary
kriging in this example may be somewhat misleading. On the other hand,
perhaps the effort spent in correcting departures from assumptions should
be directly proportional to their relative magnitude and our greatest efforts
should be spent on situations that show gross departures from assumptions.
278 SPATIAL PREDICTION AND KRIGING

5.6.3 Indicator Kriging

The spatial prediction techniques described in previous sections are geared


towards finding good approximations to E[Z(s0 )| Z(s)]. Alternatively, if we
can estimate Pr(Z(s0 ) ≤ z|Z(s1 ), . . . , Z(sn )) ≡ G(s0 , z |Z(s)), the conditional
probability distribution at each location, we can also obtain E[Z(s0 )|Z(s)]
and E[g(Z(s0 ))|Z(s)]. Switzer (1977) proposed the idea of using indicator
functions of the data to estimate a stationary, univariate distribution function,
F (z) = Pr(Z(s) ≤ z). This idea was then extended by Journel (1983) for
nonparametric estimation and mapping of Pr(Z(s0 ) ≤ z|Z(s1 ), · · · , Z(sn ))
and is now known as indicator kriging.
Assume that the process { Z(s) : s ∈ D ⊂ Rd } is strictly stationary and
Indicator consider an indicator transform of Z(s)
Transform -
1 if Z(s) ≤ z
I(s, z) = (5.61)
0 otherwise.
This function transforms Z(s) into a binary process whose values are de-
termined by the threshold z. Since E[I(s0 , z)] = Pr(Z(s0 ) ≤ z) = F (z) is
unknown, we can use ordinary kriging to predict I(s0 , z) from the indicator
data I(s, z) = [I(s1 , z), · · · , I(sn , z)]! . This gives
pik (I(s, z); I(s0 , z)) = λ(z)! I(s, z), (5.62)
where the weight vector λ(z) is obtained from (5.19) using the indicator semi-
variogram, γI (h) = 12 Var[I(s + h, z) − I(s, z)]. Since ordinary kriging of Z(s0 )
provides an approximation to E[Z(s0 )|Z(s1 ), · · · , Z(sn )], indicator kriging pro-
vides an approximation to
E[I(s0 , z)|I(s1 , z) · · · , I(sn , z)] = Pr(Z(s0 ) ≤ z|I(s1 , z) · · · , I(sn , z)).
Note, however, that Pr(Z(s0 ) ≤ z|I(s1 , z) · · · , I(sn , z)) is not the same as
Pr(Z(s0 ) ≤ z|Z(s1 ), · · · , Z(sn )); much information is lost by making the in-
dicator transform. Thus, in theory, the indicator kriging predictor in (5.62)
is a crude estimator of the conditional probability of interest. However, in-
dicator semivariograms contain more information than may be apparent at
first glance. First, they contain information about the bivariate distributions
since Var[I(s + h, z) − I(s, z)] = Pr(Z(s + h) ≤ z, Z(s) ≤ z). Second, the
indicator functions do contain some information from one threshold to the
next since they are defined cumulatively. Finally, Goovaerts (1994, 1997) has
shown that, in practice, more complex indicator methods that more directly
account for information available across all thresholds (e.g., indicator cokrig-
ing, see Goovaerts, 1997, pp. 297–299) offer little improvement over the more
simple indicator kriging approach.
In many applications, such as environmental remediation and risk analysis,
we are interested in exceedance probabilities, e.g., Pr(Z(s0 ) > z|Z(s)). This
probability can be estimated using indicator kriging based on the compliment
of (5.61), I c (s, z) = 1 − I(s, z). In other applications, the data may already be
NONLINEAR PREDICTION 279

binary (e.g., presence/absence records). In applications such as these, map-


ping the estimates of Pr(Z(s0 ) > z|Z(s)) is often the ultimate inferential goal.
However, indicator kriging is also used to provide nonparametric predictions
of any functional g(Z(s0 )) by using K different indicator variables defined at
thresholds zk , k = 1, · · · , K. This produces an estimate of the entire condi-
tional distribution at each location s0 . Thus, for each threshold zk , indicator
kriging based on the corresponding indicator data I(s1 , zk ), · · · , I(sn , zk ) gives
an approximation to Pr(Z(s0 ) ≤ zk |Z(s1 ), · · · , Z(sn )). Given this approxi-
mate conditional distribution, denoted here as F̂ (s0 , z |Z(s)), a predictor of
g(Z(s0 )) is
A
p(Z, g(Z(s0 ))) = g(z) dF'(s0 , z |Z(s)).

When g(Z(s0 )) = Z(s0 ), this is called the “E-type estimate” of Z(s0 ) (Deutsch
and Journel, 1992, p. 76). A measure of uncertainty is given by
A
σ (s0 ) = [p(Z, g(Z(s0 ))) − g(z)]2 dF'(s0 , z |Z(s)).
2

Computation of this nonparametric predictor of g(Z(s0 )) requires that K


semivariograms be estimated and modeled. Median indicator kriging (Journel,
1983) alleviates this tedious chore by using a common semivariogram model
based on the median threshold value, Pr(Z(s) < zM ) = 0.5. However, this
is only valid if all the indicator semivariograms are proportional to one an-
other (Matheron, 1982; Goovaerts, 1997, p. 304). A more troublesome issue
is the fact that F'(s0 , z |Z(s)) need not satisfy the theoretical properties of a
cumulative distribution function: it may be negative, exceed one, and is not
necessarily monotonic. These “order-relation problems” are circumvented in
practice by using a variety of “fix-ups” ranging from clever modeling strate-
gies to brute-force alteration of any offending estimates (Deutsch and Journel,
1992, p. 77–81). While the cause of these problems is often blamed on nega-
tive indicator kriging weights or the lack of data between two thresholds, the
basic problem is two-fold. First, the estimator is not constrained to satisfy
these properties, and second, there is no guarantee that any joint probability
distribution exists with the specified marginal and bivariate distributions (see
§5.8).

5.6.4 Disjunctive Kriging

Disjunctive kriging is a method for nonlinear spatial prediction proposed by


Matheron (1976) to make more use of the information contained in indicator
variables. The method is based on a technique known as disjunctive coding.
Let {Rk } be a partition of 8, i.e., Ri ∩ Rj = ∅ for i (= j and ∪k Rk = 8. Now
define indicator variables
-
1 if Z(s) ∈ Rk
Ik (s) =
0 otherwise.
280 SPATIAL PREDICTION AND KRIGING

If the partition is sufficiently fine, i.e., there are many subsets Rk , then any
function g(Z(s0 )) can be approximated by a linear combination of these indi-
cator functions
g(Z(s0 )) = g1 I1 (s0 ) + g2 I2 (s0 ) + g3 I3 (s0 ) + · · · .

In the situation we describe here, each Ik (s0 ) is unknown, but we can obtain
a predictor of any Ik (s0 ) using indicator kriging of the data associated with
the k th set {Ik (si ), i = 1, · · · , n}. However, as discussed above, this does not
make optimal use of all the indicator information. Another approach is to
obtain a predictor of Ik (s0 ) using not only the data associated with the k th
set, but also the data associated with all of the other sets, i.e., use data
{I1 (si ), i = 1, · · · , n}, · · · , {Ik (si ), i = 1, · · · , n}, · · ·. Thus, if we use a linear
combination of all available indicator data to predict each Ik (s0 ), the predictor
can be written as ((
I'k (s0 ) = λik Ik (si )
i k
and a predictor of g(Z(s0 )) is then
(( (
p(Z, g(Z(s0 )) = gki Ik (si ) ≡ gi (Z(si )). (5.63)
i k i

This is the general form of the disjunctive kriging predictor.


The key to disjunctive kriging is the determination of the functions gi (Z(si )).
These can be obtained using all the indicator information as described above,
but in any practical application, this requires estimation and modeling of co-
variances and cross-covariances of all the indicator variables. Thus, in practice,
disjunctive kriging relies on models that expand a given function in terms of
other functions that are uncorrelated. In geostatistics, this is called factoriz-
ing (Rivoirard, 1994, p. 11). Vector space theory, and Hilbert space theory in
particular, gives us a very elegant way to obtain such functions through the
use of orthogonal polynomials.

5.6.4.1 Orthonormal Polynomials for the Gaussian Distribution

Consider a system of functions {χp (x), p = 0, 1, · · ·} that forms an orthonormal


basis in the space L2w associated with the nonnegativeB weight function w(x).
Then, any measurable function g(x) for which g(x) w(x)dx < ∞ can be
2

expressed as a linear combination of these basis functions, i.e.,



(
g(x) = ap χp (x).
p=0

The basis functions have the property


A
χp (x)χm (x)w(x)dx = δpm ,

where δpm = 1 if p = m and is equal to 0 otherwise.


NONLINEAR PREDICTION 281

The weight function determines the system {χp (x)}. For example, taking
2
w(x) = f (x) = √12π e−x /2 over the interval (−∞, ∞), the standard Gaussian
density function, gives the system of Chebyshev-Hermite polynomials (Stuart
and Ord, 1994, p. 226-228) Chebyshev-
Hermite
p x2 /2 dp −x2 /2
Hp (x) = (−1) e e . (5.64) Poly-
dxp nomials
They are called polynomials since Hp (x) is in fact a polynomial of degree p :

H0 (x) = 1
H1 (x) = x
H2 (x) = x2 − 1
.. .. ..
. . .
Hp+1 (x) = xHp (x) − pHp−1 (x).

Chebyshev-Hermite polynomials are orthogonal, but not orthonormal


A ∞ -
1 −x2 /2 0 m=( p
√ Hp (x)Hm (x)e dx =
2π −∞ p! m = p.

In terms of statistical concepts such as expectation and variance,


/ the stan-
dardized Chebyshev-Hermite polynomials, ηp (x) = Hp (x)/ (p!) satisfy
A
E[ηp (x)] = ηp (x)f (x)dx = 0

A
Var[ηp (x)] = (ηp (x))2 f (x)dx = 1

-
0 m (= p
E[ηp (x)ηm (x)] =
1 m = p.
Thus, the polynomials ηp (x) form an orthonormal basis on L2 with respect to
the standard Gaussian density and we can expand any measurable function
g(x) as

(
g(x) = bp ηp (x).
p=0

The coefficients {bp } can be obtained


.∞from the orthogonal properties of the
polynomials. Since E[g(x)ηm (x)] = p=0 bp E[ηp (x)ηm (x)] ≡ bp ,
A
bp = g(x)ηp (x)f (x)dx. (5.65)
282 SPATIAL PREDICTION AND KRIGING

5.6.4.2 Disjunctive Kriging with Gaussian Data

Suppose that we now want to predict g(Z(s0 )) using the predictor given in
(5.63). Then, from the above development,

(
g(Z(s0 )) = b0p ηp (Z(s0 )), (5.66)
p=0

and

(
gi (Z(si )) = bip ηp (Z(si )).
p=0
If we could predict ηp (Z(s0 )) from the available data, then we would have a
predictor of g(Z(s0 )). Predicting ηp (Z(s0 )) from ηp (Z(si )) is now an easier
task since ηp (Z(s)) and ηp (Z(s)) are uncorrelated. Thus, we can use ordinary
kriging based on the data ηp (Z(s1 )), · · · , ηp (Z(sn )) to predict ηp (Z(s0 )). To
obtain the kriging equations, we need the covariance between ηp (Z(s + h))
and ηp (Z(s)). Matheron (1976) showed that if Z(s + h) and Z(s) are bivariate
Gaussian with correlation function ρ(h), then for p ≥ 1 this covariance is
Cov[ηp (Z(s + h)), ηp (Z(s))] = [ρ(h)]p .
Thus, the optimal predictor of ηp (Z(s0 )) is (Chilès and Delfiner, 1999, p. 393)
p
(
η'p (Z(s0 )) = p({ηp (Z(si )}; ηp (Z(s0 )) = λpi ηp (Z(si )), p = 1, 2, · · · , (5.67)
i=1

where the λpi satisfy


λR = ρ. (5.68)
Here, R is an n×n matrix with elements [ρij (h)]p and ρ is an n×1 vector with
elements [ρ0i (h)]p . The kriging variance associated with η'p (Z(s0 )) is (Chilès
and Delfiner, 1999, p. 394).
ση2 (s0 ) = 1 − λ! ρ. (5.69)
The disjunctive kriging predictor of g(Z(s0 )) is obtained by substituting for
ηp (Z(s0 )) in (5.66) η'p (Z(s0 )) from (5.67) using weights from (5.68):

(
pdk (Z; g(Z(s0 ))) = b0p η'p (Z(s0 )). (5.70)
p=0

The kriging variance associated with the disjunctive kriging predictor is



(
2
σdk (Z; g(Z(s0 ))) = b20p ση2 (s0 ). (5.71)
p=1

Often, the coefficients bp will be zero in the Hermetian expansion (5.66). Also,
the correlation function [ρ(h)]p tends to that of an uncorrelated white noise
process as p becomes large. Thus, in practice only a few (usually less than a
dozen, Rivoirard, 1994, p. 43) Hermite polynomials need to be predicted for
disjunctive kriging.
NONLINEAR PREDICTION 283

Example 5.12 The key to disjunctive kriging is obtaining the expansion


in (5.66). This is easier than it first appears. For example, suppose we are
interested in g(Z(s0 )) = Z(s0 ). The Hermetian expansion of Z(s0 ) is

(
Hp (Z(s0 ))
Z(s0 ) = √ bp ,
p=0
p!
B
where the coefficients satisfy (5.65). Thus bB0 = zH0 (z)f (z)dz = 0, since
H0 (x) = 1 and E[Z(s0 )] = 0. Similarly, b1 = zH1 (z)f (z)dz = Var[Z(s)] = 1.
For p ≥ 2,
A A
bp = zHp (z)f (z)dz = H1 (z)Hp (z)f (z)dz = 0.

Thus, the Hermitian expansion of Z(s0 ) is just H1 (Z(s0 )).


Now consider g(Z(s0 )) = I(s0 , zk ). It is easy to determine the first coeffi-
cient as A zk
b0 = H0 (z)f (z)dz = F (zk ).
−∞
To determine the other coefficients, note that from (5.64), the Hermite poly-
nomials also satisfy
d
Hp (x)f (x) = (−1)p (Hp−1 (x)f (x)).
dx
Using this to determine bp gives (Chilès and Delfiner, 1999, p. 641)
A zk
Hp (z)
bp = √ f (z)dz
−∞ p!
A zk
(−1)p d
= √ (Hp−1 (x)f (x))dx
−∞ p! dx
Hp−1 (zk )f (zk )
= (−1)p √ .
p!
Thus,

( Hp−1 (zk )Hp (Z(s0 ))
I(s0 , zk ) = F (zk ) + f (zk ) (−1)p . (5.72)
p=1
p!

Rivoirard (1994) gives some excellent elementary examples that can be per-
formed with a calculator to show how these expansions and disjunctive kriging
work in practice with actual data.

5.6.4.3 Disjunctive Kriging with Non-Gaussian Data

If the data do not follow a Gaussian distribution, they can be transformed


using the anamorphosis or normal scores transformations described in §5.6.2.
Unfortunately, not all data can be transformed to have a Gaussian distribu-
tion. For example, if the data are discrete or their distribution is very skewed,
284 SPATIAL PREDICTION AND KRIGING

a Gaussian assumption even on the transformed scale may not be realistic.


Thus, we may want to consider other probability measures, not just the stan-
dard Gaussian, and thus, other orthonormal polynomials. For example, if our
weight function is w(x) ∝ e−x xα on (0, ∞), we may use the generalized La-
guerre polynomials that form an orthogonal basis in L2 with respect to the
Gamma distribution. More generally, Matheron (1984) developed isofactorial
models that have the following general form

(
Fi,j (dxi , dxj ) = Tm (i, j)χm (xi )χm (xj )F (dxi )F (dxj ),
m=0

where Fi,j (dxi , dxj ) is a bivariate distribution with marginals F (dxi ) and
F (dxj ), and the χm (z) are orthonormal polynomials with respect to some
probability measure F (dx). In kriging the polynomials, the covariances needed
for the kriging equations are given by the Tm (i, j). These are inferred from as-
sumptions pertaining to the bivariate distribution of the pairs (Z(si ), Z(sj )).
For example, as we noted above, if (Z(si ), Z(sj )) is bivariate Gaussian with
correlation function ρ(h), then Tm (i, j) = [ρ(||i − j||)]m . However, to actually
predict the factors, we need to know (and parametrically model) Tm (h). The
general form of Tm (h) has been worked out in special cases (see Chilès and
Delfiner, 1999, pp. 398–413), but many of the models seem contrived, or there
are undesirable constraints on the form of the Tm (i, j) = [ρ(||i − j||)]m needed
to ensure a valid bivariate distribution. Thus, Gaussian disjunctive kriging
remains the isofactorial model that is most commonly used in practice.

5.7 Change of Support

In the previous sections we have assumed that the data were located at
“points” within a spatial domain D and that the inferential goal was pre-
diction at another “point” in D. However, spatial data come in many forms.
Instead of measurements associated with point locations, we could have mea-
surements associated with lines, areal regions, surfaces, or volumes. In geology
and mining, observations often pertain to rocks, stratigraphic units, or blocks
of ore that are three dimensional. The inferential goal may also not be lim-
ited to point predictions. We may want to predict the grade of a volume of
ore or estimate the probability of contamination in a volume of soil. Data
associated with areal regions are particularly common in geographical studies
where counts or rates are obtained as aggregate measures over geopolitical re-
gions such as counties, Census tracts, and voting districts. In many instances,
spatial aggregation is necessary to create meaningful units for analysis. This
latter aspect was perhaps best described by Yule and Kendall (1950, p. 312),
when they stated “... geographical areas chosen for the calculation of crop
yields are modifiable units and necessarily so. Since it is impossible (or at
any rate agriculturally impractical) to grow wheat and potatoes on the same
piece of ground simultaneously we must, to give our investigation any mean-
ing, consider an area containing both wheat and potatoes and this area is
CHANGE OF SUPPORT 285

modifiable at choice.” This problem is now known as the modifiable areal


unit problem (MAUP) (Openshaw and Taylor, 1979).
The MAUP is comprised of two interrelated problems. The first occurs when
different inferences are obtained when the same set of data is grouped into
increasingly larger areal units. This is often referred to as the scale effect or
aggregation effect. The second, often termed the grouping effect or the
zoning effect, arises from the variability in results due to alternative forma-
tions of the areal units that produce units of different shape or orientation
at the same or similar scales (Openshaw and Taylor, 1979; Openshaw, 1984;
Wong, 1996). We illustrated the implications of the zoning effect using the
Woodpecker data in §3.3.4. These situations, and the MAUP in general, are
special cases of what is known as the change of support problem (COSP)
in geostatistics. The term support includes the geometrical size, shape, and
spatial orientation of the units or regions associated with the measurements
(see e.g., Olea, 1991). Changing the support of a variable (typically by av-
eraging or aggregation) creates a new variable. This new variable is related
to the original one, but has different statistical and spatial properties. For
example, average values are not as variable as point measurements. When we
have statistically independent data, deriving the variance of their average is
relatively easy. When the data are spatially dependent, inferring this variance
is more difficult. It depends on both the block itself and on the variability in
the point measurements. The problem of how the spatial variation in one vari-
able associated with a given support relates to that of the other variable with
a different support is called the change of support problem. Many of the
statistical solutions to the change of support problem can be traced back to
Krige’s “regression effect” and subsequent corrections used in mining blocks
of ore in the 1950’s (Krige, 1951). From the beginning, the field of geostatistics
has incorporated solutions to change of support problems, beginning with the
early work of Matheron (1963). In the following sections, we describe some
common change of support problems and their solutions. A more detailed de-
scription of the change of support problem and recent statistical solutions can
be found in Gotway and Young (2002).

5.7.1 Block Kriging

Consider the process {Z(s) : s ∈ D ⊂ Rd }, where Z(s) has mean µ and


covariance function Cov[Z(u), Z(v)] = C(u, v) for u, v in D. Suppose that
instead of predicting Z(s0 ) from data Z(s1 ), · · · , Z(sn ), we are interested in
predicting the average value in a particular region B (block) of volume |B|,
A
1
g(Z(s0 )) = Z(B) = Z(s) ds. (5.73)
|B| B
The spatial region or block associated with the data, B, is called the support
of the variable Z(B). To adapt the ideas of ordinary kriging to .nthe predic-
tion of Z(B), we consider predictors of the form p(Z; Z(B)) = i=1 λi Z(si ),
286 SPATIAL PREDICTION AND KRIGING

where the weights are chosen to minimize the mean-squared prediction error
E[(p(Z; Z(B)) − Z(B))2 ]. Since E[Z(s)] = µ, E[Z(B)] = µ, and the same ideas
used in the development of the ordinary kriging predictor in §5.2.2 can be ap-
plied to the prediction of Z(B). This leads to the block kriging predictor
Block
n
(
Kriging
p(Z; Z(B)) = λi Z(si ),
Predictor
i=1
where optimal weights {λi } are obtained by solving (Journel and Huijbregts,
1978; Chilès and Delfiner, 1999)
n
(
λk C(si , sk ) − m = Cov[Z(B), Z(si )] i = 1, · · · , n;
k=1
n
(
λi = 1. (5.74)
i=1

In matrix terms, the optimal weights satisfy


C D!
! 1 − 1! Σ−1 σ(B, s)
λ = σ(B, s) + 1 Σ−1
1! Σ−1 1
1 − 1! Σ−1 σ(B, s)
m = ,
1! Σ−1 1
where the elements of the vector σ(B, s) are Cov[Z(B), Z(si )] (see Cressie,
1993). These “point-to-block” covariances can be derived from the covariance
function C(·) of the underlying Z(s) process as
A
1
Cov[Z(B), Z(s)] = C(u, v)dudv. (5.75)
|B| B
(see Journel and Huijbregts, 1978; Chilès and Delfiner, 1999). Thus, the
weights needed for block kriging are the same as those used in ordinary kriging
with σ replaced by σ(B, s).
The minimized mean-squared prediction error, the block kriging variance,
Block is
Kriging 2
σok (B) = σ(B, B) − λ! σ(B, s) + m
Variance ) *2
1 − 1! Σ−1 σ(B, s)
= σ(B, B) − σ(B, s)! Σ−1 σ(B, s) + ,
1! Σ−1 1
where
A A
1
σ(B, B) = Cov[Z(B), Z(B)] = C(u, v)dudv. (5.76)
|B|2 B B

The covariance function, C(u, v) (here a point-to-point covariance), is as-


sumed known for theoretical derivations, but is then estimated and modeled
with a valid positive definite function based on the data as in ordinary point
kriging. In practice, the integrals in (5.75) and (5.76) are computed by dis-
CHANGE OF SUPPORT 287

cretizing B into points, {u!j }, so that (5.75) is approximated using


N
(
Cov[Z(B), Z(s)] ≈ 1/N C(u!j , s),
j=1

and (5.76) is approximated using


N (
( N
Cov[Z(B), Z(B)] ≈ 1/N 2
C(u!i , u!j ).
i=1 j=1

Block kriging can also be carried out using the semivariogram. The relation-
ship between the semivariogram associated with Z(B) and that associated
with the underlying process of point support Z(s) is given by (Cressie, 1993,
p. 16),
A A
1
2γ(Bi , Bj ) = − γ(u − v) dudv
|Bi ||Bi | Bi Bi
A A
1
− γ(u − v) dudv
|Bj ||Bj | Bj Bj
A A
2
+ γ(u − v) dudv , (5.77)
|Bi ||Bj | Bi Bj
where 2γ(u − v) = Var[Z(u) − Z(v)] is the variogram of the point-support
process {Z(s)}.

Example 5.13 It is illustrative to examine the effects of integrating a co-


variance function, although in practice approximations are invoked. Imagine
a one-dimensional (transect) process with point-to-point semivariogram
- >
3|si − sj |
γ(si , sj ) = 1 − exp − .
5
We wish to predict Z(B), where B is the line segment from s = 2 to s = 4.
The point-to-block semivariogram is then
A - >
1 4 3|u − sj |
γ(si , B) = 1 − exp − du.
2 2 5
Figure 5.17 shows the point-to-point and the point-to-block semivariances in
the neighborhood of the prediction location s = 3. Whereas γ(si , s0 = 3)
approaches 0 as si nears the prediction location, the same is not true for the
point-to-block semivariances. The semivariogram is one halfBof the variance
between a point datum, Z(s), and an aggregate Z(B) = Z(u) du. This
difference is not zero at the prediction location.

The point-to-block change of support problem is one that routinely arises


in geostatistics, but the same idea can be used to solve other change of sup-
port problems. Suppose the data are Z(A1 ), · · · , Z(An ), with Z(Ai ) defined
288 SPATIAL PREDICTION AND KRIGING

0.8

Semivariance 0.6

0.4

0.2

0.0

0 1 2 3 4 5 6
si

Figure 5.17 Point-to-point (dashed line) and point-to-block semivariances (solid line)
near the prediction location at s = 3.

through (5.73) and prediction of Z(B) is desired. The optimal linear pre-
' .n
dictor of Z(B) based on data {Z(Ai )} is Z(B) = i=1 λi Z(Ai ), where the
optimal weights {λi } are solutions to the equations obtained by replacing
the point-to-point covariances C(si , sk ) and the point-to-block covariances
Cov[Z(B), Z(si )] in (5.74) with
A A
Cov[Z(Ai ), Z(Ak )] = C(u, v)dudv/(|Ai | |Ak |), (5.78)
Ak Ai

and A A
Cov[Z(B), Z(Ai )] = C(u, v)dudv/(|B| |Ai |).
B Ai

Because data on any support can be built from data with point-support,
these relationships can be used for the case when |Ai | < |B| (aggregation),
the case when |B| < |Ai | (disaggregation), and also the case of overlapping
units on essentially the same scale. However, unlike the previous situation,
where we observed point-support data and could easily estimate the point-
support covariance function C(u, v), in practice this function is more difficult
to infer from aggregate data. If we assume a parametric model, γ(u − v; θ),
for γ(u − v), a generalized estimating equations (GEE) approach can be used
to estimate θ (see McShane et al., 1997). Consider the squared differences
(1)
Yij = (Z(Bi ) − Z(Bj ))2 . (5.79)
Note that E[Z(Bi )−Z(Bj )] = 0 and E[Yij ] = 2γ(Bi , Bj ; θ). Taking an identity
(1)
working variance-covariance matrix for the Yij , the generalized estimating
CHANGE OF SUPPORT 289

equations (see §4.5.3 in Chapter 4) can be written as

∂γ(Bi , Bj ; θ) E (1) F
n−1
( (n
(1)
U (θ; {Yij }) =2 ! Yij − 2γ(Bi , Bj ; θ) ≡ 0. (5.80)
i=1 j=i+1
∂θ

Mockus (1998) considers a very similar approach based on least-squares fitting


of a parametric covariance function.

5.7.2 The Multi-Gaussian Approach

In many cases, E[Z(B)|Z(s)] is not linear in the data Z(s) and, in others,
prediction of a nonlinear function of Z(B) is of interest. These problems re-
quire more information about the conditional distribution of Z(B) given the
data, FB (z|Z(s)) = Pr(Z(B) ≤ z|Z(s)), than that used for linear prediction.
Moreover, in many cases, such as mining and environmental remediation, the
quantity Pr(Z(B) > z|Z(s)) has meaning in its own right (e.g., proportion
of high-grade blocks available in mining evaluation or the risk of contamina-
tion in a volume of soil). Nonlinear geostatistics offers solutions to COSPs
that arise in this context. The multi-Gaussian approach (Verly, 1983) to non-
linear prediction in the point-to-block COSP assumes that available point
data Z(s1 ), · · · , Z(sn ) can be transformed to Gaussian variables, {Y (s)}, by
Z(s) = ϕ(Y (s)). The block B is discretized into points {u!j , j = 1, · · · , N },
and Z(B) is approximated as
N
1 (
Z(B) ≈ Z(u!j ). (5.81)
N j=1

Then
 
(N
1
FB (z|Z(s)) ≈ Pr  Z(u!j ) < z|Z(s)
N j=1
 
N
(
= Pr  φ(Y (u!j )) < N z|Y (s1 ), Y (s2 ), · · · , Y (sn ) .
j=1

This probability is estimated through simulation (see Chapter 7). The vector
Y(u) = [Y (u1 ), · · · , Y (uN )]! is simulated from the conditional distribution of
Y(u)|Y(s). Since Y is Gaussian, this conditional distribution can be obtained
by kriging and simulation is straightforward. Then, FB (z|Z(s)) is estimated
.N
as the proportion of vectors satisfying j=1 ϕ(Y (u!j )) < N z.
If, instead of point support data, data Z(A1 ), · · · , Z(An ), |Ai | < |B|, are
available, this approach can still be used provided an approximation similar
to that of equation (5.81) remains valid. More general COSP models based on
the multi-Gaussian approximation may be possible by building models from
data based on point support as described in §5.7.1.
290 SPATIAL PREDICTION AND KRIGING

5.7.3 The Use of Indicator Data

Consider again indicator data I(s, z) = [I(s1 , z), · · · , I(sn , z)]! , derived from
the indicator transform in (5.61). From §5.6.3, indicator kriging provides an
estimate of Fs0 (z |Z(s)) = Pr(Z(s0 ) < z|Z(s)). For nonlinear prediction in the
point-to-block COSP, it is tempting to use block kriging, described in §5.7.1,
with the indicator data. However, this will yield a predictor of
A
1
I (B) =

I(Z(s) ≤ z) ds,
|B| B

which is the proportion of B consisting of points where Z(s) is at or below z.


This quantity is clearly not the same as
-
1 if Z(B) ≤ z
I(B) = (5.82)
0 otherwise,

which would provide an estimate of Pr(Z(B) ≤ z|Z(s)), the probability that


the average value of Z(·) is at or below z. This latter quantity is the one of
interest in COSPs. The problem arises with any nonlinear function of Z(s),
because the mean of block-support data will not be the same as the block
average of the point support data. This is also true in the more general COSP
based on data with supports Ai that differ from support B.
Goovaerts (1997) suggests a solution to nonlinear block prediction based on
simulation. The block is discretized and data Z(u!j ) are simulated at each dis-
cretized node. Simulated block values are then obtained via equation (5.81).
Based on these simulated block values, block indicator values are constructed
using equation (5.82), and Pr(Z(B) ≤ z|Z(s)) is then estimated as the aver-
age of these block indicator values. Goovaerts (1997) recommends LU decom-
position for the simulation of the Z-values, but any conditional simulation
technique (i.e., one that forces the realizations to honor the available data)
could be used (see §7.2.2).

5.7.4 Disjunctive Kriging and Isofactorial Models

The ideas underlying the development of the disjunctive kriging predictor


and the use of isofactorial models can also be used for COSPs. For example,
suppose all pairs (Z(s), Z(u)) are bivariate Gaussian and we want to predict
I(B) in equation (5.82). This function can be expanded in terms of Hermite
polynomials using (5.64):


( Hp−1 (z)Hp (Z(B))
I(B) = F (z) + f (z) (−1)p .
p=1
p!
CHANGE OF SUPPORT 291

Then, the disjunctive kriging predictor of I(B) is obtained by replacing each


Hp (Z(B)) with its predictor obtained by kriging based on the equations
n
( p p
λpi [Cov[Z(si ), Z(sj )]] = [Cov[Z(sj ), Z(B)]] , j = 1, · · · n.
i=1

These are analogous to those §5.6.4, but adapted to the point-block COSP
p
through the term [Cov[Z(sj ), Z(B)]] . They also have a more general form in
the case of isofactorial models (§5.6.4.3):
n
(
λpi [Tp (i, j)] = Tp (B, j), j = 1, · · · n.
i=1

The covariances Tp (i, j) (point-to-point), Tp (B, j) (point-to-block), and Tp (B,


B) (block-to-block) needed for disjunctive kriging and for calculation of the
prediction standard errors must be derived for the particular system of or-
thonormal polynomials being used. In practice, this has only been done in spe-
cial cases, e.g., using what is called the discrete Gaussian model (Rivoirard,
1994; Chilès and Delfiner, 1999).

5.7.5 Constrained Kriging

Dissatisfied by the solutions to the change of support problem described above,


Cressie (1993b) proposes constrained kriging, which uses g(λ! Z(s)) to pre-
dict g(Z(B)). If ordinary kriging is used to obtain the weights, λ, the corre-
sponding predictor, g(λ! Z(s)), will be too smooth. In constrained kriging, a
variance constraint is added to compensate for this undesirable smoothness.
Thus, the weights are chosen to minimize the mean-squared prediction error of
λ! Z(s) as a predictor of Z(B), subject to both an unbiasedness constraint as in
ordinary kriging and a variance constraint Var[λ! Z(s)] = Var[Z(B)]. Thus, we
choose λ by minimizing E[Z(B)−λ! Z(s)], subject to E[λ! Z(s)] = E[Z(B)] = µ
and A A
!
Var[λ Z(s)] = σ(B, B) = C(u, v) dudv.
B B
Cressie (1993b) shows that the optimal weights are given by
λ! = (1/m2 )σ(B, s)! Σ−1 − (m1 /m2 )1! Σ−1
m2 σ(B, s)! Σ−1 1
m1 = − ! −1 +
1Σ 1 1! Σ−1 1
[(σ(B, s)! Σ−1 σ(B, s))(1! Σ−1 1) − (σ(B, s)! Σ−1 1)2 ]1/2
m2 = ,
[(1! Σ−1 1)σ(B, B) − 1]1/2
where m1 and m2 are Lagrange multipliers from the constrained minimization,
the vector σ(B, s) has elements Cov[Z(B), Z(si )] given in (5.75) and Σ has
elements C(u, v).
The weights obtained from this constrained minimization are optimal for
292 SPATIAL PREDICTION AND KRIGING

λ! as a predictor of Z(B), i.e., for linear prediction. Thus, g(λ! Z(s)) will not
be optimal for g(Z(B)), but the advantage of constrained kriging is that the
weights depend only on C(u, v), the point-point covariance and the range of
g(λ! Z(s)) exactly matches that of g(Z(B)). Simulations in Cressie (1993b)
and Aldworth and Cressie (1999) indicate that accurate nonlinear predictions
of aggregate data can be made using this approach. An extension of this,
covariance-matching constrained kriging, has been shown to have even
better mean-squared prediction error properties (Aldworth and Cressie, 2003).

5.8 On the Popularity of the Multivariate Gaussian Distribution

We showed in sections §5.1 and §5.2 that it was possible to determine the best
predictor, E[Z(s0 )| Z(s)], when the data follow a multivariate Gaussian distri-
bution. When the assumption of a Gaussian distribution is relaxed, we must
either re-define the notion of an optimal predictor by imposing additional cri-
teria such as linearity and unbiasedness, or transform the data to a Gaussian
distribution in order to make use of its nice theoretical properties. This led
us to ask the question: Why is there such a dependence on the multivari-
ate Gaussian distribution in spatial statistics? This section explores several
answers to this crucial question. Additional discussion is given in §6.3.3 and
§7.4.
For simplicity, we begin by considering bivariate distributions. Let Z1 and
Z2 be two random variables with bivariate distribution function F12 (z1 , z2 ) =
Pr(Z1 ≤ z1 , Z2 ≤ z2 ). The marginal distributions F (z1 ) and F (z2 ) can be
obtained from the bivariate distribution F (z1 , z2 ) as
F1 (z1 ) = F12 (z1 , ∞); F2 (z2 ) = F12 (∞, z2 ).
A well-known example is that of the bivariate Gaussian distribution where
A z1 A z2 H ?C D2
1 −1 z1 − µ1
F12 (z1 , z2 ) = / exp −
2πσ1 σ2 1 − ρ2 −∞ −∞ 2(1 − ρ2 ) σ1
C DC D C D2 @I
z1 − µ1 z2 − µ2 z2 − µ2
2ρ + ,
σ1 σ2 σ2

with −1 < ρ < 1, F1 (z1 ) ∼ G(µ1 , σ12 ) and F2 (z2 ) ∼ G(µ2 , σ22 ). The question
of interest in this section is: Can we go the other way, i.e., given F1 (z1 ) and
F2 (z2 ) can we construct F12 (z1 , z2 ) such that its marginals are F1 (z1 ) and
F2 (z2 ) and Corr[Z1 , Z2 ] = ρ? The answer is “yes” but as we might expect,
there may be some caveats depending on the particular case of interest.
There are several different ways to construct bivariate (and multivariate)
distributions (Johnson and Kotz, 1972; Johnson, 1987):

1. through generalizations of Pearson’s system of distributions derived from


differential equations;
ON THE POPULARITY OF THE MULTIVARIATE GAUSSIAN DISTRIBUTION 293

2. by series expansions like the isofactorial models described in §5.6.4.3 or the


well-known Edgeworth expansions;
3. by transformation;
4. through the use of sequential conditional distributions.
However, it is important to consider the characteristics required of the result-
ing bivariate (and multivariate) distribution. Several important considerations
include the nature of the marginals, the nature of the conditional distribu-
tions, the permissible correlation structure, whether the distribution factors
appropriately under independence, and the limiting distribution. We consider
several examples, focusing on bivariate distributions for simplicity.
Consider the bivariate Cauchy distribution (see, e.g., Mardia, 1970, p. 86)
F (z1 , z2 ) = (2π)−1 c[c2 + z12 + z22 ]−3/2 , −∞ < z1 , z2 , < ∞, c > 0.
The marginal distributions of Z1 and Z2 are both Cauchy, but E[Z2 |Z1 ] does
not exist. When working with Poisson distributions, the result is more severe:
no bivariate distribution exists having both marginal and conditional distri-
butions of Poisson form (Mardia, 1970). Even if the conditional moments
do exist, other constraints arise. For example, consider the Farlie-Gumbel-
Morganstern system of bivariate distributions (Johnson and Kotz, 1975)
F12 (z1 , z2 ) = F1 (z1 )F2 (z2 ){1 + α[1 − F1 (z1 )][1 − F2 (z2 )]},
constructed from specified marginals F1 (z1 ) and F2 (z2 ). Here, α is a real num-
ber so that F12 (z1 , z2 ) satisfies the theoretical properties of a distribution func-
tion. The bivariate exponential distribution is a member of this class, obtained
by taking both F1 (z1 ) and F2 (z2 ) to be exponential. With this system, the
conditional moments exist, but Corr[Z1 , Z2 ] ≤ 1/3 (Huang and Kotz, 1984).
Many bivariate distributions that are not constructed from an underlying
bivariate Gaussian distribution have a similar constraint on the correlation;
some like the bivariate Binomial permit only negative correlations (Mardia,
1970). Plackett’s family of bivariate distribution functions (Plackett, 1965)
overcomes this difficulty. This family is given by the function F12 (z1 , z2 ) that
satisfies
F12 (1 − F1 − F2 + F12 )
ψ= ψ ∈ (0, ∞).
(F1 − F12 )(F1 − F12 )
If ψ = 1, then Z1 and Z2 are independent. By varying ψ, we can generate
bivariate distributions with different strengths of dependence. Mardia (1967)
shows that Corr[Z1 , Z2 ] = [(ψ − 1)(1 + ψ) − 2ψ log(ψ)]/(1 − ψ)2 , and that 0 <
Corr[Z1 , Z2 ] < 1 if ψ > 1 and Corr[Z1 , Z2 ] < 0 if 0 < ψ < 1. Unfortunately,
Plackett’s family does not seem to have a nice multivariate generalization. As
a final example, consider the class of isofactorial models described in §5.6.4.3.
This is a factorization of an existing bivariate distribution (Lancaster, 1958).
While it can be used to build a bivariate distribution from specified marginals,
we must verify that the resulting bivariate distribution is a valid distribution
function and then we must derive the range of permissible correlations.
294 SPATIAL PREDICTION AND KRIGING

These difficulties are compounded when generalizing these results to mul-


tivariate distributions. Perhaps the most commonly-used non-Gaussian mul-
tivariate distributions are the elliptically contoured distributions (see, e.g.,
Johnson, 1987, pp. 106–124). These are defined through affine transforma-
tions of the class of spherically symmetric distributions with a density of the
form ) *
F (dz) = kn |Σ|−1/2 h (z − µ)! Σ−1 (z − µ) ,
where kn is a proportionality constant and h is a one-dimensional real-valued
function (Johnson, 1987, p. 107). Clearly, the multivariate Gaussian distribu-
tion is in the class, obtained by taking kn = (2π)−n/2 and h(z) = exp{−z/2}.
These distributions share many of their properties with the multivariate Gaus-
sian distribution, making them almost equally as nice from a theoretical view-
point. However, this also means we are not completely free of some sort of
multivariate Gaussian assumption, and many members of this family have lim-
iting marginal distributions that are normal (e.g., Pearson’s Type II, Johnson,
1987, p. 114). Generating multivariate distributions sequentially from spec-
ified conditionals overcomes some of these difficulties. In this approach we
specify Z1 ∼ F1 , then Z2 |Z1 ∼ F2|1 , and Z3 |Z1 , Z2 ∼ F3|21 , and so on. This
approach is routinely used in Bayesian hierarchical modeling and it can al-
low us to generate fairly complex multivariate distributions. However, we may
not always be certain of some of the properties of the resulting distribution.
Consider the following example. Suppose Z1 (s) is a second-order stationary
process with E[Z1 (s)] = 1 and Cov[Z1 (u), Z1 (u + h)] = σ 2 ρ1 (h). Conditional
on Z1 (s), suppose Z2 (s) is a white noise process with mean and variance given
by
E[Z2 (s)|Z1 (s)] = exp{x(s)! β}Z1 (s) ≡ µ(s), Var[Z2 (s)|Z1 (s)] = µ(s).
This is a simplified version of a common model used for modeling and inference
with count data (see., e.g., Zeger, 1988 and McShane et al., 1997). This model
is attractive since the marginal mean E[Z2 (s)] = exp{x(s)! β}, depends only
on the unknown parameter β, and the marginal variance, Var[Z2 (s)] = µ(s) +
σ 2 µ(s)2 , allows overdispersion in the data Z2 (s). Now consider the marginal
correlation of Z2 (s)
ρ1 (h)
Corr[Z2 (s), Z2 (s + h)] = :C DC D;1/2
1 1
1+ 1+ 2 .
σ 2 µ(s) σ µ(s + h)

If σ 2 , µ(s), and µ(s+h) are small, Corr[Z2 (s), Z2 (s+h)] << ρ1 (h). For exam-
ple, taking σ 2 = 1 and µ(s) = µ(s + h) = 1, Corr[Z2 (s), Z2 (s + h)] = ρ1 (h)/2.
Thus, while the conditioning induces both overdispersion and autocorrelation
in the Z2 process, the marginal correlation has a definite upper bound and so
may not be a good model for highly correlated data.
Given all of this discussion (and entire books reflecting almost 50 years of
research in this area), it is now easy to see why the multivariate Gaussian dis-
tribution is so popular: it has a closed form expression, permits pairwise cor-
CHAPTER PROBLEMS 295

relations in (−1, 1), each (Zi , Zj ) has a bivariate Gaussian distribution whose
moments can be easily derived, all marginal distributions are Gaussian, and all
conditional distributions are Gaussian. Moreover, (almost) equally tractable
multivariate distributions can be derived from the multivariate Gaussian (e.g.,
the multivariate lognormal and the multivariate t-distribution) and these play
key roles in classical multivariate analysis. Thus, the multivariate Gaussian
distribution has earned its truly unique place in statistical theory.
Note that in geostatistical modeling, we are working with multivariate
data, i.e., rather than just considering Fij (zi , zj ) we must be concerned with
F1,2,···,n (z1 , z2 , · · · , zn ) and the relationships permitted under this multivari-
ate distribution. Herein lies the problem with the nonparametric indicator
approaches and non-Gaussian disjunctive kriging models: they attempt to
build a multivariate distribution from bivariate distributions. With indicator
kriging this is done through indicator semivariograms, and with disjunctive
kriging it is done through isofactorial models. From the above discussion, we
have to wonder if there is indeed a multivariate distribution that gives rise to
these bivariate distributions. Sometimes, this consideration may seem like just
a theoretical nuisance. However, in some practical applications it can cause
difficulties, e.g., “covariance” matrices that are not positive definite, numer-
ical instability, and order-relations problems. These ideas are important to
keep in mind as we go on to consider more complex models for spatial data
in subsequent chapters.

5.9 Chapter Problems

Problem 5.1 The prediction theorem states that for any random vector U
and any random variable Y we have either of the following
• For every function g, E[(Y − g(U)2 ] = ∞;
• E[(Y − E[Y | U])2 ] ≤ E[(Y − g(U)2 ] for every g with equality only if g(U) =
E[Y | U].
Hence, the conditional expectation is the best predictor under squared error
loss. Prove this theorem.

Problem 5.2 Consider prediction under squared error loss. Let p0 (Z; s0 ) =
E[Z(s0 )|Z(s)]. Establish that
E[(Z(s0 ) − p0 (Z; s0 ))2 ] = Var[Z(s0 )] − Var[p0 (Z; s0 )].

Problem 5.3 Consider a standard linear model with uncorrelated errors,


Y = Xβ + e, e ∼ (0, σ 2 I). It is easy to establish that the ordinary least
squares estimator of β requires only assumption about the first two moments
of Y and is identical to the maximum likelihood estimator of β when e is
Gaussian. Use this correspondence to revisit the link between best prediction
under squared-error loss for Gaussian data and best linear unbiased prediction
under squared-error loss.
296 SPATIAL PREDICTION AND KRIGING

Problem 5.4 Let the random variables X and Y have joint density f (x, y).
Show that the best linear unbiased predictor of Y based on X is given by the
linear regression function
BLU P (Y |X) = α + βX,
where β = Cov[X, Y ]/Var[X] and α = E[Y ] − βE[X].

Problem 5.5 (adapted from Goldberger, 1991, p. 55) Consider the joint mass
function p(x, y) in the following table.
X
Y x=1 x=2 x=3
y=0 0.15 0.10 0.30
y=1 0.15 0.30 0.00

(i) Find the conditional expectation function E[Y |X].


(ii) Find the BLU P (Y |X).
(iii) Compare the conditional expectation and the BLUP for x = 1, 2, 3.
(iv) Calculate E[(Y − E[Y |X])2 ] and E[(Y − BLU P (Y |X))2 ]. Which one is
smaller?

Problem 5.6 Let random variables X and Y have joint probability density
function
6
f (x, y) = (x + y)2 , 0 ≤ x ≤ 1; 0 ≤ y ≤ 1.
7
Find E[Y |X], BLU P (Y |X), and compare the mean-squared prediction errors
for the two predictors.

Problem 5.7 The simple kriging predictor (5.10) on page 223 is the solu-
tion to the problem of best linear prediction with known mean. Show that
psk (Z; s0 ) not only is an extremum of the mean squared prediction error
E[(p(Z; s0 ) − Z(s0 ))2 ], but that it is a minimum. Assume that Var[Z(s)] is
positive definite.

Problem 5.8 Show that (5.14) and (5.15) are the solution to the ordinary
kriging problem in §5.2 (page 227). Verify the formula (5.16) for σok
2
(s0 ).

Problem 5.9 Verify that (5.17) is the ordinary kriging predictor, provided
' is the generalized least squares estimate of µ.
µ

Problem 5.10 Verify (5.19)–(5.22).

Problem 5.11 Refer to the seven point ordinary kriging Example 5.5, p. 229.
Repeat the example with a prediction point that falls outside the hull of the
observed points, for example, s0 = [50, 20] or s0 = [60, 10].
CHAPTER PROBLEMS 297

• Do the same conclusions as in Example 5.5 regarding the effects of nugget,


sill, and range still hold?
• Which points are screening?
• What happens to the kriging variance?
• In which models does the predicted value change, in which models does it
remain the same?

Problem 5.12 Assume that you observe in the domain A a spatial random
field {Z(s) : s ∈ A ⊂ R2 } with covariance function C(h). Further let Bi and
Bj be some regions in A with volumes |Bi | and |Bj |, respectively. Show that
A A
1
Cov[Z(Bi ), Z(Bj )] = C(u, v) dudv.
|Bi ||Bj | Bi Bj

Problem 5.13 The table below shows 31 observations of two variables, a


response variable y and a regressor variable x.
• Prepare a scatter plot of these data and suggest a regression to model the
relationship between y and x.
• Obtain the ordinary least squares fit of your model of choice and of a simple
linear regression.
• Compute predicted values for a localized simple linear regression model,
where the weights are given by the Gaussian kernel function (page 240)
with bandwidths λ = 0.1, 0.25, 0.5, 1, 2, and 5. How would you choose the
value of λ?

x y x y x y x y x y
3.03 0.03 3.10 −0.20 3.23 −0.44 3.34 0.48 3.42 −0.79
3.51 −0.35 3.62 −0.21 3.72 −1.02 3.80 −1.49 3.91 −1.78
4.04 −0.43 4.15 −0.91 4.23 −1.33 4.32 −1.41 4.44 −1.14
4.53 −1.67 4.65 −1.36 4.73 −0.96 4.85 −1.15 4.92 −1.27
5.02 −0.37 5.12 −1.25 5.24 −0.96 5.31 −0.44 5.44 −2.08
5.51 −0.68 5.63 −1.23 5.72 −0.82 5.81 −0.51 5.92 −0.45
6.01 −0.40
CHAPTER 6

Spatial Regression Models

In §2.4.1 we introduced the operational decomposition

Z(s) = µ(s) + e(s)


e(s) = W (s) + η(s) + #(s)

of data from a random field process into large-scale trend µ(s), smooth, small-
scale variation W (s), micro-scale variation η(s), and measurement error #(s).
This decomposition was also used to formulate statistical models for spatial
prediction in the previous chapter. For example, the ordinary kriging predictor
was obtained for µ(s) = µ, the universal kriging predictor for µ(s) = x! (s)β.
The focus in the previous chapter was on spatial prediction; predicting Z(s)
or the noiseless S(s) = µ(s) + W (s) + η(s) at observed or unobserved lo-
cations. Developing best linear unbiased predictors ultimately required best
linear unbiased estimators of µ and β. The fixed effects β were important in
that they need to be properly estimated to account for a spatially varying
mean and to avoid bias. The fixed effects were not the primary focus of the
analysis, however. They were essentially nuisance parameters. The covariance
parameters θ were arguably of greater importance than the parameters of the
mean function, as θ drives the various prediction equations and the precision
of the predictors along with the model chosen for Σ = Var[e(s)].
Statistical practitioners are accustomed to the exploration of relationships
among variables, modeling these relationships with regression and classifica-
tion (ANOVA) models, testing hypotheses about regression and treatment
effects, developing meaningful contrasts, and so forth. When first exposed to
spatial statistics, the practitioner often appears to abandon these classical
lines of data inquiry—that focus on aspects of the mean function—in favor
of spatial prediction and the production of colorful maps. What happened?
When you analyze a field experiment with spatially arranged experimental
units, for example, you can rely on randomization theory or on a spatial model
as the framework for statistical inference (more on the distinction below). In
either case, the goal is to make decisions about the effects of the treatments
applied in the experiment. And since the treatment structure is captured in
the mean function—unless treatment levels are selected at random—we can
not treat µ(s) as a nuisance. It is central to the inquiry.
In this chapter we discuss models for spatial data analysis where the focus is
on modeling and understanding the mean function. In a reversal from Chapter
5, the covariance parameters may, at times, take on the role of the nuisance
300 SPATIAL REGRESSION MODELS

parameters. The models differ in the degree to which spatial dependence is


incorporated and whether it is modeled directly or indirectly. For example,
just as there are methods of spatial prediction that assume uncorrelated errors
(see §5.3.1 and §5.3.2), some spatial regression models account for all spatial
variation through the mean function. In practice you will find that your needs
with respect to prediction and mean function inference are varied. Often we
are interested in both estimation of fixed effects and covariance parameters.
Neither β nor θ are nuisance parameters; you need to pay attention to both.
In §6.1–6.2 we assume that the spatial model has a linear mean structure,

Z(s) = X(s)β + e(s), (6.1)


e(s) ∼ (0, Σ(θ)).

In addition, we may assume e(s) ∼ G(0, Σ(θ)), when necessary to obtain


confidence intervals or to construct hypothesis tests. In §6.3 we consider a
more general formulation based on generalized linear models (GLMs) and
relax this distributional assumption and the linearity of the mean function.
All models in this chapter have in common that spatial structure is in-
corporated, either as part of the mean function, the error process, or both.
None of the models ignore spatial structure, they accommodate it in dif-
ferent ways. Naturally, this begs the question about the respective merits
and demerits of the various approaches. An interesting case in point is the
comparison of models for analyzing data from designed (field) experiments.
Recently, random-field methods based on model (6.1), with spatial autocorre-
lation modeled through Σ(θ), have received considerable attention (see, for ex-
ample, Zimmerman and Harville, 1991; Brownie, Bowman, and Burton, 1993;
Stroup, Baenziger, and Mulitze, 1994; Brownie and Gumpertz, 1997; Gotway
and Stroup, 1997). It has generally been reported that modeling such data
by including spatial autocorrelation yields a more powerful analysis than the
traditional design-based analysis of variance (based on randomization theory,
e.g., the CRD or RCBD analysis) when the spatial structure is pronounced.
In the cases examined, it is typically the case that the design structure in-
volves some form of blocking scheme with a fairly large number of treatments,
e.g., variety or plant breeding trials. A short-sighted comparison of the two
approaches that considers only the precision of parameter estimates is danger-
ous since the methods can represent completely different philosophies. From a
randomization perspective (Kempthorne, 1955), the error-control and treat-
ment design lead to a linear model that usually has the form of an analy-
sis of variance model with independent errors. The statistical model is not
something hypothesized or assumed, it is the result of executing a particu-
lar design. Spatial autocorrelation is addressed under randomization at two
stages. Large-scale trends are ostensibly eliminated through blocking, small-
scale trends that operate at the scale of the experimental units are neutralized
through randomization. Situations where spatial random field models with
correlated errors have seemingly out-performed design-based analyses are ba-
sically dramatic examples of inefficient blocking that failed to eliminate spatial
LINEAR MODELS WITH UNCORRELATED ERRORS 301

trends. Making poor design choices does not affect the validity of cause-and-
effect inferences in design-based analyses under randomization. It only makes
it difficult to detect treatment differences because of a large experimental
error variance. When experimental data are subjected to modeling, it is pos-
sible to increase the statistical precision of treatment contrasts. The ability
to draw cause-and-effect conclusions has been lost, however, unless it can be
established that the model is correct.
Some statisticians take general exception with the modeling of experimen-
tal data, whether its focus is on the mean or the covariance structure of the
data, because it is not consistent with randomization inference. Any devi-
ation from the statistical model that reflects the execution of the particular
design is detrimental in their view. We agree that you should “analyze ’em the
way you randomize ’em,” whenever possible; this is the beauty of designed-
based inference. Nevertheless, we also know from experience that things can
go wrong and that scientists want to make the most of the data they have
worked so hard to collect. Thus, modeling of experimental data should also
be a choice, provided we attach the important caveat that modeling experi-
mental data does not lend itself to cause-and-effect inferences. If, for example,
blocking has been carried out too coarsely to provide a reduction in experi-
mental error variance substantial enough to yield smaller standard errors of
treatment contrasts than an analysis that accounts for heterogeneity outside
of the error-control design, why not proceed down that road?

6.1 Linear Models with Uncorrelated Errors

When model errors are uncorrelated, the standard linear model battery can
be brought to bear and analyses are particularly simple. In designed experi-
ments, design-based analysis based on randomization theory does not need to
explicitly model spatial dependencies, since these are neutralized through ran-
domization at the spatial scale of the experimental unit. If spatial structure is
present at scales smaller or larger than the unit, or if data are observational,
or if one does not want to adopt a randomization framework for inference,
uncorrelated errors are justified if all of the spatial variation is captured with,
accounted for, or explained by the mean function. This leads to the addition of
terms in the mean function to account for spatial configuration and structure,
or to the transformation of the regressor space.
The significance of regression or ANCOVA models with uncorrelated errors
for spatial data is twofold. First, we want to discuss their place in spatial
analysis. Many statisticians have been led to believe that these models are in-
adequate for use with spatial data and that ordinary least squares estimates
of fixed effects are biased or inefficient. This is not always true. Second, these
models provide an excellent introduction to more complex models with cor-
related errors that follow later. For example, models with nearest neighbor
adjustments, such as the Papadakis analysis (Papadakis, 1937), are forerun-
302 SPATIAL REGRESSION MODELS

ners of spatial autoregressive models. The danger of assuming that the salient
spatial structure can be captured through the mean function alone is that
there is “little room for error.” In the words of Zimmerman and Harville
(1991), a correlated error structure can “soak up” spatial heterogeneity. In
longitudinal data analyses it is not uncommon to model completely unstruc-
tured covariance matrices, in part to protect against omitted covariates. In
the spatial case unstructured covariance matrices are impractical, but even
a highly parameterized covariance structure can provide an important safety
net, protecting you against the danger of a misspecified mean function.

Example 6.1 Soil carbon regression. This application is a continuation


of Example 4.2, which considered the spatial variation of soil C/N ratios on
an agricultural field. In contrast to the analyses in Chapter 4 we are now
concerned with modeling the relationship between soil carbon percentage and
the soil nitrogen percentage.

2.0

1.5
Total Carbon (%)

1.0

0.5

0.03 0.05 0.07 0.09 0.11 0.13 0.15 0.17


Total Nitrogen (%)

Figure 6.1 Scatterplot of soil carbon (%) versus soil nitrogen (%). Spatial configu-
ration of samples is shown in Figure 4.6 on page 156. Data kindly provided by Dr.
Thomas G. Mueller, Department of Agronomy, University of Kentucky.

Previously we established that the empirical semivariogram of the C/N ra-


tios can be modeled with an exponential semivariogram; there appears to be
spatial autocorrelation in the C/N measurements. This finding does not tell
us anything about the relationship between C and N, however. For exam-
ple, it is possible that the C/N ratios exhibit spatial dependency because C%
depends on N% and it is the latter attribute that varies spatially. Could spa-
tial autocorrelation in the N-process induce the spatial variation in C? Also,
LINEAR MODELS WITH UNCORRELATED ERRORS 303

an exponential semivariogram may not be appropriate to model the spatial


dependency in soil carbon adjusted for soil nitrogen. A scatterplot between
the two variables certainly argues for the fact that a linear relationship exists
between the two soil attributes (Figure 6.1). A regression analysis with mean
function E[C(s)|N (s)] = β0 + β1 N (s) seems appropriate. An ordinary least
squares analysis yields a highly significant relationship with R2 = 0.89. The
important question is, however, whether the simple linear regression structure
is sufficient to explain the (co-)variation in soil C. In other words, how do we
model the errors in the model
C(si )|N (si ) = β0 + β1 N (si ) + e(si )?

6.1.1 Ordinary Least Squares—Inference and Diagnostics

If we are satisfied with our choice of a particular model, parameter estimation


is followed by confirmatory inference, that is, the testing of hypothesis, the
computation of confidence intervals, and so forth. While we are in the process
of model building, we need to raise questions about the model-data agreement,
that is, diagnose the extent to which the data conform to model assumptions,
the extent to which observations are influential in the analysis, and the extent
to which model components inter-relate. In the case of linear models with
uncorrelated errors, there is a large battery of tools to choose from to perform
these diagnostic tasks as well as confirmatory inference. In this subsection
we briefly re-iterate some of these well-known tools. We will see in the next
section how considerably more complicated diagnostic and inferential tasks
can become when correlated errors are introduced. It is particularly important
to us to address the properties of OLS residuals and their use in diagnosing
the fit of a spatial regression model.

6.1.1.1 Estimators and Their Properties

Assume that you are fitting a linear model with uncorrelated, homoscedastic
errors to spatial data,
Z(s) = X(s)β + e(s), e(s) ∼ (0, σ 2 I). (6.2)
The ordinary least squares estimator of the fixed effects and the customary
estimator of the residual variance are
) ! *−1 !
'
β = X(s) X(s) X(s) Z(s)
ols
1 E F! E F
'2 = ' '
σ Z(s) − X(s)β Z(s) − X(s)β ols .
n − rank{X(s)} ols

These estimators are based on the least squares criterion: find β that mini-
mizes the residual sum of squares:
) ! *! ) ! *
Z(s) − X(s) β Z(s) − X(s) β .
304 SPATIAL REGRESSION MODELS

Using ideas from calculus theory of minimization (e.g., differentiate, equate


to zero, solve) leads to normal equations and then to the OLS estimator of β.
The minimized error variance, corrected for bias, is then the OLS estimator of
σ 2 . If we are willing to make a distributional assumption about Z(s), then we
can also use maximum likelihood estimation. Assume that Z(s) ∼ G(0, σ 2 I).
The maximum likelihood (ML) estimators of β and σ 2 maximize the joint
likelihood of the data, or, equivalently, minimize the negative of twice the
Gaussian log likelihood given by
(Z(s) − X(s)β)! (Z(s) − X(s)β)
ϕ(β, σ ; Z(s)) = nln{2π} + nln{σ } +
2 2
.
σ2
Here, ideas from calculus lead to score equations which are then solved to
obtained the ML estimators. For linear models, least squares and ML produce
equivalent estimators. Thus, the ML estimator of β and the bias-corrected ML
estimator of σ 2 (the REML estimator) are equivalent to the OLS estimators
given above.
To establish a few important properties of β ' , let us define the “hat”
ols
! !
matrix H = X(s)(X(s) X(s)) X(s) and M = I − H. Obviously, H and
−1

M are projection matrices (symmetric and idempotent), and estimates of the


'
mean are obtained as Z(s) = X(s)β' = HZ(s). The OLS residuals are then
ols
simply ' '
eols (s) = Z(s) − Z(s) = MZ(s). Because H is a projector, H and M
project onto orthogonal subspaces. The expression Z(s) = HZ(s) + MZ(s)
states that the (n × 1) vector Z(s) is decomposed into a component projected
onto that subspace of Rn which can be generated by the columns of X(s), and
a component projected onto the orthogonal complement of this space.
OLS If model (6.2) is correct, then
Properties
E[β' ] = β
ols
eols (s)] = 0
E['
) *
Var[β' ] = σ 2 X(s)! X(s) −1
ols
Var['
eols (s)] = σ 2 M
eols (s) = 0.
1! '

The first and second properties follow from the assumption that the model
errors e(s) have zero mean. The third and fourth properties are based on the
assumption that the variance of the model errors is σ 2 I. The last property
of the OLS residuals (sum-to-zero) applies when the X(s) matrix contains
a constant column (an intercept). (We typically assume in this text that an
intercept is present.)
'2 . One of the advan-
We can also derive the expected value and variance of σ
tages of maximum likelihood estimation is that the inverse of the information
matrix provides the variance-covariance matrix of the maximum likelihood
estimators. For the linear model considered here, this matrix can be obtained
as a special case of that given in (5.47).
LINEAR MODELS WITH UNCORRELATED ERRORS 305

6.1.1.2 Testing Linear Hypotheses

A linear hypothesis is a hypothesis of the form H0 : Lβ = l0 , where L is an


l × p matrix of coefficients and l0 is a specified l × 1 vector. For example,
H0 : βj = 0 can be obtained by choosing L to be a p vector of zeros with a one
in the jth position, and taking l0 = 0. If X(s) is deficient in rank, then we
need to worry about whether the hypothesis Lβ = l0 is testable. This is the
) ! *− !
case if αi X(s) X(s) X(s) X(s) = αi , where αi is the ith row of L.
The standard test statistic for the testable hypothesis H0 is F -Test for
E F! K ) *−1 ! L−1 E F Linear
' − l0 ! ' − l0
Lβ ols L X(s) X(s) L Lβ ols Hypothesis
F = . (6.3)
'2 rank{L}
σ
If e(s) is Gaussian distributed, then F follows an F -distribution with rank{L}
numerator and n − rank{X(s)} denominator degrees of freedom. The custom-
ary t-test for H0 : βj = 0 is a special case of (6.3). In this case, F = β'j2 /ese(β
' )2 ,
j
where ese denotes the estimated standard error. Since Fα,1,ν = tα/2,ν the
statistic for the t-test is
E F √
t = sign β'j × F ,

and a (1 − α) × 100% confidence interval for βj is given by


E F
β'j ± tα/2,n−rank{X(s)} × ese β'j .

In a model with uncorrelated errors, the test statistic (6.3) has another,
appealing interpretation, in terms of reductions of sums of squares. Let
E F! E F
!' !'
SSR = Z(s) − X(s) β ols Z(s) − X(s) β ols

denote the residual sum of squares of the model and let SRRr denote the
same sum of squares subject to the linear constraints H0 : Lβ = l0 . Then Sum of
SSRr − SSR Squares
F = . Reduction
rank{L}'
σ2
Test

6.1.1.3 Residual and Influence Diagnostics

The ordinary least squares residuals ' eols (s) have zero mean and variance
σ M = σ (I − H). If hii denotes the ith diagonal element of H, then the
2 2

ith residual can be standardized as Standardi-


e' (s ) e'ols (si ) zation
/ ols i = √ .
Var['
eols (si )] σ 1 − hii
Since σ is unknown, one can compute instead the studentized residual Studentized
e'ols (si ) e'ols (si ) Residual
ri = < = √ .
T eols (si )] ' 1 − hii
σ
Var['
306 SPATIAL REGRESSION MODELS

The studentized residual is important in diagnostic work because (i) it avoids


some of the pitfalls of raw residuals, and (ii) many other diagnostic measures
can be expressed in terms of the ri . Among the pitfalls of raw residuals are that
they are correlated (M is not a diagonal matrix), they are not homoscedastic
(the diagonals of M are not of the same value), and they are rank-deficient
(M is a (n × n) matrix of rank (n − rank{X(s)}). Studentization at least takes
care of one of these issues, namely the unequal variance of the residuals (see
the following sub-section).
The quantity ri is also referred to as an internally studentized residual,
because the estimate σ' is based on all n data points. If σ 2
'−i is the estimator
of σ obtained with the ith data point removed from the analysis, then
2

e'ols (si )
ti = √
'−i 1 − hii
σ
is called the externally studentized residual.
The quantity hii is also called the leverage; it expresses how unusual an
observation is in the “X”-space. Data points with high leverage have the
potential to be influential on the analysis, but are not necessarily so. In a linear
model with uncorrelated errors and an intercept, the leverages are bounded,
1/n ≤ hii ≤ 1, and the sum of the leverages equals the rank of X(s). Note
Leverage that
Matrix '
∂ Z(s)
H= .
∂Z(s)
Diagnostic measures based on the sequential removal of data points are
particularly simple to compute for linear models with uncorrelated errors.
At the heart of the matter is the following powerful result. If X(s)−i is the
((n − 1) × p) regressor matrix with the ith row removed, then the estimates
of the fixed effects—if the ith data point is not part of the analysis—are
) ! *−1
'
β = X(s) X(s) X(s)−i Z(s)−i .
ols,−i −i −i

The difficult component in this calculation is the matrix inversion. Fortu-


nately, we can compute this inverse easily because of the following result
(known as the Sherman-Morrison-Woodbury theorem)
) ! *−1 ) ! *−1
X(s)−i X(s)−i = X(s) X(s)
! !
(X(s) X(s))−1 x(si )x(si )! (X(s) X(s))−1
+ .
1 − hii
After some algebra, we are led to a simple expression for the “leave-one-out”
Leave-One- estimate of the fixed effects,
Out OLS !
' − (X(s) X(s)) x(si )' e(si )
−1
Estimates '
β = β
ols,−i ols
1 − hii
!
' − (X(s) X(s))−1 x(si )'
= β ols e(si )−i .
' i )−i is known as the PRESS residual (Allen,
The quantity e'(si )−i = Z(si )−Z(s
LINEAR MODELS WITH UNCORRELATED ERRORS 307

1974). The importance of these results for diagnosing the fit of linear models
is that statistics can be computed efficiently based only on the fit of the model
to the full data and that many statistics depend on only a fairly small number
of elementary measures such as leverages and raw residuals. For example, a
PRESS residual is simply PRESS
Residual
' i )−i = e'(si ) ,
e'(si )−i = Z(si ) − Z(s
1 − hii
and Cook’s D (Cook, 1977, 1979), a measure for the influence of an observation
' can be written as
on the estimate β, Cook’s D
ri2 hii
D= ,
k(1 − hii )
where k = rank{X(s)}. The DFFITS statistic of Belsely, Kuh, and Welsch
(1980) measures the change in fit in terms of standard error units, and can be
written as J
hii
DF F IT Si = ti .
1 − hii
These and many other influence statistics are discussed in the monographs by
Belsely, Kuh, Welsch (1980) and Cook and Weisberg (1982).

6.1.2 Working with OLS Residuals

Fitted residuals in a statistical model are commonly used to examine the un-
derlying assumptions about the model. For example, a QQ-plot or histogram
of the 'eols (s) is used to check whether it is reasonable to assume a Gaus-
sian distribution, scatter plots of the residuals are used to test a constant
variance assumption or the appropriateness of the mean function. In spatial
models, whether the errors are assumed to be correlated or uncorrelated, an
important question is whether the covariance structure of the model has been
chosen properly. It seems natural, then, to use the fitted residuals to judge
whether the assumed model Var[e(s)] = Σ(θ) appears adequate. When, as
in this section, it is assumed that Σ(θ) = σ 2 I and the model is fit by ordi-
nary least squares, one would use the ' eols (s) to inquire whether there is any
residual spatial autocorrelation. Common devices are a test for autocorrela-
tion based on Moran’s I with regional data and estimation of semivariograms
of the residuals with geostatistical data. To proceed with such analyses in a
meaningful way, the properties of residuals need to be understood.
Recall from the previous section that the “raw” residuals from an OLS fit
are
' '
eols (s) = Z(s) − Z(s) = Z(s) − HZ(s) = MZ(s), (6.4)
! !
where H = X(s)(X(s) X(s))−1 X(s) is the “hat” (leverage) matrix. Since
we aim to use ' eols (s) to learn about the unobservable e(s), let us compare
their features. First, the elements of e(s) have zero mean, are non-redundant,
uncorrelated, and homoscedastic. By comparison, the elements of ' eols (s) are
308 SPATIAL REGRESSION MODELS

• rank deficient: If X(s) is (n × p) of rank k, then


! !
H = X(s)(X(s) X(s))−1 X(s)
'
is of rank k. The fitted values Z(s) are a projection of Z(s) onto a k-
dimensional subspace of R . As a consequence, only n − k of the OLS
n

residuals carry information about the model disturbances e(s). The re-
maining k residuals are redundant.
• correlated: The variance of the OLS residual vector is
eols (s)] = σ 2 M,
Var['
a non-diagonal matrix. The residuals are correlated because they result
from fitting a model to data and thus obey certain constraints. For example,
!
X(s) 'eols (s) = 0. Fitted residuals exhibit more negative correlations than
the model errors.
• heteroscedastic: The leverage matrix H is the gradient of the fitted values
with respect to the observed data. A diagonal element hii reflects the weight
of an observation in determining its predicted value. With the exception
of some balanced classification models, the hii are not of equal value and
the residuals are thus not equi-dispersed. A large residual in a plot of the
e'ols (si ) against the fitted values may not convey a model breakdown or an
outlying observation. A large residual is more likely if Var[' eols (si )] is large
(hii small). Furthermore, σ (1 − hii ) < σ , since in an OLS model with
2 2

intercept 1/n ≤ hii ≤ 1. Thus, the sill of a semivariogram computed from


the e'ols (si ) does not reflect the variability of the process (σ 2 ).
The one (only) property the residuals have in common with the model er-
rors is a zero mean. Note that these discrepancies between e'ols (s) and e(s)
have nothing to do with fitting a model to spatial data or with spatial autocor-
relation. They are a mere consequence of fitting a model to data by ordinary
least squares. It is now becoming clear that computing a semivariogram from
residuals in order to understand the autocorrelation pattern of the data can
be a problematic undertaking. This empirical semivariogram estimates the
semivariogram of the residuals, not the semivariogram of e(s). Furthermore,
the “data” for this analysis, e'ols (s), fail to meet an important condition for
a semivariogram analysis: the residual process is not second-order stationary.
Moreover, negative correlations among the residuals may confound spatial au-
tocorrelation. We should have similar reservations about a Moran’s I statistic
based on the e'ols (si ).
At this point, three courses of action emerge:
(i) proceed with the raw residuals but understand the implications and (try
to) interpret the results accordingly;
(ii) manipulate (adjust, transform) the raw residuals in such a way that an
autocorrelation analysis is more meaningful;
(iii) use spatial statistics that take into account the fact that you are work-
LINEAR MODELS WITH UNCORRELATED ERRORS 309

ing with quantities that result from a model fit, rather than actual, observed
data.

Most practitioners adopt (i), usually without the caveat and understand-
ably so: it is difficult to interpret the results once we realize all the problems
that can arise when working with raw residuals. One approach to (ii) is to
derive a set of n − k “new” quantities that overcome the problems inherent in
working with raw residuals. This approach is termed error recovery. As for
(iii), spatial statistics typically used for spatial autocorrelation analysis can
be modified for use with residuals. We describe error recovery and some mod-
ifications to spatial autocorrelation statistics for use in OLS residual analysis
in subsequent paragraphs.

6.1.2.1 Error Recovery

Error recovery is the process of generating “variance-covariance” standard-


ized residuals that have zero mean, unit variance, and are uncorrelated. Two
methods can be distinguished for recovering uncorrelated errors, depending
on whether they draw on the projection properties in the fitted model or
on the sequential forecasting of observations. The latter approach gives rise
to recursive or sequential residuals (Brown et al., 1975; Kianifard and Swal-
low, 1996). The process of residual recursion starts with fitting the model
to k = rank(X) data points. The remaining n − k observations are entered
sequentially and the jth recursive residual is the scaled difference between
Z(sj ) and Z '−j (sj ), the predicted value based on the previous observations.
The n − k scaled differences so obtained are the recursive residuals and are
useful in detecting outliers, changes in regression coefficients, heteroscedastic-
ity, and serial correlation (Galpin and Hawkins, 1984; Kianifard and Swallow,
1996).
Error recovery based on projections is also known as Linearly Unbiased
Scaled (LUS) estimation and is due to Theil (1971). If ' e(s) is an (n × 1)
residual vector with variance Var['
e(s)] = A, then to transform the residuals
so that they are uncorrelated with unit variance we need to find an (n × n)
matrix Q such that Linearly
: ; Unbiased
! ! I(n−k) 0((n−k)×k) Scaled
Var[Q ' e(s)] = Q AQ = .
0(k×(n−k)) 0(k×k) Estimates

The first n − k elements of Q! '


e(s) are the LUS estimates of e(s). Jensen and
Ramirez (1999) call these elements the linearly recovered errors (which
we refer to as LREs).
In the case of OLS residuals based on a linear model with uncorrelated,
homoscedastic errors, A = σ 2 M, with M defined on page 305. Since M is
a projection matrix (symmetric and idempotent), it has a spectral decom-
position M = P∆P! , where P is orthogonal and ∆ is a diagonal matrix
310 SPATIAL REGRESSION MODELS

containing the eigenvalues of M. Because of these properties,


: ;
I(n−k) 0((n−k)×k)
∆= .
0(k×(n−k)) 0(k×k)
Hence, the matrix Q = P/σ will transform the OLS residuals into LREs that
have variance-covariance matrix ∆.
The process of error recovery by the aforementioned techniques produces
n−k recovered residuals with identity variance-covariance matrix and k “resid-
uals” whose variance is zero. If n is not much larger than k, only a few inde-
pendent residuals can be recovered. In spatial regression models this is usually
not of concern, since the number of observations usually far exceeds the rank
of the X(s) matrix. The linearly recovered errors also do not correspond to
any particular one of the OLS residuals, they are linear combinations of them.
Linearly recovered errors are thus not suited for the identification of outliers
(recursive residuals do not have this drawback). Furthermore, the recovered
errors are not unique. Any orthogonal rotation of an LUS estimate is also an
LUS estimate. And residuals recovered by means of a Cholesky decomposi-
tion depend on the ordering of the data. Residuals recovered by spectral or
singular value decomposition depend on the ordering of the eigenvalues and
corresponding eigenvectors. Note that in the OLS case, where M is a projec-
tion matrix, all non-zero eigenvalues of M are equal to one and the ordering
of the columns of Q corresponding to the non-zero eigenvalues is arbitrary.
In the correlated error case an ordering of the eigenvalues by decreasing size
is common. But there is no reason why recovered errors computed with this
convention in mind should be best in any sense. Jensen and Ramirez (1999)
use linearly recovered errors to test for normality of the model disturbances.
These authors settle the uniqueness issue of the LUS estimates by finding an
orthogonal rotation BQ! that maximizes the kurtosis of the recovered errors,
making them as “ugly as possible,” to slant diagnostic tests away from nor-
mality. Very little is known about using an analogous approach for assessing
whether there is spatial autocorrelation in the recovered errors.

6.1.2.2 Semivariogram Estimation using OLS Residuals

When working with spatial data, we need to assess whether there is any spatial
variation that has not been accounted for by the model. Such variation can be
due to an omitted spatially-varying covariate or to spatial autocorrelation in
the data, or both. The only real information we have for this assessment comes
from the OLS residuals. If the OLS residuals exhibit spatial patterns, then the
OLS regression model is not adequate to describe the spatial variation in the
data. We note in passing that a simple map of the residuals can be extremely
informative as can other spatial visualization techniques.
One of the most common tools for assessing spatial autocorrelation is the
empirical semivariogram. Unfortunately, the empirical semivariogram com-
puted from the OLS residuals (which we refer to as the residual semivari-
LINEAR MODELS WITH UNCORRELATED ERRORS 311

ogram) may not be a good estimate of the semivariogram of the error pro-
cess, e(s). As noted previously, the statistical properties of the two processes
are very different. It is difficult to determine whether any structure (or lack
thereof) apparent in the residual semivariogram is due to spatial autocor-
relation in the error process or to artifacts induced by the rank deficiency,
correlation, and heteroscedasticity among the residuals.

Example 6.1 (Soil carbon regression. Continued) Figure 6.2 displays


the empirical semivariograms of the soil carbon percentages and of the OLS
residuals in a simple linear regression of C% on N%. Both panels display some
structure. In the left panel we do not know whether the spatial structure could
be due to the fact that the average C percentage changes systematically with
soil N%. In other words, the degree to which the empirical semivariogram is
determined by large-scale trend is not discernible. In the right-hand panel it
is not clear to what extent the structure is induced by the properties of OLS
residuals, and to what extent it reflects spatial autocorrelation in the (unob-
servable) model errors. Note the smaller “sill” of the residual semivariogram,
the decrease in the sill reflects the systematic variation in C% removed by the
regression on N%.

Total Carbon (%) OLS Residuals


Semivariance Semivariance

0.020

0.0017

0.015

0.0013

0.010

0.0009

0.005

0.0005
0 100 200 0 100 200
Distance Distance

Figure 6.2 Empirical semivariograms of soil carbon (%) (left panel) and OLS resid-
uals of simple linear regression of C% on N%.

There are several options for reducing the potential for such artifacts in
the residual semivariogram. First, we could use the empirical semivariogram
312 SPATIAL REGRESSION MODELS

computed from the studentized residuals to assess residual spatial autocorre-


lation. Although studentized residuals remain correlated and rank-deficient,
as far as computing a semivariogram is concerned, studentized residuals at
least meet the needed stationary properties since they have mean zero and
constant variance. This approach is fairly simple since most statistical soft-
ware packages compute studentized residuals. Another, more comprehensive,
solution is to compute the empirical semivariogram of the recursive residuals.
Recursive residuals remove all the problems inherent in the OLS residuals.
Moreover, Grondona (1989) and Grondona and Cressie (1995) show that for
the linear model with uncorrelated errors, the covariance function computed
from recursive residuals is an unbiased estimator of the white noise (or nugget)
covariance function C(h) = σ 2 I[h = 0]. Thus, the empirical semivariogram
computed from recursive residuals should be that of a white noise process
unless there is residual spatial variation. If k = rank(X) is small, it may be
difficult to fit the model to the first k data points needed to begin the recur-
sive process. Thus, alternatively, we can compute the empirical semivariogram
based on the LREs described above. As with the recursive residuals, the em-
pirical semivariogram of the LREs should also resemble that of a white noise
process.

Example 6.1 (Soil carbon regression. Continued) For the C/N regres-
sion model we computed the empirical semivariograms of the Best Linearly
Unbiased Scaled estimates (Theil, 1971), the recursive residuals, the studen-
tized OLS residuals, and the raw OLS residuals (Figure 6.3). All four types
of residuals exhibit considerable structure. Based on the semivariograms of
the BLUS and recursive residuals we conclude that the model errors should
be considered spatially correlated. A simple linear regression model is not
adequate for these data. The empirical semivariogram for the OLS residuals
does not appear too different from the other semivariograms. Typically, the
bias in the OLS semivariogram is larger for models with more regression co-
efficients, as the residuals become more constrained. With a single regressor
and an intercept, there are only two redundant residuals and the leverages are
fairly homogeneous. The empirical semivariogram of the studentized residuals
bears a striking resemblance to the semivariograms of the recovered errors in
the top panels of Figure 6.3. Again, this is partly helped by the fact that the
X(s) matrix contains only two columns. On the other hand, it is encouraging
that the simple process of scaling the residuals to equal variance provides as
clear a picture of the spatial autocorrelation as the recovered errors, whose
construction is more involved.
A possible disadvantage of recovered errors is their dependence on data or-
der. What is important is that the residual semivariogram conveys the pres-
ence of spatial structure and the need for a spatial analysis. Figure 6.4 displays
empirical semivariograms of the recursive residuals for 10 random permuta-
tions of the data set. To show the variation among the semivariograms, they
are shown as series plots rather than as scatter plots. Any of the sets is equally
well suited to address the question of residual spatial dependency.
LINEAR MODELS WITH UNCORRELATED ERRORS 313

0 100 200

Best Linear Unbiased Scales Recursive Residuals

0.0017
0.0016

0.0009 0.0008
Semivariance

Studentized Residuals OLS Residuals

1.0000
0.0013

0.5000 0.0007

0 100 200
Distance

Figure 6.3 Empirical semivariograms of various residual measures in C–N simple


linear regression.

Relatively little, practical work has been done on residual diagnostics for
spatial models. In particular, the practical impacts of rank deficiency, corre-
lation and heteroscedasticity among the OLS residuals on inferences drawn
from residual variography are not clearly understood. Also, it is not clear to
what extent the solutions described above are truly effective, and whether or
not they may introduce other problems (e.g., lack of uniqueness in LREs).
Some additional research has been done for more general linear models with
correlated errors and this is discussed in §6.2.3.
Earlier we mentioned three courses of action when working with residuals:
to proceed with an analysis of raw residuals, to compute transformed resid-
uals, and to use adjusted spatial statistics. The careful interpretation of the
semivariogram of OLS residuals is an example of the first action. The semi-
variograms for the recursive, studentized, and BLUS residuals in Figure 6.3
represent the second course. An example of taking into account the fact that
statistics are computed form fitted, rather than observed, quantities is the
test for autocorrelation in lattice data based on Moran’s I for OLS residuals.
314 SPATIAL REGRESSION MODELS

0.0020

0.0015
Semivariance

0.0010

0.0005

0.0000
0 50 100 150 200 250
Distance

Figure 6.4 Empirical semivariograms of recursive residuals for 10 random orderings


of the data set.

6.1.2.3 Moran’s I with OLS Residuals

In the case of lattice data, it is tempting to compute measures of spatial


autocorrelation—such as Moran’s I or Geary’s c—from the OLS residuals.
Moran’s I The Moran statistic based on an OLS fit is
for OLS n 'eols (s)! W' eols (s)
Residuals Ires = , (6.5)
eols (s)! '
w.. ' eols (s)
.n .n
where w.. = 1! W1 = i=1 j=1 wij and W is the spatial connectivity matrix
(see §1.3.2). Note that Ires can also be written as
!
n e(s) MWMe(s)
Ires = .
w.. e(s)! Me(s)
So far, these expressions are a simple application of the Moran’s I statistic
to 'eols (s), no adjustment has been made yet for the fact that the “data” are
fitted quantities. We noted earlier that one path to OLS residual analysis
is to apply methods that have been specifically devised to account for the
properties of fitted residuals. This accounting takes place in the next step.
Since the error terms, e(si ), are assumed to be iid Gaussian—under the null
hypothesis of no spatial autocorrelation—Cliff and Ord (1981, pp. 202–203)
Mean and show that Ires is asymptotically Gaussian (for n → ∞) with mean
Variance of : ;
n 8 ! ! 9
Ires E[Ires ] = − tr (X(s) X(s))−1 X(s) WX(s)
(n − k)w..
LINEAR MODELS WITH UNCORRELATED ERRORS 315

n
= tr{MW} (6.6)
(n − k)w..
and variance
- >
n2
Var[Ires ] =
w..2 (n − k)(n − k + 2)
- >
8 29 2[tr{G}]2
× S1 + 2tr G − tr{F} − , (6.7)
(n − k)
where
n n
1 ((
S1 = (wij + wji )2 ,
2 i=1 j=1
! !
F = (X(s) X(s))−1 X(s) (W + W! )2 X(s), and
! !
G = (X(s) X(s))−1 X(s) WX(s).

Cliff and Ord (1981, p. 200) note that randomizing the residuals does not
provide the appropriate reference set for a permutation test of autocorrelation
based on OLS residuals. They consider only the asymptotic test based on the
asymptotic results in equations (6.6) and (6.7), and an assumption that the
data are Gaussian. Thus, an approximate test of the null hypothesis of no
spatial autocorrelation can be made by comparing the observed value of
Ires − E[Ires ]
z= / (6.8)
Var[Ires ]
to the appropriate percentage point of the standard Gaussian distribution.
Note that this test is different than the one described in §1.3.2. For example,
the comparable test for Moran’s I described in (§1.3.2) uses a mean of −(n −
1)−1 to construct the test statistic. If this test statistic is used with OLS
residuals, which can be easily accomplished by importing the residuals into a
software program that computes Moran’s I, the resulting test will be incorrect;
the correct mean is given in (6.6). The same argument applies to the variance.
The computer will not know that you are working with residuals and so cannot
adjust the mean and variance to give the proper test statistic.
As with the semivariogram, we can also compute Moran’s I using the re-
cursive residuals or the LREs from LUS estimation described above. Since
the recursive residuals and the LREs are free of the problems inherent in OLS
residuals, a permutation test can be used. In addition, Cliff and Ord (1981, p.
204) give the moments of I computed from the LREs so that an approximate
z-test can be constructed. However, Cliff and Ord (1981, p. 204) note the
problems with determining which n − k recovered errors to use and the same
concern applies to the use of recursive residuals in this context. They ten-
tatively
. suggest using the most well-connected observations, those for which
w
j ij is the largest. Their discussion and recommendation clearly implies
the potential inferential perils that can occur when using recovered errors to
assess spatial autocorrelation.
316 SPATIAL REGRESSION MODELS

6.1.3 Spatially Explicit Models

Consider again the linear regression model with uncorrelated errors given in
(6.2). It is a spatial regression model since the dependent variable Z(s), and
the independent variables comprising X(s), are recorded at spatial locations
s1 , . . . , sn . However, for most independent variables, the spatial aspect of the
problem serves only to link Z(s) and X(s). Once the dependent and indepen-
dent variables are linked through location, there is nothing in the analysis
that explicitly considers spatial pattern or spatial relationships. In fact, if we
give you Z(s) and X(s), but simply refer to them as Z and X, you could apply
any and all tools from regression analysis to understand the effect of X on Z.
Moreover, you could move them around in space and still get the same results
(provided you move Z(s) and its corresponding covariates together). Fother-
ingham, Brunsdon, and Charlton (2002) refer to such analyses as aspatial,
a term we find informative. The field of spatial statistics is far from aspatial,
and even in the simple linear model case, there is more that can be done to
use spatial information more explicitly.

6.1.3.1 Local Polynomial and Geographically Weighted Regression

One of the easiest ways to make more use of spatial information and rela-
tionships is to use covariates that are polynomial functions of the spatial
coordinates si = [xi , yi ]! . The trend surface models described in §5.3.1 are
an example of this approach. For example, a linear trend surface uses a first
degree polynomial in [x,y] to describe the spatial variation in the response,
e.g.,
Z(si ) = β0 + β1 xi + β2 yi + #i , #i ∼ iid (0, σ 2 ).
Ordinary least squares estimation inference can be used to estimate the β
parameters. However, such an analysis is not aspatial; X is clearly completely
tied to the spatial locations. Although in §5.3.1, the parameter estimates were
simply a means of obtaining a response surface, the β coefficients themselves
have a spatial interpretation, measuring the strength of large-scale spatial
trends in Z(si ).
The parameter estimates from a trend surface analysis provide a fairly
broad, large-scale interpretation of the spatial variation in Z(si ). However,
they are essentially aspatial, since the model has one set of parameters that
apply everywhere, regardless of spatial location. As discussed in §5.3.2, we
can adapt traditional local polynomial regression to the spatial case by fitting
a polynomial model at any specified spatial location s0 . This model is essen-
tially a spatial version of the local estimation procedures commonly referred
to as LOESS or nonparametric regression, where the covariates are polyno-
mial functions of the spatial coordinates. In traditional applications of LOESS
and nonparametric regression methods where general covariates form X, the
term local refers to the attribute or X-space and not to spatial location. The
weights are functions of xi − x0 , differences in covariate values, and the anal-
LINEAR MODELS WITH UNCORRELATED ERRORS 317

ysis is aspatial (Fotheringham et al., 2002, pp. 3–4). What is really needed is
a model that is fit locally in the spatial sense, but allows general covariates
that are not necessarily polynomial functions of the spatial coordinates. The
same general model as that described in §5.3.2 can be used, but with general
covariates

Z(s) = X(s)β(s0 ) + e0 , e0 ∼ (0, σ 2 W(s0 )−1 ), (6.9)


where the weights forming the elements of W(s0 ), W (si , s0 ), determine the
degree to which Z(si ) is allowed to affect the estimate of the relationship
between X(s) and Z(s) at the point of interest s0 . The covariates in X(s) do
not have to be polynomial functions of spatial coordinates. They can be, e.g.,
price, income, elevation, race, etc. However, in contrast to traditional kernel
regression approaches, the local aspect of the model is based on the spatial
locations, not on the covariate values. In contrast to the models for local
estimation described in §5.3.2, the interest lies not in a smoothed response
surface, but in smoothed maps of the βj that describe the spatial variation in
the relationship between xj (s) and Z(s). In geography, this model is known
as geographically weighted regression (Fotheringham et al., 2002).
To avoid unnatural discontinuities in the parameter estimates, the spatial
weight function has to be chosen in such a way that the influence of an obser-
vation decreases with increasing distance from s0 . The kernel weight functions
given in §5.3.2 are some examples of weights that satisfy this condition. Cressie
(1998) and Fotheringham et al. (2002) provide more general discussions on
the choice of spatial function and additional alternatives.
Once the spatial weight function is selected, β(s0 ) can be estimated by
weighted least squares
) *
' 0 ) = X(s)! W(s0 )X(s) −1 X(s)! W(s0 )Z(s),
β(s

and the variance of the parameter estimates is


) ! *−1 !
' 0 )] = σ 2 CC! ,
Var[β(s C = X(s) W(s0 )X(s) X(s) W(s0 ).

If β is estimated at the sample locations, si , then the fitted values are given by
) ! *−1 !
'
Z(s) = LZ(s), where the ith row of L is x(si ) X(s) W(si )X(s) X(s) W(si ).
The variance component, σ 2 , can be estimated from the residuals of this fit
using (Cressie, 1998)

'
(Z(s) − Z(s))! '
(Z(s) − Z(s))
σ̂ 2 = .
tr{(I − L)(I − L)! }

Cressie (1998) gives more details on how this local model can be used in
geostatistical analysis and Fotheringham et al. (2002) provide many practical
examples of how this model, and various extensions of it, can be used in
geographical analysis.
318 SPATIAL REGRESSION MODELS

6.1.3.2 Models with Neighbor Adjustments—Papadakis Analysis

Models with neighbor adjustments originated in agricultural experimentation


to cope with the problem of spatial trends at scales different from the size of
blocks. In particular, to obtain valid inferences in the presence of soil fertility
trends in large variety trials. Starting from a classification model that cap-
tures the treatment and design structure—often the design-based model—the
model is changed to an analysis of covariance. The new covariates are func-
tions of residuals between neighboring sites, assumed to capture the salient
spatial and treatment effects, so that unbiased (and precise) comparisons of
treatments can be made. The process can be illustrated nicely with an impor-
tant representative in this class of models, the Papadakis model (Papadakis,
1937).
Suppose an experimental area is a two-dimensional layout of r rows and c
columns. The rc experimental units are grouped into b blocks. The treatments
T1 , · · · , Tt are assigned at random to the units in each block so that treatment
Ti is replicated ni times within a block. Most common is the randomized
complete block design for which ni = 1, ∀i. The design-based linear model for
the analysis of this experiment can be written as

Z(i, j)kl = µ + ρk + τl + #kl (6.10)


i = 1, · · · , r j = 1, · · · , c
k = 1, · · · , b l = 1, · · · , t,

where ρk is the effect of the kth block and τl is the effect of the lth treatment
(Hinkelmann and Kempthorne, 1994). The somewhat unusual notation is used
to identify blocks and treatments (indices k and l), as well as lattice positions.
The Papadakis analysis is essentially an analysis of covariance where the
block effects in (6.10) are replaced by functions of OLS residuals in the model
Z(i, j)kl = µ + τl + #kl . Because the residuals are based on a different model,
the analysis involves several steps. In the first step the model with only treat-
ment effects is fit and the residuals

' '
#(i, j) = Z(i, j)kl − E[Z(i, j)kl ] = Z(i, j)kl − Z k

are computed. Here, Z k denotes the arithmetic average of the treatment k,


the treatment that was assigned to the plot in row i, column j. Based on
these residuals, and a definition of which plots constitute spatial neighbors,
covariates are computed. For example, Stroup, Baenziger, and Mulitze (1994)
define the East-West and North-South differences
1
x1 (i, j) = ('
#(i, j − 1) + '
#(i, j + 1))
2
1
x2 (i, j) = ('
#(i − 1, j) + '
#(i + 1, j))
2
These adjustments then replace the block effects in (6.10). A Papadakis model
LINEAR MODELS WITH UNCORRELATED ERRORS 319

for these data could be


Z(i, j)kl = β0 + τl + β1 x1 (i, j) + β2 x2 (i, j) + #∗kl .

There are many ambiguities in this analysis for which the user needs to make
a determination. In a two-dimensional layout, you can define a single covariate
adjusting in both directions, or separate covariates in each direction. You need
to decide on how to handle edge-effects, EUs near the boundary of the experi-
mental area. You need to decide whether to include immediate neighbors into
the adjustments or extend to second-or higher-order differences. The analysis
can be non-iterative (as described above) or iterative, the residuals from the
analysis of covariance model are then used to recompute the covariates and
the process continues until changes are sufficiently small.
Since the covariates in the Papadakis analysis are linear functions of the
responses Z(s), the analysis essentially accounts for local linear trends. This is
one of the reasons why mean adjustments with this analysis tend to be more
moderate than in models that add trend surface components, which tend to
have a higher degree of the polynomial. Overfit trend surface models can
produce large local adjustments to the treatment means (Brownie, Bowman,
and Burton, 1993).
It is instructive to view the Papadakis analysis in a slightly different light.
The OLS residuals from which the covariates are constructed, are linear
functions of the data in the model Z(s) = Xτ + ', namely ' ' = MZ(s),
! −1 !
M = I − X(X X) X . Consider the case of a single Papadakis covariate.
Because τ' ols = (X! X)−1 X! Z(s), the resulting model can be written as
Z(s) = Xτ + βA'
' + '∗
= Xτ + βAMZ(s) + '∗
= Xτ + βA(Z(s) − X'
τ ols ) + '∗ .
The matrix A determines how the averages of the residuals are determined.
One of the shortcomings of the Papadakis analysis is that the covariates
MZ(s) are assumed fixed, only '∗ is treated as a random component on the
right hand side. Furthermore, the elements of '∗ are considered uncorrelated
and homoscedastic. If the variation in Z(s) were taken into account on the
right hand side, a very different correlation model would result. But if we can
accommodate Z(s) on the right hand side as a random variable, then there is
no need to rely on the fitted OLS residuals τ' ols in the first place. We could
then fit a model of the form
Z(s) = Xτ + βW(Z(s) − Xτ ) + '∗ . (6.11)
The matrix A has been replaced by the matrix W, because how you in-
volve the model errors may be different from how you use fitted residuals
for adjustments. The important point is that in equation (6.11) the response
is regressed on its own residual. This model has autoregressive form, it is a
simultaneous autoregressive model (§6.2.2.1), a special case of a correlated
error model. Studies as those by Zimmerman and Harville (1991), Brownie,
320 SPATIAL REGRESSION MODELS

Bowman, and Burton (1993), Stroup, Baenziger, and Mulitze (1994), and
Brownie and Gumpertz (1997) have shown that correlated error models typ-
ically outperform trend surface models and neighbor-adjusted models with
uncorrelated errors. The importance of models with neighbor adjustments
lies in their connection to autoregressive models, and thus as stepping stones
to other correlated error models.

6.1.3.3 First Difference Model

The idea of adding polynomial trends in spatial coordinates or nearest neigh-


bor residuals in the Papadakis analyses is that suitable adjustments to the
mean function justify a model with uncorrelated errors. A different route,
also leading to a model with uncorrelated errors, was taken by Besag and
Kempton (1986), who described a model for spatially arranged experimental
designs. They consider field experiments arranged in adjacent columns, where
the number of rows is much larger than the number of columns (which may be
1). In other words, these layouts are rectangular and elongated. In this case,
adjustments for neighboring plots are likely to be made within the row only,
partly because plants within a row are often planted more closely than plants
across rows. We focus on a single column for the time being, the extension
to multiple rows and columns is straightforward. The first difference model is
formed from
Z(s) = Xτ + e(s), e(s) ∼ (0, σ 2 Σ). (6.12)
If Z(i) is the response value in row i, then it is assumed that Var[Z(i) −
Z(i + 1)] = σ 2 and Cov[Z(i) − Z(i + 1), Z(k) − Z(k + 1)] = 0, i (= k. In
other words, the first differences between observations in the same column are
homoscedastic and uncorrelated. The model (6.12) can then be transformed
to a first-difference model by multiplying both sides of the model (from the
left) with the matrix
 
1 −1 0 · · · 0
 0 1 −1 · · · 0 
 
∆= . . . , (6.13)
 .. .. .. 
0 0 ··· 1 −1
The resulting model is an OLS model:
∆Z(s) = ∆Xτ + ∆e(s)

Z(s) ∗
= X∗ τ + e(s)

e(s) ∼ (0, σ 2 I),
from which the treatment effects and their precision can be estimated as
! !
τ' = (X∗ X∗ )−1 X∗ Z(s)∗
T τ] !
Var[' '2 (X∗ X∗ )−1
= σ
1
'2
σ = (Z(s)∗ − X∗ τ' )! (Z(s)∗ − X∗ τ' ).
n − rank{X}
LINEAR MODELS WITH CORRELATED ERRORS 321

The importance of the Besag-Kempton model lies in its connection to gen-


eralized least squares estimation in a correlated error model. In the initial
model (6.12) the errors are correlated, but a matrix ∆ is known a priori, such
that ∆Σ∆! = I. To bring out the connection more clearly, take a correlated
error model with general linear mean function,
Z(s) = X(s)β + e(s), e(s) ∼ (0, σ 2 Σ).
If the (correlation) matrix Σ is known, the generalized least squares estimates
of β can be obtained by applying ordinary least squares to a suitably trans-
formed model. Let L be a matrix such that LL! = Σ. If Σ is positive definite,
for example, L can be chosen as its lower triangular Cholesky root. Or, L can
be constructed based on a singular value decomposition. Then, L−1 ΣL!−1 = I,
and the model can be transformed to
L−1 Z(s) = L−1 X(s)β + L−1 e(s)
∗ ∗
Z(s)∗ = X(s) β + e(s)

e(s) ∼ (0, σ 2 I).
The ordinary least squares estimates in the transformed model are the gener-
alized least squares estimates in the correlated error model:
' ∗! ∗ ∗!
β
gls = (X(s) X(s) )−1 X(s) Z(s)∗
! ! ! !
= (X(s) L−1 L−1 X(s))−1 X(s) L−1 L−1 Z(s)
! !
= (X(s) Σ−1 X(s))−1 X(s) Σ−1 Z(s).

The first difference matrix ∆, when applied to model (6.12), plays the role
of the inverse “square root” matrix L−1 . It transforms the model into one
with uncorrelated errors. In the correlated error model, the transformation
matrix L−1 is known because Σ is known. In the first-difference approach
we presume knowledge about the transformation directly, at least up to a
multiplicative constant. Note also, that the differencing process produces a
model for n − 1, rather than n, observations. The reality of fitting models
with correlated errors is that Σ is unknown, at least it is unknown up to some
parameter vector θ. Thus, the square root matrix is also unknown.

6.2 Linear Models with Correlated Errors

In this section, we continue to assume a linear relationship between the out-


come variable and the covariates, but in addition to the spatially-varying
covariates, the components of the error term may now be spatially-correlated.
Thus, we assume the more general model of equation (6.1).
In addition to there being numerous estimation techniques for the covari-
ance parameters θ, there are different modeling approaches that can be used
to describe the dependence of Σ(θ) on θ. We discuss two approaches in this
chapter. The first is based on the models for the covariance function described
in §4.3 and applies to spatially-continuous (geostatistical) data. The second
322 SPATIAL REGRESSION MODELS

approach is based on the spatial proximity measures described in §1.3 and ap-
plies to regional data. In what follows, we first focus on geostatistical models
for Σ(θ) and their use in linear regression models with spatially correlated
errors. The approach for regional data will be described in §6.2.2. Our con-
cern now is primarily in estimates of θ for inference about β, rather than in
prediction of the Z(s) process.
If θ is known, the generalized least estimator
) *
β' = X(s)! Σ(θ)−1 X(s) −1 X(s)! Σ(θ)−1 Z(s) (6.14)
gls

can be used to estimate β. In order to use this estimator for statistical infer-
ence, we need to be sure it is a consistent estimator of β and then determine
its distributional properties. If we assume the errors, e(s), follow a Gaussian
distribution, then the maximum likelihood estimator of β is equivalent to the
generalized least squares estimator. Thus, the generalized least squares esti-
mator is consistent for β and β ' ∼ G(β, (X(s)Σ(θ)−1 X(s))−1 ). However, if
gls
the errors are not Gaussian, these properties are not guaranteed. Consistency
of β'
gls depends on the X(s) matrix and the covariance matrix Σ(θ). One
condition that will ensure the consistency of β'
gls for β is
) ! *
X(s) Σ(θ)−1 X(s)
lim = Q, (6.15)
n→∞ n
where Q is a finite, nonsingular matrix (see, e.g., Judge et al., 1985, p. 175).
The asymptotic properties of β̂ gls are derived from the asymptotic properties
of
C ! D−1 !

' − β) = X(s) Σ(θ)−1
X(s) X(s) Σ(θ)−1 e(s)
n(β gls √ .
n n
Most central limit theorems given in basic statistics books are not directly rel-
!
evant to this problem since X(s) Σ(θ)−1 e(s) is not a sum of independent and
identically distributed random variables. One condition that can be applied
here is called the Lindberg-Feller central limit theorem and this condition can
be used to show that (Schmidt, 1976)
√ d
n(β' − β) −→ G(0, Q−1 ).
gls

Thus, if the condition in (6.15) and some added regularity conditions for
central limit theorems are satisfied, then
' ∼ · !
β gls G(β, (X(s) Σ(θ)−1 X(s))−1 ).

If θ is unknown, iteratively reweighted generalized least squares can be used


to estimate β and θ in the general model of equation (6.1) (see §5.5.1). This
Estimated gives an estimated generalized least squares estimator of the fixed effects β,
GLS E F−1
R ! R −1 X(s) ! R −1 Z(s),
Estimator β egls = X(s) Σ(θ) X(s) Σ(θ) (6.16)

R is an estimator of the covariance parameters, obtained by the methods


where θ
LINEAR MODELS WITH CORRELATED ERRORS 323

in §5.5 in the case of a spatially varying mean (and by the methods in §4.5–§4.6
in the case of a constant mean). As in Chapter 5, we need to be concerned with
the effects of using estimated covariance parameters in plug-in expressions
such as (6.16).
Derivation of the distributional properties of βR R
egls is difficult because Σ(θ)
and e(s) will be correlated. Thus, we again turn to asymptotic results. If, in
addition to the condition in (6.15), the conditions
! p
R − Σ(θ)]X(s) −→
lim n−1 X(s) [Σ(θ) 0
n→∞

and
! p
R − Σ(θ)]e(s) −→ 0
lim n−1/2 X(s) [Σ(θ)
n→∞
R
hold, then β '
egls has the same limiting distribution as β gls and so
√ d
R −1
)
n(β egls − β) −→ G(0, Q

(Theil, 1971; Schmidt, 1976; Judge et al., 1985, p. 176). Thus, if these condi-
tions hold,
R · !
egls ∼ G(β, (X(s) Σ(θ) X(s))−1 ).
−1
β

In many cases, these conditions are met if θR is a consistent estimator of θ.


R
However, the consistency of θ is not sufficient; sufficient conditions are given
in Fuller and Battese (1973). Fuller and Battese show that these conditions
are met for some commonly nested error structure models and Theil (1971)
gives conditions for the AR(1) time series process.
We can bypass the checking of these conditions by using likelihood methods
to estimate β and θ simultaneously. We discussed ML and REML estimation
in §5.5.2, and the variance-covariance matrix of these estimators can be ob-
tained from the information matrix given in (5.47). However, even if we use
ML or REML, or we can be satisfied that the asymptotic properties of the
estimated generalized least squares estimator are valid, in practice inference
will require estimation of

!
R = (X(s) Σ(θ)−1 X(s))−1 .
Var(β)
! R −1 X(s))−1 ) as
The consequences of using the plug-in expression (X(s) Σ(θ)
R are discussed in §6.2.3.1.
an estimator of Var(β)

Example 6.1 (Soil carbon regression. Continued) By studying empirical


semivariograms of various types of residuals, we convinced ourselves that the
model errors in the regression of soil C% on soil N% are correlated (see Figure
6.3 on page 313). Based on these semivariograms we now propose the following
model:
C(si )|N (si ) = β0 + β1 N (si ) + e(si )
e(si ) ∼ G(0, σ 2 )
Cov[e(si ), e(sj )] = σ 2 exp{−||si − sj ||/θ}, (6.17)
324 SPATIAL REGRESSION MODELS

a spatial regression model with soil N% as the regressor and an exponential


covariance structure. Later in this chapter we will diagnose whether this is the
appropriate spatial covariance structure for these data. Table 6.1 displays the
estimates for the fixed effects and covariance parameters under the assumption
that the errors are uncorrelated and in the spatial models. At first glance the
estimates of the fixed effects and their standard errors do not appear to differ
much between a model with uncorrelated errors (OLS in Table 6.1) and the
REML/ML fits of the correlated error model. The estimates of the mean soil
carbon percentage will be similar. This should not be too surprising, since the
OLS estimates are unbiased as long as the model errors have zero mean. You
do not incur bias in OLS estimation because of correlations of the model errors.
However, the OLS estimates can be highly inefficient if data are correlated.

Table 6.1 Estimated parameters for C–N regression. Covariance structure for REML
and ML estimation is exponential. Independent error assumption for OLS estima-
tion.
OLS REML ML
Parameter Est. Std. Err. Est. Std. Err. Est. Std. Err.

β0 −0.0078 0.0202 0.0064 0.0228 0.0058 0.0226


β1 10.900 0.2652 10.733 0.2960 10.738 0.2936
θ — — 14.128 3.0459 13.661 2.9172
σ2 0.0016 0.0001 0.0017 0.0002 0.0016 0.0002

Furthermore, a cursory glance at parameter estimates for the fixed effects


and their standard errors does not tell us whether the inclusion of correlated
errors improved the model. We need a formal method to compare the models.
We are revisiting the data in Table 6.1 in §6.2.3.1 where these formal methods
are discussed.
Note that the standard errors of the OLS estimates in Table 6.1 are smaller
than those for the correlated error model. This is one indication of the bias in
estimating the precision of the estimates. OLS tends to overstate the precision
of the estimates.
An important difference between the model with uncorrelated and with
uncorrelated data is in prediction of soil carbon values. In the spatial model
a prediction of soil carbon at one of the sampled locations reproduces the
observed value. In the OLS model the prediction equals the estimate of the
mean. Predictions of soil C% are spatially sensitive in the correlated error
model. Predictions in the OLS model are spatially sensitive only to the extent
that the regressor, soil N%, varies spatially.
LINEAR MODELS WITH CORRELATED ERRORS 325

6.2.1 Mixed Models

In the previous section, any variation not explained by the parametric mean
function, X(s)β, was assumed to be unstructured, random, spatial variation.
However, in many applications, the variation reflected in e(s) may have a
systematic component. For example, in a randomized complete block design,
it can be advantageous to separate out the variation explained by the blocking
and not simply lump this variation into a general error term. Thus, we can
consider mixed models that contain both fixed and random effects.
The general form of a linear mixed model (LMM) is General
Linear
Z(s) = X(s)β + U(s)α + '(s), (6.18)
Mixed
where α is a (K ×1) vector of random effects with mean 0 and variance G. The Model
vector of model errors '(s) is independent of α and has mean 0 and variance
R. Our inferential goal is now more complicated. In addition to estimators of
the fixed effects, β, and any parameters characterizing R, we will also need
a predictor of the random effects, α, as well as estimators of any parameters
characterizing G.
The mixed model (6.18) can be related to a signal model (see §2.4.1),
Z(s) = S(s) + '(s)
S(s) = X(s)β + W(s) + η(s),
so that U(s)α corresponds to W(s) + η(s), the smooth-scale and micro-scale
components. For the subsequent discussion, we combine these two components
into υ(s), so that (6.18) is a special case of Z(s) = X(s)β + υ(s) + '(s). The
various approaches to spatial modeling that draw on linear mixed model tech-
nology, differ in how U(s)α is constructed, and in their assumptions regarding
G and R.
But first, let us return to the general case and assume that G and R are
known. The mixed model equations of Henderson (1950) are a system of equa-
tions whose solution yield β ' and α,' the estimates of the fixed effects and
predictors of the random effects. The mixed model equations can be derived
using a least squares criterion and augmenting the traditional β vector with
the random effects vector α. Another derivation of the mixed model equations
which we present here commences by specifying the joint likelihood of [α, '(s)]
and maximizing it with respect to β and α. Under a Gaussian assumption for
both random components, this joint density is
1 U U
U G 0 U−1/2
f (α, '(s)) = U U
(2π)(n+K)/2 0 R
H : ;: ;−1
1 α G 0
× exp −
2 Z(s) − X(s)β − U(s)α 0 R
: ;>
α
× ,
Z(s) − X(s)β − U(s)α
326 SPATIAL REGRESSION MODELS

and it is maximized when


Q(β, α) = (Z(s) − X(s)β − U(s)α)! R−1 (Z(s) − X(s)β − U(s)α)
+ α! G−1 α (6.19)
is minimized. The resulting system of equations is known as Henderson’s
Mixed mixed model equations
Model : ! ! ;: ; : ! ;
X(s) R−1 X(s) X(s) R−1 U(s) '
β X(s) R−1 Z(s)
Equations ! ! = ! .
U(s) R−1 X(s) U(s) R−1 U(s) + G−1 '
α U(s) R−1 Z(s)
The solutions are
' ! !
β (X(s) V−1 X(s))−1 X(s) V−1 Z(s)
= (6.20)
' = GU(s)! V−1 (Z(s) − X(s)β)
α ' (6.21)
!
V = U(s)GU(s) + R.
The estimator for the fixed effects given in equation (6.20) is just the gen-
eralized least squares estimator of β obtained using Var[Z(s)] = V. Also, it
can be shown that α ' is the best linear unbiased predictor (BLUP) of the ran-
dom effects, α, under squared-error loss. Note that when there are no random
effects, the linear mixed model reduces to that given in equation (6.1) and
discussed earlier.
In practical applications, V is parameterized as V(θ) and the elements of
θ are estimated by one of the methods discussed previously. It is common
in mixed model analyses to follow the assumption of Gaussian distributed
errors and a Gaussian distribution for α and to estimate θ by maximum or
restricted maximum likelihood. The estimators for the fixed effects then take
on the EGLS form
'
β 'ml )X(s)V(θ
= Ω(θ 'ml )−1 Z(s)
ml
'
β 'reml )X(s)V(θ
= Ω(θ 'reml )−1 Z(s)
reml

and the predictors become


' ml
α 'ml )V(θ
= ΣS (θ 'ml )−1 (Z(s) − X(s)β
' )
ml
'reml )V(θ
= ΣS (θ 'reml ) (Z(s) − X(s)β '
reml ).
−1
' reml
α
However, it is much easier to write this than it is to accomplish this mod-
eling in practice. Typically, statisticians think of modeling G and R, rather
than modeling V. In the spatial case, it will generally be difficult to model
both of these as parametric functions of a smaller number of parameters; it
is difficult for any statistical model fitting algorithm to distinguish between
spatially structured random effects (U (s)α, with Var[α] ≡ G(θ α )) and spa-
tially autocorrelated errors (Var['(s)] = R(θ $ )). In subsequent sections, we
describe several special cases of the general linear mixed model that can be
used for modeling spatial data.
LINEAR MODELS WITH CORRELATED ERRORS 327

6.2.1.1 Penalized Splines and Low-rank Smoothers

An important special case of the linear mixed model arises when R = σ$2 I
and G = σ 2 I, a variance component model. This is a particularly simple
mixed model, and fast algorithms are available to estimate the parameters
σ 2 and σ$2 ; for example, the modified W-transformation of Goodnight and
Hemmerle (1979). The linear mixed model criterion (6.19) now becomes Mixed
Model
Q(β, α) = σ$−2 (Z(s) − X(s)β − U(s)α)! (Z(s) − X(s)β − U(s)α)
Criterion
+ σ −2 α! α
= σ$−2 ||Z(s) − X(s)β − U(s)α||2 + σ −2 ||α||2 . (6.22)
This expression is very closely related to the objective function minimized
in spline smoothing. In this subsection we elicit this connection and provide
details on the construction of the spatial design or regressor matrix U(s). The
interested reader is referred to the text by Ruppert, Wand, and Carroll (2003),
on which the exposition regarding splines and radial smoothers is based.
The problem of modeling the mean function in Yi = f (xi ) + #i has many
statistical solutions. Among the nonparametric ones are scatterplot smoothers
based on splines. A spline model essentially decomposes f (x) into an “over-
all” mean component and a linear combination of piecewise functions. For
example, define the truncated line function
-
x−t x>t
(x − t)+ =
0 otherwise.
The functions 1, x, (x − t1 ), · · · , (x − tK ) are linear spline basis functions, and
their linear combinations are called splines (Ruppert, Wand, and Carroll,
2003, p. 62). The points t1 , · · · , tK are the knots of the spline. Ruppert et al.
(2003) term a smoother as low-rank , if the number of knots is considerably
less than the number of data points, if K ≈ n, then the smoother is termed
full-rank .
A linear spline model, for example, uses the linear spline basis functions
K
(
f (x) = β0 + β1 x + αj (x − tj )+ . (6.23)
j=1

This model and its basis functions are a special case of the more general power
spline of degree p
K
(
f (x) = β0 + β1 x + · · · βp x +
p
αj ((x − tj )+ )p .
j=1

We have used the symbol α for the spline coefficients to initiate the con-
nection between spline smoothers and the mixed model in equation (6.18).
We have not decided to treat the coefficients as random quantities, however.
Whether α is fixed or random, we can write the observational model for (6.23)
328 SPATIAL REGRESSION MODELS

as
y = Xβ + Uα + ',
where
   
1 x1 (x1 − t1 ) ··· (x1 − tK )
 .. ..   .. .. .. 
X= . .  U= . . . .
1 xn (xn − t1 ) ··· (xn − tK )

In order to fit a spline model to data, penalties are imposed that prevent
the fit from being too variable. A penalty criterion that restricts the variation
Spline of the spline coefficients leads to the minimization of
Criterion
Q∗ (β, α) = ||y − Xβ − Uα||2 + λ2 ||α||2 , (6.24)
where λ > 0 is the smoothing parameter.
The connection between spline smoothing and linear mixed models is be-
' α
coming clear. If β, ' minimize (6.24), then they also minimize
Q∗ (β, α)/σ$2 = Q(β, α),
the mixed model criterion (6.19), that led to Henderson’s mixed model equa-
tion, with λ2 = σ$2 /σ 2 .
You can use mixed model software to perform spline smoothing: construct
the random effects regressor matrix U from the spline basis functions, and
assume a variance component model for their dispersion. This correspondence
has another, far reaching, consequence. The smoothing parameter does not
have to be determined by cross-validation or a comparison of information
'$2 and σ
criteria. If σ '2 are the (restricted) maximum likelihood estimates of the
variance components, then the smoothing parameter is determined as
C 2 D1/2
'$
σ
λ=
'2
σ
in a linear spline model. In a power spline of degree p, this expression is taken
to the pth power (Ruppert, Wand, and Carroll, 2003, p. 113). This automated,
model-driven estimation of the smoothing parameter, replaces the smoothing
parameter selection that is a matter of much controversy in nonparametric
modeling. In fact, the actual value of the smoothing parameter in these models
may not even be of interest. It is implied by the estimates of the variance
components, it has been “de-mystified.”
There is another important consequence of adopting the mixed model frame-
work and ML/REML estimation of the smoothing parameter. In order for
there to be a variance component σ 2 , the spline coefficients α have to be
random variables. If you use the “mixed model crank” to obtain the solutions
' and α,
β ' this is one thing. If you are interested in the precision of predicted
' + u! α,
values such as x! β ' then the appropriate estimate of precision depends
on whether α is fixed or random. We argue that α is a random component
and should be treated accordingly. Recall the correspondence U(s)α = υ(s)
LINEAR MODELS WITH CORRELATED ERRORS 329

in the initial formulation of the linear mixed model. The spline coefficients are
part of the spatial process υ(s), a random process. The randomness of υ(s)
cannot be induced by U(s), it is a matrix of constants. The randomness must
be induced by α, hence the coefficients are random variables.
So far, the spline model has the form of a scatterplot smoother in a single
dimension. In order to cope with multiple dimensions, e.g., two spatial coordi-
nates, we need to make some modifications. Again, we follow Ruppert, Wand,
and Carroll (2003, Ch. 13.4–13.5). For a process in Rd , the first modification
is to use radial basis functions as the spline basis. Define hik as the Euclidean
distance between the location si and the spline knot tk , hik = ||si − tk ||.
Similarly, define ckl = ||tk − tl ||, the distance between the knots tk and tl ,
and p = 2m − d, p > 0. Then, the (n × K) matrix U∗ (s) and (K × K) matrix
Ω(t) are defined to have typical elements
- p
∗ [hik ] d odd
U (s) = p
[h ] log{hik } d even
- pik
[ckl ] d odd
Ω(t) = p
[ckl ] log{ckl } d even.

From a singular value decomposition of Ω(t) you can obtain the square root
matrix Ω(t)1/2 . Define the mixed linear model
Multivariate
Z(s) = X(s)β + U(s)α + ' (6.25) Radial
U(s) = U∗ (s)Ω(t)−1/2 Smoother
α ∼ (0, σ 2 I)
' ∼ (0, σ$2 I)
Cov[α, '] = 0.

The transformation of U∗ (s) to U(s) is made so that this low-rank ra-


dial smoother approximates a (full-rank) thin-plate spline, see Ruppert et al.
(2003, p. 253). This is also the reason behind the logarithmic factor in the case
of d even. The number m in the exponent relates to the order of the derivative
penalized in the spline. For one- and two-dimensional processes, m = 2 is an
appropriate choice.
Model (6.25), the low-rank radial smother for spatial data, has several re-
markable aspects. For example, you can fit a spatial model by numerical
optimization of a single parameter, σ 2 . As discussed in §5.5.2 and §5.5.3,
in (restricted) maximum likelihood estimation the fixed effects as well as
the residual variance component can be profiled from the objective func-
tion. The profiled (restricted) log-likelihood is a function of the remaining
covariance parameters. In model (6.25), only σ 2 , the variance of the spline
coefficients, remains. Estimates of the fixed effects and predictions of the
'2 has been determined; either directly
spline coefficients are computed after σ
from (6.20), (6.21), or from sweeping the augmented mixed model equations
(Goodnight, 1979). Also, the mixed model has an “independent error” struc-
330 SPATIAL REGRESSION MODELS

ture, Var[Z(s)|γ] = σ$2 I, permitting the computationally fast modified W-


transformation of Goodnight and Hemmerle (1979).
The degree of smoothing is determined again from the relative size of the
two variance components. Note that σ$2 is the variance of observations about
the noise-free random function. Its magnitude scales with the response vari-
able Z(s). On the other hand, the magnitude of σ 2 depends on the smoothness
of the process, as well as on the coordinate metric. Rescaling the spatial coor-
dinates will affect σ 2 . When this variance component is near zero, convergence
difficulties may arise. Dividing the coordinates by a constant factor will move
the solution away from the boundary of the parameter space.
The computational simplicity of model (6.25) has been achieved at some
cost. Recall the more general model Z(s) = X(s)β + υ(s) + '(s). Assuming
Var['(s)] = σ$2 I, the spatial dependency in the error-free signal is expressed
through Var[υ(s)] = Συ . On the contrary, the low-rank radial smoother re-
places υ(s) with U(s)α and assumes a very simple covariance structure for
α, Var[α] = σ 2 I. Where did the spatial information go? It is contained, of
course, in the “spline basis” U(s) = U∗ (s)Ω(t)−1/2 , which depends on the
spatial configuration of data and knots. Because both U∗ (s) and Ω(t) are
known, given the data and knot locations, the matrix U(s) does not depend
on any unknowns. If, however, U(s) contains additional spatial dependence
parameters, for example, because Ω(t) is chosen as a spatial covariance model,
then computational demand increases quickly. In the general model, the spa-
tial information is contained in the random process itself, that is, U(s) ≡ I,
α ≡ υ(s).
Radial smoothing based on mixed models takes an interesting intermedi-
ate position between correlated error models for geostatistical data in which
the spatial correlation structure is specified directly (§6.2.1.3) and correlated
error models for lattice (regional) data in which the error structure is de-
termined indirectly (§6.2.2). Conditional on the random spline knots α, the
radial smoothing model is a model with uncorrelated errors. Likelihood-based
estimation proceeds, however, from a marginal distribution. The marginal
variance is that of a correlated error model, Var[Z(s)] = σ 2 U(s)U(s)! + σ$2 I.
The marginal correlations are a consequence of U(s) not being a diagonal
matrix and are determined by the structure of U(s). The number of spline
knots and their location in the domain indirectly determine the correlation
“structure” of the resulting model. This makes the selection of the knots an
important aspect of radial smoothing, also, because the computational burden
(the size of the mixed model equations) increases with the number of knots.
Limiting the number of knots to a fraction of the total sample size—a feature
of low-rank smoothers—brings with it tremendous computational savings. If
the number of knots and their placement is selected properly, these savings
do not come at a loss in quality of fit. It does, however, raise the important
question as to how knots should be selected, and how many one should choose
(K). In the one-dimensional case (d = 1), Ruppert et al. (2003, p. 126) rec-
LINEAR MODELS WITH CORRELATED ERRORS 331

ommend for automatic smoothers to choose K as one fourth of the (unique)


coordinate locations, but not more than 35. The issue of uniqueness rarely
comes into play with spatial data, but for other applications of scatterplot
smoothing, replications may be common. The tth knot is then placed at the
(t + 1)/(K + 2)th sample quantile of the (unique) coordinates. In the two-
dimensional, spatial case, selecting the number of knots and their placement
is a much more complicated problem. A regular grid of values is not necessar-
ily a good choice. The spatial domain may be irregularly shaped, and there
may be pockets with higher and lower data frequency. Ruppert et al. (2003, p.
257) select no less than 20 knots, but no more than n/4, with an upper limit of
K = 150 and recommend placement of the knots according to a space-filling
design. These designs are obtained by optimizing a stress criterion, selecting
either K of the observed data points (discrete spatial design) or K arbitrary
locations in the domain (continuous spatial design). For large data sets, the
optimization involved is numerically challenging and may turn out to be the
computationally costliest aspect of the entire analysis. This, to some extent,
defeats the formulation of the low-rank smoother as a variance component
model for which fast estimation algorithms are available.

6.2.1.2 Filtering Measurement Error

As another special case, consider again the general linear model (see also
§5.4.3)
Z(s) = S(s) + #(s), s ∈ D,
where E[#(s)] = 0, Var[#(s)] = σ$2 , Cov[#(si ), #(sj )] = 0, for all i (= j and S(s)
and #(s) are independent. We again assume that the unobserved signal S(s)
can be described by a general linear model with autocorrelated errors
S(s) = x(s)! β + υ(s),
where E[υ(s)] = 0, and Cov[υ(si ), υ(sj )] = Cov[S(si ), S(sj )] = CS (si , sj ).
Note that this model has a linear mixed model formulation, Mixed
Model For-
Z(s) = X(s)β + υ(s) + '(s)
mulation
υ ∼ (0, ΣS )
'(s) ∼ (0, σ$2 I)
Cov[υ, '(s)] = 0,
and (6.18) is a special case, obtained by taking U(s)α ≡ υ(s), R ≡ σ$2 I, and
G ≡ ΣS . The matrix ΣS contains the covariance terms CS (si , sj ). In practice,
ΣS is modeled parametrically as Σ(θ S ). Note that the moments are
E[Z(s)] = X(s)β (6.26)
Var[Z(s)] = ΣS + σ$2 I = V, (6.27)
which are those of a linear model with correlated errors of the form of (6.1).
Thus, as far as estimation of the fixed effects is concerned, the two models are
332 SPATIAL REGRESSION MODELS

equivalent. Because of this equivalency, we can also obtain solutions based on


Henderson’s mixed model equations:
: ; : −2 ! ! ;− : −2 ! ;
β' σ$ X(s) X(s) σ$−2 X(s) σ$ X(s) Z(s)
=
' (s)
υ σ$−2 X(s) σ$−2 I + Σ−1
S σ$−2 Z(s)
: −2 ! ;
σ$ X(s) Z(s)
= C
σ$−2 Z(s)

It is left as an exercise to verify that


: ;
Ω −ΩX(s)V−1 ΣS
C= ! ,
−ΣS V−1 X(s) Ω (σ$2 I + Σ−1 S )
−1
+ ΣS V−1 X(s)ΩX(s)V−1 ΣS
(6.28)
!
where Ω = (X(s) V−1 X(s))− . The solutions to the mixed model equations
are then
' !
β = ΩX(s) V−1 Z(s) (6.29)
' (s)
υ '
= ΣS V−1 (Z(s) − X(s)β). (6.30)
Note that (6.29) is the GLS estimator of β and that (6.30) is the best linear
unbiased predictor of υ(s) under squared-error loss.
Since Z(s) = S(s) + '(s) = X(s)β + υ(s) + '(s), it is natural to consider as
a predictor of Z(si ) the quantity
' +υ
Z(si ) = x(si )! β '(si )
= x(si )! β '
' + σ(si )! V−1 (Z(s) − X(s)β), (6.31)
where σ(si ) = Cov[υ(si ), υ(s)]. This is equivalent to the universal kriging
predictor with filtered measurement error (see Chapter problems and §5.3.3).

Example 6.1 (Soil carbon regression. Continued) For the C/N data we
fitted two conditional models. A mixed model
Z(s) = X(s)β + υ(s) + '(s),
where υ(s) is a zero mean random field with exponential covariance structure,
and a low-rank radial smoother with 49 knots, placed on an evenly spaced grid
throughout the domain. The large-scale trend structure is the same in the two
models, as before, C% is modeled as a linear function of total soil nitrogen
(N%). The low-rank smoother has two covariance parameters, the variance of
the random spline coefficients, and the residual variance, σ$2 = Var['(s)]. The
spatial mixed model has three covariance parameters, the variance and range
of the exponential covariance structure Σ(θ) = Var[υ(s)] and the nugget
variance σ$2 = Var['(s)].
Because of the presence of the variance component σ 2 , predictions at the
observed locations are filtered and do not honor the data. The left-hand panel
of Figure 6.5 shows the adjustments that are made to the estimate of the
mean to predict the C% at the observed locations. The right-hand panel of
LINEAR MODELS WITH CORRELATED ERRORS 333

the figure compares the C% predictions in the two conditional models. They
are generally very close. The computational effort to fit the radial smoother
model is considerably smaller, however. The low-rank smoother has a simpler
variance structure, σ 2 U(s)! U(s) + σ$2 I compared to Σ(θ) + σ$2 I, that does not
have second derivatives. Representing the spatial random structure requires
K = 49 random components in the smoothing model and n = 195 components
in the spatial mixed model.

1.2 1.2

Prediction of C% (conditional mixed model)


1.0 1.0
Prediction of C%

0.8 0.8

0.6 0.6

0.4 0.4

0.4 0.7 1.0 0.4 0.7 1.0


Estimate of E[C%] Prediction of C% (radial smooth)

Figure 6.5 Comparison of fitted and predicted values in two conditional models for
the C–N spatial regression. Left-hand panel compares predictions of the C-process
and estimates of the mean carbon percentage in the radial smoother model. The
right-hand panel compares the C-predictions in the two conditional formulations.

6.2.1.3 The Marginal and the Conditional Specifications

It is often convenient to specify parametric models by just considering their


first two moments. For example, we might assume that the mean and variance
under consideration are Marginal
Formula-
E[Z(s)] = X(s)β
tion
Var[Z(s)] = σ$2 I + Σ(θ), (6.32)
where σ$2 represents the measurement error variance and θ includes other
variance components as well as spatial autocorrelation parameters. Estima-
tion of β and θ can be done using iteratively reweighted generalized least
squares (§5.5.1) and universal kriging (§5.3.3) can be used to make predic-
334 SPATIAL REGRESSION MODELS

tions of Z(s0 ). There are no random effects and all the spatial autocorrelation
is modeled through what is essentially R = σ$2 I + Σ(θ).
Modeling these moments directly can be a difficult task and it is often
easier to hypothesize a spatially-varying latent process, S(s), and to construct
models around the moments of this process. For example, we could assume
Conditional that, given an unobservable spatial process S(s),
Formula-
E[Z(s)|S] ≡ S(s),
tion
Var[Z(s)|S] = σ$2 ,
Cov[Z(si ), Z(sj )|S) = 0 for si (= sj .
To complete the specification we assume
S(s) = x(s)! β + υ(s),
where υ(s) is a zero mean, second-order stationary process with covariance
function Cυ (si , sj ) = CS (si , sj ) = Cov[υ(si ), υ(sj )] = Cov[S(si ), S(sj )], and
corresponding variance-covariance matrix Σ(θ S ). This is an example of a
conditionally-specified or hierarchical model. At the first stage of the
hierarchy, we describe how the data depend on the random process S(s). At
the second stage of the hierarchy, we model the moments of the random pro-
cess. Estimation of β and θ can also be done using iteratively reweighted
least squares, or by using likelihood methods if we assume S(s) is Gaussian
(as described earlier as part of the theory of mixed models). All of the spatial
autocorrelation is modeled through what is essentially G = Σ(θ S ).
Note that the marginal moments of Z(s) are then easily obtained as
E[Z(s)] = ES [E[Z(s)|S(s)]] = ES [E[Z(s)|υ(s)]] = X(s)β
Var[Z(s)] = VarS [E[Z(s)|S(s)]] + ES [Var[Z(s)|S(s)]
= VarS [S(s)] + E[σ$2 I]
= Σ(θ S ) + σ$2 I = V.
Thus, our conditionally-specified, or hierarchical, model is a linear mixed
model and is marginally equivalent to a general linear model with autocorre-
lated errors where the moments are given in (6.32). Because Var[Z(s)|S(s)] =
σ$2 , this variance component represents the nugget effect of the marginal spa-
tial covariance structure. For identifiability of the variance components, the
covariance structure of υ(s) is free of a nugget effect, or it is assumed that
the overall nugget effect can be decomposed in micro-scale and measurement
error variation in known proportions.
With a linear model and only 2 levels in the hierarchy, the use of the condi-
tional specification over the marginal specification is basically one of personal
preference done for ease of understanding. The two approaches give equivalent
inference if the marginal moments coincide. However, if we want to consider
additional levels in the hierarchy or nonlinear models, the conditional speci-
fication can offer some advantages and we revisit this formulation again later
in this chapter.
LINEAR MODELS WITH CORRELATED ERRORS 335

6.2.2 Spatial Autoregressive Models

With regional data, a direct specification of the variance-covariance matrix Σ


limits our measure of spatial proximity to the distances among point locations
assumed to represent each region (e.g., intercentroid distances). In this sub-
section, we describe other approaches to modeling autocorrelation in spatial
regression models that can incorporate the neighborhood structures often used
when modeling regional data.
In time series, autoregressive models represent the data at time t as a linear
combination of past values. The spatial analog represents the data at location
s as a linear combination of neighboring values. This autoregression induces
spatial dependence in the data. Thus, instead of specifying the spatial au-
tocorrelation structure directly, spatial autoregressive models induce spatial
autocorrelation in the data through the autoregression and the spatial prox-
imity measure used to define the neighborhood structure among the data.

6.2.2.1 Simultaneous Autoregressive (SAR) Models

We begin by applying the idea of spatial autoregression to the vector of resid-


ual errors, e(s), in the spatial linear regression model with Gaussian data.
That is, we regress e(si ) on all the other error terms giving Spatial
Autore-
Z(s) = X(s)β + e(s)
gression
e(s) = Be(s) + υ, (6.33)
where B is a matrix of spatial dependence parameters with bii = 0 (so we do
not regress #(si ) on itself). We assume the residual errors from the autore-
gression, υi , i = 1, · · · , n, have mean zero and a diagonal variance-covariance
matrix Συ = diag[σ12 , . . . σn2 ]. If all the bij are zero, there is no autoregres-
sion and the model reduces to the traditional linear regression model with
uncorrelated errors.
We can also express this autoregressive model as
(I − B)(Z(s) − X(s)β) = υ, (6.34)
and from this expression we can derive the variance-covariance matrix of Z(s)
as
ΣSAR = Var[Z(s)] = (I − B)−1 Συ (I − B! )−1 , (6.35)
assuming (I−B)−1 exists. Thus, the autoregression induces a particular model
for the general covariance structure of the data Z(s) that is defined by the
parameters bij ; the covariance structure is determined indirectly through B
and our choice of Συ .
The model in (6.34) was introduced by Whittle (1954), and often appears
in the literature as the simultaneous autoregressive (SAR) model, where the
adjective “simultaneous” describes the n autoregressions that occur simultane-
ously at each data location in this formulation. It further serves to distinguish
336 SPATIAL REGRESSION MODELS

this type of spatial model from the class of conditional autoregressive models
defined in §6.2.2.2.
Clearly, the matrix of spatial dependence parameters, B, plays an important
role in SAR models. To make progress with estimation and inference, we will
need to reduce the number of spatial dependence parameters through the use
of a parametric model for the {bij } and, for interpretation, we would like to
relate them to the ideas of proximity and autocorrelation we have described
previously. One way to do this is to take B = ρW, where W is one of the
spatial proximity matrices discussed in §1.3.2. For example, if the spatial
locations form a regular lattice, W may be a matrix of 0’s and 1’s based on
a rook, bishop, or queen move. In geographical analysis, W may be based on
the length of shared borders, centroid distances, or other measures of regional
proximity. With this parameterization of B, the SAR model can be written
as
Simultaneous
Auto- Z(s) = X(s)β + e(s)
regressive e(s) = ρWe(s) + υ, (6.36)
Model
(SAR) where B = ρW. We can manipulate this model in a variety of ways, and it is
often intuitive to write it as
Z(s) = X(s)β + (I − ρW)−1 υ (6.37)
= X(s)β − ρWX(s)β + ρWZ(s) + υ. (6.38)
From equation (6.37) we can see how the autoregression induces spatial au-
tocorrelation in the linear regression model through the term (I − ρW)−1 υ.
From equation (6.38) we obtain a better appreciation for what this means in
terms of a linear regression model with uncorrelated errors: we now have two
additional terms in the regression model: ρWXβ and ρWZ(s). These terms
are called spatially lagged variables.
For a well-defined model, we require (I−ρW) to be non-singular (invertible).
This restriction imposes conditions on W and also on ρ, best summarized
through the eigenvalues of the matrix W. If ϑmax and ϑmin are the largest
and smallest eigenvalues of W, and if ϑmin < 0 and ϑmax > 0, then 1/ϑmin <
ρ < 1/ϑmax (Haining, 1990, p. 82). For a large set of identical square regions,
these extreme eigenvalues approach −4 and 4, respectively, as the number of
regions increases, implying |ρ| < 0.25, but actual constraints on ρ may be
more severe, especially when the sites are irregularly spaced. Often, the row
sums of W are standardized to 1 by dividing each entry in W by its row sum,
.
j wij . Then, ϑmax = 1 and ϑmin ≤ −1, so ρ < 1 but may be less than −1
(see Haining, 1990, §3.2.2).
Estimation and Inference in SAR Models. The one-parameter SAR model
described by equations (6.36), (6.37) and (6.38), with Συ = σ 2 I, is by far
the most commonly-used SAR model in practical applications. Probably the
main reason for the popularity of this model is the difficulty of actually fitting
models with more parameters. In what follows we describe both least squares
LINEAR MODELS WITH CORRELATED ERRORS 337

and maximum likelihood methods for estimation and inference with this one-
parameter SAR model.
If ρ is known, then generalized least squares can be used to estimate β and
to obtain an estimate of σ 2 . Thus,
) *
β' = X(s)! Σ−1 X(s) −1 X(s)! Σ−1 Z(s),
gls SAR SAR

and
(Z(s) − X(s)β ' )! Σ−1 (Z(s) − X(s)β ' )
gls SAR gls
'2 =
σ ,
n−k
with ΣSAR given in equation (6.35) using B = ρW and Συ = σ 2 I. Of course
ρ is not known, and we might be tempted to use iteratively re-weighted least
squares to estimate both ρ and β simultaneously through iteration. However,
Z(s) and υ are, in general, not independent, making the least squares estima-
tor of ρ inconsistent (Whittle, 1954; Ord, 1975). Based on the work of Ord
(1975) and Cliff and Ord (1981, p. 160), Haining (1990, p. 130) suggested
the use of a modified least squares estimator of ρ that is consistent, although
inefficient
Z(s)! W! WZ(s)
ρ' = .
Z(s)! W! W2 Z(s)
However, given this problem with least squares estimation and the unifying
theory underlying maximum likelihood, parameters of SAR models are usually
estimated by maximum likelihood.
Suppose the data are multivariate Gaussian with the general SAR model
defined in (6.36) and (6.35), i.e., the data are multivariate Gaussian with mean
X(s)β and variance-covariance matrix given in equation (6.35),
) *
Z(s) ∼ G X(s)β, (I − B)−1 Συ (I − B! )−1 .

We consider a more general variance-covariance structure for υ, by reparam-


eterizing the diagonal matrix Συ as Συ = σ 2 Vυ , so the variance covariance
matrix of a SAR can be written as
ΣSAR (θ) = σ 2 (I − B)−1 Vυ (I − B! )−1 = σ 2 VSAR (θ), (6.39)
where the vector θ contains the spatial dependence parameters bij , and the
parameters of Vυ . Under this model, the maximum likelihood procedure de-
scribed in §5.5.2 can be used to estimate all unknown parameters. Specifically,
estimates of β and θ are obtained by minimizing the negative of twice the
Gaussian likelihood, ϕ(β; θ; Z(s)), given in equation (5.41) with Σ(θ) replaced
with ΣSAR (θ). The parameters σ 2 and β can be profiled from the log likeli-
'ml is the maximum likelihood estimator of θ, then
hood as in §5.5.2. If θ
E F−1
' ! ' ! 'ml )Z(s),
β ml = X(s) ΣSAR (θ ml ) X(s)
−1
X(s) ΣSAR (θ (6.40)
and
' )! ΣSAR (θ
(Z(s) − X(s)β 'ml )−1 (Z(s) − X(s)β
' )
2
'ml
σ = ml ml
. (6.41)
n
338 SPATIAL REGRESSION MODELS

'ml and it is obtained


In most cases, there is no closed form expression for θ
using numerical methods.
In this general development, there are many model parameters contained
in the vector θ, and minimizing ϕ(β; θ; Z(s)) may be difficult or impossible.
Thus, in practice, we usually assume the parameters in Vυ are known (e.g., the
(i, i)th element is equal to 1 or equal to 1/ni to account for differing population
sizes in a geographical analysis) and parameterize B as a parametric function
of a spatial proximity matrix W. For example, if we take B = ρW and
Vυ = σ 2 I, the only unknown parameter is ρ, and in this case, minimizing
ϕ(β; θ; Z(s)) is typically straightforward. The information matrix—assuming
β is not profiled—then has a much simpler form (Ord, 1975; Cliff and Ord,
SAR 1981, p. 242):
Information : −2 ! ;
σ (X(s) A(ρ)−1 X(s)) 0
Matrix I(β, σ , ρ) =
2
(6.42)
0! I(σ 2 , ρ)
: ;
nσ −4 /2 σ −2 tr{G}
I(σ , ρ) =
2
σ −2 tr{G} σ −2 (α + tr{G! G})
A(ρ) = (I − ρW)−1 (I − ρW! )−1
G = W(I − ρW)−1
(
α = (ϑ2i /(1 − ρϑi )2 ),

where ϑi are the eigenvalues of W. The inverse of the information matrix


provides the asymptotic standard errors of the estimators.
We note, without providing further details, that restricted maximum like-
lihood estimation is of course a possible alternative in this setup. The needed
adjustments to the foregoing formulas are straightforward, see §5.5.3.

6.2.2.2 Conditional Autoregressive (CAR) Models

The simultaneous approach to spatial autoregression provides one multivariate


model that describes the spatial interactions among the data. We may find it
more intuitive to follow the approach adopted in the analysis of time series
and specify models for the set of conditional probability distributions of each
observation, Z(si ), given the observed values of all of the other observations.
That is, we model f (Z(si )|Z(s)−i ), where Z(s)−i denotes the vector of all
observations except Z(si ), and we do this for each observation in turn.
In time series, a sequence of random variables Y1 , Y2 , · · · , YT is said to have
Markov the Markov property if the conditional distribution of Yt+1 given Y1 , Y2 , · · · , Yt
Property is the same as the conditional distribution of Yt+1 given Yt . That is, the value
at time t + 1 depends only on the previous value. A sequence of random
variables having the Markov property is also called a Markov process. We can
extend these ideas to the spatial domain by assuming Z(si ) depends only on
a set of neighbors, i.e., Z(si ) depends on Z(sj ) only if location sj is in the
neighborhood set, Ni , of si . In this special case, the process Z(s) is called
LINEAR MODELS WITH CORRELATED ERRORS 339

a Markov random field. Thus, with the conditional autoregressive approach, Markov
we construct models for f (Z(si )|Z(sj ), sj ∈ Ni ). For example, if we assume Random
each of these conditional distributions is Gaussian, then we might model them Field
using:
n
(
E[Z(si )|Z(s)−i ] = x(si ) β + !
cij (Z(sj ) − x(si )! β), (6.43)
j=1

Var[Z(si )|Z(s)−i ] = σi2 , i = 1, · · · , n, (6.44)


where the cij denote spatial dependence parameters that are nonzero only if
sj ∈ Ni , and cii = 0.
For estimation and inference, we need to ensure the existence of the joint
distribution. Building a valid joint distribution from a collection of marginal
or conditional distributions is difficult and this type of “upwards” construction
almost always results in constraints on the interactions among the data (see
also §5.8 and §6.3.3). In the case we consider here, the Hammersley-Clifford
theorem (first proved in Besag (1974)), describes the conditions necessary
for a set of conditional distributions, {f (Z(si )|Z(sj ), sj ∈ Ni )}, to define
a valid joint distribution, f (Z(s1 ), Z(s2 ), Z(sn )). In order to contrast CAR
models with SAR models (the latter of which are really only defined for a
multivariate Gaussian distribution), we begin with Gaussian CAR models
and address the implications resulting from the Hammersley-Clifford theorem
for non-Gaussian models in the continuation of Example 1.4 on page 377.
If we assume that the conditional distributions are Gaussian, with condi-
tional means and variances given by equations (6.43) and (6.44), respectively,
the conditions required by the Hammersley-Clifford theorem are not too re-
strictive and Besag (1974) shows (see also Cliff and Ord, 1981, p. 180) that
these conditional distributions generate a valid joint multivariate Gaussian
distribution with mean X(s)β and variance
ΣCAR = (I − C)−1 Σc , (6.45)
where Σc = diag[σ12 , · · · σn2 ]. To ensure that this variance-covariance matrix is
symmetric, the constraints
σj2 cij = σi2 cji (6.46)
must be imposed.
Clearly, this model is similar to a SAR model, but with a different variance-
covariance matrix. In fact, if we set Σc = σ 2 I in the development of the
CAR models and Συ = σ 2 I in the development of the SAR models, then a
comparison of the variances in equations (6.35) and (6.45) shows that any
SAR model with spatial dependence matrix B can be expressed as a CAR
model with spatial dependence matrix C = B + B! − BB! . Any CAR model
can also be expressed as a SAR model, but the relationships between B and
C are somewhat contrived (see Haining, 1990, p. 81), and the neighborhood
structure of the two models may not be the same (Cressie, 1993, p. 409).
340 SPATIAL REGRESSION MODELS

Estimation and Inference with Gaussian CAR models. Following the ideas
regarding SAR models, we consider the case where Σc = σ 2 I and the spa-
tial dependence parameters can be written as functions of a single spatial
autocorrelation parameter, e.g., C = ρW.
Unlike the SAR model, the least squares estimator of the autocorrelation
parameter ρ in a one-parameter CAR model is consistent. Thus, iteratively
re-weighted generalized least squares (described in §5.5.1) can be used to
estimate all of the parameters of this CAR model. As before, β is estimated
using generalized least squares with Σ(θ) = ΣCAR = ΣCAR (ρ), and ρ is
estimated using (Haining, 1990, p. 130)
'! W'
' '
ρ'OLS = ! 2 ,
'W '
' '
where '
' is the residual vector from OLS regression.
To perform maximum likelihood estimation, we consider a slightly more
general formulation for Σc , as in the case of the SAR models. Specifically, we
reparameterize it as Σc = σ 2 Vc , with Vc known. Thus, with C = ρW, the
variance-covariance matrix of this CAR model can be written as
ΣCAR = σ 2 (I − C)−1 Vc = σ 2 VCAR (ρ). (6.47)
Maximizing the Gaussian likelihood based on this variance-covariance struc-
ture is usually straightforward, and the information matrix has a form very
similar to that associated with the one-parameter SAR model (see (6.42)):
(Cliff and Ord, 1981, p. 242)
 −2 ! 
σ (X(s) A(ρ)−1 X(s)) 0 0
 
 

I(β, σ , ρ) = 
2
0! n −4
σ 1 −2
σ tr(G) ,
2 2 
 
0! 1 −2
2σ tr(G) 1

(6.48)
. 2
where A(ρ) = (I − ρW)−1 , G = W(I − ρW)−1 , α = (ϑi /(1 − ρϑi )2 ), and
ϑi are the eigenvalues of W.

6.2.2.3 Spatial Prediction with Autoregressive Models

Spatial autoregressive models were developed for use with regional data and
in most applications the data are aggregated over a set of finite areal regions.
Thus, prediction at a new location (which is actually a new region) is usually
not of interest unless there is missing data. Assume that in the Gaussian case
the SAR model Z(s) ∼ G(X(s)β, ΣSAR ), with ΣSAR given in (6.35), or the
CAR model Z(s) ∼ G(X(s)β, ΣCAR ) with ΣCAR given in (6.45), hold for
the observed data as well at the new location. Then universal kriging can
be used to predict at the new location by jointly modeling the data and the
LINEAR MODELS WITH CORRELATED ERRORS 341

unobservables as in equations (5.31) and (5.32) and using the predictor in


equation (5.33).
While prediction at a new location is usually not of interest in spatial autore-
gressive models, predictions from the models for the given set of regions can
be important. If predictions are made using the same covariate values, these
predictions represent smoothed values of the original data that are adjusted
for the covariate effects. The adjustment also accounts for spatial autocorre-
lation as measured by the spatial dependence parameters in the model and
these are usually assumed to be determined by the neighborhood structure
imposed on the lattice system. Recall that universal kriging honors the data,
and so one way to obtain these smoothed values is to use the filtered version
described in §5.3.3.

6.2.3 Generalized Least Squares—Inference and Diagnostics

6.2.3.1 Hypothesis Testing

Linear Hypotheses About Fixed Effects. The correlated error models of this
section, whether marginal models, mixed models, or autoregressive models,
have an estimated generalized least squares solution for the fixed effects. As
for models with uncorrelated errors, we consider linear hypotheses involving
β of the form
H0 : Lβ = l0
H1 : Lβ (= l0 , (6.49)
where L is a l × p matrix of contrast coefficients and l0 is a specified l × 1
vector. The equivalent statistic to (6.3) is the Wald F statistic Wald F
! Statistic
R − l0 )! [L(X(s) Σ(θ)
(Lβ R −1 X(s))−1 L! ]−1 (Lβ
R − l0 )
FR = , (6.50)
rank(L)
taking β R as the EGLS estimator based on θ, R the latter being either the IR-
WGLS, ML, or REML estimators. For uncorrelated, Gaussian distributed
errors, the regular F statistic ((6.3), page 305) followed an F distribution.
In the correlated error case, the distributional properties of FR are less clear-
cut. If θR is a consistent estimator, then FR has an approximate Chi-square dis-
tribution with rank{L} degrees of freedom. This is also the distribution of FR
if θ is known and Z(s) is Gaussian. Consequently, p-values computed from the
Chi-square approximation tend to be too small, the test tends to be liberal,
Type-I error rates tend to exceed the nominal level. A better approximation
to the nominal Type-I error level is achieved when p-values for FR are com-
puted from an F distribution with rank{L} numerator and (n − rank{X(s)})
denominator degrees of freedom.

Example 6.2 A similar phenomenon can be observed in iid random samples.


342 SPATIAL REGRESSION MODELS

If Y1 , · · · , Yn is a random sample of size n from a distribution with mean µ and


finite variance σ 2 , and Y , s2 , denote the sample mean and sample variance,
respectively, then the distribution of
Y −µ
T = √ (6.51)
s/ n
approaches a G(0, 1) distribution as n → ∞ (by the Central Limit Theorem).
Because s2 is a consistent estimator of σ 2 , the asymptotic distribution is the
same as that of
Y −µ
T∗ = √ .
σ/ n
In order to test the hypothesis H0 : µ = µ0 with test statistic T , one can draw
on the standard Gaussian distribution to compute p-values, critical values,
etc. Since n is finite, one may obtain better adherence to nominal Type-I
error levels, if the distribution applied to T is one that takes “into account”
the fact that s is a random variable for any finite n. This is the rationale for
using a t-distribution with n − 1 degrees of freedom in tests based on (6.51),
although T ∼ tn−1 only if the Yi are Gaussian distributed.

When L is a 1 × p vector whose only nonzero entry is a 1 for the ith


element and l0 = 0, the test based on FR reduces to the familiar t-test of
! R −1
βi = 0. Denoting the diagonal elements of (X(s) √ Σ(θ) X(s)) by s , then
−1 ii

the estimated standard error for a single β'i is sii , and a (1-α)% confidence
interval for βi is

β'i ± (tα/2,n−k ) sii , (6.52)
where tα/2,n−k is the α/2 percentage point from a t-distribution with n − k
degrees of freedom.
Using an F -distribution (t-distribution) instead of the asymptotic Chi-
square (Gaussian) distribution improves the properties of the test of H0 : Lβ =
l0 , but it does not address yet the essential problem that
E F−1
R = X(s)! Σ(θ)
Ω(θ) R −1 X(s)

is not the variance-covariance matrix of β. R It is an estimate of Var[β]


R but
not necessarily a good one at that. Indeed, it is an estimate of the variance
of the GLS estimator. We saw in §5.5.4 that plugging-in covariance param-
eter estimates can affect the prediction standard errors. By the same token,
plugging-in affects the variance of the EGLS estimator relative to the GLS
estimator. The test statistic FR should be adjusted in possibly two ways. First,
E F−1
! R −1 X(s)
L X(s) Σ(θ) L!

R This requires a different estimate


is supposed to be an estimate of Var[Lβ].
R one that accounts for the fact that uncertainty is associated with
of Var[β],
LINEAR MODELS WITH CORRELATED ERRORS 343

R (regardless of how θ was estimated). If this adjusted test statistic is de-


θ
noted FR∗ , then the appropriate null distribution may not be the same as
that used in a test based on FR. The adjustment for the variance estimator
of the EGLS estimator flows directly from the work of Kackar and Harville
(1984), Prasad and Rao (1990), and Harville and Jeske (1992); summarized
in §5.5.4. Kenward and Roger (1997) gave computationally simple expressions
' based on (5.53) and REML
for a bias-corrected estimator of the variance of β
estimation. They furthermore gave expressions for adjusting the degrees of
freedom of the reference F -distribution.
' denote the REML estimator of θ. The adjusted Kenward-Roger vari-
Let θ
ance estimator then takes the form Bias-
  adjusted
( q (q 
' ' ' ' Variance
Υ(θ) = Ω(θ) + 2Ω(θ) aij Dij Ω(θ), (6.53)
  Estimator
i=1 j=1

' q is the dimension of θ and


where aij is the (i, j)th element of Var[θ],
−1 −1
! ∂Σ(θ) ' ∂Σ(θ) X −
Dij = X Σ(θ)
∂θi ∂θi
−1 −1
∂Σ(θ) ' ! ∂Σ(θ) X −
X! XΩ(θ)X
∂θi ∂θj
1 ! ' −1 ∂ 2 Σ(θ) ' −1
X Σ(θ) Σ(θ) X.
4 ∂θi ∂θj
' When Σ depends linearly
The derivatives in this expression are evaluated at θ.
on θ, the last term vanishes. This is the case in certain mixed models, but
usually not when the covariance structure has spatial components.
In the next step, Kenward and Roger (1997) replace Ω(θ) ' in (6.50) with
(6.53) and adjust the test statistic and its degrees of freedom. The Kenward-
Roger F -test has test statistic Kenward-
Roger
' − l0 )! [LΥ(θ)L
(Lβ ' ! ]−1 (Lβ
' − l0 ) Wald Test
FR∗ = λ , (6.54)
rank(L)
its distribution under H0 : Lβ = l0 is taken to be F with rank{L} numerator
and m denominator degrees of freedom. The interested reader is referred to
Kenward and Roger (1997) for computational details regarding λ and m.

Testing Hypotheses About Covariance Parameters. Since maximum likeli-


hood estimators are asymptotically Gaussian, we can obtain approximate
confidence intervals or associated hypothesis tests for the elements of θ using
<
θ'i(ml) ± (zα/2 ) Var[θ'i(ml) ] (6.55)
<
θ'i(reml) ± (zα/2 ) Var[θ'i(reml) ] (6.56)
344 SPATIAL REGRESSION MODELS

where the variance of the estimators derives from the inverse of the observed
or expected information matrix, substituting (plugging-in) ML (REML) esti-
mates for any unknown parameters. Alternatively, we can use likelihood ratio
tests of hypotheses about θ and compare two nested parametric models to see
whether a subset of the covariance parameters fits the data as well as the full
set. Consider comparing two models of the same form, one based on parame-
ters θ 1 and a larger model based on θ 2 , with dim(θ 2 ) > dim(θ 1 ). That is, θ 1 is
obtained by constraining some parameters in θ 2 , usually setting them to zero,
and dim(θ) denotes the number of free parameters. Then a test of H0 : θ = θ 1
Likelihood against the alternative H1 : θ = θ 2 can be carried out by comparing
Ratio Test
ϕ(β; θ 1 ; Z(s)) − ϕ(β; θ2 ; Z(s)) (6.57)
Statistic
to a χ2 distribution with dim(θ 2 ) − dim(θ 1 ) degrees of freedom. (Recall that
ϕ(β; θ; Z(s)) denotes twice the negative of the log-likelihood, so ϕ(β; θ 1 ; Z(s))
> ϕ(β; θ 2 ; Z(s)).)

Example 6.3 Consider a linear spatial regression model Z(s) = X(s)β+e(s)


with
Cov[e(si ), e(sj )] = c0 + σ 2 exp{−||si − sj ||/α},
an exponential model with practical range 3α and nugget c0 . Let θ 2 = [c0 , σ 2 ,
α]! . The alternative covariance functions are
1. Cov[e(si ), e(sj )] = σ 2 exp{−||si − sj ||/α}, θ 1 = [σ 2 , α]! .
2. Cov[e(si ), e(sj )] = σ 2 exp{−||si − sj ||2 /α2 }
Model 1 is an exponential model without nugget effect. Since it can be ob-
tained from the original model by setting c0 = 0, the two models are nested. A
likelihood ratio test of H0 : c0 = 0 is possible. The test statistic is ϕ(β; θ 1 ; Z(s))
− ϕ(β; θ 2 ; Z(s)). Model 2 is a gaussian covariance structure and is not of the
same form as the original model or model 1. Constraining parameters of the
other models does not yield model 2. A likelihood ratio test comparing the
two models is not possible.

An important exception to the distributional properties of likelihood ratio


statistics occurs when parameters lie on the boundary of the parameter space.
In that case, the test statistic in equation (6.57) is a mixture of χ2 distribu-
tions. Such boundary exceptions arise in practice when variance components
(e.g., the nugget or the variance of the low-rank smoother’s spline coefficients)
are tested against zero. If we are testing only one of these variance components
against 0, the test statistic has a distribution that is a mixture of a degenerate
distribution giving probability 1 to the value zero, and a χ2 distribution with
dim(θ 2 ) − dim(θ 1 ) degrees of freedom (Self and Liang, 1987; Littell, Milliken,
Stroup and Wolfinger, 1996). Thus, to make the test, simply divide the p-value
obtained from a χ2 with dim(θ 2 ) − dim(θ 1 ) degrees of freedom by 2.
The likelihood ratio test can also be carried out for REML estimation,
provided the hypothesis is about the covariance parameters. In that case we
LINEAR MODELS WITH CORRELATED ERRORS 345

are comparing
ϕR (θ 1 ; KZ(s)) − ϕR (θ 2 ; KZ(s)) (6.58)
to a χ distribution with dim(θ 2 ) − dim(θ 1 ) degrees of freedom. While like-
2

lihood ratio tests can be formulated for models that are nested with re-
spect to the mean structure and/or the covariance structure, tests based on
ϕR (θ; KZ(s)) can only be carried out for models that are nested with respect
to the covariance parameters and have the same fixed effects structure (same
X(s) matrix).

Example 6.4 Consider a partitioned matrix X(s) = [X(s)1 X(s)2 ] and the
following models:
1. Z(s) = X(s)[β !1 , β !2 ]! +e(s) with Cov[e(si ), e(sj )] = c0 +σ 2 exp{|si −sj ||/α}
2. Z(s) = X(s)2 β 2 + e(s) with Cov[e(si ), e(sj )] = c0 + σ 2 exp{|si − sj ||/α}
3. Z(s) = X(s)[β !1 , β !2 ]! + e(s) with Cov[e(si ), e(sj )] = σ 2 exp{|si − sj ||/α}
4. Z(s) = X(s)2 β 2 + e(s)) with Cov[e(si ), e(sj )] = σ 2 exp{|si − sj ||/α}
To test the hypothesis H0 : β 2 = 0, a likelihood ratio test is possible, based on
the log likelihoods from models 1 and 2 (or models 3 and 4). The REML log
likelihoods from these two models can not be compared because K1 Z(s) and
K2 Z(s) are two different sets of “data.”
To test the hypothesis H0 : c0 = 0, models 1 and 3 (or models 2 and 4) can
be compared, since they are nested. The test can be based on either the −2 log
likelihood (ϕ(β; θ; Z(s))) or on the −2 residual log likelihood (ϕR (θ; KZ(s))).
The test based on ϕR is possible because both models use the same matrix of
error contrasts.
Finally, a test of H0 : c0 = 0, β 2 = 0 is possible by comparing models 1 and
4. Since the models differ in their regressor matrices (mean model), only a
test based on the log likelihood is possible.

Example 6.1 (Soil carbon regression. Continued) Early on in §6.2 we


compared OLS estimates in the soil carbon regression model to maximum
likelihood and restricted maximum likelihood estimates in Table 6.1 on page
324. At that point we deferred a formal test of significance of the spatial
autocorrelation.
We can compare the models
C(si )|N (si ) = β0 + β1 N (si ) + e(si )
e(si ) ∼ G(0, σ 2 )
Cov[e(si ), e(sj )] = σ 2 exp{−||si − sj ||/θ}
and
C(si )|N (si ) = β0 + β1 N (si ) + e(si )
e(si ) ∼ G(0, σ 2 )
Cov[e(si ), e(sj )] = 0
346 SPATIAL REGRESSION MODELS

Table 6.2 Minus two times (restricted) log likelihood values for C–N regression anal-
ysis. See Table 6.1 on page 324 for parameter estimates

Covariance Structure
Fit Statistic Independent Exponential

ϕ(β; θ 1 ; C(s)) −706 −747


ϕR (θ; KC(s)) −695 −738

with a likelihood ratio test, provided that the models are nested; that is,
the independence model can be derived from the correlated error model by
a restriction of the parameter space. The exponential covariance structure
exp{−||si − sj ||/θ} approaches the independence model only asymptotically,
as θ → 0. For θ = 0 the likelihood cannot be computed, so it appears that the
models cannot be compared. There are two simple ways out of this “dilemma.”
We can reparameterize the covariance model so that the OLS model is nested.
For example,
- >
||si − sj ||
exp − = ρ||si −sj ||
θ
for ρ = exp{−1/θ}. The likelihood ratios statistic for H0 : ρ = 0 is −706 −
(−747) = 41 with p-value Pr(χ21 > 41) < 0.0001. The addition of the exponen-
tial covariance structure significantly improves the model. The test based on
the restricted likelihood ratio test yields the same conclusion (−695−(−738) =
43). It should be obvious that −747 is also the −2 log likelihood for the model
in the exp{−||si −sj ||/θ} parameterization, so we could have used the log like-
lihood from the exponential model directly. The second approach is to evaluate
the log likelihood at a value for the range parameter far enough away from
zero to prevent floating point exceptions and inaccuracies, but small enough
so that the correlation matrix is essentially the identity.
For a stationary process, ρ = 0 falls on the boundary of the parameter
space. The p-value of the test could be adjusted, but it would not alter our
conclusion (the adjustment shrinks the p-value). The correlated error model
fits significantly better than a model assuming independence.

To compare models that are not nested, we can draw on likelihood-based in-
formation criteria. For example, Akaike’s Information Criterion (AIC, Akaike,
1974), a penalized-log-likelihood ratio, is defined by

AIC(β, θ) = ϕ(β; θ; Z(s)) + 2(k + q) (6.59)


AICR (θ) = ϕR (θ; Z(s)) + 2q. (6.60)
LINEAR MODELS WITH CORRELATED ERRORS 347

A finite-sample corrected version of this information criterion is AICC (Hur-


vich and Tsai, 1989; Burnham and Anderson, 1998)
n(k + q)
AICC(β, θ) = ϕ(β; θ; Z(s)) + 2 (6.61)
n−k−q−1
(n − k)q
AICCR (θ) = ϕR (θ; Z(s)) + 2 . (6.62)
n−k−q−1
In theory, we select the model with the smallest AIC or AICC value. In
practice, one should observe the following guidelines, however.
• When fitting several competing models, for example, models that differ
in their covariance structure, one often finds that the information criteria
group models into sets of models that are clearly inferior and a set of
model with fairly similar values. One should then not necessarily adopt the
model with the smallest information criterion, but a model in the group of
plausible model that is parsimonious and interpretable.
• The presence of unusual observations (outliers, influential data points), can
substantively affect the covariance model. For example, in a simple one-way
classification model, a single extreme value in one of the groups may suggest
that a heterogeneous variance model may be appropriate. Strict reliance on
information criteria alone may then lead to a model that is more complex
than necessary. The complexity of the model tries to adjust to a few unusual
observations.
• In the case of residual maximum likelihood estimation, these information
criteria should not be used for model comparisons, if the models have dif-
ferent mean structures (whether the mean models are nested or not). The
criteria are based on residual log likelihoods that apply to different sets
of data when the mean model changes. Comparisons of non-nested models
based on information criteria that are derived from ϕR (θ; Z(s)) should be
restricted to models that are non-nested only with respect to the covariance
structure.

Hypothesis Testing in Autoregressive Models. Since the SAR model can be


viewed as a linear model with spatially autocorrelated errors, all of the meth-
ods of hypothesis testing involving β or θ described previously can be used
with SAR models. Specifically, hypotheses involving β can be tested using the
Wald test in (6.50) or the Kenward-Roger test in (6.54). However, one of the
most common uses of the one-parameter SAR model is to provide an alter-
native model for a test of residual spatial autocorrelation in OLS residuals.
That is, we consider the following two models, a traditional linear regression
model with independent errors,
Z(s) = X(s)β + ', Σ = σ2 I (6.63)
and the one parameter SAR model:
Z(s) = X(s)β + (I − ρW)−1 υ, Σ = σ 2 (I − ρW)−1 (I − ρW! )−1 . (6.64)
348 SPATIAL REGRESSION MODELS

Setting ρ = 0 in the SAR model (6.64), we obtain the traditional lin-


ear regression model with independent errors (6.63). This nesting of the two
models, together with a traditional null model for which estimation is rather
straightforward (i.e., OLS), provides several approaches for constructing tests
of H0 : ρ = 0 vs. H1 : ρ (= 0 (or a corresponding one-sided alternative). Two
of the most common are the likelihood ratio test described earlier in §6.2.3.1
and the use of Moran’s I with OLS residuals (see §6.1.2). In order to carry
out the likelihood ratio test for H0 : ρ = 0 in a SAR model, we are compar-
ing the alternative model in equation (6.64) with parameters θ 2 = [β ! , σ 2 , ρ]!
to the null model defined by equation (6.63) with parameters θ 1 = [β ! , σ 2 ]! ,
with dim(θ 2 ) − dim(θ 1 ) = 1. A test of H0 : θ = θ 1 against the alternative
H1 : θ = θ 2 is also a test of H0 : ρ = 0 vs. H1 : ρ (= 0 and it can be done by
comparing
ϕ(β; θ 1 ; Z(s)) − ϕ(β; θ2 ; Z(s)) (6.65)
to a χ2 distribution with a single degree of freedom. The χ2 distribution is a
large sample approximation to the likelihood ratio statistic, and thus will give
approximately valid inference only for large n (the number of points on the
lattice). The size of the lattice necessary for the approximation to work well
depends on several factors including the particular structure of the spatial
proximity matrix W. In general, for small lattices, the χ21 distribution can
provide a very poor approximation to the distribution of the likelihood ratio
statistic.
The primary difference between hypothesis testing for CAR and SAR mod-
els based on Gaussian data is the definition of Σ(θ). In the simplest one-
parameter cases, Σ(θ) = σ 2 (I − ρW)−1 for a CAR model and Σ(θ) =
σ 2 (I − ρW)−1 (I − ρW! )−1 for a SAR model. Thus, the methods for test-
ing H0 : ρ = 0 vs. H1 : ρ (= 0 apply to SAR and CAR models alike.

6.2.3.2 GLS Residual Analysis

The goals of residual analysis in the correlated error model


Z(s) = X(s)β + e(s), e(s) ∼ (0, Σ(θ))
are the same as in standard linear models. We would like to derive quantities
whose properties instruct us about the behavior of e(s). We want to use such
residual-type quantities to detect, for example, whether important terms have
been omitted from X(s) and whether the assumed model for Σ(θ) is correct.
Except, we are no longer working with ordinary least squares residuals, but
with (E)GLS residuals,
'
eegls (s) = Z(s) − X(s)β
' egls .

To contrast residual analysis in the correlated error model with the analysis
of OLS residuals (§6.1.2), it is important to understand the consequences of
engaging a non-diagonal variance-covariance matrix in estimation. We will
focus on this aspect here, and therefore concentrate on GLS estimation. A
LINEAR MODELS WITH CORRELATED ERRORS 349

second layer of complications can be added on top of these differences because


θ is being estimated.
The first important result is that raw GLS residuals are about as useful as
raw OLS residuals in diagnosing properties of the model errors e(s). We can
write
'
egls (s) = Z(s) − Z(s)
'
E F−1
! −1 ! −1
= Z(s) − X(s) X(s) Σ(θ) X(s) X(s) Σ(θ) Z(s)
C E F−1 D
! −1 ! −1
= I − X(s) X(s) Σ(θ) X(s) X(s) Σ(θ) Z(s)

= (I − H(θ)) Z(s) = M(θ)Z(s). (6.66)

The matrix H(θ) is the gradient of the fitted values with respect to the
observed data, Leverage
'
∂ Z(s) Matrix
H(θ) =
∂Z(s)
and it is thus reasonable to consider it the “leverage” matrix of the model. As
with the model with uncorrelated errors, the leverage matrix is not diagonal,
and its diagonal entries are not necessarily the same. So, as with OLS residuals,
GLS residuals are correlated, and heteroscedastic, and since rank{H(θ)} = k,
the GLS residuals are also rank-deficient. These properties stem from the
fitting the model to data, and have nothing to do with whether the application
egls (s) is afflicted with the same problems as '
is spatial or not. Since ' eols (s), the
same caveats apply. For example, a semivariogram based on the GLS residuals
is a biased estimator of the semivariogram of e(s). In fact, the semivariogram
of the GLS residuals may not be any more reliable or useful in diagnosing
the correctness of the covariance structure as the semivariogram of the OLS
residuals. From (6.66) it follows immediately that
egls (s)] = M(θ)Σ(θ)M(θ)! ,
Var[' (6.67)
which can be quite different from Σ(θ). Consequently, the residuals will not
be uncorrelated, even if the model fits well. If you fit a model by OLS but the
true variance-covariance matrix is Var[e(s)] = Σ(θ), then the variance of the
OLS residuals is
Var['eols (s)] = Mols Σ(θ)Mols ,
! !
where Mols = I − X(s)(X(s) X(s))−1 X(s) .
There is more. Like the OLS residuals, the GLS residuals have zero mean,
provided that E[e(s)] = 0. The OLS residuals also satisfy very simple con-
!
straints, for example X(s) ' eols (s) = 0. Thus, if X(s) contains an intercept
column, the OLS residuals not only will have zero mean in expectation.
They will sum to zero in every sample. The GLS residuals do not satisfy
!
X(s) ' egls (s) = 0.
The question, then, is how to diagnose whether the covariance model has
350 SPATIAL REGRESSION MODELS

been chosen properly? Brownie and Gumpertz (1997) recommend a procedure


that uses OLS residuals and the fit of a correlated error model in tandem.
Their procedure is to
1. Fit the models Z(s) = X(s)β + e(s), e(s) ∼ (0, Σ(θ)) and Z(s) = X(s)β +
', ' ∼ (0, σ 2 I). The first model can be fit by REML, for example, yielding
the estimate θ 'reml . The second model is fit by ordinary least squares.
2. Compute γ 'e (h), the empirical semivariogram of the OLS residuals ' eols (s).
3. Compute an estimate of Var[' eols (s)] under the assumption that Σ(θ) is the
correct model for the covariance structure. The “natural” estimate is the
plug-in estimate
'reml )Mols = A.
T eols (s)] = Mols Σ(θ
Var['
Construct the “semivariogram” of the OLS residuals under the assumed
model for Σ from the matrix A: for all pairs of observations in a particular
lag class h, let ge (h) be the average value of 12 ([A]ii + [A]jj ) − [A]ij .
4. Graphically compare the empirical semivariogram γ 'e (h) and the function
ge (h). If Σ(θ) is the correct covariance function, the two should agree quite
'e (h) should scatter about ge (h) in a non-systematic fashion.
well. γ

Example 6.1 (Soil carbon regression. Continued) Based on Figure 6.3


on page 313 we assumed that the model errors in the spatial linear regression
of C% on N% are spatially correlated. REML and ML parameter estimates for
a model with exponential covariance structure were then presented in Table
6.1 on page 324. Was the exponential model a reasonable choice? Following
the steps of the Brownie and Gumpertz (1997) method, Figure 6.6 displays the
comparisons of the empirical semivariograms of OLS residuals and ge (h). The
exponential model (left-hand panel) is not appropriate. The corresponding
plot is shown in the right-hand panel for an anisotropic covariance structure,
8 9
Cov[Z(si ), Z(sj )] = σ 2 exp −θx |xi − xj |φx − θy |yi − yj |φy ,
that provides a much better fit. The estimates of the covariance parameters
for this model are θ'x = 0.73, θ'y = 0.40, φ'x = 0.2, φ'y = 0.2, σ
'2 = 0.0016.
Because this model is not nested within the isotropic exponential model, a
quantitative comparison can be made based on information criteria. The bias-
corrected AIC criteria for the models are −733.9 for the isotropic model and
−755.3 for the anisotropic model. The latter is clearly preferable.

Notice that the matrix M(θ) is transposed on the right hand side of equa-
tion (6.67). The leverage matrix H(θ) is an oblique projector onto the column
space of X(s) (Christensen, 1991). It is not a projection matrix in the true
sense, since it is not symmetric. It is idempotent, however:
E F−1
! −1 ! −1
H(θ)H(θ) = X(s) X(s) Σ(θ) X(s) X(s) Σ(θ) ×
E F−1
! −1 ! −1
X(s) X(s) Σ(θ) X(s) X(s) Σ(θ)
LINEAR MODELS WITH CORRELATED ERRORS 351

Isotropic Exponential Model Anisotropic Model

0.0019 0.0019

0.0016 0.0016

0.0013 0.0013

0.0010 0.0010

0.0007 0.0007

10 70 130 190 10 70 130 190


Distance Distance

Figure 6.6 Comparison of ge (h) (open circles) and empirical semivariogram of OLS
residual for exponential models with and without anisotropy.

E F−1
! −1 ! −1
= X(s) X(s) Σ(θ) X(s) X(s) Σ(θ) .

It also shares other properties of the leverage matrix in an OLS model (see
Chapter problems), and has the familiar interpretation as a gradient. Because
diagonal elements of H(θ) can be negative, other forms of leverage matrices in
correlated error models have been proposed. For example, Σ(θ) ' −1 H(θ) and
−1
L H(θ)L, where L is the lower-triangular Cholesky root of Σ(θ) (Martin,
1992; Haining, 1994).
In order to remedy shortcomings of the GLS residuals as diagnostic tools,
we can apply similar techniques as in §6.1.2. For example, the plug-in estimate
of the estimated variance of the EGLS residuals is
T eegls (s)]
Var[' '
= (I − H(θ))Σ( ' − H(θ)
θ)(I ' !)
' − H(θ)Σ(
= Σ(θ) ' '
θ), (6.68)
and the ith studentized EGLS residual is computed by dividing e'egls (si ) by
the square root of the ith diagonal element of (6.68). The Kenward-Roger
variance estimator (6.53) can be used to improve this studentization.
We can also apply ideas of error recovery to derive a set of n − k variance-
covariance standardized, uncorrelated, and homoscedastic residuals. Recall
from §6.1.2 that if the variance of the residuals is A, then we wish to find an
(n × n) matrix Q such that
: ;
! ! I(n−k) 0((n−k)×k)
Var[Q ' eegls (s)] = Q AQ = .
0(k×(n−k)) 0(k×k)
352 SPATIAL REGRESSION MODELS

The first n − k elements of Q! 'eegls (s) are the linearly recovered errors (LREs)
of e(s). Since A = Var[' eegls (s)] = Σ(θ)' − H(θ)Σ(
' ' is real, symmetric, and
θ)
positive semi-definite, it has a spectral decomposition P∆P! √= A. Let D
denote the diagonal matrix whose ith diagonal element is 1/ δi if δi > 0
and zero otherwise. The needed matrix to transform the residuals ' eegls (s) is
Q = PD.
You can also compute a matrix Q with the needed properties by way of a
Cholesky decomposition for positive semi-definite matrices. This decomposi-
tion yields a lower triangular matrix L such that LL! = A. This decomposition
is obtained row-wise and elements of L corresponding to singular rows are re-
placed with zeros. Then choose Q = L− , where the superscript − denotes a
generalized inverse; obtained, for example, by applying the sweep operator to
all rows of L (Goodnight, 1979).
Other types of residuals can be considered in correlated error models. Haslett
and Hayes (1998), for example, define marginal and conditional (prediction)
residuals. Houseman, Ryan, and Coull (2004) define rotated residuals based
on the Cholesky root of the inverse variance matrix of the data, rather than
a root constructed from the variance of the residuals.

6.3 Generalized Linear Models

6.3.1 Background

A linear model may not always be appropriate, particularly for discrete data
that might be assumed to follow a Poisson or Binomial distribution. General-
ized linear models (comprehensively described and illustrated in the treatise
by McCullagh and Nelder, 1989) are one class of statistical models devel-
oped specifically for such situations. These models are now routinely used for
modeling non-Gaussian longitudinal data, usually using a “GEE” approach
for inference. The GEE approach was adapted for time series count data by
Zeger (1988), and in the following sections we show how these ideas can be
applied to non-Gaussian spatial data.
In all of the previous sections we have assumed that the mean response is a
linear function of the explanatory covariates, i.e., µ = E[Z(s)] = X(s)β. We
also implicitly assumed that the variance and covariance of observations does
not depend on the mean. Note that this is a separate assumption from mean
stationarity. The implicit assumption was that the mean µ does not convey
information about the variation of the data. For non-Gaussian data, these
assumptions are usually no longer tenable. Suppose that Y1 , · · · , Yn denote
uncorrelated binary observations whose mean depends on some covariate x.
If E[Yi ] = µ(xi ), then

• Var[Yi ] = µ(xi )(1−µ(xi )). Knowing the mean of the data provides complete
knowledge about the variation of the data.
GENERALIZED LINEAR MODELS 353

• Relating µ and x linearly is not a reasonable proposition on (at least)


two grounds. The mean µ(xi ) must be bounded between 0 and 1 and this
constraint is difficult to impose on a linear function such as β0 + β1 x. The
effect of the regressor x may not be linear on the scale of the mean, which is
a probability. Linearity is more likely on a transformed scale, for example,
the logit scale log{µ(xi )/(1 − µ(xi ))}.
The specification of a generalized linear model (GLM) entails three compo-
nents: the linear predictor, the link function, and the distributional specifica-
tion. The linear predictor in a GLM expresses the model parameters as a linear
function of covariates, x(s)! β. We now assume that some monotonic function
of the mean, called the link function, is linearly related to the covariates, i.e.,
g(µ) = g(E[Z(s)]) = X(s)β, (6.69)
where µ is the vector of mean values [µ(s1 ), µ(s2 ), · · · µ(sn )]! of the data vector
Z(s). This enables us to model nonlinear relationships between the data and
the explanatory covariates that often arise with binary data, count data, or
skewed continuous data. In a GLM, the distribution of the data is a member
of the exponential family. This family is very rich and includes continuous
distributions such as the Gaussian, Inverse Gaussian, Gamma, and Beta dis-
tributions. Among the discrete members of the exponential family are the
binary (Bernoulli), geometric, Binomial, negative Binomial, and Poisson dis-
tributions. The name exponential family stems from the fact that the densities
or mass functions in this family can be written in a particular exponential
form. The identification of a link function and the mean-variance relationship
is straightforward in this form. A density or mass function in the exponential
family can be written as Exponential
- > Family
yφ − b(φ)
f (y) = exp + c(y, a(ψ)) (6.70) Density
ψ
for some functions b() and c(). The parameter ψ is a scale parameter and not
all exponential family distributions have this parameter (for those that do
not, we may simply assume ψ ≡ 1).

Example 6.5 Let Y be a binary random variable so that


-
µ y=1
Pr(Y = y) =
1 − µ y = 0,
or more concisely, p(y) = µy (1 − µ)1−y , y ∈ {0, 1}. Exponentiating the mass
function you obtain
- - > >
µ
p(y) = exp y log + log{1 − µ} .
1−µ
The natural parameter is the logit of the mean, φ = log{µ/(1−µ)}. The other
elements in the exponential form (6.70) are b(φ) = log{1+eφ } = − log{1−µ},
ψ = 1, c(y, ψ) = 1.
354 SPATIAL REGRESSION MODELS

To specify a statistical model in the exponential family of distributions you


can draw on some important relationships. For example, the mean and vari-
ance of the response relate to the first and second derivative of b(φ) as follows:
E[Y ] = µ = b! (φ), Var[Y ] = ψb!! (φ). If we express the natural (=canonical) pa-
rameter φ as a function of the mean µ, the second derivative v(µ) = b!! (φ(µ))
is called the variance function of the GLM. In the binary example above
you can easily verify that v(µ) = µ(1 − µ).
In (6.69), the linear predictor x(s)! β was related to the linked mean g(µ).
Since the link function is monotonic, we can also express the mean as a func-
tion of the inverse linked linear predictor, µ = g −1 (x(s)! β). Compare this
to the relationship between b(φ) and the mean, µ = b! (φ). If we substitute
φ = x(s)! β, then the first derivative of b() could be the inverse link function.
In other words, every exponential family distribution implies a link function
that arises naturally from the relationship between the natural parameter φ
and the mean of the data µ. Because φ is also called the canonical param-
eter, this link is often referred to as the canonical link. The function b! (φ)
is then the inverse canonical link function. For Poisson data, for example,
the canonical link is the log link, for binary and binomial data it is the logit
link, for Gaussian data it is the identity link (no transformation). You should
feel free to explore other link functions than the canonical ones. Although
they are good starting points in most cases, non-canonical links are preferable
for some distributions. For example, data following a Gamma distribution
are non-negative, continuous, and right-skewed. The canonical link for this
distribution is the reciprocal link, 1/µ = x(s)! β. This link does not guar-
antee non-negative predicted values. Instead, the log link ensures positivity,
µ = exp{x(s)! β}.

6.3.2 Fixed Effects and the Marginal Specification

In traditional applications of GLMs (e.g., in the development of Dobson,


1990), the data are assumed to be independent, but with heterogeneous vari-
ances given by the variance function. Thus, the variance-covariance matrix of
the data Z(s) is
Σ = Var[Z(s)] = ψVµ , (6.71)
where Vµ is an (n × n) diagonal matrix with the variance function terms
v(µ(si )) on the diagonal.
To adapt GLMs for use with spatial data, we need to modify the traditional
GLM variance-covariance matrix in equation (6.71) to reflect small-scale spa-
tial autocorrelation. In §6.2 we incorporated spatial autocorrelation by allow-
ing a more general variance-covariance matrix Σ(θ), where θ is a q × 1 vector
of unknown parameters with q << n. We extend this idea here by modifying
Σ(θ) to include the variance-to-mean relationships inherent in a GLM specifi-
cation. Based on the ideas in Wolfinger and O’Connell (1993) and Gotway and
Stroup (1997), one such approach is to model the variance-covariance matrix
GENERALIZED LINEAR MODELS 355

of the data as
1/2 1/2
Var[Z(s)] = Σ(µ, θ) = σ 2 Vµ R(θ)Vµ , (6.72)
where R(θ) is a correlation matrix with elements ρ(si − sj ; θ), the spatial
1/2
correlogram defined in §1.4.2. The diagonal matrix Vµ has elements equal
/
to the square root of the variance functions, v(µ).
If R(θ) = I, then (6.72) reduces to σ 2 Vµ , which is not quite the same
as (6.71). The parameter σ 2 obviously equals the scale parameter ψ for those
exponential family distributions that possess a scale parameter. In cases where
ψ ≡ 1, for example for binary, Binomial, or Poisson data, the parameter
σ 2 measures overdispersion of the data. Overdispersion is the phenomenon
by which the data are more dispersed than is consistent with a particular
distributional assumption. Adding the multiplicative scale factor σ 2 in (6.72)
is a basic method to account for the “inexactness” in the variance-to-mean
relationship.
Recall that the correlogram or autocorrelation function is directly related
to the covariance function and the semivariogram, and so many of the ideas
concerning parametric modeling of covariance functions and semivariograms
described in §4.3 apply here as well. Often we take q = 1 and θ to equal the
range of spatial autocorrelation, since the sill of a correlogram is 1. A nugget
effect can be included by using
1/2 1/2
Var[Z(s)] = c0 Vµ + σ12 Vµ R(θ)Vµ .
In this case, Var[Z(si )] = (c0 + σ 2 )v(µ(s
/i )), and the covariance between any
two variables is Cov[Z(si ), Z(sj )] = σ v(µ(si ))v(µ(sj ))ρ(si − sj ).
2

6.3.3 A Caveat

Before we proceed further with spatial models for non-Gaussian data, an im-
portant caveat of the presented model needs to be addressed. In the Gaussian
case you can always construct a multivariate distribution with mean µ and
variance Σ. This also leads to valid models for the marginal distributions,
these are of course Gaussian with respective means and variances µi and
[Σ]ii . The model
Y = µ + ', ' ∼ G(0, Σ)
is a generalization of
Y = µ + ', ' ∼ G(0, I).
For non-Gaussian data, such generalizations are not possible. In the indepen-
dence case, there is a known marginal distribution for each observation based
on the exponential family. There is also a valid joint likelihood, the product of
the individual likelihoods. Estimation in generalized linear models can largely
be handled based on a specification of the first two moments of the responses
alone. Expressions (6.69) and (6.72) extend these moment specifications to
356 SPATIAL REGRESSION MODELS

the non-Gaussian, spatially correlated case. There is, however, no claim made
at this point that the underlying joint distribution may be a “multivariate
Binomial” distribution, or some such thing. We may even have to step back
from the assumption that a joint distribution in which
• the mean is defined by (6.69),
• the variance is given by (6.72) with R (= I,
• the marginal distributions are Binomial, Poisson, etc.,
exists at all.
Moreover, in addition to the difficulties inherent in building non-Gaussian
multivariate distributions described in §5.8, the mean-variance relationship
also imposes constraints on the possible covariation between the responses. In
the simple case of n binary outcomes with common mean (success probability)
µ, Gilliland and Schabenberger (2001) examine the constraints placed on the
association parameter ρ in a model of equi-correlation (compound symmetry),
when the joint distribution on {0, 1}n is symmetric with respect to permu-
tation of coordinates. While the association parameter is not restricted from
above, it is severely restricted from below (Figure 6.7). The absolute mini-
mum in each case is of course ρmin = −1/(n − 1). This lower bound applies to
any n equi-correlated random variables, regardless of their distribution. But
in the binary case—especially for small n—the lower bound can be substan-
tially larger than ρmin . The bound is achieved for µ = 0.5 if n is even and for
µ = ±1/n. The valid parameter space depends on the sample size and on the
unknown success probability.
If one places further restrictions on the joint distributions, the valid com-
binations of (µ, ρ) are further constrained. If one sets third- and higher-order
correlations to zero as in Bahadur (1961), then the lower bounds are even
more restrictive than shown in Figure (6.7) and the range of ρ is now also
bounded from above (see also Kupper and Hasemann, 1978). Such higher-
order restrictions may be necessary to facilitate a particular parameterization
or estimation technique, e.g., second-order generalized estimating equations.
Prentice (1988) notes that without higher-order effects, models for correlated
binomial data sacrifice a desired marginalization and flexibility.

6.3.4 Mixed Models and the Conditional Specification

In the development in §6.3.2, we assumed that β is a vector of fixed, unknown


parameters. The literature refers to this approach as the marginal specifi-
cation since the marginal mean, E[Z(s)], is modeled as a function of fixed,
non-random (but unknown) parameters. An alternative specification defines
the distribution of each Z(s) conditional on an unobserved (latent) spatial pro-
cess. We considered this type of specification with linear models in §6.2.1.3.
With linear models, the marginal and conditional specifications give the same
inference, but in a GLM, the two approaches generally lead to models with
GENERALIZED LINEAR MODELS 357

0.9

0.8

0.7

0.6

0.5
-1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 min
0.4

n=2
0.3
n=3
0.2
n=4 n=6
0.1

Figure 6.7 Lower bounds of the correlation parameter of equi-correlated binary obser-
vation with mean µ so that the n-variate joint distribution is permutation symmetric
(n = 2, 3, 4, 5, 6). For a given n, the region to the right of the respective curve is that
of possible combinations of (µ, ρ) under this model. After Gilliland and Schaben-
berger (2001).

very different interpretations. In a GLM, the conditional approach incorpo-


rates the unobserved spatial process through the use of random effects within
the mean function and models the conditional mean and variance of Z(s) as
a function of both fixed covariate effects and these random effects deriving
from the unobserved spatial process. The conditional formulation leads to a
generalized linear mixed model (GLMM).
We assume the data are conditionally dependent on an underlying, smooth,
spatial process {S(s) : s ∈ D}. Given S(s), Z(s) has distribution in the expo-
nential family. Instead of relating the marginal mean E[Z(s)] to the covariates,
we now consider the conditional mean

E[Z(s)|S] ≡ µ(s).

The link function relates this conditional mean to the explanatory covariates
358 SPATIAL REGRESSION MODELS

and also to the underlying spatial random field so that


g[µ(s)] = x(s)! β + S(s). (6.73)
Note that S(s), a random effect at location s, enters the linear component of
the GLM as an addition to the intercept. At any location, we can consider
S(s) to represent a random intercept that varies with spatial location (or more
accurately, a random addition to the intercept). In tandem with traditional
GLM development, we allow the conditional variance to depend on the mean
via
Var[Z(s)|S] = σ 2 v(µ(s)), (6.74)
where the function v(·) is the variance function described earlier and σ 2 is a
dispersion parameter. Finally, to complete the model specification, we need
to specify the spatial dependence structure in the data. Instead of modeling
Var[Z(s)] directly, we assume the data are conditionally independent (i.e., in-
dependent given S) and that {S(s)} is a Gaussian random field with mean 0
and covariance function σS2 ρS (si − sj ). Thus, the assumption of conditional
independence defers treatment of spatial autocorrelation to the {S(s)} pro-
cess.
This conditional specification and the marginal specification given in the
previous section are different in structure and interpretation. Because S(s) is
a component of the conditional mean and because the link function g(·) is a
nonlinear function, we have,
E[Z(s)] = ES [E[Z(s)|S]]
= ES [g −1 (x(s)! β + S(s))]
(= g −1 (x(s)! β).
Taking expectations is a linear operation and does not carry through in case
of a nonlinear link function. Evaluating the inverse link function at x(s)! β
does not yield the marginal mean. In some cases, such as the log link, the
marginal mean can be derived, or equivalently, corrections can be applied to
g −1 (x(s)! β); see, for example, Zeger, Liang, and Albert (1988).

Example 6.6 Consider a model with canonical link, g(µ) = log{µ}, identity
variance function v(µ) = µ, and let m(s) = exp{x(s)! β}. Such a construction
arises, for example, in Poisson regression. Using results from statistical theory
relating marginal and conditional means and variances, together with the
mean and variance relationships of the lognormal distribution (see §5.6.1),
the marginal moments of the data can be derived as
E[Z(s)] = ES [E[Z(s)|S]] = ES [m(s) exp{S}]
= m(s) exp{σS2 /2},
K
Var[Z(s)] = m(s) σ 2 exp{σS2 /2}
L
+ m(s) exp{σS }(exp{σS } − 1) ,
2 2
GENERALIZED LINEAR MODELS 359

Cov[Z(si ), Z(sj )] = m(si )m(sj ) exp{σS2 }


× (exp{σS2 ρ(si − sj )} − 1).

Note Var[Z(s)] > E[Z(s)], even if σ 2 = 1. Also, both the overdispersion and
the autocorrelation induced by the latent process {S(s)} depend on the mean,
so the conditional model can be used with non-stationary spatial processes.

As with the marginal spatial GLMs described in §6.3.2, the correlation


function ρS (si −sj ), can be modeled as a function of a q ×1 vector of unknown
parameters θ S that completely characterizes the spatial dependence in the
underlying process, i.e., Corr[S(si ), S(sj )] = ρS (si − sj ; θ S ).

6.3.5 Estimation in Spatial GLMs and GLMMs

Traditional GLMs allow us to move away from the Gaussian distribution and
utilize other distributions that allow mean-variance relationships. However,
the likelihood-based inference typically used with these models requires a
multivariate distribution, and when the data are spatially-autocorrelated we
cannot easily build this distribution as a product of marginal likelihoods as
we do when the data are independent.
We can bypass this problem by using conditionally-specified generalized lin-
ear mixed models (GLMMs), since these models assume that, conditional on
random effects, the data are independent. The conditional independence and
hierarchical structure allow us to build a multivariate distribution, although
we cannot always be sure of the properties of this distribution. However, the
hierarchical structure of GLMMs poses problems of its own. As noted in Bres-
low and Clayton (1993), exact inference for first-stage model parameters (e.g.,
the fixed covariate effects) typically requires integration over the distribution
of the random effects. This necessary multi-dimensional integration is diffi-
cult and can often result in numerical instabilities, requiring approximations
or more computer-intensive estimation procedures.
There are several ways to avoid all of these problems (although arguably
they all introduce other problems). To address the situation where we can
define means and variances, but not necessarily the entire likelihood, Wed-
derburn (1974) introduced the notion of quasi-likelihood based on the first
two moments of a distribution, and the approach sees wide application for
GLMs based on independent data (McCullagh and Nelder, 1989). This leads
to an iterative estimating equation based on only the first two moments of a
distribution that can be used with spatial data. Another solution is based on
an initial Taylor series expansion that then allows pseudo-likelihood methods
for spatial inference similar to those described in §5.5.2. A similar approach,
called penalized quasi-likelihood by Breslow and Clayton (1993), uses a Laplace
approximation to the log-likelihood. If combined with a Fisher-scoring algo-
rithm, the estimating equations for the fixed effects parameters and predictors
360 SPATIAL REGRESSION MODELS

of random effects in the model are of the same form as the mixed model equa-
tions in the pseudo-likelihood approach of Wolfinger and O’Connell (1993).
Finally, Bayesian hierarchical models and the simulation methods used for
inference with them offer another popular alternative.

6.3.5.1 Generalized Estimating Equations

As noted briefly above, quasi-likelihood builds inference in terms of the first


two moments only, rather than the entire joint distribution, but it has many
of the familiar properties of a joint likelihood. For example, estimates derive
from maximizing the quasi-likelihood function using score equations (partial
derivatives set equal to zero) and have nice asymptotic properties such as
limiting Gaussian distributions. While most fully developed for independent
data, McCullagh and Nelder (1989) extend quasi-likelihood estimation to de-
pendent data, and Gotway and Stroup (1997) cast the approach in a spatial
GLM context, which we summarize here.
The quasi-likelihood (or quasi-log-likelihood, to be more exact) function
Quasi-Log Q(µ; z) is defined by the relationship
Likelihood
∂Q(µ; z)
= V−1 (Z(s) − µ(β)),
∂µ
where V represents the variance-covariance matrix capturing the spatial corre-
lation, and µ(β) ≡ µ = (µ1 , · · · , µn )! is the mean vector. We use the notation
µ(β) to emphasize the dependence of the mean on β that arises through the
link function. Note that in most GLMs, the elements of V are also functions
of µ(β). Differentiating Q(µ; z) with respect to each element of β yields the
Quasi- set of quasi-likelihood score equations
Likelihood
∆! V−1 (Z(s) − µ(β)) = 0
Score
where ∆ denotes the matrix with elements [∆]ij = ∂µi /∂βj and j = 1, · · · , p
indexes the parameters in the linear portion of the GLM. Solving the score
equations for β yields the quasi-likelihood estimates of the model parameters.
McCullagh and Nelder (1989, pp. 333–335) note that the inverse of the
variance-covariance matrix V−1 must satisfy several conditions to guarantee
a solution to the score equations, some not easily verified in practice. One
problem in practical applications is that V−1 is assumed to depend only on
β and not on any unknown spatial autocorrelation parameters θ. As a result,
Wolfinger and O’Connell (1993) and Gotway and Stroup (1997) follow Liang
and Zeger (1986) and Zeger and Liang (1986) and limit attention to variance-
covariance matrices of the form introduced in equation (6.72),
1/2 1/2
V ≡ Σ(µ, θ) = σ 2 Vµ R(θ)Vµ ,
where R(θ) denotes a matrix of correlations among the observations param-
1/2
eterized by the vector θ, and Vµ a diagonal matrix of variance functions.
With V now also depending on unknown autocorrelation parameters, the
GENERALIZED LINEAR MODELS 361

quasi-likelihood score equations are known as generalized estimating equations.


Liang and Zeger (1986) and Zeger and Liang (1986) show in the context of
longitudinal data, that, under mild regularity conditions, these equations gen-
erate consistent estimators of β, even with mis-specified correlation matrices.
Zeger (1988) shows that the same conditions are met for a single time series
replicate, provided the covariance function is “well behaved” in the sense that
Σ(µ, θ) breaks into independent blocks for large n. This same result holds for
spatial data as well (McShane et al., 1997). In the spatial setting, to avoid
any distributional assumptions in a completely marginal analysis, Gotway
and Stroup (1997) suggest using an iteratively re-weighted generalized least
squares approach to solving the generalized estimating equations for β and
θ. With η = X(s)β, the matrix ∆ can be written as ∆ = ΨX(s), where
Ψ = diag[∂µi /∂ηi ], and the generalized estimating equations can be written
as
! ! ∗
(X(s) A(θ)X(s))−1 β = X(s) A(θ)Z(s) , (6.75)

where A(θ) = ΨΣ(µ, θ)−1 Ψ and Z(s) = Xβ + Ψ−1 (Z(s) − µ).
Assume for the moment that θ is known. The solution to the estimating
equation (6.75) is
' = (X(s)! A(θ)X(s))−1 X(s)! A(θ)Z(s)∗ ,
β
a generalized least squares estimator in the model
Z(s)∗ = X(s)β + e(s), e(s) ∼ (0, A(θ)−1 ).
In fact, this model can be derived from a first-order Taylor series of the mean
µ = g −1 (X(s)β) in the model
Z(s) = g −1 (X(s)β) + e∗ (s), e∗ (s) ∼ (0, V). (6.76)
This marginal formulation is somewhat “unnatural” in generalized linear mod-
els, since it assumes a zero mean error vector, but it suffices for estimating
techniques that require only the first two moments. Now, expanding about
some value β,R we obtain

R + ∂g (X(s)β) X(s)(β − β)
−1
. R
µ = g −1 (X(s)β)
∂β R

= µ R
R + ΨX(s)(β R
− β).
Substituting the approximated mean into (6.76) and re-arranging yields the
pseudo-model Linearized
Pseudo-
Z(s)∗ R −1 (Z(s) − µ
= Ψ R
R ) + X(s)β Model
−1 ∗
R
= X(s)β + Ψ e (s). (6.77)
This model is the linearized form of a nonlinear model with spatial corre-
lated error structure. The pseudo-response Z(s)∗ is also called the “working”
outcome variable, but the attribute should not be confused with the concept
362 SPATIAL REGRESSION MODELS

of a “working correlation structure” in GEE estimation. The working out-


come variable earns the name because its values change every time we “work
through” the generalized least squares problem.
Solving (6.75) is an iterative process (even if θ is known), because in order
to evaluate the pseudo-response Z(s)∗ , we need to know the mean, hence, β
must be available. In order to start the process, we assume some starting value
for β. Another way of looking at the iterative nature is through the Taylor
R is
series expansion (6.77). The initial expansion locus (the starting value β)
arbitrary. Hence, the GLS solution to (6.75) depends on that choice. Thus,
after a solution has been obtained, we update the pseudo-response Z(s)∗ and
repeat the fitting process. The procedure stops if the estimates of β do not
change between two model fits.
The formulation in terms of the linearization of a nonlinear model with
correlated error is also helpful because it points us to methods for estimating
the covariance parameters. If θ is unknown—as is typically the case—we form
residuals after each fit of model (6.77) and apply the geostatistical techniques
in Chapter 4. It is recommended to perform a studentization first, to remedy
the heteroscedasticity problem. See (6.68) on page 351 on the studentization
of GLS residuals. Probably the two most frequent current practices are to
(i) estimate θ by fitting a semivariogram model to the raw GLS residuals,
and to repeat this process after each GLS fit;
(ii) estimate θ by initially fitting a semivariogram model to the raw residu-
als from a generalized linear model without any spatial correlation structure
(R(θ) = I). Then, the GLS fit is performed once, assuming that θ ' so ob-
tained equals the true values. In other words, no further updates of the
covariance parameters are obtained.
You could also formulate a second set of pseudo-responses based on the GLS
residuals and apply the generalized estimating equation or composite likeli-
hood techniques of §4.5.3.

6.3.5.2 Pseudo-likelihood Estimation

Instead of quasi-likelihood, Wolfinger and O’Connell (1993) suggest an ap-


proach termed pseudo-likelihood (PL) as a flexible and efficient way of estimat-
ing the unknown parameters in a generalized linear mixed model (GLMM).
(We note that this “pseudo-likelihood” is different from the one developed by
Besag (1975) for inference with non-Gaussian auto models.
The pseudo-likelihood approach of Wolfinger and O’Connell (1993) differs
from the quasi-likelihood approach above in that at any step of the iterative
process, a function that is a true joint likelihood is used to estimate unknown
parameters. In the case we consider here, a pseudo-likelihood approach as-
sumes β is known and estimates σ 2 and θ using ML (or REML, as described
in §5.5.2–5.5.3), and then assumes θ is known and estimates β using ML
GENERALIZED LINEAR MODELS 363

(or EGLS), and iterates until convergence. We can apply this approach to
marginal GLMs as special cases of a GLMM.
The idea behind the pseudo-likelihood approach is to linearize the problem
so we can use the approach in §5.5.2–5.5.3 for estimation and inference. This
is done using a first-order Taylor series expansion of the link function to
give what Wolfinger and O’Connell (1993) call “pseudo data” (similar to the
“working” outcome variable used in the generalized estimating equations)

Z(s) ). As before, the pseudo-data is constructed as
νi = g('
µi ) + g ! ('
µi )(Z(si ) − µ
'i ), (6.78)
where g ! ('
µi ), is the first derivative of the link function with respect to µ,
evaluated at the current estimate µ '. To apply the methods in §5.5.2–5.5.3,
we need the mean and variance-covariance matrix of the pseudo data, ν.
Conditioning on β and S, assuming Var[Z(s)|S] has the form of equation
(6.72), and using some approximations described in Wolfinger and O’Connell
(1993), these can be derived in almost the traditional fashion as
E[ν|β, S] = Xβ + S
Var[ν|β, S] = Σµ̂ ,
with
−1 1/2 ' −1
2' 1/2 ' −1 Σ(' ' −1 .
' =σ Ψ
Σµ V
µ'
R(θ) V Ψ
'
µ
=Ψ µ, θ)Ψ (6.79)

Recall that the matrix Ψ ' is an (n × n) diagonal matrix with typical element
[∂µ(si ))/∂η(si )] and is evaluated at µ
' . The marginal moments of the pseudo-
data are:
E[ν] = Xβ
Var[ν] = ΣS + Σµ
' ≡ Σν ,
and ΣS has (i, j)th element σS2 ρS (si − sj ). This can be considered as a general
linear regression model with spatially autocorrelated errors as described in
§6.2, since the mean (of the pseudo-data ν) is linear in β. Thus, if we are
willing to assume Σµ ' is known (or at least does not depend on β) when we
want to estimate β, and that β is known when we want to estimate θ, we can
maximize the log-likelihood analytically yielding the least squares equations
' = (X! Σ−1 X)−1 X! Σ−1 ν,
β (6.80)
ν ν
' = ΣS Σ−1 (ν − Xβ),
S ' (6.81)
ν
' ! (Σ∗ )−1 (ν − Xβ)
(ν − Xβ) '
'2 =
σ ν . (6.82)
n
(The matrix Σ∗ν in (6.82) is obtained by factoring the residual variance from
Σµ and ΣS .) However, because Σµ ' does depend on β we iterate as follows:
1. Obtain an initial estimate of µ' from the original data. An estimate from
the non-spatial generalized linear model often works well;
364 SPATIAL REGRESSION MODELS

2. Compute the pseudo-data from equation (6.78);


3. Using ML (or REML) with the pseudo-data, obtain estimates of the spatial
autocorrelation parameters, θ, and σS2 in Σν ;
4. Use these estimates to compute generalized least squares estimates (which
are also the maximum pseudo-likelihood estimators) of β and σ 2 from equa-
tions (6.80) and (6.82) and to predict S from equation (6.81);
5. Update the estimate of µ, using µ ' + S);
' = g −1 (Xβ '
6. Repeat these steps until convergence.

Approximate standard errors for the fixed effects derive from


T β)
Var( ' −1 X)−1
' = (X! Σ
ν
using the converged parameter estimates of θ in Σν to define the estimator
' ν . We can construct approximate p-values using t-tests analogous to those
Σ
described in §6.2.3.1.
As for the marginal model and the generalized estimating equation ap-
proach, we can cast the pseudo-likelihood approach in terms of a linearization
and a pseudo-model. In the marginal case, the pseudo-model had the form of a
general linear model with correlated errors. In the conditional case, we arrive
at a linear mixed model. First, take a first-order Taylor series of µ about some
values β' and S(s),
'
. ' ' + Ψ(S(s)
' '
µ=µ ' + ΨX(s)(β − β) − S(s)).
Rearranging terms we can write
' −1 (µ − µ ' + S(s)
' .
Ψ ' ) + X(s)β = X(s)β + S(s).
Notice that the left-hand side of this expression is the conditional expectation,
given S(s)), of
ν=Ψ ' −1 (Z(s) − µ' ) + X(s)β' + S(s),
' (6.83)
since we assume that the expansion locus in the Taylor series is fixed. The
conditional variance is given by (6.79). We can now consider a linear mixed
model ν = X(s)β + S(s) + e(s), where Var[e(s)] = Σµ ' , and Var[S(s)] = ΣS .
In order to estimate θ, we further assume that the pseudo-data ν follows a
Gaussian distribution and apply ML or REML estimation. For example, minus
twice the negative restricted log-likelihood for the linearized pseudo-model is
!
ϕR (θ; Kν) = ln{|Σν |} + ln{|X(s) Σ−1
ν X(s)|}
! −1
+ r Σν r + (n − k) ln{2π}, (6.84)
where r = ν − X(s)(X(s)Σ−1 −1
ν X(s)) X(s)Σν ν.
−1

While in the marginal model (with generalized estimating equations) ap-


proach we are repeatedly fitting a general linear model, in the pseudo-likelihood
approach we are repeatedly fitting a linear mixed model. One of the advan-
tages of this approach is that steps 3 and 4 in the algorithm above can be
GENERALIZED LINEAR MODELS 365

accommodated within the mixed model framework. Estimates of θ are ob-


tained by minimizing (6.84), and updated estimates of θ and predictions of
S(s) are obtained by solving the mixed model equations,
? ! !
@? @ ? !
@
X(s) Σµ ' X(s) X(s) Σµ ' β' X(s) Σµ ' ν
−1 = . (6.85)
Σµ' X(s) Σ µ' + ΣS
'
S(s) S(s)Σ 'ν
µ
The solutions to the mixed model equations are (6.80) and (6.81).
We have a choice as to how to model any spatial autocorrelation: through
R or through ΣS . Marginal models let R be a spatial correlation matrix,
R(θ), and set S = 0. Conditional models are specified through S and ΣS ,
with spatial dependence incorporated in ΣS = σs2 V(θ S ), and R equal to an
identity matrix. Once we have determined which type of model we want to
use, we use the iterative approach just described to estimate any unknown
parameters (σ 2 , β, θ for a marginal model and σ 2 , σs2 , β, θ S for a conditional
model).
This distinction between marginal and conditional approaches should not
be confused with another issue of “marginality” that comes about in the
pseudo-likelihood approach. Because we have started with a Taylor series in
a conditional model about β ' and S(s),
' the question is how to choose these
expansion loci. In case of the random field S(s), there are two obvious choices.
We can expand about some current predictor of S(s), for example, the solution
to the mixed model equations, or about the mean E[S(s)] = 0. The latter
case is termed a marginal or population-averaged expansion, a term borrowed
from longitudinal data analysis. The former expansion is also referred to as a
conditional or subject-specific expansion. A marginal expansion also yields a
pseudo-model of the linear mixed model variety, however, the pseudo-data in
this case is
νm = Ψ ' −1 (Z(s) − µ '
' ) + X(s)β.
Notice that this is the same pseudo-data as in (6.77), but the right-hand side
of the pseudo-model remains a mixed model. Changing the expansion locus
from the conditional to the marginal form has several important implications:
• predictions of the random field S(s) are not required in order to compute the
'
pseudo-data. You can hold the computation of S(s) until the overall, doubly
iterative algorithm has converged. This can save substantial computing
time.
'
• the expansion about S(s) typically produces a better approximation, but
it is also somewhat more sensitive to model mis-specification.
'
• the interpretation of S(s)—computed according to (6.81) after the algo-
rithm converged—is not clear.
The conditional, mixed model approach also provides another important
advantage. Our formulation of the conditioning random field S(s) was entirely
general. As in §6.2.1 we can provide additional structure. For example, let
S(s) = U(s)α, and determine U(s) as a low-rank radial smoother. Then,
366 SPATIAL REGRESSION MODELS

Var[α] = σ 2 I, and the mixed model equations are particularly easy to solve.
This provides a technique, to spatially smooth, for example, counts and rates.

6.3.5.3 Penalized Quasi-likelihood

The pseudo-likelihood technique is by no means the only approach to fit gener-


alized linear mixed models. It is in fact only one representative of a particular
class of methods, the linearization methods. For the case of clustered (lon-
gitudinal and repeated measures) data, Schabenberger and Gregoire (1996)
described more than ten algorithms representing different assumptions and
approaches to the problem. Among the strengths of linearization methods
are their flexibility, the possibility to accommodate complex random struc-
tures and high-dimensional random effects, and to cope with correlated errors.
The derivation of the pseudo-model rests solely on the first two moments of
the data. In the marginal GEE approach, no further distributional assump-
tions were made. In the pseudo-likelihood approach we also assumed that the
pseudo-response follows a Gaussian distribution, which led to the objective
function (6.84) (in the case of REML estimation). To be perfectly truthful,
it is not necessary to make a Gaussian assumption for the pseudo-response
(6.83). The REML objective function is a reasonable objective function for the
estimation of θ, regardless of the distribution of ν. From this vantage point,
the solutions (6.80) and (6.81) are simply a GLS estimator and a best linear
unbiased predictor. We prefer, however, to think of ν as a Gaussian random
variable, and of (6.84) as minus twice its restricted log likelihood, even if it
is only a vehicle to arrive at estimates and predictions. The analysis to some
extent depends on our choice of vehicle.
The pseudo-likelihood approach is so eminently feasible, because it yields a
Gaussian linear mixed model in which the marginal distribution can be easily
obtained. It is this marginal distribution on which the optimization is based,
but because it is dependent on the expansion locus, the optimization must
be repeated; the process is doubly iterative. In other words, the linearization
provides a solution to the problem of obtaining the marginal distribution
A
f (Z(s)) = f (Z(s)|S(s))fs (S(s)) dS(s), (6.86)

by approximating the model. The disadvantage of the approach is that you


are not working with the likelihood of Z(s), but with the likelihood of ν, the
pseudo-response.
A different class of methods approaches the problem of obtaining the mar-
ginal distribution of the random process by numerically solving (6.86). The
methods abound to accomplish the integration numerically, ranging from
quadrature methods and Laplace approximations, to Monte Carlo integration,
importance sampling, and Markov chain Monte Carlo (MCMC) methods. We
mention this class of methods here only in passing. Their applicability to spa-
GENERALIZED LINEAR MODELS 367

tial data analysis is limited by the dimensionality of the integral in (6.86). Note
that this is an integral over the random variables in S(s), an n-dimensional
problem. A numerical integration via quadrature or other multi-dimensional
technique is computationally not feasible, unless the random effects struc-
ture is much simplified (which is the case for the radial smoothing models).
However, an integral approximation that does not require high-dimensional
integration and assumes conditional independence (R = σ$2 I) is an option. In
the case of longitudinal data, where the marginal covariance matrix is block-
diagonal, Breslow and Clayton (1993) applied a Laplace approximation in an
approach that they termed penalized quasi-likelihood.
B
Assume that you wish to compute p(τ )dτ , that τ is (m × 1) and that the
function p(τ ) can be written as exp{nf (τ )}. Then the Laplace approximation
of the target integral (Wolfinger, 1993) is
A A C Dm/2
. 2π −1/2
p(τ ) dτ = exp{nh(τ )} = exp{nh(' τ )} |−h!! ('
τ )| .
n
The quantity τ' in this approximation is not just any value, it is the value
that maximizes exp{nh(τ )}, or equivalently, maximizes log{h(τ )}. The term
| − h!! ('
τ )| is the determinant of the second derivative matrix, evaluated at
that maximizing value.
In the case of a GLMM, we have p(τ ) = f (Z(s)|S(s)) fs (S(s)), τ =
[β ! , S(s)! ]! , and h(β, θ) = h(β, S(s)) = n−1 log{f (Z(s)|S(s)) fs (S(s))}. Note
that this approximation involves the random field S(s), rather than the covari-
ance parameters. Breslow and Clayton (1993) use a Fisher-scoring algorithm,
which amounts to replacing −h!! (' τ ) with the expected value of
∂2
− log {f (Z(s)|S(s)) fs (S(s))} ,
∂τ ∂τ !
which turns out to be
? ! ! @
X(s) Σ−1 X(s) X(s) Σ−1
µ' '
µ .
Σ−1 X(s) Σ−1 + Σ−1
'
µ '
µ S

You will recognize this matrix as one component of the mixed model equations
(6.85). From the first order conditions of the problem we find the values that
maximize h(β, S(s)) as
X(s)ΨΣ−1 (Z(s) − g −1 (X(s)β + S(s))) = 0
'
µ
ΨΣ (Z(s) − g −1 (X(s) + S(s))) − Σ−1
−1
S S(s) = 0.
'
µ
The solutions for β and S(s) are (6.80) and (6.81). Substituting into the
Laplace approximation yields the objective function that is maximized to ob-
tain the estimates of the covariance parameters. One can show that this ob-
jective function differs from the REML log likelihood in the pseudo-likelihood
method only by a constant amount. The two approaches will thus yield the
same estimates.
368 SPATIAL REGRESSION MODELS

6.3.5.4 Inference and Diagnostics

The marginal as well as the conditional formulation of the generalized linear


model for spatial data arrive at a doubly iterative fitting algorithm. After
producing the initial pseudo-data, the fit alternates between estimation of the
covariance parameters and the estimation of the fixed effects. The final esti-
mate of θ takes on the form of an estimated generalized least squares estimate
in either case. Thus, hypothesis tests, confidence intervals, and other inferen-
tial procedures for β can be based on the methods in §6.2.3. Essentially, this
applies the formulas for a general linear model with correlated errors to the
last linearized model that was fit. In other words, this produces inferences on
the linear scale. In a model for Binomial rates with a logit link, the infer-
ences apply to the logit scale. This is generally meaningful, since the model
states that covariate effects are additive (linear) on the linked scale. In order
to produce estimates of the mean, associated standard errors, and confidence
intervals, however, the predictions on the linearized scale can be transformed.
Standard errors are derived by the delta method.

Example 6.7 Assume that a model has been fit with a log link, E[Y ] =
µ = exp{η} = exp{x! β} and we have at our disposal the estimates β ' as well
' Then the plug-in estimate of µ is
as an estimate of their variability, Var[β].
' = exp{'
µ η } and the variance of the linear predictor is Var[' '
η ] = x! Var[β]x. To
obtain an approximate variance of µ ', expand it in a first-order Taylor series
about η,
. '
∂µ
'=µ+
µ ('
η − η).
∂'
η |η
In our special case,
' = exp{η} + exp{η}('
µ η − η),
.
so that Var['
µ] = exp{η}Var['
η ]. The estimate of this approximate variance is
then
T µ] = exp{'
Var[' T η ].
η }Var['

Since link functions are monotonic, confidence limits for η' = x(s)β' can be
transformed to confidence limits for µ(s), by using the upper and lower limits
as arguments of the inverse link function g −1 (·).
Confirmatory statistical inference about the covariance parameters θ in
generalized linear mixed models is more difficult compared to their linear
mixed model counterparts. Likelihood-ratio or restricted likelihood-ratio tests
are not immediately available in methods based on linearization. The obvious
reason is that adding or removing columns of the X(s) matrix or changing the
structure of S(s) changes the linearization. When the linearization changes,
so does the pseudo-data, and the pseudo objective functions such as (6.84) are
not comparable. Information based criteria such as AIC or AICC also should
GENERALIZED LINEAR MODELS 369

not be used for model comparisons, unless the linearizations are the same;
this holds for ML as well as REML estimation. The fact that two models are
nested with respect to their large-scale trend structure (X(s)2 is a subset of
X(s)1 ), and we perform ML pseudo-likelihood estimation does not change this
fact. An exception to this rule, where models can be compared based on their
pseudo objective functions, occurs when their large-scale trend structures are
the same, their formulation of S(s) is the same, and they are nested with
respect to the R(θ) structure.

6.3.6 Spatial Prediction in GLMs

The marginal as well as the conditional generalized spatial models have a


pseudo-data formulation that corresponds to either a linear model with corre-
lated errors or a linear mixed model. The goal of prediction at unobserved loca-
tions is the process Z(s) in the marginal formulation or g −1 (x! (s0 )β + S(s0 ))
in the conditional formulation (filtering). In either case, these are predic-
tions on the scale of the data, not the scale of the pseudo-data. It is illus-
trative, however, to approach the prediction problem in spatial generalized
linear models from the pseudo-data. Consider a marginal formulation. Since
the model for the pseudo-data (6.77) is linear with mean X(s)β and variance
Σµ = Ψ−1 Σ(µ, θ)Ψ−1 , we can apply universal kriging and obtain the UK
predictor for the pseudo-data as
E E FF
' ! −1
p(ν; ν(s0 )) = x (s0 )β gls + σ Σµ ν − g
! −1 '
X(s)β gls ,

where σ is the covariance vector between the pseudo-data for a new obser-
vation and the “observed” vector ν. Plug-in estimation replaces the GLS
estimates with EGLS estimates and evaluates σ and Σµ at the estimated
covariance parameters. The mean-squared prediction error E[(p(ν; ν(s0 )) −
ν(s0 ))2 ] = σν2 (s0 ) is computed as in §5.3.3, equation (5.30). To convert this
prediction into one for the original data, you can apply the inverse link func-
tion,
' 0 ) = g −1 (p(ν; ν(s0 )))
Z(s (6.87)
and apply the Delta method to obtain a measure of prediction error,
C −1 D2 + ,2
∂g (p(ν; ν(s0 ))) ∂µ
σν2 (s0 ) = σν2 (s0 ). (6.88)
∂p(ν; ν(s0 )) ∂η |p

However, expression (6.88) is not the mean-squared prediction error of the


inverse linked predictor (6.87). It is the prediction error of a different predictor
of the original data, which we derive as follows, see Gotway and Wolfinger
(2003).
To predict the original data (and not the pseudo data), assume ν(s0 ) and
the new observation to be predicted, Z(s0 ), and their associated predictors,
370 SPATIAL REGRESSION MODELS

' 0 ) also satisfy equation (6.78), so that


ν'(s0 ) and Z(s
ν'(s0 ) = g('
µ(s0 )) + g ! (' ' 0) − µ
µ(s0 ))(Z(s '(s0 )), (6.89)
where g ! ('
µ(s0 )) denotes the derivative of g(µ) = η with respect to µ, evalu-
'(s0 ). Note that this derivative corresponds to the reciprocal diagonal
ated at µ
elements of the matrix Ψ defined earlier. Solving for ν'0 (s0 ) in equation (6.89)
yields the predictor
' 0) = µ −1
Z(s '(s0 ) + (g ! ('
µ(s0 ))) ('ν (s0 ) − g('
µ(s0 )). (6.90)

' 0)
The mean-squared prediction error associated with this predictor is E[(Z(s
− Z(s0 )) ], which can be obtained from the mean-prediction squared error of
2

ν'0 obtained from universal kriging by noting that


' 0 ) − Z(s0 ) = (g ! (' −1
Z(s µ(s0 ))) ('ν0 − ν0 ),
and then
2
σZ ' 0 ) − Z(s0 ))2 ]
(s0 ) = E[(Z(s
−1 2
= [(g ! ('
µ(s0 ))) ] E[('
ν0 − ν0 )2 ]
−1
= [(g ! ('
µ(s0 ))) ]2 σν2 (s0 )
? @2
∂µ
= σν2 (s0 ). (6.91)
∂η |' µ(s0 )

/
The root mean-squared prediction error, 2 (s ), is usually reported as a
σZ 0
prediction standard error.
In the case of a marginal GLMM where the mean function is defined as in
equation (6.69), the predictor based on pseudo-likelihood estimation (equation
6.90) is very similar in nature to that proposed by Gotway and Stroup (1997)
without the explicit linearization of pseudo data. Similarly, the mean-squared
prediction errors are comparable to those derived by Vijapurkar and Gotway
(2001) by expanding the nonlinear mean function instead of the link function;
see also Vallant (1985).

Example 1.4 (Blood lead levels in children, Virginia 2000. Contin-


ued) To illustrate the use of GLMs and GLMMs in the analysis of spatial
data, we again consider Example 1.4, first introduced on page 10. The out-
come data of interest are the percent of children under 6 years of age with
elevated blood lead levels for each Virginia county in 2000 (shown in Figure
1.5, page 10). As discussed in Example 1.4, the primary source of elevated
blood lead levels in children is dust from lead-based paint in older homes in
impoverished areas. Thus, if data on home age and household income (and
perhaps other variables as well) were available, we could use these variables
to explain (at least some of) the variation in the percent of children with ele-
vated blood lead levels. However, the only data available to us are aggregated
to the county level. Since we do not have data on individual homes, we will use
GENERALIZED LINEAR MODELS 371

instead the median housing value per county as a surrogate for housing age
and maintenance quality (Figure 1.6). We also do not have data on whether
or not an individual child in the study was living in poverty in 2000. Thus, we
obtained from the U.S. Census Bureau the number of children in each county
under 17 years of age living in poverty in 2000, and will use this variable as a
surrogate measure of impoverishment (Figure 6.8).

# in Poverty
80 −345
346 −705
706 −1045
1046 −1838
1839 −15573

Figure 6.8 Number of children under 17 years of age living in poverty in Virginia
in 2000. Source: U.S. Census Bureau.

Thus, in terms of the building blocks of a statistical model we have the


following:
Z(si ) ≡ Zi = the number of children under 6 years of age with elevated
blood lead levels in county i, i = 1, · · · , N = 133;
n(si ) ≡ ni = the number of children under 6 years of age tested for elevated
blood lead levels in county i;
p(si ) ≡ pi = Zi /ni × 100 the percent of children under 6 years of age with
elevated blood lead levels in county i;
x1 (si ) ≡ x1i = the median housing value in $100,000 (i.e., the actual value
from the Census, divided by 100,000);
x2 (si ) ≡ x2i = the number of children under 17 years of age living in
poverty in 2000, per 100,000 children at risk (i.e., the actual value from the
Census, divided by 100,000).
Since the outcome data are percentages, it seems reasonable to fit a gen-
eralized linear regression model to these data and we could use either of the
following models:
• A Poisson distribution with Z(si ) ∼ Poisson(µ(si ) ≡ n(si )λ(si )),
log{λ(si )} = β0 + β1 x1 (si ) + β2 x2 (si )
g(µ(si )) = log{µ(si )} = log{ni } + β0 + β1 x1 (si ) + β2 x2 (si )
v(µ(si )) = µ(si ),
372 SPATIAL REGRESSION MODELS

so that
E[p(si )] = λ(si )
Var[p(si )] = λ(si )/n(si );

• A Binomial distribution with Z(si ) ∼ Binomial(n(si ), µ(si )), so that


E[p(si )] = µ(si )
- >
µ(si )
g(µ(si )) = log = β0 + β1 x1 (si ) + β2 x2 (si )
1 − µ(si )
v(µ(si )) = n(si )µ(si )(1 − µ(si )).

To accommodate possible overdispersion of the data—at this stage of model-


ing—we add a multiplicative scale parameter to the variance functions, so that
v(µ(si )) = σ 2 µ(si ) in the Poisson case and v(µ(si )) = σ 2 n(si )µ(si )(1 − µ(si ))
in the Binomial case. The results from REML fitting of these two models are
shown in Table 6.3.

Table 6.3 Results from Poisson and Binomial regressions with overdispersion.
Poisson-Based Regression

Effect Estimate Std. Error. t-value p-value


Intercept (β0 ) −2.2781 0.1606 −14.18 < 0.0001
Median Value (β1 ) −0.6228 0.2120 −2.94 0.0039
Poverty (β2 ) 0.6293 1.4068 0.45 0.6554
σ̂ 2 6.7825 0.8413

Binomial-Based Regression

Effect Estimate Std. Error. t-value p-value


Intercept (β0 ) −2.1850 0.1712 −12.76 < 0.0001
Median Value (β1 ) −0.6552 0.2241 −2.92 0.0041
Poverty (β2 ) 0.6566 1.5086 0.44 0.6641
σ̂ 2 7.2858 0.9037

There is very little difference in the results obtained from the two models.
There is also very little to guide us in choosing between them. Comparing the
value of the AIC criterion from both models may be misleading since the mod-
els are based on two different distributions whose likelihoods differ. Moreover,
GLMs are essentially weighted regressions, and when different distributions
and different variance functions are used, the weights are different as well.
Thus, comparing traditional goodness of fit measures for these two models
(and for the other GLMs and GLMMs considered below) may not be valid.
GENERALIZED LINEAR MODELS 373

Instead, we rely on some general observations to help us choose between these


two models.
Typically, when there is overdispersion in the data, the Poisson-based model
often appears to fit better than one based on the Binomial distribution, since
the Poisson distribution has more variability relative to the Binomial. As
a consequence, it appears to have less overdispersion. To see this, suppose
Z|n ∼ Binomial(n, π), and n ∼ Poisson(λ). Then, Z is a Poisson random
variable with mean λπ and Var[Z] > E[n]π(1 − π). Also, when π is small, and
nπ → λ, the Binomial distribution converges to a Poisson distribution. For
the Virginia blood lead level data, Z = 12.63 and s2Z = 1378.64, p = 7.83 and
s2p = 110.25, reflecting a great deal of overdispersion in the outcome data and
also indicating that perhaps elevated blood lead levels in children are unusual
(and hopefully so). These observations might lead us toward choosing the
Poisson-based model over the Binomial-based model and, for the rest of this
example, we will adopt the Poisson-based model. Repeating the analysis using
the Binomial-based model is left as Exercise 6.17.
One key feature of the results in Table 6.3 is the large amount of overdisper-
sion in the data (measured by σ '2 = 6.7825). This may just be indicating that
the data do not follow a Poisson (or a Binomial) distribution, but this varia-
tion could also be spatial. Thus, if we make the model more spatially explicit,
e.g., by using spatial GLMs and GLMMs, we might be able to account for
some of the overdispersion and obtain a better fit. To determine whether the
data are spatially autocorrelated and an adjustment of the covariance struc-
ture of the is necessary, we follow an analogous procedure to that used for the
C/N ratio data (continuation of Example 6.1 on page 311). Since the current
situation is more complicated—as we are using generalized linear, and not lin-
ear models—we use traditional residuals from the Poisson-based GLM, rather
than recovered errors. Thus, we compute the empirical semivariogram of the
standardized Pearson residuals based on Euclidean distances between county
centroids, measured in kilometers after converting the centroids to Virginia
state plane coordinates. The standardized Pearson residuals are
Standardized
Zi − µ'i
ri = 2 , Pearson
' v(µ'i )(1 − lii )
σ
Residuals
where lii is the ith diagonal element of
' 1/2 X(s)(X(s)! ∆X(s))
∆ ' −1 ! ' 1/2
X(s) ∆ ,
' is a diagonal matrix with ith element ('
and ∆ µi ))−1 (g ! ('
σ 2 v(' µi ))−2 .
The empirical semivariogram based on these residuals is shown in Figure
6.9. We use it only as a rough guide to help indicate whether or not we need
a model with a more complicated variance structure and, if so, to suggest
a possible parametric form. This semivariogram is probably not definitive of
the spatial variability in our data. For example, it does not lend itself to
an interpretation of the spatial dependency in a conditional model, where
the covariance structure of a latent spatial random field is modeled on the
374 SPATIAL REGRESSION MODELS

logarithmic (linked) scale. The empirical semivariogram in Figure 6.9 suggests


that there might be some spatial autocorrelation in the standardized Pearson
residuals and that perhaps a spherical model might be an adequate parametric
representation.

257 328 372 442 477 449 512 490 445 502 441 443 408 399 336 306 262
2.0

1.5
Semivariance

1.0

0.5

0.0
0 100 200 300
Distance

Figure 6.9 Empirical semivariogram of standardized Pearson residuals. Values


across top indicate number of pairs in lag class.

To include spatial autocorrelation, both a conditional spatial GLMM and


a marginal GLM can be used. To investigate how well these models explain
variation in the percentages of elevated blood lead levels we fit both types
of models as well as a random effects model without spatial autocorrelation.
Specifically, we considered the following models, with g(µ(s)) = log{µ(s)} in
all cases:
• Random Effects Model:
ind
Z(si )|S(si ) ∼ Poisson(µ(si ) ≡ n(si )λ(si )),
log{λ(si )} = β0 + β1 x1 (si ) + β2 x2 (si ) + S(si ),
Var[Z(s)|S(s)] = σ 2 Vµ ,
S(s) ∼ G(0, σS2 I);
• Conditional Spatial GLMM:
ind
Z(si )|S(si ) ∼ Poisson(µ(si ) ≡ n(si )λ(si )),
GENERALIZED LINEAR MODELS 375

log{λ(si )} = β0 + β1 x1 (si ) + β2 x2 (si ) + S(si ),


Var[Z(s)|S(s)] = σ 2 Vµ ,
S(s) ∼ G(0, σS2 RS (αs ));
• Marginal Spatial GLM
E[Z(s)] = µ(s) ≡ n(s)λ(s)),
log{λ(s)} = β0 + β1 x1 (s) + β2 x2 (s),
1/2 1/2
Var[Z(s)] = σ02 Vµ + σ12 Vµ RS (αm )Vµ .
For the spatial models, spatial autocorrelation was modeled using the spherical
model defined in (4.13). The parameters of the models were estimated using
the pseudo-likelihood estimation described in §6.3.5.2. The results are shown
in Table 6.4.
From Table 6.4, some tentative conclusions emerge:

• The results from the random effects model indicate that the relationship
between median housing value and the percentage of children with elevated
blood lead levels is not significant. They also show an inverse relationship
between the percentage of children with elevated blood lead levels and
poverty, although the relationship is not significant. The results from all
other models indicate just the opposite: a significant association with me-
dian housing value and a nonsignificant poverty coefficient that has a pos-
itive sign. Thus, some of spatial variation in the data arising from spatial
autocorrelation (incorporated through R(α)) is perhaps being attributed
to median housing value and poverty in the random effects model. In the
conditionally-specified model choices about R(α) affect S(s) which in turn
can affect the estimates of the fixed effects; This just reflects the fact that
the general decomposition of “data=f(fixed effects, random effects, error)”
is not unique. The random effects model does not allow any spatial struc-
ture in the random effects or the errors. The conditional and marginal
spatial models do allow spatial structure in these components, with the
conditional model assigning spatial structure to the random effects, and
the marginal model assigning the same sort of structure to “error.” Thus,
while the conditional and marginal spatial GLMs accommodate variation
in different ways, incorporating spatially-structured variation in spatial re-
gression models can be important.
• The marginal spatial model and the traditional Poisson-based model (Table
6.3) give similar results. However, the standard errors from the margi-
nal spatial GLM are noticeably higher than for the traditional Poisson-
based model. Thus, with the “data=f(fixed effects, error)” decomposition in
marginal models, the marginal spatial GLMs incorporate spatial structure
through the “error” component.
• The majority of the models suggests that median housing value is signifi-
cantly associated with the percentage of children with elevated blood lead
376 SPATIAL REGRESSION MODELS

Table 6.4 Results from spatial GLMs and GLMMs.

Random Effects Model

Effect Estimate Std. Error t-value p-value


Intercept (β0 ) −2.6112 0.1790 −14.59 < 0.0001
Median Value (β1 ) −0.2672 0.2239 −1.19 0.2349
Poverty (β2 ) −0.9688 2.900 −0.33 0.7389
σ̂S2 0.5047 0.1478
σ̂ 2 1.1052 0.3858

Conditional Spatial GLMM

Effect Estimate Std. Error t-value p-value


Intercept (β0 ) −2.4246 0.3057 −7.93 < 0.0001
Median Value (β1 ) −0.7835 0.3610 −2.17 0.0318
Poverty (β2 ) 3.5216 2.5933 1.36 0.1768
σ̂S2 0.8806 0.1949
σ̂ 2 0.7010 0.1938
αˆs (km) 78.81 6.51

Marginal Spatial GLM

Effect Estimate Std. Error t-value p-value


Intercept (β0 ) −2.2135 0.2010 −11.01 < 0.0001
Median Value (β1 ) −0.8096 0.2713 −2.98 0.0034
Poverty (β2 ) 1.3172 1.5736 0.84 0.4041
σ̂02 6.1374 0.8846
σ̂12 0.9633 0.7815
αˆm (km) 186.65 70.04

levels, and that counties with higher median housing values tend to have
a lower percentage of children with elevated blood lead levels. Also, there
appears to be a positive relationship between the percentage of children
with elevated blood lead levels and poverty, although the relationship is
not significant (and may not be well estimated).

The spatial GLM and GLMM considered thus far in this example use a
geostatistical approach to model autocorrelation in the data. However, the
regional nature of the data suggests that using a different measure of spa-
tial proximity might be warranted. For linear regression models, the spatial
GENERALIZED LINEAR MODELS 377

autoregressive models described in §6.2.2 can incorporate such measures of


spatial proximity. However, the SAR models described in §6.2.2.1 are only de-
fined for multivariate Gaussian data and extending the CAR models described
in §6.2.2.2 is fraught with problems. Besag (1974) and Cressie (1993, Section
6.4) discuss the conditions necessary to construct a joint likelihood from a con-
ditional specification. Central to these conditions is the Hammersley-Clifford
theorem (Besag, 1974) that derives the general form of the joint likelihood.
This theorem shows that the joint likelihood allows interaction among the
data only for data observed at locations that form a clique, a set of locations
that are all neighbors of each other. The practical impact of this occurs when
we want to go the other way and specify a neighborhood structure together
with a set of conditional distributions. For many of the distributions in the
exponential family guaranteeing a valid joint likelihood results in excessive
restrictions on the neighborhood structure (cf., §5.8). For example, even as-
suming just pairwise dependence among locations, so that no clique contains
more than two locations, a conditional Poisson models permits only negative
spatial dependence.
However, adopting a modeling approach allows us to circumvent some of
these difficulties. For example, in a conditionally-specified spatial model, we
could describe the Gaussian process S(s) using either a SAR or a CAR
model. This basically amounts to different choices for ΣS . For example, in
the simplest one-parameter case, we could choose ΣS = σ 2 (I − ρW)−1 for
a CAR model and ΣS = σ 2 (I − ρW)−1 (I − ρW! )−1 for a SAR model,
where W is a specified spatial proximity matrix and ρ is the spatial depen-
dence parameter. With a marginal GLM—since it makes only assumptions
about the first two moments—we could model σ 2 R as either σ 2 (I − ρW)−1
corresponding to the variance-covariance matrix from a CAR model, or as
σ 2 (I − ρW)−1 (I − ρW! )−1 corresponding to the variance-covariance matrix
from a SAR model. While this strategy does not model the data using a CAR
or SAR model, it does allow us to consider more general parametric models
for Var[Z(s)] that may be more desirable when working with regional data.
Moreover, we are not limited to the forms arising from CAR and SAR models;
we can create any parametric representation as long as the resulting matrices
are positive definite and the parameters are estimable.
To illustrate this idea, we used the binary connectivity weights defined in
equation (1.6) of §1.3.2 to measure spatial similarity in the percent of children
under 6 years of age with elevated blood lead levels for each Virginia county.
Using this measure of spatial proximity, the value of Moran’s I computed from
the standardized Pearson residuals obtained from the traditional Poisson-
based model is I = 0.0884, (the eigenvalues of W indicate this value can
range from −0.3275 to 0.1661) with a z-score of z = 1.65. Thus, there may
be some evidence of spatial autocorrelation in the residuals, although the
assumptions underlying any test, and even the definition and interpretation
of Moran’s I computed from standardized Pearson residuals, are extremely
questionable. Thus, we considered instead the following marginal spatial GLM
378 SPATIAL REGRESSION MODELS

using g(µ(s)) = log{µ(s)}:


E[Z(s)] = µ(s) ≡ n(s)λ(s),
log{λ(s)} = β0 + β1 x1 (s) + β1 x2 (s),
1/2 1/2
Var[Z(s)] = σ02 Vµ + σ12 Vµ (I − ρW)−1 Vµ ,
where W is a spatial proximity matrix constructed from the binary connec-
tivity weights. The results are shown in Table 6.5.

Table 6.5 Results from a marginal spatial GLM with a CAR autocorrelation struc-
ture.
Effect Estimate Std. Error t-value p-value
Intercept (β0 ) −2.1472 0.1796 −11.95 < 0.00001
Median Value (β1 ) −0.7532 0.2471 −3.05 0.0014
Poverty (β2 ) 0.5856 1.4417 0.40 0.6574
σ̂02 0.0000
σ̂12 6.2497
ρ̂ 0.0992

These results are quite similar to those from the marginal spatial GLM
given in Table 6.4, although it is difficult to generalize this conclusion. Often,
adjusting for spatial autocorrelation is more important than the parametric
models used to do the adjustment, although in some applications, different
models for the spatial autocorrelation can lead to different results and con-
clusions.

50
Marginal Spatial GLM Predictions

40

30

20

10

0 10 20 30 40 50
Conditional Spatial GLMM Predictions

Figure 6.10 Predictions from marginal spatial GLM vs. predictions from conditional
spatial GLMM.

The fitted mean from the regression, µ


', can be interpreted as the pre-
GENERALIZED LINEAR MODELS 379

dicted percentage of children with elevated blood lead levels, adjusted for
median income and poverty. For the models with random effects, this mean
' + S(s)),
is g −1 (X(s)β ' '
where S(s) is obtained from (6.81). For the marginal
spatial models, this mean should be based on prediction of Z(si ) at the data
locations, using the methods described in §6.3.6 (cf., §5.4.3).
Figure 6.10 compares the adjusted percentages for the conditional spatial
GLMM and the marginal spatial GLM with a geostatistical variance structure.
There is a strong relationship between them, although there is more smoothing
in the marginal model. This is dramatically evident from the adjusted maps
shown in Figure 6.12.
The maps of the adjusted percentages obtained from the conditional spatial
GLMM are more similar to the raw rates shown in Figure 1.5, since this model
tends to shrink the percentages to local means (through S(s)) rather than to
a global mean as in the marginal models. Figure 6.11 shows the adjusted
percentages from the two marginal spatial GLMs, one with a geostatistical
variance, and one with the CAR variance. The adjusted percentages are more
different than the maps in Figure 6.12 reflect. The model with the CAR vari-
ance smooths the data much more than the model with the geostatistical
variance. This may result from our choice of adjacency weights since with
these weights adjacent counties are assumed to be spatially similar and so the
contribution from σ02 which mitigates local spatial similarity is reduced.

25
Marginal Model with Geostatistical Variance

20

15

10

0 5 10 15 20 25
Marginal Model with CAR Variance

Figure 6.11 Predictions from marginal GLM using geostatistical variance structure
vs. predictions from marginal GLM using CAR variance structure.
380 SPATIAL REGRESSION MODELS

Traditional Poisson−Based

Conditional Spatial GLMM

% Elevated
< 2.62
2.62 −5.23
5.24 −10.46
10.47 − 20.93
Marginal Spatial GLM > 20.93
With Geostatistical Variance

Marginal Spatial GLM


With CAR Variance

Figure 6.12 Predictions from regression fits.


GENERALIZED LINEAR MODELS 381

Both the conditional spatial GLMM and the marginal spatial GLM indicate
three areas where blood lead levels in children remain high, after adjusting for
median housing value and poverty level: one, comprised of Frederick and War-
ren counties, in the north; another in south-central Virginia near Farmville;
and another in the northeast in the Rappahannock river basin. To investigate
these further, a more thorough analysis that includes other potentially impor-
tant covariates (e.g., other demographic variables, more refined measures of
known lead exposures), more local epidemiological and environmental infor-
mation, and subject-matter experts, is needed.
Moreover, these analyses all assume that the relationship between the per-
centage of children with elevated blood lead level and median housing value
and poverty level is the same across all of the Virginia. We can relax this
assumption by considering a Poisson version of geographically weighted re-
gression (§6.1.3.1, Fotheringham et al., 2002). Instead of assuming a linear
relationship between the data and the covariates as in (6.9), we assume
E[Z(s)] = µ(s) ≡ n(s)λ(s),

log{λ(s)} = β0 (s) + β1 (s)x1 (s) + β2 (s)x2 (s),


1/2 1/2
Var[Z(s)] = σ12 Vµ W(s0 )−1 Vµ .
We chose weights defined by the Gaussian kernel, wij = exp{− 12 dij /b} where
dij is the Euclidean distance between regional centroids and b is a bandwidth
parameter which controls the degree of local smoothing. Following Fothering-
ham et al. (2002), we estimated b by cross validation and used local versions of
the generalized estimating equations derived in §6.3.5.1 to estimate β 0 . Since
the data are aggregated, we obtained estimates of β 0 and standard errors at
each county centroid. A map of the effect of median housing value on the
percentage of children with elevated blood lead levels is shown in Figure 6.13
and the corresponding standard error map is shown in Figure 6.14.
Figure 6.13 clearly shows that the effect of median housing value on the
percentage of children with elevated blood lead levels is not constant across
Virginia. There is a rather large area in south central Virginia where this
effect is as we would expect; as median housing value increases, blood lead
levels increases. Referring back to Figure 1.5, this is also an area with some
of the largest percentages of elevated blood lead levels. It is also a fairly rural
region with relatively low median housing values (cf., Figure 1.6). There is
another area of interest near West Virginia, where the relationship between
median housing value and the percentage of elevated blood lead levels is not
what we would expect; as median housing value increases, blood lead levels
also tend to increase. This area also has relatively low median housing values,
but is also one that has some of the lowest percentages of elevated blood lead
levels in Virginia. This suggests that the homes in this area might be different
from those in south central Virginia; perhaps many of the older homes are
not painted at all or only those with relatively high median values for this
region are painted with lead-based paint. Conjectures like these are one of the
382 SPATIAL REGRESSION MODELS

Effect of Median Housing Value


[ − 6.13, − 2.71 ]
[ − 2.70, − 1.19 ]
[ − 1.18, − 0.13 ]
[ − 0.12, 1.54 ]
[ 1.55, 4.03 ]

Figure 6.13 Estimates of the effect of median housing value from GWR.

major advantages of this type of spatial analysis: it allows us to fine tune our
hypotheses and isolate more specific exposure sources that may affect blood
lead levels in children.

Std. Errors from GWR


0.10 −0.46
0.47 −0.63
0.64 −0.93
0.94 −1.17
1.18 −2.05

Figure 6.14 Standard errors of the effect of median housing value estimated from
GWR.
BAYESIAN HIERARCHICAL MODELS 383

6.4 Bayesian Hierarchical Models

The conditionally specified generalized linear mixed models discussed in the


previous section are examples of hierarchical models. At the first stage of the
hierarchy, we describe how the data depend on the random effects, S(s), and
on other covariate values deemed fixed and not random. At the second stage
of the hierarchy, we model the distribution of the random effects. Since the
data are assumed to be conditionally independent, constructing the likelihood
function (the likelihood of the data given unknown parameters) is relatively
easy and forms the basis for inference about the unknown parameters. Thus,
in the abstract, we have data Z = (Z1 , · · · , Zn )! whose distribution depends
on unknown quantities φ (these may be parameters, missing data, or even
other variables). We assume that Z|φ ∼ f (z|φ), where, borrowing from tra-
ditional statistical terminology, f (z|φ) is called the likelihood function. The
distribution of φ is specified using π(φ) and is called the prior distribution.
The key to Bayesian inference is the posterior distribution of the unknown
quantities given the data that reflects the uncertainty about φ after observing
the data. From Bayes’ theorem, the posterior distribution is proportional to
the product of the likelihood function and the prior distribution Posterior
Distribu-
h(φ|Z) ∝ f (z|φ)π(φ), (6.92)
B tion
where the constant of proportionality is f (z|φ)π(φ)dφ, so that ensures pos-
terior distribution integrates to one.

Example 6.8 Suppose that given φ, the data Zi are independent Poisson
ind
variables, Zi |φ ∼ Poisson(φ), and suppose φ ∼ Γ(α, β). Then, the likelihood
function is
_n
φzi
f (z|φ) = e−φ ,
i=1
z i !
and the posterior is
n
_ φzi β α α−1 −βφ
h(φ|z) = e−φ × φ e .
i=1
zi ! Γ(α)
Combining terms, the posterior can be written as
.
h(φ|z) ∝ φα+ zi −1 e−(β+n)φ ,
.
which is Γ(α + zi , β + n).
Given the posterior distribution, we can use different summary statistics
from this distribution to provide inferences about φ. One that is commonly
used is the posterior mean, which in this case is
. C D C D
α + zi n α n
E[φ|z] = = z̄ + 1− .
β+n n+β β n+β
Thus, the Bayes estimate is a linear combination of the maximum likelihood
estimate z̄, and the prior mean, α/β.
384 SPATIAL REGRESSION MODELS

Example 6.9 Consider the linear mixed model


Zi = µi + ei ,
where the µi are random effects and the errors are iid with a common known
variance σ 2 . One Bayesian hierarchical formulation might be
[Z|µ, σ 2 ] ∼ G(µ, σ 2 I)
6 7
µ|α, τ 2 ∼ G(α1, τ 2 I)
[α] ∼ G(a, b2 ).
At the final stage of the hierarchy, numerical values must be specified for
all unknowns, in this case τ 2 , a, and b2 . The lack of information on these
parameters has caused much controversy in statistics leading many to adopt
an empirical Bayes approach. An empirical Bayes version of this model is
[Z|µ, σ 2 ] ∼ G(µ, σ 2 I)
6 7
µ|α, τ 2 ∼ G(α1, τ 2 I)
where α and τ 2 are unknown and then estimated from the data using the
likelihood A
f (z|σ , α, τ ) = f (z|µ, σ 2 )π(µ|α, τ 2 )dµ.
2 2

In this case, because of the conditional independence,


n A
_
f (z|σ , α, τ ) =
2 2
f (zi |µi , σ 2 )π(µi |α, τ 2 )dµi ,
i=1

which is G(α, σ 2 + τ 2 ). Thus, α and τ 2 can be estimated by .


maximum like-
' = Z̄, and τ' = max(0, s − σ ), where s = (Zi − Z̄)2 /n.
lihood giving α 2 2 2 2

A more general Bayesian hierarchical model—corresponding to the GLMMs


discussed in the previous section—is slightly more complicated. Suppose we
model spatial autocorrelation through the random effects, so that R = σ 2 I,
with σ 2 known, and Var[S(s)] = Σ(θ). For a hierarchical specification, we
must specify how the data depend on both the fixed and the random effects,
and then how the distribution of the random effects depends on θ. Thus, the
Joint joint posterior distribution may look something like
Posterior
h(θ, S(s), β|Z(s)) ∝ f (Z(s)|β, S(s)) π(S(s)|θ) π(β) π(θ). (6.93)
Distribu-
tion Note that a fully Bayesian analysis treats all model parameters as random
variables rather than fixed but unknown values. However, we still refer to
“fixed effects” as parameters pertaining to all experimental units and “ran-
dom effects” as parameters that vary among experimental units. This distinc-
tion remains a key component of mixed models regardless of the inferential
approach. There is another important distinction. The covariance parameters
θ enter the model at a second stage of the hierarchy and are also assigned a
prior distribution. They are often referred to as hyperparameters. Finally,
BAYESIAN HIERARCHICAL MODELS 385

to construct (6.93), we assume statistical independence between the fixed ef-


fects parameters β and the hyperparameters θ, yielding a product of prior
distributions for the different model parameters.

6.4.1 Prior Distributions

In principle, the prior distribution is a statistical rendition of the information


known about a parameter of interest. Most parameters are not completely
unknown; we usually have some information about them. For example, we
know that variance components are positive, and from previous experience
we might expect their distribution to be skewed. We also know that probabil-
ities lie between 0 and 1. There might be historical data from other studies
that provides information that can be used to construct a prior distribution.
Theoretically, the prior distribution may be elicited from subject-matter ex-
perts, but in practice this is often done only in complex risk analysis studies.
A more intriguing philosophical approach to prior solicitation advocates elic-
iting, not a prior distribution, but plausible data (Lele, 2004).
To make progress analytically, a conjugate prior, one that leads to a
posterior distribution belonging to the same family as the prior, is often used.
With a conjugate prior, the prior parameter can often be interpreted as a
prior sample, with the posterior distribution being just an updated version
based on new data. In the first example given above, the Γ(α, β) prior for φ
is a conjugate prior.
From the beginning, Bayesian analysis was meant to be subjective. As the
question “Where do you get the prior?” became more persistent, attempts
were made to make Bayesian analysis more objective by defining noninfor-
mative priors. For example, suppose the parameter space is discrete, taking
on m values. Then the uniform prior π(φ) = 1/m is a noninformative prior
in that every value of φ is equally likely. In this case, the term “noninfor-
mative” is somewhat misleading; this prior does indeed provide information
about φ, namely the fact that every value of φ is equally likely. Other nonin-
formative priors include flat priors (e.g., the prior distribution is uniform on
(−∞, ∞)), vague priors (e.g., φ ∼ G(0, 10, 000) or 1/σ 2 ∼ Γ(0.001, 0.001)),
and the Jeffrey’s prior (π(φ) ∝ I(φ)1/2 , where I(φ) is the Fisher information
in the model). In the multi-parameter case, the situation is more complex
since we need to specify prior distributions that are joint distributions, e.g.,
with φ = (φ1 , φ2 ) the prior distribution of φ must define the joint distribution
of φ1 and φ2 .
Unfortunately, many practical applications of Bayesian analysis seem to
specify noninformative priors when very little information is known about the
prior distribution or the impact of the prior distribution is to be minimized.
We have to wonder why one would choose a Bayesian analysis in such cases.
and caution against the use of the automatic inference for complex models
such choices permit. Moreover, many prior distributions, particularly those
386 SPATIAL REGRESSION MODELS

used in complex hierarchical modeling situations, are chosen for convenience


(e.g., multivariate Gaussian, the inverse Gamma) and often, in addition to
conditional independence, the hyperparameters are also assumed to be in-
dependent. Thus, while there are now many more answers to the question
“Where do you get the prior?” the question itself still remains.

6.4.2 Fitting Bayesian Models

Suppose φ = (θ, τ )! with joint posterior distribution h(θ, τ |z). Inference about
θ is made from the marginal posterior distribution of θ obtained by inte-
grating τ out of the joint posterior distribution. For most realistic models,
the posterior distribution is complex and high-dimensional and such inte-
grals are difficult to evaluate. However, in some cases, it is often possible
to simulate realizations from the desired distribution. For example, suppose
h(θ, τ |z) = f (θ|τ, z)g(τ |z), and suppose it is easy to simulate realizations from
f (θ|τ, z) and g(τ |z). Then
A
h(θ|z) = f (θ|τ, z)g(τ |z)dτ ,

and we can obtain a sample θ1 , · · · , θm from h(θ|z) as follows:


1. Given observed data z, generate τ ∗ from g(τ |z);
2. Generate θ∗ from f (θ|τ, z);
3. repeat m times.
The pairs (θ1 , τ1 ), · · · , (θm , τm ) are then a random sample from joint posterior
B are a random sample
distribution h(θ, τ |z) = f (θ|τ, z)g(τ |z) and θ1 , · · · , θm
from the marginal posterior distribution h(θ|z) = f (θ|τ, z)g(τ |z)dτ (see,
e.g., Tanner and Wong, 1987; Tanner, 1993). When this is possible, it makes
Bayesian inference fairly easy. However, in many applications, the distribu-
tions involved are non-standard and sampling from them is difficult at best.
Sometimes, it is possible to approximate the desired integrals without having
to directly simulate from a specified distribution, e.g., importance sampling
(Ripley, 1987; Robert and Casella, 1999) and rejection sampling (Ripley, 1987;
Smith and Gelfand, 1992; Tanner, 1993). Taking this one step further, it is
also possible to generate a sample, say θ1 , · · · , θm ∼ g, without directly sim-
ulating from g. However, generating θ1 , · · · , θm independently is difficult, so
attention has focused on generating a dependent sample from a specified dis-
tribution indirectly, without having to calculate the density or determine an
adequate approximation. This is the essence of Markov chain Monte Carlo
(MCMC).
A Markov chain is a sequence of random variables {Xm ; m ≥ 0} gov-
erned by the transition probabilities Pr(Xm+1 ∈ A|X1 . . . Xm ) = Pr(Xm+1 ∈
A|Xm ) ≡ P (xm , A), so that the distribution of the next value depends only
on the present “state” or value. This is called the Markov property. An in-
variant distribution π(x) for the Markov chain is a density satisfying π(A) =
BAYESIAN HIERARCHICAL MODELS 387

B
P (x, A)π(x)dx. Markov chain Monte Carlo (MCMC) is a sampling based
simulation technique for generating a dependent sample from a specified dis-
tribution of interest, π(x). Under certain conditions, if we “run” a Markov
chain, i.e., we generate X (1) , X (2) , · · ·, then
d
X m → X ∼ π(x) as m → ∞.
Thus, if a long sequence of values is generated, the chain will converge to a
stationary distribution, i.e., after convergence, the probability of the chain
being in any particular “state” (or taking any particular value) at any partic-
ular time remains the same. In other words, after convergence, any sequence
of observations from the Markov chain represents a sample from the station-
ary distribution. To start the chain, a starting value, X0 must be specified.
Although the chain will eventually converge to a stationary distribution that
does not depend on X0 , usually the first m∗ iterations are discarded (called
the “burn-in”) to minimize any potential impact of the starting value on the
remaining values in the sequence. Note that the values in the remaining, post-
convergence sequence are dependent since each new value still depends on the
previous value. Thus, in practice, one selects an independent random sam-
ple by using only every k th value appearing in the sequence, where k is large
enough to ensure the resulting sample mimics that of a purely random process.
Tierney (1994) and Robert and Casella (1999) provide much more rigorous
statistical treatments of MCMC sampling and convergence, and Gilks et al.
(1996) provide a practical discussion.

6.4.2.1 The Gibbs Sampler

One of the most popular approaches to construct a Markov chain with a


specified target density π(x) is the Gibbs sampler, introduced by Geman and
Geman (1984) in image analysis and popularized in statistics by Gelfand and
Smith (1990). The Gibbs sampler is perhaps best explained by an example.
Suppose we have a model for data vector Z that has two parameters φ1 and
φ2 . Suppose that the desired posterior distribution, f (φ1 , φ2 |Z), is intractable,
but we can derive and sample from the full conditional distributions Full
Conditional
f (φ1 |φ2 , Z)
Distribu-
f (φ2 |φ1 , Z). tions
(0)
Armed with a starting values φ2 we iterate through the full conditional as
follows:
(1) (0)
generate φ1 from f (φ1 |φ2 , Z),
(1) (1)
generate φ2 from f (φ2 |φ1 , Z),
(2) (1)
generate φ1 from f (φ1 |φ2 , Z),
(2) (2)
generate φ2 from f (φ2 |φ1 , Z),
..
.
388 SPATIAL REGRESSION MODELS

As we continue to sequentially update the values of φ1 and φ2 , they will even-


tually become indistinguishable from samples from the joint posterior distribu-
tion f (φ1 , φ2 |Z), provided such a stationary distribution (Gelfand and Smith,
1990). Casella and George (1992) provide an excellent, elementary treatment
of the Gibbs sampler. More theoretical treatments are given in Gelfand and
Smith (1990), Smith and Roberts (1993), and Robert and Casella (1999) and
practical guidelines for implementation can be found in Gilks (1996).
Besag (1974) showed that knowledge of all full conditional distributions
uniquely determines the joint distribution, provided that the joint distribution
exists and is proper. Note, however, a collection of proper full conditional dis-
tributions does not guarantee the existence of a proper joint distribution (see,
e.g., Casella and George, 1992, for a simple, analytical example). Schervish
and Carlin (1992) provide general convergence conditions needed for the Gibbs
sampler. If Gibbs sampling is used with a set of full conditionals that do not
determine a proper joint distribution, the sampler may fail to converge or give
nonsensical results. Unfortunately, such situations may not always manifest
themselves in such obvious ways, and determining convergence or lack thereof
can be tricky; in particularly with complex spatial models.

Example 6.10 Consider a hierarchical model that builds on the model given
in Example 6.8 above. Suppose that
ind
Zi |λi ∼ Poisson(λi )
ind
λi ∼ Γ(α, βi )
ind
βi ∼ Γ(a, b)
with α, a, and b known. Then, the posterior distribution is
n
_ e−λi λzi βiα α−1 −βi λi ab a−1 −bβi
h(λ, β|z) = i
× λi e × β e .
i=1
zi ! Γ(α) Γ(a) i

Gilks (1996) demonstrates that full conditionals can be determined by simply


collecting the terms of the joint posterior that correspond to the parameter
of interest. So, to obtain the full conditional of λi , collect the terms involving
λi and this gives
h(λi |βi , z, α, a, b) ∼ Γ(zi + α, 1 + βi ),
and similarly the full conditional for β i is
h(βi |λi , z, α, a, b) ∼ Γ(α + a, λi + b).
Thus, given a starting value for λi and values for α, a, and b, βi can be
generated from Γ(α + a, λi + b) and then λi can be generated from Γ(zi +
α, 1 + βi ) to obtain a sample (λi , βi )1 , · · · , (λi , βi )m .
BAYESIAN HIERARCHICAL MODELS 389

6.4.2.2 The Metropolis-Hastings Algorithm

Another popular approach to construct a Markov chain with a specified tar-


get density π(x) is the Metropolis-Hastings algorithm. It was originally
developed by Metropolis, Rosenbluth, Teller, and Teller (1953) and general-
ized by Hastings (1970), whence the name was coined. It is frequently used
in geostatistical simulation where it forms the basis of simulated annealing
(§7.3, Deutsch and Journel, 1992). Tierney (1994) was among the first to
demonstrate its utility in Bayesian computing. The Gibbs sampling algorithm
discussed in §6.4.2.1 is a special case of the Metropolis-Hastings algorithm.
The algorithm depends on the creation of what Chib and Greenberg (1995)
call a candidate generating density q(x(t) , y) that depends on the current
state, x(t) , and can generate a new sample x, e.g., q(x(t) , y) = G(y − x(t) , σ 2 I).
Then, a Markov chain is generated as follows:
1. Repeat for j = 1, . . . m;
2. Draw y ∼ q(·, x(j) ) and u ∼ U (0, 1);
π(y)
3. If u ≤ π(x(j)
then set x(j+1) = y;
4. Otherwise set x(t+1) = x(t) .
5. This gives x(1) , x(2) , · · · , x(m) .
Eventually—after the burn-in and under certain mild regularity conditions—
the distribution of samples generated by this algorithm will converge to the
target distribution, π(x).
The key to the algorithm is selection of the candidate generating density,
q(x, y). Often, this density will depend on unknown location or scale parame-
ters that must be tuned during the burn-in period. The choice of scale param-
eter is important since it affects the rate of convergence and the region of the
sample space covered by the chain. Typically, this choice is made by specifying
an acceptance rate, the percentage of times a move to a new state is made.
This rate is often set to 25–50%, but depends on the dimensionality of the
problem and the candidate generating density. More implementational details
are given in Chib and Greenberg (1995) and Robert and Casella (1999).

6.4.2.3 Diagnosing Convergence

At convergence, the MCMC values should represent a sample from the tar-
get distribution of interest. Thus, for large m they should randomly fluctuate
around a stable mean value. So, a first diagnostic is to plot the sequence φt
against t. Another approach is to use multiple chains with different starting
values. This allows us to compare results from different replications; for ex-
ample, Gelman and Rubin (1992) suggest convergence statistics based on the
ratio of between-chain variances to within-chain variances. However, conver-
gence to a stationary distribution may not be the only concern. If we are
interested in functions of the parameters, we might be more interested in
390 SPATIAL REGRESSION MODELS

.
convergence of averages of the form 1/m g(φt ). Another consideration is
the dependence in the samples. If independent samples are required for infer-
ence, we can obtain these by subsampling the chain, but then we need to assess
how close the resulting sample is to being independent. There have been many
different convergence diagnostics proposed in the literature and Cowles and
Carlin (1996) provide a comprehensive, comparative review. Many of these,
and others proposed more recently, are described and illustrated in Robert
and Casella (1999).

6.4.2.4 Summarizing Posterior Distributions

Summaries of the post-convergence MCMC samples provide posterior in-


ference for model parameters. For instance, the sample mean of the (post-
convergence) sampled values for a particular model parameter provides an
estimate of the marginal posterior mean and a point estimate of the parame-
ter itself. Other measures of central tendency such as the median might also
be used as point estimates. The utility of the simulation approach is that it
provides an estimate of the entire posterior distribution which allows us to
obtain interval estimates as well. For example, the 2.5th and 97.5th quantiles
of the (post-convergence) sampled values for a model parameter provides a
95% interval estimate of the parameter. In Bayesian inference, such an inter-
val is termed a credible set to distinguish it from the confidence interval
in frequentist statistics. Unlike confidence intervals, credibility intervals have
a direct probabilistic interpretation: A 95% credible set defines an interval
having a 0.95 posterior probability of containing the parameter of interest.
A comprehensive discussion of the theory or the practice of Bayesian hierar-
chical modeling and MCMC is beyond the scope of this text. Instead, we have
provided a summary that will allow us to understand how such models might
be used in spatial data analysis. A discussion of the details, nuances, consid-
erations in implementation, and illustrative examples can be found in Gilks
et al. (1996), Robert and Casella (1999), Carlin and Louis (2000), Congdon
(2001, 2003), Banerjee, Carlin and Gelfand (2003), and Gelman et al. (2004).

6.4.3 Selected Spatial Models

At this point, there has been nothing explicitly spatial about our discussion
and the examples provided were basically aspatial. It is difficult to give a
general treatment of Bayesian hierarchical models, in general, and their use
in spatial data analysis, in particular, since each application can lead to a
unique model. Instead, we give an overview of several general types of models
that have been useful in spatial data analysis. More concrete examples can
be found in e.g., Carlin and Louis (2000) and Banerjee, Carlin, and Gelfand
(2003).
BAYESIAN HIERARCHICAL MODELS 391

6.4.3.1 Linear Regression and Bayesian Kriging

Consider the simple linear model

Z(s) = X(s)β + '(s),

with E['(s)] = 0 and Var['(s)] = R, where R is assumed known. A simple


hierarchical rendition is

[Z(s)|β] ∼ G(X(s)β, R)
[β] ∼ G(m, Q), (6.94)
!
with m and Q known. We assume that Q and (X(s) R−1 X(s)) are both of
full rank. The posterior distribution, h(β|Z(s)), is

h(β|Z(s)) ∝ f (Z(s)|β)π(β)
-
16
∝ exp − (Z(s) − X(s)β)! R−1 (Z(s) − X(s)β)
2
79
+ (β − m)! Q−1 (β − m)
-
16 !
= exp − β ! (Q−1 + X(s) R−1 X(s))β
2
!
− β ! (Q−1 m + X(s) R−1 Z(s))
!
− (Q−1 m + X(s) R−1 Z(s))! β
79
+ m! Q−1 m + Z(s)R−1 Z(s)
- >
1 ∗
∝ exp − (β − m∗ )! (Q )−1 (β − m∗ )
2
with
! !
m∗ = (Q−1 + X(s) R−1 X(s))−1 (X(s) R−1 Z(s) + Q−1 m)
!
Q∗ = (Q−1 + X(s) R−1 X(s))−1 .

Thus, h(β|Z(s)) ∝ G(m∗ , Q∗ ). Note that if Q−1 = 0, m∗ reduces to the


generalized least squares estimator of (the now fixed effects) β with variance-
covariance matrix Q∗ . This is also true if we assume independent vague or
uniform priors for the elements of β. If, on the other hand, Q−1 is much
!
larger than (X(s) R−1 X(s)), the prior information dominates the data and
. .
m∗ = m, and Q∗ = Q.
Now consider the prediction of a k × 1 vector of new observations, Z(s0 ).
For this we assume Z(s0 ) = X(s0 )β + '(s0 ) and that
: ; C: ; : ;D
Z(s) X(s) Rzz Rz0
∼G β, . (6.95)
Z(s0 ) X(s0 ) R!0z R00

For spatial prediction with the hierarchical model in (6.94), we require the
predictive distribution of Z(s0 ), the marginal distribution of Z(s0 ) given
392 SPATIAL REGRESSION MODELS

both the data and the prior information,


A
p(Z(s0 )|Z(s)) = f (Z(s0 )|Z(s), β)f (β) dβ.
β
Predictive The mean of this distribution can be obtained as
Distribu-
E[Z(s0 )|Z(s)] = Eβ [E(Z(s0 )|Z(s); β]
tion A
= [X(s0 )β + R0z R−1
zz (Z(s) − X(s)β)]f (β) dβ,
β
and the variance can be obtained using
Var[Z(s0 )|Z(s)] = Eβ [Var(Z(s0 )|Z(s); β)] + Varβ [E(Z(s0 )|Z(s); β)].
For the model in (6.94) these results give (Kitanidis, 1986)
!
E[Z(s0 )|Z(s)] = (X(s0 ) − R0z R−1
zz X(s))(Q
−1
+ X(s) R−1
zz X(s))
−1 −1 !
Q m
+ [R0z R−1 −1
zz + (X(s0 ) − R0z Rzz X(s))
! !
× (Q−1 + X(s) R−1
zz X(s))
−1
X(s) R−1
zz ]Z(s),

and
Var[Z(s0 )|Z(s)] = R00 − R0z R−1 −1
zz Rz0 + (X(s0 ) − R0z Rzz X(s))
!
× (Q−1 + X(s) R−1
zz X(s))
−1
(X(s0 ) − R0z R−1
zz X(s)) .
!

If Q−1 = 0, (or we assume independent vague or uniform priors for the


elements of β) these reduce to the universal kriging predictor and its variance,
given in (5.29) ad (5.30), respectively.
Consider slightly relaxing the assumption that R is completely known and
assume R = σ 2 V, where V is known, but σ 2 is unknown. Now there are
two types of unknown parameters to consider, β and σ 2 . The natural conju-
gate family of prior distributions is the normal-inverse-Gamma distribu-
tion (O’Hagan, 1994, p. 246), defined as
(a/2)d/2
π(β, σ 2 ) = (σ 2 )−(d+p+2)/2
(2π)p/2 |Q|1/2 Γ(d/2)
- >
1 6 ! −1
7
× exp − 2 (β − m) Q (β − m) + a ,

with hyperparameters a, d, m, and Q. This distribution is denoted N IG(a, d, m, Q).
The hierarchical model is now
[Z(s)|β, σ 2 ] ∼ G(X(s)β, σ 2 V)
6 7
β, σ 2 ∼ N IG(a, d, m, Q), (6.96)
with V, a, d, m, and Q known. Another way of writing this model is
[Z(s)|β, σ 2 ] ∼ G(X(s)β, σ 2 V)
[β|] ∼ G(m, σ 2 Q)
6 27
σ ∼ IG(a, d),
BAYESIAN HIERARCHICAL MODELS 393

where IG(a, d) denotes the inverse Gamma distribution with density

(a/2)d/2 2 −(d+2)/2 8 9
π(σ 2 ) = (σ ) exp −a/(2σ 2 ) ,
Γ(d/2)
and mean a/(d − 2). Then σ −2 has a Gamma distribution with parameters
(d/2, a/2) and mean d/a. The joint posterior distribution is h(β, σ 2 |Z(s)) =
f (Z(s)|β, σ 2 )π(β, σ 2 ), but inferences about β must be made from h(β|Z(s)),
obtained by integrating out σ 2 from the joint posterior. We can obtain the
moments of this distribution using results on conditional expectations. Condi-
tioning on σ 2 is the same as assuming σ 2 is known, so this mean is equivalent to
that derived above. Thus, the posterior mean is E[β|Z(s)] = E[E(β|σ 2 , Z(s))]
= m∗ . However, the variance is Var[β|Z(s)] = E[σ 2 |Z(s)]Q∗ . The posterior
mean of σ 2 , E[σ 2 |Z(s)], can be derived using similar arguments (and much
more algebra). O’Hagan (1994, p. 249) shows that it is a linear combination
of three terms: the prior mean, E(σ 2 ), the usual residual sum of squares, and
a term that compares β '
gls to m. The real impact of the prior distribution,
π(β, σ ), is that the posterior density function of β follows a multivariate t-
2

distribution with mean m∗ , scale a∗ Q∗ and d + n degrees of freedom, where


−1
a∗ = a + m! Q−1 m + Z(s)! V−1 Z(s) − (m∗ )! Q∗ m∗ (Judge et al., 1985;
O’Hagan, 1994).
We chose the N IG(a, d, m, Q) as a prior distribution for (β, σ 2 ) since the
conjugacy leads to tractable (although messy) analytical results. However,
this distribution specifies a particular relationship between β and σ 2 . Often,
we will have no reason to assume these two parameters are related. Thus,
instead we could assume f (β, σ 2 ) = f (β)f (σ 2 ), (independence), and then we
are free to independently choose priors for β and σ 2 . Common choices include
an improper uniform prior for β and an inverse Gamma distribution for σ 2 . If
Q−1 = 0 (corresponding to an improper uniform prior for β), and we choose
a = 0, then m∗ = β ' , and E[σ 2 |Z(s)] = {(n − p)/(d + n − 2)' σ 2 }, with
gls
'2 = (Z(s) − X(s)β
σ ' )! V−1 (Z(s) − X(s)β ' )/(n − p). Thus, the posterior
gls gls
'
distribution of β is β gls gives the classical inference for β.
The predictive distribution of Z(s0 ) given Z(s) is derived analogously to
that given above, and Kitanidis (1986) and Le and Zidek (1992) show that it
has a multivariate t-distribution with mean
!
E[Z(s0 )|Z(s)] = (X(s0 ) − V0z V−1
zz X(s))(Q
−1
+ X(s) V−1
zz X(s))
−1 −1 !
Q m
+ [V0z V−1 −1
zz + (X(s0 ) − V0z Vzz X(s))
! !
× (Q−1 + X(s) V−1
zz X(s))
−1
X(s) V−1
zz ]Z(s), (6.97)

and variance
6
Var[Z(s0 )|Z(s)] = V00 − V0z V−1 −1
zz Vz0 + (X(s0 ) − V0z Vzz X(s))
!
× (Q−1 + X(s) V−1zz X(s))
−1
7 ∗
× (X(s0 ) − V0z V−1
zz X(s)) ν ,
!
(6.98)
394 SPATIAL REGRESSION MODELS

where
a + (n − p)' ' − m)! (X(s)! V−1 X(s))(β̂ − m)
σ 2 + (β gls zz gls
ν =

.
n+d−2

Finally, consider the more general case with R = σ 2 V(θ), where θ charac-
terizes the spatial autocorrelation in the data. Now we now need to assume
something for the prior distribution π(β, σ 2 , θ) For convenience, Handcock
and Stein (1993) choose π(β, σ 2 , θ) ∝ π(θ)/σ 2 and use the Matérn class of
covariance functions to model the dependence of V on θ, choosing uniform
priors for both the smoothness and the range parameters in this class. In this
case, and for other choices of π(β, σ 2 , θ), analytical derivation of the predictive
distribution becomes impossible. Computational details for the use of impor-
tance sampling or MCMC methods for a general π(θ) and any parametric
model for V(θ) are given in Gaudard et al. (1999).
Unfortunately, and surprisingly, many (arguably most) of the noninforma-
tive priors commonly used for spatial autocorrelation (including the one used
by Handcock and Stein described above) can lead to improper posterior dis-
tributions (Berger, De Oliveira, and Sansó, 2001). Berger et al. (2001) provide
tremendous insight into how such choices should be made and provide a flex-
ible alternative (based on a reference prior approach) that always produces
proper prior distributions. However, this prior has not been widely used in
spatial applications, so its true impacts on Bayesian modeling of spatial data
are unknown.

6.4.3.2 Smoothing Disease Rates

One of the most popular uses of Bayesian hierarchical modeling of spatial data
is in disease mapping. In this context, for each of i = 1, · · · , N geographic
regions, let Z(si ) be the number of incident cases of disease, where si is a
generic spatial index for the ith region. Suppose that we also have available
the number of incident cases expected in each region based on the population
size and demographic structure within region i (E(si ), i = 1, . . . , N ); often the
E(si ) reflect age-standardized values. The Z(si ) are random variables and we
assume the E(si ) are fixed values. In many applications where information on
E(si ) is unknown, n(si ), the population at risk in region i is used instead, as
in Example 1.4.
To allow for the possibility of region-specific risk factors in addition to those
defining each E(si ), Clayton and Kaldor (1987) propose a set of region-specific
relative risks ζ(si ), i = 1, · · · , n, and define the first stage of a hierarchical
model through
ind
[Z(si )|ζ(si )] ∼ Poisson(E(si )ζ(si )). (6.99)
Thus E[Z(si )|ζ(si )] = E(si )ζ(si ), so that the ζ(si ) represent an additional
multiplicative risk associated with region i, not already accounted for in the
calculation of E(si ).
BAYESIAN HIERARCHICAL MODELS 395

The ratio of observed to expected counts, Z(si )/E(si ) corresponds to the


local standardized mortality ratio (SMR) and is the maximum likelihood es-
timate of the relative risk, ζ(si ), experienced by individuals residing in region
i. Instead of SMR, the same type of model can be used with rates, Z(si )/ni ,
where ni is the population at risk in region i.
We can incorporate fixed and random effects within the relative risk pa-
rameter if we include covariates and let
log{ζ(si )} = x(si )! β + ψ(si ), (6.100)
so that the model is
ind
[Z(si )|β, ψ(si )] ∼ Poisson(E(si ) exp{x(si )! β + ψ(si )}). (6.101)
Note that this is the same model used in Example 1.4, with ψ(si ) representing
S(si ).
The next step is to specify distributions for the parameters β and ψ. For
the fixed effects β, noninformative priors are often used (e.g., uniform, or
Gaussian with very wide prior variance). Since the elements comprising β
are continuous parameters potentially taking values anywhere in [−∞, ∞]
the uniform prior distribution is an improper distribution, i.e., its probability
density function does not integrate to one. In many cases when the joint
likelihood is sufficiently well-defined, the posterior distribution arising from
an improper prior distribution will be a proper distribution, but care must
be taken when using improper priors, in general, and in spatial modeling in
particular.
We also have several choices for the prior distribution of the random effects
ψ(si ). A common assumption is
ind
ψ(si ) ∼ G(0, σψ2 ), i = 1, · · · , N . (6.102)
With this choice of prior, the ψ(si ) do not depend on location i and are
said to be exchangeable. The effect of this choice is to add excess (but not
spatially structured) variation to the model in equation (6.100), and hence
to the model in equation (6.101). As we saw in §6.3.4, even though the prior
distributions are essentially aspatial, they do induce some similarity among
the observations. Exploiting this similarity to obtain better predictions of
the relative risks is the concept of “borrowing strength” across observations
commonly referred to in discussions of smoothing disease rates.
Estimating the remaining model parameter σψ2 (or the parameters from
other choices of the prior for ψ) from the data, leads to empirical Bayes in-
ference often used in smoothing disease rates prior to mapping (e.g., Clayton
and Kaldor, 1987; Marshall, 1991; Cressie, 1992; Devine et al., 1994). For
fully Bayes inference, we complete the hierarchy by defining a hyperprior dis-
tribution for σψ2 . This is often taken to be a conjugate prior; for the variance
parameter of a Gaussian distribution, the inverse Gamma distribution pro-
vides the conjugate family.
396 SPATIAL REGRESSION MODELS

Most applications complete the model at this point, assigning fixed values
to the two parameters of the inverse Gamma hyperprior. There has been
much recent discussion on valid noninformative choices for these parameters
since the inverse Gamma distribution is only defined for positive values and
zero is a degenerate value (see e.g., Kelsall and Wakefield, 1999 and Gelman
et al., 2004, Appendix C). Ghosh et al. (1999) and Sun et al. (1999) define
conditions on the inverse Gamma parameters to ensure a proper posterior.
In our experience, convergence of MCMC algorithms can be very sensitive to
choices for these parameters, so they must be selected carefully.
To more explicitly model spatial autocorrelation, we can use the prior dis-
tribution for ψ to allow a spatial pattern among the ψ(si ), e.g., through a
parametric covariance function linking pairs ψ(si ) and ψ(sj ) for j (= i. Thus,
we could consider a joint multivariate Gaussian prior distribution for ψ with
spatial covariance matrix Σψ , i.e.,
ψ ∼ G(0, Σψ ), (6.103)
and, following the geostatistical paradigm, define the elements of Σψ through
a parametric covariance function, e.g., such as in Handcock and Stein (1993)
discussed above. A more popular approach is to use a conditionally specified
prior spatial structure for ψ similar to the conditional autoregressive models
introduced in §6.2.2.2. With a CAR prior, we define the distribution of ψ
conditionally, i.e.,
+N ,
(
ψ(si )|ψ(sj&=i ) ∼ G cij ψ(sj ), τi2 , i = 1, · · · , N ,
i=1

which is equivalent to
ψ ∼ G(0, (I − C)τ 2 M), (6.104)
where M = diag(τ12 , · · · , τN
2
). As in §6.2.2.2, the cij s denote spatial dependence
parameters and we set cii = 0 for all i. Typical applications take M = τ 2 I
and consider adjacency-based weights where cij = φ if region j is adjacent to
region i, and cij = 0, otherwise. Clayton and Kaldor (1987) propose priors
in an empirical Bayes setting, and Breslow and Clayton (1993) apply CAR
priors as random effects distributions within likelihood approximations for
generalized linear mixed models.
For a fully Bayesian approach, we need to specify a prior for [φ, τ 2 ], e.g.,
π(φ, τ 2 ) = π(φ)π(τ 2 )
π(τ 2 ) ∼ IG(a, b)
−1
π(φ) ∼ U (γmin −1
, γmax ),
−1
where γmin and γmax
−1
are the smallest and largest eigenvalues of W, where
C = φW and W is a spatial proximity matrix. (e.g., Besag et al., 1991; Stern
and Cressie, 1999).
There are many issues surrounding the use of CAR priors, e.g., choices for
cij that lead to desired interpretations (cf., intrinsic autoregression of Besag
BAYESIAN HIERARCHICAL MODELS 397

and Kooperberg, 1995), the “improperness” that results from the pairwise
construction of CAR models and the use of constraints on ψ to assure proper
posterior distributions, etc. (see, e.g., Besag et al., 1991; Besag et al., 1995).

6.4.3.3 Mapping Non-Gaussian Data

When the data are not Gaussian, the optimal predictor of the data at new
locations, E[Z(s0 )|Z(s)], may not be linear in the data and so linear predic-
tors (like the universal kriging predictor described in §5.3.3 and the Bayesian
kriging predictors described above in §6.4.3.1) may be poor approximations
to this optimal conditional expectation. Statisticians typically address this
problem in one of two ways, either through transformation as described in
§5.6 or through the use of generalized linear models described in §6.3. Just as
there is a Bayesian hierarchical model formulation for universal kriging, there
are also Bayesian hierarchical model formulations for trans-Gaussian (§5.6.2)
and indicator kriging (§5.6.3) and for spatial GLMMs (§6.3.4).
De Oliveira et al. (1997) extend the ideas given in §6.4.3.1 to the case of
transformed Gaussian fields. Consider data Z(s), and a parametric family of
monotone transformations, {gλ (·)}, for which gλ (Z(s)) is Gaussian. In contrast
to the methods described in §5.6.2, De Oliveira et al. (1997) assume λ is
unknown. On the transformed scale, (gλ (Z(s)), gλ (Z(s0 ))! is assumed to follow
the same multivariate Gaussian distribution as in (6.95), with R = σ 2 V(θ).
The idea is to build a hierarchical model that will incorporate uncertainty in
the sampling distribution through a prior distribution on λ. Following ideas
in Box and Cox (1964), De Oliveira et al. (1997) choose the prior
p/n
π(β, σ 2 , θ, λ) ∝ σ 2 π(θ)/Jλ ,
=
where Jλ = |gλ! (z(si )|, and then assume (bounded) uniform priors for λ and
each component of θ.
The predictive distribution is obtained from
A
p(Z(s0 )|Z(s)) = f (Z(s0 )|Z(s), φ)f (φ|Z(s))dφ, (6.105)

where φ = (β, σ 2 , θ, λ)! . By first fixing θ and λ, De Oliveira et al. (1997)


make repeated use of the results for linear models given in §6.4.3.1 to con-
struct f (φ|Z(s)), and to integrate out β and σ 2 from (6.105), using the fact
that p(gλ (Z(s0 ))|Z(s), θ, λ) has a multivariate t-distribution with mean and
variance similar to those given in equations (6.97) and (6.98). This leaves
(6.105) as a function of θ and λ and De Oliveira et al. (1997) construct a
Monte Carlo algorithm to integrate out these parameters.
De Oliveira et al. (1997) compare their approach, which they call the
Bayesian transformed Gaussian model (BTG), to the more traditional trans-
Gaussian kriging approach (TGK). They use a cross-validation study based
on rainfall amounts measured at 24 stations near Darwin, Australia. Their re-
sults indicate that both approaches were about as accurate, but the empirical
398 SPATIAL REGRESSION MODELS

probability coverages based on 95% prediction intervals from the BTG model
were much closer to nominal than those from trans-Gaussian kriging. BTG
intervals covered 91.6% of the 24 true rainfall values, while TGK covered 75%,
although the kriging prediction intervals were based on z-scores rather than
t-scores. Thus, the BTG model is another approach to adjust for use of the
plug-in variance in kriging in small samples (in addition to that of Kackar and
Harville and Kenward and Roger discussed in §5.5.4).
Kim and Mallick (2002) adopt a slightly different approach to the prob-
lem of spatial prediction for non-Gaussian data. They utilize the multivariate
skew-Gaussian (SG) distribution developed by Azzalini and Capitanio (1999).
The SG distribution is based on the multivariate Gaussian distribution, but
includes an extra parameter for shape. Kim and Mallick (2002) construct a hi-
erarchical model very similar to the most general model described in §6.4.3.1,
but with the likelihood based on a skew-Gaussian distribution rather than a
Gaussian distribution. Because the multivariate Gaussian distribution forms
the basis for the SG distribution, the SG distribution has many similar prop-
erties and the ideas and results in §6.4.3.1 can be used to obtain all full
conditionals needed for Gibbs sampling.
These approaches assume the data are continuous and that there is no rela-
tionship between the mean and the variance. In a generalized linear modeling
context, Diggle et al. (1998) consider the model in §6.3.4 where given an un-
derlying, smooth, spatial process {S(s) : s ∈ D}, Z(s) has distribution in the
exponential family. Thus,
E[Z(s)|S] ≡ µ(s),
g[µ(s)] = x(s)! β + S(s), (6.106)
and we assume S(s) is a Gaussian random field with mean 0 and covariance
function σS2 ρS (si − sj ; θ).
Thus, in a hierarchical specification
ind
[Z(s)|β, S] ∼ f (g −1 (X(s)β + S))
[S|θ] ∼ G(0, ΣS (θ))
[β, θ] ∼ π(β, θ),
where f (·) belongs to the exponential family, and has mean g −1 (X(s)β + S).
Diggle et al. (1998) take a two step approach, first constructing an algorithm
to generate sample from the full conditionals of θ, β and S(s), and then
considering prediction of new values S(s0 ). For prediction of S(s0 ), again
assume that, conditional on β and θ, S(s) and S(s0 ) follow a multivariate
normal distribution of the form of equation (6.95), where
  S 
: ; Σss (θ) ΣSs0 (θ)
S(s)
∼ G 0,   .
S(s0 )
Σ0s (θ) Σ00 (θ)
S ! S

Diggle et al. (1998) assume that the data Z(s) are independent of S(s0 ) (which
BAYESIAN HIERARCHICAL MODELS 399

we note in passing seems counterintuitive to the geostatistical paradigm) and


so predicting S(s0 ) given values of Z(s), β, and θ can be done by using
ordinary kriging of S(s0 ) assuming data S(s).
Once the entire two step algorithm is complete, prediction of data Z(s0 )
is done using g −1 (X(s)β + S(s0 )). Diggle et al. (1998) called their approach
“model based geostatistics,” a name we disagree with for reasons that should
be obvious from Chapter 5.
As with other Bayesian hierarchical formulations for modeling spatial data,
care must be taken when choosing prior distributions, particulary those for
spatial autocorrelation parameters. The cautions in Berger et al. (2001) apply
here as well, and recent work by Zhang (2004) indicates that parameters of
the Matérn class are not estimable in model-based geostatistics.
Moyeed and Papritz (2002) compare this approach (MK) to ordinary kriging
(OK), lognormal kriging (LK), and disjunctive kriging (DK) using a validation
study based on 2649 measurements on copper (Cu) and cobalt (Co) concen-
trations. They predicted values at each of 2149 locations, using predictions
from each model constructed from 500 observations. Their results are similar
to those of De Oliveira et al. (1997):
1. The accuracy of the predictions was essentially the same for all methods.
2. The empirical probability coverages for OK were not nominal; OK gave
smaller coverage at the tails and higher coverage for central values than
nominal.
3. The degree to which the empirical probability coverages departed from
nominal depends on the strength of autocorrelation, supporting a similar
finding by Zimmerman and Cressie (1992) (§5.5.4). For Co, with relatively
weak spatial dependence, the empirical probability coverages for OK were
about the same as those for MK with a spherical covariance structure. For
Cu, with stronger spatial dependence, the empirical probability coverages
for OK were much worse, while those for MK were far closer to nominal.
4. The MK method is sensitive to the choice of the parametric covariance
function used to construct ΣS (θ). For some choices, the empirical prob-
ability coverages were close to nominal; for other choices the results were
similar to OK or LK.
5. The empirical probability coverages for LK and DK were as good or bet-
ter as those from MK, even though LK and DK do not account for the
uncertainty in estimated autocorrelation parameters.
While most Bayesian methods for spatial prediction are compared to tra-
ditional kriging approaches, this is not the most enlightening comparison.
Better comparisons would use as a comparison OK using a t-statistic to ac-
count for the small sample size, OK or universal kriging with the Kackar and
Harville-type adjustments, or results from kriging-based geostatistical simu-
lation methods (Chapter 7), invented primarily for the purpose of adjusting
for the smoothness in predictions from kriging.
400 SPATIAL REGRESSION MODELS

6.5 Chapter Problems

Problem 6.1 Consider a Gaussian linear mixed model for longitudinal data.
Let Zi denote the (ni × 1) response vector for subject i = 1, · · · , s. If Xi and
Zi are (ni × p) and (ni × q) matrices of known constants and bi ∼ G(0, G),
then the conditional and marginal distributions in the linear mixed model are
given by
Yi |bi ∼ G(Xi β + Zi bi , Ri )
Yi ∼ G(Xi β, Vi )
Vi = Zi GZ!i + Ri
(i) Derive the maximum likelihood estimator for β based on the marginal
distribution of Y = [Y!1 , · · · , Y!s ]! .
(ii) Formulate a model in which to perform local estimation. Assume that
you want to localize the conditional mean function E[Yi |bi ].

Problem 6.2 In the first-difference approach model (6.12), describe the cor-
relation structure among the observations in a column that would lead to
uncorrelated observations when the differencing matrix (6.13) is applied.

Problem 6.3 Show that the OLS residuals (6.4) on page 307 have mean zero,
even if the assumption that Var[e(s)] = σ$2 I is not correct.

Problem 6.4 In model (5.24) show that


Var[Z(si )] − h(θ)!i Cov[Z(si ), Z(s)] ≥ 0,
where h(θ)!i is the ith row of
! −1 ! −1
H(θ) = X(s)(X(s) Σ(θ) X(s))−1 X(s) Σ(θ) .

Problem 6.5 Consider a statistical model of the following form


Y = Xβ + e e ∼ (0, V).
Assume that V is a known, positive definite matrix and that the model is fit
by (a) ordinary least squares and (b) generalized least squares. The leverage
matrices of the two fits are
(a) Hols = X(X! X)− X!
(b) Hgls = X(X! V−1 X)−1 X! V−1
(i) If X contains an intercept, find lower and upper bounds for the diagonal
elements of Hols . Can you find similar bounds for Hgls ?
(ii) Find the lower bound for the diagonal elements of Hgls if X = 1.
(iii) Compare tr{Hols } and tr{Hgls }.
CHAPTER PROBLEMS 401

Problem 6.6 Consider a statistical model of the following form


Y = Xβ + e e ∼ (0, V).
Assume that V is positive definite so there exists a lower triangular matrix L
such that V = LL! . An alternative model to the one given is
L−1 Y = L−1 Xβ + L−1 e
= X∗ β + e∗
e∗ ∼ (0, I).
(i) Derive the estimators for β in both models and compare.
(ii) Find the projector in the second model and compare its properties to
the oblique projector Hgls from the previous problem.

Problem 6.7 In a standard regression model Y = Xβ + e, e ∼ (0, σ 2 I),


where X contains an intercept, the OLS residuals
e'i = yi − y'i = yi − x!i (X! X)−1 X! Y
.n
have expectation zero. Furthermore, i=1 e'i = 0. Now assume that Var[e] =
V and that β is estimated by GLS (EGLS). Which of the mentioned properties
of the OLS residuals are shared by the GLS (EGLS) residuals?

Problem 6.8 Suppose that t treatments are assigned to field plots arranged
in a single long row so that each treatment appears in exactly r plots. To
account for soil fertility gradients as well as measurement error, the basic
model is
Z(s) = Xτ + S(s) + ',
where S(s) ∼ (0, σ Σ) and ' ∼ (0, σ$2 I). Assume it is known that the random
2

field {S(s)} is such that the (rt − 1) × rt matrix of first differences (∆, see
(6.13) on page 320) transforms Σ as follows: ∆Σ∆! = I.
(i) Describe the estimation of treatment effects τ and the variance compo-
nents σ 2 and σ$2 in the context of a linear mixed model, §6.2.1.3.
(ii) Is this a linear mixed model with or without correlations in the condi-
tional distribution? Hint: After the transformation matrix ∆ is applied, what
do you consider to be the random effects in the linear mixed model?

Problem 6.9 In the mixed model formulation of the linear model with au-
tocorrelated errors (§6.2.1.3), verify the expression for C in (6.28) and the
expressions for the mixed model equation solutions, (6.29)–(6.30).

Problem 6.10 In the mixed model formulation of §6.2.1.3, show that C in


' ('
(6.28) is the variance matrix of [β, υ (s) − υ(s)! ]. Verify that the element in
the lower right corner of C is the prediction mean squared error of υ ' (s).

Problem 6.11 Show that the mixed model predictor (6.31) has the form of
a filtered, universal kriging predictor.
402 SPATIAL REGRESSION MODELS

Problem 6.12 Imagine a lattice process on a 2 × 3 rectangle. The sites


s1 , s2 , s3 make up the first row, the remaining sites make up the second row.
Assume that a spatial connectivity matrix is given by
 
0 1 0 1 0 0
 1 0 1 0 1 0 
 
 0 1 0 0 0 1 
W=  .
1 0 0 0 1 0 
 
 0 1 0 1 0 1 
0 0 1 0 1 0
For a simultaneous and a conditional autoregressive scheme with Var[Z(s)] =
σ 2 (I − ρW)−1 (I − ρW! )−1 and Var[Z(s)] = σ 2 (I − ρW)−1 , respectively, per-
form the following tasks for σ 2 = 1, ρ = 0.25.
(i) Derive the variance-covariance matrix for the SAR and CAR scheme.
(ii) Determine which of the processes is second-order stationary.
(iii) Describe the correlation pattern that results. Are observations equi-
correlated that are the same distance apart? Do correlations decrease with
increasing lag distance?

Problem 6.13 Consider the following exponential family distributions:


(a) Binomial(n, π):
C D
n
Pr(Y = y) = π y (1 − π)n−y , y = 0, 1, · · · , n
y

(b) Gaussian, G(µ, σ 2 ):


1
f (y) = √ exp{−0.5(y − µ)2 /σ 2 }, −∞ < y < ∞
σ 2π
(c) Negative Binomial, NegBin(k, µ):
Γ(y + 1/k) (kµ)y
Pr(Y = y) = , y = 0, 1, 2, · · ·
Γ(y + 1)Γ(1/k) (1 + kµ)y+1/k

(d) Inverse Gaussian, IG(µ, σ):


H C D2 I
1 1 y−µ
f (y) = / exp − , y>0
σ 2πy 3 2y µσ

(i) Write the densities or mass functions in exponential form


(ii) Identify the canonical link, variance function, and scale parameter.
(iii) Is the canonical link a reasonable link for modeling? If not, suggest an
appropriate link.
(iv) Assume a random sample of size m from each of these distributions.
For the distributions with a scale parameter, can you profile this parameter
from the likelihood?
CHAPTER PROBLEMS 403

Problem 6.14 If Y = log{X} has a Gaussian distribution, then X = exp{Y }


as a lognormal distribution.
(i) Derive the density of X, f (x).
(ii) Prove or disprove, whether the lognormal distribution is a member of
the exponential family.
(iii) Name a distribution that is a member of the exponential family and
has a similar shape and the same support as the lognormal distribution.

Problem 6.15 If Z|n Binomial(n, π), and n ∼ Poisson(λ), find the distribu-
tion of Z. Is it more or less dispersed than the distribution of Z|n?

Problem 6.16 Repeat the previous exercise where n is a truncated Poisson


variate, that is,
λm exp{−λ}
Pr(n = m) = , m = 1, 2, · · · .
m! 1 − exp{−λ}

Problem 6.17 For the data on blood lead levels in children in Virginia, 2000,
repeat the analysis in Example 1.4 using models based on the Binomial dis-
tribution.
CHAPTER 7

Simulation of Random Fields

Simulating spatial data is important on several grounds.


• The worth of a statistical method for georeferenced data can often be estab-
lished convincingly only if the method has exhibited satisfactory long-run
behavior.
• Statistical inference for spatial data often relies on randomization tests,
e.g., tests for spatial point patterns. The ability to simulate realizations
of a hypothesized process quickly and efficiently is important to allow a
sufficient number of realizations to be produced.
• The absence of replication in most spatial data sets requires repeated obser-
vation of a phenomenon to obtain empirical estimates of mean, variation,
and covariation.

Constructing a realization of a random field {Z(s) : s ∈ D ⊂ Rd } is not a


trivial task, since it requires knowledge of the spatial distribution
Pr(Z(s1 ) < z1 , Z(s2 ) < z2 , · · · , Z(sn ) < zn ).
From a particular set of data we can only infer, under stationarity assump-
tions, the first and second moment structure of the field. Inferring the joint
distribution of the Z(si ) from the mean and covariance function is not possible
unless the random field is a Gaussian random field (GRF). Even if the spatial
distribution were known, it is usually not possible to construct a realization
via simulation from it. The best we may be able to do—unless the random
field is Gaussian—is to produce realizations whose marginal or maybe bivari-
ate distributions agree with the respective distributions of the target field. In
other instances, for example, the simulation of correlated counts, even this
condition may be difficult to attain. Generating a random field in which the
Z(s) are marginally Bernoulli(π) random variables with a particular covari-
ance function may be impossible, because the model itself is vacuous. What
we may be able to accomplish is to generate random deviates with known au-
tocovariance function whose marginal moments (mean and variance) “behave
like” those of Bernoulli(π) variables.
In this Chapter we review several methods for generating spatial data with
the help of computers and random number generators. Some methods will gen-
erate spatial data with known spatial distribution, for example, the methods
to generate Gaussian random fields (§7.1). Other methods generate data that
comply only to first and second moment assumptions, that is, data whose
406 SIMULATION OF RANDOM FIELDS

mean, variance, and covariance function are known. Particularly important


among the latter methods are those that generate data behaving like counts
or proportions. Convolution theory (§7.4) can be used to simulate such data
as well as simulated annealing methods (§7.3).

7.1 Unconditional Simulation of Gaussian Random Fields

The Gaussian random field holds a core position in the theory of spatial
data analysis much in the same vein as the Gaussian distribution is key to
many classical approaches of statistical inferences. Best linear unbiased krig-
ing predictors are identical to conditional means in Gaussian random fields,
establishing their optimality beyond the class of linear predictors. Second-
order stationarity in a Gaussian random field implies strict stationarity. The
statistical properties of estimators derived from Gaussian data are easy to
examine and test statistics usually have a known and simple distribution.
Hence, creating realizations from GRFs is an important task. Fortunately, it
is comparatively simple.
Chilès and Delfiner (1999, p. 451) discuss an instructive example that high-
lights the importance of simulation. Imagine that observations are collected
along a transect at 100-meter intervals measuring the depth of the ocean floor.
The goal is to measure the length of the profile. One could create a continuous
profile by kriging and then obtain the length as the sum of the segments be-
tween the observed transect locations. Since kriging is a smoothing of the data
in-between the observed locations, this length would be an underestimate of
the profile length. In order to get a realistic estimate, we need to generate val-
ues of the ocean depth in-between the 100-meter sampling locations that are
consistent with the stochastic variation we would have seen, had the sampling
interval been shorter.
In this example it is reasonable that the simulated profile passes through the
observed data points. After all, these were the values which were observed, and
barring measurement error, reflect the actual depth of the ocean. A simulation
method that honors the data in the sense that the simulated value at an
observed location agrees with the observed value is termed a conditional
simulation. Simulation methods that do not honor the data, for example,
because no data has yet been collected, are called unconditional simulations.
Several methods are available to simulate GRFs unconditionally, some are
more brute-force than others. The simplest—and probably crudest— method
relies on the reproductive property of the (multivariate) Gaussian distribution
and the fact that a positive-definite matrix Σ can be represented as

Σ = Σ1/2 Σ! 1/2 .

If Y ∼ G(µ, Σ), and X ∼ G(0, I), then

µ + Σ1/2 X
CONDITIONAL SIMULATION OF GAUSSIAN RANDOM FIELDS 407

has a G(µ, Σ) distribution. Two of the elementary ways of obtaining a square


root matrix of the variance-covariance matrix Σ are the Cholesky decompo-
sition and the spectral decomposition.

7.1.1 Cholesky (LU) Decomposition

If Σn×n is a positive definite matrix, then there exists an upper triangular ma-
trix Un×n such that Σ = U! U. The matrix U is called the Cholesky root of Σ
and is unique up to sign (Graybill, 1983). Since U! is lower-triangular and U is
upper-triangular, the decomposition is often referred to as the lower-upper or
LU decomposition. Many statistical packages can calculate a Cholesky root,
for example, the root() function of the SAS\IML! r module. Since Gaussian
random number generators are also widely available, this suggests a simple
method of generating data from a Gn (µ, Σ) distribution. Generate n indepen-
dent standard Gaussian random deviates and store them in vector x. Calculate
the Cholesky root U! of the variance-covariance matrix Σ and a (n × 1) vector
of means µ. Return y = µ+U! x as a realization from a G(µ, Σ). It works well
for small to moderate sized problems. As n grows large, however, calculating
the Cholesky decomposition is numerically expensive.

7.1.2 Spectral Decomposition

A second method of generating a square root matrix of Σ relies on the spectral


decomposition of a real symmetric matrix. If Ap×p is a real symmetric matrix,
then there exists a (p × p) orthogonal matrix P such that
A = P∆P! ,
where ∆ is a diagonal matrix containing the eigenvalues of A. Since P! P = I,
the matrix
Σ1/2 = P∆1/2 P!
has the needed properties to function as the square root matrix to generate
G(µ, Σ) deviates by
y = µ + Σ1/2 x.
r System with the
The spectral decomposition can be calculated in The SAS!
eigen function of the SAS\IML! r module.

7.2 Conditional Simulation of Gaussian Random Fields

A conditional simulation is a realization S(s) of a random field Z(s) that


honors the observed values of Z(s). Let Z(s1 ), Z(s2 ), · · · , Z(sm ) denote the
observed values of a random field. A conditional simulation produces n =
m + k values
!
S(s) = [Z(s1 ), Z(s2 ), · · · , Z(sm ), S(sm+1 ), · · · , S(sm+k )] .
408 SIMULATION OF RANDOM FIELDS

If m = 0, the simulation is unconditional. Some methods for conditional simu-


lation condition on the data directly, while others start with an unconditional
simulation which is then conditioned.

7.2.1 Sequential Simulation

The idea of sequential simulation is simple. For the general case consider
simulating a (n×1) random vector Y with known distribution F (y1 , · · · , yn ) =
Pr(Y1 ≤ y1 , · · · , Yn ≤ yn ). The joint cdf can be decomposed into conditional
distributions

F (y1 , · · · , yn ) = F (y1 )F (y2 |y1 ) × · · · × F (yn |y1 , · · · , yn−1 ).

If the conditional distributions are accessible, the joint distribution can be


generated from the conditionals. The name of the method stems from the
sequential nature of the decomposition. In the spatial case, the conditioning
sequence can be represented as

Pr (S(sm+1 ) ≤ sm+1 , · · · , S(sm+k ) ≤ sm+k |z(s1 ), · · · , z(sm ))


= Pr (S(sm+1 ) ≤ sm+1 |z(s1 ), · · · , z(sm ))
× Pr (S(sm+2 ) ≤ sm+2 |z(s1 ), · · · , z(sm ), s(sm+1 ))
..
.
× Pr (S(sm+k ) ≤ sm+k |z(s1 ), · · · , z(sm ), s(sm+1 ), · · · , s(sm+k−1 ))

The advantage of sequential simulations is that it produces a random field


not only with the correct covariance structure, but the correct spatial distri-
bution. The disadvantage is having to work out the conditional distributions.
In one particular case, when Z(s) is a GRF, the conditional distributions are
simple. If
: ; C: ; : 2 ;D
Z(s0 ) µ0 σ c!0
=G , ,
Z(s) µ c0 Σ
then Z(s0 )|Z(s) is Gaussian distributed with mean E[Z(s0 )|Z(s)] = µ0 +
c! Σ−1 (Z(s) − µ) and variance Var[Z(s0 )|Z(s)] = σ 2 − c! Σ−1 c. If the mean of
the random field is known, this is the simple kriging predictor and the corre-
sponding kriging variance. Thus we can calculate the conditional distributions
of S(sm+i ) given z(s1 ), · · · , z(s)m , s(sm+1 ), · · · , s(sm+i−1 ) as Gaussian. The
mean of the distribution is the simple kriging predictor of S(sm+i ) based on
the data Z(s1 ), · · · , Z(sm ) and S(sm ), · · · , S(sm+i−1 ). Notice that if m = 0,
this leads to an unconditional simulation of a GRF where successive values are
random draws from Gaussian distributions. The means and variances of these
distributions are updated sequentially. Fortunately, the stochastic properties
are independent of the order in which the S(s) values are being generated.
SIMULATED ANNEALING 409

7.2.2 Conditioning a Simulation by Kriging

Consider a random field with covariance function C(h), sampled at locations


s1 , · · · , sm and vector of realizations Z(s) = [Z(s1 ), · · · , Z(sm )]! . We want
to simulate a random field with the same mean and covariance structure
as Z(s), but ensure that the realization passes through the observed values
Z(s1 ), · · · , Z(sm ). This can be accomplished based on an unconditional simu-
lation S(s) of the random field with the same covariance function,
Cov[S(h), S(s + h)] = Cov[Z(s), Z(s + h)]
as follows. The decomposition
Z(s) = psk (s; Z) + Z(s) − psk (s; Z)
holds always, but the simple kriging residual Z(s)−psk (s; Z) is not observable.
Instead we can substitute for it S(s)−psk (s; Sm ), where psk (s; Sm ) denotes the
simple kriging predictor at location s based on the values of the unconditional
simulation at the locations s1 , · · · , sm where Z was observed. That is, Sm (s) =
[S(s1 ), · · · , S(sm )]! . We define the conditional realization as
Zc (s) = psk (s; Z) + S(s) − psk (s; Sm )
= S(s) + c! Σ−1 (Z − Sm ). (7.1)

Note that (7.1) corrects the unconditional simulation at S(s) by the residual
between the observed data and the values from the unconditional simulation.
Also, it is not necessary that S(s) is simulated with the same mean as Z(s).
Any mean will do, for example, E[S(s)] = 0. It is left as an exercise to show
that (7.1) has the needed properties. In particular,
(i) For s0 ∈ {s1 , · · · , sm }, Zc (s0 ) = Z(s0 ); the realization honors the data;
(ii) E[Zc (s)] = E[Z(s)], i.e., the conditional simulation is (unconditionally)
unbiased;
(iii) Cov[Zc (s), Zc (s + h)] = Cov[Z(s), Z(s + h)] ∀h.
The idea of a conditional simulation is to reproduce data where it is known
but not to smooth the data in-between. The kriging predictor is a best linear
unbiased predictor of the random variables in a spatial process that smoothes
in-between the observed data. A conditional simulation of a random field
will exhibit more variability between the observed points than the kriging
predictor. In fact, it is easy to show that
K L
2
E (Zc (s) − Z(s)) = 2σsk2
,

where σsk
2
is the simple kriging variance.

7.3 Simulated Annealing

On occasion we may want to constrain the simulations more than having the
proper covariance on average or honoring the data points. For example, we
410 SIMULATION OF RANDOM FIELDS

may want all realizations have the same empirical covariance function than
an observed process. Imagine that m observations have been collected and the
empirical semivariogram γ '(h1 ), · · · , γ
'(hk ) has been calculated for a set of k lag
classes. A conditional realization of the random field is to be simulated that
honors the data Z(s1 ), · · · , Z(sm ) and whose semivariogram agrees completely
with the empirical semivariogram γ '(h).
To place additional constraints on the realizations, the optimization method
of simulated annealing (SA) can be used. It is a heuristic method that is used in
operations research, for example, to find solutions in combinatorial optimiza-
tion when standard mathematical programming tools fail. The monograph
by Laarhoven and Aarts (1987) gives a wonderful account of SA, its history,
theory, and applications.
The name of the method reveals a metallurgic connection. If a molten metal
is cooled slowly, the molecules can move freely, attaining eventually a state
of the solid with low energy and little stress. If cooling occurs too quickly,
however, the system will not be able to find the low-energy state because the
movement of the molecules is obstructed. The solid will not reach thermal
equilibrium at a given temperature, defects are frozen into the solid.
Despite the name, simulated annealing originated in statistical physics. The
probability density function f (a) of state a in a system with energy U (a) and
absolute temperature T > 0, when the system reaches thermal equilibrium, is
given by the Boltzmann distribution,

f (a) = g(T )−1 exp {−U (a)/(kB T )} , (7.2)

where kB is the Boltzmann constant, g(T )−1 is a normalizing constant, and


exp{−U (a)/(kB T )} is known as the Boltzmann factor.
The probability that the system is in a high energy state at temperature
T tends to zero because the density function f (a) concentrates on the set of
states with minimum energy. The idea of simulated annealing is to continue
to generate realizations from the state space as the temperature T is reduced.
If the process converges, uniform sampling from the states that have mini-
mum energy is achieved. In statistical applications, the energy function U (a)
represents a measure of the discrepancy between features of the current state
(= current realization) and the intended features of the simulated process.
Simulated annealing is related to iterative improvement (II) algorithms
(Laarhoven and Aarts, 1987, p. 4). An II algorithm presupposes a state space
of configurations, a cost function, and a mechanism to generate a transition
from one state to another by means of small perturbations. Starting from an
initial configuration (state) of the system, transitions from the current to a
new state are selected if a new configuration lowers the cost function. If a
configuration is found such that no neighboring state has lower energy (cost),
the configuration is adopted as the solution. Unfortunately, II algorithms will
find a local minimum that depends on the initial configuration. Simulated
annealing avoids this pitfall by allowing the algorithm to accept a transition
SIMULATED ANNEALING 411

if it increases the energy in the system and not only configurations that lower
the system’s energy. The key is to allow these transitions to higher energy
states with the appropriate frequency. In addition, the probability to accept
higher-energy states will gradually decrease in the process.
The connection of simulated annealing to optimization and an illustration
on why this procedure works, is given in the following example.

Example 7.1 Consider that we want to find the global maximum of the
function - > - >
1 1
f (x) = exp − U (x) = exp − (x − µ) . 2
2 2
Obviously, the function is maximized for x = µ when the energy function U (x)
takes on its minimum. In order to find the maximum of f (x) by simulated
annealing we sample from a system with probability distribution gT∗ (x) ∝
f (x)1/T , where T denotes the absolute temperature of the system. For exam-
ple, we can sample from
- >
1 1 1 (x − µ)2
gT (x) = √ f (x)1/T
=√ exp − .
2πT 2πT 2 T
Note that g(x) is the Gaussian density function with mean µ and variance
T . The process commences by drawing a single sample from gT (x) at high
temperatures and samples are drawn continuously while the temperature is
dropped. As T → 0, x → µ with probability one. From this simple example
we can see that the final realization will not be identical to the abscissa where
f (x) has its maximum, but should be close to it.

The connection of annealing a metal to the simulation of thermal equilib-


rium was drawn by the Monte Carlo approach of Metropolis et al. (1953).
First, we wish to generate a sequence of configurations of the solid in thermal
equilibrium at temperature T . Assume that the solid has energy Uu at the
uth step of the procedure. A randomly chosen particle is slightly displaced.
This perturbation will lead to a new energy state of the solid. If Uu+1 < Uu ,
the new configuration is accepted. Otherwise, the new configuration is ac-
cepted with probability exp{−(Uu+1 − Uu )/(kB T )}. After a large number of
random perturbations, the system should achieve thermal equilibrium at tem-
perature T . The application in optimization problems involves—in addition
to the described process—a controlled reduction of the temperature T . Af-
ter the system has reached thermal equilibrium, the temperature is reduced,
and a new equilibrium is sought. The entire process continues until, for some
small temperature, no perturbations lead to configurations with lower energy.
The acceptable random perturbations are then a random sampling from the
uniform distribution of the set of configurations with minimal energy.
The application of this iterative simulated annealing process to the simula-
tion of spatial random fields requires the following:
412 SIMULATION OF RANDOM FIELDS

(i) An initial realization of the random field. This can be a conditional or


unconditional simulation. We call the realization at the uth iteration of the
process the uth image.
(ii) A rule by which to perturb the image to reallocate its molecules.
(iii) A rule by which to determine whether the perturbed image at iteration
u + 1 should be accepted or rejected compared to the image at iteration u.
(iv) A rule by which to change the temperature of the system, the cooling
schedule.

Denote by Su (s) the realization of the random field at stage u of the iterative
process; S0 (s) is the initial image. The most common rule of perturbing the
image is to randomly select two locations si and sj and to swap (exchange)
their values. If the simulation is conditional such that the eventual image
honors an observed set of data Z(s1 ), · · · , Z(sm ), we choose
!
S0 (s) = [Z(s1 ), Z(s2 ), · · · , Z(sm ), S0 (sm+1 ), · · · , S0 (sm+k )] ,
and perturb only the values at the unobserved locations.
After swapping values we decide whether to accept the new image. Su+1 (s)
is accepted if Uu+1 < Uu . Otherwise the new image is accepted with proba-
bility fu+1 /fu , where
fu ∝ exp {−U/Tu } .
Chilès and Delfiner (1999, p. 566) point out that when the temperature Tu
is reduced according to the cooling schedule Tu = −c log{u + 1}, the process
converges to a global minimum energy state. Here, c is chosen as the minimum
increase in energy that moves the process out of a local energy minimum—
that is not the global minimum—into a state of lower energy. In practice,
the cooling schedule is usually governed by simpler functions such as Tu =
T0 λ, λ < 1. The important issues here are the magnitude of the temperature
reduction (λ) and the frequency with which the temperature is reduced. It
is by no means necessary or prudent to reduce the temperature every time a
new image is accepted. At each temperature, the system should be allowed
to achieve thermal equilibrium. Goovaerts (1997, p. 420) discusses adjusting
the temperature after a sufficient number of perturbations have been accepted
(2n—10n) or too many have been tried unsuccessfully (10n—100n); n denotes
the number of sites on the image.
Simulated annealing is a heuristic, elegant, intuitive, brute-force method.
It starts with an initial image of n atoms and tries to improve the image by
rearranging the atoms in pairs. Improvement is measured by a user-defined
discrepancy function, the energy function U . Some care should be exercised
when simulating spatial random fields by this method.
• If several simulations are desired one should not start them from the same
initial image S0 (s), but from different initial images. Otherwise the real-
izations are too similar to each other.
SIMULATING FROM CONVOLUTIONS 413

• Care must be exercised in choosing and monitoring the objective function.


If U measures the discrepancy between a theoretical and the empirical
semivariogram for the current image and a set of k lags,
k
( 2
Uu (h) = ('
γ (hs ) − γ(hs ))
s=1

it is not necessarily desirable to achieve U = 0. This would imply that


all realizations in the set of realizations with minimum energy have the
same empirical semivariogram. If we sample a surface of a random field
with semivariogram γ(h), even a sampling conditional on m known values
of the surface, we expect the empirical semivariogram to agree with the
theoretical semivariogram within the limits of sampling variation. We do
not expect perfect agreement. If states with zero energy exist, it may be
advisable to stop the annealing algorithm before such states are reached to
allow the simulated system to represent uncertainty realistically.
• Since successive steps of the algorithm involve swapping values of the cur-
rent image, the quality of the initial image is important. An initial image
with high energy will require many iterations to achieve a low energy con-
figuration. The minimum energy configuration that is achievable must be
viewed in light of the initial image that is being used. Simulated annealing—
if the process converges—finds one of the configurations of the sites that
has lowest energy among the states that can be achieved starting from the
initial image. This lowest energy configuration is not necessarily a good
representation of the target random field if the initial image was chosen
poorly.
• Simulated annealing is a computer intensive method. Hundreds of thou-
sands of iterations are often necessary to find a low energy configuration.
It is thus important to be able to update the objective function between
perturbations quickly. For example, if the objective function monitors the
empirical semivariogram of the realization, at stage u we can calculate the
new semivariogram at lag h by subtracting the contribution of the swapped
values from the previous semivariogram and adding their contributions at
that lag to the new semivariogram. A complete re-calculation of the em-
pirical semivariogram is not needed (see Chapter problems).

7.4 Simulating from Convolutions

Simulating Gaussian random fields via the LU decomposition (§7.1) exploits


two important facts: linear combinations of Gaussian random variables are
Gaussian, and the mean and variance of a Gaussian random variable can be
determined independently. The only obstacle is the size of the spatial data
set being generated, calculating the Cholesky decomposition of the variance-
covariance matrix of a n-dimensional Gaussian distribution is computer-in-
tensive for large n.
414 SIMULATION OF RANDOM FIELDS

The difficulties encountered in generating spatial data with non-Gaussian


distributions are formidable in comparison, however. In recent years, statis-
tical methods for analyzing correlated data have flourished. The generalized
estimating equation (GEE) approach of Liang and Zeger (1986) for longitudi-
nal data and Zeger and Liang (1986), for example, incorporates the correlation
with clustered data by choosing a working correlation matrix and estimating
the parameters of this matrix by the method of moments based on model
residuals. In essence, correlation structures common for Gaussian processes
are paired with models for the mean that draw on generalized linear models.
One appealing feature of the GEE method for longitudinal data is that the
estimates of the parameters in the mean model are consistent, provided the
estimators of the working correlation matrix are consistent and the variance
matrix satisfies certain limiting conditions. This does not require that the
working variance-covariance matrix converges to the true variance-covariance
matrix of the response; it requires that the working matrix converges to some-
thing. In addition, a sandwich estimator allows “robust” estimation of the
variance of the estimated coefficients, down-weighing the impact of having
employed a working correlation model, rather than the true one.
As a result of the popularity of the GEE approach, which requires only as-
sumptions about the first two moments, it may be tempting to combine models
for mean and covariance structure that maybe should not be considered in the
same breath. The resulting model may be vacuous. For example, when model-
ing counts correlated in time, you may aim for Poisson random variables with
an AR(1) correlation structure. A valid joint distribution with that correla-
tion structure in which all marginal distributions are Poisson may not exist.
The essential problem is the linkage of the mean and variance for many dis-
tributions, which imposes constraints on the joint distributions, see §5.8 and
§6.3.3. The acid test to determine whether pairing a certain marginal behavior
with a particular correlation model is appropriate is of course whether a valid
joint probability distribution exists. The sloppy test is to ask: “is there a way
to simulate such behavior?” It should give us pause if we intend to model
data as marginally Poisson with an exponential covariance function and no
mechanism comes to mind that could generate this behavior.
In many instances, statistical models for correlated data are based on first
and second moments. When modeling counts, for example, it is thus natural
to draw on distributions such as the Binomial, Poisson, or Negative Bino-
mial, to suggest a mean-variance relationship for the model. What is usu-
ally required is not that the data have marginally a known distribution, but
that they exhibit certain mean-variance behavior that concurs with models
used for uncorrelated data. Such data can be generated in many instances.
For example, is a model for Z(s) with E[Z(s)] = π, Var[Z(s)] = π(1 − π),
Cov[Z(s), Z(s + h)] = σ 2 exp{−||h||2 /α} vacuous? Most likely not, if a data-
generating mechanism can be devised. The convolution representation is help-
ful in the construction (and simulation) of this mechanism.
Recall that a spatial random field {Z(s) : s ∈ D ⊂ Rd } can be represented
SIMULATING FROM CONVOLUTIONS 415

as the convolution of a kernel function K(u) and an excitation field X(s),


A
Z(s) = K(s − u)X(u) du.
u
Z(s) will be second-order stationary, if X(s) is second-order stationary. In
addition, if the excitation field has E[X(s)] = µx , Var[X(s)] = σx2 , and
Cov[X(s), X(s + h)] = 0, h (= 0, then
B
(i) E[Z(s)] = µx u K(u) du;
B
(ii) Var[Z(s)] = σx2 u K 2 (u) du;
B
(iii) Cov[Z(s), Z(s + h)] = σx2 u K(u)K(u + h) du.

Since the kernel function governs the covariance function of the Z(s) pro-
cess, an intuitive approach to generating random fields is as follows. Generate
a dense field of the X(s), choose a kernel function K(u), and convolve the
two. This ensures that we understand (at least) the first and second moment
structure of the generated process. Unless the marginal distribution of X(s)
has some reproductive property, however, the distribution of Z(s) is difficult
to determine by this device. A notable exception is the case where X(s) is
Gaussian; Z(s) then will also be Gaussian.
If it is required that the marginal distribution of Z(s) is characterized by
a certain mean-variance relationship or has other moment properties, then
these can be constructed by matching moments of the desired random field
with the excitation process. We illustrate with an example.

Example 7.2 Imagine it is desired to generate a random field Z(s) that


exhibits the following behavior:
E[Z(s)] = λ, Var[Z(s)] = λ, Cov[Z(s), Z(s + h)] = C(h).
In other words, the Z(s) are marginally
B supposed to behave Bsomewhat like
Poisson(λ) variables. From λ = µx K(u) du and λ = σx2 K 2 (u) du we
obtain the condition B
K(u)du
σx = µx B 2
2
. (7.3)
K (u)du
For a given kernel function
B K(u) we could choose a Gaussian B white noise
field with mean µx = λ/ K(u) du and variance σx2 = λ/ K 2 (u) du. But
that does not guarantee that the Z(s) are non-negative. The near-Poisson(λ)
behavior should be improved if the stochastic properties of the excitation
field are somewhat close to the target. Relationship (7.3) suggests a count
variable whose variance is proportional to the mean. The Negative Binomial
random variable behaves this way. If X ∼ NegBinomial(a, b), then E[X] = ab,
Var[X] = ab(b + 1) = E[X](b + 1). By choosing the parameters a and b of the
Negative Binomial excitation field carefully, the convolved
B process will have
mean λ, variance λ, and covariance function C(h) = K(u)K(u + h) du.
416 SIMULATION OF RANDOM FIELDS

Matching moments we obtain


A
λ
λ = ab K(u) du ⇔ ab = B (7.4)
u
A u K(u) du
λ = ab(b + 1) K 2 (u) du. (7.5)
u
Substituting (7.4) into (7.5) leads to
B
K(u) du
b= B 2 − 1, (7.6)
K (u) du
and back-substitution yields
: B 2 ;
λ K (u)du
a= B B B . (7.7)
K(u)du K(u)du − K 2 (u)du
Although these expressions are not pretty, they are straightforward to calcu-
late, since K(u) is a known function andB hence the integrals are constants.
Additional simplifications arise when K(u)du = 1, since then the range
of values in the excitation field equals the range of values in the convolved
process.
B It isBalso obvious that the kernel function must be chosen so that
K(u)du > K 2 (u)du in this application for a and b to be non-negative.
Figure 7.1 shows (average) semivariograms of random fields generated by
this device. The domain of the process was chosen as a transect of length 500
with observations collected at unit intervals. A Gaussian kernel function
- >
1 1 (s − u)2
K(s − u, h) = / exp −
h (2π) 2 h2
was used and the range of integration was extended outside of the domain
to avoid edge effects. The goal was to generate a random field in which the
realizations have mean λ = 5, variance λ = 5 and are autocorrelated. The pa-
rameters a and b of the Negative Binomial excitation field were determined by
(7.6) and (7.7). The Gaussian kernel implies a gaussian covariance function,
other covariance functions can be generated by choosing the kernel accordingly
(§2.4.2). For each of h = 5, 10, 15, 20, ten random convolutions were generated
and their empirical semivariograms calculated. Figure 7.1 shows the average
semivariogram over the ten repetitions for the different kernel bandwidths.
With increasing h, the smoothness of the semivariogram as well as its range
increases. The sample means and sample variances averaged over the ten re-
alizations are shown in Table 7.1. Even with only 10 simulations, the first two
moments of Z(s) match the target closely.

Another important example is the generation of a random field in which the


Z(s) behave like Bernoulli(π) random variables, i.e., E[Z(s)] = π, Var[Z(s)] =
π(1 − π). Since π is the success probability in a Bernoulli experiment, it is
convenient to consider excitation fields with support (0, 1). For example, let
SIMULATING FROM CONVOLUTIONS 417

Table 7.1 Average sample mean and sample variance of simulated convolutions
whose semivariograms are shown in Figure 7.1
Bandwidth h Z s2
5 4.98 4.29
10 5.30 4.67
15 5.23 4.02
20 5.07 4.72

4
Empirical Semivariogram

3
h=5
h = 10
h = 15
2 h = 20

0 50 100 150 200 250


Lag Distance

Figure 7.1 Average semivariograms for Negative Binomial excitation fields convolved
with Gaussian kernel functions of different bandwidth h (=standard deviation).

X(s) be a Beta(α, β) random variable with E[X(s)] = α/(α + β), Var[X(s)] =


αβ(α + β)−2 /(α + β + 1). Matching the moments of the convolution with the
target yields the equations
A
α
π = K(u) du
α+β u
418 SIMULATION OF RANDOM FIELDS
A
αβ
π(1 − π) = K 2 (u) du.
(α + β) (α + β + 1) u
2

It is left as an exercise to establish that the parameters α and β of the Beta


excitation field must be chosen to satisfy
- >
π d(c − π)
α = −1 (7.8)
c (1 − π)c2
- >
c − π d(c − π)
β = −1 , (7.9)
c (1 − π)c2
B B
where c = B K(u)du and d = K 2 (u)du. Notice that if the kernel is chosen
such that K(u)du = 1, then the expressions simplify considerably; α =
π(d − 1), β = (1 − π)(d − 1).

7.5 Simulating Point Processes

Because of the importance of Monte Carlo based testing for mapped point
patterns, efficient algorithms to simulate spatial point processes are critical
to allow a sufficient number of simulation runs in a reasonable amount of
computing time. Since the CSR process is the benchmark for initial analysis
of an observed point pattern, we need to be able to simulate a homogeneous
Poisson process with intensity λ. A simple method relies on the fact that
if N (A) ∼ Poisson(λ), then, given N (A) = n, the n events form a random
sample from a uniform distribution (a Binomial process, §3.2).

7.5.1 Homogeneous Poisson Process on the Rectangle (0, 0) × (a, b) with


Intensity λ

1. Generate a random number from a Poisson(λab) distribution


→ n.
2. Order n independent draws from a U (0, a) distribution
→ x1 < x2 < · · · < xn .
3. Generate n independent realizations from a U (0, b) distribution
→ y1 , · · · , yn .
4. Return (x1 , y1 ), · · · , (xn , yn ) as the coordinates of the homogeneous Poisson
process.

Extensions of this algorithm to processes in Rd are immediate. Comparing an


observed pattern versus simulated ones, the number of events in the simulated
patterns typically equal those in the observed pattern. In that case, step 1
of the algorithm is omitted and n for step 2 is set equal to the number of
observed points. If the study region is of irregular shape, a Poisson process
can be generated on a bounding rectangle that encloses the study region.
Points that fall outside the area of interest are removed.
CHAPTER PROBLEMS 419

7.5.2 Inhomogeneous Poisson Process with Intensity λ(s)

Lewis and Shedler (1979) suggested the following acceptance/rejection algo-


rithm to simulate a Poisson process on A with spatially varying intensity.
1. Simulate with intensity λ0 ≥ max{λ(s)} a homogeneous Poisson process
on A
→ (x1 , y1 ), · · · , (xm , ym ).
2. Generate a U (0, 1) realization (independently of the others) for each event
in A after step 1
→ u1 , · · · , um .
3. Retain event i if ui ≤ λ(s)/λ0 , otherwise remove the event.
The initial step of the Lewis-Shedler algorithm generates a homogeneous
Poisson process whose intensity dominates that of the inhomogeneous process
everywhere. Steps 2 and 3 are thinning steps that remove excessive events with
the appropriate frequency. The algorithm works for any λ0 ≥ max{λ(s)}, but
if λ0 is chosen too large many events will need to be thinned in steps 2 and
3. It is sufficient to choose λ0 = max{λ(s)}.

7.6 Chapter Problems

Problem 7.1 Generate data on a 10 × 10 regular lattice by drawing iid ob-


servations from a G(µ, σ 2 ) distribution and assign the realizations at random
to the lattice positions. Given a particular nearest-neighbor definition (rook,
bishop, queen move), produce a rearrangement of the observations by simu-
lated annealing such that the sample correlation between Z(si ) and the aver-
age of its neighbors takes on some value. Compare the empirical semivariogram
between the lattice arrangements so achieved and the original arrangement
for various magnitudes of the sample correlations.
Things to consider: you will have to decide on a cooling schedule, the number of successful
perturbation attempts before a new temperature is selected, and the number of unsuccessful
attempts. How do you select the convergence criterion? How close is the actual sample
correlation to the desired one upon convergence? How do your choices affect the end result?

Problem 7.2 Consider the Matheron estimator of the empirical semivari-


ogram (see §4.4.1). Swap two values, Z(si ) and Z(sj ). Give an update for-
mula for the empirical semivariogram that avoids re-calculation of the entire
semivariogram.

Problem 7.3 Consider a bivariate and a trivariate Gaussian distribution.


Develop explicit formulas to generate a realization by sequential simulation.
Given an input vector for the mean µ and the variance matrix Σ of the
Gaussian distribution, write the software to generate realizations based on a
standard Gaussian random number generator.
420 SIMULATION OF RANDOM FIELDS

Problem 7.4 In a conditional simulation by kriging, verify that the condi-


tional realization (7.1) satisfies the following requirements: (i) Zc (s) honors
the data, (ii) Zc (s) is unbiased in the sense that its mean equals E[Z(s)], (iii)
its variance and covariance function are identical to that of Z(s). Furthermore,
establish that E[(Zc (s) − Z(s))2 ] = 2σsk2
.

Problem 7.5 In a conditional simulation by kriging, establish whether the


conditional distribution of Zc (s) given the observed data Z(s1 ), · · · , Z(sm ) is
stationary. Hint: consider the conditional mean and variance.

Problem 7.6 Verify formulas (7.8) and (7.9), that is, show that a convolution
where X(s) ∼ Beta(α, β) satisfies E[Z(s)] = π, Var[Z(s)] = π(1 − π).

Problem 7.7 Use the convolution method to devise a mechanism to generate


Z(s) such that E[Z(s)] = kπ and Var[Z(s)] = kπ(1 − π).
CHAPTER 8

Non-Stationary Covariance

8.1 Types of Non-Stationarity

Throughout this text (second-order) stationarity of the stochastic process


was an important assumption, without which there was little hope to make
progress with statistical inference based on a sample of size one (§2.1). Recall
that a process {Z(s) : s ∈ D ⊂ Rd } is second-order (weakly) stationary, if
E[Z(s)] = µ and Cov[Z(s), Z(s + h)] = C(h). A non-stationary process is any
random field for which these conditions do not hold; some aspect of the spatial
distribution is not translation invariant, it depends on the spatial location.
A non-constant mean function and variance heterogeneity are two frequent
sources of non-stationarity. Mean and variance non-stationarity are not the fo-
cus of this chapter. Changes in the mean value can be accommodated in spatial
models by parameterizing the mean function in terms of spatial coordinates
and other regressor variables. Handling non-stationarity through fixed-effects
structure of the model was covered in Chapter 6. Variance heterogeneity can
sometimes be allayed by transformations of the response variables.
Non-stationarity is a common feature of many spatial processes, in par-
ticular those observed in the earth sciences. It can also be the result of
operations on stationary random fields. For example, let X(s) be a white
noise random field in R2 with mean µ and variance σ 2 . The domain con-
sists of subregions
B S1 , · · · , Sk and we consider modeling the block aggregates
Z(Si ) = Si X(u) du. Obviously, the Z(Si ) are neither mean constant nor
homoscedastic, unless ν(Si ) = ν(Sj ), ∀i (= j. Lattice processes with unequal
areal units are typically not stationary and the formulation of lattice models
takes the variation and covariation structure into account. The type of non-
stationarity that is of concern in this section is covariance non-stationarity, the
absence of translation invariance of the covariance function in geostatistical
applications.
When the covariance function varies spatially, Cov[Z(s), Z(s+h)] = C(s, h),
two important consequences arise. One, the random field no longer “replicates
itself” in different parts of the domain. This implication of stationarity enabled
us to estimate the spatial dependency from pairs of points that shared the
same distance but without regard to their absolute coordinates. Two, the
covariogram or semivariogram models considered so far no longer apply.
The approaches to model covariance non-stationarity can be classified coarse-
ly into global and local methods. A global method considers the entire do-
422 NON-STATIONARY COVARIANCE

main, a local method assumes that a globally non-stationarity process can be


represented as a combination of locally stationary processes. Parametric non-
stationary covariance models and space deformation methods are examples of
global modeling. The employ of convolutions, weighted averages, and moving
windows is typical of local methods. The “replication” mechanisms needed
to estimate dispersion and covariation from the data are different in the two
approaches. Global methods, such as space deformation, require multiple ob-
servations for at least a subset of the spatial locations. Spatio-temporal data,
where temporal stationarity can be assumed, can work well in this case. Local
methods do not require actual replicate observations at a site, but local sta-
tionarity in a neighborhood of a given site. Estimates of variation and spatial
covariation then can be based on observations within the neighborhood.
It is sometimes stated that spatial prediction is not possible for non-statio-
nary processes. That assertion is not correct. Consider the ordinary kriging
predictor for predicting at a new location s0 ,
' + c! Σ−1 (Z(s) − 1µ),
pok (Z; s0 ) = µ
where Σ is the variance matrix of the random field and c = Cov[Z(s0 ), Z(s)].
It is perfectly acceptable for the covariance matrix Σ to be that of a non-
stationary process. We require Σ to be positive definite, a condition entirely
separate from stationarity. There is, however, a problem with spatial predic-
tion for non-stationary processes. The elements of Σ are unknown and must
be estimated from the data. If Σ is parameterized, then the parameters of
Σ(θ) must be estimated. In either case, we require valid semivariogram or
covariance function models for non-stationary data and a mechanism for es-
timation.

8.2 Global Modeling Approaches

8.2.1 Parametric Models

If one understands the mechanisms that contribute to covariance non-statio-


narity, these can be incorporated in a model for the covariance structure. The
parameters of the non-stationary covariance model can subsequently be esti-
mated based on (restricted) maximum likelihood. Hughes-Oliver et al. (1998a)
present a correlation model for stochastic processes driven by one or few point
sources. Such a point source can be an industrial plant emitting air pollution
or waste water, or the center of a wafer in semiconductor processing (Hughes-
Point Oliver et al., 1998b). Their model for a single point source at location c is
Source
Corr[Z(si ), Z(sj )] = exp {−θ1 ||si − sj || exp {θ2 |ci − cj | + θ3 min[ci , cj ]}} ,
Correlation
(8.1)
Model,
where ci = ||si − c|| and cj = ||sj − c|| are the distances between sites and the
Hughes-
point source. This model is non-stationary because the correlation between
Oliver et
sites si and sj depends on the site distances from the point source through ci
al.
and cj .
GLOBAL MODELING APPROACHES 423

The Hughes-Oliver model is a clever generalization of the exponential cor-


relation model. First note that when θ2 = θ3 = 0, (8.1) reduces to the ex-
ponential correlation model with practical range α = 3/θ1 . In general, the
correlation between two sites si and sj is that of a process with exponential
correlation model and practical range
exp{−θ2 |ci − cj | − θ3 min[ci , cj ]}
α(si , sj ) = 3 .
θ1
Consider two sites equally far from the point source, so that ci = cj . Then,
exp{−θ3 ||si − c||}
α(si , sj ) = 3 .
θ1
The spatial correlation will be small if θ3 large and/or the site is far removed
from the point source.
The correlation model (8.1) assumes that the effects of the point source
are circular. Airborne pollution, for example, does not evolve in circular pat-
tern. Point source anisotropy can be incorporated by correcting distances for
geometric anisotropy,
Corr[Z(si ), Z(sj )] = exp{−θ1 h∗ij
8 9
× exp θ2 |c∗i − c∗j | + θ3 min[c∗i , c∗j ] },
where h∗ij = ||A(si − sj )||, c∗i = ||Ac (si − c)|| and A, Ac are anisotropy
matrices.
To establish whether a non-stationary correlation model such as (8.1) is
valid can be a non-trivial task. Hughes-Oliver et al. (1998a) discuss that the
obvious requirements θ1 > 0, θ2 , θ3 ≥ 0 are necessary but not sufficient for
a covariance matrix defined by (8.1) to be positive semi-definite. If it is not
possible to derive conditions on the model parameters that imply positive
semi-definiteness, then one must examine the eigenvalues of the estimated
covariance or correlation matrix to ensure that at least the estimated model
is valid.

8.2.2 Space Deformation

If a random process does not have the needed attributes for statistical infer-
ence, it is common to employ a transformation of the process that leads to
the desired properties. Lognormal kriging (§5.6.1) or modeling a geometrically
anisotropic covariance structure (§4.3.7) are two instances where transforma-
tions in spatial statistics are routinely employed. An important difference be-
tween the two types of transformations is whether they transform the response
variable (lognormal kriging) or the coordinate system (anisotropic modeling).
Recall from §4.3.7 that if iso-correlation contours are elliptical, a linear trans-
formation s∗ = f (s) = As achieves the rotation and scaling of the coordinate
system so that covariances based on the s∗ coordinates are isotropic. If g() is
424 NON-STATIONARY COVARIANCE

a variogram, then
Var[Z(si ) − Z(sj )] = g(||f (si ) − f (sj )||), (8.2)
and g'(||f (si ) − f (sj )||) is its estimate, assuming f is known.
This general idea can be applied to spatial processes with non-stationary
covariance function. The result is a method known as “space deformation:”
• find a function f that transforms the space in such a way that the covariance
structure in the transformed space is stationary and isotropic;
• find a function g such that Var[Z(si ) − Z(sj )] = g(||f (si ) − f (sj )||);
• if f is unknown, then a natural estimator is
T
Var[Z(s '(||f'(si ) − f'(sj )||).
i ) − Z(sj )] = g

The work by Sampson and Guttorp (1992) marks an important milestone


for this class of non-stationary models. We only give a rough outline of the ap-
proach, the interested reader is referred to the paper by Sampson and Guttorp
(1992) for details and to Mardia, Kent, and Bibby (1979, Ch. 14). The tech-
nique of multidimensional scaling, key in determining the space deformation,
was discussed earlier (see §4.8.2).
In the non-stationary case, neither g nor f are known and the Sampson-
Guttorp approach determines both in a two-step process. First, the “spatial
dispersions”
T
d2ij = Var[Z(si ) − Z(sj )]
are computed and multidimensional scaling is applied to the matrix D = [dij ].
Starting from an initial configuration of sites, the result of this operation is a
set of points {s∗1 , · · · , s∗n } and a monotone function m, so that
m(dij ) ≈ ||s∗i − s∗j ||.
Since m is monotone we can solve dij = m−1 (||s∗i − s∗j ||) and obtain d2ij =
g(||s∗i − s∗j ||). The approach of Sampson and Guttorp (1992) in analyzing
spatio-temporal measurements of solar radiation is to assume temporal sta-
tionarity and to estimate the d2ij from sample moments computed at each
spatial location across the time points. Once multidimensional scaling has
been applied to the dij , a model g() is fit to the scatterplot of the d2ij versus
distances h∗ij = ||s∗i − s∗j || in the deformed space. This fit can be a paramet-
ric model or a nonparametric alternative as discussed in §4.6. Sampson and
Guttorp chose a nonparametric approach.
The space deformation method presents several challenges to the analyst.
• An empirical semivariogram is needed as input for the multidimensional
scaling algorithm. Sampson and Guttorp (1992) call the d2ij the spatial dis-
persions instead of the sample variogram to disassociate the quantities from
notions of stationarity. How to obtain estimates of the d2ij without station-
arity assumptions or replication is not clear. Repeated measurements over
time of each spatial location and an assumption of temporal stationarity
LOCAL STATIONARITY 425

enable the estimation of the spatial dispersions in Sampson and Guttorp’s


case.
• In order to compute the covariance between arbitrary points si and sj in
the domain of measurement, a smooth and injective mapping is required
that yields the corresponding coordinates of the points in the deformed
space in which the Euclidean distances hij are computed. In other words,
one needs a function f (s) = s∗ . Sampson and Guttorp (1992) determine f
as a smoothing spline.
• The result of the multidimensional scaling algorithm depends on the initial
configuration of sites. The observation sites are a logical choice.

8.3 Local Stationarity

A function f (x) that changes in x can be viewed as constant or approximately


constant in an interval [a, a+#], provided |#| > 0 is sufficiently small. A similar
idea can be applied to non-stationary random fields; even if the mean and/or
covariance function are non-stationary throughout the domain, it may be
reasonable to assume that
(i) the process is stationary in smaller subregions;
(ii) the process can be decomposed into simpler, stationary processes.
We label methods for non-stationary data making assumptions (i) or (ii) as
“local” methods. Three important representatives are moving window tech-
niques, convolutions, and weighted stationary processes.

8.3.1 Moving Windows

This technique was developed by Haas (1990) to perform spatial prediction in


non-stationary data and extended to the spatio-temporal case in Haas (1995).
To compute the ordinary kriging predictor
' + c! Σ−1 (Z(s) − 1µ)
pok (Z; s0 ) = µ
for a set of prediction locations s0 (1) , · · · , s0 (m) , one only needs to recompute
the vector of covariances between the observed and prediction location, c. For
large data sets, however, the inversion of the variance-covariance matrix Σ
is a formidable computational problem, even if performed only once. In ad-
dition, observations far removed from the prediction location may contribute
only little to the predicted value at s0 , their kriging weights are close to zero.
It is thus a commonplace device to consider for prediction at s0 (i) only those
observed sites within a pre-defined neighborhood of s0 (i) . This kriging win-
dow changes with prediction location and points outside of the window have
kriging weight 0 (see §5.4.2). The local kriging approach has advantages and
disadvantages. The predictor that excludes observed sites is no longer best and
the analyst must decide on the size and shape of the kriging neighborhood. As
426 NON-STATIONARY COVARIANCE

points are included and excluded in the neighborhoods with changing predic-
tion location, spurious discontinuities can be introduced. On the upside, local
kriging is computationally less involved than solving the kriging equations
based on all n data points for every prediction location. Also, if the mean
is non-stationary, it may be reasonable to assume that the mean is constant
within the kriging window and to re-estimate µ based on the observations in
the neighborhood.
Whether the mean is estimated globally or locally, the spatial covariation
in local kriging is based on the same global model. Assume that the covari-
ances are determined based on some covariance or semivariogram model with
parameter vector θ. Local kriging can then be expressed as
(i) ' ! Σ(i) (θ)
' −1 (Z(s)(i) − 1µ),
pok (Z; s0 ) = µ
' + c(i) (θ)
where Z(s)(i) denotes the subset of points in the kriging neighborhood, Σ(i) =
Var[Z(s)(i) ], and c(i) = Cov[Z(s0 (i) ), Z(s)(i) ]. All n data points contribute to
the estimation of θ in local kriging.
The moving window approach of Haas (1990, 1995) generalizes this idea
by re-estimating the semivariogram or covariance function locally within a
circular neighborhood (window). A prediction is made at the center of the
window using the local estimate of the semivariogram parameters,
(i)
pok (Z; s0 ) = µ '(i) )! Σ(i) (θ
' + c(i) (θ '(i) )−1 (Z(s)(i) − 1µ).

The neighborhood for local kriging could conceivably be different from the
neighborhood used to derive the semivariogram parameters θ (i) , but the neigh-
borhoods are usually the same. Choosing the window size must balance the
need for a sufficient number of pairs to estimate the semivariogram parameters
reliably (large window size), and the desire to make the window small so that
a stationarity assumption within the window is tenable. Haas (1990) describes
a heuristic approach to determine the size of the local neighborhood: enlarge
a circle around the prediction site until at least 35 sites are included, then
include five sites at a time until there is at least one pair of sites at each lag
class and the nonlinear least squares fit of the local semivariogram converges.

8.3.2 Convolution Methods

Constructing non-stationary processes from convolutions is an elegant and


powerful approach with great promise. We include convolution methods in
the class of local modeling approaches because of the presence of a kernel
function that can be viewed as a local weighing function, decreasing with
the distance from the target point s, and because of local window techniques
used at the estimation stage. Two illustrative referenced for this approach are
Higdon (1998) and Higdon, Swall, and Kern (1999).
Consider a zero-mean white noise process X(s) such that E[X(s)] = µx = 0,
Var[X(s)] = φx , and E[X(s)X(s + h)] = 0, h (= 0. Then a weakly stationary
LOCAL STATIONARITY 427

random field Z(s) can be constructed by convolving X(s) with a kernel func-
tion Ks (u), centered at s. The random field
A
Z(s) = Ks (u)X(u) du
all u
B
has mean E[Z(s)] = µx u Ks (u) du = 0 and covariance function
Cov[Z(s), Z(s + h)] = E[Z(s)Z(s + h)]
A A
= Ks (u)Ks+h (v)E[X(u)X(v)] du dv
uA v
= φx Ks (v)Ks+h (v) dv.
v
Since the covariance function depends on the choice of the kernel function, a
non-stationary covariance function can be constructed by varying the kernel
spatially. Following Higdon, Swall, and Kern (1999), consider the following
progression for a process in R2 and points u = [ux , uy ], s = [sx , sy ].

1. Ks (u, σ 2 ) = (2πσ 2 )−1 exp{−0.5[(ux − sx )2 + (uy − sy )2 ]}.


This bivariate normal kernel is the product kernel of two univariate Gaus- Isotropic
sian kernels with equal variances. Note that σ 2 is the dispersion of the Convolu-
kernel functions, not the variance of the excitation process X(s). The re- tion
sult of the convolution with X(s) is a stationary, isotropic random field
with gaussian covariance structure. The kernel function has a single pa-
rameter, σ 2 .

2. Ks (u, θ) = |Σ|−1/2 (2π)−1 exp{−0.5(u − s)! Σ−1 (u − s)},


: ; Anisotropic
σx2 ρσx σy Convolu-
Σ= .
ρσx σy σy2 tion

The bivariate Gaussian kernel with correlation and unequal variances


yields a random field with geometric anisotropy and gaussian covariance
structure. The parameters of this kernel function are θ = [σx , σy , ρ]. The
isotropic situation in 1. is a special case, ρ = 0, σx = σy .

3. Ks (u, θ(s)) = |Σ(s)|−1/2 (2π)−1 exp{−0.5(u − s)! Σ(s)−1 (u − s)}.


The parameters of the kernel function depend on the spatial location, Non-
θ(s) = [σx (s), σy (s), ρ(s)]. The resulting process has non-stationary covari- stationary
ance function Convolu-
A
tion
Cov[Z(si ), Z(sj )] = φx Ksi (v, θ(si ))Ksj (v, θ(sj )) dv.
v
Higdon et al. (1999) give a closed-form expression for this integral.
428 NON-STATIONARY COVARIANCE

For a process in Rd and n sites, the non-stationary convolution model has


nd(d + 1)/2 kernel parameters, a large number. To make estimation feasible,
Higdon (1998) draws on the local window idea and the connection between the
gaussian semivariogram and the gaussian convolution kernel. First, estimate
the kernel parameters only for a subset of sites. Second, at each site in the
subset estimate the semivariogram parameters in a local neighborhood and
convert the parameter estimates to parameters of the kernel function. Third,
interpolate the kernel function parameters between the estimation sites so
that the kernel functions vary smoothly throughout the domain.

8.3.3 Weighted Stationary Processes

The method of weighted stationary processes is closely related to convolution


methods and many models in this class have a convolution representation.
The important difference between this and the previously discussed approach
is the assumption about which model component is spatially evolving. The
convolution method of Higdon (1998) and Higdon et al. (1999) varies parame-
ters of the kernel function spatially. The weighted stationary process approach
of Fuentes (2001) varies the stationary processes but not the kernel.
Fuentes (2001) assumes that the non-stationary process Z(s) can be written
as the weighted sum of stationary processes Z1 (s), · · · , Zk (s),
k
(
Z(s) = wi (s)Zi (s). (8.3)
i=1

The local processes are uncorrelated, Cov[Zi (s), Zj (s)] = 0, i (= j and have
covariance functions Cov[Zi (s), Zi (s + h)] = C(h, θ i ). The resulting, non-
stationary covariance function of the observed process is
? k k
@
( (
Cov[Z(s), Z(s + h)] = Cov Zi (s)wi (s), Zi (s + h)wi (s + h)
i=1 i=1
k (
( k
= Cov[Zi (s), Zj (s + h)]wi (s)wj (s + h)
i=1 j=1
k
(
= C(h, θ i )wi (s)wi (s + h).
i=1

To see how mixing locally stationary processes leads to a model with non-
stationary covariance, we demonstrate the Fuentes model with the following,
simplified example.

Example 8.1 Consider a one-dimensional stochastic process on the interval


(0, 10). The segment is divided into four intervals of equal widths, S1 , · · · , S4 .
The covariance function in segment i = 1, · · · , 4 is
Cov[Zi (t1 ), Zj (t2 )] = exp{−|t2 − t1 |/θi },
LOCAL STATIONARITY 429

with θ1 = 10, θ2 = 2, θ3 = 2, θ4 = 10. The weight functions wi (t) are Gaussian


kernels with standard deviation σ = 1.5, centered in the respective segment,
so that
1 8 9
wi (t) = √ exp −0.5(t − ci )2 /σ 2 ,
2πσ
c1 = 1.25, c2 = 3.75, c3 = 6.25, c4 = 8.75. Figure 8.1 shows a single realiza-
tion of this process along with the kernel functions. Although the covariance
function changes discontinuously between segments, the smooth, overlapping
kernels create a smooth process.

1.0

0.5
Z(t) process

0.0

0.50

-0.5

0.25

-1.0 0.00

0.00 1.25 2.50 3.75 5.00 6.25 7.50 8.75 10.00


Time coordinate t

Figure 8.1 A single realizations of a Fuentes process in R1 and the associated Gaus-
sian kernel functions.

Because the degree of spatial dependence varies between the segments, the
covariance function of the weighted process
4
(
Z(t) = Zi (t)wi (t)
i=1

is non-stationary (Figure 8.2). The variances are not constant, and the corre-
lations between locations decrease slower for data points in the first and last
segment and faster in the intermediate segments.
430 NON-STATIONARY COVARIANCE

Figure 8.2 Covariance function of the Fuentes process shown in Figure 8.1.
CHAPTER 9

Spatio-Temporal Processes

9.1 A New Dimension

The significant advance in the statistical analysis of spatial data is to ac-


knowledge the fact that the configuration of observations carries important
information about the relationship of data points.
We made the argument early on that incorporating the spatial context into
the statistical analysis is a need and a benefit when “space matters.” By the
same token, we must argue now that addressing the time component in space-
time processes must not be overlooked and that many—if not most—spatial
processes change over time. Unfortunately, statistical tools for the analysis
of spatio-temporal processes are not (yet) as fully developed as methods for
time series or spatial data alone. Also, there is paucity of commercial software
solutions for such data. The temptation arises thus naturally to proceed along
one of the following lines:

1. separate spatial analyses for each time point;

2. separate temporal analyses for each location;

3. spatio-temporal data analysis with methods for random fields in “Rd+1 ”.

The first two approaches can be considered conditional methods because


they isolate a particular time point or location and apply standard techniques
for the type of data that results. A two-stage variation on the theme is to
combine the results from the conditional analyses in a second stage. Two-
stage approaches are common in statistical application where multiple sources
of variation are at work but methodology and/or software (and computing
power) are unavailable for joint modeling. A case in point are nonlinear mixed
model applications for clustered data. Two important sources of variation
there are the changes in response as a function of covariates for each cluster
and cluster-to-cluster heterogeneity. The first source of variation is captured
by a nonlinear regression model. The second source is expressed by varying at
random model coefficients among clusters. The obvious two-stage approach
is to fit the nonlinear model separately to each cluster and to combine the
regression coefficients into a set of overall (population-averaged) coefficients
in a second stage. The reader interested in multi-stage and modern, mixed
model based approaches to nonlinear mixed models is referred to Davidian
and Giltinan (1995).
432 SPATIO-TEMPORAL PROCESSES

Two-stage approaches are appealing because of simplicity but have also


serious shortcomings.
• If it is not possible to analyze the data spatially for a time point tj , then
data collected at that time will not contribute in the second stage when
the spatial analyses are combined. Data that are sparse in time or space
can present difficulties in this regard. Integrating observations over time if
data are sparse temporally, may enable a spatial analysis but can confound
temporal effects. When data “locations” are unique in time and space, a
two-stage analysis fails.
• There are many ways to summarize the results from the first-stage analysis.
One can examine the temporal distribution of semivariogram ranges and
sills, for example. To reasonably combine those statistics into some over-
all measure—if that is desired—requires information about the temporal
correlation between the statistics.
• Interpolation of observations in a continuous space-time process should take
into account the interactions between the spatial and temporal components
and allow for predictions in time and space. Separate analyses in time
(space) allow predictions in space (time) only.
Joint analyses of spatio-temporal data are preferable to separate analyses.
In the process of building a joint model, separate analyses are valuable tools,
however. Again, we can draw a parallel to nonlinear mixed modeling. Fitting
the nonlinear response model separately to all those clusters that provide a
sufficient number of observations and examining the variation of the regression
coefficients enables one to make better choices about which coefficients to
vary at random. Examining semivariogram parameter estimates over time
and auto-regressive estimates for each location leads to more informed model
selection for a joint analysis.
To express the space-time nature of the process under study, we expand our
previous formulation of a stochastic process in the following way:
{Z(s, t) : s ∈ D(t) ⊂ R2 , t ∈ T }. (9.1)

The observation of attribute Z at time t and location s is denoted as Z(s, t)


to emphasize that the temporal “location” t is not just an added spatial
coordinate. The dependence of the domain D on time symbolizes the condition
where the spatial domain changes over time. For simplicity, we may assume
that D(t) ≡ D. The spatial component in (9.1) can be discrete or continuous,
fixed or random. The same applies to the temporal component of the process.
The distribution of particles in the atmosphere, for example, can be seen as a
spatio-temporal process with continuous spatial domain and continuous time.
Data on the disposable household incomes by counties on April 15 of every
year, comprises a spatial lattice process and a discrete time process. This
creates a dizzying array of spatio-temporal data structures.
The third approach mentioned above, to treat spatio-temporal data as spa-
tial data with an extra dimension is not encouraged. Time and space are not
A NEW DIMENSION 433

directly comparable. Space has no past, present, and future and the spatial
coordinate units are not comparable to temporal units. If certain assumptions
are met, it is acceptable to model spatio-temporal data with separable co-
variance structures. This is not the same as modeling spatio-temporal data as
“3-D” data. A spatio-temporal process with a spatial component in R2 may
be separable, but it is not a process in R3 . Assume we would treat the spatio-
temporal data as the realization of a process in R3 and denote the coordinates
as si = [xi , yi , ti ]! . Let si − sj = [hij , ti − tj ]. An exponential correlation model
results in
Corr[Z(si ), Z(sj )] = exp{−θ((xi − xj )2 + (yi − yj )2 + (ti − tj )2 )1/2 }
= exp{−θ(||hij ||2 + (ti − tj )2 )1/2 }. (9.2)

This is a valid correlation function in R3 and two observations at the same


point in time have correlation exp{−θ||hij ||}. Similarly, the serial correlation
of two observations at the same spatial location is exp{−θ|ti − tj |}. Both
covariance functions are valid in R2 and R1 but the “3-D” covariance does
not make much practical sense. The range of the spatial process is different
from the temporal process, the units are not comparable. The anisotropy of
spatio-temporal models must be reflected in the covariance functions. Gneiting
(2002) points out that from a mathematical point of view the space-time
domain Rd × R is no different than the spatial domain Rd+1 . A valid spatio-
temporal covariance function is also a valid spatial covariance function. He
argues that through notation and construction, the difference between space
and time must be brought to the fore.
There are two distances between points in a space-time process: the Eu-
clidean distance between points in space and the Euclidean distance between
points in time. The spatio-temporal lag between Z(s, t) and Z(s + h, t + k) is
[h; k]. If time is considered an “added” dimension, then the anisotropy must
be taken into account. A class of spatio-temporal covariance functions that
uses this approach constructs models as follows. If R is a valid correlation
function in Rd+1 , then
Corr[Z(si , ti ), Z(sj , tj )] = R(θs ||hij ||2 + θt k 2 ). (9.3)
The parameters θs and θt are the spatial and temporal anisotropy parameters,
respectively.
A more reasonable correlation structure than (9.2), drawing again on the
exponential model, is
Corr[Z(si , ti ), Z(sj , tj )] = exp{−θs ||hij ||} × exp{−θt |ti − tj |}. (9.4)
This is an example of a separable covariance function. If the time points are
evenly spaced, the temporal process has an AR(1) structure and the spatial
component has an exponential structure. A model of this kind is considered
in Mitchell and Gumpertz (2003) in modeling CO2 released over time on
rice plots. Also, it is easy to see that this type of model is a member of the
anisotropic class (9.3), if one chooses the gaussian correlation model in (9.4).
434 SPATIO-TEMPORAL PROCESSES

9.2 Separable Covariance Functions

Whether separable or not, a spatio-temporal covariance function must be a


valid covariance function in Rd × R (we restrict discussion to d = 2 in this
section). Because mathematically there is no difference between that domain
and Rd+1 , but our notation accommodates the special role of space and time,
we re-state the conditions for stationarity and validity of a covariance function.
A spatio-temporal covariance function is (second-order) stationary in space
and time if
Cov[Z(si , ti ), Z(sj , tj )] = C(si − sj , ti − tj ).
The covariance function is furthermore isotropic in space if
Cov[Z(si , ti ), Z(sj , tj )] = C(||si − sj ||, |ti − tj |).
The functions C(h, 0) and C(0, k) are spatial and temporal covariance func-
tions, respectively. The condition of positive definiteness must be met for C to
be a valid covariance function. This implies that for any set of spatio-temporal
locations (si , ti ), · · · , (sk , tk ) and real numbers a1 , · · · , ak
k (
( k
ai aj C(si − sj , ti − tj ) ≥ 0. (9.5)
i=1 j=1

These features are obvious “extensions” of the requirements for covariance


functions in section §2.2. Two of the elementary properties of covariance func-
tions reviewed there are especially important for the construction of spatio-
temporal covariance functions:
.k
If Cj (h) are valid covariance functions, j = 1, · · · , k, then j=1 bj Cj (h) is
a valid covariance function, if bj ≥ 0 ∀j.
=k
If Cj (h) are valid covariance functions, j = 1, · · · , k, then j=1 Cj (h) is a
valid covariance function.
Separability of spatio-temporal covariance functions can be defined based
Separable on the additive and multiplicative properties. A separable spatio-temporal co-
Covariance variance function decomposes Cov[Z(si , ti ), Z(sj , tj )] into a purely spatial and
Function a purely temporal component. The components have usually different param-
eters to allow for space-time anisotropy. The final spatio-temporal covariance
function is then the result of addition and or multiplication operations. For
example, let Cs (h; θ s ) and Ct (k; θ t ) be a spatial and temporal covariance
function. Valid separable spatio-temporal covariance functions are
Cov[Z(s, t), Z(s + h, t + k)] = Cs (h; θ s ) Ct (k; θ t )
and
Cov[Z(s, t), Z(s + h, t + k)] = Cs (h; θ s ) + Ct (k; θ t )
These covariance models are sometimes referred to as the product and the
sum covariance structures. The product-sum covariance structure
Cov[Z(s, t), Z(s + h, t + k)] = Cs (h; θ s ) Ct (k; θ t ) + Cs (h; θ s ) + Ct (k; θ t )
NON-SEPARABLE COVARIANCE FUNCTIONS 435

of De Cesare, Myers, and Posa (2001) is generally nonseparable.


Separable covariance functions are easy to work with and valid, provided the
components are valid covariance functions. Furthermore, existing commercial
software for spatial data analysis can sometimes be coaxed into fitting sepa-
rable spatio-temporal covariance models. Mitchell and Gumpertz (2003), for
example, use a spatio-temporal covariance function with product separabil-
ity in which the temporal process has a first-order autoregressive correlation
structure (see equation (9.4)). This, in turn, enables the authors to rewrite
the observational model in autoregressive form, which makes parameter esti-
mation possible based on nonlinear mixed model tools in SAS.
Cressie and Huang (1999) note that separable models are often chosen for
their convenience. We add that this is true for related models. Posa (1993)
notes the invariance of kriging predictions to scaling of the covariance func-
tions. That is, the kriging solutions in no-nugget models remain the same,
regardless of changes in scale of the observations (the kriging variance, mind
you, is not invariant). The spatio-temporal covariance models considered by
Posa (1993) are of the form
Cov[Z(s, t), Z(s + h, t + k) = σ 2 (t) Cs (h). (9.6)
Only the sill of the covariance function is time dependent to account for non-
stationarity in time. If σ 2 (t) ≡ σ 2 , model (9.6) is a product separable model
with a nugget-only model in time. Although the model is non-stationary, the
kriging predictions at a given time point are not affected by the non-stationary
temporal variance.
The primary drawback of separable models is not to incorporate space-time
interactions. Consider a product separable structure
Cov[Z(s, t), Z(s + h, t + k)] = Cs (h; θ s ) Ct (k; θ t )
The spatial covariances at time lag k and u (= k have the same shape, they are
proportional to each other. The temporal and spatial components represent
dependencies in time (given spatial location) and space (given time). The
processes do not act upon each other. To incorporate space-time interactions
requires non-separability of the covariance function.

9.3 Non-Separable Covariance Functions

Non-separable, valid covariance functions for spatio-temporal data are typi-


cally more complicated than separable models but incorporate space-time in-
teractions. Some non-separable models reduce to separable ones for particular
values of model parameters and allow a test for separability. Methods for con-
structing valid covariance functions with spatio-temporal interactions include
the monotone function approach of Gneiting (2002), the spectral method of
Cressie and Huang (1999), the mixture approach, and the partial differential
equation approach of Jones and Zhang (1997).
436 SPATIO-TEMPORAL PROCESSES

9.3.1 Monotone Function Approach

Gneiting (2002) presented a flexible and elegant approach to construct spatio-


temporal covariance functions. The method is powerful because it does not
require operations in the spectral domain and builds valid covariance functions
from elementary components whose validity is easily checked. To fix ideas let
[h; k] denote a lag vector in Rd × R1 and choose two functions φ(t), t ≥ 0 and
ψ(t), t ≥ 0 such that φ(t) is completely monotone and ψ(t) is positive with a
completely monotone derivative. The functions φ(t) = exp{−ct}, c > 0 and
ψ(t) = (at+1), a > 0, for example, satisfy the requirements. Tables 1 and 2 in
Gneiting (2002) list a variety of functions and valid ranges of their parameters.
With these ingredients in place a valid covariance function in Rd × R1 is
σ2 ) *
Cov[Z(s, t), Z(s + h, t + k)] = C(h, k) = φ ||h||2
/ψ(|k|2
) . (9.7)
ψ(|k|2 )d/2
Gneiting (2002) illustrates the construction with functions φ(t) = exp{−ctγ }
and ψ(t) = (atα + 1)β . They satisfy the requirements for
c, a > 0; 0 < γ, α ≤ 1; 0 ≤ β ≤ 1.
Substituting into (9.7) yields (for d = 2)
- >
σ2 c||h||2γ
C(h, k) = exp − . (9.8)
(a|k|2α + 1)β (a|k|2α + 1)βγ
For β = 0 the covariance function does not depend on the time lag. Multi-
plying (9.8) with a purely temporal covariance function leads to a separable
model for β = 0. For ||h|| = 0, (9.8) reduces to the temporal covariance
function Ct (k) = (a|k|2α + 1)−βt . The function
- >
σ2 c||h||2γ
Ct (k) × C(h, k) = exp − (9.9)
(a|k|2α + 1)βt +β (a|k|2α + 1)βγ
is a valid spatio-temporal covariance function. It is separable for β = 0 and
non-separable otherwise. Since the separable and non-separable models are
nested, a statistical test for H0 : β = 0 can be carried out. One technique would
be to estimate the parameters of (9.9) by (restricted) maximum likelihood
with and without the constraint β = 0 and to compare twice the negative
(restricted) log likelihood. A correction for the fact that the null value of the
test falls on the boundary of the parameter space can be applied (Self and
Liang, 1987).

9.3.2 Spectral Approach

We noted in §4.3 (page 141) that by Bochner’s theorem valid covariance func-
tions have a spectral representation. For a process in Rd we can write
A ∞ A ∞
C(h) = ··· exp{iω ! h}s(ω) dω,
−∞ −∞
NON-SEPARABLE COVARIANCE FUNCTIONS 437

which suggests the following method for constructing a valid covariance func-
tion: determine a valid spectral density and take its inverse Fourier transform.
To acknowledge the physical difference between time and space, the spectral
representation of a spatio-temporal covariance that satisfies (9.5) is written
as A ∞ A ∞
C(h, k) = ··· exp{iω ! h + iτ k}s(ω, τ ) dωdτ , (9.10)
−∞ −∞
where s(ω, τ ) is the spectral density. One could proceed with the construction
of covariance functions as in the purely spatial case: determine a valid spatio-
temporal spectral density and integrate. This is essentially the method applied
by Cressie and Huang (1999), but these authors use two clever devices to avoid
the selection of a joint spatio-temporal spectral density.
First, because of (9.10), the covariance function and the spectral density are
a Fourier transform pair. Integration of the spatial and temporal components
can be separated in the frequency domains:
A ∞ A ∞
1
s(ω, τ ) = ··· exp{−iω ! h − iτ k}C(h, k) dhdk
(2π) d+1
−∞ −∞
A ∞ ? A ∞ A ∞
1 1
= exp{−iτ k} ···
2π −∞ (2π)d −∞ −∞
@
exp{−iω ! h}C(h, k) dh dk
A ∞
1
= exp{−iτ k}h(ω, k) dk. (9.11)
2π −∞

The function h(ω, k) is the (spatial) spectral density for temporal lag k. What
has been gained so far is that if we know the spectral density for a given lag
k, the spatio-temporal spectral density is obtained with a one-dimensional
Fourier transform. And it is presumably simpler to develop a model for h(ω, k)
than it is for s(ω, τ ). To get the covariance function from there still requires
complicated integration in (9.10).
The second device used by Cressie and Huang (1999) is to express h(ω, k) as
the product of two simpler functions. They put h(ω, k) ≡ R(ω,
B k)r(ω), where
R(ω,Bk) is a continuous correlation function, r(ω) > 0. If R(ω, k)dk < ∞
and r(ω)dω < ∞, then substitution into (9.11) gives
A ∞
1
s(ω, τ ) = r(ω) exp{−iτ k}R(ω, k) dk
2π −∞

and A ∞ A ∞
C(h, k) = ··· exp{−iω ! h}R(ω, k)k(ω) dω. (9.12)
−∞ −∞
Cressie and Huang (1999) present numerous examples for functions R(ω, k)
and k(ω) and the resulting spatio-temporal covariance functions. Gneiting
(2002) establishes that some of the covariance functions in the paper by Cressie
438 SPATIO-TEMPORAL PROCESSES

and Huang are not valid, because one of the correlation functions R(ω, k) used
in their examples does not satisfy the needed conditions.

9.3.3 Mixture Approach

Instead of integration in the frequency domain, nonseparable covariance func-


tions can also be constructed by summation or integration in the spatio-
temporal domain. Notice that if Zs (s) and Zt (t) are purely spatial and tempo-
ral processes with covariance functions Cs (h; θ s ) and Ct (k; θ t ), respectively,
then Z(s, t) = Zs (s)Zt (t) has the separable product covariance function
C(h, k) = Cs (h; θ s ) × Ct (k; θ t ),
if Zs (s) and Zt (t) are uncorrelated. Space-time interactions can be incorpo-
rated by mixing product covariance or correlation functions. Ma (2002) con-
siders the probability mass functions πij , where (i, j) ∈ Z+ . In other words,
πij is the mass function of [U, V ], a bivariate discrete random vector with
support on the non-negative integers. If, conditional on [U, V ] = [i, j], the
spatio-temporal process has correlation function
R(h, k | U = i, V = j) = Rsi (h) × Rtj (k),
then the correlation function of the unconditional process is the non-separable
Positive model
∞ (
( ∞
(bivariate)
Power
R(h, k) = Rsi (h)Rtj (k)πij . (9.13)
i=0 j=0
Mixture
Ma (2002) terms this model a positive power mixture, it makes use of the fact
that if R(u) is a correlation model in Rd , then R(u)i is also a valid correlation
model in Rd for any positive integer i. The method of power mixtures does
not require a bivariate, discrete mass function. This is important, because
such distributions are quite rare. A non-separable model can be constructed
Positive in the univariate case, too,
(univari- ∞
( i
ate) Power R(h, k) = (Rs (h)Rt (k)) πi . (9.14)
Mixture i=0

The right hand side of equation (9.14) bears a striking resemblance to


the probability generating function (pgf) of a discrete random variable with
support on the non-negative integers. If U takes realizations in {0, 1, · · ·}
with probability
.∞ Pr(U = i) = πi , then its probability generating function
is G(w) = i=0 w πi for 0 ≤ w ≤ 1. The correlation product in (9.14) takes
i

the role of wi in the generating function. This provides a convenient method to


construct spatio-temporal correlation models. Obtain the probability generat-
ing function and replace w with Rs (h)Rt (k). In the bivariate case, replace w1
in the pgf with Rs (h) and w2 with Rt (k). See Ma (2002) for further examples
of the univariate and bivariate case.
NON-SEPARABLE COVARIANCE FUNCTIONS 439

Example 9.1 Let U ∼ Binomial(n, π), so that its pgf is


n
G(w) = (π(w − 1) + 1) .
A valid spatio-temporal correlation model based on the power mixture of
Binomial probabilities is
n
C(h, k) = (π(Rs (h)Rt (k) − 1) + 1) .
If U follows a Poisson distribution with mean λ, then its pgf is G(w) =
exp{λ(w − 1)} and exp{λ(Rs (h)Rt (k) − 1)} is a spatio-temporal correlation
function.

A different approach to construct a non-separable covariance function from


a product covariance function, is to make the spatial and temporal coordinates
depend on one another. Ma (2002) terms this the scale mixture approach.
Let [U, V ] be a bivariate random vector with distribution function F (u, v), not
necessarily discrete. If [U, V ] is independent of the purely spatial and temporal
processes Zs (s) and Zt (t), which are independent of each other, then the scale
mixture process
Z(s, t) = Zs (sU )Zt (tV )
has covariance function
A
C(h, k) = Cs (hu)Ct (kv) dF (u, v). (9.15)

As with the power mixture approach, a univariate version follows easily:


A
C(h, k) = Cs (hu)Ct (ku) dF (u). (9.16)

Covariance function (9.16) is a special case of the family of covariance func-


tions of De Iaco, Myers, and Posa (2002). In their work, the distribution
function F (u) is replaced by a positive measure. They furthermore apply the
mixing idea not only to separable product covariance functions, but also to
product-sum functions and show the connection to the Cressie-Huang repre-
sentation in the frequency domain.
Our notation in the preceding paragraphs may have suggested that the
spatial and temporal covariance functions Cs (h) and Ct (k) are stationary. This
is not necessarily the case. The development of the mixture models applies
in the non-stationary situation as well. Most examples and applications start
from stationary (and isotropic) covariance functions for the two components,
however.

9.3.4 Differential Equation Approach

This approach of constructing spatio-temporal covariance functions relies on


the representation of the process as a stochastic differential equation. For ex-
ample, a temporal process Z(t) with exponential covariance function Ct (k) =
440 SPATIO-TEMPORAL PROCESSES

σ 2 exp{−θk} has representation


C D
d
+ θ Z(t) = #(t),
dt
where #(t) is a white noise process with variance σt2 . In R2 , Whittle (1954)
considered the stochastic Laplace equation
C 2 D
∂ ∂2
+ 2 − θ Z(s) = #(s)
2
∂x2 ∂y
as describing the elementary process Z(s) = Z([x, y]). This process has co-
variance function Cs (||h||) = σs2 θ||h||K1 (θ||h||).
A stochastic equation that combines the spatial and temporal components
(Jones and Zhang, 1997),
C 2 DC D
∂ ∂2 ∂
+ 2 − θs 2
+ θt Z(s, t) = #(s, t),
∂x2 ∂y ∂t
leads to a product separable model with
C(h, k) = Ct (k)Cs (h) = σst
2
exp{−θt k} × θs ||h||K1 (θ||h||).

To construct non-separable models in Rd × R1 , Jones and Zhang (1997)


consider stochastic differential equations of the form
?+ d ,p @
( ∂2 ∂
2 − θ2 −c Z(s, t) = #(s, t).
i=1
∂si ∂t
In this equation s1 , · · · , sd denote the coordinates of a point s. The parameter p
governs the smoothness of the process and must be greater than max{1, d/2}.
For d = 2 the spatio-temporal covariance function in the isotropic case is
A ∞
σ2 τ exp{−(k/c)(τ 2 + θ2 )p }
C(h, k) = J0 (τ h) dτ . (9.17)
4cπ 0 (τ 2 + θ2 )p

The connection between (9.17) and purely spatial (isotropic) covariance


models is interesting. Expressions (4.7)–(4.8) on page 141 expressed the co-
variance function in the isotropic case as a Hankel transformation. For d = 2
this is a Hankel transformation of zero order,
A ∞
Cs (h) = J0 (hω)dH(ω),
0

just as (9.17).

9.4 The Spatio-Temporal Semivariogram

To estimate the parameters of a spatio-temporal covariance function, the same


basic methods can generally be applied as in Chapter 4. The model can be fit
to the original data by maximum or restricted maximum likelihood, or using
THE SPATIO-TEMPORAL SEMIVARIOGRAM 441

least squares or CL/GEE methods based on pseudo-data. The semivariogram


for a stationary spatio-temporal process relates to the covariance function as
before: Spatio-
1 temporal
γ(h, k) = Var[Z(s, t) − Z(s + h, t + k)] Semivari-
2
ogram
= Var[Z(s, t)] − Cov[Z(s, t), Z(s + h, t + k)]
= C(0, 0) − C(h, k).

When the empirical spatio-temporal semivariogram is estimated from data,


the implicit anisotropy of spatial and temporal dimensions must be accounted
for. A graph of the spatio-temporal semivariogram (or covariogram) with an
isotropic spatial component is a three-dimensional plot. The axes consist of
temporal and spatial lags, the semivariogram values are the ordinates.
The empirical spatio-temporal semivariogram estimator that derives from
the Matheron estimator (4.24), page 153, is
1 ( 2
'(h, k) =
γ {Z(si , ti ) − Z(sj , tj )} . (9.18)
2|N (h, k)|
N (h,k)

The set N (h, k) consists of the points that are within spatial distance h
and time lag k of each other; |N (h, k)| denotes the number of distinct pairs in
that set. When data are irregularly spaced in time and/or space, the empirical
semivariogram may need to be computed for lag classes. The lag tolerances in
the spatial and temporal dimensions need to be chosen differently, of course,
to accommodate a sufficient number of point pairs at each spatio-temporal
lag.
Note that (9.18) is an estimator of the joint space-time dependency. It is
different from a conditional estimator of the spatial semivariogram at time t
which would be used in a two-stage method:
1 ( 2
't (h) =
γ {Z(si , t) − Z(sj , t)} . (9.19)
2|Nt (h)|
Nt (h)

A weighted least squares fit of the joint spatio-temporal empirical semivar-


iogram to a model γ(h, k; θ) estimates θ by minimizing
ms (
( mt
|N (hj , kl )|
γ (hj , kl ) − γ(hj , kl ; θ)}2 ,
{'
j=1
2γ(hj , kl ; θ)
l=1

where ms and mt are the number of spatial and temporal lag classes, respec-
tively. A fit of the conditional spatial semivariogram at time t minimizes
ms
( |Nt (hj )|
γ (hj , t) − γ(hj , t; θ)}2 .
{'
j=1
2γ(h j , t; θ)
442 SPATIO-TEMPORAL PROCESSES

9.5 Spatio-Temporal Point Processes

9.5.1 Types of Processes

A spatio-temporal point process is a spatio-temporal random field with a


random spatial index D and a temporal index T . As before, the temporal
index can be either fixed or random, discrete or continuous. According to
the nature of the temporal component we distinguish the following types of
spatio-temporal point processes (Dorai-Raj, 2001).
• Earthquake Process
Events are unique to spatial locations and time points, only one event can
occur at a particular location and time. If a record indicates—in addition
to time and location of the earthquake—the magnitude of the quake, the
process is marked. The connection of this type of process to earthquakes
is intuitive and it has been used in the study of seismic activity (Choi
and Hall, 1999; Ogata, 1999). It plays an important role in many other
applications. For example, the study of burglary patterns in a suburban
area will invariably involve spatio-temporal point processes of this type,
unless the data are temporally aggregated.
• Explosion Process
The idea of an explosion process is the generation of a spatial point pro-
cess at a time t which itself is a realization in a stochastic process. The
realization of an explosion process consists of locations si1 , · · · , sini ∈ D∗
at times ti ∈ T ∗ . Temporal events occur with intensity γ(t) and produce
point patterns with intensity λt (s). An example of such a spatio-temporal
process is the distribution of acorns around an oak tree. The time at which
the (majority of the) acorns fall each year can be considered a temporal
random process. The distribution of the acorns is a point process with some
intensity, possibly spatially varying.
• Birth-Death Process
This process is useful to model objects that are placed at a random location
by birth at time tb and exist at that location for a random time tl . Cressie
(1993, p. 720) refers to such a process as a space-time survival point process.
At time t, an event is recorded at location s if a birth occurred at s at
time tb < t and the object has a lifetime of tb + tl > t. Rathbun and
Cressie (1994) formulate the spatio-temporal distribution of longleaf pines
in Southern Georgia through a birth-death process.
In the explosion process the time points at which the point patterns are ob-
served are the realization of a stochastic process; they are a complete mapping
of temporal events. If the observation times are not the result of a stochas-
tic process, but chosen by the experimenter, the spatio-temporal pattern is
referred to as a point pattern sampled in time. Even if sampling times are
selected at random, they do not represent a mapping of temporal events, nor
are they treated as a stochastic process. The realization of a birth-death pro-
cess observed at fixed time points can be indistinguishable from a temporally
SPATIO-TEMPORAL POINT PROCESSES 443

sampled point pattern. Events observed at location s at time ti but not at


time ti+1 could be due to the death of a spatially stationary object or due to
the displacement of a non-stationary object between the two time points.

9.5.2 Intensity Measures

Recall from §3.4 that the first- and second-order intensities of a spatial point
process are defined as the limits
E[N (ds)]
λ(s) = lim
|ds|→0 |ds|
E[N (dsi )N (dsj )]
λ2 (si , sj ) = lim ,
|dsi |,|dsj |→0 |dsi ||dsj |
where ds is an infinitesimal disk (ball) of area (volume) |ds|. To extend the
intensity measures to the spatio-temporal scenario, we define N (ds, dt) to
denote the number of events in an infinitesimal cylinder with base ds and
height dt (Dorai-Raj, 2001). (Note that Haas (1995) considered cylinders in
local prediction of spatio-temporal data.) The spatio-temporal intensity of the
process {Z(s, t) : s ∈ D(t), t ∈ T } is then defined as the average number of
events per unit volume as the cylinder is shrunk around the point (s, t):
E[N (ds, dt)]
λ(s, t) = lim . (9.20)
|ds|,|dt|→0 |ds||dt|

To consider only a single component of the spatio-temporal process, the


intensity (9.20) can be marginalized to obtain the marginal spatial intensity
A
λ(s, ·) = λ(s, v) dv, (9.21)
T
or the marginal temporal intensity
A
λ(·, t) = λ(u, t) du. (9.22)
D
If the spatio-temporal intensity can be marginalized, it can also be condi-
tioned. The conditional spatial intensity at time t is defined as
E[N (ds, t)]
λ(s|t) = lim ,
|ds|→0 |ds|
and the conditional temporal intensity at location s as
E[N (s, dt)]
λ(t|s) = lim .
|dt|→0 |dt|
In the case of an earthquake process, these conditional intensities are not
meaningful and should be replaced by intensities constructed on intervals in
time or areas in space (Rathbun, 1996).
Second-order intensities can also be extended to the spatio-temporal case
444 SPATIO-TEMPORAL PROCESSES

by a similar device. Let Ai = dsi × dti be an infinitesimal cylinder containing


point (si , ti ). The second-order spatio-temporal intensity is then defined as
E[N (Ai )N (Aj )]
λ2 (si , sj , ti , tj ) = lim .
|Ai |,|Aj |→0 |Ai ||Aj |

A large array of different marginal, conditional, and average conditional


second-order intensities can be derived by similar arguments as for the first-
order intensities.

9.5.3 Stationarity and Complete Randomness

First- and second-order stationarity of a spatio-temporal point process can


refer to stationarity in space, in time, or both. We thus consider an array of
conditions.
(i) Z(s, t) is first-order stationary in space (FOS) if λ(t|s) = λ∗ (t).
(ii) Z(s, t) is first-order stationary in time (FOT) if λ(s|t) = λ∗∗ (s).
(iii) Z(s, t) is first-order stationary in space and time (FOST) if λ(s, t) does
not depend on s or t.
Dorai-Raj (2001) shows that the spatio-temporal intensity (9.20) is related
to the conditional intensities by
A A
λ(s, v) λ(t, u)
lim dv = lim du.
|dt|→0 dt |dt| |ds|→0 |ds| |ds|

The following results are corollaries of this relationship.


(i) If a process is FOT, then λ(s, t) = λ∗∗ (s) and λ(s, ·) = |T |λ∗∗ (s).
(ii) If a process is FOS, then λ(s, t) = λ∗ (t) and λ(·, t) = |A|λ∗ (t).
Second-order stationarity in space and time requires that λ(s, t) does not
depend on s or t, and that
λ2 (s, s + h, t, t + k) = λ∗2 (h, k).

Bartlett’s complete covariance density function (§4.7.3.1) can be extended


for FOST processes as
Cov[N (Ai ), N (Aj )]
lim = v(h, k) + λδ(h, k), (9.23)
|Ai |,|Aj |→0 |Ai ||Aj |

Equipped with these tools, a spatio-temporal process can be defined as a


completely spatio-temporally random (CSTR) process if it is a Poisson process
in both space and time, that is, a process void of any temporal or spatial
structure, so that N (A, T ) ∼ Poisson(λ|A × T |). For this process, λ(s, t) = λ
and λ2 (s, s + h, t, t + k) = λ2 . If the CSR process is an unattainable standard
for spatial point processes, then the CSTR process is even more so for spatio-
temporal processes. Its purpose is to serve as the initial benchmark against
SPATIO-TEMPORAL POINT PROCESSES 445

which observed spatio-temporal patterns are tested, in much the same way as
observed spatial point patterns are tested against CSR.
References

Abramowitz, M. and Stegun, I.A. (1964) Handbook of Mathematical Functions,


Applied Mathematics Series, Vol. 55. National Bureau of Standards, Washington,
D.C (reprinted 1972 by Dover Publications, New York).
Aitchison, J. and Brown, J. (1957) The Lognormal Distribution. Cambridge Univer-
sity Press, London.
Akaike, H. (1974) A new look at the statistical model identification. IEEE Transac-
tion on Automatic Control, AC-19, 716–723.
Aldworth, J. and Cressie, N. (1999) Sampling designs and prediction methods for
Gaussian spatial processes. In: S. Ghosh (ed.), Multivariate Analyses, Design of
Experiments, and Survey Sampling. Marcel Dekker, New York, 1–54.
Aldworth, J. and Cressie, N. (2003) Prediction of nonlinear spatial functionals. Jour-
nal of Statistical Planning and Inference, 112:3–41.
Allen, D.M. (1974) The relationship between variable selection and data augmenta-
tion and a method of prediction. Technometrics, 16:125–127.
Anselin, L. (1995) Local indicators of spatial association—LISA. Geographic Anal-
ysis, 27(2):93–115.
Armstrong, M. (1999) Basic Linear Geostatistics. Springer-Verlag, New York.
Azzalini, A. and Capitanio, A. (1999) Statistical applications of the multivari-
ate skew normal distribution. Journal of the Royal Statistical Society, Series B,
61:579–602.
Baddeley, A. and Silverman, B.W. (1984) A cautionary example for the use of second
order methods for analyzing point patterns. Biometrics, 40:1089–1094.
Bahadur, R.R. (1961) A representation of the joint distribution of responses to n
dichotomous items, In Studies in Item Analysis and Prediction, ed. H. Solomon,
Stanford University Press, Stanford, CA, 158–165.
Banerjee, S., Carlin, B.P., and Gelfand, A.E. (2003) Hierarchical Modeling and Anal-
ysis for Spatial Data. Chapman and Hall/CRC, Boca Raton, FL.
Barry, R.P. and Ver Hoef, J.M. (1996) Blackbox kriging: spatial prediction without
specifying variogram models. Journal of Agricultural, Biological, and Environ-
mental Statistics, 1:297–322.
Bartlett, M.S. (1964) The spectral analysis of two-dimensional point processes.
Biometrika, 51:299–311.
Bartlett, M.S. (1978) Stochastic Processes. Methods and Applications. Cambridge
University Press, London.
Belsley, D.A., Kuh, E., and Welsch, R.E. (1980), Regression Diagnostics; Identifying
Influential Data and Sources of Collinearity. John Wiley & Sons, New York.
Berger, J.O., De Oliveria, V., and Sansó, B. (2001) Objective Bayesian analysis
of spatially correlated data. Journal of the American Statistical Association, 96:
1361–1374.
Besag, J. (1974) Spatial interaction and the statistical analysis of lattice systems.
Journal of the Royal Statistical Society, Series B, 36:192–225.
448 REFERENCES

Besag, J. (1975) Statistical analysis of non-lattice data. The Statistician, 24:179–195.


Besag, J., Green, P., Higdon, D., and Mengersen, K. (1995) Bayesian computation
and stochastic systems (with discussion), Statistical Science, 10:3–66.
Besag, J. and Kempton, R. (1986) Statistical analysis of field experiments using
neighbourng plots. Biometrics, 42:231–251.
Besag, J. and Kooperberg, C. (1995) On conditional and intrinsic autoregressions.
Biometrika, 82:733–746.
Besag, J. and Newell, J. (1991) The detection of clusters in rare diseases. Journal of
the Royal Statistical Society, Series A, 154:327–333.
Besag, J., York, J., and Mollié, A. (1991) Bayesian image restoration, with two appli-
cations in spatial statistics (with discussion). Annals of the Institute of Statistical
Mathematics, 43:1–59.
Bloomfield, P. (1976) Fourier Analysis of Time Series: An Introduction. John Wiley
& Sons, New York.
Boufassa, A. and Armstrong, M. (1989) Comparison between different kriging esti-
mators. Mathematical Geology, 21:331–345.
Box, G.E.P. and Cox, D.R. (1964) An analysis of transformations. Journal of the
Royal Statistical Society, Series B, 26:211–243.
Breslow, N.E. and Clayton, D.G. (1993) Approximate inference in generalized linear
mixed models. Journal of the American Statistical Association, 88:9–25.
Breusch, T.S. (1980) Useful invariance results for generalized regression models.
Journal of Econometrics, 13:327–340.
Brillinger, D.R. (1972) The spectral analysis of stationary interval functions. In Pro-
ceedings of the 6th Berkeley Symposium on Mathematical Statistics and Proba-
bility, 1:483–513.
Brown, R.L., Durbin, J., and Evans, J.M. (1975) Techniques for testing the con-
stancy of regression relationships over time. Journal of the Royal Statistical So-
ciety (B), 37:149–192.
Brownie, C., Bowman, D.T., and Burton, J.W. (1993) Estimating spatial variation
in analysis of data from yield trials: a comparison of methods. Agronomy Journal,
85:1244–1253.
Brownie, C. and Gumpertz, M.L. (1997) Validity of spatial analyses of large field
trials. Journal of Agricultural, Biological, and Environmental Statistics, 2(1):1–
23.
Burnham, K.P. and Anderson, D.R. (1998) Model Selection and Inference: A Prac-
tical Information-Theoretic Approach. Springer-Verlag, New York.
Burt, W.H. (1943) Territoriality and homerange concepts as applied to mammals.
Journal of Wildlife Management, 54:310–315.
Carlin, B.P. and Louis, T.A. (2000) Bayes and Empirical Bayes Methods for Data
Analysis. 2nd Edition, Chapman and Hall/CRC, Boca Raton, FL.
Casella, G. and George, E.I. (1992) Explaining the Gibbs sampler. The American
Statistician, 46:167–174.
Chatfield, C. (1996) The Analysis of Time Series; An Introduction, 5th ed. Chapman
& Hall, London.
Cherry, S., Banfield, J., and Quimby, W. (1996) An evaluation of a nonparametric
method of estimating semi-variograms of isotropic spatial processes. Journal of
Applied Statistics, 23:435–449.
Chib, S. and Greenberg, E. (1995) Understanding the Metropolis-Hastings algo-
rithm. The American Statistician, 49:327–335.
REFERENCES 449

Chilès, J.P. and Delfiner, P. (1999) Geostatistics. Modeling Spatial Uncertainty.


John Wiley & Sons, New York.
Choi, E. and Hall, P. (1999) Nonparametric approach to analysis of space-time data
on earthquake occurrences. Journal of Computational and Graphical Statistics,
8:733–748.
Christensen, R. (1991) Linear Models for Multivariate, Time Series, and Spatial
Data. Springer-Verlag, New York.
Clark, I. (1979) Practical Geostatistics. Applied Science Publishers, Essex, England.
Clayton, D.G. and Kaldor, J. (1987) Empirical Bayes estimates of age-standardized
relative risks for use in disease mapping, Biometrics, 43:671–682.
Cleveland, W.S. (1979) Robust locally weighted regression and smoothing scatter-
plots. Journal of the American Statistical Association, 74:829–836.
Cliff, A.D. and Ord, J.K. (1981) Spatial Processes; Models and Applications. Pion
Limited, London.
Cody, W.J. (1987) SPECFUN—A portable special function package. In New Com-
puting Environments: Microcomputers in Large-Scale Scientific Computing, ed.
A. Wouk, SIAM, Philadelphia, 1–12.
Congdon, P. (2001) Bayesian Statistical Modelling. John Wiley & Sons, Chichester.
Congdon, P. (2003) Applied Bayesian Modelling. John Wiley & Sons, Chichester.
Cook, R.D. (1977) Detection of influential observations in linear regression. Tech-
nometrics, 19:15–18.
Cook, R.D. (1979) Influential observations in linear regression. Journal of the Amer-
ican Statistical Association, 74:169–174.
Cook, R.D. and Weisberg, S. (1982) Residuals and Influence in Regression. Chapman
and Hall, New York.
Cowles, M.K. and Carlin, B.P. (1996) Markov chain Monte Carlo convergence di-
agnostics: a comparative review. Journal of the American Statistical Association,
91:883–904.
Cressie, N. (1985) Fitting variogram models by weighted least squares. Journal of
the International Association for Mathematical Geology, 17:563–586.
Cressie, N. (1990) The origins of kriging. Mathematical Geology, 22:239–252.
Cressie, N. (1992) Smoothing regional maps using empirical Bayes predictors. Geo-
graphical Analysis, 24:75–95.
Cressie, N.A.C. (1993) Statistics for Spatial Data. Revised ed. John Wiley & Sons,
New York.
Cressie, N. (1993b) Aggregation in geostatistical problems. In: A. Soares (ed.), Geo-
statistics Troia ’92. Kluwer Academic Publishers, Dordrecht, 25–35.
Cressie, N. (1998) Fundamentals of spatial statistics. pp. 9–33 in Collecting Spatial
Data: Optimum Design of Experiments for Random Fields. W.G. Müller, (ed.)
Physica-Verlag.
Cressie, N. and Grondona, M.O. (1992) A comparison of variogram estimation with
covariogram estimation. In The Art of Statistical Science, ed. K.V. Mardia. John
Wiley & Sons, New York, 191–208.
Cressie, N. and Hawkins, D.M. (1980) Robust estimation of the variogram, I. Journal
of the International Association for Mathematical Geology, 12:115–125.
Cressie, N. and Huang, H.-C. (1999) Classes of nonseparable, spatio-temporal sta-
tionary covariance functions. Journal of the American Statistical Association,
94:1330–1340.
Cressie, N. and Lahiri, S.N. (1996) Asymptotics for REML estimation of spatial
covariance parameters. Journal of Statistical Planning and Inference, 50:327–341.
450 REFERENCES

Cressie, N. and Majure, J.J. (1997) Spatio-temporal statistical modeling of live-


stock waste in streams. Journal of Agricultural, Biological, and Environmental
Statistics, 2:24–47.
Croux, C. and Rousseeuw, P.J. (1992) Time-efficient algorithms for two highly robust
estimators of scale. In Computational Statistics, Vol. 1, eds. Y. Dodge, and J.
Whittaker, Physika-Verlag, Heidelberg, 411–428.
Curriero, F.C. (1996) The use of non-Euclidean distances in geostatistics. Ph.D.
thesis, Department of Statistics, Kansas State University, Manhattan, KS.
Curriero, F.C. (2004) Norm dependent isotropic covariogram and variogram models.
Journal of Statistical Planning and Inference, In review.
Curriero, F.C. and Lele, S. (1999) A composite likelihood approach to semivari-
ogram estimation. Journal of Agricultural, Biological, and Environmental Statis-
tics, 4(1):9–28.
Cuzick, J. and Edwards, R. (1990) Spatial clustering for inhomogeneous populations
(with discussion). Journal of the Royal Statistical Society, Series B, 52:73–104.
David, M. (1988) Handbook of Applied Advanced Geostatistical Ore Reserve Esti-
mation. Elsevier, Amsterdam.
Davidian, M. and Giltinan, D.M. (1995) Nonlinear Models for Repeated Measure-
ment Data. Chapman and Hall, New York.
De Cesare, L., Myers, D., and Posa, D. (2001) Estimating and modeling space-time
correlation structures. Statistics and Probability Letters, 51(1):9–14.
De Iaco, S., Myers, D.E., and Posa, D. (2002) Nonseparable space-time covariance
models: some parametric families. Mathematical Geology, 34(1):23–42.
De Oliveira, V., Kedem, B. and Short, D.A. (1997) Bayesian prediction of trans-
formed Gaussian random fields. Journal of the American Statistical Association,
92:1422–1433.
Deutsch, C.V. and Journel, A.G. (1992) GSLIB: Geostatistical Software Library and
User’s Guide. Oxford University Press, New York.
Devine, O.J., Louis, T.A. and Halloran, M.E. (1994) Empirical Bayes methods for
stabilizing incidence rates before mapping, Epidemiology, 5(6):622–630.
Diggle, P.J. (1983) Statistical Analysis of Point Processes. Chapman and Hall, New
York.
Diggle, P.J. (1985) A kernel method for smoothing point process data. Applied
Statistics, 34:138–147.
Diggle, P., Besag, J.E., and Gleaves, J.T. (1976) Statistical analysis of spatial pat-
terns by means of distance methods. Biometrics, 32:659–667.
Diggle, P.J. and Chetwynd, A.G. (1991) Second-order analysis of spatial clustering
for inhomogeneous populations. Biometrics, 47:1155–1163.
Diggle, P.J., Tawn, J.A., and Moyeed, R.A. (1998) Model-based geostatistics. Ap-
plied Statistics, 47:229–350.
Dobson, A.J. (1990) An Introduction to Generalized Linear Models. Chapman and
Hall, London.
Dorai-Raj, S.S. (2001) First- and Second-Order Properties of Spatiotemporal Point
Processes in the Space-Time and Frequency Domains. Ph.D. Dissertation, Dept.
of Statistics, Virginia Polytechnic Institute and State University.
Dowd, P.A. (1982) Lognormal kriging—the general case. Journal of the International
Association for Mathematical Geology, 14:475–499.
Draper, N. and Smith, H. (1981) Applied Regression Analysis, 2nd edition. John
Wiley & Sons, New York.
REFERENCES 451

Eaton, M.L. (1985) The Gauss-Markov theorem in multivariate analysis. In Multi-


variate Analysis—VI, ed. P.R. Krishnaiah, Amsterdam: Elsevier, 177–201.
Ecker, M.D. and Gelfand, A.E. (1997) Bayesian variogram modeling for an isotropic
spatial process. Journal of Agricultural, Biological, and Environmental Statistics,
2:347–369.
Fedorov, V.V. (1974) Regression problems with controllable variables subject to
error. Biometrika, 61:49–65.
Fisher, R.A., Thornton, H.G., and MacKenzie, W.A. (1922) The accuracy of the
plating method of estimating the density of bacterial populations. Annals of Ap-
plied Biology, 9:325–359.
Fotheringham, A.S., Brunsdon, C. and Charlton, M. (2002) Geographically
Weighted Regression. John Wiley & Sons, New York.
Freeman, M.F. and Tukey, J.W. (1950) Transformations related to the angular and
the square root. Annals of Mathematical Statistics, 21:607–611.
Fuentes, M. (2001) A high frequency kriging approach for non-stationary environ-
mental processes. Environmetrics, 12:469–483.
Fuller, W.A. and Battese, G.E. (1973) Transformations for estimation of linear mod-
els with nested error structure. Journal of the American Statistical Association,
68:626–632.
Galpin, J.S.and Hawkins, D.M. (1984) The use of recursive residuals in checking
model fit in linear regression. The American Statistician, 38(2):94–105.
Gandin, L.S. (1963) Objective Analysis of Meteorological Fields. Gidrometeoro-
logicheskoe Izdatel’stvo (GIMIZ), Leningrad (translated by Israel Program for
Scientific Translations, Jerusalem, 1965).
Gaudard, M., Karson, M., Linder, E., and Sinha, D. (1999) Bayesian spatial predic-
tion. Environmental and Ecological Statistics, 6:147–171.
Geary, R.C. (1954) The contiguity ratio and statistical mapping, The Incorporated
Statistician, 5:115–145.
Gelfand, A.E. and Smith, A.F.M. (1990) Sampling-based approaches to calculating
marginal densities. Journal of the American Statistical Association, 85:398–409.
Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (2004) Bayesian Data Anal-
ysis. Chapman and Hall/CRC, Boca Raton, FL.
Gelman, A. and Rubin, D.B. (1992) Inference from iterative simulation using mul-
tiple sequences (with discussion). Statistical Science, 7:457–511.
Geman, S. and Geman, D. (1984) Stochastic relaxation, Gibbs distributions and
the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 6:721–741.
Genton, M.G. (1998a) Highly robust variogram estimation. Mathematical Geology,
30:213–221.
Genton, M.G. (1998b) Variogram fitting by generalized least squares using an ex-
plicit formula for the covariance structure. Mathematical Geology, 30:323–345.
Genton, M.G. (2000) The correlation structure of Matheron’s classical variogram
estimator under elliptically contoured distributions. Mathematical Geology,
32(1):127–137.
Genton, M.G. (2001) Robustness problems in the analysis of spatial data. In Spatial
Statistics: Methodological Aspects and Applications, ed. M. Moore, Springer-
Verlag, New York, 21–38.
Genton, M.G. and Gorsich, D.J. (2002) Nonparametric variogram and covariogram
estimation with Fourier-Bessel matrices. Computational Statistics and Data Anal-
ysis, 41:47–57.
452 REFERENCES

Genton, M.G., He, L., and Liu, X. (2001) Moments of skew-normal random vectors
and their quadratic forms. Statistics & Probability Letters, 51:319–325.
Gerrard, D.J. (1969) Competition quotient: a new measure of the competition afect-
ing individual forest trees. Research Bulletin No. 20, Michigan Agricultural Ex-
periment Station, Michigan State University
Ghosh, M., Natarajan, K., Waller, L.A., and Kim, D. (1999) Hierarchical GLMs
for the analysis of spatial data: an application to disease mapping. Journal of
Statistical Planning and Inference, 75:305–318.
Gilks, W.R. (1996) Full conditional distributions. In Gilks, W.R. Richardson, S.
and Spiegelhalter, D.J. (eds.) Markov Chain Monte Carlo in Practice Chapman
& Hall/CRC, Boca Raton, FL, 75–88.
Gilks, W.R., Richardson, S., and Spiegelhalter, D.J. (1996) Introducing Markov
chain Monte Carlo. In Gilks, W.R. Richardson, S. and Spiegelhalter, D.J. (eds.)
Markov Chain Monte Carlo in Practice Chapman & Hall/CRC, Boca Raton, FL,
1–19.
Gilliland, D. and Schabenberger, O. (2001) Limits on pairwise association for equi-
correlated binary variables. Journal of Applied Statistical Science, 10:279–285.
Gneiting, T. (2002) Nonseparable, stationary covariance functions for space-time
data. Journal of the American Statistical Association, 97:590–600.
Godambe, V.P. (1960) An optimum property of regular maximum likelihood esti-
mation. Annals of Mathematical Statistics, 31:1208–1211.
Goldberger, A.S. (1962) Best linear unbiased prediction in the generalized linear
regression model. Journal of the American Statistical Association, 57:369–375.
Goldberger, A.S. (1991) A Course in Econometrics. Harvard University Press, Cam-
bridge, MA.
Goodnight, J.H. (1979) A tutorial on the sweep operator. The American Statistician,
33:149–158. (Also available as The Sweep Operator: Its Importance in Statistical
Computing, SAS Technical Report R-106, SAS Institute, Inc. Cary, NC).
Goodnight, J.H. and Hemmerle, W.J. (1979) A simplified algorithm for the W-
transformation in variance component estimation. Technometrics, 21:265–268.
Goovaerts, P. (1994) Comparative performance of indicator algorithms for modeling
conditional probability distribution functions. Mathematical Geology, 26:389–411.
Goovaerts, P. (1997) Geostatistics for Natural Resource Evaluation. Oxford Univer-
sity Press, Oxford, UK.
Gorsich, D.J. and Genton, M.G. (2000) Variogram model selection via nonparamet-
ric derivative estimation. Mathematical Geology, 32(3):249–270.
Gotway, C.A. and Cressie, N. (1993) Improved multivariate prediction under a gen-
eral linear model. Journal of Multivariate Analysis, 45:56–72.
Gotway, C.A. and Hegert, G.W. (1997) Incorporating spatial trends and anisotropy
in geostatistical mapping of soil properties. Soil Science Society of America Jour-
nal, 61:298–309.
Gotway, C.A. and Stroup, W.W. (1997) A generalized linear model approach to
spatial data analysis and prediction. Journal of Agricultural, Biological and En-
vironmental Statistics, 2:157–178.
Gotway, C.A. and Wolfinger, R.D. (2003) Spatial prediction of counts and rates.
Statistics in Medicine, 22:1415–1432.
Gotway, C.A. and Young, L.J. (2002) Combining incompatible spatial data. Journal
of the American Statistical Association, 97:632–648.
Graybill, F.A. (1983) Matrices With Applications in Statistics. 2nd ed. Wadsworth
International, Belmont, CA.
REFERENCES 453

Greig-Smith, P. (1952) The use of random and contiguous quadrats in the study of
the structure of plant communities. Annals of Botany, 16:293–316.
Grondona, M.O. (1989) Estimation and design with correlated observations. Ph.D.
Dissertation, Iowa State University.
Grondona, M.O. and Cressie, N. (1995). Residuals based estimators of the covari-
ogram. Statistics, 26:209–218.
Haas, T.C. (1990) Lognormal and moving window methods of estimating acid de-
position. Journal of the American Statistical Association, 85:950–963.
Haas, T.C. (1995) Local prediction of a spatio-temporal process with an applica-
tion to wet sulfate deposition. Journal of the American Statistical Association,
90:1189–1199.
Haining, R. (1990) Spatial Data Analysis in the Social and Environmental Sciences.
Cambridge University Press, Cambridge.
Haining, R. (1994) Diagnostics for regression modeling in spatial econometrics. Jour-
nal of Regional Science, 34:325–341.
Hampel, F.R., Rochetti, E.M., Rousseeuw, P.J., and Stahel, W.A. (1986) Robust
Statistics, the Approach Based on Influence Functions. John Wiley & Sons, New
York.
Handcock, M.S. and Stein, M.L. (1993) A bayesian analysis of kriging. Technomet-
rics, 35:403–410.
Handcock, M.S. and Wallis, J.R. (1994) An appproach to statistical spatial-temporal
modeling of meteorological fields (with discussion). Journal of the American Sta-
tistical Association, 89:368–390.
Hanisch, K.-H. and Stoyan, D. (1979) Formulas for the second-order analysis of
marked point processes. Mathematische Operationsforschung und Statistik. Series
Statistics, 10:555–560.
Harville, D.A. (1974) Bayesian inference for variance components using only error
contrasts. Biometrika, 61:383–385.
Harville, D.A. (1977) Maximum-likelihood approaches to variance component esti-
mation and to related problems. Journal of the American Statistical Association,
72:320–340.
Harville, D.A. and Jeske, D.R. (1992) Mean squared error of estimation or Prediction
under a general linear model. Journal of the American Statistical Association,
87:724–731.
Haslett, J. and Hayes, K. (1998) Residuals for the linear model with general covari-
ance structure. Journal of the Royal Statistical Society, Series B, 60:201–215.
Hastings, W.K. (1970) Monte Carlo sampling methods using Markov chains and
their applications. Biometrika, 57:97–109.
Hawkins, D.M. (1981) A cusum for a scale parameter. Journal of Quality Technology,
13:228–231.
Hawkins, D.M. and Cressie, N.A.C. (1984) Robust kriging—a proposal. Journal of
the International Association of Mathematical Geology, 16:3–18.
Heagerty, P.J. and Lele, S.R. (1998) A composite likelihood approach to binary
spatial data. Journal of the American Statistical Association, 93:1099–1111.
Henderson, C.R. (1950) The estimation of genetic parameters. The Annals of Math-
ematical Statistics, 21:309–310.
Heyde, C.C. (1997) Quasi-Likelihood and Its Applications. A General Approach to
Optimal Parameter Estimation. Springer-Verlag, New York.
Higdon, D. (1998) A process-convolution approach to modeling temperatures in the
North Atlantic Ocean. Environmental and Ecological Statistics, 5(2):173–190.
454 REFERENCES

Higdon, D., Swall, J., and Kern, J. (1999) Non-stationary spatial modeling. Bayesian
Statistics, 6:761–768.
Hinkelmann, K. and Kempthorne, O. (1994) Design and Analysis of Experiments.
Volume I. Introduction to Experimental Design. John Wiley & Sons, New York.
Houseman, E.A., Ryan, L.M., and Coull, B.A. (2004) Cholesky residuals for assessing
normal errors in a linear model with correlated outcomes. Journal of the American
Statistical Association, 99:383–394.
Huang, J.S. and Kotz, S. (1984) Correlation structure in iterated Farlie-Gumble-
Morgenstern distributions. Biometrika, 71:633–636.
Hughes-Oliver, J.M., Gonzalez-Farias, G., Lu, J.-C., and Chen, D. (1998) Parametric
nonstationary correlation models. Statistics & Probability Letters, 40:267–278.
Hughes-Oliver, J.M., Lu, J.-C., Davis, J.C., and Gyurcsik, R.S. (1998) Achieving
uniformity in a semiconductor fabrication process using spatial modeling. Journal
of the American Statistical Association, 93:36–45.
Hurvich, C.M. and Tsai, C.-L. (1989) Regression and time series model selection in
small samples. Biometrika, 76:297-307.
Isaaks, E.H. and Srivastava, R.M. (1989) An Introduction to Applied Geostatistics.
Oxford University Press, New York.
Jensen, D.R. and Ramirez, D.E. (1999) Recovered errors and normal diagnostics in
regression. Metrica, 49:107–119.
Johnson, M.E. (1987) Multivariate Statistical Simulation. John Wiley & Sons, New
York.
Johnson, N.L. and Kotz, S. (1972) Distributions in Statistics: Continuous Multivari-
ate Distributions. John Wiley & Sons, New York.
Johnson, N.L. and Kotz, S. (1975) On some generalized Farlie-Gumbel-Morganstern
distributions. Communications in Statistics, 4:415–427.
Jones, R.H. (1993) Longitudinal Data With Serial Correlation: A State-space Ap-
proach. Chapman and Hall, New York.
Jones, R.H. and Zhang, Y. (1997) Models for continuous stationary space-time pro-
cesses. In Gregoire, T.G. Brillinger, D.R. Diggle, P.J., Russek-Cohen, E., Warren,
W.G., and Wolfinger, R.D. (eds.) Modeling Longitudinal and Spatially Correlated
Data, Springer Verlag, New York, 289–298.
Journel, A.G. (1980) The lognormal approach to predicting local distributions of
selective mining unit grades. Journal of the International Association for Mathe-
matical Geology, 12:285–303.
Journel, A.G. (1983) Nonparametric estimation of spatial distributions. Journal of
the International Association for Mathematical Geology, 15:445–468.
Journel, A.G. and Huijbregts, C.J. (1978) Mining Geostatistics. Academic Press,
London.
Jowett, G.H. (1955a) The comparison of means of industrial time series. Applied
Statistics, 4:32–46.
Jowett, G.H. (1955b) The comparison of means of sets of observations from sections
of independent stochastic series. Journal of the Royal Statistical Society, (B),
17:208–227.
Jowett, G.H. (1955c) Sampling properties of local statistics in stationary stochastic
series. Biometrika, 42:160–169.
Judge, G.G., Griffiths, W.E., Hill, R.C., Lütkepohl, H., and Lee, T.-C. (1985) The
Theory and Practice of Econometrics, John Wiley & Sons, New York.
Kackar, R.N. and Harville, D.A. (1984) Approximation for Standard errors of estima-
REFERENCES 455

tors of fixed and random effects in mixed linear models. Journal of the American
Statistical Association, 79:853–862.
Kaluzny, S.P., Vega, S.C., Cardoso, T.P. and Shelly, A.A. (1998) S+SpatialStats.
User’s Manual for Windows! r and Unix!r . Springer Verlag, New York.
Kelsall, J.E. and Diggle, P.J. (1995) Non-parametric estimation of spatial variation
in relative risk. Statistics in Medicine, 14: 2335–2342.
Kelsall, J.E. and Wakefield, J.C. (1999) Discussion of Best et al. 1999. In Bernardo,
J.M. Berger, J.O. Dawid, A.P. and Smith, A.F.M. (eds.) Bayesian Statistics 6,
Oxford University Press, Oxford, p. 151.
Kempthorne, O. (1955) The randomization theory of experimental inference. Journal
of the American Statistical Association, 50:946–967.
Kenward, M.G. and Roger, J.H. (1997) Small sample inference for fixed effects from
restricted maximum likelihood. Biometrics, 53:983–997.
Kern, J.C. and Higdon, D.M. (2000) A distance metric to account for edge effects
in spatial analysis. In Proceedings of the American Statistical Association, Bio-
metrics Section, Alexandria, VA, 49–52.
Kianifard, F. and Swallow, W.H. (1996) A review of the development and applica-
tion of recursive residuals in linear models. Journal of the American Statistical
Association, 91:391–400.
Kim, H. and Mallick, B.K. (2002) Analyzing spatial data using skew-Gaussian pro-
cesses. In Lawson, A. B. and Denison, D. GṪ(eds.) ˙ Spatial Cluster Modeling,
Chapman & Hall/CRC, Boca Raton, FL. pp. 163–173.
Kitanidis, P.K. (1983) Statistical estimation of polynomial generalized covariance
functions and hydrological applications. Water Resources Research, 19:909–921.
Kitanidis, P.K. (1986) Parameter uncertainty in estimation of spatial functions:
Bayesian analysis. Water Resources Research, 22:499–507.
Kitanidis, P.K. and Lane, R.W. (1985) Maximum likelihood parameter estimation
of hydrological spatial processes by the Gauss-Newton method. Journal of Hy-
drology, 79:53–71.
Kitanidis, P.K. and Vomvoris, E.G. (1983) A geostatistical approach to the inverse
problem in groundwater modeling (steady state) and one-dimensional simulations.
Water Resources Research, 19:677–690.
Knox, G. (1964) Epidemiology of childhood leukemia in Northumberland and
Durham. British Journal of Preventative and Social Medicine, 18:17–24.
Krige, D.G. (1951) A statistical approach to some basic mine valuation problems
on the Witwatersrand. Journal of Chemical, Metallurgical, and Mining Society of
South Africa, 52:119–139.
Krivoruchko, K. and Gribov, A. (2004) Geostatistical interpolation in the presence
of barriers. In: geoENV IV - Geostatistics for Environmental Applications: Pro-
ceedings of the Fourth European Conference on Geostatistics for Environmental
Applications 2002 (Quantitative Geology and Geostatistics), 331–342.
Kulldorff, M. (1997) A spatial scan statistic. Communications in Statistics-Theory
and Methods, 26:1487–1496.
Kulldorff, M. and International Management Services, Inc. (2003) SaTScan v. 4.0:
Software for the spatial and space-time scan statistics. National Cancer Institute,
Bethesda, MD.
Kulldorff, M. and Nagarwalla, N. (1995) Spatial disease clusters: detection and in-
ference. Statistics in Medicine, 14:799–810.
Kupper, L.L. and Haseman, J.K. (1978) The use of a correlated binomial model for
456 REFERENCES

the analysis of certain toxicological experiments, Biometrics, 34:69-76.


Laarhoven, P.J.M. van and Aarts, E.H.L. (1987) Simulated Annealing: Theory and
Applications. Reidel Publishing, Dordrecht, Holland.
Lahiri, S.N., Lee, Y., Cressie, N. (2002) On asymptotic distribution and asymptotic
efficiency of least squares estimators of spatial variogram parameters. Journal of
Statistical Planning and Inference, 103:65–85.
Lancaster, H. O. (1958) The structure of bivariate distributions. Annals of Mathe-
matical Statistics, 29:719–736.
Lawson, A.B. and Denison, D.G.T. (2002) Spatial Cluster Modelling, Chapman and
Hall/CRC, Boca Raton, FL.
Le, N.D. and Zidek, J.V. (1992) Interpolation with uncertain spatial covariances: a
Bayesian alternative to kriging. Journal of Multivariate Analysis, 43:351–374.
Lele, S. (1997) Estimating functions for semivariogram estimation, In: Selected Pro-
ceedings of the Symposium on Estimating Functions, eds. I.V. Basawa, V.P. Go-
dambe, and R.L. Taylor, Hayward, CA: Institute of Mathematical Statistics, 381–
396.
Lele, S. (2004) On using expert opinion in ecological analyses: A frequentist ap-
proach. Environmental and Ecological Statistics, to appear.
Lewis, P.A.W. and Shedler, G.S. (1979) Simulation of non-homogeneous Poisson
processes by thinning. Naval Research Logistics Quarterly, 26:403–413.
Liang, K.-Y. and Zeger, S.L. (1986) Longitudinal data analysis using generalized
linear models. Biometrika, 73:13–22.
Lindsay, B.G. (1988) Composite likelihood methods. Contemporary Mathematics,
80:221–239.
Littell, R. C., Milliken, G. A., Stroup, W. W., and Wolfinger, R. D. (1996) The SAS
System for Mixed Models. SAS Institute, Inc., Cary, NC.
Little, L.S., Edwards, D. and Porter, D.E. (1997) Kriging in estuaries: as the crow
flies or as the fish swims? Journal of Experimental Marine Biology and Ecology,
213:1–11.
Lotwick, H.W. and Silverman, B.W. (1982) Methods for analysing spatial processes
of several types of points. Journal of the Royal Statistical Society, Series B, 44:
406–413.
Ma, C. (2002) Spatio-temporal covariance functions generated by mixtures. Mathe-
matical Geology, 34(8):965–975.
Mantel, N. (1967) The detection of disease clustering and a generalized regression
approach. Cancer Research, 27(2):209–220.
Marcotte, D. and Groleau, P. (1997) A simple and robust lognormal estimator.
Mathematical Geology, 29:993–1009.
Mardia, K.V. (1967) Some contributions to contingency-type bivariate distributions.
Biometrika, 54:235–249.
Mardia, K.V. (1970) Families of Bivariate Distributions. Hafner Publishing Com-
pany, Darien, CT.
Mardia, K.V., Kent, J.T., and Bibby, J.M. (1979) Mutivariate Analysis. Academic
Press, London.
Mardia, K.V. and Marshall, R.J. (1984) Maximum likelihood estimation of models
for residual covariance in spatial regression. Biometrika, 71:135–46.
Marshall, R.J. (1991) Mapping disease and mortality rates using empirical Bayes
estimators. Applied Statistics, 40:283–294.
Martin, R.J. (1992) Leverage, influence and residuals in regression models when
REFERENCES 457

observations are correlated. Communications in Statistics–Theory and Methods,


21:1183–1212.
Matérn, B. (1960) Spatial variation. Meddelanden fran Skogsforskningsinstitut,
49(5).
Matérn, B. (1986) Spatial Variation, 2nd ed. Lecture Notes in Statistics, Springer-
Verlag, New York.
Matheron, G. (1962) Traite de Geostatistique Appliquee, Tome I. Memoires du Bu-
reau de Recherches Geologiques et Minieres, No. 14. Editions Technip, Paris.
Matheron, G. (1963) Principles of geostatistics. Economic Geology, 58:1246–1266.
Matheron, G. (1976) A simple substitute for conditional expectation: The disjunc-
tive kriging. In: M. Guarascio, M. David, and C. Huijbregts (eds.), Advanced
Geostatistics in the Mining Industry. Reidel, Dordrecht, 221–236.
Matheron, G. (1982) La destructuration des hautes teneures et le krigeage des indica-
trices. Technical Report N-761, Centre de Géostatistique, Fontainebleau, France.
Matheron, G. (1984) Isofactorial models and change of support. In: G. Verly, M.
David, A. Journel, A. Marechal (eds.), Geostatistics for Natural Resources Char-
acterization. Reidel, Dordrecht, 449–467.
McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models, Second Edition.
Chapman and Hall, New York.
McMillen D.P. (2003) Spatial autocorrelation or model misspecification? Interna-
tional Regional Science Review, 26:208–217.
McShane, L.M., Albert, P.S., and Palmatier, M.A. (1997) A latent process regression
model for spatially correlated count data. Biometrics, 53:698–706.
Mercer, W.B. and Hall, A.D. (1911) The experimental error of field trials. Journal
of Agricultural Science, 4:107–132.
Metropolis, N., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953) Equations of
state calculations by fast computing machines. Journal of Chemical Physics, 21:
1087–1092.
Mitchell, M.W. and Gumpertz, M.L. (2003) Spatio-temporal prediction inside a
free-air CO2 enrichment system, Journal of Agricultural, Biological, and Envi-
ronmental Statistics, 8(3):310–327.
Mockus, A. (1998) Estimating dependencies from spatial averages. Journal of Com-
putational and Graphical Statistics, 7:501–513.
Møller, J. and Waagepetersen, R.P. (2003) Statistical Inference and Simulation for
Spatial Point Processes, Chapman & Hall/CRC, Boca Raton, FL.
Moran, P.A.P. (1950) Notes on continuous stochastic phenomena, Biometrika, 37:17–
23.
Moyeed, R.A. and Papritz, A. (2002) An emprical comparision of kriging methods
for nonlinear spatial prediction. Mathematical Geology, 34:365–386.
Mugglestone, M.A. and Renshaw, E. (1996a) A practical guide to the spectral anal-
ysis of spatial point processes. Journal of Computational Statistics & Data Anal-
ysis, 21:43–65
Mugglestone, M.A. and Renshaw, E. (1996b) The exploratory analysis of bivariate
spatial point patterns using cross-spectra. Environmetrics, 7:361–377.
Müller, W.G. (1999) Least-squares fitting form the variogram cloud. Statistics &
Probability Letters, 43:93–98.
Nadaraya, E.A. (1964) On estimating regression. Theory of Probability and its Ap-
plications, 10:186–190.
Neyman, J. and Scott, E.L. (1972) Processes of clustering and applications. In:
458 REFERENCES

P.A.W. Lewis, (ed.) Stochastic Point Processes. John Wiley & Sons, New York,
646–681.
Ogata, Y. (1999) Seismicity analysis through point-process modeling: a review. Pure
and Applied Geophysics, 155:471–507.
O’Hagan, A. (1994) Bayesian Inference. Kendall’s Advanced Theory of Statistics,
2B, Edward Arnold Publishers, London.
Olea, R. A. (ed.) (1991) Geostatistical Glossary and Multilingual Dictionary. Oxford
University Press, New York.
Olea, R. A. (1999) Geostatistics for Engineers and Earth Scientists. Kluwer Aca-
demic Publishers, Norwell, Massachusetts.
Openshaw, S. (1984) The Modifiable Areal Unit Problem. Geobooks, Norwich, Eng-
land.
Openshaw, S. and Taylor, P. (1979) A million or so correlation coefficients. In N.
Wrigley (ed.), Statistical Methods in the Spatial Sciences. Pion, London, 127–144.
Ord, K. (1975) Estimation methods for models of spatial interaction. Journal of the
American Statistical Association, 70:120–126.
Ord, K. (1990) Discussion of “Spatial Clustering for Inhomogeneous Populations”
by J. Cuzick and R. Edwards. Journal of the Royal Statistical Society, Series B,
52:97.
Pagano, M. (1971) Some asymptotic properties of a two-dimensional periodogram.
Journal of Applied Probability, 8:841–847.
Papadakis, J.S. (1937) Méthode statistique pour des expériences sur champ. Bull.
Inst. Amelior. Plant. Thessalonique, 23.
Patterson, H.D. and Thompson, R. (1971) Recovery of inter-block information when
block sizes are unequal. Biometrika, 58:545–554.
Percival, D.B. and Walden, A.T. (1993) Spectral Analysis for Physical Applications.
Multitaper and Conventional Univariate techniques. Cambridge University Press,
Cambridge, UK.
Plackett, R.L. (1965) A class of bivariate distributions. Journal of the American
Statistical Association, 60:516–522.
Posa, D. (1993) A simple description of spatial-temporal processes. Computational
Statistics & Data Analysis, 15:425–437.
Prasad, N.G.N. and Rao, J.N.K. (1990) The estimation of the mean squared error
of small-area estimators. Journal of the American Statistical Association, 85:161–
171.
Prentice, R.L. (1988) Correlated binary regression with covariates specific to each
binary observation. Biometrics, 44:1033–1048.
Priestley, M.B. (1981) Spectral analysis of time series. Volume 1: Univariate series.
Academic Press, New York.
Rathbun, S.L. (1996) Asymptotic properties of the maximum likelihood estimator
for spatio-temporal point processes. Journal of Statistical Planning and Inference,
51:55–74.
Rathbun, S.L. (1998) Kriging estuaries. Environmetrics, 9:109–129.
Rathbun, S.L. and Cressie, N.A.C. (1994) A space-time survival point process for
a longleaf pine forest in Southern Georgia. Journal of the American Statistical
Association, 89:1164–1174.
Rendu, J.M. (1979) Normal and lognormal estimation. Journal of the International
Association for Mathematical Geology, 11:407–422.
Renshaw, E. and Ford, E.D. (1983) The interpretation of process from pattern us-
REFERENCES 459

ing two-dimensional spectral analysis: methods and problems of interpretation.


Applied Statistics, 32:51–63.
Ripley, B.D. (1976) The second-order analysis of stationary point processes. Journal
of Applied Probability, 13:255–266.
Ripley, B.D. (1977) Modeling spatial patterns. Journal of the Royal Statistical So-
ciety (B), 39:172–192 (with discussion, 192–212).
Ripley, B.D. (1981) Spatial Statistics. John Wiley & Sons, New York.
Ripley, B.D. (1987) Stochastic Simulation. John Wiley & Sons, Chichester.
Ripley, B.D. and Silverman, B.W. (1978) Quick tests for spatial interaction.
Biometrika, 65:641–642.
Rivoirard, J. (1994) Introduction to Disjunctive Kriging and Nonlinear Geostatistics.
Clarendon Press, Oxford.
Robert, C.P. and Casella, G. (1999) Monte Carlo Statistical Methods. Springer-
Verlag, New York.
Rogers, J.F., Thompson, S.J., Addy, C.L., McKeown, R.E., Cowen, D.J., and De-
Coulfé, P. (2000) The association of very low birthweight with exposures to envi-
ronmental sulfur dioxide and total suspended particulates. American Journal of
Epidemiology, 151:602–613.
Rousseeuw, P.J. and Croux, C. (1993) Alternatives to the median absolute deviation.
Journal of the American Statistical Association, 88:1273–1283.
Royaltey, H., Astrachan, E. and Sokal, R. (1975) Tests for patterns in geographic
variation. Geographic Analysis, 7:369–396.
Ruppert, D., Wand, M.P., and Carroll, R.J. (2003) Semiparametric Regression,
Cambridge University Press, Cambridge, UK.
Russo, D. and Bresler, E. (1981) Soil hydraulic properties as stochastic processes,
1. An analysis of field spatial variability. Journal of the Soil Science Society of
America, 45:682–687.
Russo, D. and Jury, W.A. (1987) A theoretical study of the estimation of the cor-
relation scale in spatially variable fields. 1. Stationary Fields. Water Resources
Research, 7:1257–1268.
Sampson, P.D. and Guttorp, P. (1992) Nonparametric estimation of nonstation-
ary spatial covariance structure. Journal of the American Statistical Association,
87:108–119.
Schabenberger, O. and Gregoire, T.G. (1996) Population-averaged and subject-
specific approaches for clustered categorical data. Journal of Statistical Com-
putation and Simulation, 54:231–253.
Schabenberger, O. and Pierce, F.J. (2002) Contemporary Statistical Models for the
Plant and Soil Sciences. CRC Press, Boca Raton, FL.
Schervish, M.J. and Carlin, B.P. (1992) On the convergence of successive substitu-
tion sampling. Journal of Computational and Graphical Statistics, 1:111–127.
Schmidt, P. (1976) Econometrics, Marcel Dekker, New York.
Searle, S.R., Casella, G., and McCulloch, C.E. (1992) Variance Components. John
Wiley & Sons, New York.
Self, S.G. and Liang, K.Y. (1987) Asymptotic properties of maximum likelihood
estimators and likelihood ratio tests under nonstandard conditions. Journal of
the American Statistical Association, 82:605–610.
Shapiro, A. and Botha, J.D. (1991) Variogram fitting with a general class of condi-
tionally nonnegative definite functions. Computational Statistics and Data Anal-
ysis, 11:87–96.
460 REFERENCES

Smith, A.F.M. and Gelfand, A.E. (1992) Bayesian statistics without tears: A
sampling-resampling perspective. The American Statistician, 46:84–88.
Smith, A.F.M. and Roberts, G.O. (1993) Bayesian computation via the Gibbs sam-
pler and related Markov chain Monte Carlo methods. Journal of the Royal Sta-
tistical Society, Series B, 55:3–24.
Solie, J.B., Raun, W.R., and Stone, M.L. (1999) Submeter spatial variability of
selected soil and bermudagrass production variables. Journal of the Soil Science
Society of America, 63:1724–1733.
Stein, M.L. (1999) Interpolation of Spatial Data. Some Theory of Kriging. Springer-
Verlag, New York.
Stern, H. and Cressie, N. (1999) Inference for extremes in disease mapping. In A.
Lawson et al. (eds.) Disease Mapping and Risk Assessment for Public Health,
John Wiley & Sons, Chichester, 63–84.
Stoyan, D., Kendall, W.S. and Mecke, J. (1995) Stochastic Geometry and its Appli-
cations. 2nd ed. John Wiley & Sons, New York.
Stroup, D.F. (1990) Discussion of “Spatial Clustering for Inhomogeneous Popula-
tions” by J. Cuzick and R. Edwards. Journal of the Royal Statistical Society,
Series B, 52:99.
Stroup, W.W., Baenziger, P.S., and Mulitze, D.K. (1994) Removing spatial variation
from wheat yield trials: a comparison of methods. Crop Science, 86:62–66.
Stuart, A. and Ord, J.K. (1994) Kendall’s Advanced Theory of Statistics, Volume
I: Distribution Theory. Edward Arnold, London.
Sun, D., Tsutakawa, R.K., and Speckman, P. L. (1999) Posterior distribution of
hierarchical models using CAR(1) distributions, Biometrika, 86:341–350.
Switzer, P. (1977) Estimation of spatial distributions from point sources with ap-
plication to air pollution measurement. Bulletin of the International Statistical
Institute, 47:123–137.
Szidarovsky, F., Baafi, E.Y., and Kim, Y.C. (1987). Kriging without negative
weights, Mathematical Geology, 19:549–559.
Tanner, M.A. (1993) Tools for Statistical Inference. 2nd Edition, Springer-Verlag,
New York.
Tanner, M.A. and Wong, W.H. (1987) The calculation of posterior distributions by
data augmentation. Journal of the American Statistical Association, 82:528–540.
Theil, H. (1971) Principles of Econometrics. John Wiley & Sons, New York.
Thiébaux, H.J. and Pedder, M.A. (1987) Spatial objective analysis with applications
in atmospheric science. Academic Press, London.
Thompson, H.R. (1955) Spatial point processes with applications to ecology.
Biometrika, 42:102–115.
Thompson, H.R. (1958) The statistical study of plant distribution patterns using a
grid of quadrats. Australian Journal of Botany, 6:322–342.
Tierney, L. (1994) Markov chains for exploring posterior distributions (with discus-
sion). Annals of Statistics, 22:1701–1786.
Tobler, W. (1970) A computer movie simulating urban growth in the Detroit region.
Economic Geography, 46:234–240.
Toutenburg, H. (1982) Prior Information in Linear Models. John Wiley & Sons, New
York.
Turnbull, B.W., Iwano, E.J., Burnett, W.S., Howe, H.L., and Clark, L.C. (1990)
Monitoring for clusters of disease: Application to leukemia incidence in upstate
New York. American Journal of Epidemiology, 132:S136–S143.
REFERENCES 461

Upton, G.J.G. and Fingleton, B. (1985) Spatial Data Analysis by Example, Vol. 1:
Point Pattern and Quantitative Data. John Wiley & Sons, New York.
Vallant, R. (1985) Nonlinear prediction theory and the estimation of proportions in
a finite population. Journal of the American Statistical Association, 80:631–641.
Vanmarcke, E. (1983) Random Fields: Analysis and Synthesis. MIT Press, Cam-
bridge, MA.
Verly, G. (1983) The Multi-Gaussian approach and its applications to the estima-
tion of local reserves. Journal of the International Association for Mathematical
Geology, 15:259–286.
Vijapurkar, U.P. and Gotway, C.A. (2001) Assessment of forecasts and forecast
uncertainty using generalized linear models for time series count data. Journal of
Statistical Computation and Simulation, 68:321–349.
Waller, L.A. and Gotway, C.A. (2004) Applied Spatial Statistics for Public Health
Data. John Wiley & Sons, New York.
Walters, J.R. (1990) Red-cockaded woodpeckers: a ’primitive’ cooperative breeder.
In: Cooperative Breeding in Birds: Long-term Studies of Ecology and Behaviour.
Cambridge University Press, Cambridge, 67–101.
Wand, M.P. and Jones, M.C. (1995) Kernel Smoothing. Chapman and Hall/CRC
Press, Boca Raton, FL.
Watson, G.S. (1964) Smooth regression analysis. Sankhya (A), 26:359–372.
Webster, R. and Oliver, M.A. (2001) Geostatistics for Environmental Scientists.
John Wiley & Sons, Chichester.
Wedderburn, R.W.M. (1974) Quasi-likelihood functions, generalized linear models
and the Gauss-Newton method. Biometrika, 61:439–447.
Whittaker, E.T. and Watson, G.N. (1927) A couse of modern analysis, 4th ed.,
Cambridge University Press, Cambridge, UK.
Whittle, P. (1954) On stationary processes in the plane. Biometrika, 41:434–449.
Wolfinger, R.D. (1993) Laplace’s approximation for nonlinear mixed models.
Biometrika, 80:791–795.
Wolfinger, R.D. and O’Connell, M. (1993) Generalized linear mixed models: a
pseudo-likelihood approach. Journal of Statistical Computing and Simulation,
48:233–243.
Wolfinger, R., Tobias, R. and Sall, J. (1994) Computing gaussian likelihoods and
their derivatives for general linear mixed models. SIAM Journal on Scientific and
Statistical Computing, 15:1294–1310.
Wong, D.W.S. (1996) Aggregation effects in geo-referenced data. In D. Griffiths
(ed.), Advanced Spatial Statistics. CRC Press, Boca Raton, Florida, 83–106.
Yaglom, A. (1987) Correlation Theory of Stationary and Related Random Functions
I. Springer-Verlag, New York.
Yule, G. U. and Kendall, M. G. (1950) An Introduction to the Theory of Statistics.
14th Edition, Griffin, London.
Zahl, S. (1977) A comparison of three methods for the analysis of spatial pattern.
Biometrics, 33:681–692.
Zeger, S.L. (1988) A regression model for time series of counts. Biometrika, 75:621–
629.
Zeger, S.L. and Liang, K.-Y. (1986) Longitudinal data analysis for discrete and
continuous outcomes. Biometrics, 42:121–130.
Zeger, S.K., Liang, K.-Y., and Albert, P.S. (1988) Models for longitudinal data: a
generalized estimating equation approach. Biometrics, 44:1049–1006.
462 REFERENCES

Zellner, A. (1986) Bayesian estimation and prediction using asymmetric loss func-
tions. Journal of the American Statistical Association, 81:446–451.
Zhang, H. (2004) Inconsistent estimation and asymptotically equal interpolators
in model-based geostatistics. Journal of the American Statistical Association 99:
250–261.
Zhao, L.P. and Prentice, R.L. (1990) Correlated binary regression using a quadratic
exponential model. Biometrika, 77:642–648.
Zimmerman, D.L. (1989). Computationally efficient restricted maximum likelihood
estimation of generalized covariance functions. Mathematical Geology, 21:655–
672.
Zimmerman, D.L. and Cressie, N.A. (1992) Mean squared prediction error in the
spatial linear model with estimated covariance parameters. Annals of the Institute
of Statistical Mathematics, 32:1–15.
Zimmerman, D.L. and Harville, D.A. (1991) A random field approach to the analysis
of field-plot experiments and other spatial experiments. Biometrics, 47:223–239.
Zimmerman, D.L. and Zimmerman, M.B. (1991) A comparison of spatial semivari-
ogram estimators and corresponding kriging predictors. Technometrics, 33:77–91.
Author Index

Aarts, 410 Burton, 300, 319, 320


Abramowitz, 210, 211
Addy, 107 Capitanio, 398
Aitchison, 268, 269 Carlin, 388, 390
Akaike, 236, 346 Carroll, 327–329
Albert, 170, 358 Casella, 263, 386–390
Aldworth, 292 Cesare, 435
Allen, 307 Charlton, 316
Anderson, 347 Chatfield, 62
Anselin, 24 Cherry, 180
Armstrong, 216, 270 Chetwynd, 103, 105
Astrachan, 22 Chib, 389
Azzalini, 398 Chilès, 135, 140, 145, 152, 216, 217, 231,
269, 270, 282–284, 286, 291, 406,
Baddeley, 85 412
Baenziger, 300, 318, 320 Choi, 442
Bahadur, 356 Christensen, 265, 350
Banerjee, 390 Clark, 116, 216
Banfield, 180 Clayton, 359, 367, 394–396
Barry, 185, 186, 188 Cleveland, 240
Bartlett, 127, 201 Cliff, 20, 22, 38, 56, 314, 315, 338, 339
Battese, 169, 323 Cody, 211
Belsely, 307 Congdon, 390
Berger, 394, 399 Cook, 307
Besag, 115, 320, 339, 362, 377, 388, 396, Coull, 352
397 Cowles, 390
Bibby, 207, 209, 424 Cox, 271, 397
Bloomfield, 62 Cressie, 45, 54, 55, 85, 101, 102, 122,
Botha, 180 127–129, 131, 137, 141, 153, 159–
Boufassa, 270 161, 163–165, 172, 186, 204, 216,
Bowman, 300, 319, 320 217, 227, 228, 232, 242, 243,
Box, 271, 397 248, 250, 255, 258, 263, 266,
Bresler, 140 267, 269, 270, 277, 286, 287,
Breslow, 359, 367, 396 291, 292, 312, 317, 339, 377,
Breusch, 261 395, 396, 399, 435, 437, 442
Brillinger, 203 Croux, 162
Brown, 268, 270, 309 Curriero, 171, 205, 206
Brownie, 300, 319, 320, 350 Cuzick, 114
Brunsdon, 316
Burnett, 116 David, 267, 269, 270
Burnham, 347 Davidian, 431
Burt, 94 De Oliveira, 272, 274, 394, 397, 399
464 AUTHOR INDEX

DeCoulfé, 107 Gribov, 205


Delfiner, 135, 140, 145, 152, 216, 217, Groleau, 269, 270
231, 269, 270, 282–284, 286, 291, Grondona, 312
406, 412 Gumpertz, 300, 320, 350, 433, 435
Deutsch, 217, 271, 276, 279, 389 Guttorp, 206, 209, 424, 425
Devine, 395
Diggle, 81, 87, 91, 103, 105, 112, 114, Haas, 245, 425, 426
119, 122, 127–129, 398 Haining, 336, 339, 351
Dobson, 354 Hall, 47, 442
Dowd, 270 Handcock, 200, 396
Draper, 271 Hanisch, 122
Harville, 262–266, 300, 302, 319, 343
Eaton, 265 Hasemann, 356
Ecker, 180 Haslett, 352
Edwards, 114 Hastings, 389
Hawkins, 159–161, 309
Fedorov, 173 Hayes, 352
Fingleton, 95 He, 166
Fisher, 91
Heagerty, 170
Ford, 202, 203
Hemmerle, 327, 330
Fotheringham, 316, 317, 381
Henderson, 325
Freeman, 271
Hergert, 243
Fuentes, 62, 195, 200, 428
Heyde, 169
Fuller, 169, 323
Higdon, 62, 205, 426–428
Galpin, 309 Hinkelmann, 318
Gandin, 216 Houseman, 352
Gaudard, 394 Howe, 116
Geary, 22 Huang, 293, 435, 437
Gelfand, 180, 386–388, 390 Hughes-Oliver, 422, 423
Gelman, 389, 390, 396 Huijbregts, 216, 286
Geman, 387 Hurvich, 347
Genton, 162, 166, 174, 180
Isaaks, 216, 229
George, 388
Iwano, 116
Gerrard, 119
Ghosh, 396 Jensen, 309, 310
Gilks, 387, 388, 390 Jeske, 264–266, 343
Gilliland, 356, 357 Johnson, 292–294
Giltinan, 431 Jones, 113, 144, 435, 440
Gneiting, 433, 435–437 Journel, 216, 217, 269–271, 276, 278, 279,
Godambe, 169 286, 389
Goldberger, 227, 243, 296 Jowett, 135
Goodnight, 327, 329, 330, 352 Judge, 261, 322, 323, 393
Goovaerts, 152, 217, 278, 279, 290, 412 Jury, 140
Gorsich, 180
Gotway, 108, 114–116, 227, 243, 285, 300, Kackar, 265, 266, 343
354, 360, 361, 369, 370 Kaldor, 394–396
Graybill, 34, 35, 407 Kaluzny, 135
Greenberg, 389 Kedem, 272
Gregoire, 366 Kelsall, 112, 114, 396
Greig-Smith, 92 Kempthorne, 300, 318
AUTHOR INDEX 465

Kempton, 320 McShane, 170, 288, 294, 361


Kendall, 84, 120, 122, 281, 284 Mecke, 84, 120, 122
Kent, 207, 209, 424 Mercer, 47
Kenward, 276, 343 Metropolis, 389, 411
Kern, 62, 205, 426, 427 Milliken, 263, 344
Kianifard, 309 Mitchell, 433, 435
Kim, 398 Mockus, 289
Kitanidis, 262, 392, 393 Moran, 21
Knox, 17 Moyeed, 399
Kooperberg, 397 Mugglestone, 202, 203
Kotz, 292, 293 Mulitze, 300, 318, 320
Krige, 216, 285 Myers, 435
Krivoruchko, 205
Kuh, 307 Nadaraya, 59
Kulldorff, 116 Nagarwalla, 116
Kupper, 356 Nelder, 352, 359, 360
Newell, 115
Laarhoven, 410 Neyman, 126
Lahiri, 172, 263
Lancaster, 293 O’Connell, 354, 360, 362, 363
Lane, 262 O’Hagan, 392, 393
Le, 393 Ogata, 442
Lee, 172 Olea, 216, 285
Lele, 170, 171, 385 Oliver, 216
Lewis, 124, 419 Openshaw, 285
Liang, 169, 344, 358, 360, 361, 414, 436 Ord, 20, 22, 38, 56, 115, 314, 315, 337–
Lindsay, 170 339
Littell, 263, 344 Palmatier, 170
Little, 204 Papadakis, 301, 318
Liu, 166 Papritz, 399
Lotwick, 104 Patterson, 168
Louis, 390 Pedder, 74
Percival, 74, 196, 197, 257
Müller, 173, 174
Pierce, 19, 49, 54, 56, 122, 144, 154, 157,
Møller, 101, 120, 122, 124
219, 244, 251, 252, 257
Ma, 438, 439
Plackett, 293
MacKenzie, 91
Posa, 435
Majure, 204
Prasad, 266, 343
Mallick, 398
Prentice, 169, 170, 356
Mantel, 14, 15
Priestley, 62, 63, 66, 72
Marcotte, 269, 270
Mardia, 207, 209, 259, 260, 293, 424 Quimby, 180
Marshall, 259, 260, 395
Martin, 351 Ramirez, 309, 310
Matérn, 128, 139, 141, 143, 148, 151 Rao, 266, 343
Matheron, 30, 135, 136, 139, 216, 279, Rathbun, 204, 442, 443
282, 284, 285 Raun, 140
McCullagh, 352, 359, 360 Rendu, 270
McCulloch, 263 Renshaw, 202, 203
McKeown, 107 Ripley, 78, 97, 98, 101–104, 129, 386
466 AUTHOR INDEX

Rivoirard, 271, 280, 282, 283, 291 Theil, 309, 323


Robert, 386–390 Thiébaux, 74
Roberts, 388 Thompson, 95, 107, 168
Roger, 276, 343 Thornton, 91
Rogers, 107 Tierney, 387, 389
Rosenbluth, 389 Tobias, 261, 263
Rousseeuw, 162 Tobler, 26
Royaltey, 22 Toutenburg, 227
Rubin, 389 Tsai, 347
Ruppert, 327–331 Tukey, 271
Russo, 140 Turnbull, 116
Ryan, 352
Upton, 95
Sall, 261, 263
Vallant, 370
Sampson, 206, 209, 424, 425
Vanmarcke, 62, 66, 198
Sansó, 394
Ver Hoef, 185, 186, 188
Schabenberger, 19, 49, 54, 56, 122, 144,
Verly, 289
154, 157, 219, 244, 251, 252,
Vijapurkar, 370
257, 356, 357, 366
Vomvoris, 262
Schervish, 388
Schmidt, 322, 323 Waagepetersen, 101, 120, 122, 124
Scott, 126 Wakefield, 396
Searle, 262, 263 Walden, 74, 196, 197, 257
Self, 344, 436 Waller, 108, 114–116
Shapiro, 180 Wallis, 200
Shedler, 124, 419 Walters, 93
Short, 272 Wand, 113, 327–329
Silverman, 85, 97, 98, 104 Watson, 59, 211
Smith, 271, 386–388 Webster, 216
Sokal, 22 Wedderburn, 359
Solie, 140 Weisberg, 307
Srivastava, 216, 229 Welsch, 307
Stegun, 210, 211 Whittaker, 211
Stein, 51, 141, 142, 146, 178, 199, 200, Whittle, 42, 144, 335, 337, 440
396 Wolfinger, 261, 263, 344, 354, 360, 362,
Stern, 396 363, 367, 369
Stone, 140 Wong, 285, 386
Stoyan, 84, 85, 99, 101, 120, 122, 123,
126, 127 Yaglom, 141
Stroup, 115, 263, 300, 318, 320, 344, 354, Young, 285
360, 361, 370 Yule, 284
Sun, 396
Zahl, 95
Swall, 62, 426, 427
Zeger, 169, 294, 352, 358, 360, 361, 414
Swallow, 309
Zellner, 217
Switzer, 278
Zhang, 144, 399, 435, 440
Szidarovsky, 232 Zhao, 169, 170
Zidek, 393
Tanner, 386
Zimmerman, 165, 260, 267, 277, 300, 302,
Taylor, 285
319, 399
Teller, 389
Subject Index

K-function, see K-function and inference, 31


L-function, see L-function and precision of estimators, 34
δ function, 73 and prediction, 32
“G”aussian covariance model, 143 diagnosis based on residuals, 307
function, 25, 73
Adjacency weights, 396 in OLS residuals, 307
Affine transformation, 294 local indicators, 23, 24
Aggregation, 9, 340 measures, 14
effect, 285 measures on lattices, 18
horizontally, 92, 94 negative, 147, 178
of events, 92, 109 relay effect, 140
of quadrat counts into blocks, 93 spurious, 23
vertically, 92, 94 Autocovariance density function, 201
AIC, see Akaike’s information criterion Autocovariance function, 26
AICC, 347 Autoregression, 335
Akaike’s corrected information criterion, conditional, 336
347, 368 intrinsic, 396
Akaike’s information criterion, 236, 237, simultaneous, 335
346, 368 Autoregressive model, 35, 335
Analysis of covariance, 301
Analysis of variance, 92, 94, 299, 300
Anamorphosis function, 271 Bandwidth, 59, 79, 111, 240, 416
ANCOVA, 301 Basis function, 142, 181
Anisotropy, 27, 151–152, 245 and disjunctive kriging, 280
and Θ spectrum, 204 for linear spline, 327
and convolutions, 427 radial, 329
geometric, 45, 423, 427 Bayes
geometric (correction), 151 analysis, 385
in moving average model, 185 controversy, 384
in point process, 204 Empirical–, 384, 395
parameters (spatial and temporal), estimate, 383
433 risk, 221
Point source–, 423 Bayes’ theorem, 383
spatio-temporal, 433 Bayesian hierarchical model, 360, 383–
zonal, 152 399
Annealing, simulated, 5 and GLMM, 384
ANOVA, 299 conjugate prior, 385
AR(1) model, 35 fitting, 386–390
Arcsine transformation, 271 kriging, 397, 399
Aspatial analysis, 316, 390 posterior distribution, 383, 384, 386,
Aspatial distributions, 395 395, 397
Autocorrelation, 14–30 predictive distribution, 391
468 SUBJECT INDEX

prior distribution, 383, 385, 393, 395– Blood lead levels in children, Virginia
397 2000, 370
prior sample, 385 as example of lattice data, 10
Bayesian kriging, 391 building blocks for models, 371
Bayesian linear regression, 391 geographically weighted Poisson re-
Bayesian transformed Gaussian model, gression, 381
397 GLM results, 372
Bernoulli GLMs and GLMMs, 370–382
experiment, 83, 416 introduction, 10
process, 83 maps of predictions, 380
random variable, 416 predicted value comparison, 378, 379
Bessel function, 210–211 pseudo-likelihood, 375
I, 211 semivariogram of std. Pearson resid-
J, 141, 147, 180, 210 uals, 374
K, 143, 210 spatial GLMs and GLMMs results,
modified, 210 376
Best linear unbiased estimator, 223, 299 spatial GLMs with CAR structure,
Best linear unbiased prediction 378
of residuals, 224 spatial models, 374
with known mean, 223 variable definitions, 371
BLUP, see Best linear unbiased predic-
with unknown but constant mean,
tor
226
Bochner’s theorem, 72, 141, 436
Best linear unbiased predictor, 33, 134,
Boltzmann distribution, 410
220, 222, 242, 299
Box-Cox transformation, 271
filtered measurement error, 332
Breakdown point, 162
of random effects (mixed model), 326
BTG, see Bayesian transformed Gaus-
Binary classification, 17
sian model
Binomial
distribution, 352, 353
C/N ratio data
experiment, 83
comparison of covariance parameter
multivariate, 356
estimates, 176
point process, 83
filtered kriging, 252
power mixture, 439
introduction, 154
probability generating function, 439 ordinary kriging (REML, and OLS),
process, 83, 86 246
random variable, 83 semivariogram estimation, 176
Binomial sampling, see Sampling semivariogram estimation (paramet-
Birth-death process, 442 ric kernel), 186
Bishop move, 19, 336 spatial configuration, 156
Black-Black join count, 20 Candidate generating density, 389
Black-White join count, 19 Canonical link, 354
Block indicator values, 290 inverse, 354
Block kriging, 285–289 Canonical parameter, 354
in terms of semivariogram, 287 CAR, see Spatial autoregressive model,
predictor, 286 conditional
variance, 286 Cardinal sine covariance model, 148
weights, 286 Cauchy distribution, 293
with indicator data, 290 Cauchy-Schwarz inequality, 44
Blocking, 300, 325 Cause-and-effect, 301
SUBJECT INDEX 469

Central Limit theorem, 46 distribution (approximated), 279


Change of support, 92, 95, 284–292 distribution in exponential family,
aggregation effect, 285 357
grouping effect, 285 distribution, full, 387
scale effect, 285 expansion (linearization), 365
zoning effect, 285 expectation, 220, 267, 270, 364
Chaotic component, 139 expectation function, 218
Characteristic function, 69 independence, 384
Chebyshev-Hermite polynomial, 281 independence (in GLMM), 358
standardized, 281 mean function, 222
Chi-square mean in GRF, 218, 224
approximation, 91, 341 model formulation, 334
distribution, 95, 341, 342 probability, 338
goodness-of-fit test, 91, 130 realization, 409
mixtures, 344 simulation, 290
statistic, 91 variance (pseudo-likelihood), 364
test, 17 variance in GRF, 218
Cholesky decomposition, 310, 321, 351, Conditional autoregression, 336
352, 407, 413 Conditional simulation, 406–409, 412
Cholesky root, see Cholesky decomposi- Conditionally negative definite, 204
tion Conditioning by kriging, 409
Circular covariance model, 146 Conditioning sequence, in simulation, 408
City block distance, 205, 207, 209 Confidence interval, 219, 221, 300, 342
CL, see Composite likelihood for βj , 305
Classical semivariogram estimator, 30 for θi , 343
Clique, 377 in GLMs, 368
Cluster process, 82 monotonic transform, 368
Clustering operation, 123, 125 Conjugacy, 393
Cokriging, 278 Conjugate prior, 385, 395
Cold spot, 24 Connectivity, 18, 238
Complete covariance density function, 201 matrix, 314
Complete spatial randomness, 81, 83, 91, Consistency
103, 130, 203 and periodogram, 194
hypothesis, 84, 96, 98 and Wald test, 341
process, 84 in CL and GEE estimation, 175
Completely randomized design, 300 of CL estimator of θ, 173
Composite likelihood, 163, 170 of estimator of βgls , 322
and GEE, 171 of estimator of θ, 323
and spatial GLMs, 362 of GEE approach in longitudinal anal-
bias correction, 173 ysis, 414
component score function, 171 of least squares estimator, 172
component scores, 171 of least squares in SAR model (lack
consistency, 173 thereof), 337
Compound Poisson process, 84 of Matheron estimator, 154
Compound symmetry, 32, 34, 356 of median absolute deviation, 162
Conditional of quasi-likelihood estimator, 361
autoregression, 339 Constant risk hypothesis, 108, 116
distribution, 293, 338 Constrained kriging, 291–292
distribution (and indicator kriging), Constrained simulation, 409
278 Constraint
470 SUBJECT INDEX

linearity, 33 hole, 148


unbiasedness, 33 indirect, 57
Contingency table, 17, 91 induced, 57
Continuity, see Spatial continuity linear, 149
and kriging, 229 Matérn class, 143, 210, 394
in mean square, 49 non-separable, 435–440
Continuous component, 139 nonparametric, 148
Contrast coefficient, 341 power, 149
Convex hull, 89, 241 product (spatio-temporal), 434
Convolution, 57, 406 product-sum (spatio-temporal), 434
and non-stationarity, 422, 426, 427 properties, 43, 434
anisotropic, 427 separable, 72, 433
as linear filter, 75 spectral representation, 67, 141
between kernel and data, 59 spherical, 49, 146
in simulation, 413–418 sum (spatio-temporal), 434
kernel, 61, 79 valid, 139
of white noise, 60, 62, 427 validity of, 44
representation, 59, 70, 74, 183, 414 Whittle, 144
Cook’s D, 307 Covariance structure, see Covariance func-
Correlation tion
exchangeable, 32 Covariogram, 136
exponential structure, 74, 423 Covariogram, empirical, 136
first-order autoregressive structure, Cox process, 123–125, 127, 130
435 CRD, 300
serial, 25 Credible set, 390
Correlogram, 136, 355 Cressie-Hawkins (CH) semivariogram es-
COSP, see Change of support timator, 159
Cost function (simulated annealing), 410 Cross-validation, 328, 397
Cost weighted distance, 205 CSR, see Complete Spatial Randomness
Counting measure, 83
Covariance Decomposition
density function, 100, 201, 444 Cholesky, 310, 321, 351, 352, 407,
matrix, 3 413
model LU, 290, 407, 413
direct, 54 operational, 54, 232, 249, 299
induced, 54, 395 regionalization, 150
parameters, 3, 4 sequential – of joint cdf, 408
Covariance function, 25, 26, 28, 43, 45, singular value, 209, 310, 321
51, 52, 59, 64, 67, 75 spectral, 133, 309, 407
and semivariogram, 45 spectral (of a real symmetric ma-
and spectral density, 66, 67 trix), 352, 407
Bartlett’s, 201 Degrees of freedom, 305, 341–343
behavior near origin, 138 Delta method, 270, 368, 369
cardinal sine, 148 Density
circular, 146 estimation, 110, 111
differentiability, 138 function (spectral), 64
estimate on lattice, 192 of energy, 64
exponential, 49, 144 of power, 64
for point pattern, 201 Dependent sample, 387
gaussian, 49, 143, 205, 427 Design-based inference, 301
SUBJECT INDEX 471

Designed experiment, 301 Empirical semivariogram cloud, 153


DFFITS, 307 Equicorrelation, 32, 34, 356
Differential equation, 439 Equivalence theorem, 85
Dirac delta function, 73, 201 Error
Discrete Gaussian model, 291 contrast in REML estimation, 168
Disease clustering, 14 mean-squared, 35, 111, 220, 224, 264,
Disease mapping, 394 265
Disjunctive coding, 279 of measurement, 55
Disjunctive kriging, 279–284, 291, 295, Type 1, 25
399 Error recovery, 309–310, 351
predictor, 282 Error-control design, 300, 301
variance, 282 Estimated best linear unbiased predic-
Distance tor, 265
city block, 205, 207, 209 Estimated GLS, 225
cost weighted, 205 and “REML estimator” of β, 262
Euclidean, 127, 202, 205, 206, 209, and GLMs, 368
329, 433 estimator, 167, 168, 257, 265
fitted by MDS, 208 estimators in mixed model, 326
inter-event, 97, 116 residuals, 257, 348
nearest neighbor, 97, 114 Estimating function, 169
non-Euclidean, 204, 207 optimal, 169
polygonal, 205 unbiased, 169, 171
water, 204 Estimation
Domain local vs global, 238
continuous, 7 versus prediction, 215
discrete, 8 Euclidean distance, 127, 202, 205, 206,
fixed, 7, 8, 15 209, 329, 433
random, 12 Euler relation, 191
spatial, 432 Exact interpolation, 224
temporal, 432 Examples
Doubly stochastic process, 125 L function, 103
Dynamic range, 196, 213 geostatistical data, 8
Bayesian hierarchical model, 383, 384
E-type estimate, 279 bias correction for log link, 358
Earthquake process, 442, 443 binary classification of leukemia cases,
Edge correction, 102, 104, 112, 122 17
Edge effects, 87, 102, 112 Binomial process, 86
Edgeworth expansion, 293 bivariate point pattern, 106
Effect block kriging semivariance, 287
fixed, 232, 300, 325 Blood lead levels in children, Vir-
random, 232, 325 ginia 2000, see Blood lead lev-
Effective sample size, 32, 34 els in children, Virginia, 2000
Efficiency, 35 BLUS estimates, 312
EGLS, see Estimated GLS C/N ratios, see C/N ratio data
Empirical Bayes, 384, 395 cluster detection (point patterns),
Empirical covariance, 30 114
Empirical covariogram, 136 comparing L functions, 106
Empirical distribution function, 88 composite likelihood estimation, 176
Empirical semivariogram, see Semivari- conditional models, 332
ogram contiguous quadrats analysis, 95
472 SUBJECT INDEX

contingency analysis for double di- ML/REML estimation of covariance


chotomy, 17 parameters, 176
CSR hypothesis, 91 Monte Carlo test for CSR, 98
delta method, 368 Moran’s I and residuals, 236
diagnosing the covariance model, 350 Moran’s I for quadrat counts, 96
disjunctive kriging, 283 Moran’s I and spurious autocorre-
distribution for asymptotic test, 341 lation, 22
Edward’s NN test, 115 Multi-dimensional scaling, 207
effect of measurement error on krig- ordinary kriging (C/N ratio data),
ing, 251, 252 246
empirical semivariogram, 311 parametric kernel estimation of semi-
equivalence theorem, 86 variogram, 186
estimating the intensity ratio (point plug-in estimation, 254, 368
patterns, 112 point pattern comparison, 106, 108
exponential family, 353 point pattern data, 12
filtering (in prediction), 251, 252 point process equivalence, 86
Fuentes process, 428 posterior distribution, 383, 388
Geary’s c for quadrat counts, 96 prediction and simple linear regres-
generalized estimating equation es- sion, 219
timation, 176 probability generating function, 439
Procrustes rotation (of MDS solu-
geographically weighted regression,
tion), 209
381
pseudo-likelihood, 375
GHCD 9 infant birth weights, see
quadrat counts in point patterns, 91
GHCD 9 infant birth weights
radial smoothing, 332
GLS estimator and sample mean,
Rainfall amounts, see Rainfall amounts
167
recursive residuals, 312
Greig-Smith analysis, 95
red-cockaded woodpecker, see Red-
induced covariance because of shared
cockaded woodpecker
random effects, 2
semivariogram of OLS residuals, 236,
influence of semivariogram parame-
311
ters on ordinary kriging, 229
Simple linear regression and predic-
integrating covariance function, 287
tion, 222
isotropy and anisotropy, 45 simple linear regression and predic-
kriging neighborhood, 246 tion, 219
kriging standard errors, 246 simulated annealing, 411
kriging with nugget effect, 246 simulated Lattice (spatial context),
lattice data, 10, 47 5
Lewis-Shedler algorithm, 124 simulating from convolutions by match-
Lightning strikes, see Lightning strikes ing moments, 415
data simulation envelopes for L, 104
likelihood ratio test, 344, 345 simulation envelopes for contiguous
likelihood ratio test for spatial au- quadrat analysis, 96
tocorrelation, 345 Soil carbon regression, see Soil car-
linear filtering, 77 bon regression
linear mixed model, 384 spatial (linear) regression, 302
linear semivariogram and convolu- spatial GLM with CAR structure,
tion, 184 378
lognormal bias correction, 268 spatial GLMMs, 376
Mercer and Hall grain yield, 47 spatial GLMs, 376
SUBJECT INDEX 473

spatial scan statistic, 116 Fisher information, 385


studentized residuals, 312 Fisher scoring, 359, 367
thinning of Poisson process, 124 Fixed effect, 3, 232, 299, 300, 325
trans-gaussian kriging, 271 Flat prior distribution, 385
transfer function, 77 Forecasting, 251
trendsurface model, 236 Fourier
using convolution in simulation, 415 frequencies, 190
variance of estimation in GLM, 368 integral, 63, 65
variance of Matheron estimator, 154 pair, 64, 67, 69–71, 75, 189
what is a Gaussian random field, 47 series, 63, 65
white noise convolution, 60, 184 transform, 69, 73, 77, 198
Exceedance probability, 278 Freeman-Tukey transformation, 271
Excess variability, 35 Frequency
Exchangeable correlation, 32 angular, 64
Excitation field, 58, 60, 62, 75, 184, 415 content, 62
Excitation process, see Excitation field distribution of energy over, 64
Expansion locus, pseudo-likelihood, 365, domain, 68, 70, 71, 75, 77
366 interval, 66, 68
Experimental data, 301 leakage, 196
Experimental design, 320 response function, 75
Experimental error, 301 Fubini’s theorem, 59
Explosion process, 442 Fuentes model, 428
Exponential covariance model, 144 Full conditional distribution, 387
Exponential family, 353, 398
canonical link, 354 Gauss-Markov theorem, 220, 223
canonical parameter, 354 Gaussian kernel, 111
density, 353 Gaussian random field, 24, 46, 47, 153,
inverse canonical link, 354 159, 160, 163, 166, 174, 195,
natural parameter, 353 218, 224, 267, 358, 398, 405,
scale parameter, 353, 355 406
variance function, 354, 358 Geary’s c, 22, 96
External studentization, 306 GEE, see Generalized estimating equa-
tions
F General linear model, 331
distribution, 95, 305, 341, 342 Generalized estimating equations, 163,
Kenward-Roger statistic, 343, 347 169
statistic for linear hypothesis, 305 and block kriging, 288
test and sums of squares reduction, and composite likelihood, 171
305 and quasi-likelihood, 360, 361
Wald statistic, 341, 347 and spatial GLMs, 360–362
Factorizing, 280 for θ, 170
Farlie-Gumbel-Morganstern distributions, working covariance structure, 169,
293 170
Fejér’s kernel, 196, 197 Generalized least squares, 4, 32, 35, 161
Filtering, 250 estimator, 36, 134, 167, 256, 322
Finite-dimensional distribution, 85 estimator in SAR model, 337
First difference model, 320–321 estimator of β (mixed model), 326
First-order autoregression, 35 residuals, 260, 348–352
First-order intensity, 83, 85, 102, 111, 125, sum of squares, 164
443 Generalized linear mixed model, 357, 362
474 SUBJECT INDEX

and Bayesian model, 384 “soaked up” by correlated errors,


penalized quasi-likelihood, 366–367 302
pseudo-likelihood, 362–366 and aggregation, 11
spatial prediction, 369–382 cluster-to-cluster, 431
tests about θ, 368 of kriging weights, 231
Generalized linear model, 300, 352–382 of variance, 421
and universal kriging, 369 of variance (GLM), 354
canonical link, 354 of variances of residuals, 306, 349
canonical parameter, 354 Heteroscedasticity, 157, 309, 349, 362, 421
inference and diagnostics, 368–369 Hilbert space, 280
inverse canonical link, 354 Hole covariance model, 148
spatial prediction, 369–382 Homogeneity of a point process, 84
variance function, 354 Homogeneity of variance, see Homoscedas-
Generalized sum of squares, 164 ticity
Genton semivariogram estimator, 162 Homoscedasticity, 77, 154, 219, 235, 261,
Geographical analysis, 336, 338 303, 320, 421
Geographically weighted regression, 316– Hot spot, 24
317, 381 HPP, see Point pattern, homogeneous
Geometric anisotropy, 45, 427 Hull, convex, 241
Geostatistical data, 7, 30, 57, 134, 215, Hypergeometric sampling, see Sampling
234, 236, 321 Hyperparameter, 384, 386
GHCD 9 infant birth weights Hyperprior, 395, 396
cases and controls, 110 Hypothesis
cluster detection, 114 linear, 305, 341
Edward’s NN test, 115 of constant risk, 108
intensity ratio estimation, 112 of no clustering, 114
introduction, 107 of random labeling, 105
pattern comparison, 108 testable, 305
relative risk map, 113
spatial scan statistic, 116 Idempotency, 304
Gibbs sampling, 387, 389, 398 Identity link, 354
GLM, see Generalized linear model iid observations, 31, 41, 131, 234
GLMM, see Generalized linear mixed Importance sampling, 366, 386, 394
model Index of dispersion, 91
Global kriging, 245 Indicator
Global model, 238 cokriging, 278
GLS, see Generalized least squares function, 278, 280
Goodness of fit, 90, 130, 236 semivariogram, 278
Greig-Smith analysis, 92, 93, 95 transform, 12, 278, 290
GRF, see Gaussian random field variable, 280
Gridding, 245 Indicator kriging, 278–279, 290, 295
Grouping effect, 285 and exceedance probability, 278
in hierarchical model, 397
Hammersley-Clifford theorem, 339 predictor, 278
Hankel transform, 141, 440 Influence function, 161
Hat matrix, 304 Information criterion, 328, 346
Hermetian expansion, 282 Information matrix, 260, 304, 338, 340
Hermite polynomials, 282, 290 Inhomogeneity of a point process, 84
Hessian matrix, 260, 261, 263 Inhomogeneous Poisson process, 107–110
Heterogeneity Integral scales, 140
SUBJECT INDEX 475

Integrated spectrum, 70 and clustering, 101


normalized, 71 and regularity, 101
Intensity, 81 and second-order intensity, 101
and average number of events, 110 estimation, 102
and density, 111 in bivariate pattern, cross, 104
cross, between pattern types, 121 in multivariate patterns, cross, 122
estimating ratios of, 112 of Neyman-Scott process, 127
first-order, 83, 85, 102, 109, 111 properties, 101
function (inhomogeneous), 109 Kenward-Roger F statistic, 343, 347
kernel estimator (edge corrected), Kernel
112 bandwidth, 59
marginal spatial, 443 bivariate Gaussian, 112, 240, 427
marginal temporal, 443 estimation, 239
second-order, 85, 99 Fejér’s, 196
spatio-temporal, 443 function, 59, 75, 111, 145, 181, 196,
Inter-event distance, 97, 116 426
Internal studentization, 306 Gaussian, 60, 184, 240, 429
Intrinsic autoregression, 396 Gaussian product, 240
Intrinsic hypothesis, 51, 139, 143, 149 Minimum variance, 111
Intrinsic stationarity, see Stationarity multivariate, 240
Invariance Product, 111, 240
of a linear filter, 76 Quadratic, 111
of kernel, 62 smoother, 219
of kriging predictions to scaling, 435 smoothing, 58
to rotation, 100, 206 spatially varying, 427
to translation, 43, 85, 100, 206, 265, Uniform, 60, 111, 184
421 univariate, 111
Inverse distance-squared interpolation, 244 Knots, spline smoothing, 329
Inverse link function, 354 Kriging, 216
IPP, 84 and exact interpolation, 224
IRWGLS and non-stationarity of covariance,
and conditional formulation, 334 227
and marginal formulation, 333 Bayesian, 391
and quasi-likelihood, 361 constrained, 291–292
estimator, 258 disjunctive, 279–284, 399
estimator in CAR model, 340 global, 245
in SAR model, 337 in spatial GLMs (universal), 369
Isofactorial model, 284, 291, 293 Indicator, 278–279
Isometric embedding, 206 local, 244, 425
Isotropy, 45, 78
lognormal, 267–270, 399, 423
of point process, 99
neighborhood, 245, 426
Iteratively reweighted least squares, see
of polynomials, 284
IRWGLS
ordinary, 33, 226, 399, 425
practice of, 243–254
Jeffrey’s prior, 385 predictor (block), 286
Join count statistics, 37 predictor (disjunctive), 282
Join counts, 19 predictor (indicator), 278
predictor (lognormal, ordinary), 269
K-function, 101, 133, 200, 202 predictor (lognormal, simple), 268
476 SUBJECT INDEX

predictor (ordinary), 33, 227, 242, regular, 29


250, 263, 299 Lattice data, 8, 18, 57, 134
predictor (simple), 223, 408, 409 Leakage, 196, 213
predictor (trans-Gaussian), 271 Least squares
predictor (universal), 242, 256 constrained, 180
screening effect, 232 equations, 363
simple, 223 fitting of semivariogram, 164–166
trans-Gaussian, 270–277 generalized, 32, 35, 161
universal, 225, 241–243 nonlinear, 164, 170, 183, 186
variance (block), 286 ordinary, 23, 183
variance (disjunctive), 282 reweighted, 180
variance (ordinary), 231, 263 weighted, 161, 172, 189
variance (trans-Gaussian), 271 Leave-one-out estimates (OLS), 306
variance (universal), 242 Leverage, 306
weight (negative), 231 Leverage matrix, 306, 307, 349
weights, 230, 425 Lightning strikes data
weights (ordinary), 227, 228, 230 G function, 89
weights (positivity constraint), 232 L function, 103
window, 425 as a bivariate pattern, 106
with known mean, 223 as point pattern, 12
with positive weights, 232 bounding box and convex hull, 88
with unknown but constant mean, comparing L functions, 106
226 introduction, 12
Kurtosis, 310 simulation envelopes for G, 89
simulation envelopes for L, 104
L-function, 101 simulation envelopes for difference
and random labeling, 104 of L functions, 106
estimation, 102 Likelihood ratio test, 344
plot, 103 and REML, 344
Labeled point process, 103 boundary problem, 344
Lag, 60, 66 example, 344
class, 30 in SAR model, 348
distance, 28, 73 Linear covariance model, 149
spatio-temporal, 433, 441 Linear filter, 74
tolerance, 30 input, 74
Lag distance, see Lag, 139 output, 74
Lag tolerance, see Lag properties, 75
Lagged variables, 336 Linear hypothesis, 305, 341
Lagrange multiplier, 33, 226, 241, 250, Linear mixed model, 325–334
269, 291 and general linear model, 331
Laguerre polynomials, 284 and pseudo-likelihood, 364
Laplace approximation, 359, 366, 367 and signal model, 325
Laplace equation, 440 and spline smoothers, 327
Large-scale, 232, 299 conditionally specified, 334
Large-scale trend, 53, 55, 225, 244 criterion in variance component model,
linear, 137 327
Latent process, 334, 357 general form, 325
Lattice hierarchical, 334
irregular, 18 low-rank smoother, 327–331
rectangular, 29 maximum likelihood estimators, 326
SUBJECT INDEX 477

mixed model equations, 325 squared-error, 220, 226, 264, 326,


restricted maximum likelihood esti- 332
mators, 326 Low-rank smoothing, 327–331
solutions (to mixed model equations), LRE, see Recovered errors
326 LU decomposition, 290, 407, 413
variance components, 327 LUS, see Linearly unbiased scaled esti-
Linear model mation
correlated errors, 321–352
uncorrelated errors, 301–321 Mantel statistic, 15, 20, 22
Linear model of regionalization, 150 Mapped point patterns, 81–129
Linear predictor, 353, 354 Marginal
Linear spline model, 327 analysis, 361
Linearization, 361, 363, 364 correlation, 294, 330
Linearly unbiased scaled estimation, 309, distribution, 284, 292, 330
315 distribution (in GLM), 355
Link function, 353 distribution (upward construction),
and conditional mean, 357 339
identity, 354 equivalence (between mixed and cor-
inverse, 354 related error models), 334
log, 354 expansion (linearization), 365
logit, 354 expectation, 220
reciprocal, 354 likelihood, 359
LISA, 23 model formulation, 333
LMM, see Linear mixed model moments, 334, 405
Local estimation, 238–241 posterior distribution, 386
Local kriging, 244 spatial covariance structure, 334
compared to local estimation, 245 variance, 330
Local likelihood ratio, 116 Marginality, 365
Local Moran’s I, 24 Mark space, 118
Local polynomial, 316 Mark variable, 118
Log likelihood, 166 and K-function, 103
profiled for β, 260 and random labeling, 105
REML, profiled, 263 Marked point pattern, 103, 118–122
restricted (REML), 262 Markov
Log link, 354 chain, 386, 389
Log-likelihood chain, stationary distribution, 388
Laplace approximation, 359 process, 338
Logit, 271, 353 property, 338, 386
Logit link, 354 random field, 339
Lognormal distribution, 267, 269, 270, Markov chain Monte Carlo, 366, 386
358, 403 burn-in, 387
multivariate, 295 convergence, 387, 389
Lognormal kriging, 267–270, 399 stationary distribution, 388
calibrated predictor, 269 subsampling the chain, 390
predictor (ordinary), 269 Matérn class, 143, 189, 198, 200, 394, 399
predictor (simple), 268 Matérn covariance model, 143
Longitudinal data, 2, 302, 365, 414 Matérn point process models, 128
Loss Matching moments, 415
function (asymmetric), 217 Matheron estimator, 30, 153, 165, 174,
function (squared-error), 33, 217, 218 244
478 SUBJECT INDEX

MAUP, see Modifiable areal unit prediction error (universal kriging),


Maximum likelihood, 163, 259 242
bias of covariance parameter esti- Measurement error, 28, 55, 139, 150, 224,
mator, 167 232, 299, 334
and information matrix, 304 and filtering, 248
estimation in SAR model, 337 and kriging predictions, 251
estimator of β, 167, 304, 322 Median absolute deviation, 162
estimator of σ 2 , 304 Median indicator kriging, 279
estimator of relative risk, 395 Metric solution to MDS, 207
estimator of variance components, Metric space, 205, 206
328 Metropolis-Hastings algorithm, 389
in Gaussian CAR model, 340 Micro-scale, 28, 55, 133, 139, 150, 224,
in linear mixed model, 326 232, 254, 299, 334
score equations, 304 Minimization
spatially varying mean, 259–261 in filtered kriging, 250
spatio-temporal, 440 in simple kriging, 223
standard errors of estimates, 260 of penalized spline criterion, 328
MCMC, see Markov chain Monte Carlo to find BLUP, 222
MDS, see Multidimensional scaling to obtain mixed model equations,
Mean function, 52, 54, 56, 59, 232, 234, 326
300 unconstrained, 33
and splines, 327 Minimized
and uncorrelated errors, 301 error variance, 304
inference, 300 joint density (mixed model equations),
misspecified, 302 326
overfit, 237 mean-squared prediction error, 33,
spatially varying, 323 223, 224, 227, 242, 250, 269,
Mean model, 55 286
Mean square minus two log likelihood, 167
blocks (quadrats), 93 stress criterion, 209
continuity, 49, 75, 78, 138 Minimum variance kernel, 111
derivative, 51 Mixed model
differentiability, 51, 140 generalized linear, see Generalized
Mean-squared linear mixed model
error, 35, 111, 220, 224, 264, 265 linear, see Linear mixed model
error (Prasad-Rao estimator), 266 nonlinear, 431
prediciton error (trans-Gaussian krig- Mixed model equations, 325
ing), 271 and general linear model, 332
prediction error, 54, 217, 218, 220, and penalized quasi-likelihood, 367
241, 266, 269 and pseudo-likelihood, 365
prediction error (block kriging), 286 and spline criterion, 328
prediction error (constrained krig- augmented, 329
ing), 291 minimized joint density, 326
prediction error (filtered kriging), 250 solutions, 326
prediction error (in GLM), 370 Mixing locally stationary processes, 428
prediction error (multivariate), 243 Mixing process, 57
prediction error (ordinary kriging), ML, see Maximum likelihood
227 MLE, see Maximum likelihood estimator
prediction error (simple kriging), 224 Model
prediction error (trend surface), 235 Bayesian hierarchical, 383–399
SUBJECT INDEX 479

Bayesian linear regression, 391 set (Markov process), 338


Bayesian transformed Gaussian, 397 structure, 335
conditional formulation, 334 structure (SAR vs. CAR), 339
Discrete Gaussian, 291 Nested model, 150, 164, 180
general linear, 255, 331 Neyman-Scott process, 126, 131
general linear with correlated errors, Non-stationarity, 62, 421–429
249 and Cox process, 125
generalized linear, 352–382 and weighted processes, 428
generalized linear mixed, 362 convolutions, 426, 427
isofactorial, 291, 293 global method, 421
linear, correlated errors, 321–352 in mixture models, 439
linear, uncorrelated errors, 301–321 in time, 435
marginal formulation, 333 local method, 421
power spline of degree p, 327 moving windows, 422, 425
Signal, 249 of covariance, 206, 245
spatial autoregressive, 335–341 of mean, 225
Spatial linear, 256, 262 space deformation, 422, 424
spatial linear regression, 335 Nonlinear generalized least squares, 164
Spline, 327 Nonlinear least squares, 170, 258
Model based geostatistics, 215–295 Nonlinear mixed model, 431
Modifiable areal unit, 95, 285 Nonlinear spatial prediction, 267–284
Moment matching (in simulation), 415 disjunctive kriging, 279
Monte Carlo integration, 366 indicator kriging, 278
Monte Carlo test, 16, 87, 95, 130 lognormal kriging, 267
Moran’s I, 21, 38, 96 multi-Gaussian approach, 289
and OLS residuals, 314–315, 348 trans-Gaussian kriging, 270
for residuals, 23, 236, 307 Nonparametric covariance model, 148
local, 24 Normal equations, 304
Moving average, 179 Normal score transformation, 271
MSE, see Mean-squared error Normal-inverse-Gamma distribution, 392
MSPE, see Mean-squared prediction Nugget
error effect on OK predictions, 228–232
Multidimensional scaling, 206–209, 424 Nugget effect, 28, 50, 139, 140, 150, 224,
classical solution, 207 230, 237, 251, 334, 344, 355
metric solution, 207 and nested models, 150
non-metric solution, 208 and smoothing, 224
Multinomial distribution, 120 in nonparametric model, 186
Multiplicity, 25 relative, 161
Multivariate point pattern, 103, 120 Nuisance parameter, 300
Multivariate radial smoother, 329 Numerical integration, 366

Natural parameter, 353 Objective analysis, 216


Nearest neighbor, 5, 87 Oblique projector, 350
average distance as test statistic, 89 Octant search, 245
distance, 97, 114 Offspring event, 125, 127
distribution of distances, 88 OLS, see Ordinary least squares
residuals, 320 Operational decomposition, 54, 232, 249,
Negative Binomial distribution, 353, 415 299
Negative Binomial excitation field, 416 Operations on point processes, 123
Neighborhood, 111, 133 Operations research, 410
480 SUBJECT INDEX

Order-relation problems, 279, 295 PGF, see Probability generating func-


Orderly point process, 201 tion
Ordinary kriging, 33, 226–228, 269, 399 Plackett’s family of distributions, 293
and semivariogram, 228 Plug-in
equations, 250 estimate of residual semivariogram,
in hierarchical model, 399 350
of indicator transformed data, 278 estimate of variance of EGLS resid-
plug-in predictor, 256 uals, 351
predictor, 227, 242, 250, 263, 299 estimation, 254, 263
related model, 232 estimation (effects on F ), 342
variance, 227, 231, 250, 263 estimation, consequences, 263–267
weights, 227, 228, 230 estimator of β, 323
Ordinary least squares, 23 estimator of kriging variance, 267
“estimator” of residual variance, 303 estimator of mean-squared error, 266,
“leave-one-out” estimates, 306 267
estimator, 222 predictor (ordinary kriging), 256
estimator of fixed effects, 303 Point data, 57
estimators in SLR, 220 Point pattern
raw residuals, 307 aggregated, 82
residuals, 257, 303, 304 Bernoulli, 83
residuals and semivariogram, 137 cluster detection, 114–117
semivariogram of residuals, 310 clustering, 114
spatial model (uncorrelated errors), completely random, 81
301 components of multivariate, 121
Orthogonal basis, 284 data type, 11
Orthogonal process, 69, 75 intensity ratios, 112
and nested models, 151 mapped, 81
Orthogonal rotation, 310 marked, 103, 118
Orthogonal subspaces, 304 multitype, 120
Orthonormal basis, 280 multivariate, 103, 120
Orthonormal polynomials, 284, 291 offspring, 131
Overdispersion, 126, 294, 355, 359, 373 parent events, 131
regular, 82
Pair correlation function, 101 relationship between two, 103–106
Papadakis analysis, 301, 318–320 type, 120
Parent event, 125, 127 Point process
Partial sill, 140, 251 Binomial, 83
Patch size, 140 bivariate, 104, 120
Pearson residuals, 373 CSTR, 444
Pearson’s system of distributions, 292 distribution, 85
Pearson’s Type II distribution, 294 equivalence, 85
Penalized quasi-likelihood, 359, 366 hard-core, 128
Penalized spline, 327–331 homogeneous, 101, 102
Period function, 63 independence, 105
Periodicity, 25 inhomogeneous, 84
Periodogram, 62, 68, 78, 189–198 isotropy, 99
interpretation, 193 labeled, 103
leakage, 196, 213 Matérn models, 128
properties, 193 models, 122–129
Permutation test, 15, 25 models with regularity, 128
SUBJECT INDEX 481

Neyman-Scott, 126 error, 215, 264, 265


operations, 123 function, 221
orderly, 201 in spatial GLMs, 369–382
Poisson, 84 interval, 219
random labeling, 105 kriging (indicator), 278
simple sequential inhibition, 129 local, 238
simulation, 418–419 location, 238
spatio-temporal, 14, 442–445 of blocks (average values), 285
stationarity, 99 of new observation, 215, 219, 263
uniformity, 84, 99 simultaneous, 242
void probabilities, 85 standard error, 243, 250, 370
Point source, 422 variance, 33
Point source anisotropy, 423 versus estimation, 215
Poisson cluster process, 123, 125, 126 Prediction interval, see Prediction
Poisson distribution, 271, 293, 352, 353, Prediction theorem, 295
383, 439 Predictive distribution, 391, 393, 397
power mixture, 439 Predictor
probability generating function, 439 best, 267
Poisson power mixture, 439 best (in trend surface model), 235
Poisson process best (under squared-error), 218, 222
Cluster, 84 kriging (block), 286
compound, 84 kriging (disjunctive), 282
homogeneous, 81, 83, 84, 86, 122 kriging (lognormal, ordinary), 269
inhomogeneous, 84, 107, 123, 125 kriging (lognormal, simple), 268
properties, 84 kriging (ordinary), 227, 250, 263, 299
Polygonal data, 18 kriging (simple), 223, 226, 408, 409
Polygonal distance, 205 kriging (trans-Gaussian), 271
Polynomial trend, 320 kriging (universal), 256
Population-average, 365 linear, 267
Positive definiteness, 44, 68, 178, 204, of random effects, 325
205, 321, 423, 434 optimal linear, 241
Positive power mixture, 438 PRESS residual, 307
Posterior distribution, 383 Prior
from improper prior, 395 aspatial, 395
improper, 394 CAR, 396
joint, 384, 386 conjugate, 385, 395
marginal, 386 distribution, 383, 385–386
proper (CAR), 397 flat, 385
Power covariance model, 149 improper (uniform), 393
Power mixture, 438 Jeffrey’s, 385
Power spectrum, 63 multivariate Gaussian, 396
Power spline of degree p, 327 noninformative, 385, 394, 395
Practical range, 28, 73, 138, 144, 149, Reference, 394
154, 231 sample, 385
Prasad-Rao MSE estimator, 266 uniform, 385
Pre-whitening, 197 uniform (bounded), 397
Predicted surface, 248 vague, 385
Predicted value, 215, 248 Prior distribution, see Prior
Prediction Probability generating function, 438
and (non-)stationarity, 228 Probit, 271
482 SUBJECT INDEX

Procrustes rotation, 208, 209 introduction, 271


Product kernel, 111 semivariogram of normal scores, 275
Product separable covariance, 435, 440 spatial configuration, 272
Profiled log likelihood, 260 Random component, 233, 234
Profiled REML log likelihood, 263 Random effect, 232, 325
Profiling, 259 Random field, 41–78
Projection matrix, 304, 350 Gaussian, 46, 267
Proximity, 1, 11, 14, 336 Markov, 339
matrix, 336, 338, 348, 396 white noise, 68
regional, 336 Random function, 41, 65
spatial, 22, 322, 335 Random intercept, 358
spatial weights, 17 Random labeling hypothesis, 105–106, 114
temporal, 14 Random walk, 44
Pseudo-data, 163, 169, 170, 176, 363, 368, Randomization, 16, 300, 301
441 Randomization test, 25, 405
Pseudo-likelihood, 359, 362 Randomized complete block design, 300,
expansion locus, 365 325
Pseudo-model, 361 Range, 28, 55, 148
Pseudo-response, 163 effect on OK predictions, 228–232
in hole (wave) model, 149
in spherical model, 146
Quadrant search, 245
of spatial process, 28, 138
Quadratic kernel, 111
practical, 28, 73, 138, 144, 149, 154
Quadrats, 86, 90
Raw residual, 306
contiguous, 92
RCBD, 300, 325
smoothing of counts, 110
Reciprocal link, 354
tests based on, 90
Recovered errors, 310, 312, 315, 351
variance of counts, 91
Rectangular lattice, 29
Quadrature, 181, 366
Recursive residuals, 309, 312
Quasi-likelihood, 359, 362
Red-cockaded woodpecker, 93
and consistency, 361
analysis of contiguous quadrats, 95
and GEE in GLMs, 360
Monte Carlo test for CSR, 98
and Taylor series, 361
Moran’s I, 96
penalized, 359, 366
quadrat counts, 94
pseudo-model, 361
simulation envelopes for contiguous
score equations, 360, 361 quadrat analysis, 96
Queen move, 19, 29, 96, 336 Reduced second moment measure, 101
Reference prior, 394
Radial basis functions, 329 Regional data, 8, 322, 335
Radial covariance function, 198 Regional proximity, 336
Radial distribution function, 101 Regionalization, 150
Radial smoothing, 327–331 Regular lattice, 29
knot selection, 331 Rejection sampling, 386
low-rank, 329 Relative efficiency, 35
multivariate, 329 Relative excess variability, 36
Rainfall amounts, 271–277 Relative structured variability, 140, 247
cross validation results, 277 Relay effect, 140, 231
histogram of amounts, 273 REML, see Restricted maximum likeli-
histogram of transformed amounts, hood
273, 274 Repeated measures, 2
SUBJECT INDEX 483

Replication, 42, 55, 133, 140, 405, 421 in MCMC methods, 387
Residual maximum likelihood, see Re- point patterns temporally, 443
stricted maximum likelihood SAR, see Spatial autoregressive model,
Residual recursion, 309 Simultaneous
Residual semivariogram (GLS), 349 Scale effect, 285
Residual semivariogram (OLS), 308, 311, Scale mixture, 439
349, 350 Scale parameter, exponential family, 353,
Residual sum of squares, 303, 305, 393 355
Residual variance, 303 Scales of pattern, 95
Residuals Scales of variation, 54
(E)GLS, 257, 348 Scan statistic, 116
and autocorrelation, 235, 307 Scatterplot smoother, 327, 329
and pseudo-data (GLS), 362 Score equations, 304
GLS, 225, 260, 348–352 Screening effect, 232
OLS, 225, 257, 303, 304 Second reduced moment function, 101
PRESS, 307 Second-order
standardized (OlS), 305 factorial moment, 99
studentized (GLS), 362 intensity, 85, 99, 101, 443
studentized (OLS), 305 intensity of Cox process, 125
trend surface, 235 intensity of Neyman-Scott process,
Restricted maximum likelihood, 163, 168, 127
422 product density, 99
and error contrasts, 168, 262 reduced moment measure, 101
estimator of µ, 168 spatio-temporal intensity, 444
estimator of scale (σ 2 ), 263 stationarity, 27, 43, 52, 65, 201, 255,
estimator of variance components, 308, 406, 415
328 Semivariogram, 28, 30, 45, 50, 135, 224
in linear mixed model, 326 and covariance function, 45, 136
objective function, 263 and lognormal kriging, 270
spatially varying mean, 261–263 as structural tool, 28
spatio-temporal, 440 biased estimation, 137
REV, 35 classical estimator, 30
Robust semivariogram estimator, 160 cloud, 153, 170, 178, 186
Rook move, 19, 238, 336 conditional spatial, 441
RSV, see Relative structured variability Cressie-Hawkins estimator, 159, 239
empirical, 30, 135, 153, 236, 244,
Sample 255, 256, 410, 413, 416, 424,
autocovariance function, 191 441
covariance function, 192, 193 empirical (spatio-temporal), 441
dependent, 387 estimate, 30
mean, 31, 34, 191, 231 estimation, 153–163
size, 83 Genton estimator, 162
size (effective), 32, 34 local estimation, 426
Sampling Matheron estimator, 30, 136, 153,
binomial, 20 174, 441
by rejection, 386 moving average formulation, 184
from stationary distribution, 387 nonparametric modeling, 178–188
Gibbs, 387, 389, 398 nugget, 50, 139
hypergeometric, 20 of a convolved process, 184
Importance, 386, 394 of GLS residuals, 258, 362
484 SUBJECT INDEX

of indicators, 278 Simulation


of OLS residuals, 236, 308, 310–313, conditional, 406, 412
350 conditioning by kriging, 409
of recovered errors, 312 constrained, 409
of recursive residuals, 312 from convolutions, 413–418
of residuals, 225, 307 of Cox process, 126
partial sill, 140 of GRF (conditional), 407–409
power model, 51 of GRF (unconditional), 406–407
practical range, 138 of homogeneous Poisson process, 418
range, 138 of inhomogeneous Poisson process,
resistant estimator, 162 419
robust estimator, 160 of point processes, 418–419
sill, 28, 138, 154, 231 sequential, 408
spatio-temporal, 440 unconditional, 406, 409, 412
Separability Simulation envelopes, 87, 88, 93, 95
additive, 434 Simultaneous autoregression, 335
multiplicative, 434 Simultaneous prediction, 242
of covariance function, 79, 433 Simultaneous spatial autoregression, 56
of covariance functions, 72, 434 Singular value decomposition, 209, 310,
of spectral density, 79 321, 329
Test for, 435, 436 SLR, see Simple linear regression
Sequential decomposition (of joint cdf), Small-scale, 55, 133, 243, 354
408 Smooth-scale, 55, 139, 232, 299
Sequential residuals, 309 Smoothing, 224, 234
Sequential simulation, 408 and filtered kriging, 250
Serial and measurement error, 248
correlation, 25 disease rates, 394
variation curve, 135 in autoregressive models, 341
variation function, 135 parameter, 240, 328
Shepard-Kruskal algorithm, 209 parameter (model-driven), 328
Signal model, 55, 249, 325 with splines, 327
Sill, 28, 136, 138, 147, 148, 154, 236, 251 SMR, see Standardized mortality ratio
and isotropy (geometric), 151 Soil carbon regression, see also C/N ra-
and isotropy (zonal), 152 tio data
and nonparametric semivariogram, BLUS estimates, 312
183 conditional models, 332
effect on OK predictions, 228–232 diagnosing covariance model, 350
of correlogram, 355 empirical semivariogram, 311
of hole model, 149 fixed-effect estimates, 323
of residual semivariogram, 308 introduction, 302
partial, 140, 251 likelihood ratio test for spatial au-
time-dependent, 435 tocorrelation, 345
Similarity statistics, 30 ML/REML estimation of mean, 323
Simple kriging, 223–225 OLS estimation of mean, 323
predictor, 223, 408, 409 OLS residuals, 312
residual, 409 radial smoothing, 332
variance, 224, 225 recursive residuals, 312
Simple linear regression, 222 residual semivariogram, 311
Simple sequential inhibition process, 129 studentized residuals, 312
Simulated annealing, 5, 406, 409–413 Space deformation, 206, 422, 424
SUBJECT INDEX 485

Space-filling design, 331 joint vs. separate analyses, 432


Space-time, see Spatio-temporal lag, 433, 441
Spatial point process, 14, 442–445
autoregression, 335 processes, 431–445
closeness, 29 semivariogram, 440
connectivity, 18, 56, 238 two-stage analysis, 431
context, 5 Spectral
dispersions, 424 decomposition, 133, 309, 352, 407
distribution, 43, 405 density, 62, 64, 66, 67, 70, 71, 78,
neighbors, 18 198
proximity, 14, 17, 322, 335 density (marginal), 72
scales, 54 density (normalized), 71
Spatial autoregressive model, 335–341 density (spatio-temporal), 437
conditional, 338, 396 density function, 73, 74, 188
covariance matrix (parameterized), distribution function, 71
337 mass, 66, 67
eigenvalue conditions, 336 representation, 65, 74, 77, 179
estimation and inference, 336, 339 of isotropic covariance, 141
indcuded correlation, 336 spatio-temporal, 436
lagged variables, 336 theorem, 68
maximum likelihood, 337 Spectral decomposition (of a real sym-
modified least squares, 337 metric matrix), 407
one-parameter CAR, 340 Spectral density, see Spectral
one-parameter SAR, 336, 347 Spherical covariance model, 146
prediction, 340 Spline coefficients, 344
restricted maximum likelihood, 338 Spline model, 327
simultaneous, 335 Spline smoothing
standardized weights, 336 and linear mixed models, 327
universal kriging, 340 coefficients, 329
Spatial continuity, 48–52 knot selection, 331
and kriging predictions, 230 knots, 329
at origin, 50, 138 thin-plate, 329
degree of, 48 Split-plot, 32
in mean-square, 49 Squared-error loss, 33, 134, 217, 218, 220,
Spatial linear model, 234, 256 221, 226, 264, 326, 332
Spatial model SSAR model, 56
interaction model, 56 Standardized mortality ratio, 395
mean model, 55 Standardized OLS residuals, 305
nearest-neighbor, 55 Stationarity, 27, 42
reaction model, 56 intrinsic, 44, 135, 255
signal model, 55 local, 422, 425
simultaneous AR, 56 of Cox process, 126
SSAR, 56 of Neyman-Scott process, 127
Spatial neighbors, 18 of point process, 99
Spatial regression, 54 of residuals (lack thereof), 308
Spatial scales, 54 of spatio-temporal point pattern, 444
Spatially lagged variables, 336 second-order, 27, 43, 52, 65, 201,
Spatio-temporal 255, 308, 406, 415
data, 422 Space-time, 434
interactions, 435, 438 strict, 43, 48, 406
486 SUBJECT INDEX

strong, 43 t distribution, 342


temporal, 422 multivariate, 295, 393, 397
weak, 43, 51, 427 Taper functions, 197
Stationary distribution, 387 Tapering, 197
Statistic Taylor series
F -test, SS-reduction, 305 and disjunctive kriging, 270
F -test, linear hypothesis, 305 and MSE estimation, 266
t-test, for βj = 0, 305 and pseudo-likelihood, 364
AIC, 346 and quasi-likelihood, 359
AICC, 347 and spatial GLM (marginal), 361
autocorrelation and trans-Gaussian kriging variance,
Black-black join count, 20 271
Geary’s c, 17, 21, 22 expansion locus, 365
join count, 19, 37 of link function, 363
local, 24 Temporal proximity, 14
local Moran’s I, 24 Tesselation, 98
Mantel, 15, 22 Test for normality, 310
Moran’s I, 17, 21 Test statistic
Moran’s I for OLS residuals, 315 for βj = 0, 305
Cook’s D, 307 for (testable) linear hypothesis, 305
DFFITS, 307 for autocorrelation (Moran’s I and
Kenward-Roger F , 343, 347 OLS residuals), 315
leverage, 306 for CSR, based on distances, 97
likelihood ratio test, 344 in Monte Carlo test, 87
PRESS, 307 Theorem
Wald F , 347 Bayes, 383
Wald F , for linear hypothesis, 341 Bochner’s, 68, 72, 141, 436
Stochastic process, 41 Central limit, 46, 322, 342
orthogonal, 69 Equivalence, 85, 86
spatio-temporal, 432 Fubini, 59
Strict stationarity, 43, 406 Gauss-Markov, 220, 223
Strong stationarity, 43, 406 Hammersley-Clifford, 339, 377
Studentization, 362 Prediction, 295
external, 306 Sherman-Morrison-Woodbury, 306
internal, 306 Spectral representation, 68
Studentized EGLS residual, 351 Wiener-Khintchine, 72
Studentized OLS residuals, 305 Thin-plate spline, 329
Sub-sampling, 32 Thinning, 101, 123, 128
Subject-specific expansion, 365 π–, 123
Success probability, 83 p–, 123
Sum of squares p(s)–, 123
generalized, 164 in simulation, 419
in constrained model, 305 Time series, 44, 52
in OLS, 303 Tobler’s law, 26, 28
reduction test, 305 Trans-Gaussian kriging, 270–277
residual, 305, 393 in hierarchical model, 397
weighted, 165 predictor, 271
Superpositioning, 123 variance, 271
Support, 285 Transfer function, 75
SUBJECT INDEX 487

Transformation Vague prior, 385


affine, 294 Variability, see Variance
anamorphosis, 271, 283 Variance
and nonlinear prediction, 267 at discrete frequency, 67
arcsine, 271 heterogeneity, 11, 421
Box-Cox, 271 of black-black join count, 20
Freeman-Tukey, 271 of black-white join count, 20
Indicator, 278, 290 of block kriging predictor, 286
log, 271 of convolved random field, 184
logit, 271 of difference, 220, 221
normal scores, 271, 283 of disjunctive kriging predictor, 282
of coordinates, 206, 423 of filtered kriging predictor, 250
of GRF, 397 of gradient, 51
of model, 321 of IRWGLS estimator, 258
of predictor, 269 of linear combination, 44
of regressor space, 301 of Matheron estimator, 154, 165
of residuals, 309 of measurement error, 55, 139
of response, 421, 423 of model errors, 304
probit, 271 of Moran’s I, 38
square-root, 271 of OLS residuals, 308
to Gaussian, 292 of OLS residuals (in correlated error
to normality, 270 model), 349
Transition probability, 386 of ordinary kriging predictor, 227,
Treatment design, 300 263
Treatment effects, 320 of prediction, 33
Trend surface model, 234–238 of quadrat counts, 91
of second order random field with
Trend, large-scale, 53
nugget, 140
Trend, polynomial, 320
of second-order random field, 43
Triangulation, 98
of simple kriging predictor, 224
Truncated line function, 327
of trans-Gaussian kriging predictor,
Two-stage analysis, 431
271
Type-I error, 25, 90, 342
of universal kriging predictor, 242
Variance component, 254, 330, 344
UMVU, 218 Variance component model, 327
Unconditional, see Marginal Variance function, 354, 358
Unconditional simulation, 406, 409, 412 Variogram, 30, 135
Unconstrained minimization, 33, 226 Variography, 137
Uniform prior distribution, 385 Void probabilities, 85
Uniformity of a point process, 84
Universal kriging, 241–243 W-transformation, 327, 330
and hierarchical model, 392 Wald F statistic, 341, 347
and residuals, 225 Water distance, 204
and spatial GLMs, 369 Wave covariance model, 148
globally, 245 Weak stationarity, 427
in hierarchical model, 397 Weighted least squares, 161, 165, 172,
in SAR/CAR models, 340 189
predictor, 242, 256 objective function, 240
variance, 242 Weighted nonlinear least squares, 258
Unstructured covariance, 302 Weighted stationary processes, 428
488 SUBJECT INDEX

Weighted sum of squares, 165


White noise, 60, 62, 68, 83, 150, 184, 232,
282, 294, 312, 415, 426, 440
Whittle model, 144
Wiener-Khintchine theorem, 72
WLS, see Weighted least squares
Woodpecker, red-cockaded, 93, 285
Working structure, in GEE, 170

Zero-probability functionals, 85
Zoning effect, 285

You might also like