Statistical Methods For Spatial Data Analysis 07f414bf098301cd
Statistical Methods For Spatial Data Analysis 07f414bf098301cd
Statistical Methods
for Spatial Data Analysis
CHAPMAN & HALL/CRC
Texts in Statistical Science Series
Series Editors
Bradley P. Carlin, University of Minnesota, USA
Chris Chatfield, University of Bath, UK
Martin Tanner, Northwestern University, USA
Jim Zidek, University of British Columbia, Canada
Bayes and Empirical Bayes Methods for Data Analysis, Large Sample Methods in Statistics
Second Edition P.K. Sen and J. da Motta Singer
Bradley P. Carlin and Thomas A. Louis Linear Models with R
Julian J. Faraway
Bayesian Data Analysis, Second Edition
Andrew Gelman, John B. Carlin, Markov Chain Monte Carlo — Stochastic Simulation
Hal S. Stern, and Donald B. Rubin for Bayesian Inference
D. Gamerman
Beyond ANOVA — Basics of Applied Statistics
R.G. Miller, Jr. Mathematical Statistics
K. Knight
Computer-Aided Multivariate Analysis,
Fourth Edition Modeling and Analysis of Stochastic Systems
A.A. Afifi and V.A. Clark V. Kulkarni
A Course in Large Sample Theory Modelling Survival Data in Medical Research, Second
T.S. Ferguson Edition
D. Collett
Data Driven Statistical Methods
P. Sprent Multivariate Analysis of Variance and Repeated Measures
— A Practical Approach for Behavioural Scientists
Decision Analysis — A Bayesian Approach D.J. Hand and C.C. Taylor
J.Q. Smith
Multivariate Statistics — A Practical Approach
Elementary Applications of Probability Theory, Second B. Flury and H. Riedwyl
Edition
H.C. Tuckwell Practical Data Analysis for Designed Experiments
B.S. Yandell
Elements of Simulation
B.J.T. Morgan
Texts in Statistical Science
Statistical Methods
for Spatial Data Analysis
Oliver Schabenberger
Carol A. Gotway
This book contains information obtained from authentic and highly regarded sources. Reprinted material is quoted with
permission, and sources are indicated. A wide variety of references are listed. Reasonable efforts have been made to publish
reliable data and information, but the author and the publisher cannot assume responsibility for the validity of all materials
or for the consequences of their use.
No part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or
other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Preface xv
1 Introduction 1
1.1 The Need for Spatial Analysis 1
1.2 Types of Spatial Data 6
1.2.1 Geostatistical Data 7
1.2.2 Lattice Data, Regional Data 8
1.2.3 Point Patterns 11
1.3 Autocorrelation—Concept and Elementary Measures 14
1.3.1 Mantel’s Tests for Clustering 14
1.3.2 Measures on Lattices 18
1.3.3 Localized Indicators of Spatial Autocorrelation 23
1.4 Autocorrelation Functions 25
1.4.1 The Autocorrelation Function of a Time Series 25
1.4.2 Autocorrelation Functions in Space—Covariance and
Semivariogram 26
1.4.3 From Mantel’s Statistic to the Semivariogram 29
1.5 The Effects of Autocorrelation on Statistical Inference 31
1.5.1 Effects on Prediction 32
1.5.2 Effects on Precision of Estimators 34
1.6 Chapter Problems 37
References 447
The study of statistical methods for spatial data analysis presents challenges
that are fairly unique within the statistical sciences. Like few other areas,
spatial statistics draws on and brings together philosophies, methodologies,
and techniques that are typically taught separately in statistical curricula.
Understanding spatial statistics requires tools from applied statistics, mathe-
matical statistics, linear model theory, regression, time series, and stochastic
processes. It also requires a different mindset, one focused on the unique char-
acteristics of spatial data, and additional analytical tools designed explicitly
for spatial data analysis.
When preparing graduate level courses in spatial statistics for the first time,
we each struggled to pull together all the ingredients necessary to present the
material in a cogent manner at an accessible, practical level that did not tread
too lightly on theoretical foundations. This book ultimately began with our
efforts to resolve this struggle. It has its foundations in our own experience,
almost 30 years combined, with the analysis of spatial data in a variety of
disciplines, and in our efforts to keep pace with the new tools and techniques
in this diverse and rapidly evolving field.
The methods and techniques discussed in this text do by no means provide
a complete accounting of statistical approaches in the analysis of spatial data.
Weighty monographs are available on any one of the main chapters. Instead,
our goal is a comprehensive and illustrative treatment of the basic statistical
theory and methods for spatial data analysis. Our approach is mostly model-
based and frequentist in nature, with an emphasis on models in the spatial,
and not the spectral, domain. Geostatistical methods that developed largely
outside of the statistical mainstream, e.g., kriging methods, can be cast easily
in terms of prediction theory based on statistical regression models. Focusing
on a model formulation allows us to discuss prediction and estimation in the
same general framework. But many derivations and results in spatial statistics
either arise from representations in the spectral domain or are best tackled
in this domain, so spectral representations appear throughout. We added a
section on spectral domain estimation (§4.7) that can be incorporated in a
course together with the background material in §2.5. While we concentrate
on frequentist methods for spatial data analysis, we also recognize the utility
of Bayesian hierarchical models. However, since these models are complex
and intricate, we leave their discussion until Chapter 6, after much of the
xvi PREFACE
to Victor De Oliveira for the rainfall data, and to Felix Rogers for the low
birth weight data. Thanks to Charlie for being there when shape files go bad;
his knowledge of Virginia and his patience with the “dry as toast” material
in the book are much appreciated.
Finding time to write was always a challenge, and we are indebted to David
Olson for supporting Carol through her struggles with government bureau-
cracy. Finally, we learned to value our close collaboration and critical thinking
ability, enjoying the time we spent discussing and debating recent develop-
ments in spatial statistics, and combining our different ideas like pieces of a
big puzzle. We hope the book reflects the integration of our knowledge and
our common philosophy about statistics and spatial data analysis. A special
thanks to our editor at CRC Press, Bob Stern, for his vision, support, and
patience.
The material in the book will be supplemented with additional material
provided through the CRC Press Web site (www.crcpress.com). The site will
provide many of the data sets used as examples in the text, software code
that can be used to implement many of the principal methods described and
illustrated in the text, as well as updates and corrections to the text itself.
We welcome additions, corrections, and discussions for this Web page so that
it can make statistical methods for spatial data analysis useful to scientists
across many disciplines.
Introduction
Statistical methods for spatial data analysis play an ever increasing role in
the toolbox of the statistician, scientist, and practitioner. Over the years,
these methods have evolved into a self-contained discipline which continues to
grow and develop and has produced a specific vocabulary. Characteristic of
spatial statistics is its immense methodological diversity. In part, this is due
to its many origins. Some of the methods developed outside of mainstream
statistics in geology, geography, meteorology, and other subject matter areas.
Some are rooted in traditional statistical areas such as linear models and
response surface theory. Others are derived from time series approaches or
stochastic process theory. Many methods have undergone specific adaptations
to cope with the specific challenges presented by, for example, the fact that
spatial processes are not equivalent to two-dimensional time series processes.
The novice studying spatial statistics is thus challenged to absorb and combine
varied tools and concepts, revisit notions of randomness and data generating
mechanisms, and to befriend a new vernacular.
Perhaps the foremost reason for studying spatial statistics is that we are
often not only interested in answering the “how much” question, but the “how
much is where” question. Many empirical data contain not only information
about the attribute of interest—the response being studied—but also other
variables that denote the geographic location where the particular response
was observed. In certain instances, the data may consist of location infor-
mation only. A plant ecologist, for example, records the locations within a
particular habitat where a rare plant species can be found. It behooves us to
utilize this information in statistical inference provided it contributes mean-
ingfully to the analysis.
Most authors writing about statistical methods for spatial data will argue
that one of the key features of spatial data is the autocorrelation of observa-
tions in space. Observations in close spatial proximity tend to be more similar
than is expected for observations that are more spatially separated. While
correlations between observations are not a defining feature of spatial data,
there are many instances in which characterizing spatial correlation is of pri-
mary analytical interest. It would also be shortsighted to draw a line between
“classical” statistical modeling and spatial modeling because of the existence
of correlations. Many elementary models exhibit correlations.
2 INTRODUCTION
During the past two decades we witnessed tremendous progress in the anal-
ysis of another type of correlated data: longitudinal data and repeated mea-
sures. It is commonly assumed that correlations exist among the observations
repeatedly collected for the same unit or subject. An important aspect of such
data is that the repeated measures are made according to some metric such
as time, length, depth, etc. This metric typically plays some role in expressing
just how the correlations evolve with time or distance. Models for longitudinal
and repeated measures data bring us thus closer to models for spatial data,
but important differences remain. Consider, for example, a longitudinal study
of high blood pressure patients, where s subjects are selected at random from
some (large) population. At certain time intervals t1 , · · · , tni , (i = 1, · · · , s)
the blood pressure of the ith patient is measured along with other variables,
e.g., smoking status, exercise and diet habits, medication. Thus, we observe a
value of Yi (tj ), j = 1, · · · , ni , the blood pressure of patient i at time tj . Al-
though the observations taken on a particular patient are most likely serially
THE NEED FOR SPATIAL ANALYSIS 3
correlated, these data fall within the realm of traditional random sampling.
We consider the vectors Yi = [Yi (t1 ), · · · , Yi (tni )]! as independent random
vectors because patients were selected at random. A statistical model for the
data from the ith patient might be
Yi = X i β + e i , (1.1)
where Xi is a known (ni ×p) matrix of regressor and design variables associated
with subject i. The coefficient vector β can be fixed, a random vector, or have
elements of both types. Assume for this example that β is a vector of fixed
effects coefficients and that ei ∼ (0, Vi (θ l )). The matrix Vi (θ l ) contains the
variances of and covariances between the observations from the same patient.
These are functions of the elements of the vector θ l . We call θ the vector of
covariance parameters in this text, because of its relationship to the covariance
matrix of the observations.
The analyst might be concerned with
1. estimating the parameters β and θ l of the data-generating mechanism,
2. testing hypotheses about these parameters,
3. estimating the mean vector E[Yi ] = Xi β,
4. predicting the future blood pressure value of a patient at time t.
To contrast this example with a spatial one, consider collecting n observa-
tions on the yield of a particular crop from an agricultural field. The analyst
may raise questions similar to those in the longitudinal study:
1. about the parameters describing the data-generating mechanism,
2. about the average yield in the northern half of the field compared to the
southern-most area where the field surface is sloped,
3. about the average yield on the field,
4. about the crop yield at an unobserved location.
Since the questions are so similar in their reference to estimates, parame-
ters, averages, hypotheses, and unobserved events, maybe a statistical model
similar to the one in the longitudinal study can form the basis of inference?
One suggestion is to model the yield Z at spatial location s = [x, y]! as
Z(s) = x!s α + ν,
where xs is a vector of known covariates, α is the coefficient vector, and
ν ∼ (0, σ 2 ). A point s in the agricultural field is identified by its x and y
coordinate in the plane. If we collect the n observations into a single vector,
Z(s) = [Z(s1 ), Z(s2 ), · · · , Z(sn )]! , this spatial model becomes
Z(s) = Xs α + ν, ν ∼ (0, Σ(θ s )), (1.2)
where Σ is the covariance matrix of the vector Z(s). The subscript s is used
to identify a component of the spatial model; the subscript l is used for a
component of the longitudinal model to avoid confusion.
4 INTRODUCTION
The models for Y and Z(s) are rather similar. Both suggest some form of
generalized least squares (GLS) estimation for the fixed effects coefficients,
' = (X! V(θ l )−1 Xl )−1 X! V(θ l )−1 Y
β l l
' = (X!s Σ(θ s )−1 Xs )−1 X!s Σ(θ s )−1 Z(s).
α
Also common to both models is that θ is unknown and must be estimated.
The estimates β ' and α' are efficient, but unattainable. In Chapters 4–6 the
issue of estimating covariance parameters is visited repeatedly.
However, there are important differences between models (1.3) and (1.2)
that imperil the transition from the longitudinal to the spatial application.
The differences have technical, statistical, and computational implications.
Arguably the most important difference between the longitudinal and the
spatial model is the sampling mechanism. In the longitudinal study, we sam-
ple s independent random vectors, and each realization can be thought of as
that of a temporal process. The sampling of the patients provides the repli-
cation mechanism that leads to inferential brawn. The technical implication
is that the variance-covariance matrix V(θ) is block-diagonal. The statistical
implication is that standard multivariate limiting arguments can be applied,
' because the es-
for example, to ascertain the asymptotic distribution of β,
timator and its associated estimating equation can be expressed in terms of
sums of independent contributions. Solving
s
( ) *
X!i Vi (θ l )−1 Yi − X!i β = 0
i=1
leads to + ,−1
s
( s
(
'=
β X!i Vi (θ l )−1 Xi X!i Vi (θ l )−1 Yi .
i=1 i=1
The computational implication is that data processing can proceed on a
subject-by-subject basis. The key components of estimators and measures of
their precision can be accumulated one subject at a time, allowing fitting of
models to large data sets (with many subjects) to be computationally feasible.
In the spatial case, Σ is not block-diagonal and there is usually no repli-
THE NEED FOR SPATIAL ANALYSIS 5
Example 1.2 Data were generated on a 10×10 lattice by drawing iid variates
from a Gaussian distribution with mean 5 and variance 1, denoted here as
G(5, 1). The observations were assigned completely at random to the lattice
coordinates (Figure 1.1a). Using a simulated annealing algorithm (§7.3), the
data points were then rearranged such that a particular value is surrounded
by more and more similar values. We define a nearest-neighbor of site si ,
(i = 1, · · · , 100), to be a lattice site that could be reached from si with a
one-site move of a queen piece on a chess board. Then, let z i denote the
average of the neighboring sites of si . The arrangements shown in Figure
1.1b–d exhibit increasing correlations between Z(si ), the observation at site
si , and the average of its nearest-neighbors.
Since the same 100 data values are assigned to the four lattice arrange-
ments, any exploratory statistical method that does not utilize the coordinate
information will lead to identical results for the four arrangements. For ex-
ample, histograms, stem-and-leaf plots,Q-Q plots, and sample moments will
be identical. However, the spatial pattern depicted by the four arrangements
is considerably different. This is reflected in the degree to which the four
arrangements exhibit spatial autocorrelation (Figure 1.2).
6 INTRODUCTION
a) b)
10 10
9 9
8 8
7 7
Row
6 6
Row
5 5
4 4
3 3
2 2
1 1
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Column Column
c) 10
d)
10
9
9
8
8
7
7
6
Row
6
Row
5
5
4
4
3
3
2
2
1
1
1 2 3 4 5 6 7 8 9 10
1 2 3 4 5 6 7 8 9 10
Column Column
a) b)
7 7
= 0.3
6 6
z
z
5 5
4 4
3 3
3.5 4.0 4.5 5.0 5.5 6.0 3.5 4.0 4.5 5.0 5.5 6.0
zbar zbar
c) d)
= 0.6 = 0.9
7 7
6 6
z
z
5 5
4 4
3 3
Figure 1.2 Correlations between lattice observations Z(si ) and the average of the
nearest neighbors for lattice arrangements shown in Figure 1.1.
Figure 1.3 U.S. Weather Stations. Source: National Climatic Data Center.
Lattice data are spatial data where the domain D is fixed and discrete, in other
words, non-random and countable. The number of locations can be infinite;
what is important is that they can be enumerated. Examples of lattice data
are attributes collected by ZIP code, census tract, or remotely sensed data
TYPES OF SPATIAL DATA 9
reported by pixels. Spatial locations with lattice data are often referred to as
sites and these sites usually represent not points in space, but areal regions.
It is often mathematically convenient or necessary to assign to each site one
precise spatial coordinate, a “representative” location. If we refer to Wake
County in North Carolina, then this site is a geographic region, a polygon
on a map. To proceed statistically, we need to spatially index the sites so we
can make measurements between them. For example, to measure the distance
between the Wake County site and another county in North Carolina, we need
to adopt some convention for measuring the distance between two regions. One
possibility would be the Euclidean distance between a representative location
within each county, for example, the county centroid, or the seat of the county
government.
Unfortunately, the mathematical notation typically used with lattice data,
and the term lattice data itself, are both misleading. It seems natural to refer
to the observation at the ith site as Zi and then to use si to denote the rep-
resentative point coordinate within the site. In order to emphasize the spatial
nature of the observation, the same notation, Z(si ), used for geostatistical
data is also routinely used for lattice data. The subscript indexes the lattice
site and si denotes its representative location. Unfortunately, this notation
and the idea of a “lattice” in general, has encouraged most scientists to treat
lattice data as if they were simply a collection of measurements recorded at a
finite set of point locations. In reality, most, if not all, lattice data are spatially
aggregated over areal regions. For example,
• yield is measured on an agricultural plot,
• remotely sensed observations are associated with pixels that correspond to
areas on the ground,
• event counts, e.g., number of deaths, crime statistics, are reported for coun-
ties or ZIP codes,
• U.S. Census Bureau data is made available by census tract.
The aggregation relates to integration of a continuous spatial attribute (e.g.,
10 INTRODUCTION
Example 1.4 Blood lead levels in children, Virginia 2000. The mis-
sion of the Lead-Safe Virginia Program sponsored by the Virginia Department
of Health is to eradicate childhood lead poisoning. As part of this program,
children under the age of 72 months were tested for elevated blood lead levels.
The number of children tested each year is reported for each county in Vir-
ginia, along with the number that had elevated blood lead levels. The percent
of children with elevated blood lead levels for each Virginia county in 2000 is
shown in Figure 1.5.
Percent
0.00 − 2.62
2.63 − 5.23
5.24 − 10.46
10.47 −20.93
20.94 −71.43
Figure 1.5 Percent of children under the age of 72 months with elevated blood lead
levels in Virginia in 2000. Source: Lead-Safe Virginia Program, Virginia Department
of Health.
Figure 1.6 Median housing value per county in Virginia in 2000. Source: U.S. Census
Bureau.
Geostatistical and lattice data have in common the fixed, non-stochastic do-
main. A domain D is fixed if it does not change from one realization of the
spatial process to the next. Consider pouring sand out of a bucket onto a
desk and let Z(s) denote the depth of the poured sand at location s. The set
12 INTRODUCTION
Example 1.5 Lightning strikes. Once again we turn to the weather for
an interesting example. Prompt and reliable information about the location of
lightning strikes is critical for any business or industry that may be adversely
affected by the weather. Weather forecasters use information about lightning
strikes to alert them (and then the public at large) to potentially dangerous
storms. Air traffic controllers use such information to re-route airline traffic,
and major power companies and forestry officials use it to make efficient use
of human resources, anticipating the possibility of power outages and fires.
TYPES OF SPATIAL DATA 13
Figure 1.7 Locations of lightning strikes within approximately 200 miles of the east
coast of the U.S. between April 17 and April 20, 2003. Data courtesy of Vaisala Inc.,
Tucson, AZ.
Some elementary statistical measures of the degree to which data are au-
tocorrelated can be motivated precisely in this way; as methods for detect-
ing clusters in three-dimensional space. Mantel (1967) considered a general
procedure to test for disease clustering in a spatio-temporal point process
{Z(s, t) : s ∈ D ⊂ R2 , t ∈ T ∗ ⊂ R}. For example, (s, t) is a coordinate
in space and time at which a leukemia case occurred. This is an unmarked
spatio-temporal point pattern, Z is a degenerate random variable. To draw
the parallel with studying autocorrelation in a three-dimensional coordinate
system, we could also consider the data-generating mechanism as a spatial
point process with a mark variable T , the time at which the event occurs.
Denote this process as T (s) and the observed data as T (s1 ), T (s2 ), · · · , T (sn ).
The disease process is said to be clustered if cases that occur close together
in space also occur close together in time. In order to develop a statistical
measure for this tendency to group in time and space, let Wij denote a measure
of spatial proximity between si and sj and let Uij denote a measure of the
temporal proximity of the cases. For example, we can take for leukemia cases
at si and sj
Wij = ||si − sj || Uij = |T (si ) − T (sj )|.
AUTOCORRELATION—CONCEPT AND ELEMENTARY MEASURES 15
Mantel (1967) suggested to test for clustering by examining the test statistics
Mantel
n−1 n Statistics
( (
M1 = Wij Uij (1.4)
i=1 j=i+1
(n ( n
M2 = Wij Uij (1.5)
i=1 j=1
Wii = Uii = 0.
Because of the restriction Wii = Uii = 0, (1.4) sums the product Wij Uij for
the n(n − 1)/2 unique pairs and M2 is a sum over all n(n − 1) pairs. Collect
the spatial and temporal distance measures . in (n × n) matrices W and U
n
whose diagonal elements are 0. Then M2 = i=1 u!i wi , where u!i is the ith
row of U and w!i is the ith row of W.
Although Mantel (1967) focused on diagnosing spatio-temporal disease clus-
tering, the statistics (1.4) and (1.5) can be applied for any spatial attribute
and are not restricted to marked point patterns. Define Wij as a measure
of closeness between si and sj and let Uij be a measure of closeness of the
observed values; for example, Uij = |Z(si ) − Z(sj )| or Uij = {Z(si ) − Z(sj )}2 .
Notice that for a geostatistical or lattice data structure the Wij are fixed
because the domain is fixed and the Uij are random. We can imagine a re-
gression through the origin with response Uij and regressor Wij . If the process
exhibits positive autocorrelation, then small Wij should pair with small Uij .
Points will be distributed randomly beyond some critical Wij . The Mantel
statistics are related to the slope estimator of the regression Uij = βWij + eij
(Figure 1.8):
M2
β' = .n .n 2 .
i=1 j=1 Wij
3
uij
2
0 4 8 12
Distance in Space (|| si - sj||)
2.5
2.0
Average u ij
1.5
1.0
0.5
0.0
0 4 8 12
Distance in Space (|| si - sj||)
Figure 1.8 Scatter plots of uij = |Z(si )−Z(sj )| vs. wij = ||si −sj || for the simulated
lattice of Figure 1.1d. Bottom panel shows uij vs. wij .
• Monte Carlo Test. Even for moderately large n, the number of possible
random assignments of values to sites is large. Instead of a complete enu-
meration one can sample independently k random assignments to construct
the empirical null distribution of M2 . For a 5%-level test at least k = 99
samples are warranted and k = 999 is recommended for a 1%-level test.
The larger k, the better the empirical distribution approximates the null
distribution. Then, M2(obs) is again combined with the k values for M2 from
the simulation and the relative rank of M2(obs) is computed. If this rank is
sufficiently extreme, the hypothesis of no autocorrelation is rejected.
• Asymptotic Z-Test With Gaussian Assumption. The distribution of
M2 can also be determined—at least asymptotically—if the distribution of
the Z(s) is known. The typical assumption is that of a Gaussian distribution
for the Z(s1 ), · · · , Z(sn ) with common mean and common variance. Under
the null distribution, Cov[Z(si ), Z(sj )] = 0 ∀i (= j, and the mean and
variance of M2 can be derived. Denote these as Eg [M2 ] and Varg [M2 ],
since these moments may differ from those obtained under randomization.
A test statistic
M2(obs) − Eg [M2 ]
Zobs =
Varg [M2 ]
is formed and, appealing to the large-sample distribution of M2 , Zobs is
AUTOCORRELATION—CONCEPT AND ELEMENTARY MEASURES 17
compared against cutoffs from the G(0, 1) distribution. The spatial prox-
imity weights Wij are considered fixed in this approach.
• Asymptotic Z-Test Under Randomization. The mean and variance
of M2 under randomization can also be used to formulate a test statistic
M2(obs) − Er [M2 ]
Zobs = ,
Varr [M2 ]
which follows approximately (for n large) a G(0, 1) distribution under the
null hypothesis. The advantage over the previous Z-test lies in the absence
of a distributional assumption for the Z(si ). The disadvantage of either
Z-test is the reliance on an asymptotic distribution for Zobs .
The Mantel statistics (1.4) and (1.5) are usually not applied in their raw
form in practice. Most measures of spatial autocorrelation that are used in
exploratory analysis are special cases of M1 and M2 . These are obtained by
considering particular structures for Wij and Uij and by scaling the statistics.
n(n−1)
Total Su∗ 2 − Su∗ n(n − 1)/2
Other special cases of the Mantel statistics are the Black-White join count,
Moran’s I, and Geary’s c statistic. We now discuss these in the context of
lattice data and then provide extensions to a continuous domain.
18 INTRODUCTION
Lattice or regional data are in some sense the coarsest of the three spatial
data types because they can be obtained from other types by spatial accumu-
lation (integration). Counting the number of events in non-overlapping sets
A1 , · · · , Am of the domain D in a point process creates a lattice structure.
A lattice process can be created from a geostatistical process by integrating
Z(s) over the sets A1 , · · · , Am .
Key to analyzing lattice structures is the concept of spatial connectivity. Let
i and j index two members of the lattice and imagine that si and sj are point
locations with which the lattice members are identified. For example, i and j
may index two counties and si and sj are the spatial locations of the county
centroid or the seat of the county government. It is not necessary that each
lattice member is associated with a point location, but spatial connectivity
between sites is often expressed in terms of distances between “representative”
points. With each pair of sites we associate a weight wij which is zero if i = j
or if the two sites are not spatially connected. Otherwise, wij takes on a
non-zero value. (We use lowercase notation for the spatial weights because
the domain is fixed for lattice data.) The simplest connectivity structure is
obtained if the lattice consists of regular units. It is then natural to consider
binary weights
-
1 if sites i and j are connected
wij = (1.6)
0 if sites i and j are not connected.
Sites that are connected are considered spatial neighbors and you determine
what constitutes connectedness. For regular lattices it is customary to draw
on the moves that a respective chess piece can perform on a chess board
(Figure 1.9a–c). For irregularly shaped areal units spatial neighborhoods can
be defined in a number of ways. Two common approaches are shown in Figure
1.10 for counties of North Carolina. Counties are considered connected if they
share a common border or if representative points within the county are less
than a certain critical distance apart. The weight wij assigned to county j, if
it is a neighbor of county i, may be a function of other features of the lattice
sites; for example, the length of the shared border, the relative sizes of the
counties, etc. Symmetry of the weights is not a requirement. If housing prices
are being studied and a small, rural county abuts a large, urban county, it is
reasonable to assume that changes in the urban county have different effects
on the rural county than changes in the rural environment have on the urban
situation.
AUTOCORRELATION—CONCEPT AND ELEMENTARY MEASURES 19
a) b) c)
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
Black- known as the Black-Black join count statistic. The moniker stems from con-
Black sidering sites with Z(si ) = 1 as colored black and sites where the event of
Join Count interest did not occur as colored white. By the same token, the Black-White
join count statistic is given by
n n
1 ((
BW = wij (Z(si ) − Z(sj ))2 . (1.8)
2 i=1 j=1
The BB statistic is a special case of the Mantel statistic with Z(si ) binary and
Uij = Z(si )Z(sj ). If Z is a continuous attribute whose mean is not spatially
varying, i.e., E[Z(s)] = µ, then closeness of attributes at sites si and sj can
be expressed by various measures:
Uij = (Z(si ) − µ)(Z(sj ) − µ) (1.9)
Uij = (Z(si ) − Z)(Z(sj ) − Z) (1.10)
Uij = |Z(si ) − Z(sj )| (1.11)
Uij = (Z(si ) − Z(sj ))2 . (1.12)
The practically important measures are (1.10) and (1.12) because of mathe-
matical tractability and interpretation. If Z is consistent for µ, then E[(Z(si )−
Z)(Z(sj ) − Z)] is a consistent estimator of Cov[Z(si ), Z(sj )]. It is tempting to
scale this measure by. an estimate of σ 2 . If the Z(si ) were a random sample,
n
then S 2 = (n − 1)−1 i=1 (Z(si ) − Z)2 would be the appropriate estimator.
In case of (1.10) this leads us to consider
(Z(si ) − Z)(Z(sj ) − Z)
(n − 1) .n , (1.13)
i=1 (Z(si ) − Z)
2
As with the general Mantel statistic, inference for the I and c statistic
can proceed via permutation tests, Monte-Carlo tests, or approximate tests
based on the asymptotic distribution of I and c. To derive the mean and
variance of I and c one either proceeds under the assumption that the Z(si )
are Gaussian random variables with mean µ and variance σ 2 or derives the
mean and variance under the assumption of randomizing the attribute values
to the lattice sites. Either assumption yields the same mean values,
1
Eg [I] = Er [I] = −
n−1
Eg [c] = Er [c] = 1,
but expressions for the variances under Gaussianity and randomization differ.
The reader is referred to the formulas in Cliff and Ord (1981, p. 21) and to
Chapter problem 1.8.
The interpretation of the Moran and Geary statistic is as follows. If I > E[I],
then a site tends to be connected to sites that have similar attribute values.
The spatial autocorrelation is positive and increases in strength with |I −E[I]|.
If I < E[I], attribute values of sites connected to a particular site tend to be
dissimilar. The interpretation of the Geary coefficient is opposite. If c > E[c],
sites are connected to sites with dissimilar values and vice versa for c < E[c].
The assumptions of constant mean and constant variance of the Z(si ) must
not be taken lightly when testing for spatial autocorrelation with these statis-
tics. Values in close spatial proximity may be similar not because of spatial
autocorrelation but because the values are independent realizations from dis-
tributions with similar mean. Values separated in space may appear dissimilar
if the mean of the random field changes.
1.0
0.6
0.2
Z(t)
-0.2
-0.6
-1.0
1 3 5 7 9 11 13 15 17 19 21
Time (t)
Sill
10
8
(||h||)
6
Semivariogram
4 0
Practical Range
0
0 10 20 30 40
Lag distance || h||
Figure 1.12 Exponential semivariogram with sill 10 and practical range 15. Semi-
variogram not passing through the origin has a nugget effect of θ0 = 4.
n−1
( n
(
(1) (1)
M1 = wij Uij
i=1 j=i+1
30 INTRODUCTION
n−1
( n
(
(2) (2)
M1 = wij Uij
i=1 j=i+1
For irregular (or small) lattices we can incorporate a lag tolerance # > 0 to
ensure a sufficient number of pairs in each lag class:
-
1 if h − # ≤ ||si − sj || < h + #
wij (h, #) =
0 otherwise.
These weight definitions are meaningful whenever an observation is associated
with a point in Rd , hence they can be applied to geostatistical data. Two com-
mon choices for Uij are squared differences and cross-products. The similarity
statistics can be re-written as
(
M1∗ = (Z(si ) − Z(sj ))2
N (h)
(
M1∗∗ = (Z(si ) − Z)(Z(sj ) − Z),
N (h)
where N (h) is the set of site pairs that are lag distance h (or h ± #) apart. The
.n−1 .n ..
cardinality of this set is |N (h)| = i=1 j=i+1 wij (h) (or wij (h, #)). If
the random field has constant mean µ, then (Z(si ) − Z(sj )) is an unbiased
2
This estimator, due to Matheron (1962, 1963), is termed the classical semi-
variogram estimator and a plot of γ
'(h) versus ||h|| is known as the empirical
semivariogram. Similarly,
1 (
'
C(h) = (Z(si ) − Z)(Z(sj ) − Z)
|N (h)|
N (h )
Since much of classical applied statistics rests on the assumption of iid ob-
servations, you might be tempted to consider correlation in the data as a
nuisance at first. It turns out, however, (i) that ignoring autocorrelation has
serious implications for the ensuing statistical inference, and (ii) that correla-
tion in the data can be employed to your benefit. To demonstrate these points,
we consider the following simple problem.
Assume that observations Y1 , · · · , Yn are Gaussian distributed with mean
µy , variance σ 2 and covariances Cov[Yi , Yj ] = σ 2 ρ (i (= j). A second sample
of size n for variable X has similar properties. Xi ∼ G(µx , σ 2 ), (i = 1, · · · , n),
Cov[Xi , Xj ] = σ 2 ρ (i (= j). The samples for Y and X are independent,
Cov[Yi , Xj ] = 0 (∀i, j). Ignoring.the fact that the Yi are correlated, one might
n
consider the sample mean n−1 i=1 Yi as the “natural” estimator of µ. Some
straightforward manipulations yield
n n
6 7 1 ((
Var Y = Cov[Yi , Yj ]
n2 i=1 j=1
1 8 2 9
= nσ + n(n − 1)σ 2
ρ
n2
σ2
= {1 + (n − 1)ρ}.
n
Assume that ρ > 0, so that Var[Y ] > σ 2 /n; the sample mean is more dis-
persed than in a random sample. More importantly, we note that E[Y ] = µy ,
regardless of the correlations, but that
lim Var[Y ] = σ 2 ρ.
n
The effect of ignoring autocorrelation in the data and proceeding with in-
ference as if the data points were uncorrelated was discussed in §1.5. The
effective sample size formula (1.22) allows a comparison of the precision of
the arithmetic sample mean for the compound symmetry model and the case
of uncorrelated data. The intuitive consequence of this expression is that posi-
tive autocorrelation results in a “loss of information.” A sample of independent
observations of size n contains more information as a sample of autocorrelated
observations of the same size. As noted, the arithmetic sample mean is not the
appropriate estimator of the population mean in the case of correlated data.
To further our understanding of the consequences of positive autocorrelation
THE EFFECTS OF AUTOCORRELATION ON STATISTICAL INFERENCE 35
This model is known as the autoregressive model of first order, or simply the
AR(1) model. Let Y = [Y1 , · · · , Yn ]! and Var[Y] = Σ. As for the compound
symmetry model, an expression for Σ−1 is readily available. Graybill (1983,
pp. 198–201) establishes that Σ−1 is a diagonal matrix of type 2, that is,
σij (= 0 if |i − j| ≤ 1, σij = 0 if |i − j| > 1 and that
1 −ρ 0 0 ··· 0
−ρ 1 + ρ2 −ρ 0 ··· 0
0 −ρ 1 + ρ2
−ρ ··· 0
1 ..
Σ−1 = 2
σ (1 − ρ ) 0
2 0 −ρ 1+ρ 2
··· .
. .. .. ..
. .
. . . ··· −ρ
0 0 ··· 0 −ρ 1
The generalized least squares estimator of µ is
' = (1! Σ−1 1)−1 (1! Σ−1 Y)
µ (1.29)
and some algebra leads to (see Chapter problems)
1+ρ
Var['
µ] = σ 2 . (1.30)
(n − 2)(1 − ρ) + 2
Two special cases are of particular interest. If ρ = 0, the variance is equal
to σ 2 /n as it should, since µ
' is then the arithmetic mean. If the data points
are perfectly correlated (ρ = 1) then Var[' µ] = σ 2 , the variance of a single
observation. The variance of the estimator then does not depend on sample
size. Having observed one observation, no additional information is accrued
by sampling additional values; if ρ = 1, all further values would be identical
to the first.
When you estimate µ in the autoregressive model by (1.29), the correlations
are not ignored, the best possible linear estimator is being used. Yet, compared
to a set of independent data of the same sample size, the precision of the
estimator is reduced. The values in the body of Table 1.2 express how many
times more variable µ ' is compared to σ 2 /n, that is,
: ;−1
1−ρ 2 ρ
+ .
1+ρ n1+ρ
This is not a relative efficiency in the usual sense. It does not set into relation
the mean square errors of two competing estimators for the same set of data. It
is a comparison of mean-squared errors for two estimators under two different
36 INTRODUCTION
data scenarios. We can think of the values in Table 1.2 as the relative excess
variability (REV) incurred by correlation in the data.
Table 1.2 Relative excess variability of the generalized least squares estimator (1.29)
in AR(1) model.
For any given sample size n > 1 the REV of (1.29) increases with ρ. No
' = Y1 .
loss is incurred if only a single observation is collected, since then µ
The REV increases with sample size and this effect is more pronounced for
ρ large than for ρ small. As is seen from (1.30) and the fact that E[' µ] = µ,
the GLS estimator is consistent for µ. Its precision increases with sample
size for any given value ρ < 1. An important message that you can glean
from these computations is that the most efficient estimator when data are
(positively) correlated can be (much) more variable than the most efficient
estimator for independent data. In designing simulation studies with corre-
lated data these effects are often overlooked. The number of simulation runs
CHAPTER PROBLEMS 37
Problem 1.3 Derive the mean and variance of the BB join count statistic
(1.7) under the assumption of binomial sampling. Notice that E[Z(si )k ] =
π, ∀k, because Z(si ) is an indicator variable. Also, under the null hypothesis
of no autocorrelation, Var[Z(si )Z(sj )] = π 2 − π 4 .
Problem 1.4 Is the variance of the BB join count statistic larger under
binomial sampling or under hypergeometric sampling? Imagine that you are
studying the retail volume of grocery stores in a municipality. The data are
coded such that Z(si ) = 1 if the retail volume of the store at site si exceeds
20 million dollars per year, Z(si ) = 0 otherwise. The BB join count statistic
with suitably chosen weights is used to test for spatial autocorrelation in the
sale volumes. Discuss a situation when you would rely on the assumption of
binomial sampling and one where hypergeometric sampling is appropriate.
Problem 1.6 Given a lattice of sites, assume that Z(si ) is a binary variable.
Compare the Black-White join count statistic to Moran’s I in this case (using
the same contiguity definition). Are they the same? If not, are they very
different? What advantages would one have in using the BB (or BW ) statistic
in this problem that I does not offer?
Problem 1.7 Show that Moran’s I is a scale-free statistic, i.e., Z(s) and
λZ(s) yield the same I statistic.
(i) Derive the mean and variance of Moran’s I under randomization em-
pirically by enumerating the 6! permutations of the lattice. Compare your
answer to the formulas for E[I] and Var[I].
(ii) Calculate the empirical p-value for the hypothesis of no spatial autocor-
relation. Compare it against the p-value based on the Gaussian approxima-
tion under randomization. For this problem you need to know the variance
of I under randomization. It can be obtained from the following expression,
given in Cliff and Ord (1981, Ch. 2):
n[n2 − 3n + 3]S1 − nS2 + 3w..2 − b[(n2 − n)S1 − 2nS2 + 6w..2
Er [I 2 ] =
(n − 3)(n − 2)(n − 1)w..2
n n
1 ((
S1 = (wij + wji )2
2 i=1 j=1
2
(n ( n (n
S2 = wij + wji
i=1 j=1 j=1
.n
(Z(si ) − Z)4
b = n 8.ni=1 9
2 2
i=1 (Z(si ) − Z)
Compare this power function to the power function that is obtained for ρ = 0.
Problem 1.10 Using the same setup as in the previous problem, find the
best linear unbiased predictor p(Y0 ) for a new observation based on Y1 , · · · , Yn .
Compare its prediction variance σpred
2
to that of the predictor Y .
Problem 1.11 Show that (1.23) and (1.24) are the solutions to the con-
strained minimization problem that yields the best (under squared-error loss)
linear unbiased predictor in §1.5.1. Establish that the solution is indeed a
minimum.
(i) Find the covariance function, the correlogram, and the semivariogram
for the processes Z1 (t) and Z2 (t).
(ii) Set ρ = 0.5 and simulate the two processes with p = 0.5, 0.8, 0.9. For
each of the simulations generate a sequence for t = 0, 1, · · · , 200 and graph
the realization against t. Choose σ 2 = 0.2 throughout.
Problem 1.15 Using the expression for Σ−1 in §1.5.2, derive the variance of
the generalized least squares estimator in (1.30).
Problem 1.16 Table 1.2 gives relative excess variabilities (REV) for the GLS
estimator with variance (1.30) for several values of ρ ≥ 0. Derive a table akin
to Table 1.2 and discuss the REV if −1 < ρ ≤ 0.
CHAPTER 2
Imagine pouring sand from a bucket onto a surface. The sand distributes
on the surface according to the laws of physics. We could—given enough
resources—develop a model that predicts with certainty how the grains come
to lie on the surface. By the same token, we can develop a deterministic model
that predicts with certainty whether a coin will land on heads or tails, taking
into account the angle and force at which the coin is released, the conditions
of the air through which it travels, the conditions of the surface on which
it lands, and so on. It is accepted, however, to consider the result of the
coin-flip as the outcome of a random experiment. This probabilistic model is
more parsimonious and economic than the deterministic model, and enables
us to address important questions; e.g., whether the coin is fair. Considering
the precise placement of the sand on the surface as the result of a random
experiment is appropriate by similar reasoning. At issue is not that we consider
the placement of the sand a random event. What is at issue is that the sand
was poured only once; no matter at how many locations the depth of the sand
is measured. If we are interested in the long-run average depth of the sand at
a particular location s0 , the expectation (2.1) tells us that we need to repeat
the process of pouring the sand over and over again and consider the expected
value with respect to the probability distribution of all surfaces so generated.
The implications are formidable. How are we to learn anything about the
variability of a random process if only a single realization is available? In
practical applications there is usually no replication in spatial data in the
sense of observing several, independent realizations of the process. Are infer-
ences about the long-run average really that important then? Are we then
not more interested to model and predict the realized surface rather than
some average surface? How are we to make progress with statistical inference
based on a sample of size one? Fortunately, we can, provided that the random
process has certain stationarity properties. The assumption of stationarity in
random fields is often criticized, and sometimes justifiably so. Analyzing ob-
servations from a stochastic process as if the process were stationary—when it
is not—can lead to erroneous inferences and conclusions. Without a good un-
derstanding of stationarity (and isotropy) issues, little progress can be made
in the study of non-stationary processes. And in the words of Whittle (1954),
The processes we mentioned can only as a first approximation be regarded
as stationary, if they can be so regarded at all. However, the approximation is
satisfactory sufficiently often to make the study of the stationary type of process
worth while.
A random field
{Z(s) : s ∈ D ⊂ Rd } (2.2)
STATIONARITY, ISOTROPY, AND HETEROGENEITY 43
=k
(vi) If Cj (h) are valid covariance functions, j = 1, · · · , k, then j=1 Cj (h)
is a valid covariance function;
(vii) If C(h) is a valid covariance function in Rd , then it is also a valid
covariance function in Rp , p < d.
Properties (i) and (ii) are immediate, since C(h) = Cov[Z(s), Z(s + h)]. At
lag h = 0 this yields the variance of the process. Since C(h) does not de-
pend on spatial location s—otherwise the process would not be second-order
stationary—we have C(h) = Cov[Z(s), Z(s + h)] = Cov[Z(t − h), Z(t)] =
C(−h) for t = s + h. Since R(h) = C(h)/C(0) is the autocorrelation function
and is bounded −1 ≤ R(h) ≤ 1, (iii) follows from the Cauchy-Schwarz in-
equality. The lack of importance of absolute coordinates that is characteristic
for a stationary random field, is the reason behind (iv). This particular prop-
erty will be helpful later to construct covariance functions for spatio-temporal
data. Properties (i) and (iii) together suggest that the covariance function has
a true maximum at the origin. In §2.3 it is shown formally that this is indeed
the case.
Property (v) is useful to construct covariance models as linear combinations
of basic covariance models. It is also an important mechanism in nonparamet-
ric modeling of covariances. The proof of this property is simple, once we
have established what valid means. For a covariance function C(si − sj ) of a
second-order stationary spatial random field in Rd to be valid, C must satisfy
Positive the positive-definiteness condition
Definite- k (
k
(
ness ai aj C(si − sj ) ≥ 0, (2.3)
i=1 j=1
for any set of locations and real numbers. This is an obvious requirement,
since (2.3) is the variance of the linear combination a! [Z(s1 ), · · · , Z(sk )].
In time series analysis, stationarity is just as important as with spatial
data. A frequent device employed there to turn a non-stationary series into a
stationary one is differencing of the series. Let Y (t) denote an observation in
the series at time t and consider the random walk
Y (t) = Y (t − 1) + e(t),
where the e(t) are independent random variables with mean 0 and variance
σ 2 . Although we have E[Y (t)] = E[Y (t − k)], the variance is not constant,
Var[Y (t)] = tσ 2 ,
and the covariance does depend on the origin, Cov[Y (t), Y (t − k)] = (t − k)σ 2 .
The random walk is not second-order stationary. However, the first differences
Y (t) − Y (t − 1) are second-order stationary (see Chapter problems). A similar
device is used in spatial statistics; even if Z(s) is not second-order stationary,
the increments Z(s)−Z(s + h) might be. A process that has this characteristic
is said to have intrinsic stationarity. It is often defined as follows: the process
STATIONARITY, ISOTROPY, AND HETEROGENEITY 45
the covariance structure (see §4.3.7). Under geometric anisotropy the variance
of the process is the same in all directions, but the strength of the spatial
autocorrelation is not. The realization in the left-hand panel was generated
with autocorrelations that are stronger in the East-West direction than in the
North-South direction. The realization on the right-hand side of the panel has
the same covariance structure in all directions as the anisotropic model in the
East-West direction.
10
8
8
6
6
4
4
2
2 4 6 8 10 2 4 6 8 10
Figure 2.1 Anisotropic (left) and isotropic (right) second-order stationary random
fields. Adapted from Schabenberger and Pierce (2002, Figure 9.11).
Example 2.2 A famous data set used in many spatial statistics texts is the
uniformity trial of Mercer and Hall (1911). On an area consisting of 20 × 25
experimental units a uniform wheat variety was planted and the grain yield
was recorded for each of the units. These are lattice data since a field plot is
a discrete spatial unit. We can identify a particular unit with a unique spatial
location, however, e.g., the center of the field plot. From the histogram and the
Q-Q plot of the data one might conclude that the data came from a Gaussian
distribution (Figure 2.2). That statement would ignore the spatial context of
the data and the random mechanism. From a random field perspective, the
500 observations represent a single observation from a 500-dimensional spatial
distribution.
80
5.0
60
4.5
Grain yield
4.0
40
3.5
20
3.0
0
Figure 2.2 Histogram of Mercer and Hall grain yields and normal QQ-plot.
b)
c)
d)
0 10 20 30 40 50
Position on Transect
Figure 2.3 Realizations of four processes on a transect of length 50 that differ in the
degree of spatial continuity. More continuous processes have smoother realizations,
their successive values are more similar. This is indicative of higher spatial auto-
correlation for the same distance lag. Realization a) is that of a white noise process
(uncorrelated data), b)–d) are those of Gaussian processes with exponential, spheri-
cal, and gaussian covariance function, respectively. Adapted from Schabenberger and
Pierce (2002).
estimates of the parameters governing the two models will differ for a partic-
ular set of data. The fitted correlation models may imply the same degree of
continuity.
We have focused on inferring the degree of continuity from the behavior of
the correlation (or covariance) model near the origin. This is intuitive, since
the near-origin behavior governs the lag interval for which correlations are
high. The theoretical reason behind this focus lies in mean square continuity
and differentiability of the random field. Consider a sequence of random vari-
ables {Xn }. We say that {Xn } is mean square continuous if there exists Mean
a random variable X with E[X 2 ] < ∞, such that E[(Xn − X)2 ] → 0. For a Square
spatial random field {Z(s) : s ∈ D ⊂ Rd } with constant mean and constant Continuity
variance, mean-square continuity at s implies that
6 7
lim E (Z(s) − Z(s + h))2 = 0.
h→0
1.0
d)
0.8
Correlation Function R(h)
c)
b)
0.6
0.4
0.2
a)
0.0
0 2 4 6 8 10
Lag on transect
Figure 2.4 Correlation functions of the spatial processes shown in Figure 2.3. The
more sharply the correlation function decreases from the origin, the less continuous
is the process.
we conclude from
6 7
lim E (Z(s) − Z(s + h))2 = lim 2(C(0) − C(h)),
h→0 h→0
models with nugget effect from consideration and reflects the sentiment that
only the study of mean square continuous processes is worthwhile.
Mean square continuity by itself does not convey much about the smooth-
ness of the process and how it is related to the covariance function. The
smoothness concept is brought into focus by studying the partial derivatives
of the random field. First, consider the special case of a weakly stationary
spatial process on a transect, {Z(s) : s ∈ D ⊂ R}, with mean µ and variance
σ 2 . Furthermore assume that data are collected at equally spaced intervals δ.
The gradient between successive observations is then
Z(s + δ) − Z(s)
Ż = ,
δ
with E[Ż] = 0 and variance
Var[Ż] = δ −2 {Var[Z(s + δ)] + Var[Z(s)] − 2Cov[Z(s + δ), Z(s)]}
8 9
= 2δ −2 σ 2 − C(δ) ≡ σ̇ 2 . (2.4)
For a second-order stationary random field, we know that C(0) = σ 2 and
hence [dC(δ)/dδ]h=0 = 0. Additional details can be garnered from (2.4), be-
cause
δ2
C(δ) = σ 2 − σ̇ 2 (2.5)
2
can be the limiting form (as δ → 0) of C(δ) only if σ̇ 2 is finite. As a consequence
of (2.5), the negative of the second derivative of C(δ) is the mean square
derivative σ̇ 2 ; the covariance function has a true maximum at the origin.
Notice that (2.4) can be written as σ̇ 2 = 2δ −2 {C(0) − C(δ)} = 2δ −2 γ(δ),
where γ(δ) is the semivariogram of the Z process. For the mean square deriva-
tive to be finite, the semivariogram cannot rise more quickly in δ than δ 2 .
This condition is known as the intrinsic hypothesis. It is, in fact, slightly Intrinsic
stronger than 2γ(δ)/δ 2 → const., as δ → ∞. A valid semivariogram must Hypothesis
satisfy 2γ(δ)/δ 2 → 0 as δ → ∞. For example, the power semivariogram model
-
0 h=0
γ(h; θ) =
θ1 + θ2 ||h||θ3 h (= 0
is a valid semivariogram for an intrinsically stationary process only if 0 ≤
θ3 < 2.
For a general process Z(s) on R with covariance function C, define
Z(s + h) − Z(s)
Żh = .
h
Stein (1999, Ch. 2.6) proves that Z(s) is mean square differentiable, if and
only if the second derivative of C(h) evaluated at h = 0 exists and is finite.
In general, Z(s) is m-times mean square differentiable if and only if
: 2m ;
d C(h)
dh2m 0
52 SOME THEORY ON RANDOM FIELDS
d2m C(h)
(−1)m .
dh2m
The smoothness of a spatial random field increases with the number of times
it is mean square differentiable. The gaussian covariance model
- >
||si − sj ||2
C(si − sj ) = σ exp −3
2
, (2.6)
α2
The representation
{ Z(s) : s ∈ D ⊂ Rd } (2.7)
is very general and reveals little about the structure of the random field under
study. To be applicable, the formulation must be cast within a framework
through which (i) statistical methods of analysis and inference can be derived,
and (ii) the properties of statistical estimators as well as the properties of the
random field itself can be studied. For second-order stationary random fields,
the core components of any formulation are the mean function E[Z(s)] = µ(s),
the covariance function C(h) = Cov[Z(s), Z(s + h)], and the properties of
the index set D (fixed continuous, fixed discrete, or random). Of the many
possible formulations that add structure to (2.7), we present two that structure
the random field in the spatial domain (§2.4.1 and §2.4.2), and the spectral
representation in the frequency domain (§2.5). The distinction between spatial
and spectral representation is coarsely whether Z(s) is expressed in terms of
functions of the observed coordinates s, or in terms of a random field X(ω)
that lives in a space consisting of frequencies.
Readers accustomed to traditional statistical modeling techniques such as
linear, nonlinear, and generalized linear models will find the model repre-
sentation in §2.4.1 most illustrative. Readers trained in the analysis of time
series data in the spectral domain might prefer the representation in §2.5. The
following discussion enunciates the relationships and correspondence between
the three formulations. They have specific advantages and disadvantages. The
model formulation will be the central representation for most of the remainder
of this text. We invoke the spectral representation when it is mathematically
more convenient to address an issue in the frequency domain, compared to
the spatial domain.
RANDOM FIELDS IN THE SPATIAL DOMAIN 53
Consider a statistical model for the random field Z(s) with additive error
structure,
Z(s) = µ(s) + e(s),
where the errors have covariance function Cov[e(s), e(s + h)] = Ce (h). As
with other statistical models, the errors can contain more than a single com-
ponent. It is thus helpful to consider the following decomposition of the process
Scales of (Cressie, 1993, Ch. 3.1):
Variation
Z(s) = µ(s) + W (s) + η(s) + #(s). (2.10)
RANDOM FIELDS IN THE SPATIAL DOMAIN 55
With the decomposition (2.10) in place, we now define two types of models.
1. Signal Model. Let S(s) = µ(s) + W (s) + η(s) denote the signal of the
process. It contains all components which are spatially structured, either
through deterministic or stochastic sources. The decomposition Z(s) =
S(s) + #(s) plays an important role in applications of spatial prediction.
There, it is the signal S(s) that is of interest to the modeler, not the noisy
version Z(s).
2. Mean Model. Let e(s) = W (s) + η(s) + #(s) denote the error process of
the model and consider Z(s) = µ(s) + e(s). If the modeler focuses on the
mean function but needs to account for autocorrelation structure in the
data, the mean model is often the entry point for analysis. It is noteworthy
that e(s) contains different spatial error processes, some more structured
than others. The idea of W (s) and η(s) is to describe small- and microscale
stochastic fluctuations of the process. If one allows the mean function µ(s)
to be flexible, then a locally varying mean function can absorb some of
this random variation. In other words, “one modeler’s mean function is
another modeler’s covariance structure.” The early approaches to cope with
spatial autocorrelation in field experiments, such as trend surface models
and nearest-neighbor adjustments, attempted to model the mean structure
56 SOME THEORY ON RANDOM FIELDS
Cliff and Ord (1981, Ch. 6) distinguish between reaction and interac-
tion models. In the former, sites react to outside influences, e.g., plants react
to the availability of nutrients in the root zone. Since this availability varies
spatially, plant size or biomass will exhibit a regression-like dependence on nu-
trient availability. By this reasoning, nutrient-related covariates are included
as regressors in the mean function f (x, s, β). In an interaction model, sites
react not to outside influences, but react with each other. Neighboring plants,
for example, compete with each other for resources. Schabenberger and Pierce
(2002, p. 601) conclude that “when the dominant spatial effects are caused by
sites reacting to external forces, these effects should be part of the mean func-
tion [f (x, s, β)]. Interactive effects [. . .] call for modeling spatial variability
through the autocorrelation structure of the error process.”
The distinction between reactive and interactive models is not cut-and-
dried. Significant autocorrelation does not imply an interactive model over
a reactive one. Spatial autocorrelation can be spurious—if caused by large-
scale trends—or real—if caused by cumulative small-scale, spatially varying
components.
When the spatial domain is discrete, the decomposition (2.10) is not directly
applicable, since the random processes W (s) and η(s) are now defined on a
fixed, discrete domain. They no longer represent continuous spatial variation.
As before, reactive effects can be modeled directly through the mean function
µ(s). To incorporate interactive effects, the covariance structure of the model
must be modified, however. One such modification gives rise to the simul-
taneous spatial autoregressive (SSAR) model. Let µ(si ) denote the mean of
the random field at location si . Then Z(si ) is thought to consist of a mean
contribution, contributions of neighboring sites, and random noise:
Z(si ) = µ(si ) + e(si )
(n
= µ(si ) + bij {Z(si ) − µ(si )} + #(si ). (2.11)
j=1
The coefficients bij in (2.11) describe the spatial connectivity of the sites.
If all bij = 0, the model reduces to a standard model with mean µ(si ) and
uncorrelated errors #(si ). The coefficients govern the spatial autocorrelation
structure, but not directly. The responses at sites si and sj can be correlated,
even if bij = 0. To see this, consider a linear mean function µ(si ) = x(si )! β,
collect the coefficients into a matrix B = [bij ], and write the model as
Simultaneous
Z(s) = X(s)β + B(Z(s) − X(s)β) + '(s)
Spatial
Auto- '(s) = (I − B)(Z(s) − X(s)β).
regression
RANDOM FIELDS IN THE SPATIAL DOMAIN 57
Lemma 2.1 If X(s) in (2.13) is a white noise process with mean µx and
variance σx2 , then under some mild regularity conditions,
B
(i) E[Z(s)] = µx u K(u) du;
B
(ii) Cov[Z(s), Z(s + h)] = σx2 u K(u)K(u + h) du;
(iii) Z(s) is a weakly stationary random field.
Proof. The proof is straightforward and requires only standard calculus but it
is dependent on being able to exchange the order of integration (the regularity
conditions that permit application of Fubini’s theorem). To show (iii), it is
sufficient to establish that E[Z(s)] and Cov[Z(s), Z(s + h)] do not depend on
s. Provided the order of integration can be exchanged, tackling (i) yields
A A
E[Z(s)] = K(v)X(s − v) dvF (dx)
A X v A
= K(v) X(s − v)F (dx)dv
vA X
= µx K(v)dv.
v
To show (ii) assume that µx = 0. The result holds in general for other values
of µx . Then,
A A
Cov[Z(s), Z(s + h)] = K(v)K(t) E[X(s − v)X(s + h − t)] dvdt
v
A A t
= K(v)K(t) Cx (h + v − t) dvdt.
v t
Since X(s) is a white noise random field, only those terms for which h+v−t =
0 need to be considered and the double integral reduces to
A
Cov[Z(s), Z(s + h)] = σx 2
K(v)K(h + v) dv. (2.16)
v
Since the mean and covariance function of Z(s) do not depend on spatial
location, (iii) follows. This completes the proof of the lemma. Similar results,
60 SOME THEORY ON RANDOM FIELDS
Example 2.3
To demonstrate the effect of convolving white noise we consider an excita-
tion process X(s) on the line and two kernel functions. The gaussian kernel
function where the bandwidth h corresponds to the standard deviation of the
kernel and a uniform kernel function. The width of the uniform distribution
was chosen so that its standard deviation also equals h. Figure 2.5 shows
the realization of the excitation field and the realizations of the convolution
(2.13) for bandwidths h = 0.1; 0.05; 0.025. For a given bandwidth, convolving
with the uniform kernel produces realizations less smooth than those with the
gaussian kernel; the uniform kernel distributes weights evenly. For a particular
kernel function, the smoothness of the process decreases with the bandwidth.
h=0.10
3
-1
3 h=0.05
-1
3 h=0.025
-1
Figure 2.5 Convolutions of Gaussian white noise with gaussian (solid line) and uni-
form (dotted line) kernel functions. The bandwidth h corresponds to the standard
deviation of the gaussian kernel. The width of the uniform kernel was chosen to
have equal standard deviation than the gaussian kernel. The jagged line represents
the realization of the excitation field.
1.0
0.8
Correlation Function R(h)
0.6 h = 0.10
0.4 h = 0.05
h = 0.025
0.2
0.0
Figure 2.6 Correlation functions determined according to (2.16) for the convolu-
tions shown in Figure 2.5. The bandwidth h corresponds to the standard deviation
of the gaussian kernel. The width of the uniform kernel was chosen to have standard
deviation equal to that of the gaussian kernel.
c
i=0 i
2
. The importance of the spectral representation for such functions lies
in the fact that the coefficients c2i yield the contribution to the power from
the term in the Fourier series at frequency i/(2p). The power spectrum is
obtained by graphing c2i against i/(2p). For periodic functions this spectrum
is discrete.
Most (deterministic) functions are not periodic, and adjustments must be
made. We are still interested in the energy and power distribution, but the
power spectrum will no longer be discrete; there will be power at all fre-
quencies. In a sense, moving from Fourier analysis of periodic to non-periodic
functions is somewhat akin to the changes incurred in switching the study of
probability from discrete to continuous random variables. The switch is made
by representing the function not as a Fourier series, but a Fourier integral,
provided the function satisfies some additional regularity conditions. If g(s) is
a non-periodic, deterministic function, then, provided g(s) decays to zero as
s → ∞ and s → −∞, and
A ∞
|g(s)| ds < ∞,
−∞
This shows that |G(ω)|2 represents the density of energy contributed by com-
ponents of g(s) in the vicinity of ω. This interpretation is akin to the coef-
ficients c2i in the case of periodic functions, but now we are concerned with
a continuous distribution of energy over frequencies. Again, this makes the
point that studying frequency properties of non-periodic functions vs. peri-
odic functions bears resemblance to the comparison of studying probability
mass and density functions for discrete and continuous random variables.
Once the evolution from deterministic-periodic to deterministic-nonperiodic
functions to realizations of stochastic processes is completed, the function
G(ω) will be a random function itself and its average will be related to the
density of power across frequencies.
One important special case of (2.19) occurs when the function g(s) is real-
valued and even, g(s) = g(−s), ∀s, since then the complex exponential can
be replaced with a cosine function and does not require complex mathemat-
ics. The even deterministic function of greatest importance in this text is the
covariance function C(h) of a spatial process. But this mathematical conve-
nience is not the only reason for our interest in (2.19) when g(s) is a covariance
function. It is the function that takes the place of G(ω) when we consider ran-
dom processes rather than deterministic functions that is of such importance
then, the spectral density function.
RANDOM FIELDS IN THE FREQUENCY DOMAIN 65
conditions under which the limit (2.24) is indeed finite are surprisingly re-
lated to the rate at which the covariance function C(h) = Cov[Z(s), Z(s + h)]
decays with increasing h; the continuity of the process. The problem of infer-
ring properties of the process from a single realization is tackled by considering
Spectral the expectation of (2.24). If it exists, the function
Density : ;
1
s(ω) = lim E |G̃(ω)|2
(2.25)
S→∞ 2S
is called the (power) spectral density function of the random process. We
now establish the relationship between spectral density and covariance func-
tion for a process on the line because it yields a more accessible formulation
than (2.25). The discussion will then extend to processes in Rd .
which leads to (2S)−1 |G̃(ω)|2 as the Fourier transform of the sample covari-
ance function. Taking expected values and the limit in (2.25) establishes
the relationship between s(ω) and C(h). This is the approach considered
in Priestley (1981, pp. 212–213).
The development that follows considers approach (i) and is adapted from
the excellent discussion in Vanmarcke (1983, Ch. 3.2–3.4). We commence by
focusing on random processes in R1 . The extensions to processes in Rd are
immediate, the algebra more tedious. The extensions are provided at the end
of this subsection.
as a sum of 2K sinusoids
K
( K
(
Z(s) = µ + Yj (s) = µ + Aj cos(ωj s + φj ). (2.26)
j=−K j=−K
The Aj are random amplitudes and the φj are random phase angles dis-
tributed uniformly on (0, 2π); j = 1, · · · , K. All Aj and φj are mutually inde-
pendent. The Yj (s) are thus zero mean random variables because
A 2π
(2π)−1
cos(a + φ) dφ = 0
0
density and claimed that this is the appropriate function through which to
study the power/energy properties of a random process in the frequency
domain. Provided s(ω) exists! Since C(h) is a non-periodic deterministic
function, s(ω) exists when C(h) satisfies the conditions in §2.5.1 for (2.20):
B ∞ must decay to zero as h → ∞ and must be absolutely integrable,
C(h)
−∞
|C(h)| dh < ∞.
• Bochner’s theorem states that for every continuous nonnegative function
C(h) with finite C(0) there corresponds a nondecreasing function dS(ω)
such that (2.28) holds. If dS(ω) is absolutely continuous then dS(ω) =
s(ω)dω (see §2.5.4 for the implications if S(ω) is a step-function). It is
thus necessary that the covariance function is continuous which disallows
a discontinuity at the origin (see §2.3). Bochner’s theorem further tells us
that C(h) is positive-definite if and only if it has representation (2.28). This
provides a method for constructing valid covariance functions for stochastic
processes. If, for example, a function g(h) is a candidate for describing the
covariance in a stochastic process, then, if g(h) can be expressed as (2.28)
it is a valid model. If g(h) does not have this representation it should
not be considered. The construction of valid covariance functions from the
spectral representation is discussed (for the isotropic case) in §4.3.1 and for
spatio-temporal models in §9.3.
• Since the spectral density function is the Fourier transform of the covariance
function, this suggests a simple method for estimating s(ω) from data.
Calculate the sample covariance function at a set of lags and perform a
Fourier transform. This is indeed one method to calculate an estimate of
s(ω) known as the periodogram (see §4.7.1).
2.5.3.2 Extensions to Rd
cial properties. Their mean is zero and they are uncorrelated, representing a
white noise process in the frequency domain. Because of their uncorrelatedness
the covariance function can now be written as (X means complex conjugate)
:A A ;
C(h) = Cov[Z(0), Z(h)] = E dX(ω1 ) dX(ω2 )
Ω1 Ω2
A
6 7
= exp{iωs}E |dX(ω)|2 .
Ω
The covariance function C(h) and the spectral density function s(ω) form a
Fourier transform pair,
A ∞ A ∞
C(h) = ··· exp{iω ! h}s(ω) dω (2.31)
−∞ −∞
A ∞ A ∞
1
s(ω) = ··· exp{−iω ! h}C(h) dh, (2.32)
(2π)d −∞ −∞
cesses whose covariance functions are discontinuous at the origin, whether the
process is in R1 or in Rd .
Also notice the similarity of (2.30) to the convolution representation (2.13).
Both integrate over a stochastic process of independent increments. The con-
volution representation operates in the spatial domain, (2.30) operates in
the frequency domain. This correspondence between convolution and spectral
representation can be made more precise through linear filtering techniques
(§2.5.6). But first we consider further properties of spectral density functions.
Up to this point we have tacitly assumed that the domain D is continuous.
Many stochastic processes have a discrete domain and lags are restricted to
an enumerable set. For example, consider a process on a rectangular r ×c row-
column lattice. The elements of the lag vectors h = [h1 , h2 ]! consist of the set
(h1 , h2 ) : h1 , h2 = 0, ±1, ±2, · · ·. The first modification to the previous formu-
las is that integration in the expression for s(ω) is replaced by summation.
Let ω = [ω1 , ω2 ]! . Then,
∞
( ∞
(
1
s(ω) = C(h) cos(ω1 h1 + ω2 h2 ).
(2π)2
h1 =−∞ h2 =−∞
The second change is the restriction of the frequency domain to [−π, π], hence
A π A π
C(h) = cos(ω ! h)s(ω) dω.
−π −π
Note that we still assume that the spectral density is continuous, even if the
spatial domain is discrete. Continuity of the domain and continuity of the
spectral distribution function dS(ω) are different concepts.
Integrated If s(ω) exists for all ω we can introduce the integrated spectrum
Spectrum A ω1 A ω2
S(ω) = S(ω1 , ω2 ) = s(ϑ1 , ϑ2 ) dϑ1 dϑ2 . (2.33)
−∞ −∞
the other Bcoordinate can be obtained from the marginal spectral density as
∞
F1 (∞) = −∞ F (∞, ϑ2 )dϑ2 . Because
∂ d F (ω1 , · · · , ωd )
f (ω) = ,
∂ω1 · · · ∂ωd
the spectral densities can also be marginalized,
A ∞ A ∞
f1 (ω) = ··· f (ω, ω2 , · · · , ωd ) dω2 , · · · , dωd .
−∞ −∞
Priestley (1981, pp. 219–222) outlines the proof of the theorem and establishes
the connection to Bochner’s theorem.
RANDOM FIELDS IN THE FREQUENCY DOMAIN 73
The spectral density function for a process with α small is flatter than the
sdf for a process with α large (Figure 2.7b).
1.0 1.0
0.8 0.8
0.6 0.6 =1
= 10
0.4 0.4 =2
=5
0.2 =5 0.2
=2 = 10
0.0 0.0
=1
2 6 10 1 3 5
Lag h
Figure 2.7 Autocorrelation function and spectral density functions for processes with
exponential correlation structure and different ranges. Panel b displays f (ω)/f (0) to
amplify differences in shape, rather than scale.
At the beginning of §2.5 we noted the similarity between the spectral repre-
sentation (2.30) of the field Z(s) and the convolution representation (2.13).
The correspondence can be made more precise by considering linear filter-
ing techniques. Wonderful expositions of linear filtering and its applications
to spectral analysis of random fields can be found in Thiébaux and Pedder
(1987, Ch. 5.3) and in Percival and Walden (1993, Ch.5). Our comments are
based on these works.
A linear filter is defined as a linear, location-invariant transformation of
one variable to another. The input variable is the one to which the linear
transformation is applied, the resulting variable is called the output of the
filter. We are interested in applying filtering techniques to random fields, and
denote the linear filtering operation as
Z(s) = L{Y (s)}, (2.36)
RANDOM FIELDS IN THE FREQUENCY DOMAIN 75
to depict that Z(s) is the output when the linear filter L is applied to the
process Y (s). A linear filter is defined through the properties Properties
of a Linear
(i) For any constant a, L{aY (s)} = aL{Y (s)}; Filter
(ii) If Y1 (s) and Y2 (s) are any two processes, then L{Y1 (s) + Y2 (s)} =
L{Y1 (s)} + L{Y2 (s)};
(iii) If Z(s) = L{Y (s)} for all s, then Z(s + h) = L{Y (s + h)}.
The linear filter is a scale-preserving, linear operator [(i) and (ii)] and is not
affected by shifts in the origin (iii). It is easy to establish that the convolution
(2.13) is a linear filter with input X(s) (Chapter problems). It would be more
appropriate to write the linear filter as L{Y (·)} = Z(·) since it associates a
function with another function. The notation we choose here suggests that
the filter associates a point with a point and should be understood to imply
that the filter maps the function defined on a point-by-point basis by Y (s) to
the function defined on a point-by-point basis by Z(s).
The excitation field of a convolution has a spectral representation, provided
it is mean square continuous. Hence,
A
X(s) = exp {iω ! s} U (dω), (2.37)
dω
we see from (2.38) that the two frequency domain processes must be related,
P (dω) = H(ω)U (dω).
As a consequence, the spectral densities of the Z and X processes are also
related. Since E[|P (dω)|2 ] = sz (ω)dω and E[|U (dω)|2 ] = sx (ω)dω, we have
sz (ω) = |H(ω)|2 sx (ω). (2.39)
The spectral density function of the Z process is the product of the squared
modulus of the transfer function and the spectral density function of the input
field X(ω). This result is important because it enables us to construct one
spectral density from another, provided the transfer function and the spectral
density function of either Z or X are known, for example
sx (ω) = |H(ω)|−2 sz (ω).
Filtering The transfer function relates the spectra of the input to a linear filter in
Complex a simple fashion. It is noteworthy that the relationship is independent of
Expo- location. We can learn more about the transfer function H(ω). Consider the
nentials one-dimensional case for the moment. Since spectral representations involve
complex exponentials, consider the function εω (s) ≡ exp {iωs} for a particular
frequency ω. When εω (s) is processed by a linear location-invariant filter, we
can write the output yω (s + h) as
yω (s + h) = L{εω (s + h)} = L{exp{iωh}εω (s)}
= exp{iωh}L{εω (s)} = exp{iωh}yω (s),
for any shift h. We have made use here of the scale preservation and the
location invariance properties of the linear filter. Since the result holds for
any s, it also holds for s = 0 and we can write for the output of the filter
yω (h) = exp{iωh}yω (0).
Since the shift h can take on any value we also have
yω (s) = exp{iωs}yω (0) = εω (s)H(ω).
Is it justified to call the coefficient of exp{iωs} the transfer function? Let a
realization of X(s) be given by
A ∞
x(s) = exp {iωs} u(dω).
−∞
B∞
Then L{x(s)} = −∞
exp{iωs}H(ω)u(dω) and
A ∞
L{X(s)} = exp {iωs} H(ω)U (dω),
−∞
Problem 2.7 Consider a random field in the plane with covariance func-
tion C(h) = C(h1 , h2 ), where h1 and h2 are the coordinate shifts in the x-
and y-directions. The covariance function is called separable if C(h1 , h2 ) =
C1 (h1 )C2 (h2 ). Here, C1 is the covariance function of the random process Z(x)
along lines parallel to the x-axis. Show that separability of the covariance
function implies separability of the spectral density.
a) b)
1.0 1.0
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 c) 0.0 0.2 0.4 0.6 0.8 1.0
1.0
0.8
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Figure 3.1 Realizations of a completely random pattern (a), a Poisson cluster process
(b), and a process with regularity (sequential inhibition, c). All patterns have n = 100
events on the unit square.
cess, the average number of events in region A is simply nπ(A), and for any
Borel subset A of D
nπ(ds) nν(ds)/ν(D) n
λ(s) = lim = lim = ≡ λ.
ν(ds)→0 ν(ds) ν(ds)→0 ν(ds) ν(D)
Since the first-order intensity does not change with spatial location, the Bi-
Homoge- nomial process is a homogeneous (or uniform) process.
neity Points in non-overlapping subregions are not independent, however. Since
the total number of events in D is fixed, m events in A necessarily implies
n−m events in D\A. Because of the correlation between the number of events
in disjoint subregions, a Binomial process is not a completely spatial random
process. It is a very important point process, however, for testing observed
patterns against the CSR hypothesis. Whereas a CSR pattern is the result of
a homogeneous Poisson process, in Monte Carlo tests of the CSR hypothesis
one usually conditions the simulations to have the same number of events
as the observed pattern. Conditioning a homogeneous Poisson process on the
number of events yields a Binomial process.
There are many types of Poisson processes with relevance to spatial statis-
tics. Among them are the homogeneous Poisson process, the inhomogeneous
Poisson process, the Poisson cluster process, and the compound Poisson pro-
cess. A process is referred to as the Poisson process if it has the following two
properties:
Homogeneous
(i) If N (A) denotes the number of events in subregion A ⊂ D, then N (A) ∼
Poisson
Poisson(λν(A)), where 0 < λ < ∞ denotes the constant intensity function
Process
of the process;
(ii) If A1 and A2 are two disjoint subregions of D, then N (A1 ) and N (A2 )
are independent.
Stoyan, Kendall, and Mecke (1995, p. 33) call (ii) the “completely random”
property. It is noteworthy that property (ii) follows from (i) but that the
reverse is not true. The number of events in A can be distributed as a Poisson
variable with a spatially varying intensity, but events can remain independent
in disjoint subsets. We consider the combination of (i) and (ii) as the definition
of complete spatial randomness. A point process that satisfies properties (i)
and (ii) is called a homogeneous Poisson (or CSR) process.
If the intensity function λ(s) varies spatially, property (i) is not met, but (ii)
may still hold. A process of this kind is the inhomogeneous Poisson process
(IPP). It is characterized by the following properties.
Inhomogeneous
(i) If N (A) denotes the number of events in subregion A ⊂ D, then N (A) ∼
Poisson
Poisson(λ(A)),
B where 0 < λ(s) < ∞ is the intensity at location s and
Process
λ(A) = A λ(s)ds;
BINOMIAL AND POISSON PROCESSES 85
(ii) If A1 and A2 are two disjoint subregions of D, then N (A1 ) and N (A2 )
are independent.
The HPP is obviously a special case of the IPP where the intensity is constant.
Stoyan et al. (1995) refer to the HPP as the stationary Poisson process and
label the IPP the general Poisson process. Stationarity of point processes is
explored in greater detail in §3.4. We note here that stationarity implies (at
least) that the first-order intensity of the process is translation invariant which
requires that λ(s) ≡ λ. The inhomogeneous Poisson process is a non-stationary
point process.
The first-order intensity λ(s) and the yet to be introduced second-order inten-
sity λ2 (si , sj ) (§3.4) capture the mean and dependence structure in a spatial
point pattern. As the mean and covariance of two random variables X and
Y provide an incomplete description of the bivariate distribution, these two
intensity measures describe a point process incompletely. Quite different pro-
cesses can have the same intensity measures λ(s) and λ2 (si , sj ) (for an exam-
ple, see Baddeley and Silverman, 1984). In order to establish the equivalence
of two point processes, their distributional properties must be studied. This
investigation can focus on the distribution of the n-tuple {s1 , · · · , sn } by con-
sidering the process as random sets of discrete points, or through distributions
defined for random measures counting the number of points. We focus on the
second approach. Let N (A) denote the number of events in region (Borel set)
A with volume ν(A). The finite-dimensional distributions are probabilities of
the form Finite-
Pr(N (A1 ) = n1 , · · · , N (Ak ) = nk ), dimensional
where n1 , · · · , nk ≥ 0 and A1 , · · · , Ak are Borel sets. The distribution of Distribu-
the counting measure is determined by the system of these probabilities for tion
k = 1, 2, · · ·. It is convenient to focus on regions A1 , · · · , Ak that are mutu-
ally disjoint (non-overlapping). A straightforward system of probabilities that
determines the distribution of a simple point process consists of the zero-
probability functionals Void Proba-
PN0 (A) = Pr(N (A) = 0) bilities
for Borel sets A. Stoyan et al. (1995, Ch. 4.1) refer to these functionals as void-
probabilities since they give the probability that region A is void of events.
Notice that using zero-probability functionals for point process identification
requires simple processes; no two events can occur at the same location.
Cressie (1993, p. 625) sketches the proof of the equivalence theorem,
which states that two simple point processes with counting measures N1 and
N2 are identically distributed if and only if their finite-dimensional distribu-
tions coincide for all integers k and sets A1 , · · · , Ak and if and only if their
void-probabilities are the same: PN0 1 (A) = PN0 2 (A) ∀A.
86 MAPPED POINT PATTERNS
Example 3.1 The equivalence theorem can be applied to establish the equiv-
alence of a Binomial process and a homogeneous Poisson process on D that is
conditioned on the number of events. First note that for the Binomial process
we have N (A) ∼ Binomial(n, π(A)), where π(A) = ν(A)/ν(D). Hence,
: ;n
ν(D) − ν(A)
PN0 (A) = {1 − π(A)}n = (3.1)
ν(D)
n!
Pr(N (A1 ) = n1 , · · · , N (Ak ) = nk ) =
n1 ! · . . . · nk !
ν(A1 )n1 · . . . · ν(Ak )nk
× , (3.2)
ν(D)n
for A1 , · · · , Ak disjoint regions such that A1 ∪ · · · ∪ Ak = D, n1 + · · · + nk = n.
Let M (A) denote the counting measure in a homogeneous Poisson process
with intensity λ. The void-probability in region A is then given by
0
PM (A) = exp {−λν(A)} .
Conditioning on the number of events M (D) = n, the void-probability of the
conditioned process becomes
A test for complete spatial randomness addresses whether or not the observed
point pattern could possibly be the realization of a homogeneous Poisson
process (or a Binomial process for fixed n). Just as the stochastic properties
of a point process can be described through random sets of points or counting
measures, statistical tests of the CSR hypothesis can be based on counts of
events in regions (so-called quadrats), or distance-based measures using the
event locations. Accordingly, we distinguish between quadrat count methods
(§3.3.3) and distance-based methods (§3.3.4).
With a homogeneous Poisson process, the number of events in region A
is a Poisson variate and counts in non-overlapping regions are independent.
The distributional properties of quadrat counts are thus easy to establish, in
TESTING FOR COMPLETE SPATIAL RANDOMNESS 87
A Monte Carlo test for CSR is a special case of a simulation test. The hy-
pothesis is that an observed pattern Z(s) could be the realization of a point
process model Ψ. A test statistic Q is chosen which can be evaluated for the
observed pattern and for any realization simulated under the model Ψ. Let
q0 denote the realized value of the test statistic for the observed pattern.
Then generate g realizations of Ψ and calculate their respective test statis-
tics: q1 = q(ψ1 ), · · · , qg = q(ψg ). The statistic q0 is combined with these and
the set of g + 1 values is ordered (ranked). Depending on the hypothesis and
the choice of Q, either small or large values of Q will be inconsistent with
the model Ψ. For example, if Q is the average distance between events and
their nearest neighbors, then under aggregation one would expect q0 to be
small when Ψ is a homogeneous Poisson process. Under regularity, q0 should
be large. If Ψ is rejected as a data-generating mechanism for the observed
pattern when q0 ≤ q(k) or q0 ≥ q(g+1−k) , where q(k) denotes the kth smallest
value, this is a two-sided test with significance level α = 2k/(g + 1).
Monte Carlo tests have numerous advantages. The p-values of the tests
are exact in the sense that no approximation of the distribution of the test
statistic is required. The p-values are inexact in the sense that the number
of possible realizations under Ψ is typically infinite. At least the number of
realizations will be so large that enumeration is not possible. The number g
of simulations must be chosen sufficiently large. For a 5% level test g = 99
and for a 1% level test g = 999 have been recommended. As long as the
model Ψ can be simulated, the observed pattern can be compared against
complex point processes by essentially the same procedure. Simulation tests
thus provide great flexibility.
A disadvantage of simulation tests is that several critical choices are left
to the user, for example, the number of simulations and the test statistic.
Diggle (1983) cautions of “data dredging,” the selection of non-sensible test
88 MAPPED POINT PATTERNS
statistics for the sake of rejecting a particular hypothesis. Even if sensible test
statistics are chosen, the results of simulation tests may not agree. The power
of this procedure is also difficult to establish, in particular, when applied to
tests for point patterns. The alternative hypothesis for which the power is to
be determined is not at all clear.
A Monte Carlo test calculates a single test statistic for the observed pattern
and each of the simulated patterns. Often, it is illustrative to examine not
point statistics but functions of the point pattern. For example, let hi denote
the distance from event si to the nearest other event and let I(hi ≤ h) denote
the indicator function which returns 1 if hi ≤ h. Then
n
' 1(
G(h) = I(hi ≤ h)
n i=1
is an estimate of the distribution function of nearest-neighbor event distances
and can be calculated for any value of h. With a clustered pattern, we expect
an excess number of short nearest-neighbor distances (compared to a CSR
pattern). The method for obtaining simulation envelopes is similar to that
used for a Monte Carlo test, but instead of evaluating a single test statistic
for each simulation, a function such as G(h) ' is computed. Let G ' 0 (h) denote
the empirical distribution function based on the observed point pattern. Cal-
culate G 'g (h) from g point patterns simulated under CSR (or any
' 1 (h), · · · , G
other hypothesis of interest). Calculate the percentiles of the investigated func-
tion from the g simulations. For example, upper and lower 100% simulation
envelopes are given by
' l (h) = min {G
G ' i (h)} and G ' u (h) = max {G ' i (h)}.
i=1,···,g i=1,···,g
41
36
Latitude
31
26
Figure 3.2 Locations of lightning strikes, bounding rectangle, and convex hull.
Bounding box
Convex hull
1.0
0.8
G(h) and envelopes
0.6
0.4
0.2
0.0
Figure 3.3 G-function and simulation envelopes from 500 simulations on bounding
box and convex hull.
• If the null hypothesis is reasonable, the observed function G(h) should fall
within the simulation envelopes. When G(h) and its envelopes are graphed
against h and a 95% upper simulation envelope is exceeded at a given
small distance h0 , a Monte Carlo test with test statistic G(h0 ) would have
rejected the null hypothesis in a one-sided test at the 5% level. It is thus
common to calculate 95% simulation envelopes and examine whether G(h) '
crosses the envelopes. It must be noted, however, that simulation envelopes
are typically plotted against the theoretical G(h) or G(h), not distance.
Furthermore, unless the value of h0 is set in advance, the Type-I error of
this method is not protected.
The most elementary test of CSR based on counting events in regions is based
on dividing the domain D into non-overlapping regions (quadrats) A1 , · · · , Ak
of equal size such that A1 ∪ · · · ∪ Ak = D. Typically, the domain is assumed to
TESTING FOR COMPLETE SPATIAL RANDOMNESS 91
Example 3.2 For the three point patterns on the unit square shown in
Figure 3.1, quadrat counts were calculated on a r = 5 × c = 5 grid. Since each
pattern contains n = 100 points, n = 4 is common to the three realizations.
The quadrat counts are shown in Table 3.1.
With a CSR process, the counts distribute evenly across the quadrats,
whereas in the clustered pattern counts concentrate in certain areas of the
domain. Consequently, the variability of quadrat counts, if events aggregate,
exceeds the variability of the Poisson process. Clustered processes exhibit large
values of X 2 . The reverse holds for regular processes whose counts are under-
dispersed relative to the homogeneous Poisson process. The CSR hypothesis
based on the index of dispersion is thus rejected in the right tail against the
clustered alternative and the left tail against the regular alternative.
The sensitivity of the goodness-of-fit test to the choice of quadrat size is
evident when the processes are divided into r = 3 × c = 3 quadrats. The
left tail X 2 probabilities for 9 quadrats are 0.08, 0.99, and 0.37 for the CSR,
clustered, and regular process, respectively. The scale on which the point
pattern appears random, clustered, or regular, depends on the scale on which
92 MAPPED POINT PATTERNS
5 6 3 7 4 5 2 1 7 5 10 7 5 3 2 6
4 5 3 1 3 4 3 6 2 3 10 3 2 4 5 7
3 2 3 8 3 5 8 4 1 6 5 4 5 3 5 3
2 5 5 5 5 4 1 6 0 2 1 2 4 7 3 3
1 1 4 2 3 4 1 2 8 3 3 4 4 4 2 3
the counts are aggregated. This is a special case of what is known in spatial
data analysis as the change of support problem (see §5.7).
The fact that quadrat counts can indicate randomness, regularity, and clus-
tering depending on the scale of aggregation was the idea behind the method
of contiguous quadrat aggregation proposed in an influential paper by Greig-
Smith (1952). Whereas the use of randomly placed quadrats of differing size
had been common a the time, Greig-Smith (1952) proposed a method of ag-
gregating events at different scales by counting events in successively larger
quadrats which form a grid in the domain. Initially, the domain is divided into
a grid of 2q × 2q quadrats. Common choices for q are 4 or 5 leading to a basic
aggregation into 256 or 1024 quadrats. Then the quadrats are successively
combined into blocks consisting of 2 × 1, 2 × 2, 2 × 4, 4 × 4 quadrats and so
forth. Consider a division of the domain into 16 × 16 = 256 quadrats as shown
in Figure 3.4.
Depending on whether the rectangular blocks that occur at every second
level of aggregation are oriented horizontally or vertically, two modes of ag-
gregation are distinguished. The entire pattern contains two blocks of 128
quadrats. Each of these contains two blocks of 64 quadrats. Each of these
contains two blocks of 32 quadrats, and so forth. Let Nr,i denote the number
of events in the ith block of size r. The sum of squares between blocks of size
r is given by
m m/2
( (
SSr = 2 2
Nr,i − 2
N2r,j .
i=1 j=1
The term sum of squares reveals the connection of the method with a (nested)
analysis of variance. Blocks of size 1 are nested within blocks of size 2, these
TESTING FOR COMPLETE SPATIAL RANDOMNESS 93
16 16
15 15
14 14
13 13
12 12
11 11
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Figure 3.4 Aggregation of quadrat counts into blocks of successively larger size for
analysis of contiguous quadrats according to Greig-Smith (1952). Horizontal aggre-
gation is shown in the left-hand panel, vertical aggregation in the right-hand panel.
Dashed line shows the division of one block of size 128 into two blocks of size 64.
are nested within blocks of size 4, and so forth. The mean square associated
with blocks of size r is then simply M Sr = SSr /22q . The original Greig-Smith
analysis consisted of plotting M Sr against the block size r. Peaks or troughs
in this graph are interpreted as indicative of clustered or regular patch sizes.
Since Var[M Sr ] increases with the quadrat area, care must be exercised not
to over-interpret the fluctuations in M Sr , in particular for larger block sizes.
The Greig-Smith analysis thus provides a good application where simulation
envelopes should be considered. The peaks and troughs in the M Sr plot can
then be interpreted relative to the variation that should be expected at that
block size. To calculate simulation envelopes, the quadrat counts for the finest
gridding are randomly permuted s times among the grid locations.
enteen of the twenty-three months, one bird from each cluster was selected and
observed for an entire day. At eight-minute intervals during that observation
period, the location of the bird was geo-referenced. One of the goals of the
study was to obtain an estimate of the homerange of the cluster, that is, the
area in which the animals perform normal activities (Burt, 1943). Figure 3.5
shows the counts obtained from partitioning the 4,706 ft × 4,706 ft bounding
square of the pattern into 32 × 32 quadrats of equal size. A concentration of
the birds near the center of the study area is obvious, the data appear highly
clustered.
1
32
1 1 1 1 1
31
1 1 2 3 2 1 1 1 1
30
1 1 1 2 1 2 1 1 1 2
29
1 1 1 1
28
2 1 1
27
1 1 1 1
26
1 1 1 2 3
25
1 1 1
24
1 1 1 1
23
1
22
1 1 1 1 1 1 1 1 1 2
21
1 1 1 1 1 1 1 1 1 1
20
1 1 1 2 1 1 3 3 1 1 1 1
19
1 1 2 2 1 3 1 1 1 2 1 1
18
1 1 1 7 4 1 5 2 1 4 2 1
Row
1 1 1
17
1 1 1 2 1 2 1 4 8 5 3 2 2 2 1 1
16
1 2 1 2 1 3 7 5 13 2 2 1 1 3 2 2
15
1 1 4 1 3 4 7 8 4 1 2 1 1 1
14
1 1 1 2 2 2 4 1 1 3 3 2 4 1 2 1 1 1 1 1 1
13
2 2 3 2 3 3 5 5 4 2 5 2 2 1 2 3 2 4 1 1 1
12
1 2 4 2 2 2 5 2 4 3 1 1 8 1 2 3 3 1 1 1 1 1 1 1
11
1 2 1 4 2 5 5 6 8 5 4 9 4 4 4 4 2 1 1 1 1 1 2 1
10
1 1 2 2 1 2 1 3 4 3 3 6 3 8 4 5 3 2 1
9
2 1 1 1 2 1 2 1 5 1 7 3 3 1 3 1 1
8
1 1 2 1 1 2 1 3 5 1 4 5 4 1 1
7
1 1 1 1 1 1 3 1 4 1 6 3 1 1
6
1 1 1 1 1 2 2 2 1
5
1 2 1 1 1 1
4
2 1 1 1 1 2
3
1
2
1 1 1
1
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32
Column
Figure 3.5 Quadrat counts for woodpecker data. The bounding square was divided
into 32 × 32 square quadrats of equal size. The width of a quadrat is approximately
147 ft. A total of 675 locations were recorded. Data kindly provided by Professor Jeff
Walters, Department of Biology, Virginia Tech.
Table 3.2 Nested analysis of variance for vertical and horizontal aggregation into
contiguous quadrats
dom permutations of the quadrat counts in Figure 3.5 were generated and
the nested analysis of variance was repeated for each permutation. For the
smallest block sizes the woodpecker distribution exhibits some regularity, but
the effects of strong clustering are apparent for the larger block sizes (Table
3.2). Since Var[M Sr ] increases with block size, peaks and troughs must be
compared to the simulation envelopes to avoid over-emphasizing spikes in the
M Sr . Figure 3.6 shows significant clustering for block sizes of 32 and more
quadrats. The spike of M S r at r = 128 represents the mean square for blocks
of 16 × 8 quadrats. This corresponds to a “patch size” of 2,352 ft × 1,176 ft.
The Woodpecker quadrat count data are vertically aggregated. Observa-
tions collected over time were accumulated into discrete spatial units. The
analysis shows that both the size of the unit of aggregation as well as the
spatial configuration (orientation) has an effect on the analysis. This is an ex-
ample of a particular aspect of the change of support problem, the modifiable
areal unit problem (MAUP, see §5.7).
The graph of the mean squares by block size in the Greig-Smith analysis
is an easily interpretable, exploratory tool. The analysis can attain a confir-
matory character if it is combined with simulation envelopes or Monte Carlo
tests. Closed-form significance tests have been suggested based on sums of
squares and mean squares and the Chi-square and F -distributions (see, for
example, Thompson, 1955, 1958; Zahl, 1977). Upton and Fingleton (1985,
p. 53) conclude that “virtually all the significance tests are suspect.” The
analysis of contiguous quadrats also conveys information only about “scales
of pattern” that coincide with blocks of size 2k , k = 0, · · · 2q − 1. A peak or
trough in the M Sr plot for a particular value of r could be induced by a patch
96 MAPPED POINT PATTERNS
70
60
50
40
30
20
10
0
Figure 3.6 Mean squares for nested, contiguous quadrat counts as a function of
block size. Solid lines without symbols denote 100% simulation envelopes for M S r =
(v) (h)
0.5(M Sr + M Sr ).
size for which the mean squares cannot be calculated. The effects of orienta-
tion of the grid on the analysis are formidable. Despite of these problems, the
Greig-Smith analysis remains a popular tool.
Table 3.3 Quadrature counts for woodpecker data based on 10 × 10 square quadrats
Column
Row 1 2 3 4 5 6 7 8 9 10
10 0 0 0 5 9 4 5 0 0 0
9 0 2 2 6 3 4 1 0 0 0
8 0 0 6 3 3 0 1 0 2 2
7 1 0 2 1 1 3 5 4 2 2
6 2 1 4 19 15 14 6 4 3 1
5 3 4 17 42 41 8 9 5 1 0
4 2 14 21 34 31 31 22 7 4 4
3 5 5 11 25 31 45 33 3 0 1
2 2 0 3 5 6 10 19 8 0 0
1 0 0 3 5 0 3 2 2 0 0
Table 3.4 Results for Moran’s I and Geary’s c analysis based on quadrat counts in
Table 3.3
The choice of shape and number of quadrats in CSR tests based on areal
counts is a subjective element that can influence the outcome. Test statistics
that are based on distances between events or between sample points and
events eliminate this subjectiveness, but are more computationally involved.
In this subsection tests based on distances between events are considered.
Let hij denote the inter-event distance between events at locations si and sj ,
hij = ||si − sj ||. The distance between event si and the nearest other event is
called the nearest-neighbor distance and denoted hi .
Sampling distributions of test statistics based on inter-event or nearest-
neighbor distances are elusive, even under the CSR assumption. Ripley and
Silverman (1978) described a closed-form quick test that is based on the first
ordered inter-event distances. For example, if t1 = min{hij } then t21 has an
98 MAPPED POINT PATTERNS
' 0 ) = #(yi ≤ y0 ) ,
G(y
n
the empirical estimate of the probability that the nearest-neighbor distance
is at most y0 ,
' 0 ) = 2 #(hij ≤ h0 ) ,
H(t
n(n − 1)
the empirical estimate of the probability that the inter-event distance is at
most h0 , and so forth. There are many other sensible choices provided that the
test statistic is interpretable in the context of testing the CSR hypothesis. In
a clustered pattern, for example, h tends to be smaller than in a CSR pattern
and tends to be larger in a regular pattern (Figure 3.7). If y0 is chosen small,
' 0 ) will be larger than expected under CSR in a clustered pattern and
G(y
smaller in a regular pattern.
a) b)
20
20
15
15
Density
Density
10
10
5 5
0 0
0.01 0.03 0.04 0.06 0.08 0.10 0.11 0.13 0.01 0.04 0.06 0.09 0.11 0.14 0.16 0.19
1.5
Density
1.0
0.5
0.0
0.07 0.23 0.38 0.54 0.70 0.86 1.01 1.17
If λ(s) plays a role in point pattern analysis akin to the mean function, what
function of event locations expresses dependency of events? In order to capture
spatial interaction, more than one event needs to be considered. The second-
order intensity function sets into relationship the expected number of the Second-
cross-product of event counts in infinitesimal disks and the volumes of the order
disks as the disks are shrunk. Intensity
E[N (dsi )N (dsj )]
λ2 (si , sj ) = lim . (3.4)
|dsi |→0,|dsj |→0 |dsi ||dsj |
Stoyan et al. (1995, p. 112) refer to (3.4) as the second-order product density
since it is the density of the second-order factorial moment measure.
A point process is homogeneous (uniform) if λ(s) = λ. A process is station-
ary, if the second-order intensity depends only on event location differences, Stationarity
λ2 (si , sj ) = λ∗2 (si − sj ). If the process is furthermore isotropic, the second-
order intensity depends only on distance, λ2 (si , sj ) = λ∗2 (||si − sj ||) = λ∗2 (h).
100 MAPPED POINT PATTERNS
1.0
0.9
0.8
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Average EDF in simulations
Figure 3.8 EDF of nearest-neighbor distance (solid line) for woodpecker data and
simulation envelopes (dashed lines) based on 200 simulations.
The inner sum yields the number of observed extra events within distance
h of event si . The outer sum accumulates these counts. Since the process is
stationary, the intensity is estimated with (3.8) and K̃(h) = λ̂−1 Ẽ(h).
Because events outside the study region are not observed, this estimator
is negatively biased. If one calculates the extra events for an event near the
boundary of the region, counts will be low because events outside the region
are not taken into account. To adjust for these edge effects, various corrections
have been applied. If one considers only those events for the computation of
K(h) whose distance di from the nearest boundary exceeds h, one obtains
n .n
j=1&=i I(hij ≤ h and dj > h)
(
E (h) =
∗
.n
i=1 j=1 I(dj > h)
- ∗ .n
'd (h) =
E
E j=1 I(dj > h) > 0
0 otherwise.
In either case,
'
K(h) '−1 E(h)
=λ ' or '
K(h) '−1 E
=λ 'd (h).
Cressie (1993, p. 616) discusses related estimators of K(h).
In statistical analyses one commonly computes K(h) for a set of distances
and compares the estimate against the K-function of the CSR process (πh2 ).
Unfortunately, important deviations between empirical and theoretical second-
'
order behavior are often difficult to determine when K(h) and K(h) are over-
layed in a plot. In addition, the variance of the estimated K-function increases
SECOND-ORDER PROPERTIES OF POINT PATTERNS 103
quickly with h and for large distances the behavior can appear erratic. Using
a plug-in estimate, the estimated L-function
<
' '
L(h) = K(h)/π
has better statistical properties. For graphical comparisons of empirical and
'
theoretical second-order behavior under CSR we recommend a plot of L(h)−h
versus h. The CSR model is the horizontal reference line at 0. Clustering of
events manifests itself as positive values at short distances. Significance is
assessed through Monte Carlo testing as described in §3.3.1 and in practice, we
'
consider a plot of L(h)−h versus h together with the corresponding simulation
envelopes computed under CSR as described in §3.3.2.
The K-function considers only the location of events; it ignores any attribute
values (marks) associated with the events. However, many point patterns in-
clude some other information about the events and this information is often
binary in nature, e.g., which of two competing species of trees occurred at a
particular location, whether or not an individual with a certain disease at a
particular location is male or female, or whether or not a plant at a location
was diseased. Diggle and Chetwynd (1991) refer to such processes as labeled.
In cases such as these, we may wonder whether the nature of the spatial pat-
tern is different for the two types of events. We discuss marked point patterns
and multivariate spatial point processes in more generality in §3.6. In this
section, we focus on the simple, yet common, case of a bivariate process with
binary marks.
One generalization of K(h) to a bivariate spatial point process is (Ripley,
1981; Diggle, 1983, p. 91)
Kij (h) = λ−1 E[#of type j events within distance h
of a randomly chosen type i event].
Suppose the type i events in A are observed with intensity λi at locations
104 MAPPED POINT PATTERNS
2.5
2.0
1.5
L(h)-h and envelopes
1.0
0.5
0.0
-1.0
Figure 3.9 L-functions and simulation envelopes from 200 simulations on bounding
'd (h).
box and convex hull for lightning data. Edge correction is based on E
where hkl = ||sk − ul ||, and w(sk , ul ) is the proportion of the circumference
of a circle centered at location sk with radius hkl that lies inside A.
If the bivariate spatial process is stationary, the cross-K-functions are sym-
metric, i.e., K12 = K21 . However, K ' 12 (= K
' 21 , so Lotwick and Silverman
(1982) suggest using the more efficient estimator
∗
Kij 'j K
(h) = λ 'i K
' ij (h) + λ 'j + λ
' ji (h)/λ 'i .
indicate attraction between the two processes at distance h. Values of L∗ij (h)−
h < 0 indicate repulsion. Unfortunately, hypothesis tests are more difficult in
this situation since a complete bivariate model must be specified.
Diggle (1983) provides an alternative philosophy that is not based on CSR.
Another way to define the null hypothesis of “no association” between the
two processes is that each event is equally likely to be a type i (or type j)
event. This is known as the random labeling hypothesis. This hypoth- Random
esis is subtly different than the independence hypothesis. The two scenarios Labeling
arise from different random mechanisms. Under independence, the locations Hypothesis
and associated marks are determined simultaneously. Under the random la-
beling hypothesis, locations arise from a univariate spatial point process and a
second random mechanism determines the marks. Thus, the marks are deter-
mined independently of the locations. Diggle notes that the random labeling
hypothesis neither implies nor is implied by the stricter notion of statistical
independence between the two spatial point processes, and confusing the two
scenarios can lead to “the analysis of data by methods which are largely ir-
relevant to the problem in hand” (Diggle, 1983, p. 93). The random labeling
hypothesis always conditions on the set of locations of all observed events,
and, under this hypothesis
K11 = K22 = K12 (3.10)
(Diggle and Chetwynd, 1991). In contrast, the independence approach con-
ditions on the marginal structure of each process. Thus, the two approaches
lead to different expected values for K12 (h), to different tests, and to different
interpretation.
Diggle and Chetwynd use the relationships in (3.10) to construct a test
based on the difference of the K-functions
D(h) = Kii (h) − Kjj (h).
They suggest estimating D(h) by plugging in estimates of Kii and Kjj ob-
tained from (3.9) (adjusted so that D(h) is unbiased, see Diggle and Chet-
wynd, 1991). Under the random labeling hypothesis, the expected value of
D(h) is zero for any distance h. Positive values of D(h) suggest spatial clus-
tering of type i events over and above any clustering observed in the type j
events.
Diggle and Chetwynd (1991) derive the variance-covariance structure of
'
D(h) and give an approximate test based on the standard Gaussian distri-
bution. However, Monte Carlo simulation is much easier. To test the random
labeling hypothesis, we condition on the set of all n1 + n2 locations, draw a
sample of n1 from these (if we want to test for clustering in type i events),
assign the locations not selected to be of type j, and then compute D(h) '
for each sampling. Under the random labeling hypothesis, the n1 sampled
locations reflect a random “thinning” (see §3.7.1) of the set of all locations.
Also, Monte Carlo simulation enables us to consider different statistics for the
comparison of the patterns, for example D(h) = Lii (h) − Ljj (h).
106 MAPPED POINT PATTERNS
32
31
Latitude
30
29
Figure 3.10 Lightning flashes off the coast of Florida, Georgia, and South Carolina.
Empty circles depict flashes with negative charge (nn = 753), closed circles depict
flashes with positive charge (np = 76).
Figure 3.11 displays the L-functions for the two types of events, their differ-
ence, 5%, and 95% simulation envelopes for D(h) = Lp (h) − Ln (h) based on
200 random assignment of labels to event. There is no evidence that lightning
strikes of different polarity differ in their degree of clustering. In carrying out
the simulations, the same bounding shape is assumed for all patterns, that
based on the data for both event types combined.
THE INHOMOGENEOUS POISSON PROCESS 107
L p (h)
1.10 L n (h)
0.95
0.80
0.65
L(h), D(h)
0.50
0.35
0.20
D(h) = L p(h) - L n(h)
0.05
-0.10
Figure 3.11 Observed L-functions for flashes with positive and negative charge in
pattern of Figure 3.10 and their difference. Dotted lines depict 5 and 95 percentile
envelopes from 200 random labelings of polarity.
9 (GHCD9) (Figure 3.12). One of the purposes of the study was to examine
geographic risk factors associated with the risk of having a very low birth
weight (VLBW) baby, one weighing less than 1,500 grams at birth. Cases
were identified from all live-born, singleton infants born between April 1, 1986
and March 30, 1988 and the locations of the mothers’ addresses are shown in
Figure 3.13.
Statesboro
Vidalia Savannah
Fort Stewart Wilmington Island
Hinesville
Douglas
Brunswick
Waycross St. Simons
Notice how the aggregated pattern in the locations of the cases corresponds
to the locations of the cities and towns in Georgia Health Care District 9. A
formal test of CSR will probably not tell us anything we did not already know.
How much of this clustering can be attributed to a geographical pattern in
cases of very low birth weight infants and how much is simply due to clustering
in residences is unclear. To separate out these two confounding issues, we
need to compare the geographic pattern in the cases to that based on a set
of controls that represent the geographic pattern in infants who were born
with normal birth weights. Controls were selected for this study by drawing
a 3% random sample of all live-born infants weighing more than 2,499 grams
at birth. This sampling was constrained so that the controls met the same
residency and time frame requirements as the case subjects. Their geographic
distribution is shown in Figure 3.14. Notice that the locations of both the
cases and the controls appear to be clustered. We can use the controls to
quantify a background geographical variation in infant birth weights and then
assess whether there are differences in the observed spatial pattern for babies
born with very low birth weights. In order to do this, we need a different
null hypothesis than the one provided by CSR. One that is often used is
Constant called the constant risk hypothesis (Waller and Gotway, 2004). Under the
Risk
Hypothesis
THE INHOMOGENEOUS POISSON PROCESS 109
Figure 3.13 Cases of very low birth weight babies in Georgia Health Care District
9 from Rogers et al. (2000). The locations have been randomly relocated to protect
confidentiality.
constant risk model, the probability of being an event is the same, regardless
of location (e.g., each baby born to a mother residing in Georgia Health Care
District 9 has the same risk of being born with a very low birth weight).
Under the constant risk hypothesis, we expect more events in areas with more
individuals. Clusters of cases in high population areas could violate CSR but
would not necessarily violate the constant risk hypothesis. Thus, choosing
the constant risk hypothesis as a null model allows us to refine the question
of interest from “are the cases clustered?” (the answer to which we already
know is probably “yes”) to the question “are the cases more clustered than
we would expect under the constant risk hypothesis?” Answering this latter
question allows adjustment for any patterns that might occur among all the
individuals within a domain of interest.
Figure 3.14 Cases of very low birth weight babies in Georgia Health Care District 9
and Controls from Rogers et al. (2000). The locations have been randomly relocated
to protect confidentiality.
Studying point patterns through λ(s) rather than through E[N (A)] is often
mathematically advantageous because it eliminates the dependency on the
size (and shape) of the area A. In practical applications, when an estimate of
the intensity function is sought, an area context is required.
Relative Risk
0.00−0.20
0.21 − 0.28
0.29 − 0.32
0.33 − 0.40
0.41 − 0.52
0.53− 0.75
0.76−1.10
1.10 − 1.75
1.76 − 2.80
2.81 − 4.65
Figure 3.15 Relative risk of very low birth weight babies in Georgia Health Care Dis-
trict 9. Conclusions are not epidemiologically valid since locations and case/control
status were altered to preserve confidentiality.
inherently zero. Second, zero estimates for f2 (s) are clearly problematic for
computation of r'(s), but certainly can occur and do have meaning as part of
the estimation of f2 .
An advantage to choosing a kernel with infinite tails such as the bivariate
Gaussian kernel, as we have done here, is that the estimate of f2 (s) is non-zero
for all locations. Third, the choice for the bandwidths is critical and different
choices can have a dramatic effect on the resulting surface. In constructing
the map in Figure 3.15 we experimented with a variety of choices for the
bandwidth and the underlying grid for which the kernel density estimates are
obtained. The combination of the two (bandwidth and grid spacing) reflects
the tradeoff between resolution and stability. A fine grid with a small band-
width will allow map detail, but the resulting estimates may be unstable. A
coarse grid with a large bandwidth will produce more stable estimates, but
much of the spatial variation in the data will be smoothed away. Also, there
are large regions within Georgia Health Care District 9 without controls. This
leads to unstable estimates for certain bandwidths. We began by choosing the
bandwidths according to automatic selection criteria (e.g., cross validation,
Wand and Jones, 1995), but found the results were visually uninteresting; the
resulting map appeared far too smooth. Because of the large gaps where there
are no controls, we took the same bandwidth for the cases as for the controls
and then increased it systematically until we began to lose stability in the
estimates. We may have actually crossed the threshold here: the area with a
high relative risk on the western edge of the domain may be artificially high,
114 MAPPED POINT PATTERNS
reflecting estimate instability and edge effects, rather than a high relative risk.
This illustrates the importance of careful estimation and interpretation of the
results, particularly if formal inference (e.g., hypothesis tests, see Kelsall and
Diggle, 1995 and Waller and Gotway, 2004) will be conducted using the re-
sulting estimates. However, even with a few potential anomalies, and the odd
contours that result from the kernel, Figure 3.15 does allow us to visualize
the spatial variation in the risk of very low birth weight.
While the K-function can be used to assess clustering in events that arise from
a homogeneous Poisson process, the assumption of stationarity upon which it
is based precludes its use for inhomogeneous Poisson processes. Thus, Cuzick
and Edwards (1990) adapted methods based on nearest neighbor distances
(described in §3.3) for use with inhomogeneous Poisson processes. Instead of
assuming events occur uniformly in the absence of clustering, a group of con-
trols is used to define the baseline distribution and nearest neighbor statistics
are based on whether the nearest neighbor to each case is another case or a
control. The null hypothesis of no clustering is that each event is equally likely
to have been a case or a control, i.e., the random labeling hypothesis.
Let {s1 , . . . sn } denote the locations of all events and assume n1 of these
are cases and n2 are controls. Let
-
1 if si is a case
δi =
0 if si is a control.
and
-
1 if the nearest neighbor to si is a case
di =
0 if the nearest neighbor to si is a control.
The test statistic represents the number of the q nearest neighbors of cases
that are also cases,
n
(
Tq = δi dki ,
i=1
where q is specified by the user. For inference, Cuzick and Edwards (1990)
derive an asymptotic test based on the Gaussian distribution. A Monte Carlo
test based on the random labeling hypothesis is also applicable.
Example 3.4 (Low birth weights. Continued) We use Cuzick and Ed-
ward’s NN test to assess whether there is clustering in locations of babies born
with very low birth weights in Georgia Health Care District 9. This test is
not entirely applicable to this situation in that it assumes each event location
must be either a case or a control. However, because people live in apart-
ment buildings, there can be multiple cases and/or controls at any location;
we cannot usually measure a person’s location so specifically. This situation
THE INHOMOGENEOUS POISSON PROCESS 115
Table 3.5 Results from Cuzick and Edward’s NN Test Applied to Case/Control Data
in Georgia Health Care District 9. The p-values were obtained from Monte Carlo sim-
ulation. Conclusions are not epidemiologically valid since locations and case/control
status were altered to preserve confidentiality.
q Tq p-value
1 81 0.0170
5 388 0.0910
10 759 0.2190
20 1464 0.3740
The results in Table 3.5 seem to indicate that there is some clustering
among the cases at very local levels. As q is increased, the test statistics are
not significant, indicating that, when considering Georgia Health Care District
9 overall, there is no strong evidence for clustering among locations of babies
born with very low birth weights. Note, however, that Tq2 is correlated with
Tq1 for q1 < q2 since the q2 nearest neighbors include the q1 nearest neighbors.
Ord (1990) suggests using contrasts between statistics (e.g., Tq2 − Tq1 ) since
they exhibit considerably less correlation and can be interpreted as excess
cases between the q1 and the q2 nearest neighbors of cases.
a cluster of people with a rare disease could indicate a common, local envi-
ronmental contaminant. A cluster of burglarized residences can alert police
to “hot spots” of crime that warrant increased surveillance. Thus, what is
needed in such situations is a test that will: 1) detect clusters; 2) determine
whether they contain a significantly higher or lower number of events than we
would expect; and 3) identify the location and extent of the cluster. There are
several methods for cluster detection (see, e.g., Waller and Gotway, 2004, for
a comprehensive discussion and illustration), but the most popular is the spa-
tial scan statistic developed by Kulldorff and Nagarwalla (1995) and Kulldorff
(1997) and popularized by the SatScan software (Kulldorff and International
Management Services, Inc., 2003).
Scan statistics use moving windows to compare a value (e.g., a count of
events or a proportion) within the window to the value outside of the window.
Kulldorff (1997) uses circular windows with variable radii ranging from the
smallest inter-event distance to a user-defined upper bound (usually one half
the width of the study area). The spatial scan statistic may be applied to
circles centered at specified grid locations or centered on the set of observed
event locations.
The spatial scan statistic developed by Kulldorff (1997) considers local like-
lihood ratio statistics that compare the likelihood under the the constant risk
hypothesis to various alternatives where the proportion of cases within the
window is greater than that outside the window. Let C denote the total num-
ber of cases and let c be the total number of cases within a window. Under
the assumption that the number of cases follows a Poisson distribution, the
likelihood function for a given window is proportional to
E c Fc C C − c DC−c
I(·),
n C −n
where n is the number expected assuming constant risk assumption over the
study domain. I(·) denotes the indicator function which, when high propor-
tions are of interest, is equal to 1 when the window has more cases than
expected.
The likelihood function is maximized over all windows and the window with
the maximum likelihood function is called “the most likely cluster.” Signifi-
cance is determined by using Monte Carlo simulation. Using random labeling,
cases are randomly assigned to event locations, the likelihood function is com-
puted for each window, and the maximum value of this function is determined.
In this way the distribution of the maximum likelihood function is simulated
(Turnbull, Iwano, Burnett, Howe, and Clark, 1990). As a result, the spatial
scan statistic provides a single p-value for the study area, and avoids the
multiple testing problem that plagues many other approaches.
Example 3.4 (Low birth weights. Continued) We use the spatial scan
statistic developed by Kulldorff (1997) and Kulldorff and International Man-
agement Services, Inc. (2003) to find the most likely cluster among locations
THE INHOMOGENEOUS POISSON PROCESS 117
of babies born with very low birth weights in Georgia Health Care District 9.
We note that this “most likely” cluster may not be at all “likely,” and thus
we rely on the p-value from Monte Carlo testing to determine its significance.
We assumed a Poisson model and allowed the circle radii to vary from the
smallest inter-event distance to one half of the largest inter-event distance.
The results from the scan give the location and radius of the circular win-
dow that constitutes the most likely cluster and a p-value from Monte Carlo
testing. The results are shown in Figure 3.16.
Figure 3.16 Results from the spatial scan statistic. Conclusions are not epidemio-
logically valid since locations and case/control status were altered to preserve confi-
dentiality.
3.6.1 Extensions
1.0
0.8
0.6
y
0.4
0.2
0.0
Figure 3.17 Marked point pattern. Distribution of hickories (open circles) and
maples (closed circles) in a forest in Lansing, MI (Gerrard, 1969; Diggle, 1983).
The mark variable is discrete with two levels.
marked point pattern, on the other hand, represents the complete observation
of all event locations. There are no other locations at which the attribute Z
could have been observed. Consequently, the notion of a continuous random
surface expanding over the domain does not arise and predicting the value of
Z at an unobserved location appears objectionable. Why would one want to
predict the diameter of trees that do not exist? For planning purposes, for
example. In order to do so, one views the marked point process conditional
and treats the event locations as if they were non-stochastic.
Because there are potentially two sources of randomness in a marked point
pattern, the randomness of the mark variable at s given that an event occurred
at that location and the distribution of events according to a stochastic pro-
cess, one can choose to study either conditional on the other, or to study
them jointly. If the mark variable is not stochastic as in §3.1–3.4, one can
still ask questions about the distribution of events. When the mark variable
is stochastic, we are also interested in studying the distributional properties
of Z. In the tree example we may inquire about
120 MAPPED POINT PATTERNS
Recall that for a univariate point pattern the first- and second-order intensities
are defined as
E[N (ds)]
λ(s) = lim
|ds|→0 |ds|
E[N (dsi )N (dsj )]
λ2 (si , sj ) = lim .
|dsi |→0,|dsj |→0 |dsi ||dsj |
1.0 1.0
0.8 0.8
0.6 0.6
y
y
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
x x
Figure 3.18 Lansing tree data as a bivariate point pattern. The bivariate pattern is
the collection of the two univariate patterns. Their superposition is a marked process
(Figure 3.17).
where λml,2 is the isotropic cross-pattern intensity (Hanisch and Stoyan, 1979;
Cressie, 1993). For the univariate case, Ripley’s edge corrected estimator of
the K-function is
n n
' ν(A) ( (
K(h) = w(si , sj )−1 I(hij ≤ h),
n2 i=1
j&=i
the ith point of type m and the jth point of type l. An edge corrected estimator
of the cross K-function between the mth and lth pattern is
nm (
( nl
' ml (h) = ν(A)
K w(si m − sj l )−1 I(hml
ij ≤ h),
nm nl i=1 j=1
The homogeneous Poisson process provides the natural starting point for a
statistical investigation of an observed point pattern. Rejection of the CSR
hypothesis does not come as a great surprise in many applications and you are
naturally confronted with the question “What kind of pattern is it?” If the
CSR test suggests a clustered pattern, one may want to compare, for example,
the observed K-function to simulated K-functions from a cluster process.
We can only skim the surface of point process models in this chapter. A
large number of models have been developed and described for clustered and
regular alternatives, details can be found in, e.g., Diggle (1983), Cressie (1993),
Stoyan, Kendall, and Mecke (1995), and Møller and Waagepetersen (2004).
The remainder of this chapter draws on these sources as well as on Appendix
A9.9.11 in Schabenberger and Pierce (2002). The models were chosen for their
representativeness for a particular data-generating mechanism, and because
of their importance in theoretical and applied statistics. When you analyze
an observed spatial point pattern, keep in mind that based on a single re-
alization of the process unambiguous identification of the event-generating
point process model may not be possible. For example, an inhomogeneous
POINT PROCESS MODELS 123
Poisson process and a Cox Process (see below) lead to clustering of events.
The mechanisms are entirely different, however. In case of the IPP, events in
non-overlapping regions are independent and clustering arises because the in-
tensity function varies spatially. In the Cox process, clustering occurs because
events are dependent, the (average) intensity may be homogeneous. Certain
Poisson cluster processes, where one point process generates parent events
and a second process places offspring events around the locations of the par-
ent events, can be made equivalent to a Poisson process with a randomly
varying intensity.
Processes that are indistinguishable based on a single realization, can have
generating mechanism that suggest very different biological and physical in-
terpretation. It behooves the analyst to consider process models whose genesis
are congruent with the subject-matter theory. Understanding the genesis of
the process models also holds important clues about how to simulate realiza-
tions from the model.
process to which thinning is applied. For this reason, one often applies thinning
operations to a homogeneous Poisson process, because its properties are well
understood.
2. If the original process has intensity λ, then the thinned process has intensity
λ∗ (s) = λp(s). For π-thinning of a process with intensity λ, the resulting
intensity is λE[π(s)].
3. If Z(s) is a Poisson process subject to p(s)-thinning, then the thinned
process Z ∗ and the process Z \ Z ∗ of the removed points are independent
Poisson processes with intensities λp(s) and λ(1−p(s)), respectively (Møller
and Waagepetersen, 2003, p. 23).
4. The p-thinning of a stationary process yields a stationary process. p(s)-
thinning does not retain stationarity. The π-thinning of a stationary process
is a stationary process, provided that the random field π(s) is stationary.
5. The K-function of a point process is not affected by p-thinning. The in-
tensity and the expected number of extra events within distance h from an
arbitrary event are reduced by the same factor.
6. The K-function of a π-thinned process can be constructed from the K-
function of the original process and the mean and covariance of the π(s)
process. If the random field π(s) has mean ξ and covariance function
C(||h||) = E[π(0)π(h)] − ξ 2 , then
A h
1
K (h) =
∗
2
(C(u) + ξ 2 )dK(u).
0 ξ
If the random intensity measure Λ(s) is stationary, so is the Cox process and
126 MAPPED POINT PATTERNS
(ii∗ ) The number of offspring are realized independently and identically for
POINT PROCESS MODELS 127
Problem 3.2 For the homogeneous Poisson process, find unbiased estimators
of λ and λ2 .
Problem 3.3 Data are sampled independently from two respective Gaussian
130 MAPPED POINT PATTERNS
populations with variance σ 2 = 1.52 . Five observations are drawn from each
population. The realized values are
Sampled Values
Population yi1 yi2 yi3 yi4 yi5
i=1 9.7 10.2 10.9 8.6 10.3
i=2 8.2 6.1 9.6 8.2 8.9
Problem 3.4 The random intensity function of the Cox process induces spa-
tial dependency between the events. You can think of this mechanism as
creating stochastic dependency through shared random effects. To better un-
derstand and appreciate this important mechanism consider the following two
statistical scenarios
1. A one-way random effects model can be written as Yij = µ + αi + #ij where
the #ij are iid random variables with zero mean and variance σ$2 and the
αi are independent random variables with mean 0 and variance σα2 . Also,
Cov[αi , #ij ] = 0. Determine Cov[Yij , Ykl ]. How does “sharing” of random
effects relate to the covariance of observations for i = k and i (= k?
2. Let Y1 |λ, · · · , Yn |λ be a random sample from a Poisson distribution with
mean λ. If λ ∼ Gamma(α, β), find the mean and variance of Yi as well as
Cov[Yi , Yj ].
Problem 3.5 (Stein, 1999, Ch. 1.4) Consider a Poisson process on the line
with intensity λ. Let A denote an interval on the line with length |A|, and
N (A) the number of events in A. Define Z(t) = N ((t − 1, t + 1]). Find the
mean and covariance function of Z(t).
(iii) Show that the K function is K(h) = πh2 + λ−1 (1 − exp{−h2 /(4σ 2 )}).
(iv) Explain why the K-function does not depend on the mean number of
offspring per parent (µ).
(v) Under what condition do this process and the homogeneous Poisson
process have the same second-order properties?
CHAPTER 4
4.1 Introduction
Two important features of a random field are its mean and covariance struc-
ture. The former represents the large-scale changes of Z(s), the latter the
variability due to small- and micro-scale stochastic sources. In Chapter 2, we
gave several different representations of the stochastic dependence (second-
order structure) between spatial observations. Direct and indirect specifica-
tions based on model representations (§2.4.1), representations based on con-
volutions (§2.4.2) and spectral decompositions (§2.5). In the case of a spatial
point process, the second-order structure is represented by the second-order
intensity, and by the K-function in the isotropic case (Chapter 3). If a spatial
random field has model representation Z(s) = µ(s)+e(s), where e(s) ∼ (0, Σ),
the spatial dependence structure is expressed through the variance-covariance
matrix Σ. The semivariogram and covariance function of a spatial process
with fixed, continuous domain were introduced in §2.2, since these parameters
require that certain stationarity conditions be met. The variance-covariance
matrix of e(s) is not bound by any stationarity requirements, it simply cap-
tures the variances and covariances of the process. In addition, the model
representation does not confine e(s) to geostatistical applications, the domain
may be a lattice, for example. In practical applications, Σ is unknown and
must be estimated from the data. Unstructured variance-covariance matrices
that are common in multivariate statistical methods are uncommon in spatial
statistics. There is typically structure to the spatial covariances, for exam-
ple, they may decrease with increasing lag. And without true replications,
there is no hope to estimate the entries in an unspecified variance-covariance
matrix. Parametric forms are thus assumed so that Σ ≡ Σ(θ) and θ is esti-
mated from the data. The techniques employed to parameterize Σ vary with
circumstances. In a lattice model Σ is defined indirectly by the choice of a
neighborhood matrix and an autoregressive structure. For geostatistical data,
Σ is constructed directly from a model for the continuous spatial autocorre-
lation among observations. The importance of choosing the correct model for
Σ(θ) also depends on the application. Consider a spatial model
Z(s) = Xβ + e(s), e(s) ∼ (0, Σ(θ)),
where the primary interest is inference about β, for example, confidence in-
tervals and hypothesis tests about the mean. When θ is estimated from data,
134 SEMIVARIOGRAM ANALYSIS AND ESTIMATION
making use of the fact that Var[Z(s)] = C(0) under second-order stationarity.
Note that the name semivariogram is used both for the function γ(si − sj ) as
well as the graph of γ(h) against h. When working with covariances, C(si −sj )
is the covariance function, the graph of C(h) against h is referred to as the
covariogram. Similarly, a graph of the correlation function R(h) against h
is termed the correlogram. In the spirit of parallel language, we sometimes
will use the term covariogram even if technically the term covariance function
may be more appropriate.
Because of the simple relationship between semivariogram and covariance
function, it seems immaterial which function is used to study the spatial de-
pendence of a process. Since the class of intrinsic stationary processes contains
the class of second-order stationary processes, the semivariogram of a second-
order stationary process can be constructed from the covariance function by
(4.2). If the process is intrinsic but not second-order stationary, the covari-
ance function is not a parameter of the process. Our preference to work with
the semivariogram γ and not the variogram 2γ is partly due to the fact that
γ(si − sj ) → C(0) provided C(si − sj ) → 0. An unbiased estimate of the
semivariogram for lag distances at which data are (practically) uncorrelated,
is an unbiased estimate of the variance of the process.
In geostatistical applications, it is common to work with the semivariogram,
rather than the covariance function. Statisticians, on the other hand, are
trained in expressing dependency between random variables in terms of co-
variances. The reasons are not just convenience and training, and a nice inter-
pretation of the semivariogram sill in a second-order stationary process. When
you are estimating the spatial dependence from data, the ambivalence between
covariance function and semivariogram gives way to differences in statistical
properties of the empirical estimators. Details on empirical semivariogram
estimators are given in §4.4. Here we address briefly the issue of bias when
working with (semi-)variograms and with covariances. Let Z(s1 ), · · · , Z(sn )
denote the observations from a spatial process with constant but unknown
mean. Since then
1
γ(si − sj ) = E[(Z(si ) − Z(sj ))2 ],
2
Matheron a simple, moment-based estimator due to Matheron (1962, 1963) is
Estimator 1 ( 2
'(si − sj ) =
γ {Z(si ) − Z(sj )} ,
2|N (si − sj )|
N (si -sj )
for γ(h) if Z(s) is intrinsically stationary. If the mean is estimated from the
'
data, then C(h) is a biased estimator of the covariance function at lag h.
Furthermore,
1 ( 8 92
'(h) =
γ Z(si ) − Z − Z(sj ) + Z
2|N (h)|
N (h )
1 ( 8 92
= Z(si ) − Z '
− C(h),
|N (h)|
N (h )
' .n 8 92
' '
but C(0) = n−1 i=1 Z(si ) − Z . As a consequence, C(0) − C(h) (= γ
'(h)
and a semivariogram estimate constructed from (4.3) will also be biased. As
|N (h)|/n → 1, the bias disappears.
The semivariogram estimator γ '(h) has other appealing properties. For ex-
ample, γ'(0) = 0 = γ(0) and γ '(h) = γ'(−h), sharing properties of the semi-
variogram γ(h). If the data contain a linear large-scale trend,
Z(s) = X(s)β + e(s),
then the spatial dependency in the model errors e(s) is often estimated from
the least squares residuals. Since Var[e(s)] = Σ is unknown—otherwise there
is no need for semivariogram or covariance function estimation—the ordinary
least squares residuals
) ! !*
e(s) = I − X(s)(X(s) X(s))−1 X(s) Z(s)
'
are often used. Although the semivariogram estimated from ' e(s) is a biased
estimator for the semivariogram of e(s), this bias is less than the bias of the
covariance function estimator based on 'e(s) for C(h) (Cressie and Grondona,
1992; Cressie, 1993, p. 71 and §3.4.3).
The advantages of the classical semivariogram estimator over the covariance
function estimator (4.3) stem from the fact that the unknown—but constant—
mean is not important for estimation of γ(h). The semivariogram filters the
mean. This must not be interpreted as robustness of variography to arbitrary
mean. First, the semivariogram is a parameter of a spatial process only un-
der intrinsic or second-order stationary, both of which require E[Z(s)] = µ.
Second, the semivariogram reacts rather poorly to changes in the mean with
spatial locations. Let Z(s) = µ(s) + e(s), where E[e(s)] = 0, γe (h) = γz (h).
It is easy to show (Chapter problem 4.1) that
1 ( 2
γz (h)] = γe (h) +
E[' {µ(si ) − µ(sj )} , (4.4)
2|N (h)|
N (h )
The behavior of the covariance function near the origin and its differentiability
were studied in §2.3 to learn about the continuity and smoothness of a second-
order stationary random field. Recall that a mean square continuous random
field must be continuous everywhere, and that a random field cannot be mean
square continuous unless it is continuous at the origin. Hence, C(h) → C(0)
as h → 0 which implies that γ(h) → 0 as h → 0. Furthermore, we must have
γ(0) = 0, of course. Mean square continuity of a random field implies that
the semivariogram is continuous at the origin. The notion of smoothness of
a random field was then brought into focus in §2.3 by studying the partial
derivatives of the process. The more often a random field is mean square
differentiable, the higher its degree of smoothness.
The semivariogram is not only a device to derive the spatial dependency
structure in a random field and to build the variance-covariance matrix of
Z(s), which is needed for model-based statistical inferences. It is a structural
tool which in itself conveys much information about the behavior of a random
field. For example, semivariograms that increase slowly from the origin and/or
exhibit quadratic behavior near the origin, imply processes more smooth than
those whose semivariogram behaves linear near the origin.
For a second-order stationary random field, the (isotropic) semivariogram
γ(||h||) ≡ γ(h) has a very typical form (Figure 1.12, page 29). It rises from
the origin and if C(h) decreases monotonically with increasing h, then γ(h)
will approach Var[Z(s)] = σ2 either asymptotically or exactly at a particular
Sill and lag h∗ . The asymptote itself is termed the sill of the semivariogram and the
Range lag h∗ at which the sill is reached is called its range. Observations Z(si ) and
Z(sj ) for which ||si − sj || ≥ h∗ are uncorrelated. If the semivariogram reaches
the sill asymptotically, the practical range is defined as the lag h∗ at which
γ(h) = 0.95 × σ 2 . Semivariograms that do not reach a sill occur frequently.
This could be due to
SEMIVARIOGRAM AND COVARIOGRAM 139
• Non-stationarity of the process, e.g., the mean of Z(s) is not constant across
the domain;
• An intrinsically stationary process. The intrinsic hypothesis states that a
variogram must satisfy
γ(h)
2 →0 as ||h|| → 0.
||h||2
• The process is second-order stationary, but the largest lag for which the
semivariogram can be estimated is shorter than the range of the process.
The lag distance at which the semivariogram would flatten has not been
observed.
In practice, empirical semivariograms γ '(h) calculated from a set of data
often suggest that the semivariogram does not pass through the origin. This
intercept of the semivariogram has been termed the nugget effect c0 , c0 = Nugget
limh→0 γ(h) (= 0. If the random field under study is mean square continuous, Effect
such a discontinuity at the origin must not exist. Following Matérn (1986,
Ch. 2.2), define Qd as the class of all functions that are valid covariance func-
tions in Rd , Q!d as the subclass of functions which are continuous everywhere
except possibly at the origin, and Q!!d as the subclass of covariance functions
continuous everywhere. Matérn shows that if C(h) ∈ Q!d it can be written as
C(h) = aC0 (h) + bC1 (h), where a, b ≥ 0, C1 (h) ∈ Q!!d , and
-
1 if h = 0
C0 (h) = (4.5)
0 otherwise.
It follows that if Z(s) has a covariance function in Q!d , it can be decomposed
as Z(s) = U (s) + ν(s), where U (s) is a process with covariance function in
Q!!d and ν(s) has covariance function (4.5). Matérn (1986, p. 12) calls U (s)
the continuous component and ν(s) the chaotic component in the de-
composition. The variance of the latter component is the nugget effect of
the semivariogram. The chaotic component is not necessarily completely spa-
tially unstructured; it can be further decomposed. Recall the decomposition
Z(s) = µ(s) + W (s) + η(s) + #(s) from §2.4, where W (s) depicts smooth-scale
spatial variation, η(s) micro-scale variation, and #(s) is pure measurement
error. The micro-scale process η(s) is a stationary spatial process whose semi-
variogram has sill Var[η(s)] = ση2 . It represents spatial structure but cannot
be observed unless data points are collected at lag distances smaller than the
range of the η(s) process. The measurement error component has variance
Var[#(s)] = σ$2 and the nugget effect of a semivariogram is
c0 = ση2 + σ$2 .
The name was coined by Matheron (1962) in reference to small nuggets of
ore distributed throughout a larger body of rock. The small nuggets consti-
tute a microscale process with spatial structure. Matheron’s definition thus
appeals to the micro-scale process and Matérn’s definition to the measure-
ment error process. In practice, η(s) and #(s) cannot be distinguished unless
140 SEMIVARIOGRAM ANALYSIS AND ESTIMATION
there are replicate observations at the same spatial locations. The modeler
who encounters a nugget effect in a semivariogram thus needs to determine
on non-statistical grounds whether the effect is due to micro-scale variation
or measurement error. The choice matters for deriving best spatial predictors
and measures of their precision (§5). Software packages are not consistent in
this regard.
In the presence of a nugget effect, the variance of a second-order stationary
process is Var[Z(s)] = c0 + σ02 , where σ02 is the partial sill. Since the nugget
reduces the smoothness of the process, a common measure for the degree of
Relative spatial structure is the relative structured variability
Structured
Variability C D
σ02
RSV = × 100%. (4.6)
σ02 + c0
This is a rather crude measure for the degree of structure (or smoothness) of a
random field. Besides the relative magnitude of the discontinuity at the origin
of the semivariogram it does not incorporate other features of the process that
represents the continuous component, e.g., its mean square differentiability.
The range of the semivariogram is often considered an important parameter.
In ecological applications it has been related to the size of patches that form af-
ter human intervention. It is not clear why the distance at which observations
are no longer spatially correlated should be equal to the diameter of patches.
Consider observations Z(s1 ), Z(s2 ), and Z(s3 ). If ||s1 − s2 || < h∗ , where h∗
is the range, but ||s1 − s3 || > h∗ , ||s2 − s3 || < h∗ then Cov[Z(s1 ), Z(s2 )] (= 0,
Cov[Z(s1 ), Z(s3 )] = 0, but Z(s3 ) and Z(s2 ) are correlated. In spatial predic-
tion Z(s3 ) can impact Z(s1 ) through its correlation with Z(s2 ). Chilès and
Delfiner (1999, p. 205) call this the relay effect of spatial autocorrelation.
The relative structured variability measures that component of spatial conti-
nuity that is reflected by the nugget effect. Other measures, which incorporate
the shape of the semivariogram (and the range) have been proposed. Russo
and Bresler (1981) and Russo and Jury (1987) consider integral scales. If
R(h) = C(h)/C(0) is the autocorrelation function of an isotropic process,
Integral then the scales for processes in R1 and R2 are
Scales
A ∞ - A ∞ >1/2
I1 = R(h) dh I2 = 2 R(h)h dh .
0 0
In §4.3.1–4.3.5 we consider isotropic models for the covariance function and the
semivariogram of a spatial process (accommodating anisotropy is discussed in
§4.3.7). We start from models for covariance functions because valid semivar-
iograms for second-order stationary processes can be constructed from valid
covariance functions. For example, if C(h) is the covariance function of an
isotropic process with variance σ 2 and no nugget effect, then
-
0 h=0
γ(h) =
σ (1 − C(h))) h > 0.
2
Not every mathematical function can serve as a model for the spatial de-
pendency in a random field, however. Let C(h) be the isotropic covariance
function of a second-order stationary field and γ(h) the isotropic semivari-
ogram of a second-order or intrinsically stationary field. Then the following
hold:
• If C(h) is valid in Rd , then it is also valid in Rs , s < d (Matérn, 1986, Ch.
2.3). If γ(h) is valid in Rd , it is also valid in Rs , s < d.
• If C1 (h) and C2 (h) are valid covariance functions, then aC1 (h) + bC2 (h),
a, b ≥ 0, is a valid covariance function.
• If γ1 (h) and γ2 (h) are valid semivariograms, then aγ1 (h) + bγ2 (h), a, b ≥ 0,
is a valid semivariogram.
• A valid covariance function C(h) is a positive-definite function, that is,
k (
( k
ai aj C(si − sj ) ≥ 0,
i=1 j=1
for any set of real numbers a1 , · · · , ak and sites. By Bochner’s theorem this
implies that C(h) has spectral representation (§2.5)
A ∞ A ∞
C(h) = ··· exp{iω ! h}dS(ω).
−∞ −∞
H is related
B to the spectral distribution function of the process through
H(u) = ||ω ||<u dS(ω) (Stein, 1999, p. 43).
Basis We call Ωd the basis function of the covariance model in Rd . For processes
Functions in Rd , d ≤ 3, Ω1 (t) = cos(t), Ω2 (t) = J0 (t), Ω3 (t) = sin(t)/t. Also, for
d → ∞, Ωd (t) → exp{−t2 } (Figure 4.1).
1.0
0.8 J_0(t)
sin(t)/t
exp( - t*t )
0.6
0.4
0.2
0.0
-0.2
-0.4
0 5 10 15 20 25 30
t
Figure 4.1 Basis functions for processes in R2 , R3 , and R∞ . With increasing di-
mension, the basis functions have fewer sign changes. Covariance functions are non-
increasing unless the basis function permits at least one sign change. A process in
R∞ does not permit negative autocorrelation at any lag distance.
grows more slowly than ||h||2 . This is often referred to as the intrinsic
hypothesis.
√
Exponential and substitution into (4.9) yields (recall that Γ(1/2) = π)
Model - >
h
C(h) = σ exp {−θh} = σ exp −3
2 2
. (4.11)
α
The second parameterization is again common in geostatistical applications
where α denotes the practical range. The exponential model is the continuous-
time analog of the first-order autoregressive time series covariance structure.
It enjoys popularity not only in spatial applications, but also in modeling
longitudinal and repeated measures data (see, e.g., Jones, 1993; Schabenberger
Whittle and Pierce, 2002, Ch. 7). The model for ν = 1,
Model
C(h) = σ 2 θhK1 (θh), (4.12)
was suggested by Whittle (1954). He considered the exponential model as the
“elementary” covariance function in R1 and (4.12) as the “elementary” model
in R2 . A process Z(t) in R1 with exponential correlation can be represented
by the stochastic differential equation
C D
d
+ θ Z(t) = #(t),
dt
where #(t) is a white noise process. Whittle (1954) and Jones and Zhang (1997)
consider this the elementary stochastic differential equation in R1 . In R2 , with
coordinates x and y, Whittle awards this distinction to the stochastic Laplace
equation C 2 D
∂ ∂2
+ 2 − θ Z(x, y) = #(x, y).
2
∂x2 ∂y
A process represented by this equation has correlation function
R(h) = θhK1 (θh),
a Whittle model. Whittle (1954) concludes that “the exponential function has
no divine right in two dimensions” and calls processes in R2 with exponen-
tial covariance function “artificial”; finding it “difficult to visualize a physical
mechanism” that has covariance function (4.11).
We strongly feel that the exponential model has earned its place among the
isotropic covariance models for modeling spatial data. In fitting these models
to data, the exponential model has a definite advantage over Whittle’s model.
It does not require evaluation of infinite series (§4.9.2).
If there is an “artificial” model for the spatial dependence, it is the gaus-
sian model (4.10). Because it is the limiting model for ν → ∞ it is infinitely
differentiable. Physical and biological processes with this type of smoothness
are truly artificial. The name is unfortunately somewhat misleading. Covari-
ance model (4.10) is called the “gaussian” model because of the functional
similarity of the spectral density of a process with that covariance function to
the Gaussian probability density function (§4.7.2). It does not command the
same respect as the Gaussian distribution. We choose lowercase notation to
distinguish the covariance model (4.10) from the Gaussian distribution.
COVARIANCE AND SEMIVARIOGRAM MODELS 145
1.0
= 0.25
0.8
= 0.75
Semivariogram
0.6 = 0.5
=1
0.4
0.2
0.0
0 2 4 6 8 10 12 14 16 18 20
Lag distance
• Spherical Model, d = 3:
- ) h *3
R1 (h) = 1− 3h
2α + 1
2 α h≤α (4.13)
0 otherwise.
are a mystery to Stein (1999, p. 52), who argues that perhaps “there is a
mistaken belief that there is some statistical advantage in having the auto-
correlation function being exactly zero beyond some finite distance.”
The second-order stationary models discussed so far permit only positive au-
tocorrelation, the semivariogram is a non-decreasing function (the covariance
COVARIANCE AND SEMIVARIOGRAM MODELS 147
1.0
0.8 Spherical
Semivariogram
0.6 Circular
0.4
Tent
0.2
0.0
0 2 4 6 8 10 12 14 16 18 20
Lag distance
The “practical” range for this model is defined as the lag distance at which the
first peak is no greater than 1.05σ 2 or the first valley is no less than 0.95σ 2 .
It is approximately 6.5 × πα (Figure 4.4b).
The two basic isotropic models for processes that are not second-order sta-
tionary are the linear and the power model. The former is a special case of
the latter. The power model is given in terms of the semivariogram Power
Models
γ(h) = θhλ , (4.21)
a) b)
1.2
2.0
1.0
1.5 =1
0.8 2
=1
1.0 0.6
= 1.0
= 0.5
0.4 = 0.5
0.5 2
= 0.5
0.2
= 1.5
0.0
0.0
0.0 0.5 1.0 1.5
0 10 20 30
Lag Distance
Lag Distance
Figure 4.4 Power semivariogram (a) and hole (cardinal-sine) models (b).
150 SEMIVARIOGRAM ANALYSIS AND ESTIMATION
a) b)
20 30
20
10
10
0 0
-20 -10 0 10 20 -30 -20 -10 0 10 20 30
-10
-10
-20
-30
-20
Figure 4.5 Contours of iso-correlation (0.8, 0.5, 0.3, 0.1) for two processes with
exponential correlation function. The isotropic model (a) has spherical correlation
contours. The elliptic contours in panel b) correspond to a 45 degree rotation and a
ratio of λ = 0.5 between the two major axes. The axes depict lag distances in the
(x, y) (a) and (x∗ , y ∗ ) coordinate systems (b).
Chilès and Delfiner (1999, p. 96) warn about zonal models that partition the
coordinates, because certain linear combinations can have zero variance. In R2
let Z(s) = Z1 (x) + Z2 (y). If the components Z1 (x) and Z2 (y) are orthogonal,
then, by §4.3.6, γz (h) = γ1 (hx ) + γ2 (hy ). Let hu = [u, 0]! and hv = [0, v]! be
two vectors shifting the coordinates. Then
because
The set N (h) consists of location pairs (si , sj ) such that si − sj = h and
|N (h)| denotes the number of distinct pairs in N (h). When data are sparse or
irregularly shaped, the number of distinct pairs in N (h) may not be sufficient
to obtain a stable estimate at lag h. Typical recommendations are that at
least 30 (better 50) pairs of locations should be available at each lag. If the
number of pairs is smaller, lags are grouped into lag classes so that γ'(h) is
the average squared difference of site pairs that satisfy si − sj = h ± '. The
choice of the tolerance ' is left to the user. A graph of γ'(h) against ||h|| is
called the Matheron semivariogram or the empirical semivariogram
Among the appealing properties of the Matheron estimator—which are
partly responsible for its widespread use—are simple computation, unbiased-
ness, evenness, and attaining zero at zero lag: E['γ (h)] = γ(h), γ
'(h) = γ
'(−h),
'(0) = 0. It is difficult in general to determine distributional properties and
γ
moments of semivariogram estimators without further assumptions. The es-
timators at two different lag values are usually correlated because (i) obser-
vations at that lag class are spatially correlated, and (ii) the same points are
used in estimating the semivariogram at the two lags. Because the Matheron
estimator is based on squared differences, more progress has been made in
establishing (approximate) moments and distributions than for some of its
competitors.
Consider Z(s) to be a Gaussian random field so that (2γ(h))−1 {Z(s) −
Z(s + h)}2 ∼ χ21 and
2
Var[{Z(s) − Z(s + h)}2 ] = 2 × 4γ(h) ,
Cressie (1985) shows that the variance of (4.24) at lag hi can be approximated
154 SEMIVARIOGRAM ANALYSIS AND ESTIMATION
as
γ(hi )2
γ (hi )] ≈ 2
Var[' . (4.25)
|N (hi )|
The approximation ignores the correlations between Z(si )−Z(sj ) and Z(sk )−
Z(sl ) (see §4.5.1). If it holds, consistency of the Matheron estimator is easily
ascertained from (4.25), since γ '(h) is unbiased. The expression (4.25) also
tells us what to expect for large lag values. In practice empirical semivari-
ograms are common that appear ill-behaved and erratic for large lags. Since
the semivariogram γ(hi ) of a second-order stationary process rises until it
reaches the sill, the numerator of (4.25) increases sharply in hi as long as the
semivariogram has not reached the sill. Even then, the variance of the Math-
eron estimator does not remain constant. The number of pairs from which
'(h) can be computed decreases sharply with h.
γ
One could, however, group the pairs into lag classes such that each class
contains a number of observations that makes the empirical semivariogram
values homoscedastic under a particular semivariogram model. This leads to
a concentration of lag classes at small lags and sparsity at large lags. The
overall shape of the semivariogram may be difficult to determine. Since one
would have to know the form and parameters of the true semivariogram, this
is an impractical proposition in any case. But even choosing lag classes that
have the same number of points may inappropriately group the lag distances.
Example 4.2 C/N ratios. Figure 4.6 displays the 195 locations on an agri-
cultural field at which the total soil carbon and total soil nitrogen percentages
were measured. These data were kindly provided by Dr. Thomas G. Mueller,
Department of Agronomy, University of Kentucky, and represent a subset of
the data used in Chapter 9 of Schabenberger and Pierce (2002). The field had
been in no-tillage management for more than ten years when strips of the
field were chisel-plowed. The data considered here correspond to these plowed
parts of the field.
The data are geostatistical and irregularly spaced. The Euclidean distances
ESTIMATING THE SEMIVARIOGRAM 155
between observations range between 5 and 565.8 feet. In order to obtain lag
classes with at least 50 observations per class, we decided on 35 lag classes of
width 6 feet. The resulting Matheron estimator of the empirical semivariogram
is shown in Figure 4.7.
It is customary not to compute the empirical semivariogram up to the
largest possible lag class. The number of available pairs shrinks quickly for
larger lags and the variability of the empirical semivariogram increases. A
common recommendation is to compute the empirical semivariogram up to
about one half of the maximum separation distance in the data, although this
is only a general guideline. It is important to extend the empirical semivar-
iogram far enough so that the important features of the spatial dependency
structure can be discerned but not so far as to hinder model selection and
interpretation due to lack of reliability. The empirical semivariogram of the
C/N ratios appears quite “well-behaved.” It rises from what appears to be
the origin up to a distance of 100 feet and has a sill between 0.25 and 0.30. A
spherical or exponential model may fit this empirical semivariogram well. We
will return to these data throughout the chapter.
The question of possible anisotropy can be investigated by computing the
empirical semivariogram surface or by constructing directional empirical semi-
variograms. To compute the semivariogram surface you divide the domain into
non-overlapping regions of equal size, typically rectangles or squares. If δx and
δy are the dimensions of the rectangles in the two main directions, we com-
pute a point on the surface at location h by averaging the pairs separated
156 SEMIVARIOGRAM ANALYSIS AND ESTIMATION
300
250
200
Y-Coordinate (ft)
150
100
50
74 136 95 121 168 194 224 241 339 387 256 368 428 384 362 275 462 428
0.4
64 99 92 229 239 238 207 246 379 289 311 329 383 423 310 311 409
0.3
Semivariance
0.2
0.1
0.0
0 50 100 150 200
Distance
Figure 4.7 Empirical semivariogram based on the Matheron (classical) estimator for
C/N ratios. Numbers across the top denote the number of pairs within the lag class.
To see the impact of Z([3, 4]) on the computation of the Matheron estimator
notice that there are two pairs for each of five lag classes. The respective
estimates are
√ 18 9
'( 2) =
γ (1 − 2)2 + (2 − 3)2 ) = 1/2
4
18 9
'(2) =
γ (1 − 3)2 + (4 − 20)2 ) = 65
4
√ 18 9
'( 5) =
γ (4 − 2)2 + (20 − 2)2 ) = 82
4
18 9
'(3) =
γ (4 − 1)2 + (20 − 3)2 ) = 74.5
4
√ 18 9
'( 13) =
γ (3 − 4)2 + (20 − 1)2 ) = 90.5
4
10 70 130 190
0.4
0.2
Semivariance
0.4
0.2
10 70 130 190
Distance
√
observation is √
removed from the data, the estimates are γ
'(2) = 2, γ
'( 5) = 2,
'(3) = 4.5, γ
γ '( 13) = 1/2.
Cressie and Hawkins (1980) suggested an estimator that alleviates the nega-
tive impact of outlying observations by eliminating squared differences from
the calculation. It is often referred to as the robust semivariogram estimator;
we refer to it as the Cressie-Hawkins (CH) estimator. Its genesis is as follows.
In a Gaussian random field all bivariate distributions of [Z(si ), Z(sj )] are
Gaussian and
Z(si ) − Z(sj )
/ ∼ G(0, 1),
2γ(si − sj )
(Z(si ) − Z(sj ))2
∼ χ21 .
2γ(si − sj )
160 SEMIVARIOGRAM ANALYSIS AND ESTIMATION
Cressie and Hawkins (1980) note that the fourth root transformation of (Z(si )−
Z(sj ))2 yields an approximately Gaussian random variable with mean
K L 1
E |Z(si ) − Z(sj )|1/2
≈ π −1/2 Γ(0.75) × γ(si − sj )1/4 .
2
Furthermore, the expected value of the fourth power of
1 (
|Z(si ) − Z(sj )|1/2
|N (h)|
|N (h)|
The term 0.045/|N (h)|2 contributes very little to the bias correction, particu-
larly if |N (h)| is large. The (robust) Cressie-Hawkins semivariogram estimator
Cressie- is finally given by
Hawkins 4
Estimator 1 1 ( SC 0.494
D
R(h) =
γ |Z(si ) − Z(sj )|1/2
0.457 + . (4.26)
2 |N (h)| |N (h)|
N (h )
Because the square root differences are averaged first and the resulting
average is then raised to the fourth power, the first term in (4.26) is much
less affected by extreme values than the average of the squared differences in
the Matheron estimator. The robust estimator is not unbiased, but the term
in the denominator serves to achieve approximate unbiasedness.
The attribute robust of the CH estimator refers to small amounts of con-
tamination in a Gaussian process. It is under this premise that Hawkins and
Cressie (1984) investigated the robustness of (4.26): white noise #(s) was added
to an intrinsically stationary process such that #(s) is G(0, σ02 ) with probability
1 − p and G(0, kσ02 ) with probability p. To be more specific, let
Z(s) = µ + S(s) + #(s),
where S(s) is a Gaussian random field with semivariogram γS (h). For some
value h it is assumed that γS (h) = mσ02 . One could thus think of S(s) as
second-order stationary with sill mσ02 . The particular model investigated was
G(0, σ02 ) with probability 0.95
#(s) ∼
G(0, 9σ02 ) with probability 0.05.
Cressie (1984) and Cressie (1993, p. 82, Table 2.2) it is seen that γ R(h) is less
biased than γ '(h) if the relative nugget effect is small. Similarly, if the nugget
σ0 is small relative to the semivariogram of the intrinsically stationary process,
2
then the variability of γR(h) is less than that of the Matheron estimator. The
CH estimator will typically show less variation at small lags and also result
in generally smaller values than (4.24).
However, at m = 1 the variability of γ '(h) and γ R(h) are approximately
the same and the robust estimator is more variable for m > 1. In that case
the contamination of the data plays a minor role compared to the stochastic
variation in S(s). As Hawkins and Cressie (1984) put it: “The loss of efficiency
as m → ∞ may be thought of as a premium paid by the robust estimators on
normal data to insure against the effects of possible outliers.”
As shown by Hawkins (1981), the |Z(si ) − Z(sj )|0.5 are less correlated than
the squared differences (Z(si ) − Z(sj ))2 . This is a reason to prefer the CH
estimator over the Matheron estimator when fitting a semivariogram model by
weighted (instead of generalized) least squares to the empirical semivariogram
(see §4.5).
Example 4.3 (Four point semivariogram. Continued) For the four lag
distances in this simple example the estimates according to equation (4.26)
are
- F>4 S
√ 1 1 E/ /
R( 2) =
γ |1 − 2| + |2 − 3| 0.704 = 0.71
2 2
- F> 4 S
1 1 E/ /
R(2) =
γ |1 − 3| + |4 − 20| 0.704 = 38.14
2 2
- F> 4 S
√ 1 1 E/ /
R( 5) =
γ |4 − 2| + |20 − 2| 0.704 = 45.5
2 2
- F> 4 S
1 1 E/ /
R(3) =
γ |4 − 1| + |20 − 3| 0.704 = 52.2
2 2
- F> 4 S
√ 1 1 E/ /
R( 13) =
γ |3 − 4| + |20 − 1| 0.704 = 36.6
2 2
where mediani (xi ) denotes the median of the xi . The factor b is chosen to
yield approximate unbiasedness and consistency. If x1 , · · · , xn are independent
realizations from a G(µ, σ 2 ), for example, the MAD will be consistent for σ
for b = 1.4826.
Rousseeuw and Croux (1993) suggested a robust estimator of scale which
also has a 50% breakdown point but a smooth influence function. Their Qn
estimator is given by the kth order statistic
) * of the n(n − 1)/2 inter-point
distances. Let h = 4n/25 + 1 and k = h2 . Then,
For Gaussian data, the multiplicative factor that gives consistency for the
standard deviation is c = 2.2191. The Qn estimator has positive small-sample
bias (see Table 1 for n ≤ 40 in their paper) which can be corrected (Croux
and Rousseeuw, 1992).
Genton (1998a, 2001) considers the modification that leads from (4.24) to
(4.26) not sufficient to impart robustness and develops a robust estimator
of the semivariogram based on Qn . If spatial data Z(s1 ), · · · , Z(sn ) are ob-
served, let N (h) denote the set of pairwise differences Ti = Z(si ) − Z(si + h),
i = 1, · · · , n(n − 1)/2. Next, calculate Q|N (h)| for the Ti and return as the
Genton semivariogram estimator at lag h
Estimator
1 2
γ(h) = Q . (4.29)
2 |N (h)|
Since Qn has a 50% breakdown point, γ(h) has a 50% breakdown point in
terms of the process of differences Ti , but not necessarily in terms of the Z(si ).
Genton (2001) establishes through simulation that (4.29) will be resistant to
roughly 30% of outliers among the Z(si ).
Another approach of “robustifying” the empirical semivariogram estimator
is to consider quantiles of the distribution of {Z(si )−Z(sj )}2 or |Z(si )−Z(sj )|
instead of arithmetic averages (as in (4.24) and (4.26)). If [Z(s), Z(s + h)]! are
bivariate Gaussian with common mean, then
1
{Z(s) − Z(s + h)}2 ∼ γ(h)χ21
2 J
1 1
|Z(s) − Z(s + h)| ∼ γ(h)|U | U ∼ G(0, 1).
2 2
PARAMETRIC MODELING 163
(p)
Let q|N (h)| denote the pth quantile. Then
- >
(p) 1
'p (h) = q|N (h)|
γ [Z(s) − Z(s + h)]2
2
estimates γ(h) × χ2p,1 . A median-based estimator (p = 0.5) would be
1
'p (h) =
γ median|N (h)| {[Z(s) − Z(s + h)]2 }/0.455
2
1E F4
= median|N (h)| {|Z(s) − Z(s + h)| } /0.455.
1/2
2
The latter expression is (2.4.13) in Cressie (1993, p. 75).
sections we discuss the various approaches and their respective merits and de-
merits. To distinguish the empirical semivariogram γ(h) and its estimate γ
'(h)
from the semivariogram model being fit, we introduce the notation γ(h, θ) for
the latter. The vector θ contains all unknown parameters to be estimated from
the data. The model may be a single, isotropic semivariogram function as in
§4.3.2–4.3.5, a model with nugget effect, an anisotropic, or a nested model.
Since
Cov[Tij , Tkl ] = E[Tij Tkl ]
= E[Z(si )Z(sk ) − Z(si )Z(sl ) − Z(sj )Z(sk ) + Z(sj )Z(sl )]
and E[Z(si )Z(sj )] = C(0) − γ(hij , θ) + µ2 , we have
2
{γ(hij , θ) + γ(hjk , θ) − γ(hjl , θ) − γ(hik , θ)}
Corr[Tij , Tkl ] =
2
,
4γ(hij )γ(hkl )
which is (2.6.10) in Cressie (1993, p. 96). Finally,
2
Cov[Tij2 , Tkl
2
] = 2 {γ(hij , θ) + γ(hjk , θ) − γ(hjl , θ) − γ(hik , θ)} . (4.32)
If i = k and j = l, (4.32) reduces to 8γ(hij , θ)2 , of course. The variance of
the Matheron estimator at lag hm is now obtained as
1 ( 1 ((
γ (hm )] =
Var[2' Var T 2
= Cov[Tij2 , Tkl
2
].
|N (hm )|2 ij
|N (hm )|2 i,j
N (h m ) k,l
Because the off-diagonal entries of R(θ) are appreciable, the WLS criterion
is a poor approximation of (4.31). Since (4.34) can be written as a weighted
sum of squares over the k lag classes, it is a simple matter to fit a semivari-
ogram model with a nonlinear statistics package, provided it can accommodate
weights. A further “simplification” is possible if one assumes that R = φI.
This ordinary least squares (OLS) approach ignores the correlation and the
unequal dispersion among the γ '(hm ). Zimmerman and Zimmerman (1991)
found that the ordinary least squares and weighted least squares estimators
of the semivariogram performed more or less equally well. One does not lose
166 SEMIVARIOGRAM ANALYSIS AND ESTIMATION
much by assuming that the γ '(hm ) have equal variance. The greatest loss of ef-
ficiency is not incurred by employing OLS over WLS, but by not incorporating
the correlations among the γ '(hm ).
The covariance and correlation structure of 2' γ (h) has been studied by Gen-
ton (1998b) under the assumption that Z(s) is Gaussian and by Genton (2000)
for elliptically contoured distributions (see also Genton, He, and Liu, 2001).
The derivations rest on writing the Matheron estimator as
2'
γ (h) = Z(s)! A(h)Z(s),
where A(h) is a spatial design matrix of the data at lag h. Applying known
results for quadratic forms in Gaussian random variables, Z(s) ∼ G(µ, Σ(θ)),
yields
E[2'γ (h)] = tr[A(h)Σ(θ)]
Var[2' γ (h)] = 2tr[A(h)Σ(θ)A(h)Σ(θ)]
γ (hi ), 2'
Cov[2' γ (hj )] = 2tr[A(hi )Σ(θ)A(hj )Σ(θ)],
where tr is the trace operator.
As is the case in (4.33), these expressions depend on the unknown param-
eters. Genton (1998b) assumes that the data are only “slightly correlated”
and puts Σ(θ) ∝ I. It seems rather strange to assume that the data are
uncorrelated in order to model the parameters of the data dependence. Gen-
ton (2000) shows that if the distribution of the data is elliptically contoured,
Σ = φI + 1a! + a1! , φ ∈ R, and Σ is positive definite, the correlation structure
of the Matheron estimator is
tr[A(hi )A(hj )]
Corr[2'γ (hi ), 2'
γ (hj )] = .
tr[A2 (hi )]tr[A2 (hj )]
Example 4.4 Notice that the first term in (4.36) is a scalar, the inverse
of the sum of the elements of the inverse variance-covariance matrix. In the
special case where Σ(θ) = θI, we obtain 1! Σ(θ)−1 1 = n/θ and the generalized
least squares estimator is simply the sample mean,
n
θ θ1(
R = 1! Σ(θ)−1 Z(s) =
µ Z(si ) = Z.
n n θ i=1
Akin to the profiled maximum likelihood, (4.39) does not contain infor-
mation about the mean µ. In ML estimation, µ was profiled out of the log
likelihood, in REML estimation the likelihood being maximized is that of a
different set of data, KZ(s) instead of Z(s). Hence, there is no restricted max-
imum likelihood estimator of µ. Instead, the estimator obtained by evaluating
(4.36) at the REML estimates θ 'reml is an estimated generalized least squares
estimator: E F−1
'reml = 1! Σ(θ
µ 'reml )−1 1 'reml )−1 Z(s).
1! Σ(θ (4.40)
The idea of using generalized estimating equations (GEE) for the estimation of
parameters in statistical models was made popular by Liang and Zeger (1986)
and Zeger and Liang (1986) in the context of longitudinal data analysis. The
technique is an application of estimating function theory and quasi-likelihood.
Let T denote a random vector whose mean depends on some parameter vector
θ, E[T] = f (θ). Furthermore, denote as D the matrix of first derivatives of
the mean function with respect to the elements of θ. If Var[T] = Σ, then
U (θ; T) = D! Σ−1 (T − f (θ)),
is an unbiased estimating function for θ in the sense that E[U (θ; T)] = 0
and an estimate θ ' can be obtained by solving U (θ; t) = 0 (Heyde, 1997).
The optimal estimating function in the sense of Godambe (1960) is the (like-
lihood) score function. In estimating problems where the score is unaccessible
or intractable—as is often the case when data are correlated—U (θ; T) nev-
ertheless implies a consistent estimator of θ. The efficiency of this estima-
tor increases with closeness of U to the score function. For correlated data,
where Σ is unknown or contains unknown parameters, Liang and Zeger (1986)
and Zeger and Liang (1986) proposed to substitute a “working” variance-
covariance matrix W(α) for Σ and to solve instead the estimating equation
U ∗ (θ; T) = D! W(α)−1 (T − f (θ)) ≡ 0.
If the parameter vector α can be estimated and α
' is a consistent estimator,
then for any particular value of α
'
Ugee (θ; T) = D! W(α)
' −1 (T − f (θ)) ≡ 0 (4.41)
is an unbiased estimating equation. The root is a consistent estimator of θ,
provided that W(α) ' satisfies certain properties; for example, if W is block-
diagonal, or has specific mixing properties (see Fuller and Battese, 1973; Zeger,
1988).
Initially, the GEE methodology was applied to the estimation of parameters
that model the mean of the observed responses. Later, it was extended to the
estimation of association parameters, variances, and covariances (Prentice,
1988; Zhao and Prentice, 1990). This process commences with the construc-
tion of a vector of pseudo-data. For example, let Tij = (Yi − µi )(Yj − µj ),
then E[Tij ] = Cov[Yi , Yj ] and after parameterizing the covariances, the GEE
methodology can be applied. Now assume that the data comprise the in-
complete sampling of a geostatistical process, Z(s) = [Z(s1 ), · · · , Z(sn )]! and
consider the pseudo-data
(1)
Tij = (Z(si ) − µ(si ))(Z(sj ) − µ(sj ))
(2)
Tij = Z(si ) − Z(sj )
170 SEMIVARIOGRAM ANALYSIS AND ESTIMATION
(3)
Tij = (Z(si ) − Z(sj ))2 .
GEE estimates can thus be calculated as the ordinary (nonlinear) least squares
estimates in the model
(3)
Tij = 2γ(hij , θ) + δij δij ∼ iid (0, φ)
with a Gauss-Newton algorithm. Notice that this is the same as fitting the
semivariogram model by OLS to the semivariogram cloud consisting of {Z(si )−
Z(sj )}2 , instead of fitting the model to the Matheron semivariogram estima-
tor.
In the GEE approach only the model for the mean of the data (or pseudo-
data) is required, the variance-covariance matrix is supplanted by a “working”
structure. The efficiency of GEE estimators increases with the closeness of the
working structure to Var[T]. The essential problems that lead to the considera-
tion of generalized estimating equations are the intractability of the likelihood
function and the difficulties in modeling Var[T]. No likelihood is used at any
stage of the estimation problem. A different—but as we will show, related—
strategy is to consider the likelihood of components of T, rather than the
likelihood of T. This is an application of the composite likelihood (CL) idea
(Lindsay, 1988; Lele, 1997; Heagerty and Lele, 1998).
Let Yi , (i = 1, · · · , n) denote random variables with known (marginal)
PARAMETRIC MODELING 171
distribution and let 7(θ; yi ) denote the log likelihood function for Yi . Then
∂7(θ; yi )/∂θ ! is a true score function and Component
“Score”
∂7(θ; yi )
S(θ; yi ) = =0
θ!
is an unbiased estimating function for θ. Unless the Yi are mutually indepen-
dent, the sum of the component scores S(θ; yi ) is not the score function for
the entire data. Nevertheless,
n
(
Ucl (θ; Y) = S(θ; Y)
i=1
4.5.4 Comparisons
sites increases. At first glance, one might conjecture in this case that com-
posite likelihood estimators are more efficient than their GEE counterparts,
because the GEE approach with working independence structure does not
take into account the unequal dispersion of the pseudo-data. Recall that CL
estimation as described above entails WLS fitting of the semivariogram model
to the empirical semivariogram cloud, while GEE estimation with working in-
dependence structure is OLS estimation. Write the two non-linear models as
(3)
Tij = 2γ(hij , θ) + #ij , #ij iid (0, φ)
(3)
Tij = 2γ(hij , θ) + #ij , #ij iid (0, 8γ(hij , θ)).
Now replace site indices with lag classes and let K → ∞. If θ ∗ denotes the
true parameter vector, then the limit of the WLS criterion is
∞
( 1 2
(2γ(hk , θ ∗ ) − 2γ(hk θ) + #k ) .
i=1
8γ(hk , θ)
Applying the law of large numbers and rearranging terms, we find that this
is equivalent to the minimization of
∞ C D2
3 ( γ(hk , θ ∗ ) 1 2 γ(hk , θ ∗ )
+ −
2 i=1 γ(hk , θ) 3 3 γ(hk , θ)
(3)
The particular relationship between the mean and variance of Tij and
the dependence of the variance on model parameters has created a situation
where the semivariogram evaluated at the CL estimator is not consistent for
γ(hk , θ ∗ ). Instead, it consistently estimates 3γ(hk , θ ∗ ). The “bias correction”
for the CL estimator is remarkably simple.
174 SEMIVARIOGRAM ANALYSIS AND ESTIMATION
If correlated data are fit by least squares methods, we would like to take into
account the variation and covariation of the observations. In general, GLS esti-
mation is more efficient than WLS estimation, and it, in turn, is more efficient
than OLS estimation. We are tacitly implying here that the covariance matrix
of the data is known for GLS estimation, and that the weights are known for
WLS estimation. The preceding discussion shows, however, that if one works
with the semivariogram cloud, one may be better off fitting the semivariogram
model by OLS, than by weighted least squares without bias correction, be-
cause the weights depend on the semivariogram parameters. Since the bias
correction is so simple (multiply by 1/3), it is difficult to argue in favor of
OLS. Müller (1999) also proposed an iterative scheme to obtain consistent
estimates of the semivariogram based on WLS. In the iteratively re-weighted
algorithm the weights are computed for current estimates of θ and held fixed.
Based on the non-linear WLS estimates of θ, the weights are re-computed and
the semivariogram model is fit again. The process continues until changes in
the parameter estimates from two consecutive fits are sufficiently small. The
performance of the iteratively re-weighted estimators was nearly as good as
that of the iterated GLS estimator in the simulation study of Müller (1999).
The difficulty of a pure or iterated GLS approach lies in the determination
of the full covariance structure of the pseudo-data and the possible size of the
covariance matrix. For Gaussian random fields, variances and covariances of
(3)
the Tij are easy to ascertain and can be expressed as a function of semivar-
iogram values. For the empirical semivariogram in a Gaussian random field,
the covariance matrix of the Matheron estimator γ '(h) is derived in Genton
(1998b). In either case, the covariance matrix depends on θ and an iterative
approach seems prudent. Based on a starting value θ 0 , compute Var[T(3) ] or
Var['γ (h)] and estimate the first update β1 by (estimated) generalized least
squares. Recompute the variance-covariance matrix and repeat the GLS step.
This process continues until changes in subsequent estimates of θ are minor.
The difficulty with applying GLS estimation to the semivariogram cloud is the
size of the data vector. Since the set of pseudo-data contains up to n(n − 1)/2
points, compared to K 6 n if you work with the empirical semivariogram,
building and inverting Var[T(3) ] quickly becomes computationally prohibitive
as n grows.
If the distribution of the data is elliptically contoured, the simplification
described in Genton (2000) can be put in place, provided the covariance matrix
of the data is of the form described there (see also §4.5.1 in this text). This
eliminates the covariance parameters θ from the GLS weight matrix.
thereof) into lag classes. Even for regularly spaced data, binning is often
necessary to achieve a recommended number of pairs in each lag class. The
process of binning itself is not without controversy. The choice (number and
spacing) of lag classes affects the resulting semivariogram cloud. The choice
of the largest lag class for which to calculate the empirical semivariogram can
eliminate values with large variability. The user who has a particular semi-
variogram model in mind may be tempted to change the width and number
of lag classes so that the empirical semivariogram resembles the theoretical
model. Combined with trimming values for larger lags, the process is slanted
towards creating a set of data to which a model fits well. The process of fitting
a statistical model entails the development of a model that supports the data;
not the development of a set of data that supports a model.
The CL and GEE estimation methods are based on the semivariogram cloud
and avoid the binning process. As discussed above, the choice of parameter-
dependent weights can negatively affect the consistency of the estimates, how-
ever, and a correction may be necessary. Estimation based on the semivari-
ogram cloud can also “trim” values at large lags, akin to the determination of
a largest lag class in least-squares fitting. Specifically, let wij denote a weight
associated with {Z(si ) − Z(sj )}2 . For example, take
-
1 if ||si − sj || ≤ c
wij =
0 if ||si − sj || > c,
and modify the composite likelihood score equation as
n−1
( n
( ∂γ(hij , θ) 1 E F
(3)
CS(θ; T (2)
)=2 wij T − 2γ(h ij , θ) ,
i=1 j=i+1
∂θ ! 8γ(hij , θ)2 ij
to exclude pairs from estimation whose distance exceeds c. One might also
(3)
use weights that depend on the magnitude of the residual (tij − 2γ(hij , θ))
to “robustify” the estimator.
The ML and REML estimators also avoid the binning process altogether.
In fact, squared differences between observed values do not play a role in
likelihood estimation. The objective functions (4.35) and (4.39) involve gen-
eralized sums of squares between observations and their means, not squared
differences between the Z(si ). It is thus not correct to cite as a disadvan-
tage of likelihood methods that one cannot eliminate from estimation pairs
at large lags. The lag structure figures into the structure of the covariance
matrix Σ(θ), which reflects the variation and covariation of the data. Likeli-
hood methods use an objective function built on the data, not an objective
function built on pseudo-data that was crafted by the analyst based on the
spatial configuration.
Comparing CL with ML (or REML) estimation, it is obvious that composite
likelihood methods are less efficient, since CS(θ; T(2) ) is not the score function
of Z(s). Computationally, CL and GEE estimation are more efficient, however.
Minimizing the maximum or restricted maximum likelihood score function is
176 SEMIVARIOGRAM ANALYSIS AND ESTIMATION
Example 4.2 (C/N ratios. Continued) For the C/N ratio data we applied
the previously discussed estimation methods assuming an exponential covari-
ance structure or semivariogram. Table 4.2 displays the parameter estimates
for five estimation methods with and without a nugget effect and Figure 4.10
displays the fitted semivariograms. The least-squares fits used the classical
empirical semivariogram (Matheron estimator).
It is noteworthy in Table 4.2 that the inclusion of a nugget effect tends to
raise the estimate of the range, a common phenomenon. In other words, the
decrease in spatial continuity due to measurement error is compensated to
some degree by an increase in the range which counteracts the decline in the
spatial autocorrelations on short distances. Unfortunately, the OLS, WLS,
CL, and GEE estimation methods do not produce reliable standard errors
for the parameter estimates and the necessity for inclusion of a nugget effect
must be determined on non-statistical grounds. These methods estimate the
semivariogram parameters from pseudo-data and do not account properly for
the covariation among the data points.
Table 4.2 Estimated parameters for C/N ratios with exponential covariance struc-
ture; see also Figure 4.10.
Estimates of
Method Nugget (Partial) Sill Practical Range
likelihoods for the nugget and no-nugget models are 264.83 and 277.79, re-
spectively. The likelihood ratio statistic to test whether the presence of the
nugget effect significantly improves the model fit is 277.79 − 264.83 = 12.96
and is significant; Pr(χ21 > 12.96) < 0.00032. Based on this test, the model
should contain a nugget effect. On the other hand, the REML method pro-
duces by far the largest estimate of the variance of the process (0.318 and
0.215 + 0.118 = 0.333), and the REML estimate of the range in the no-nugget
model (171.1) appears large compared to other methods. Recall that OLS,
WLS, CL, and GEE estimates are obtained from a data set in which the
largest lag does not coincide with the largest distance in the data. Pairs at
large lags are often excluded from the analysis. In our case, only data pairs
with lags less than 6 × 35 = 210 feet were used in the OLS/WLS/CL/GEE
analyses. The ML and REML methods cannot curtail the data.
The consequences of using the empirical semivariogram cloud (GEE/CL)
versus the empirical semivariogram (OLS/WLS) are minor for these data. The
OLS and GEE estimates are quite close, as are the WLS and CL estimates.
This is further amplified in a graph of the fitted semivariograms (Figure 4.10).
The CL and WLS fits are nearly indistinguishable. The same holds for the
OLS and GEE fits.
Performing a weighted analysis does, however, affect the estimate of the
(practical) range in the no-nugget models. Maybe surprisingly, the CL esti-
mates do not exhibit the large bias that was mentioned earlier. Based on the
previous derivations, one would have expected the CL estimator of the prac-
tical range to be much larger on average than the consistent WLS estimator.
Recall that the lack of consistency of the CL estimator—which is a weighted
version of the GEE estimator—was established based on an asymptotic model
in which the number of observations as well as the number of lag classes grows
to infinity. In this application, we curtailed the max lag class (to 6 × 35 = 210
feet), a practice we generally recommend for composite likelihood and gen-
eralized estimating equation estimation. An asymptotic model under which
the domain and the number of observation increases is not meaningful in this
application. The experimental field cannot be arbitrarily increased. We accept
the CL estimators without “bias” correction.
When choosing between pseudo-data based estimates and ML/REML esti-
mates, the fitted semivariograms are sometimes displayed together with the
empirical semivariograms. The CL/GEE and ML/REML estimates in general
will fair visually less favorably compared to OLS/WLS estimates. CL/GEE
and ML/REML estimates do not minimize a (weighted) sum of squares be-
tween the model and the empirical semivariogram. The least squares estimates
obviously fit the empirical semivariogram best; that is their job. This does not
imply that least squares yields the best estimates from which to reconstruct
the second-order structure of the spatial process.
178 SEMIVARIOGRAM ANALYSIS AND ESTIMATION
REML
REML
0.3 0.3
OLS, GEE
CL, WLS
0.2 0.2
0.1 0.1
0.0 0.0
0 40 80 120 0 40 80 120
Distance Distance
Figure 4.10 Fitted semivariograms for C/N data with and without nugget effect con-
structed from parameter estimates in Table 4.2.
In §2.5.2 it was shown that the class of valid covariance functions in Rd can
be expressed as
A ∞ A ∞
C(h) = ··· cos(ω ! h)S(dω).
−∞ −∞
and in the isotropic case we have
A ∞
C(h) = Ωd (hω)F (dω), (4.45)
0
The main distinction between this and the previous approach lies in the
solution of the integration problem. In the step function approach F (ω) is
discretized to change integration to summation. The “unknowns” in the step
function approach are the number of nodes, their placement, and the weights.
Some of these unknowns are fixed a priori and the remaining unknowns are
estimated from the data. In the kernel approach we fix a priori only the class
of functions f (θ, ω). Unknowns in the estimation phase are the parameters
that index the kernel function f (θ, ω). If the integral in (4.48) cannot be
solved in closed form, we can resort to a quadrature or trapezoidal rule to
calculate C(θ,' h) numerically. The parametric kernel function approach for
semivariogram estimation is appealing because of its flexibility and parsimony.
Valid semivariograms can be constructed with a surprisingly small number of
parameters. The process of fitting the semivariogram can typically be carried
out by nonlinear least squares, often without constraints.
B∞
The need for F to be non-decreasing and 0 f (θ, ω)dω < ∞ suggests to
draw on probability density functions in the construction of f (θ, ω). A particu-
larly simple, but powerful, choice is as follows. Suppose G(θ) is the cumulative
distribution function (cdf) of a U (θl , θu ) random variable so that θ = [θl , θu ]! .
182 SEMIVARIOGRAM ANALYSIS AND ESTIMATION
Then define
0 ω<0
F (θ, ω) = G(θ) 0 ≤ ω ≤ 1 (4.49)
1 ω > 1.
The kernel f (θ, ω) is positive only for values of ω between 0 and 1. As a con-
sequence, the largest value for which the basis function Ωd (hω) is evaluated,
corresponds to the largest semivariogram lag. This largest lag may imply too
many or too few sign changes of the basis function (see Figure 4.1 on page
142). In the latter case, you can increase the bounds on ω in (4.49) and model
0 ω<0
F (θ, ω) = G(θ) 0 ≤ ω ≤ b
1 ω > b.
The weighting of the basis functions is controlled by the shape of the kernel
between 0 and b. Suppose that θl = 0 and G(θ) is a uniform cdf. Then all
values of Ω(hω) receive equal weight 1/θu for 0 ≤ ω ≤ θu , and weight 0
everywhere else. By shifting the lower and upper bound of the uniform cdf,
different parts of the basis functions are weighted. For θl small, the product
hω is small for values where f (θ, ω) (= 0 and the basis functions will have few
sign changes on the interval (0, hω) (Figure 4.11).
1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0
0 10 20 30 40 50
Lag
Figure 4.11 Semivariograms with sill 1.0 constructed from covariance functions
(4.48) for processes in R2 with uniform kernels of width θu − θl = 0.2 for vari-
ous values of θl = −0.1, 0, 0.1, 0.2, 0.4, 0.6.0.8. As θl increases the semivariogram
becomes more and more wavy. The basis function is Ωd (t) = J0 (t).
NONPARAMETRIC ESTIMATION AND MODELING 183
To adjust for the variance of the process, a sill parameter is added and we
model A b Parametric
Kernel
C(θ, h) = σ 2
Ω(hω)f (θ, ω) dω. (4.50)
0 Estimator
where C(θ, h) is given in (4.50). The parameter σ 2 is akin to the sill of classi-
cal semivariogram models in the sense that the nonparametric semivariogram
oscillates about σ 2 and will approach it asymptotically. In practical imple-
mentation, the integral in (4.50) can often be approximated with satisfactory
accuracy by a sum, applying a trapezoidal or quadrature rule. This is helpful if
the fitting procedure allows array processing, such as the NLIN or NLMIXED
procedures of SAS/STAT! r . An example of fitting a semivariogram with the
parametric kernel approach is presented at the conclusion of §4.6.3.
The uniform kernel is simple to work with but you may not want to weigh
the basis functions equally for values of ω where f (θ, ω) is nonzero. Kernel
functions with unequal weighing can be constructed easily by drawing on other
probability densities. For example,
0 ω<0
f (µ, ξ, ω) = exp{−0.5(ω − µ)2 /ξ} 0 ≤ ω ≤ b (4.52)
1 ω > b.
is a two-parameter kernel derived from the Gaussian density. The kernel can
be scaled f (θ, ω), 0 ≤ ω ≤ b so that it integrates to one, for example,
exp{−0.5(ω − µ)2 /ξ}
f (µ, ξ, ω) = B b .
0
exp{−0.5(ω − µ)2 /ξ}
field Z(s) can be expressed in terms of a kernel function K(u) and a white
noise excitation field X(s) as
A
Z(s) = K(s − u)X(u) du.
u
It was shown in §2.4.2 that
A
Cov[Z(s), Z(s + h)] = σx 2
K(u)K(h + u) du,
u
and thus A
Var[Z(s)] = σx 2
K(u)2 du. (4.53)
u
These results are put to use in finding the semivariogram of the convolved
process:
γ(h) = Var[Z(s) − Z(s + h)]
:A A ;
= Var K(s − u)X(u) du − K(s + h − u)X(u) du
:A u u ;
= Var (K(s − u) − K(s + h − u)) X(u) du
:A u ;
= Var P (s − u, h)X(u) du
u
The last expression is the variance of a random field U (s) with kernel P (u, h).
Applying (4.53) one finds
A
γ(h) = Var[Z(s) − Z(s + h)] = P (s − u, h)2 du
A u
2
= (K(s − u) − K(s + h − u)) du.
u
Because s is arbitrary and C(h) is an even function, the expression simplifies
to A
Nonparametric γ(h) = (K(u) − K(u − h))2 du, (4.54)
Semivari- u
ogram the moving average formulation of the semivariogram.
(Moving
Average) Example 4.5 Recall Example 2.3, where white noise was convolved with a
uniform and a Gaussian kernel function. The resulting correlation functions in
Figure 2.5 on page 60 show a linear decline of the correlation for the uniform
kernel up to some range r. The correlation remains zero afterward. Obviously,
this correlation function corresponds to a linear isotropic semivariogram model
-
θ|h| |h| ≤ r
γ(h) =
θr |h| > r
with sill θr(h, r/> 0). This is easily verified with (4.54). For a process in R1
define K(u) = θ/2 if 0 ≤ u ≤ r and 0 elsewhere. Then (K(u)−K(u−h))2 is
NONPARAMETRIC ESTIMATION AND MODELING 185
Barry and Ver Hoef (1996) have drawn on these ideas and extended the
basic moving average procedure in more than one dimension, also allowing for
anisotropy. Their families of variogram models are based on moving averages
using piecewise linear components. From the previous discussion it is seen that
any valid—square integrable—kernel function K(u) can be used to construct
nonparametric semivariograms from moving averages.
The approach of Barry and Ver Hoef (1996) uses linear structures which
yields explicit expressions for the integral in (4.54). For a one-dimensional
process you choose a range c > 0 and divide the interval (0, c] into k equal
subintervals of width w = c/k. Let f (u, θi ) denote the rectangular function
with height θi on the ith interval,
-
θi (i − 1) < u/w ≤ i
f (u, θi ) = (4.55)
0 otherwise.
The moving average function K(u) is a step function with steps θ1 , · · · , θk ,
k
(
K(u|c, k) = f (u, θi ).
i=1
When the lag distance |h| equals or exceeds the range c, that is, when j ≥
the second sum vanishes and the semivariogram remains flat at σ 2 =
k, .
k
w i=1 θi2 , the sill value. When h is not an integer multiple of the width
w, the semivariogram value is obtained by interpolation.
For processes in R2 , the approach uses piecewise planar functions that are
constant on the rectangles of a grid. The grid is formed by choosing ranges
c and d in two directions and then dividing the (0, 0) × (c, d) rectangle into
k × l rectangles. Instead of the constant function on the line (4.55), the piece-
wise planar functions assign height θi,j whenever a point falls inside the ith,
jth sub-rectangle (see Barry and Ver Hoef, 1996, for details). Note that the
piecewise linear model in (4.55) is not valid for processes in R2 .
The piecewise linear (planar) formulation of Barry and Ver Hoef combines
186 SEMIVARIOGRAM ANALYSIS AND ESTIMATION
Any semivariogram model can be furnished with a nugget effect using the
method in §4.3.6; nonparametric models are no exception. There is, however,
a trade-off between estimating the parameters of a nonparametric model that
govern the smoothness of.the semivariogram and estimating the nugget effect.
p
The sum of the weights i=1 wi in the spectral approach and the sum of the
.k
squared step heights i=1 θi2 in the moving average approach represent the
partial sill σ02 in the presence of a nugget effect c0 . The nonlinear least squares
objective function is adjusted accordingly. When the nugget effect is large, the
process contains a lot of background noise and the nonparametric semivari-
ogram estimate tends to be not smooth, the nonparametric coefficients wi , θi2
tend to be large. Since the sill σ 2 = c0 + σ02 is fixed, the nugget estimate
will be underestimated. A large estimate of the nugget effect, on the other
hand, leads to an artificially smooth nonparametric semivariogram that is not
sufficiently flexible because of small weights.
Barry and Ver Hoef (1996) recommend estimating the nugget effect sepa-
rately from the nonparametric coefficients, for example, by fitting a line to
the first few lags of the empirical semivariogram cloud, obtaining the nugget
estimate ' c0 as the intercept. The data used in fitting the nonparametric semi-
variogram is then shifted by that amount provided ' c0 > 0.
The nugget effect was held fixed at the same value as in the moving average
approach. Figure 4.12 shows fitted semivariograms for b = 1 and b = 2 for the
case of an externally estimated nugget effect (' c0 = 0.1169). The parameter
estimates were θ'u = 0.1499 and σ ' = 0.166 for b = 1 and θ'u = 0.1378 and
2
' = 0.166 for b = 2. Increasing the limit of integration had the effect of
σ 2
reducing the upper limit of the kernel, but not proportionally to the increase.
As a result, the Bessel functions are evaluated up to a larger abscissa, and the
fitted semivariogram for b = 2 is less smooth.
0.35
0.30
0.25
Semivariance
0.20
0.15
b=1
b=2
0.10
0.05
0.00
0 50 100 150 200
Distance
Figure 4.12 Fitted semivariograms for C/N ratios with uniform kernel functions and
b = 1, 2. Nugget effect estimated separately from kernel parameters.
When the nugget effect is estimated simultaneously with the kernel param-
eter θu , we observe a phenomenon similar to that reported by Barry and Ver
188 SEMIVARIOGRAM ANALYSIS AND ESTIMATION
Hoef (1996). The simultaneous estimate of the nugget effect is smaller than
the externally obtained estimate. The trade-off between the nugget effect and
the sill is resolved in the optimization by decreasing the smoothness of the
fit, at the nugget effect’s expense. The estimate of the nugget effect for the
semivariograms in Figure 4.13 is 0.039 for b = 1 (0.056 for b = 2) and the
estimate of the kernel parameter increased to 0.335 for b = 1 (0.291 for b = 2).
0.35
0.30
0.25
Semivariance
0.20
0.15
b=1
b=2
0.10
0.05
0.00
0 50 100 150 200
Distance
Figure 4.13 Fitted semivariograms for C/N ratios with uniform kernel functions and
b = 1, 2. Nugget effect estimated simultaneously.
4. Use the estimated model γ(h; θ) ' in further calculations, for example, to
solve a spatial prediction problem (Chapter 5).
In the frequency domain, similar steps are performed. Instead of the semi-
variogram, we work with the spectral density, however. The first step, then,
is to compute an empirical estimator of s(ω), the periodogram. Having se-
lected a theoretical model for the spatial dependency, the spectral density
model s(ω; θ), the parameter vector θ is estimated. Further calculations are
' For example, one could
then based on the estimated spectral density s(ω; θ).
construct the semivariances or covariances of the process from the estimated
spectral density function in order to solve a prediction problem.
Recall that the covariance function and the spectral density function of
a second-order stationary stochastic process form a Fourier transform pair
(§2.5),
A
1
s(ω) = C(u) exp{−iω ! u} du.
(2π)d Rd
Because you can switch between the spectral density and the covariance func-
tion by means of a (inverse) Fourier transform, it is sometimes noted that
the two approaches are “equivalent.” This is not correct in the sense that
space-domain and frequency-domain methods for studying the second-order
properties of a random field represent different aspects of the process. The
space-domain analysis expresses spatial dependence as a function of separa-
tion in spatial coordinates. It is a second-order method because it informs us
about the covariation of points in the process and its dependence on spatial
separation. Spectral methods do not study covariation of points but the man-
ner in which a function dissipates energy (or power, which is energy per unit
interval) at certain frequencies. The second step in the process, the selection of
an appropriate spectral density function, is thus arguably more difficult than
in the spatial domain, where the shape of the empirical semivariogram suggests
the model. By choosing from a sufficiently flexible family of processes—which
leads to a flexible family of spectral densities—this issue can be somewhat
defused. The Matérn class is particularly suited in this respect.
A further, important, difference between estimating the spectral density
function from the periodogram compared to estimating the semivariogram
from its empirical estimator, lies in the distributional properties. We discussed
in §4.5.1 that the appropriate least-squares methodology in fitting a semivar-
iogram model is generalized least squares, because the “data” γ '(hi ) are not
independent. The dependency stems from the spatial autocorrelation and the
sharing of data points. GLS estimation is nearly impractical, however, be-
cause of the difficulty of computing a dense variance matrix for the empirical
semivariogram values. Weighted or ordinary least squares are used instead.
The spectral approach has a considerable advantage in that the periodogram
values are—at least asymptotically—independent. A weighted least squares
approach based on the periodogram is more justifiable than a weighted least
squares approach based on the empirical semivariogram.
190 SEMIVARIOGRAM ANALYSIS AND ESTIMATION
The spectral density and the covariance function are related to each other
through a Fourier transform. Considering that the asymptotic properties of
the empirical estimators are vastly different, it may come as a surprise that
the estimates are related in the same fashion as the process quantities; the
periodogram turns out to be the Fourier transform of the sample covariance
function.
In what follows we focus on the case where the domain D is discrete and
Z is real-valued. Specifically, we assume that the data are observed on a
rectangular r×c row-column lattice. Letting u and v denote a row and column
position, respectively, Z(u, v) represents the attribute in row u and column v.
The covariance function can then be expressed as Cov[Z(u, v), Z(u+j, v+k)] =
C(j, k) and the integral in the spectral density function can be replaced case
by a double summation:
(∞ ∞
(
1
s(ω1 , ω2 ) = C(j, k) exp{−i(ω1 j + ω2 k)}
(2π)2 j=−∞
k=−∞
∞
( (∞
1
= C(j, k) cos{ω1 j + ω2 k} (4.56)
(2π)2 j=−∞ k=−∞
−π < ω1 < π − π < ω2 < π.
where 4·5 is the greatest integer (floor) function. These frequencies, which
are multiples of 2π/r and 2π/c, are known as the Fourier frequencies. The
connection between (4.57) and the spectral density as the Fourier transform
of the covariance function (4.56) is not obvious in this formulation. We now
establish this connection between the periodogram and the sample covariance
function for the case of a one-dimensional process Z(1), Z(2), · · · , Z(r). The
operations are similar for the two-dimensional case, the algebra more tedious,
however (see Chapter problems).
ESTIMATION AND INFERENCE IN THE FREQUENCY DOMAIN 191
Using the Euler relation exp{ix} = cos(x) + i sin(x), the periodogram can be
expressed in terms of trigonometric functions:
r r
1( (
2πI(ωj ) = Z(u){cos(ωj u) − i sin(ωj u)} Z(u){cos ωj u + i sin(ωj u)}
r u=1 u=1
r r
1 ((
= Z(u)Z(p) cos(ωj u) cos(ωj p) +
r u=1 p=1
r (
( r
Z(u)Z(p) sin(ωj u) sin(ωj p).
u=1 p=1
At this
.point we make use of the fact that, by definition of the Fourier frequen-
cies, u cos(ωj u) = 0, and hence we can subtract any value from Z without
altering the sums. For example,
r
( r
(
Z(u) cos(ωj u) = (Z(u) − Z) cos(ωj u).
u=1 u=1
Using the further fact that cos(a) cos(b) + sin(a) sin(b) = cos(a − b), we arrive
at
r (
( r
2rπI(ωj ) = (Z(u) − Z)(Z(p) − Z) cos(ωj u) cos(ωj p) +
u=1 p=1
(r ( r
(Z(u) − Z)(Z(p) − Z) sin(ωj u) sin(ωj p)
u=1 p=1
(r ( r
= (Z(u) − Z)(Z(p) − Z) cos(ωj (u − p)). (4.58)
u=1 p=1
r−1
( r−1
(
'
2πI(ωj ) = C(0) +2 '
cos(ωj k)C(k) = '
cos(ωj k)C(k).
k=1 k=−r+1
Similar operations as in the pervious paragraphs can be carried out for a two-
dimensional lattice process. It can be established that for ω1 (= 0 and ω2 (= 0
the periodogram is the Fourier transform of the sample covariance function,
Periodogram U U2
1 1 UU( ( U
r c
U
I(ω1 , ω2 ) = U Z(u, v) exp{−i(ω1 u + ω 2 v)}U
(2π)2 rc Uu=1 v=1 U
r−1
( c−1
(
1 ' k) exp{−i(ω1 j + ω2 k)}
= C(j,
(2π)2 j=−r+1 k=−c+1
r−1
( c−1
(
1 ' k) cos{ω1 j + ω2 k},
= C(j, (4.59)
(2π)2 j=−r+1 k=−c+1
4.7.1.3 Interpretation
Column
Row 1 2 3 4 5 6 7 8 9 10
1 3.55 3.52 4.41 4.04 4.43 5.01 4.81 4.97 5.02 4.63
2 3.56 3.91 3.39 3.22 4.29 5.77 4.61 5.02 4.14 4.24
3 4.26 4.26 4.32 3.99 5.40 5.83 4.92 3.55 2.68 3.22
4 4.29 5.84 5.24 5.30 5.27 5.48 5.02 2.98 3.15 2.71
5 5.75 4.80 5.50 4.51 4.71 4.62 5.26 4.15 3.10 3.97
6 6.20 5.81 4.45 5.14 4.62 5.38 5.31 4.50 4.84 3.94
7 6.87 6.23 6.18 5.61 6.00 5.75 5.67 5.32 5.03 4.68
8 6.73 7.24 6.53 5.20 4.65 5.19 4.83 5.11 5.35 5.82
9 6.42 7.27 6.14 5.95 5.52 4.78 5.40 5.32 5.36 6.17
10 7.52 6.49 6.18 5.46 5.12 5.55 5.09 5.69 5.99 5.91
Figure 4.14 and 4.15 show the sample covariance functions (4.60) for the
spatially uncorrelated data in Figure 1.1a and the highly correlated data in
Figure 1.1d. The covariances are close to zero everywhere, except for (j, k) =
(0, 0), where the sample covariance function estimates the variance of the
process. In the correlated case, covariances are substantial for small j and k.
Notice the evenness of the sample covariance function, C(j,' k) = C(−j,
' −k),
' k) (= C(−j,
and the absence of reflection symmetry, C(j, ' k). The periodograms
are displayed in Figures 4.16 and 4.17. For spatially uncorrelated data, the
periodogram is more or less evenly distributed. High and low ordinates occur
for large and small frequencies (note that I(0, 0) = 0). Strong positive spatial
association among the responses is reflected in large periodogram ordinates
for small frequencies.
4.7.1.4 Properties
The conditions under which these asymptotic results hold are detailed in
Pagano (1971). Specifically, it is required that the random field is second-
order stationary with finite variance and has a spectral density. Then we
can write Z(u, v) as the discrete convolution of a white noise excitation field
{X(s, t) : (s, t) ∈ I × I},
∞
( ∞
(
Z(u, v) = a(j, k)X(u − j, v − k),
j=−∞ k=−∞
for a sequence of constants a(·, ·). It is important to note that it is not required
that Z(u, v) is a Gaussian random field (as is sometimes stated). Instead,
it is only necessary that the excitation field X(s, t) satisfies a central limit
condition,
r c
1 ((
√ X(s, t) → G(0, 1).
rc s=1 t=1
The asymptotic results are very appealing. For a finite sample size, however,
the periodogram is a biased estimator of the spectral density function. Fuentes
(2001) gives the following expression for the expected value of the periodogram
196 SEMIVARIOGRAM ANALYSIS AND ESTIMATION
Figure 4.16 Periodogram I(ω1 , ω2 ) for the spatially uncorrelated data of Figure 1.1a.
at ω in R2 : A
1
I(ω) = s(α)W (α − ω) dα, (4.61)
rc(2π)2
Fejér’s with C DC 2 D
Kernel sin2 (rα1 /2) sin (cα2 /2)
W (α) = ,
sin2 (α1 /2) sin2 (α2 /2)
and α is a non-zero Fourier frequency. The bias comes about because the
function W has subsidiary peaks (sidelobes), large values away from ω. This
allows contributions to the integral from parts of s(·) far away from ω; the
phenomenon is termed leakage. The function W (·) operates as a kernel func-
tion in equation (4.61), it is known as Fejér’s kernel. Leakage occurs when
the kernel has substantive sidelobes. The kernel then transfers power from
other regions of the spectral density to ω. The bias can be substantial when
the process has high dynamic range. This quantity is defined by Percival and
Dynamic Walden (1993, p. 201) as
Range C D
max{s(ω)}
10 log10 . (4.62)
min{s(ω)}
And, the bias of I(ω) in processes with high dynamic range affects particularly
those frequencies where the spectral density is small.
The two common methods to combat leakage in periodogram estimation
ESTIMATION AND INFERENCE IN THE FREQUENCY DOMAIN 197
Figure 4.17 Periodogram I(ω1 , ω2 ) for the highly spatially correlated data of Figure
1.1d and Table 4.3.
are tapering and pre-whitening of the data. We are mentioning these tech-
niques only in passing, they are not without controversy. The interested reader
is referred to the monograph by Percival and Walden (1993) for the theory
as well as the various arguments for and against tapering and pre-whitening.
Data tapering replaces Z(s) with h(s)Z(s), where the function h(s) is termed
a taper function or data taper. The tapering controversy is ignited by the
fact that the operation in the spatial domain, creating the product h(s)Z(s),
is a weighing of the observations by h(s). Data tapers typically give smaller
weights to observations near the boundary of the domain, so it can be viewed
as a method of adjusting for edge effects. The weighing is not one that gives
more weight to observations with small variance, as is customary in statis-
tics. Negative sentiments range from “losing information” and “throwing away
data” to likening tapering and tampering. The real effect of tapering is best
seen in the frequency domain. Its upshot is to replace in (4.61) the Fejér kernel
with a kernel function that has smaller sidelobes, hence reducing leakage. Ta-
pering applies weights to data in order to change the resulting kernel function
in the frequency domain.
Pre-whitening is a filtering technique where the data are processed with a
linear filter. The underlying idea is that (i) the spectral densities of the original
and the filtered data are related (see §2.5.6), and that (ii) the filtered process
198 SEMIVARIOGRAM ANALYSIS AND ESTIMATION
has a smaller dynamic range. The spectral density of the transformed process
(the filter output) is the product of the spectral density of the filter input and
the filter transfer function H(ω), see equation (2.39) on page 76. Then, the
periodogram is estimated from the filtered data and this is used to construct
the periodogram of the original process. Ideally, filtering would create a white
noise process, since it has the smallest dynamic range (zero). But in order to
do this we need to know either the variance matrix Var[Z(s)] = Σ, or the
spectral density function of the process. This is the very parameter we are
trying to estimate.
We now give the spectral densities that correspond to the second-order sta-
tionary and isotropic models discussed previously. For some models, e.g., the
Matérn class, several parameterizations are presented and their advantages
and disadvantages are briefly discussed.
Since processes in R1 are necessarily isotropic, the spectral density for a
process with continuous domain can be constructed via
A ∞
1
s(ω) = cos(ωh)C(h) dh.
2π −∞
In Rd , the “brute-force” method to derive the spectral density, when the
covariance function is isotropic, is to compute
A
1
s(ω) = cos(ω ! h)C(||h||) dh.
(2π) Rd
d
Table 4.4 gives expressions for the spectral densities in R1 for some common
covariance models in different parameterizations.
Table 4.4 Covariance and spectral density functions for second-order stationary pro-
cesses in R1 . The parameter α is a function of the range and σ 2 is the variance of
the process.
σ2 α
Exponential σ exp{−h/α}
2
π(1 + ω 2 α2 )
3σ 2 α
Exponential σ exp{−3h/α}
2
π(9 + ω 2 α2 )
ασ 2 −(αω)2 /4
“G”aussian σ 2 exp{−h2 /α2 } √ e
2 π
√
σ 2 α/ 3 −(αω)2 /12
“G”aussian σ exp{−3h /α }
2 2 2 √ e
2 π
h3 3σ 2 α (1 − cos(ωα))2 (1 − sin(ωα))2
Spherical σ 2 {1 − 3h
+ 2a3 }I(h ≤ α)
2α 2π (ωα)4
σ2
) αh *ν 2α
2ν
Γ(ν + 12 ) 2 2 −(ν+ 12 )
Matérn class 2Kν (αh) σ (α + ω )
Γ(ν) 2
Γ(ν)Γ( 12 )
π 1/2 φ
) αh *ν 1
Matérn class Γ(ν+1/2)α2ν 2 2Kν (αh) φ(α2 + ω 2 )−(ν+ 2 )
E √ Fν E √ F σ 2 g(ρ, ν)
σ2 h ν 2h ν
Matérn class Γ(ν) ρ 2Kν ρ Z [ν+1/2
ρω
1 + (2 ν )
√ 2
Notice the functional similarity of s(ω) for the gaussian models and the
Gaussian probability density function from which these models derive their
name.
The second parameterization of the Matérn class in Table 4.4 is given by
Stein (1999, p. 31). In R1 , it is related to the first parameterization through
θ2ν Γ(ν + 12 )
φ = σ2 .
Γ(ν)Γ( 12 )
200 SEMIVARIOGRAM ANALYSIS AND ESTIMATION
with
ρd Γ(νd /2)(4ν)ν
g(ρ, ν, d) = .
Γ(ν)π d/2
The function g(ρ, ν) in Table 4.4 is g(ρ, ν, 1).
Among the advantages of spectral methods for spatial data is the requirement
that the process be second-order stationary, but it does not have to be isotropic
(§2.5.7). This is particularly important for point pattern analysis, because the
tool most commonly used for second-order analysis in the spatial domain,
the K-function, requires stationarity and isotropy. The previous discussion of
periodogram analysis focused on the case of equally spaced, gridded data. This
is not a requirement of spectral analysis but offers computational advantages
in making available the Fast Fourier Transform (FFT). Spatial locations in a
point pattern are irregularly spaced by the very nature of the process and Z is
ESTIMATION AND INFERENCE IN THE FREQUENCY DOMAIN 201
Bartlett (1964) then defined the complete covariance density function as Complete
Covariance
Cov[N (dsi ), N (dsj )]
lim = C(si − sj ) + δ(si − sj ), (4.66) Density
|dsi |,|dsj |→0 |dsi ||dsj |
where C(si − sj ) = λ2 (si − sj ) − λ2 is the autocovariance density function
(§3.3.4) and δ(u) is the Dirac delta function
-
∞ u=0
δ(u) =
0 u (= 0.
Substituting into (4.66) and using (4.65) yields
H
Cov[N (dsi ), N (dsj )] C(si − sj ) si (= sj
lim =
|dsi |,|dsj |→0 |dsi ||dsj | lim|dsi |→0 |dλsi | − λ2 si = sj .
The Dirac delta function is involved in (4.66) because the variance of N (ds)/|ds|
should go to infinity in the limit as |ds| is shrunk.
We now can take the Fourier transform of (4.66) to obtain the spectral density
function
-A >
1 −iω ! u
s(ω) = e {v(u) + λδ(u)} du
(2π)2 u
- A >
1 −iω ! u
= λ+ e v(u)du .
(2π)2 u
202 SEMIVARIOGRAM ANALYSIS AND ESTIMATION
where J(ω) is the complex conjugate of J(ω pq ), and {ω} = {[2πp, 2πq]! } for
p = 0, 1, 2, · · · and q = 0, ±1, ±2, · · ·. The matrix L is diagonal with entries
Lx and Ly such that D = Lx Ly . It is thus assumed that the bounding shape
of the point process is a rectangle. The term L−1 s scales the process to the
unit square and the intensity λ is estimated by the number of events n. The
periodogram is then given by
I(ω) = J(ω)J(ω)
n n
1 (( 8 ! −1
9
= exp −iω L (sj − sk ) . (4.68)
(2π)2 j=1
k=1
The numbers nρ and nθ denote the number of ordinates for which τ falls
within a specified tolerance. The result is that ordinates are averaged in arcs
around the origin in the R-spectrum and in rays emanating from the origin
in the Θ-spectrum (Figure 4.18).
The R-spectrum gives insight about clustering or regularity of events, the
Θ-spectrum about the isotropy of the process. The R-spectrum is interpreted
along the same lines as the K-function. If SR (ρ) takes on large values for small
ρ, the process is clustered. Small values of SR (ρ) for small ρ implies regularity.
16
m
ctru
Spe
R Spectrum
0
q
-8
-16
0 2 4 6 8 10 12 14
p
Furthermore, the CSR process has spectral density s(ω) = λ/(2π)2 . Com-
bining this result with (4.69) and (4.71) enables us to derive test statistics
for the CSR hypothesis for a test without simulations. In addition, we can
make use of the polar spectra and the fact that the spectral analysis does not
require isotropy of the process, to develop a test for isotropy (and other forms
of dependence symmetry).
First, asymptotically, any sum of periodogram ordinates is a sum of in-
dependent scaled Chi-square random variables. For example, under the CSR
hypothesis,
8π 2 (
I(ωp , ωq ) ∼ χ22m ,
λ
p,q&=0
204 SEMIVARIOGRAM ANALYSIS AND ESTIMATION
branching estuaries or large barriers prohibit this approach. For more com-
plex applications such as these, Kern and Higdon (2000) define an algorithm
to compute polygonal distance that compensates for irregularities in the
spatial domain. Krivoruchko and Gribov (2004) solve similar problems using
cost weighted distance, a common raster function in GIS. Unfortunately,
as we illustrate below, not all isotropic covariance function and semivariogram
models remain valid when based on non-Euclidean distances.
Mardia, Kent, and Bibby (1979, Ch. 14) distinguish the classical solution—
also called a metric solution—from the non-metric solution that is based on
ranks of distances and iterative optimization. A metric solution determines the
point configuration directly and can serve as the solution of the MDS prob-
lem or as the starting configuration of a non-metric technique. The classical
solution consists of choosing as point configuration the k (scaled) eigenvectors
that correspond to the k largest (positive) eigenvalues of the matrix
1
B = − (I − J/n)! D[2] (I − J/n).
2
The notation A[p] stands for a matrix whose elements are apij . The eigenvectors
are scaled so that if si ∗ is the ith eigenvector of B and λi is the ith (positive)
eigenvalue, then si ∗ ! si ∗ = λi .
Denote the solution so obtained as S and let d'ij = ||si −sj ||. The discrepancy
between D and the fit can be measured by
' 2 }.
ψ = trace{(B − B)
The classical solution minimizes this trace among all configurations that have
distance matrix D for a given value of k. If λ = [λ(1) , · · · , λ(n) ]! denotes the
vector of ordered eigenvalues of B, then
k
( Q(
n
r= λ(i) |λi |
i=1 i=1
is a measure of the agreement between the metric solution and the distance
matrix D.
Example 4.6 Classical MDS. Assume the following six point locations in
R2 : s1 = [0, 65], s2 = [115, 50], s3 = [225, 120], s4 = [175, 65], s5 = [115, 132.5],
s6 = [30, 105]. The matrix of Euclidean distances is
0 115.9 231.6 175 133.3 50
115.9 0 130.4 61.8 82.5 101.2
231.6 130.3 0 74.3 110.7 195.6
D1 = 175
61.8 74.3 0 90.3 150.4
133.3 82.5 110.7 90.3 0 89.3
50 101.2 195.6 150.4 89.3 0
208 SEMIVARIOGRAM ANALYSIS AND ESTIMATION
The Bessel function of the first kind of order ν is defined by the series
C Dν (∞
t (− 14 t2 )i
Jν (t) = . (4.72)
2 i=0
i! Γ(ν + i + 1)
We use the notation Jn (t) if the Bessel function has integer order. A spe-
cial case is J0 (t), the Bessel function of the first kind of order 0. It appears
as the basis function in spectral representations of isotropic covariance func-
tions in R2 (§4.3.1). Bessel functions of the first kind of integer order satisfy
(Abramowitz and Stegun, 1964)
2n
Jn+1 (t) = Jn (t) − Jn−1 (t)
t
1
Jn! (t) = (Jn−1 (t) − Jn+1 (t))
2
n
= Jn−1 (t) − Jn (t)
t
J−n (t) = (−1)n Jn (t)
There are two types of modified Bessel functions. Of particular importance for
spatial modeling are the modified Bessel functions of the second kind Kν (t) of
(real) order ν. They appear as components of the Matérn class of covariance
functions for second-order stationary processes (see §4.3.2):
The function Iν (t) in (4.73) is the modified Bessel function of the first kind,
defined by
C Dν ( ∞ C D2k
t ( 14 t2 )i t
Iν (t) = .
2 i=0
i! Γ(ν + i + 1) 2
Since computation of these functions can be numerically expensive, approxi-
mations can be used for t → 0:
C D−ν
Γ(ν) t
K0 (t) ≈ − ln{t}; Kν (t) ≈ for ν > 0.
2 2
Other important results regarding modified Bessel functions (Abramowitz and
Stegun, 1964; Whittaker and Watson, 1927) are (n denoting integer and ν
denoting real order)
Kν (t) = K−ν (t)
2n
Kn+1 (t) = Kn−1 (t) + Kn (t)
t
n
Kn! (t) = Kn (t) − Kn+1 (t) ⇒ K0! (t) = −K1 (t)
t A ∞
(2t)ν Γ(t + 1/2)
Kν (t) = √ cos{νπ} (u2 + t2 )−ν−1/2 cos{u} du
π 0
-
1 ν=0
Iν (0) =
0 ν>0
A π
1
In (t) = ez cos θ cos{nθ} dθ
π 0
In (t) = I−n (t)
J J
2 2
I1/2 (t) = sinh{t} I−1/2 (t) = cosh{t}
πt πt
n
In! (t) = In (t) + In+1 (t) ⇒ I0! (t) = I1 (t)
t
Some of these properties have been used in §4.3.2 to establish that the Matèrn
model for ν = 1/2 reduces to the exponential covariance function.
A Fortran program (rkbesl) to calculate Kn+α (t) for non-negative t and
non-negative order n + α is distributed as part of the SPECFUN package (Cody,
1987). It is available at www.netlib.org.
Problem 4.2 (Schabenberger and Pierce, 2002, pp. 431–433) Let Yi = β +ei ,
(i = 1, · · · , n), where ei ∼ iid G(0, σ 2 ).
(i) Find the maximum likelihood estimators of β and σ 2 .
(ii) Define a random vector
Y1 − Y
Y2 − Y
U(n−1×1) = ..
.
Yn−1 − Y
and find the maximum likelihood estimator of σ 2 based on U. Is it possible
to estimate β from the likelihood of U?
(iii) Show that the estimator of σ 2 found in (ii) is the restricted maximum
likelihood estimator for σ 2 based on the random vector Y = [Y1 , · · · , Yn ]! .
Problem 4.5 Derive the composite likelihood score equation (4.44) under
the assumption that Z(si ) − Z(sj ) are zero mean Gaussian random variables.
Problem 4.7 Establish the connection between the periodogram (4.57) and
the covariance function (4.60) for data on a rectangular r × c lattice. That is,
show that the periodogram is the Fourier transform of the sample covariance
function.
CHAPTER PROBLEMS 213
Problem 4.8 Determine the dynamic range, see equation (4.61) on 196, for
some of the spectral densities shown in Table 4.4. Which models are suscep-
tible to bias in periodogram estimation due to leakage? What is the dynamic
range of a white noise process?
Problem 4.9 Show that the spectral density function s(ω) for a homoge-
neous Poisson process has complete covariance density function λ(si − sj ) and
spectral density s(ω) = (2π)−2 λ.
Example 5.1 A public shooting range has been in operation for seven years
in a national forest, operated by the U.S. Forest Service. The lead concentra-
tion on the range is observed at sampling locations s1 , · · · , sn by collecting the
soil on a 50 × 50 cm square at each location, sieving the non-lead materials
and weighing the lead content. The investigators are interested in determining
the lead concentration at all locations on the shooting range. Estimating the
mean lead concentration at location s appeals to a universe of similar shooting
ranges. There may be no other shooting ranges like the one under considera-
tion. What matters to the investigators is not how much lead is at location s0
on average across many other—conceptual—shooting ranges. What matters
is to determine the amount of lead on the shooting range that was sampled.
In terms of a Gaussian spatial random field, where Z(s0 ) and Z(s) are
jointly multivariate Gaussian with E[Z(s0 )] = µ(s0 ), Cov[Z(s), Z(s0 )] = σ,
Conditional Var[Z(s)] = Σ, and E[Z(s)] = µ(s), (5.2) becomes
Mean in
E[Z(s0 )| Z(s)] = µ(s0 ) + σ ! Σ−1 (Z(s) − µ(s)). (5.3)
GRF
This conditional expectation is linear in the observed data and establishing
its statistical properties is comparatively straightforward.
Under squared-error loss the conditional mean is the best predictor and the
mean-squared prediction error can be written as
E[(Z(s0 ) − p0 (Z; s0 ))2 ] = Var[Z(s0 )] − Var[p0 (Z; s0 )]. (5.4)
In the case of the Gaussian random field, where p0 (Z; s0 ) is given by (5.3), we
Conditional obtain
Variance in E[(Z(s0 ) − p0 (Z; s0 ))2 ] = Var[Z(s0 )] − σ ! Σ−1 σ. (5.5)
GRF
The result (5.4) is rather stunning. The term on the left-hand side must be
positive and variances are typically not subtracted. Somehow, the variation
of the best predictor under squared-error loss must be guaranteed to be less
than the variation of the random field itself (establishing (5.4) is a Chapter
problem). More importantly, this relationship conveys the behavior one should
expect from a predictor that performs well, that is, a predictor with small
mean-squared prediction error. It is a predictor that varies a lot. At first, this
seems contrary to the results learned in classical statistics where one searches
for those estimators of unknown quantities that have small mean square error.
In the search for UMVU estimators, this means finding the estimator with
least dispersion. There is a heuristic explanation for the fact that variable
predictors will perform well in this situation, however. Figure 5.1 shows a
realization Z(t) of a temporal process. In order to predict the value of the
OPTIMAL PREDICTION IN RANDOM FIELDS 219
series at time t = 20, three prediction functions are shown. The sample mean
Z and two kernel smoothers. As the smoothness of the prediction function
decreases, the variability of the predictor increases. The prediction function
that will be close on average to the value of the series is one that is allowed to
vary a lot. Based on these arguments one would expect the “best” predictor
to follow the data even more closely than the prediction functions in Figure
5.1. One would expect the best predictor to interpolate the time series at the
observed points. Kriging predictors in mean square continuous random fields
have this property, they honor the data.
2.0
1.5
1.0
0.5
Z(t)
0.0
-0.5
-1.0
-1.5
-2.0
0 10 20 30 40
Time t
Figure 5.1 Realization of a time series Z(t). The goal is to predict the value of the
series at t = 20. Three possible predictors are shown: the sample mean (horizontal
line) and kernel estimators with different bandwidth. Adapted from Schabenberger
and Pierce (2002).
The “only” difference seems to be the additional “1+ ” under the square
root, and a common explanation for the distinction is that in one case we
consider Var['y0 ] and for the prediction interval we consider the variance of
the difference Var['y0 − y0 ]. We can examine the distinction now in terms of
the problem of finding the best predictor under squared-error loss.
First, we need to settle the issue whether the new observation is depen-
dent on the data Y = [Y1 , · · · , Yn ]! . Since the fitted model assumes that
the observed data are uncorrelated, there is no need to assume that a de-
pendency would exist with any new observation; hence, Cov[Yi , Y0 ] = 0, ∀i.
Under squared error loss, the best predictor of Y0 is E[Y0 |Y]; in our case the
conditional expectation is equal to the unconditional expectation because of
the independence. So, E[Y0 ] = α + βx0 is the best predictor. Since α and β
are unknown, we turn to the Gauss-Markov theorem, which instructs us that
y'0 = α ' 0 is the best linear unbiased predictor, where α
' + βx ' and β' are the
ordinary least squares estimators.
We have now arrived at the familiar result, that the best predictor of the
random quantity Y0 and the best estimator of the fixed quantity E[Y0 ] are
the same. But are the mean-squared prediction errors also the same? In order
to answer this question based on what we know up to now, we cannot draw
on equation (5.4), because y'0 is not the conditional expectation. Instead, we
draw on the following first principle: the mean-squared error M SE[U ; f (Y)]
for estimating (predicting) U based on some function f (Y) is
Var[U − f (Y)] = Var[U ] + Var[f (Y)] − 2Cov[U, f (Y)],
provided E[U ] = E[f (Y)].
In the simple linear regression example, we can apply this as follows, taking
note that C D
1 (x0 − x)2
Var[Y'0 ] = σ 2
+ .
n Sxx
Now, consider both the estimation problem and the prediction problem in this
context.
0. As a consequence,
C D
1 (x0 − x)2
M SE[U ; f (Y)] = Var[U − f (Y)] = Var[f (Y)] = σ 2
+ .
n Sxx
Under squared-error loss one should utilize the conditional mean function for
predictions. Not only does p0 (Z; s0 ) minimize the Bayes risk Bayes Risk
E[L(Z(s0 ), p(Z; s0 ))],
222 SPATIAL PREDICTION AND KRIGING
it is also “unbiased” in the sense that E[p0 (Z; s0 )] = E[Z(s0 )]. This fol-
lows directly from the law of iterated expectations; E[E[Y |X]] = E[Y ]. If
{ Z(s) : s ∈ D ⊂ Rd } is a Gaussian random field, p0 (Z; s0 ) is also linear in
Z(s), the observed data. In general, however, p0 (Z; s0 ) is not a linear function
of the data and establishing the statistical properties of the best predictor
under squared-error loss can be difficult; even intractable. Thus, in statisti-
cal practice the search for good estimators is restricted to particular classes
of estimators. The properties of linearity in the observed data and unbiased-
ness are commonly imposed because of mathematical tractability and the
mistaken impression that unbiasedness is an intrinsically “good” feature. Not
surprisingly, similar constraints are imposed on prediction functions. In the
Gaussian case no additional restrictions are called for, since p0 (Z; s0 ) already
is linear and unbiased. For the general case, this new consideration of what
constitutes a best predictor leads to what are called Best Linear Unbiased
Predictors (BLUPs).
Consider random variables X and Y with joint density function f (x, y) (or
mass function p(x, y)). We are given the value of X and wish to predict Y from
it, subject to the condition that the predictor p(X) satisfies E[p(X)] = E[Y ] ≡
µy and subject to a linearity condition, p(X) = α + βX. Under squared-error
loss this amounts to finding the function p(X) that minimizes
6 7
M SE[p(X); Y ] = E (Y − p(X))2 subject to
p(X) = α + βX β = (µx − µy )/α.
The solutions to this minimization problem are (Chapter problem 5.4)
σxy σxy
α = µy − 2 µx β= 2
σx σx
σxy
p(X) = BLU P (Y |X) = µy − 2 (µx − X), (5.6)
σx
where σxy = Cov[X, Y ], σx2 = Var[X].
tities in (5.6). Based on the discussion that follows, you will be able to show
that (5.7) is indeed the best linear unbiased predictor of Y |x if the means,
variance of X, and covariance are unknown. By the Gauss-Markov theorem we
know that (5.7) is also the best linear unbiased estimator (BLUE) of E[Y |x].
The differences in their mean-squared prediction errors were established pre-
viously.
to the conditional mean in a Gaussian random field, (5.3). It is thus the best
predictor (linear or not) under squared-error loss if Z(s) is a GRF. If the
joint distribution of the data is not Gaussian, then (5.10) is the best predictor
among those that are linear in Z(s). It is also unbiased, but since we did not
impose an unbiasedness constraint as part of the minimization, the simple
kriging predictor is best in the class of all linear predictors. Kriging is often
referred to as optimal spatial prediction, but optimality considerations are
confined to this sub-class of predictors unless the random field is Gaussian.
Substitution of λ! = σ ! Σ−1 into the expression for the mean-squared error
in (5.9) yields the minimized mean-squared prediction error, also called the
Simple (simple) kriging variance
Kriging 2
σsk (s0 ) = σ 2 − σ ! Σ−1 σ, (5.11)
Variance
which agrees with Var[Z(s0 )| Z(s)] in (5.5) in the Gaussian case. The simple
kriging variance depends on the prediction location through the vector σ of
covariances between Z(s0 ) and the observed data.
It was noted in §5.1 on heuristic grounds, that a good predictor should
be variable in the sense that it follows the observed data closely. The simple
kriging predictor has an interesting property that it shares with many other
types of kriging predictors. Consider predicting at locations where data are
actually observed. Thus, the predictor psk (Z; s0 ) becomes psk (Z; [s1 , · · · , sn ]! ),
and in (5.10) we replace Cov[Z(s0 ), Z(s)] = σ ! with Cov[Z(s), Z(s)] = Σ and
µ(s0 ) with µ(s) to obtain
psk (Z; [s1 , · · · , sn ]! ) = µ(s) + ΣΣ−1 (Z(s) − µ(s))
= Z(s).
Thus, the simple kriging predictor interpolates the observed data. It is an
“exact” interpolator or said to “honor the data.” Historically, many disci-
plines have considered this to be a very desirable property; one that should
be asked of any spatial predictor. However, in some situations, smoothing, as
is typically done in most regression situations, may be more appealing. For
example, when the semivariogram of the spatial process contains a nugget ef-
fect, it is not necessarily desirable to interpolate the data. If the nugget effect
consists of micro-scale variability only, then a structured portion of the spatial
variability has not been observed and honoring the observed data is reason-
able. If the nugget effect contains a measurement error component, that is,
Z(s) = S(s) + #(s), where #(s) is the measurement error at location s, then we
do not want the predictor to interpolate the data. We are then not interested
in the amount that has been erroneously measured, but the amount S(s) that
is actually there. The predictor should be an interpolator of the signal S(s),
not the observed amount Z(s). In §5.4.3 these issues regarding kriging with
and without a measurement error will be revisited in more detail.
The term simple kriging is unfortunate on another ground. There is nothing
simple or common about the requirement that the mean µ(s) of the random
field be known. An exception is best linear unbiased prediction of residuals
LINEAR PREDICTION—SIMPLE AND ORDINARY KRIGING 225
from a regression fit. If Z(s) = X(s)β + '(s), then the vector of ordinary least
squares (OLS) residuals
! !
'(s) = Z(s) − X(s)(X(s) X(s))−1 X(s) Z(s)
' (5.12)
has (known) mean 0—provided that the mean model X(s)β has been speci-
fied correctly.∗ One application of simple kriging is thus the following method
intended to cope with the problem of non-stationarity that arises from large-
scale trends in the mean of Z(s). It is sometimes incorrectly labelled as “Uni-
versal Kriging,” which it is not (see §5.3.3). “Universal
Kriging,”
1. Specify a linear spatial model Z(s) = X(s)β + e(s). but not
' = (X(s)! X(s))−1 X(s)! Z(s).
2. Fit the model by OLS to obtain β quite
ols
3. Perform simple kriging on the OLS residuals (5.12) to obtain psk ('
'; '
'(s0 )).
' + psk ('
4. Obtain the kriging predictor of Z(s0 ) as x(s0 )! β '; '
'(s0 )).
The simple kriging predictor is used when the mean µ(s) in model (5.8) is
known. With this model, the mean can change with spatial location. If E[Z(s)]
is unknown but constant across locations, E[Z(s)] ≡ µ1, best linear unbiased
prediction under squared-error loss is known as ordinary kriging.
We need to find the predictor p(Z; s0 ) of Z(s0 ) that minimizes E[(p(Z; s0 )
−Z(s0 ))2 ], when the data follow the model
Z(s) = µ1 + e(s), e(s) ∼ (0, Σ).
Thus, E[Z(s)] = µ1 and Var[Z(s)] = Σ, where µ is an unknown constant and
Σ is known.
As in the development of the simple kriging predictor, we consider linear
predictors of the form p(Z; s0 ) = λ0 + λ! Z(s), where λ0 and the elements
of the vector λ = [λ1 , · · · λn ]! are unknown coefficients to be determined.
Repeating the development in §5.2.1 gives λ0 = µ − λ! µ1. However, this does
not determine the value of λ0 since µ is unknown. When the mean in unknown,
there is no best linear predictor in the class of all linear predictors. Thus, we
refine the problem by further restricting the class of linear predictors to those
that are also unbiased. Since the mean of Z(s) does not depend on s, it is
reasonable to posit also that E[Z(s0 )] = µ. Then we require for unbiasedness
that E[p(Z; s0 )] = E[Z(s0 )] or equivalently, E[λ0 + λ! Z(s)] = E[Z(s0 )], which
implies that λ0 + µ(λ! 1 − 1) = 0. Since this must hold for every µ, it must
hold for µ = 0 and so the unbiasedness constraint requires that λ0 = 0 and
λ! 1 = 1.
Now our problem is to choose weights λ = [λ1 , · · · , λn ]! that minimize
E[(λ! Z(s) − Z(s0 ))2 ] subject to λ! 1 = 1.
This can be accomplished as an unconstrained minimization problem intro-
ducing the Lagrange multiplier m:
arg minλ,m Q = arg minλ,m E[(λ! Z(s) − Z(s0 ))2 ] − 2m(λ! 1 − 1). (5.13)
The factor 2 in front of the Lagrange multiplier was chosen to allow cancella-
tion. It is left as an exercise (Chapter problem 5.8) to show that the solutions
to this problem are (Cressie, 1993, p. 123) Ordinary
C D! Kriging
! 1 − 1! Σ−1 σ
λ = σ+1 ! −1
Σ−1 (5.14) Weights
1Σ 1
1 − 1! Σ−1 σ
m = , (5.15)
1! Σ−1 1
and that the minimized mean-squared prediction error, the ordinary kriging
variance, is (Cressie, 1993, p. 123) Ordinary
Kriging
2
σok (s0 ) = C(0) − λ! σ + m
Variance
−1 (1 − 1! Σ−1 σ)2
= C(0) − σ Σ !
σ+ . (5.16)
1! Σ−1 1
Notice that σsk
2 2
< σok , since the last term on the right-hand side of (5.16)
is positive. Not knowing the mean of the random field increases the mean-
squared prediction error. The expression pok (Z; s0 ) = λ! Z(s), where λ is given
by (5.14), hides the fact that the unknown mean of the random field is actually
estimated implicitly. The formulation of the ordinary kriging predictor we
prefer is GLS form
' + σ ! Σ−1 (Z(s) − 1'
pok (Z; s0 ) = µ µ) (5.17) of Ordinary
(Toutenburg, 1982, p. 141, Cressie, 1993, p. 173; Gotway and Cressie, 1993). Kriging
This formulation shows the correspondence between ordinary and simple krig-
ing, as well as the connection to the best predictor in the Gaussian random
field more clearly. Comparing (5.17) and (5.10), it appears that “all” that is
required to accommodate an unknown mean is to replace µ with an estimate
'. It is important to note that not just any estimate will do. The algebraic
µ
manipulations leading from λ! Z(s) to (5.17) reveal that µ must be estimated
by its best linear unbiased estimator, which in this case is the generalized least
squares estimator (Chapter problem 5.9; Goldberger, 1962):
' = (1! Σ−1 1)−1 1! Σ−1 Z(s).
µ (5.18)
Example 5.5 Isaaks and Srivastava (1989, pp. 291, 301–307) use a small
data set of seven observations and one prediction location to examine the
effect of semivariogram parameters on ordinary kriging predictions. We use
a data set of the same size, the observed data locations and their attribute
values are as follows
i si Z(si )
1 [5,20] 100
2 [20,2] 70
3 [25,32] 60
4 [8,39] 90
5 [10,17] 50
6 [35,20] 80
7 [38,10] 40
The prediction location is s0 = [20, 20], and the sample mean of the observed
data is Z = 70.0. The observed locations surround the prediction location.
Notice that, in contrast to Isaaks and Srivastava (1989), two locations are
equidistant from the prediction location (s1 and s6 , Figure 5.2).
The kriging weights (5.14), predictions, and kriging variance (5.16), are
computed for the following series of semivariogram models.
230 SPATIAL PREDICTION AND KRIGING
40 4
22.5
3
30
13.0
Y
1 15.0 s0
15.0 6
20
5 10.4
20.6
18.0
7
10
2
0
0 5 10 15 20 25 30 35 40
X
Figure 5.2 Seven observed locations and target prediction location. The dotted rays
show Euclidean distance between the observed location and the target location.
Practical
Model Range Sill Nugget γ(h) Type
) *
A 20 10 0 10 )1 − e−3h/20 * Exponential
B 10 10 0 10 1) − e−3h/10 * Exponential
C 20 10 5 5 + 5 1 − e−3h/20 Exp. + nugget
D – – 10 ) 10 * Nugget only
E 20 20 0 20E 1 − e−3h/20 F Exponential
2
F 20 10 0 10 1 − e−3(h/20) Gaussian
Models A and B differ in the range, models A and C in the relative nugget
effect. Model D is a nugget-only model in which data are not spatially corre-
lated. A comparison of models A and E highlights the effect of the variability
of the random field. The final model has the same variability and practical
range as the exponential model A, but a much higher degree of spatial conti-
nuity. Model F exhibits large short-range correlations.
The kriging weights sum to 1.0 in all cases (within roundoff error) as needed
(Table 5.1); recall that ordinary kriging weights are derived subject to the con-
straint λ! 1 = 1. Interestingly, as the degree of spatial continuity decreases, so
does the “variation” among the kriging weights λi . Model D, the nugget-only
model, assigns the same weight λi = 1/n to each observation. The resulting
LINEAR PREDICTION—SIMPLE AND ORDINARY KRIGING 231
Table 5.1 Kriging weights for predicting Z(s0 ) with semivariograms A–F. λi denotes
the kriging weight for the attribute Z(si ). sλ is the “standard deviation” of the seven
kriging weights for a particular model.
Model pok 2
σok λ1 λ2 λ3 λ4 λ5 λ6 λ7
A 66.23 9.74 0.08 0.13 0.20 0.10 0.24 0.15 0.09
B 69.04 11.25 0.12 0.14 0.16 0.14 0.16 0.14 0.13
C 68.64 10.63 0.12 0.14 0.17 0.12 0.18 0.14 0.12
D 70.00 11.43 0.14 0.14 0.14 0.14 0.14 0.14 0.14
E 66.23 19.48 0.08 0.13 0.20 0.10 0.24 0.15 0.09
F 44.52 6.67 -0.35 0.08 0.28 0.06 0.75 0.18 0.01
||s0 − si || 15.0 18.0 13.0 22.5 10.4 15.0 20.6
Model A B C D E F
sλ 0.06 0.01 0.03 0.00 0.06 0.33
predictor is the sample mean. Recall that models A and B are identical, ex-
cept for the practical range. In the model with larger range (A), short-distance
correlations are higher, creating greater heterogeneity in the weights.
The results in Table 5.1 also demonstrate that points that are separated
by more than the range do not have zero kriging weights. Also, the kriging
weight of these points is not 1/n, unless all observations are uncorrelated (as
in model D). For example, s1 is further from the prediction location than the
(practical) range in model B, yet its kriging weight is 0.12. Points more distant
than the range are spatially correlated with other points that are less distant
from the target location than the range. This is called the relay effect (Chilès
and Delfiner, 1999, p. 205).
Models A and E are identical, except for the sill of the semivariogram. The
variability of the random field is twice as large for E, than for A. This has
no effect on the kriging weights, and hence the kriging predictor is the same
under the two models. The kriging variance, however, increases accordingly
with the variability of the random field.
Finally, model F is highly continuous, with large short-distance correlations.
Since the range and sill of model F are identical to those of model A, the
gaussian model’s short-distance correlations exceed those of the exponential
model. The kriging weights show the most “variation” of the models in Table
5.1 and the value closest to the prediction location, Z(s5 ), receives the most
weight. It contributes 3/4 of its own value to pok (Z; s0 ), accounting for more
than half of the predicted amount. Maybe surprisingly, this model yields a
negative kriging weight for the observation at s1 . A similar effect can be noted
232 SPATIAL PREDICTION AND KRIGING
for model A, but it is less pronounced. The weight for Z(s1 ) is considerably
less than that for observation Z(s5 ), although they occupy very similar points
in the spatial configuration. It is exactly because they occupy similar positions
that Z(s1 ) receives small weight, and even a negative weight in model F. The
effect of Z(s1 ) is screened by the observation at s5 , because it lies “behind”
it relative to the prediction location.
In the derivation of the kriging weights only a “sum-to-one” constraint was
imposed on the kriging weights, but not a positivity constraint. On first glance,
a negative kriging weight may seem undesirable. If weights can be negative, so
could possibly the predicted values. Spatial attributes are often positive, how-
ever, e.g., yields, concentrations, counts. When the weights are restricted to
be positive, then all predicted values lie between the minimum and maximum
observed value. Szidarovsky et al. (1987) derive a version of kriging with only
positive weights. While this predictor has attractive advantages for obtaining
predicted values of nonnegative processes, the extra constraint may lead to
unacceptably large kriging standard errors (Cressie, 1993, p. 143).
Any one or several of the random components of e(s) may be zero (or as-
sumed to be zero) at times. The common thread of models of form (5.23) is
linearity of the mean function and a spatially correlated error process. The
notation X(s) is used to emphasize that the (n × p) matrix X depends on
spatial coordinates. The columns of this regressor matrix are usually com-
prised of spatial variables, although there are applications where the columns
of X do not depend on spatial coordinates at all. Furthermore, X may contain
dummy (design) variables and can be of less than full column rank. This situ-
ation arises when experimental data with design and treatment structure are
modeled spatially. Occasionally, the dependence of X(s) on s will be omitted
for brevity.
The random processes W (s), η(s), and #(s) have mean zero and measure-
ment errors are uncorrelated,
-
0 si (= sj
Cov[#(si ), #(sj )] =
σ$2 si = sj .
As a result, the error process of (5.23) is a zero-mean stochastic process with
Var[e(s)] = ΣW + Ση + σ$2 I ≡ Σ.
plicated and still evolving. The temptation to bring models for spatial data
into the classical linear model (regression) framework is understandable.
If the process contains a smooth-scale spatial component, W (s), then the
smooth fluctuations in the spatial signal are handled in an uncorrelated error
model by allowing the mean function X(s)β to be sufficiently flexible. In other
words, the mean function is parameterized to capture local behavior. With
geostatistical data this can be accomplished parametrically by expressing the
mean as a polynomial function of the spatial coordinates. As the local fluctu-
ations of the spatial signal become more pronounced, higher-order terms must
be included (§5.3.1). A non-parametric alternative is to model local behavior
by applying d-dimensional smoothing or to localize estimation (§5.3.2). The
degree of smoothness is then governed by a smoothing parameter (bandwidth).
The contemporary approach, however, is to assume that some spatial sto-
chastic structure is present which conveys in the presence of W (s) and η(s),
hence Σ will be a non-diagonal matrix. (The argument that Var[W(s)] = σW 2
I,
Var[η(s)] = ση I and hence Σ is diagonal is vacuous. The individual random
2
The idea of the trend surface approach is to model the mean function in (5.24)
with a highly parameterized fixed effects structure, comprised of functions of
the spatial coordinates, si = [xi , yi ]! . For example, a first-degree trend surface
model is
Z(si ) = β0 + β1 xi + β2 yi + #i , #i ∼ iid (0, σ 2 ).
If E[Z(si )] = µ, and the Z(si ) are correlated, then this model is incorrect in
several places: β0 +β1 xi +β2 yi is not the model for the mean and the errors are
not iid. By over-parameterizing the mean, the model accounts for variability
that is associated with the spatial random structure. The approach pretends
that the models
Z(s) = 1µ + e(s), e(s) ∼ (0, Σ)
and
Z(s) = X(s)β + ', ' ∼ (0, σ 2 I)
LINEAR PREDICTION WITH A SPATIALLY VARYING MEAN 235
Cov[#i , #j ] = 0 ∀i =
( j.
The BLUE of E[Z(si )] and the BLUP of Z(si ) are the same, the best estimator
and the best predictor are
Z(s '
' 0 ) = E[Z(s '
0 )] = x (s0 )β ols .
!
' 0 ) is
The mean-squared prediction error for Z(s0 ) based on Z(s
) *
' 0 )] = σ 2 1 + x! (s0 )(X(s)! X(s))−1 x(s0 ) ,
M SE[Z(s0 ); Z(s
where we assumed that the new data point Z(s0 ) is uncorrelated with the
observed data. This is consistent with the uncorrelated error assumption of
model (5.25).
The number of regression coefficients in a trend surface model increases
quickly with the degree of the polynomial, β is a vector of length (p + 1)(p +
2)/2. To examine whether the mean function has been made sufficiently flex-
ible, the residuals from the fit can be examined for residual spatial autocor-
relation.
In the model
Z(s) = X(s)β + '
236 SPATIAL PREDICTION AND KRIGING
a) b)
c)
origin. The semivariogram of the original data shows the spatial structure
clearly as well as the nugget effect (Figure 5.4). A trend surface of fifth de-
gree still exhibits some near-origin spatial structure in the semivariogram. For
p = 8 this structure has disappeared and the empirical semivariogram appears
to decline with increasing lag distance. This apparent decline is a combination
of a biased estimator and a statistical model that overfits the mean function.
Notice that the nugget effect remains present in all semivariograms; it is the
non-spatially structured source of variability in the data.
Based on the results displayed in Figure 5.4, a trend surface model of degree
p = 8 seems adequate. The predicted surface for p = 8 is shown in Figure 5.3c.
It is considerably more smooth than the surface with p = 14. Note that with
the rook definition of spatial neighborhood, the Ires statistic focuses entirely
on the relationships between residuals of nearest neighbors, emphasizing, and
perhaps over-emphasizing, local behavior. A trend surface with p = 11, as
selected based on the AIC criterion, appears to be a good compromise. Note
238 SPATIAL PREDICTION AND KRIGING
Table 5.2 Fit statistics for trend surface models of different order fit to data shown
in Figure 5.3a. Zobs and p-values refer to the test of residual spatial autocorrelation
based on the I ∗ statistic with a rook definition of spatial connectivity. The second
column gives the number of regression coefficients (intercept included).
p (p + 1)(p + 2)/2 '2
σ Ires Zobs p-value AIC
0 1 0.733 0.473 13.17 < 0.0001 1012.1
1 3 0.662 0.403 11.46 < 0.0001 973.4
2 6 0.643 0.375 10.92 < 0.0001 964.3
3 10 0.579 0.297 9.03 < 0.0001 926.5
4 15 0.552 0.266 8.51 < 0.0001 912.8
5 21 0.489 0.181 6.50 < 0.0001 869.6
6 28 0.443 0.095 4.46 < 0.0001 837.3
7 36 0.428 0.052 3.69 0.0001 830.4
8 45 0.410 0.006 2.84 0.0022 821.3
9 55 0.398 −0.019 2.64 0.0041 817.6
10 66 0.386 −0.046 2.42 0.0077 814.3
11 78 0.360 −0.098 1.49 0.0674 795.7
12 91 0.362 −0.117 1.57 0.0577 808.9
13 105 0.351 −0.147 1.36 0.0871 805.7
14 120 0.347 −0.183 1.05 0.1466 809.6
15 136 0.343 −0.207 1.08 0.1400 813.0
that p = 11 is the largest value in Table 5.2 for which the Ires statistic is not
significant at the 0.05 level.
0.8
p=0
Robust Empirical Semivariogram
0.7
0.6
0.5
p=5
p=6
0.4
p=7
p=8
0.3
0 2 4 6 8 10 12 14
Lag distance
to data Z(s1 ), · · · , Z(sn ) that reflect the extent to which the low-rank model
applied at s0 is expected to fit at other locations. The weights control the
influence of observations on the estimation of the model at s0 . The benefits
of a parsimonious model x(s0 )! β for any one site is traded against estimation
of local regression coefficients and the determination of the weight function.
Local estimation in this spirit is often referred to as non-parametric re-
gression. Kernel estimation, for example, moves a window over the data and
estimates the mean at a particular location as the weighted average of the
data points within the window. This window may extend over the entire data
range if the weight function reaches zero only asymptotically, or consist of a
set of nearest neighbors. Local polynomial regression applies the same idea
but fits a polynomial model at each prediction location rather than a constant
mean. An equivalent representation of local estimation—which we prefer—is
in terms of weighted linear regression models.
Assume that a prediction is desired at s0 and that the behavior of the
realized surface near s0 can be expressed as a polynomial of first degree. An
240 SPATIAL PREDICTION AND KRIGING
The (kernel) weight function W (||si −s0 ||, λ) depends on the distance between
observed locations and the prediction location and the smoothing parameter
(bandwidth) λ. Collecting the W (||si −s0 ||, λ) into a diagonal matrix W(s0 , λ),
the objective function can be written more clearly as
Q(s0 , λ) = (Z(s) − X(s)β 0 )! W(s0 , λ)(Z(s) − X(s)β 0 ).
This is a weighted least squares objective function in the model
Z(s) = X(s)β 0 + e0 , e0 ∼ (0, W(s0 , λ)−1 ) (5.26)
The assignment of small weight to Z(sj ) compared to Z(si ), say, is equivalent
to pretending that the variance of Z(sj ) exceeds that of Z(si ). If W (||si −
s0 ||, λ) = 0, then the data point Z(si ) can be removed entirely to allow the
inversion in (5.26). This representation of local estimation as a weighted esti-
mation problem is entirely general. Any statistical model that can be written
in terms of means and variances can be localized by this device (see Chapter
problems).
A special case of local polynomial estimation is LOESS regression (Cleve-
land, 1979), where the (tri-cube) weight function achieves exactly zero and
the estimation uses robust re-weighing of residuals. In general, the choice of
the weight function is less important than the choice of the bandwidth λ, a
reasonable choice of W will lead to reasonable results. Common choices in R1
are the Epanechnikov kernel
- 3 −1
4λ (1 − (d/λ)2 ) −λ ≤ d ≤ λ
We (d, λ) =
0 otherwise,
and the Gaussian kernel
H C D2 I
1 1 d
Wg (d, λ) = √ exp − .
λ 2π 2 λ
H C D2 I H C D2 I
1 1 xi − xj 1 y i − yj
Wg (si − sj , λ) = exp − exp −
2πλ2 2 λ 2 λ
- >
1 1
= exp − 2 ||si − sj ||2
2πλ 2 2λ
LINEAR PREDICTION WITH A SPATIALLY VARYING MEAN 241
has a common bandwidth for the major axes of the coordinate system and
spherical weight contours.
The two important choices made in local estimation are the degree of the
local polynomial and the bandwidth. Locally constant means lead to estimates
which suffer from edge bias. For spatial data this is an important consideration
because many data points fall near the bounding box or the convex hull of a
set of points.
We use the subscript “UK” to denote this as the universal kriging predictor,
to distinguish it from the simple and ordinary kriging predictors of §5.2.1
and §5.2.2. Obviously, if we assume a constant mean E[Z(s)] = 1µ, then the
universal kriging predictor reduces to the ordinary kriging predictor (compare
to equations (5.17) and (5.18))
' + σ ! Σ−1 (Z(s) − 1β
pok (Z; s0 ) = 1β ' ),
gls gls
with
' = (1! Σ−1 1)−1 1! Σ−1 Z(s).
β gls
As with ordinary kriging, universal kriging can also be done in terms of the
semivariogram (see Cressie, 1993, pp. 153–154).
To predict r variables, Z(s0 ), simultaneously, we extend the model above
to : ; : ;
Z(s) X(s)β
E = (5.31)
Z(s0 ) X(s0 )β
: ; : ;
Z(s) ΣZZ ΣZ0
Var = . (5.32)
Z(s0 ) Σ0Z Σ00
Here, ΣZZ is the n × n variance-covariance matrix of the data, Σ00 is the r × r
variance-covariance matrix among the unobservables, and Σ0Z is the r × n
KRIGING IN PRACTICE 243
variance-covariance matrix between the data and the unobservables. With this
model, the best linear unbiased predictor (BLUP) (Goldberger, 1962; Gotway
and Cressie, 1993) is
' + Σ0Z Σ−1 (Z(s) − X(s)β
Ẑ(s0 ) = X(s0 )β ' ), (5.33)
gls ZZ gls
Equations (5.33) and (5.34) are obvious extensions of (5.29) and (5.30) to
the multi-predictor case.
For the purpose of predictions, we can model the spatial variation entirely
through the covariates, entirely as small-scale variation characterized by the
semivariogram or Σ(θ), or through some combination of covariates and resid-
ual spatial autocorrelation. Thus, the decomposition of the data into covari-
ates plus spatially correlated error as depicted through equation (5.24) is not
unique. However, our choice impacts both the interpretation of our model and
the magnitude of the prediction standard errors.
For example, suppose we accidentally left out an important spatially-varying
covariate (say xp+1 ) when we defined X(s). If we do a good job of fitting both
models, the model omitting xp+1 may fit as well as the model including xp+1 .
So we could have two competing models defined by parameters (β 1 , e(s)1 )
and (β 2 , e(s)2 ) with comparable fit. If X(s)1 β 1 (= X(s)2 β 2 , then the inter-
pretations in the two models could be very different, although both models
are valid representations of the spatial variation in the data. The predicted
surfaces based on these two models will be similar, but the standard errors
and the interpretation of covariates effects will be substantially different (see
e.g., Cressie, 1993, pp. 212–224, and Gotway and Hergert, 1997).
As you will see in §5.5, the question of how to estimate the unknown parame-
ters of the spatial correlation structure—when the mean is spatially varying—
is an important aspect of spatial prediction. If the mean is constant, then the
techniques of §4.4 and §4.5 can be applied to obtain estimators of the co-
variance and/or semivariogram parameters. It is tempting from this vantage
point to adopt an “ordinary-kriging-at-all-cost” attitude and to model spatial
variation entirely through the small-scale variation. For example, because the
semivariogram filters the (unknown but) constant mean, not knowing µ is
of no consequence in semivariogram estimation. An incorrect assumption of a
constant large-scale mean can be dangerous for your spatial analysis, however.
244 SPATIAL PREDICTION AND KRIGING
Schabenberger and Pierce (2002, p. 614) give the following example, where
data are generated on a transect according to the deterministic functions
Z1 (t) = 1 + 0.5t
Z2 (t) = 1 + 0.22t + 0.022t2 − 0.0013t3 .
Note that “data” so generated is deterministic, there is no random variation.
If one computes the Matheron estimator of the empirical semivariogram from
these data, the graphs in Figure 5.5 result. A power semivariogram model was
fit to the empirical semivariogram in the left-hand panel. A gaussian semivar-
iogram fits the empirical semivariogram in the right-hand panel well. Not
accounting for the large-scale structure may lead you to attribute determinis-
tic spatial variation—because the large-scale trend is non-random—to random
sources. The spatial “dependency” one is inclined to infer from Figure 5.5 is
entirely spurious.
30
3
Semivariogram
20
10
1
0 0
0 5 10 15 0 5 10 15
Lag h Lag h
Figure 5.5 Empirical semivariograms (dots) and fitted models for data from deter-
ministic trend. Left panel is for Z1 (t), right panel is for Z2 (t). From Schabenberger
and Pierce (2002, p. 614).
Example 4.2 (C/N ratios. Continued) For the C/N ratio data, first intro-
duced in §4.4.1 on page 154, we modeled an exponential semivariogram para-
metrically in §4.5 and presented estimates for nugget and no-nugget models
in Table 4.2 on page 176. Now we compute ordinary kriging predictions on
a regular 10 feet × 10 feet grid of points for the REML and OLS parameter
estimates. To solve the kriging problem for the 1560 prediction locations ef-
ficiently, we define a local quadrant search neighborhood by dividing a circle
with radius equal to the range of the semivariogram into 4 sections and choose
the 5 points nearest to the prediction location in each quadrant.
Figure 5.6 displays the predictions obtained using the OLS and REML es-
timates of the exponential semivariogram model with and without the nugget
effect. Also shown (in the third panel from the top) are the predictions ob-
tained using the REML estimates in the nugget model but using a search
radius equal to that of the range estimated using OLS.
REML
c0=0.12
c1=0.22
range=171.1
REML
c0=0.0
c1=0.32 Predicted
range=41.6 C/N Ratio
9.38 −9.79
9.80 −10.09
10.10 −10.33
10.34 −10.51
10.52 −10.65
10.66 −10.83
REML
10.84 −11.06
c0=0.12
11.07 −11.37
c1=0.22
11.38 −11.77
range=171.1
11.78 −12.31
search radius=85.3
OLS
c0=0.11
c1=0.17
range=85.3
The predictions from the nugget models are generally similar, showing an
area of high ratios in the south-east portion of the field and various pockets
of low C/N ratios. The predictions based on the OLS-fitted semivariogram
model shown in the bottom panel are less smooth than those obtained using
the corresponding REML-fitted semivariogram model. The estimates of the
relative structured variability are about the same (RSVols = 39%, RSVreml =
35%), but the smaller OLS estimate of the range (85.3 versus 171.1) creates
a process with less continuity. You can glean this from the more “frizzled”
contour edges in the lower panel. Overall, the predictions based on the two
sets of estimates are close (Figure 5.7).
12.50
11.75
REML with nugget
11.00
10.25
9.50
Figure 5.7 Comparison of predictions in nugget models based on REML and OLS
estimates.
This may be surprising at first, since adding a nugget effect reduces the con-
tinuity of the process. However, the lack of a nugget effect is more than offset
in this case by a small range. Recall that earlier we rejected the hypothesis of
a zero nugget effect based on the REML analysis (see page 177). It would be
incorrect to attribute the more erratic appearance of the predicted C/N ratios
in the second panel to an analysis that reveals more “detail” about the C/N
surface and so would be preferable on those grounds. Since statistical infer-
ence is conditional on the selected model, the predictions in the second panel
must be dismissed if we accept the necessity of a nugget effect, regardless of
how informative the resulting map appears to be.
Figure 5.8 displays contour maps of the standard errors corresponding to
the kriging predictions in Figure 5.6. The standard errors are small near the
location of the observed data (compare to Figure 4.6 on page 156). At the data
locations the standard errors are exactly zero, since the predictions honor the
data. The standard error maps basically trace the observed locations.
The universal kriging predictor “honors” the data. The predicted surface
passes through the data points, i.e., the predicted values at locations where
data are measured are identical to the observed values. Thus, while kriging
produces a smooth surface, it does not smooth the data like least squares or
loess regression. In some applications such smoothing may be desirable, how-
ever. For example, when the data are measured with error, it would be better
to predict a less noisy version of the data that removes the measurement error
instead of requiring the prediction surface to pass through the noisy data.
Following the ideas in Cressie (1993), suppose we really want to make infer-
ences about a spatial process, S(s), but instead can only measure the process
Z(s), where
Z(s) = S(s) + #(s), s ∈ D,
with E[#(s)] = 0, Var[#(s)] = σ$2 , Cov[#(si ), #(sj )] = 0, for all i (= j and S(s)
and #(s) are independent. Further suppose that S(s) can be described with
KRIGING IN PRACTICE 249
0.000 −0.413
0.413 −0.433
0.433 −0.448
0.448 −0.460
REML 0.460 −0.469
c0=0.12 0.469 −0.476
c1=0.22 0.476 −0.485
0 .000− 0.280
0.280 − 0.421
0.421 − 0.492
REML 0.492 − 0.528
c0=0.0 0.528 − 0.546
0.000 −0.413
0.413 −0.433
0.433 −0.448
0.448 −0.460
REML 0.460 −0.469
c0=0.12 0.469 −0.476
c1=0.22 0.476 −0.485
range=171.1 0.485 −0.496
0.000 −0.423
0.423 −0.448
0.448 −0.464
Figure 5.8 Standard error maps corresponding to kriging predictions in Figure 5.6.
filtered kriging are smaller than those associated with ordinary kriging (except
at data locations) since S(·) is less variable than Z(·).
If the mean is spatially varying, i.e., if we consider the more general case
where E[Z(s)] = x(s)! β, a derivation analogous to that for universal kriging
(in terms of the covariance matrix) gives the optimal weights as solutions to
Σa + X(s)m = σ ∗
X(s)! a = x(s0 ),
where σ ∗ = Cov[Z(s), S(s0 )] and thus has elements CS (si , sj ). However, we
cannot estimate CS (si , sj ) directly; we obtain this quantity only through
CZ (si , sj ) which is equal to
-
CS (si , sj ) + σ$2 si = sj
CZ (si , sj ) = Cov[Z(si ), Z(sj )] =
CS (si , sj ) si (= sj .
Thus, at the prediction locations where si (= sj , CZ (si , sj ) = CS (si , sj ), and
σ ∗ = σ. However, at the data locations, CZ (si , sj ) = CS (si , sj ) + σ$2 , so we
use σ ∗ = CZ (si , sj ) − σ$2 .
The modification to the kriging predictor comes into play when we want
to predict at locations where we already have data. This is called filtering,
since we are removing the error from our observed data through a prediction
process. It is also smoothing, since the filtered kriging predictor smooths the
data, with larger values of σ$2 resulting in more smoothing.
Historically, the different terms (filtering and smoothing) arise from the time
series literature which identifies three distinct types of prediction problems.
KRIGING IN PRACTICE 251
Example 5.8 Schabenberger and Pierce (2002, Ch. A9.9.5) give a small ex-
ample to demonstrate the effects of measurement error on kriging predictions
and their standard errors, which we adapt here. Figure 5.9 shows four ob-
served locations on a grid, s1 = [0, 0]! , s2 = [4, 0]! , s3 = [0, 8]! , and s4 = [4, 8]! .
The observed values are Z(s1 ) = 5, Z(s2 ) = 10, Z(s3 ) = 15, and Z(s4 ) = 6,
with arithmetic average Z = 9. The locations s0 (1) –s0 (4) are prediction loca-
tions. Notice that one of the prediction locations is also an observed location,
s0 (3) = s2 .
5
s0(4) s2 = s0(3) s4
4
s0(2)
3
x
s0(1)
2
1
s3
0 s1
-2 0 2 4 6 8
y
Figure 5.9 Observed locations (crosses) and prediction locations (dots). Adapted from
Schabenberger and Pierce (2002).
Predictions at the four locations are obtained for two exponential semivari-
ogram models with practical range 8.5. Model A has a sill of 1.0 and no nugget
effect. Model B has a nugget effect of 0.5 and a partial sill of 0.5. We assume
that the entire nugget effect is due to measurement error.
In the absence of a nugget effect the predictions of Z(s0 ) and S(s0 ) agree
in value and precision (Table 5.3). In the presence of a nugget effect (model
B), predictions of Z(s0 ) and S(s0 ) agree in value but predictions of the signal
are more precise than those of the error-contaminated process. The difference
between the two kriging variances is the magnitude of the nugget effect.
At the observed location s2 = s0 (3) , the kriging predictor under model A
252 SPATIAL PREDICTION AND KRIGING
honors the data. Notice that the kriging weights are zero except for λ2 =
1. Since the predicted value is identical to the observed value, the kriging
variance is zero. In the presence of a nugget effect (model B) the predictor of
the signal does not reproduce the observed value and the kriging variance is
not zero.
Table 5.3 Kriging predictions of Z(s0 ) and S(s0 ) under semivariogram models A
and B. Adapted from Schabenberger and Pierce (2002, Ch. 9.9.5).
Prediction
s0 Model Target Value Variance λ1 λ2 λ3 λ4
[2, 4] A Z(s0 ) 9.0 0.92 0.25 0.25 0.25 0.25
[2, 4] A S(s0 ) 9.0 0.92 0.25 0.25 0.25 0.25
[2, 4] B Z(s0 ) 9.0 1.09 0.25 0.25 0.25 0.25
[2, 4] B S(s0 ) 9.0 0.59 0.25 0.25 0.25 0.25
[3, 2] A Z(s0 ) 8.77 0.78 0.25 0.48 0.12 0.15
[3, 2] A S(s0 ) 8.77 0.78 0.25 0.48 0.12 0.15
[3, 2] B Z(s0 ) 8.83 1.04 0.26 0.36 0.18 0.20
[3, 2] B S(s0 ) 8.83 0.54 0.26 0.36 0.18 0.20
[4, 0] A Z(s0 ) 10.0 0. 0 1 0 0
[4, 0] A S(s0 ) 10.0 0. 0 1 0 0
[4, 0] B Z(s0 ) 10.0 0. 0 1 0 0
[4, 0] B S(s0 ) 9.25 0.30 0.17 0.60 0.11 0.12
[5, −2] A Z(s0 ) 9.26 0.88 0.17 0.57 0.13 0.13
[5, −2] A S(s0 ) 9.26 0.88 0.17 0.57 0.13 0.13
[5, −2] B Z(s0 ) 9.03 1.09 0.23 0.40 0.18 0.19
[5, −2] B S(s0 ) 9.03 0.59 0.23 0.40 0.18 0.19
Predicted
C/N Ratio
9.38 −9.79
9.80 −10.09
10.10 −10.33
10.34 −10.51
10.52 −10.65
10.66 −10.83
10.84 −11.06
11.07 −11.37
11.38 −11.77
11.78 −12.31
Prediction
Standard Errors
0.178 −0.226
0.226 −0.260
0.26 1−0.284
0.284 −0.300
0.301−0.312
0.312 −0.320
0.321 −0.332
0.332 −0.348
0.348 −0.372
0.372 −0.406
Figure 5.10 Prediction and standard error maps for filtered kriging. Compare to top
panels in Figures 5.6 and 5.8.
0.6
0.5
Standard errors from OK with nugget
0.4
0.3
0.2
0.1
0.0
Figure 5.11 Prediction standard errors for filtered and ordinary kriging.
254 SPATIAL PREDICTION AND KRIGING
Example 5.9 Suppose you want to perform ordinary kriging in terms of co-
variances and you choose the exponential model γ(h, θ) = σ 2 (1 − exp{−h/α})
as the semivariogram model. Once you have obtained estimates θ ' = [' ']! ,
σ2 , α
you estimate the semivariogram as
γ ' =σ
'(h) = γ(h, θ) '2 (1 − exp{−3h/'α})
by plugging-in the estimates into the expression for the model. In order to
estimate covariances under this model, you can invoke the relationship C(h) =
C(0) − γ(h) and estimate
'
C(h) '
= C(0) '(h)
−γ
ESTIMATING COVARIANCE PARAMETERS 255
' − γ(h, θ)
= γ(∞, θ) '
= σ '2 (1 − exp{−3h/'
'2 − σ α})
= σ
'2 exp{−3h/' '
α} = C(h, θ).
with x1 (si ) ≡ 1 for all i (Cressie, 1993, p. 165). The empirical semivariogram
no longer estimates the true, theoretical semivariogram. If the covariates are
trend surface functions, the trend often manifests itself in practice by an
empirical semivariogram that increases rapidly with ||h|| (often quadratically).
Analogous problems occur when estimating the covariance function. In light
of this problem, we need to re-examine the techniques from §4.5 for the case
of a spatially varying mean. This is the topic of §5.5.1–§5.5.3.
The third issue mentioned above, regarding the variability of θ ' is of im-
portance whether the mean is constant or spatially varying. The implications
here are whether a predictor that is best in some sense retains this property
256 SPATIAL PREDICTION AND KRIGING
(1 − 1! Σ(θ)−1 σ(θ))2
σok2
(s0 ) = C(0) − σ(θ)! Σ(θ)−1 σ(θ) + .
1! Σ(θ)−1 1
Ordinary The plug-in predictor
Kriging + ,
' −1 '
' + 1 1 − 1 Σ(θ) σ(θ)
!
Plug-in p'ok (Z; s0 ) = σ(θ) ' −1 Z(s)
Σ(θ) (5.36)
Predictor ' −1 1
1! Σ(θ)
is no longer the best linear unbiased predictor of Z(s0 ). It is an estimate
of the BLUP, a so-called EBLUP. Also, this EBLUP will not be invariant
to your choice of θ. ' Different estimation methods yield different estimates
of the covariance parameters, which affects the predictions—unless you are
predicting at observed locations without measurement error; all predictors
honor the data, regardless of “how” you obtained θ. ' Not only is (5.36) no
longer best, we do not know its prediction error. The common practice of
evaluating σok 2 ' does not yield an estimate of the prediction error of
(s0 ) at θ
p'ok (Z; s0 ). It yields an estimate of the prediction error of pok (Z; s0 ). In other
words, by substituting estimated covariance parameters into the expression
for the predictor, we obtain an estimate of the predictor. By substituting into
the expression for the prediction variance we get an estimate of the prediction
error of a different predictor, not for the one we are using. How to determine,
or at least approximate, the prediction error of plug-in predictors, is the topic
of §5.5.4.
GLS where ) *
Estimator ' = X(s)! Σ(θ)−1 X(s) −1 X(s)! Σ(θ)−1 Z(s)
β (5.37)
gls
is the generalized least squares estimator of the fixed effects. How are we
going to respond to the fact that θ is unknown? In order to estimate θ by
least squares fitting of a semivariogram model, we cannot use the empirical
semivariogram based on the observed data Z(s), because the mean of Z(s) is
not constant. From equation (5.35) we see that the resulting semivariogram
ESTIMATING COVARIANCE PARAMETERS 257
would be biased, and there is little hope to find reasonable estimates for θ
this way. What we need is the semivariogram of e(s), not that of Z(s). If β
were known, then we could construct “data” Z(s) − X(s)β and unbiasedly
estimate the semivariogram, because the error process would be observable.
But β is not known, otherwise we would not find ourselves in the situation
to contemplate a linear model for a spatially varying mean. To compute the
GLS estimate of β, (5.37), requires knowledge of the covariance parameters.
If instead, we use a plug-in estimator for β, the estimated generalized least
squares estimate (EGLS) EGLS
E F−1 Estimator
' ! ' ! ' −1 Z(s),
β egls = X(s) Σ(θ) X(s)
−1
X(s) Σ(θ) (5.38)
we are still left with the problem of having to find a reasonable estimator
for θ. And we just established that this is not likely by way of least squares
techniques without knowing the mean. Schabenberger and Pierce (2002, pp.
613–615) refer to this circular argument as the “cat and mouse game of uni-
versal kriging.” In order to estimate the mean you need to have an estimate
of the covariance parameters, which you can not get by least squares without
knowing the mean. Percival and Walden (1993, p. 219) refer to a similar prob-
lem in spectral analysis of time series—where in order to derive a filter for
pre-whitening of a spectral density estimate one needs to know the spectral
density of the series—as a “cart and horse” problem.
The approach that is taken to facilitate least squares estimation of covari-
ance parameters in the case of a spatially varying mean is to compute initially
an estimate of the mean that does not depend on θ. Then use this estimate
to detrend the data and estimate the semivariogram by least squares based
' depends
on the empirical semivariogram of the residuals. Since the estimate θ
on how we estimated β initially—we know it was not by GLS or EGLS—the
process is often repeated (iterated). The resulting “simultaneous” estimation
scheme for θ and β is based on the methods described in §4.5.1 and is termed
iteratively re-weighted generalized least squares (IRWGLS) : IRWGLS
Algorithm
1. Obtain a starting estimate of β, say β;'
2. Compute residuals r = Z(s) − X(s)β; '
3. Estimate and model the semivariogram of the residuals using techniques
' by minimizing either equation (4.31)
described in §4.4 and §4.5, obtaining θ
or equation (4.34);
4. Obtain a new estimate of β using equation (5.38);
5. Repeat steps 2–4 until the relative or absolute change in estimates of β
and θ are small.
The starting value in step 1 is almost always obtained by ordinary least
squares. In the first go-around of the IRWGLS algorithm you are working with
OLS residuals, once past step 4, you are working with (E)GLS residuals. The
semivariogram estimator based on the residuals r = Z(s) − Xβ ' (step 3 above)
is biased. It is important to understand that this bias is different from the bias
258 SPATIAL PREDICTION AND KRIGING
mentioned earlier that is due to not accounting for the change in the mean,
namely equation (5.35). The bias in the semivariogram based on OLS residuals
stems from the fact that the residuals fail to share important properties of
the model errors e(s). For example, their variance is not Σ, they are rank-
deficient and heteroscedastic. Frankly, the only thing e(s) and Z(s)−X(s)β '
ols
have in common is a zero mean. The semivariogram of Z(s) − X(s)β ' is not
ols
the semivariogram of e(s). A detailed discussion of the properties of OLS and
GLS residuals, and the diagnosis of the covariance model choice based on
residuals is deferred until Chapter 6. At this point it suffices to note that the
bias in the estimated semivariogram based on OLS residuals increases with
the lag. Cressie (1993, p. 166) thus argues that the bias can be controlled
by fitting the semivariogram model by weighted nonlinear least squares as
described in §4.5.1. Because the weights are proportional to the approximate
variance of the empirical semivariogram, empirical semivariogram values at
large lags—where the bias is greater—are down-weighted.
By iterating the above steps, that is, by repeating steps 2–4 in the IRWGLS
algorithm, the bias problem is not solved. The empirical semivariogram of the
generalized least squares residuals is also a biased estimator of the semivari-
' '
ogram. Since both β ols and β egls are unbiased estimators of β, most of the
trend is removed in the very first step of the algorithm, and you may pick up
comparably little additional structure in subsequent iterations. What matters
for efficient estimation of the large-scale trend, and for spatial prediction, is
that the estimator of θ is efficient and has as little bias as possible. If you
estimate the covariogram instead of the semivariogram in step 3, then a single
OLS fit and OLS residuals may be preferable over an iterated algorithm.
The final IRWGLS estimator of β is an EGLS estimator. If we denote the
estimator of θ obtained from IRWGLS by θ 'IRW GLS , the IRWGLS estimator
IRWGLS of β is
Estimator
' ! ' ! 'IRW GLS )−1 Z(s),
IRW GLS = (X(s) Σ(θ IRW GLS )
−1
β X(s))−1 X(s) Σ(θ
(5.39)
and its variance-covariance matrix is usually estimated as
Tβ ' ! '
Var[ IRW GLS ] = (X(s) Σ(θ IRW GLS ) (5.40)
−1
X(s))−1 .
Since steps 2–4 are repeated, the process is iterative, but in contrast to
a numerical optimization technique, such as the Newton-Raphson algorithm,
it is difficult to study the overall behavior of the IRWGLS procedure. There
is, for example, no guarantee that the process “converges” in the usual sense
or that some “extremum” has been found when the process stops. When
the algorithm comes to a halt, think of it as lack of progress, rather than
convergence. Any continuance would lead to the same estimates of θ and these
would lead to the same estimates of β. We recommend not to monitor the
absolute change in parameter estimates alone, but to use a relative criterion in
step 5. For example, if β ' (u) and β
' (u+1) are the estimates from two successive
ESTIMATING COVARIANCE PARAMETERS 259
iterations, compute
(u) (u+1)
(β) |β'j − β'j |
δj = E F j = 1, · · · , p,
(u) (u+1)
0.5 |β'j | + |β'j |
(u) (u+1)
(θ) |θ'k − θ'k |
δk = E F k = 1, · · · , q,
(u) (u+1)
0.5 |θ'k | + |θ'k |
(θ) (β)
and stop when max{δk , δj } < 10−6 .
Substituting this expression into (5.41) yields an objective function for mini-
mization profiled for β,
ϕβ (θ; Z(s)) = ln{|σ 2 Σ(θ ∗ )|} + n ln{2π} + σ −2 r! Σ(θ ∗ )−1 r, (5.43)
where
! −1 ! −1
r = Z(s) − (X(s) Σ(θ) X(s))−1 X(s) Σ(θ) Z(s)
is the GLS residual. To profile σ 2 from the objective function, note that its
MLE is
1
2
'ml
σ = r! Σ(θ ∗ )−1 r.
n
Minus 2 Substituting again yields the negative of twice the profiled log likelihood,
Profiled log
ϕβ,σ (θ ∗ ; Z(s)) = ln{|Σ(θ∗ )|} + n ln{'
σ 2 } + n(ln{2π} − 1). (5.44)
Likelihood
Minimizing (5.44) is an optimization problem with only q − 1 parameters.
Upon convergence you obtain θ'ml from σ 2
'ml '∗ , and β
and θ ' by evaluating
ml ml
MLE of β (5.42) at the maximum likelihood estimate θ'ml of θ:
E F−1
β' = X(s)! Σ(θ 'ml )−1 X(s) !
X(s) Σ(θ 'ml )−1 Z(s). (5.45)
ml
T β
Var( ' ) = (X(s)! Σ(θ'ml )−1 X(s))−1 . (5.46)
ml
Thus, the variance-covariance matrix of β'
ml has the same form as that of
' 'ml .
equation (5.40), but with θ IRW GLS replaced with θ
For full ML estimation without profiling, the inverse of the information
matrix for ω = [β ! , θ ! ]! can be written as (see Breusch, 1980 and Judge et al.,
1985, p. 182) Inverse
−1 Fisher
! −1
(X(s) Σ(θ) X(s)) 0 Information
I(ω)−1 = , (5.47)
−1 ] −1
0 2 ∆ (Σ(θ) Σ(θ) )∆
! 1 !
Restricted (or residual) maximum likelihood (REML) estimates are often pre-
ferred over MLEs because the latter exhibit greater negative bias for estimates
of covariance parameters. The culprit of this bias—roughly—lies in the failure
of ML estimation to account for the number of mean parameters in the estima-
tion of the covariance parameters. The most famous—and simplest—example
is that of an iid sample from a G(µ, σ 2 ) distribution, where µ is unknown.
The MLE for σ 2 is
n
1(
'ml =
σ 2
(Yi − Y )2 ,
n i=1
which has bias −σ 2 /n. The REML estimator of σ 2 is unbiased:
n
1 (
2
'reml
σ = (Yi − Y )2 .
n − 1 i=1
Similarly, in a regression model with Gaussian, homoscedastic, uncorrelated
errors, the ML and REML estimators for the residual variance are
1 ' ! (Y − Xβ)
'
2
'ml
σ = (Y − Xβ)
n
262 SPATIAL PREDICTION AND KRIGING
1 '
' ! (Y − Xβ),
2
'reml
σ = (Y − X! β)
n−k
respectively. The ML estimator is again a biased estimator.
For the spatial model Z(s) ∼ G(X(s)β, Σ(θ)), the REML adjustment con-
sists of performing ML estimation not for Z(s), but for KZ(s), where the
((n − k) × n) matrix K is chosen so that E[KZ(s)] = 0 and rank[K] = n − k.
Because of these properties, the matrix K is called a matrix of error con-
trast, which supports the alternate name as residual maximum likelihood.
Although REML estimation is well-established in statistical theory and ap-
plications, in the geostatistical arena it appeared first in work by P. Kitanidis
and co-workers in the mid-1980’s (Kitanidis, 1983; Kitanidis and Vomvoris,
1983; Kitanidis and Lane, 1985).
An important aspect of REML estimation lies in the handling of β. In
§5.5.2, the β vector was profiled from the log likelihood to reduce the size of
the optimization problem. This led to the objective function ϕβ (θ; Z(s)) in
(5.43). When you consider ML estimation for the n − k vector KZ(s), the
fixed effects β have seemingly disappeared from the objective function. Minus
twice the log likelihood of KZ(s) is
ϕR (θ; KZ(s)) = ln{|KΣ(θ)K! |} + (n − k) ln{2π}
) *−1
+ Z(s)! K! KΣ(θ)K! KZ(s). (5.48)
This is an objective function about θ only. It is possible to write ϕR (θ; KZ(s))
in terms of an estimate β ' and we will do so shortly to more clearly show the
relationship between the ML and REML objective functions. You need to
keep in mind, however, that there is no “REML estimator of β.” Minimizing
ϕR (θ; KZ(s)) yields θ'reml . What is termed β '
REML reml and is computed as
“Estima- E F−1
' ! ' X! Σ(θ'reml )−1 Z(s)
reml = X Σ(θ reml )
−1
tor” of β X
β
'reml .
is simply an EGLS estimator evaluated at θ
We now rewrite (5.48) to eliminate the matrix K from the expression. First
notice that if E[KZ(s)] = 0, then KX(s) = 0. If Σ(θ) is positive definite,
Searle et al. (1992, pp. 451–452) shows that
) *−1 −1 −1 ! −1
K! KΣ(θ)K! K = Σ(θ) − Σ(θ) X(s)Ω(θ)X(s) Σ(θ) ,
! −1 ! −1
where Ω(θ) = (X(s) Σ(θ) X(s))−1 . This identity and ΩX(s) Σ(θ) Z(s) =
' yields
β
) *−1 −1
Z(s)! K! KΣ(θ)K! KZ(s) = r! Σ(θ) r.
In fundamental work on REML estimation, Harville (1974, 1977) established
important further results. For example, he shows that if
! !
K! K = I − X(s)(X(s) X(s))−1 X(s)
and KK! = I, then minus twice the log likelihood of KZ(s) can be written as
! −1
ϕR (θ; KZ(s)) = ln{|Σ(θ)|} + ln{|X(s) Σ(θ) X(s)|}
ESTIMATING COVARIANCE PARAMETERS 263
! −1
− ln{|X(s) X(s)|} + r! Σ(θ) r
+ (n − k) ln{2π}.
Harville (1977) points out that (n − k) × n matrices whose rows are linearly
! !
independent rows of I−X(s)(X(s) X(s))−1 X(s) will lead to REML objective
functions that differ by a constant amount. The amount does not depend on
θ or β. The obvious choice as a REML objective function for minimization is
Minus two
! −1 Restricted
ϕR (θ; KZ(s)) = ln{|Σ(θ)|} + ln{|X(s) Σ(θ) X(s)|}
log
−1
+ r Σ(θ)
!
r + (n − k) ln{2π}. Likelihood
In this form, minus twice the REML log likelihood differs from (5.41) by the
! −1
terms ln{|X(s) Σ(θ) X(s)|} and k ln{2π}. As with ML estimation, a scale
parameter can be profiled from Σ(θ). The REML estimator of this parameter
is
1
' 2reml =
σ r! Σ(θ ∗ )−1 r
n−k
and upon substitution one obtains minus twice the profiled REML log likeli-
hood
!
ϕR,σ (θ ∗ ; KZ(s)) = ln{|Σ(θ ∗ )|} + ln{|X(s) Σ(θ ∗ )−1 X(s)|}
+ (n − k) ln{'σ 2 } + (n − k)(ln{2π} − 1). (5.49)
Wolfinger, Tobias, and Sall (1994) give expressions for the gradient and Hes-
sian of the REML log likelihood with and without profiling of σ 2 .
There is a large literature on the use of ML and REML for spatial modeling
and this is an area of active research in statistics. Searle, Casella and McCul-
loch (1992) provide an introduction to REML estimation in linear models
and Littell, Milliken, Stroup and Wolfinger (1996) adapt some of these re-
sults to the spatial case. Cressie and Lahiri (1996) provide the distributional
properties of REML estimators in a spatial setting.
σ '
'2 (s0 ) = C(0) ' ! Σ(θ)
− σ(θ) ' −1 σ(θ)
' +
ok
' −1 σ(θ))
(1 − 1! Σ(θ) ' 2
.
' −1 1
1! Σ(θ)
p'ok (Z; s0 ) is not the ordinary kriging predictor, it is an estimate thereof. Sim-
ilarly, σ 2
'ok (s0 ) is not the variance of the ordinary kriging predictor, it is an
estimate thereof; and because of the nonlinear involvement of θ, ' it is a biased
estimate of σok (s0 ). More importantly, σok (s0 ) is not the prediction error of
2 2
p'ok (Z; s0 ), but that of pok (Z; s0 ). Intuitively, one would expect the prediction
error of p'ok (Z; s0 ) to exceed that of the ordinary kriging predictor, because
not knowing θ has introduced additional variability into the system; θ ' is a
random vector. So, if σok (s0 ) is not the prediction error we should compute if
2
p'ok (Z; s0 ) is our predictor, are we not making things even worse by evaluating
2
σok ' Is that not a biased estimate of the wrong quantity?
(s0 ) at θ?
The consequences of plug-in estimation for estimating variability are not
germane to spatial models. To address the issues in more generality, we con-
sider in this section a more generic model and notation. The model at issue
is
Z = Xβ + e, e ∼ (0, Σ(θ)),
a basic correlated error model with a linear mean and parameterized variance
matrix. In the expressions that follow, an overline, − , denotes a general es-
timator or predictor, R denotes a quantity evaluated if θ is known, and ' is
'
used when the quantity is evaluated at the estimate θ.
Following Harville and Jeske (1992), our interest is in predicting a quantity
ω with the properties
E[ω] = x!0 β
Var[ω] = σ 2 (θ)
Cov[Z, ω] = σ(θ),
where x!0 β is estimable. If ω is any predictor of ω, then the quality of our
prediction under squared-error loss is measured by the mean-squared error
mse[ω, ω] = E[(ω − ω)2 ].
A special case is that of Var[ω] = σ 2 = 0. The problem is then one of estimat-
ing x!0 β and
mse[ω, ω] = mse[x!0 β, x!0 β] = x!0 E[(β − β)(β − β)! ]x0 .
If β is unbiased for β, then this mean-squared error equals the variance of
x!0 β.
ESTIMATING COVARIANCE PARAMETERS 265
Under the stated conditions, the GLS estimator of β and the BLUP of ω
are
R
β = (X! Σ(θ)−1 X)−1 X! Σ(θ)−1 Z
R
ω = x!0 β R
R + σ(θ)! Σ(θ)−1 (Z − Xβ).
R = β, and that ω
It is easy to establish that E[β] R is unbiased in the sense that
E[Rω ] = E[ω] = x0 β. The respective mean-squared errors are given by
!
+ σ(θ)! Σ(θ)−1
The estimator β' and the predictor ω ' remain unbiased. The question, then,
'
is how to compute the variance of β and the mean square error mse[' ω , ω]?
Before looking into the details, let us consider the plug-in quantity
' = (X! Σ(θ)
Ω(θ) ' −1 X)−1 ,
which is commonly used as an estimate of Var[β].' There are two major prob-
' (= Ω(θ). We could use it as a
lems. First, it is not unbiased for Ω(θ); E[Ω(θ)]
biased estimator of (5.50), however. Second, even if it was unbiased, Ω(θ) is
not the variance of β.' We need
• an estimator of the mean-squared error that takes into account the fact
that θ was estimated and hence that θ' is a random variable;
• a computationally feasible method for evaluating the mean-squared error.
Now let us return to the more general case of predicting ω with ω
' , keeping in
'
mind that finding the variance of β is a special case of determining mse['ω , ω].
Progress can be made by considering only those estimators θ that have certain
properties. For example, Kackar and Harville (1984) and Harville and Jeske
(1992) consider even, translation invariant estimators. Christensen (1991, Ch.
VI.5) considers “residual-type” statistics; see also Eaton (1985). Suffice it to
say that ML and REML estimators have the needed properties. Kackar and
Harville (1984) decompose the prediction error into
'−ω = ω
ω R−ω + ω R = e1 (θ) + e2 (θ).
'−ω
' is translation invariant, then e1 (θ) and e2 (θ) are distributed indepen-
If θ
dently, and
mse['
ω , ω] = mse[R
ω , ω] + Var['
ω−ωR ].
266 SPATIAL PREDICTION AND KRIGING
By choosing to ignore the fact that θ was estimated, the mean-squared pre-
diction error is underestimated by the amount Var[' ω−ω R ]. Hence, mse['
ω , ω] ≥
mse[R '
ω , ω]. If you follow the practice of plugging θ into expressions that apply
if θ were known, you would estimate mse[' ' This
ω , ω] by evaluating (5.51) at θ.
yields the (estimated) mean-squared error of the wrong quantity, and can be
substantially biased.
Kackar and Harville (1984) propose a correction term, and Harville and
Jeske (1992) provide details about estimation. First, using a Taylor series,
Var['
ω−ω R ] is approximated as tr{A(θ)B(θ)}, where
: ;
R
∂ω
A = Var
∂θ
B = mse[θ, ' θ] = E[(θ
' − θ)(θ
' − θ)! ].
biased, the nature of the covariance model used for Σ(θ), ' the spatial con-
figuration of the data, and the strength of spatial autocorrelation. Based on
examples in Zimmerman and Zimmerman (1991) and Zimmerman and Cressie
(1992), Zimmerman and Cressie (1992) offer the following general guidelines.
The performance of the plug-in mean-squared error estimator (mse[R '
ω , ω, θ],
the estimated kriging variance) as an estimator of the true prediction mean-
squared error can often be improved upon when the spatial correlation is weak,
but it is often adequate and sometimes superior to the alternative estimators
such as (5.53) when the spatial correlation is strong. Zimmerman and Cressie
(1992) suggest that corrections of the type used in (5.53) should only be used
when θ ' is unbiased, or Σ(θ)
' is negatively biased, and the spatial correlation is
weak. In other words, the use of a plug-in estimator of the kriging variance is
fine for most spatial problems with moderate to strong spatial autocorrelation.
The development of the simple and ordinary kriging predictors requires no dis-
tributional assumptions other than those pertaining to the first two moments
of the random field. Thus, simple kriging is always the best linear predictor
and ordinary kriging is always the best linear unbiased predictor, regardless
of the underlying distribution of the data. The best predictor, i.e., the one
that minimizes the mean-squared prediction error, is given in (5.1), the con-
ditional expectation of Z(s0 ) given the observed data. When the data follow
a multivariate Gaussian distribution, this expectation is linear in the data
and is equivalent to the simple kriging predictor (5.10). For other distribu-
tions, this conditional expectation may not be linear and so linear predictors
may be poor approximations to this optimal conditional expectation. Statis-
ticians often cope with such problems by transforming the data, so that the
transformed data follow a Gaussian distribution and then performing analyses
with the transformed data. In this section, we discuss several approaches to
constructing nonlinear predictors based on transformations of the data.
Since Y (s) is a Gaussian random field, psk (Y; s0 ) = E[Y (s0 )|Y] and so
pslk (Z; s0 ) = E[exp{Y (s0 )}|Y] = E[Z(s0 )|Z].
Thus, pslk (Z; s0 ) is the optimal predictor of Z(s0 ). The corresponding con-
ditional variance, which is also the minimized mean-squared prediction error
(MSPE), is (Chilès and Delfiner, 1999, p. 191)
Var[(pslk (Z; s0 ) − Z(s0 ))|Z] = (pslk (Z; s0 ))2 [exp{σsk
2
(Y; s0 )} − 1].
When ordinary kriging is used to predict Y (s0 ), the properties of the log-
normal distribution can be used as before (see Cressie, 1993, pp. 135–136) and
in this case the bias-corrected predictor of Z(s0 ) is Ordinary
8 9 Lognormal
polk (Z; s0 ) = exp pok (Y; s0 ) + σY2 /2 − Var[pok (Y; s0 )]/2
8 9 Kriging
= exp pok (Y; s0 ) + σok
2
(Y; s0 )/2 − mY , (5.55) Predictor
where mY is the Lagrange multiplier obtained with ordinary kriging of Y (s0 )
based on data Y. The bias-corrected MSPE (see e.g., Journel, 1980; David,
1988, p. 118) is
K L
2
E (polk (Z; s0 ) − Z(s0 )) = exp{2µY + σY2 } exp{σY2 }
8 ) *
× 1 + exp{−σok 2
(Y; s0 ) + mY }
× (exp{mY } − 2)} .
Thus, unlike ordinary kriging, we need to estimate µY and σY2 (·) as well
as γY (·) in order to use lognormal kriging. Moreover, the optimality prop-
erties of polk (Z; s0 ) are at best unclear. Finding a predictor that minimizes
Var[p(Y; s0 ) − Y (s0 )] within the class of linear, unbiased predictors of Y (s0 )
(which in this case is pok (Y; s0 )), does not imply that polk (Z; s0 ) minimizes
Var[(p(Z; s0 ) − Z(s0 ))|Z] within the class of linear, unbiased predictors of
Z(s0 ).
The bias correction makes the ordinary lognormal kriging predictor sensi-
tive to departures from the lognormality assumption and to fluctuations in the
semivariogram (a criticism that applies to many nonlinear prediction meth-
ods and not just to lognormal kriging). Thus, some authors (e.g., Journel,
1980) have recommended calibration of polk (Z; s0 ), forcing the mean of kriged
predictions to equal the mean of the original Z data. This may be a useful
technique, but it is difficult to determine the properties of the resulting predic-
tor. Others (e.g., Chilès and Delfiner, 1999, p. 191) seem to regard mean unbi-
asedness as unnecessary, noting that exp{pok (Y; s0 )} is median unbiased (i.e.,
Pr(exp{pok (Y; s0 )} > Z(s0 )) = Pr(exp{pok (Y; s0 )} < Z(s0 )) = 0.5). Mar-
cotte and Groleau (1997) propose an interesting approach that works around
these problems. Instead of transforming the data, predicting Y (s0 ), and then
transforming back, they suggest predicting Z(s0 ) using the original data Z(s)
to obtain pok (Z; s0 ), and then transforming via p(Y; s0 ) = log{pok (Z; s0 )}.
Marcotte and Groleau (1997) suggest using E[Z(s0 )|p(Y; s0 )] as a predictor
of Z(s0 ). Using the properties of the lognormal distribution from Aitchison
270 SPATIAL PREDICTION AND KRIGING
and Brown (1957) given above, Marcotte and Groleau (1997) derive a compu-
tational expression for this conditional expectation that depends on µZ and
γZ (h), and is relatively robust to departures from the lognormality assump-
tion and to mis-specification of the semivariogram.
Although the theory of lognormal kriging has been developed and revis-
ited by many authors including Rendu (1979), Journel (1980), Dowd (1982),
and David (1988), problems with its practical implementation persist. David
(1988) gives several examples that provide some advice on how to detect and
correct problems with lognormal kriging and more modifications are provided
in Chilès and Delfiner (1999). Nonlinear spatial prediction is an area of active
research in geostatistics, and the last paragraph in Boufassa and Armstrong
(1989) seems to summarize the problems and the frustration: “The user of
geostatistics therefore is faced with the difficult task of choosing the most ap-
propriate stationary model for their data. This choice is difficult to make given
only information from a single realization. It would be helpful if statisticians
could devise a way of testing this.”
For unbiasedness, this should be equal to E[Z(s0 )], which can be obtained by
applying the same type of expansion to ϕ(Y (s0 )) giving
!!
ϕ (µY )
E[Z(s0 )] = E[ϕ(Y (s0 ))] ≈ ϕ(µY ) + E[(Y (s0 ) − µY )2 ]. (5.57)
2
To make (5.56) equal to (5.57), we need to add
!! !!
ϕ (µY ) ϕ (µY )
E[(Y (s0 ) − µY )2 ] − E[(Y'0 − µY )2 ]
2 2
!!
ϕ (µY ) 2
= (σok (Y; s0 ) − 2mY )
2
to p(Z; s0 ). Thus, the trans-Gaussian predictor of Z(s0 ) is Trans-
!! Gaussian
ϕ (µY ) 2
ptg (Z; s0 ) = ϕ(pok (Y; s0 )) + (σok (Y; s0 ) − 2mY ). (5.58) Predictor
2
The mean-squared prediction error of ptg (Z; s0 ), based on just a first-order
Taylor series expansion, is
K ! L2
E[(ptg (Z; s0 ) − Z(s0 )) ] ≈ ϕ (µY ) σok
2 2
(Y; s0 ). (5.59)
29.92 23.60
10
21.70
17.72
31.14
8 26.83 18.24
68.25 28.98
24.47
63.10 46.06
6 23.69
y (km)
26.76
31.27
2 33.98
29.55 41.85
0 2 4 6 8 10 12
x (km)
Figure 5.12 Locations of the 24 rainfall monitoring stations and associated weekly
rainfall amounts (mm). Data kindly provided by Dr. Victor De Oliveira, Department
of Mathematical Sciences, University of Arkansas.
of the weekly rainfall amounts (Figure 5.13) suggests that linear methods of
spatial prediction may not be the best choice for interpolating the rainfall
amounts. De Oliveira et al. (1997) consider a Box-Cox transformation
λ
z −1
if λ (= 0
gλ (z) = λ
log(z) if λ = 0,
and estimate λ using maximum likelihood. The distribution of the Box-Cox
transformed data using λ̂ = −0.486 is shown in Figure 5.14. This transfor-
mation appears to have over-compensated for the skewness, and so we also
consider the normal scores transformation given in (5.60). A histogram of
the normal scores is shown in Figure 5.15. To use Trans-Gaussian kriging de-
NONLINEAR PREDICTION 273
40
30
Percent
20
10
0
10 20 30 40 50 60 70
rainfall
30
20
Percent
10
0
1.50 1.55 1.60 1.65 1.70 1.75
Box-Cox transformed scores
Figure 5.14 Histogram of transformed weekly rainfall amounts using the Box-Cox
transformation.
30
20
Percent
10
0
-2.5 -1.5 -0.5 0.5 1.5 2.5
normal scores
Figure 5.15 Histogram of transformed weekly rainfall amounts using the normal
scores transformation.
20 21 31 25 30 19 21
2.0
1.5
Semivariance
1.0
0.5
0.0
2 4 6 8
Distance (km)
construct a prediction interval (PI) for the omitted, true value. Thus, we can
compare each predicted value to the true value omitted from the analysis and
obtain measures of bias and the percentage of prediction intervals containing
the true values. The covariance parameters and the overall mean were re-
estimated for each of the 24 cross validation data sets.
With ordinary kriging, we describe the spatial autocorrelation in the orig-
inal rainfall amounts using both an exponential semivariogram in the above
parameterization and a spherical semivariogram model to examine the impact
of the semivariogram model.
Predictions for Trans-Gaussian kriging based on the Box-Cox transformed
scores were obtained using (5.58) and the standard errors were obtained us-
ing (5.59). We also obtained a biased Trans-Gaussian predictor obtained by
ignoring the second term in (5.58) that depends on the Lagrange multiplier,
mY . Note that ϕ(·) in §5.6.2 pertains to the original rainfall amounts, Z(si ).
The transformed amounts are Y (s) = ϕ−1 (Z(s)) so that gλ (z) = ϕ−1 (z).
The normal scores transformation matches the cumulative probabilities that
define the distribution function of Z(s) to those of a standard normal distri-
bution. Thus, the transformed scores are just the corresponding percentiles
of the associated standard normal distribution. However, transforming back
after prediction at a new location is tricky, since the predicted values will
not coincide with actual transformed scores for which there is a direct link
back to the original rainfall amounts. Thus, for predictions that lie within
two consecutively-ranked values, predictions on the original scale are obtained
276 SPATIAL PREDICTION AND KRIGING
This should be relatively small for methods that are accurate (precise as
well as unbiased).
• The average percentage increase in prediction standard errors due to the
Prasad-Rao adjustment;
• The percentage of 95% prediction intervals containing the true value. This
should be close to 95%.
The results of the cross validation study are shown in Table 5.4.
From Table 5.4 we see that all empirical PI coverages were surprisingly
close to nominal. Also, surprisingly, the biased-corrected version of the Trans-
NONLINEAR PREDICTION 277
Gaussian kriging predictor has the largest bias and the largest absolute rela-
tive error. Predictions from the bias-corrected version of Trans-Gaussian krig-
ing are all higher than those obtained without this bias correction. Cressie
(1993, p. 137) notes that the approximations underlying the derivation of the
biased-corrected Trans-Gaussian predictor rely on the kriging variance of the
transformed variables being small. On the average, for the cross-validation
study presented here, this variance was approximately 0.035, so perhaps this
is not “small” enough. Also, the Box-Cox transformation tends to over-correct
the skewness in the distribution of the rainfall amounts and this may be im-
pacting the accuracy of the predictions from Trans-Gaussian kriging (Figure
5.14). The predictions from the normal scores transformation were relatively
(and surprisingly) accurate, considering no effort was made to adjust for bias
and the amount of information lost during back-transformation. The Prasad-
Rao/Kenward-Roger adjustment did change the standard errors slightly, but
not enough to result in different inferences. However, in this example, the
spatial autocorrelation is very strong ('ρ is between 0.92 and 0.98). Follow-
ing the recommendation of Zimmerman and Cressie (1992), on page 267, the
adjustment may not be needed in this case.
The empirical probability of coverage from the normal scores was below
nominal, but we used a fairly conservative method for back transformation
in the tails of the distribution that may be affecting these results. Using a
percentage point from a t-distribution to construct these intervals had little
impact on their probability of coverage. Ordinary kriging seems to do as well
as the other techniques and certainly requires far less computational (and
cerebral) effort. However, the distribution of the rainfall amounts (Figure 5.13)
is not tremendously skewed or long-tailed, so the relative accuracy of ordinary
kriging in this example may be somewhat misleading. On the other hand,
perhaps the effort spent in correcting departures from assumptions should
be directly proportional to their relative magnitude and our greatest efforts
should be spent on situations that show gross departures from assumptions.
278 SPATIAL PREDICTION AND KRIGING
When g(Z(s0 )) = Z(s0 ), this is called the “E-type estimate” of Z(s0 ) (Deutsch
and Journel, 1992, p. 76). A measure of uncertainty is given by
A
σ (s0 ) = [p(Z, g(Z(s0 ))) − g(z)]2 dF'(s0 , z |Z(s)).
2
If the partition is sufficiently fine, i.e., there are many subsets Rk , then any
function g(Z(s0 )) can be approximated by a linear combination of these indi-
cator functions
g(Z(s0 )) = g1 I1 (s0 ) + g2 I2 (s0 ) + g3 I3 (s0 ) + · · · .
In the situation we describe here, each Ik (s0 ) is unknown, but we can obtain
a predictor of any Ik (s0 ) using indicator kriging of the data associated with
the k th set {Ik (si ), i = 1, · · · , n}. However, as discussed above, this does not
make optimal use of all the indicator information. Another approach is to
obtain a predictor of Ik (s0 ) using not only the data associated with the k th
set, but also the data associated with all of the other sets, i.e., use data
{I1 (si ), i = 1, · · · , n}, · · · , {Ik (si ), i = 1, · · · , n}, · · ·. Thus, if we use a linear
combination of all available indicator data to predict each Ik (s0 ), the predictor
can be written as ((
I'k (s0 ) = λik Ik (si )
i k
and a predictor of g(Z(s0 )) is then
(( (
p(Z, g(Z(s0 )) = gki Ik (si ) ≡ gi (Z(si )). (5.63)
i k i
The weight function determines the system {χp (x)}. For example, taking
2
w(x) = f (x) = √12π e−x /2 over the interval (−∞, ∞), the standard Gaussian
density function, gives the system of Chebyshev-Hermite polynomials (Stuart
and Ord, 1994, p. 226-228) Chebyshev-
Hermite
p x2 /2 dp −x2 /2
Hp (x) = (−1) e e . (5.64) Poly-
dxp nomials
They are called polynomials since Hp (x) is in fact a polynomial of degree p :
H0 (x) = 1
H1 (x) = x
H2 (x) = x2 − 1
.. .. ..
. . .
Hp+1 (x) = xHp (x) − pHp−1 (x).
A
Var[ηp (x)] = (ηp (x))2 f (x)dx = 1
-
0 m (= p
E[ηp (x)ηm (x)] =
1 m = p.
Thus, the polynomials ηp (x) form an orthonormal basis on L2 with respect to
the standard Gaussian density and we can expand any measurable function
g(x) as
∞
(
g(x) = bp ηp (x).
p=0
Suppose that we now want to predict g(Z(s0 )) using the predictor given in
(5.63). Then, from the above development,
∞
(
g(Z(s0 )) = b0p ηp (Z(s0 )), (5.66)
p=0
and
∞
(
gi (Z(si )) = bip ηp (Z(si )).
p=0
If we could predict ηp (Z(s0 )) from the available data, then we would have a
predictor of g(Z(s0 )). Predicting ηp (Z(s0 )) from ηp (Z(si )) is now an easier
task since ηp (Z(s)) and ηp (Z(s)) are uncorrelated. Thus, we can use ordinary
kriging based on the data ηp (Z(s1 )), · · · , ηp (Z(sn )) to predict ηp (Z(s0 )). To
obtain the kriging equations, we need the covariance between ηp (Z(s + h))
and ηp (Z(s)). Matheron (1976) showed that if Z(s + h) and Z(s) are bivariate
Gaussian with correlation function ρ(h), then for p ≥ 1 this covariance is
Cov[ηp (Z(s + h)), ηp (Z(s))] = [ρ(h)]p .
Thus, the optimal predictor of ηp (Z(s0 )) is (Chilès and Delfiner, 1999, p. 393)
p
(
η'p (Z(s0 )) = p({ηp (Z(si )}; ηp (Z(s0 )) = λpi ηp (Z(si )), p = 1, 2, · · · , (5.67)
i=1
Often, the coefficients bp will be zero in the Hermetian expansion (5.66). Also,
the correlation function [ρ(h)]p tends to that of an uncorrelated white noise
process as p becomes large. Thus, in practice only a few (usually less than a
dozen, Rivoirard, 1994, p. 43) Hermite polynomials need to be predicted for
disjunctive kriging.
NONLINEAR PREDICTION 283
Rivoirard (1994) gives some excellent elementary examples that can be per-
formed with a calculator to show how these expansions and disjunctive kriging
work in practice with actual data.
where Fi,j (dxi , dxj ) is a bivariate distribution with marginals F (dxi ) and
F (dxj ), and the χm (z) are orthonormal polynomials with respect to some
probability measure F (dx). In kriging the polynomials, the covariances needed
for the kriging equations are given by the Tm (i, j). These are inferred from as-
sumptions pertaining to the bivariate distribution of the pairs (Z(si ), Z(sj )).
For example, as we noted above, if (Z(si ), Z(sj )) is bivariate Gaussian with
correlation function ρ(h), then Tm (i, j) = [ρ(||i − j||)]m . However, to actually
predict the factors, we need to know (and parametrically model) Tm (h). The
general form of Tm (h) has been worked out in special cases (see Chilès and
Delfiner, 1999, pp. 398–413), but many of the models seem contrived, or there
are undesirable constraints on the form of the Tm (i, j) = [ρ(||i − j||)]m needed
to ensure a valid bivariate distribution. Thus, Gaussian disjunctive kriging
remains the isofactorial model that is most commonly used in practice.
In the previous sections we have assumed that the data were located at
“points” within a spatial domain D and that the inferential goal was pre-
diction at another “point” in D. However, spatial data come in many forms.
Instead of measurements associated with point locations, we could have mea-
surements associated with lines, areal regions, surfaces, or volumes. In geology
and mining, observations often pertain to rocks, stratigraphic units, or blocks
of ore that are three dimensional. The inferential goal may also not be lim-
ited to point predictions. We may want to predict the grade of a volume of
ore or estimate the probability of contamination in a volume of soil. Data
associated with areal regions are particularly common in geographical studies
where counts or rates are obtained as aggregate measures over geopolitical re-
gions such as counties, Census tracts, and voting districts. In many instances,
spatial aggregation is necessary to create meaningful units for analysis. This
latter aspect was perhaps best described by Yule and Kendall (1950, p. 312),
when they stated “... geographical areas chosen for the calculation of crop
yields are modifiable units and necessarily so. Since it is impossible (or at
any rate agriculturally impractical) to grow wheat and potatoes on the same
piece of ground simultaneously we must, to give our investigation any mean-
ing, consider an area containing both wheat and potatoes and this area is
CHANGE OF SUPPORT 285
where the weights are chosen to minimize the mean-squared prediction error
E[(p(Z; Z(B)) − Z(B))2 ]. Since E[Z(s)] = µ, E[Z(B)] = µ, and the same ideas
used in the development of the ordinary kriging predictor in §5.2.2 can be ap-
plied to the prediction of Z(B). This leads to the block kriging predictor
Block
n
(
Kriging
p(Z; Z(B)) = λi Z(si ),
Predictor
i=1
where optimal weights {λi } are obtained by solving (Journel and Huijbregts,
1978; Chilès and Delfiner, 1999)
n
(
λk C(si , sk ) − m = Cov[Z(B), Z(si )] i = 1, · · · , n;
k=1
n
(
λi = 1. (5.74)
i=1
Block kriging can also be carried out using the semivariogram. The relation-
ship between the semivariogram associated with Z(B) and that associated
with the underlying process of point support Z(s) is given by (Cressie, 1993,
p. 16),
A A
1
2γ(Bi , Bj ) = − γ(u − v) dudv
|Bi ||Bi | Bi Bi
A A
1
− γ(u − v) dudv
|Bj ||Bj | Bj Bj
A A
2
+ γ(u − v) dudv , (5.77)
|Bi ||Bj | Bi Bj
where 2γ(u − v) = Var[Z(u) − Z(v)] is the variogram of the point-support
process {Z(s)}.
0.8
Semivariance 0.6
0.4
0.2
0.0
0 1 2 3 4 5 6
si
Figure 5.17 Point-to-point (dashed line) and point-to-block semivariances (solid line)
near the prediction location at s = 3.
through (5.73) and prediction of Z(B) is desired. The optimal linear pre-
' .n
dictor of Z(B) based on data {Z(Ai )} is Z(B) = i=1 λi Z(Ai ), where the
optimal weights {λi } are solutions to the equations obtained by replacing
the point-to-point covariances C(si , sk ) and the point-to-block covariances
Cov[Z(B), Z(si )] in (5.74) with
A A
Cov[Z(Ai ), Z(Ak )] = C(u, v)dudv/(|Ai | |Ak |), (5.78)
Ak Ai
and A A
Cov[Z(B), Z(Ai )] = C(u, v)dudv/(|B| |Ai |).
B Ai
Because data on any support can be built from data with point-support,
these relationships can be used for the case when |Ai | < |B| (aggregation),
the case when |B| < |Ai | (disaggregation), and also the case of overlapping
units on essentially the same scale. However, unlike the previous situation,
where we observed point-support data and could easily estimate the point-
support covariance function C(u, v), in practice this function is more difficult
to infer from aggregate data. If we assume a parametric model, γ(u − v; θ),
for γ(u − v), a generalized estimating equations (GEE) approach can be used
to estimate θ (see McShane et al., 1997). Consider the squared differences
(1)
Yij = (Z(Bi ) − Z(Bj ))2 . (5.79)
Note that E[Z(Bi )−Z(Bj )] = 0 and E[Yij ] = 2γ(Bi , Bj ; θ). Taking an identity
(1)
working variance-covariance matrix for the Yij , the generalized estimating
CHANGE OF SUPPORT 289
∂γ(Bi , Bj ; θ) E (1) F
n−1
( (n
(1)
U (θ; {Yij }) =2 ! Yij − 2γ(Bi , Bj ; θ) ≡ 0. (5.80)
i=1 j=i+1
∂θ
In many cases, E[Z(B)|Z(s)] is not linear in the data Z(s) and, in others,
prediction of a nonlinear function of Z(B) is of interest. These problems re-
quire more information about the conditional distribution of Z(B) given the
data, FB (z|Z(s)) = Pr(Z(B) ≤ z|Z(s)), than that used for linear prediction.
Moreover, in many cases, such as mining and environmental remediation, the
quantity Pr(Z(B) > z|Z(s)) has meaning in its own right (e.g., proportion
of high-grade blocks available in mining evaluation or the risk of contamina-
tion in a volume of soil). Nonlinear geostatistics offers solutions to COSPs
that arise in this context. The multi-Gaussian approach (Verly, 1983) to non-
linear prediction in the point-to-block COSP assumes that available point
data Z(s1 ), · · · , Z(sn ) can be transformed to Gaussian variables, {Y (s)}, by
Z(s) = ϕ(Y (s)). The block B is discretized into points {u!j , j = 1, · · · , N },
and Z(B) is approximated as
N
1 (
Z(B) ≈ Z(u!j ). (5.81)
N j=1
Then
(N
1
FB (z|Z(s)) ≈ Pr Z(u!j ) < z|Z(s)
N j=1
N
(
= Pr φ(Y (u!j )) < N z|Y (s1 ), Y (s2 ), · · · , Y (sn ) .
j=1
This probability is estimated through simulation (see Chapter 7). The vector
Y(u) = [Y (u1 ), · · · , Y (uN )]! is simulated from the conditional distribution of
Y(u)|Y(s). Since Y is Gaussian, this conditional distribution can be obtained
by kriging and simulation is straightforward. Then, FB (z|Z(s)) is estimated
.N
as the proportion of vectors satisfying j=1 ϕ(Y (u!j )) < N z.
If, instead of point support data, data Z(A1 ), · · · , Z(An ), |Ai | < |B|, are
available, this approach can still be used provided an approximation similar
to that of equation (5.81) remains valid. More general COSP models based on
the multi-Gaussian approximation may be possible by building models from
data based on point support as described in §5.7.1.
290 SPATIAL PREDICTION AND KRIGING
Consider again indicator data I(s, z) = [I(s1 , z), · · · , I(sn , z)]! , derived from
the indicator transform in (5.61). From §5.6.3, indicator kriging provides an
estimate of Fs0 (z |Z(s)) = Pr(Z(s0 ) < z|Z(s)). For nonlinear prediction in the
point-to-block COSP, it is tempting to use block kriging, described in §5.7.1,
with the indicator data. However, this will yield a predictor of
A
1
I (B) =
∗
I(Z(s) ≤ z) ds,
|B| B
∞
( Hp−1 (z)Hp (Z(B))
I(B) = F (z) + f (z) (−1)p .
p=1
p!
CHANGE OF SUPPORT 291
These are analogous to those §5.6.4, but adapted to the point-block COSP
p
through the term [Cov[Z(sj ), Z(B)]] . They also have a more general form in
the case of isofactorial models (§5.6.4.3):
n
(
λpi [Tp (i, j)] = Tp (B, j), j = 1, · · · n.
i=1
λ! as a predictor of Z(B), i.e., for linear prediction. Thus, g(λ! Z(s)) will not
be optimal for g(Z(B)), but the advantage of constrained kriging is that the
weights depend only on C(u, v), the point-point covariance and the range of
g(λ! Z(s)) exactly matches that of g(Z(B)). Simulations in Cressie (1993b)
and Aldworth and Cressie (1999) indicate that accurate nonlinear predictions
of aggregate data can be made using this approach. An extension of this,
covariance-matching constrained kriging, has been shown to have even
better mean-squared prediction error properties (Aldworth and Cressie, 2003).
We showed in sections §5.1 and §5.2 that it was possible to determine the best
predictor, E[Z(s0 )| Z(s)], when the data follow a multivariate Gaussian distri-
bution. When the assumption of a Gaussian distribution is relaxed, we must
either re-define the notion of an optimal predictor by imposing additional cri-
teria such as linearity and unbiasedness, or transform the data to a Gaussian
distribution in order to make use of its nice theoretical properties. This led
us to ask the question: Why is there such a dependence on the multivari-
ate Gaussian distribution in spatial statistics? This section explores several
answers to this crucial question. Additional discussion is given in §6.3.3 and
§7.4.
For simplicity, we begin by considering bivariate distributions. Let Z1 and
Z2 be two random variables with bivariate distribution function F12 (z1 , z2 ) =
Pr(Z1 ≤ z1 , Z2 ≤ z2 ). The marginal distributions F (z1 ) and F (z2 ) can be
obtained from the bivariate distribution F (z1 , z2 ) as
F1 (z1 ) = F12 (z1 , ∞); F2 (z2 ) = F12 (∞, z2 ).
A well-known example is that of the bivariate Gaussian distribution where
A z1 A z2 H ?C D2
1 −1 z1 − µ1
F12 (z1 , z2 ) = / exp −
2πσ1 σ2 1 − ρ2 −∞ −∞ 2(1 − ρ2 ) σ1
C DC D C D2 @I
z1 − µ1 z2 − µ2 z2 − µ2
2ρ + ,
σ1 σ2 σ2
with −1 < ρ < 1, F1 (z1 ) ∼ G(µ1 , σ12 ) and F2 (z2 ) ∼ G(µ2 , σ22 ). The question
of interest in this section is: Can we go the other way, i.e., given F1 (z1 ) and
F2 (z2 ) can we construct F12 (z1 , z2 ) such that its marginals are F1 (z1 ) and
F2 (z2 ) and Corr[Z1 , Z2 ] = ρ? The answer is “yes” but as we might expect,
there may be some caveats depending on the particular case of interest.
There are several different ways to construct bivariate (and multivariate)
distributions (Johnson and Kotz, 1972; Johnson, 1987):
If σ 2 , µ(s), and µ(s+h) are small, Corr[Z2 (s), Z2 (s+h)] << ρ1 (h). For exam-
ple, taking σ 2 = 1 and µ(s) = µ(s + h) = 1, Corr[Z2 (s), Z2 (s + h)] = ρ1 (h)/2.
Thus, while the conditioning induces both overdispersion and autocorrelation
in the Z2 process, the marginal correlation has a definite upper bound and so
may not be a good model for highly correlated data.
Given all of this discussion (and entire books reflecting almost 50 years of
research in this area), it is now easy to see why the multivariate Gaussian dis-
tribution is so popular: it has a closed form expression, permits pairwise cor-
CHAPTER PROBLEMS 295
relations in (−1, 1), each (Zi , Zj ) has a bivariate Gaussian distribution whose
moments can be easily derived, all marginal distributions are Gaussian, and all
conditional distributions are Gaussian. Moreover, (almost) equally tractable
multivariate distributions can be derived from the multivariate Gaussian (e.g.,
the multivariate lognormal and the multivariate t-distribution) and these play
key roles in classical multivariate analysis. Thus, the multivariate Gaussian
distribution has earned its truly unique place in statistical theory.
Note that in geostatistical modeling, we are working with multivariate
data, i.e., rather than just considering Fij (zi , zj ) we must be concerned with
F1,2,···,n (z1 , z2 , · · · , zn ) and the relationships permitted under this multivari-
ate distribution. Herein lies the problem with the nonparametric indicator
approaches and non-Gaussian disjunctive kriging models: they attempt to
build a multivariate distribution from bivariate distributions. With indicator
kriging this is done through indicator semivariograms, and with disjunctive
kriging it is done through isofactorial models. From the above discussion, we
have to wonder if there is indeed a multivariate distribution that gives rise to
these bivariate distributions. Sometimes, this consideration may seem like just
a theoretical nuisance. However, in some practical applications it can cause
difficulties, e.g., “covariance” matrices that are not positive definite, numer-
ical instability, and order-relations problems. These ideas are important to
keep in mind as we go on to consider more complex models for spatial data
in subsequent chapters.
Problem 5.1 The prediction theorem states that for any random vector U
and any random variable Y we have either of the following
• For every function g, E[(Y − g(U)2 ] = ∞;
• E[(Y − E[Y | U])2 ] ≤ E[(Y − g(U)2 ] for every g with equality only if g(U) =
E[Y | U].
Hence, the conditional expectation is the best predictor under squared error
loss. Prove this theorem.
Problem 5.2 Consider prediction under squared error loss. Let p0 (Z; s0 ) =
E[Z(s0 )|Z(s)]. Establish that
E[(Z(s0 ) − p0 (Z; s0 ))2 ] = Var[Z(s0 )] − Var[p0 (Z; s0 )].
Problem 5.4 Let the random variables X and Y have joint density f (x, y).
Show that the best linear unbiased predictor of Y based on X is given by the
linear regression function
BLU P (Y |X) = α + βX,
where β = Cov[X, Y ]/Var[X] and α = E[Y ] − βE[X].
Problem 5.5 (adapted from Goldberger, 1991, p. 55) Consider the joint mass
function p(x, y) in the following table.
X
Y x=1 x=2 x=3
y=0 0.15 0.10 0.30
y=1 0.15 0.30 0.00
Problem 5.6 Let random variables X and Y have joint probability density
function
6
f (x, y) = (x + y)2 , 0 ≤ x ≤ 1; 0 ≤ y ≤ 1.
7
Find E[Y |X], BLU P (Y |X), and compare the mean-squared prediction errors
for the two predictors.
Problem 5.7 The simple kriging predictor (5.10) on page 223 is the solu-
tion to the problem of best linear prediction with known mean. Show that
psk (Z; s0 ) not only is an extremum of the mean squared prediction error
E[(p(Z; s0 ) − Z(s0 ))2 ], but that it is a minimum. Assume that Var[Z(s)] is
positive definite.
Problem 5.8 Show that (5.14) and (5.15) are the solution to the ordinary
kriging problem in §5.2 (page 227). Verify the formula (5.16) for σok
2
(s0 ).
Problem 5.9 Verify that (5.17) is the ordinary kriging predictor, provided
' is the generalized least squares estimate of µ.
µ
Problem 5.11 Refer to the seven point ordinary kriging Example 5.5, p. 229.
Repeat the example with a prediction point that falls outside the hull of the
observed points, for example, s0 = [50, 20] or s0 = [60, 10].
CHAPTER PROBLEMS 297
Problem 5.12 Assume that you observe in the domain A a spatial random
field {Z(s) : s ∈ A ⊂ R2 } with covariance function C(h). Further let Bi and
Bj be some regions in A with volumes |Bi | and |Bj |, respectively. Show that
A A
1
Cov[Z(Bi ), Z(Bj )] = C(u, v) dudv.
|Bi ||Bj | Bi Bj
x y x y x y x y x y
3.03 0.03 3.10 −0.20 3.23 −0.44 3.34 0.48 3.42 −0.79
3.51 −0.35 3.62 −0.21 3.72 −1.02 3.80 −1.49 3.91 −1.78
4.04 −0.43 4.15 −0.91 4.23 −1.33 4.32 −1.41 4.44 −1.14
4.53 −1.67 4.65 −1.36 4.73 −0.96 4.85 −1.15 4.92 −1.27
5.02 −0.37 5.12 −1.25 5.24 −0.96 5.31 −0.44 5.44 −2.08
5.51 −0.68 5.63 −1.23 5.72 −0.82 5.81 −0.51 5.92 −0.45
6.01 −0.40
CHAPTER 6
of data from a random field process into large-scale trend µ(s), smooth, small-
scale variation W (s), micro-scale variation η(s), and measurement error #(s).
This decomposition was also used to formulate statistical models for spatial
prediction in the previous chapter. For example, the ordinary kriging predictor
was obtained for µ(s) = µ, the universal kriging predictor for µ(s) = x! (s)β.
The focus in the previous chapter was on spatial prediction; predicting Z(s)
or the noiseless S(s) = µ(s) + W (s) + η(s) at observed or unobserved lo-
cations. Developing best linear unbiased predictors ultimately required best
linear unbiased estimators of µ and β. The fixed effects β were important in
that they need to be properly estimated to account for a spatially varying
mean and to avoid bias. The fixed effects were not the primary focus of the
analysis, however. They were essentially nuisance parameters. The covariance
parameters θ were arguably of greater importance than the parameters of the
mean function, as θ drives the various prediction equations and the precision
of the predictors along with the model chosen for Σ = Var[e(s)].
Statistical practitioners are accustomed to the exploration of relationships
among variables, modeling these relationships with regression and classifica-
tion (ANOVA) models, testing hypotheses about regression and treatment
effects, developing meaningful contrasts, and so forth. When first exposed to
spatial statistics, the practitioner often appears to abandon these classical
lines of data inquiry—that focus on aspects of the mean function—in favor
of spatial prediction and the production of colorful maps. What happened?
When you analyze a field experiment with spatially arranged experimental
units, for example, you can rely on randomization theory or on a spatial model
as the framework for statistical inference (more on the distinction below). In
either case, the goal is to make decisions about the effects of the treatments
applied in the experiment. And since the treatment structure is captured in
the mean function—unless treatment levels are selected at random—we can
not treat µ(s) as a nuisance. It is central to the inquiry.
In this chapter we discuss models for spatial data analysis where the focus is
on modeling and understanding the mean function. In a reversal from Chapter
5, the covariance parameters may, at times, take on the role of the nuisance
300 SPATIAL REGRESSION MODELS
trends. Making poor design choices does not affect the validity of cause-and-
effect inferences in design-based analyses under randomization. It only makes
it difficult to detect treatment differences because of a large experimental
error variance. When experimental data are subjected to modeling, it is pos-
sible to increase the statistical precision of treatment contrasts. The ability
to draw cause-and-effect conclusions has been lost, however, unless it can be
established that the model is correct.
Some statisticians take general exception with the modeling of experimen-
tal data, whether its focus is on the mean or the covariance structure of the
data, because it is not consistent with randomization inference. Any devi-
ation from the statistical model that reflects the execution of the particular
design is detrimental in their view. We agree that you should “analyze ’em the
way you randomize ’em,” whenever possible; this is the beauty of designed-
based inference. Nevertheless, we also know from experience that things can
go wrong and that scientists want to make the most of the data they have
worked so hard to collect. Thus, modeling of experimental data should also
be a choice, provided we attach the important caveat that modeling experi-
mental data does not lend itself to cause-and-effect inferences. If, for example,
blocking has been carried out too coarsely to provide a reduction in experi-
mental error variance substantial enough to yield smaller standard errors of
treatment contrasts than an analysis that accounts for heterogeneity outside
of the error-control design, why not proceed down that road?
When model errors are uncorrelated, the standard linear model battery can
be brought to bear and analyses are particularly simple. In designed experi-
ments, design-based analysis based on randomization theory does not need to
explicitly model spatial dependencies, since these are neutralized through ran-
domization at the spatial scale of the experimental unit. If spatial structure is
present at scales smaller or larger than the unit, or if data are observational,
or if one does not want to adopt a randomization framework for inference,
uncorrelated errors are justified if all of the spatial variation is captured with,
accounted for, or explained by the mean function. This leads to the addition of
terms in the mean function to account for spatial configuration and structure,
or to the transformation of the regressor space.
The significance of regression or ANCOVA models with uncorrelated errors
for spatial data is twofold. First, we want to discuss their place in spatial
analysis. Many statisticians have been led to believe that these models are in-
adequate for use with spatial data and that ordinary least squares estimates
of fixed effects are biased or inefficient. This is not always true. Second, these
models provide an excellent introduction to more complex models with cor-
related errors that follow later. For example, models with nearest neighbor
adjustments, such as the Papadakis analysis (Papadakis, 1937), are forerun-
302 SPATIAL REGRESSION MODELS
ners of spatial autoregressive models. The danger of assuming that the salient
spatial structure can be captured through the mean function alone is that
there is “little room for error.” In the words of Zimmerman and Harville
(1991), a correlated error structure can “soak up” spatial heterogeneity. In
longitudinal data analyses it is not uncommon to model completely unstruc-
tured covariance matrices, in part to protect against omitted covariates. In
the spatial case unstructured covariance matrices are impractical, but even
a highly parameterized covariance structure can provide an important safety
net, protecting you against the danger of a misspecified mean function.
2.0
1.5
Total Carbon (%)
1.0
0.5
Figure 6.1 Scatterplot of soil carbon (%) versus soil nitrogen (%). Spatial configu-
ration of samples is shown in Figure 4.6 on page 156. Data kindly provided by Dr.
Thomas G. Mueller, Department of Agronomy, University of Kentucky.
Assume that you are fitting a linear model with uncorrelated, homoscedastic
errors to spatial data,
Z(s) = X(s)β + e(s), e(s) ∼ (0, σ 2 I). (6.2)
The ordinary least squares estimator of the fixed effects and the customary
estimator of the residual variance are
) ! *−1 !
'
β = X(s) X(s) X(s) Z(s)
ols
1 E F! E F
'2 = ' '
σ Z(s) − X(s)β Z(s) − X(s)β ols .
n − rank{X(s)} ols
These estimators are based on the least squares criterion: find β that mini-
mizes the residual sum of squares:
) ! *! ) ! *
Z(s) − X(s) β Z(s) − X(s) β .
304 SPATIAL REGRESSION MODELS
The first and second properties follow from the assumption that the model
errors e(s) have zero mean. The third and fourth properties are based on the
assumption that the variance of the model errors is σ 2 I. The last property
of the OLS residuals (sum-to-zero) applies when the X(s) matrix contains
a constant column (an intercept). (We typically assume in this text that an
intercept is present.)
'2 . One of the advan-
We can also derive the expected value and variance of σ
tages of maximum likelihood estimation is that the inverse of the information
matrix provides the variance-covariance matrix of the maximum likelihood
estimators. For the linear model considered here, this matrix can be obtained
as a special case of that given in (5.47).
LINEAR MODELS WITH UNCORRELATED ERRORS 305
In a model with uncorrelated errors, the test statistic (6.3) has another,
appealing interpretation, in terms of reductions of sums of squares. Let
E F! E F
!' !'
SSR = Z(s) − X(s) β ols Z(s) − X(s) β ols
denote the residual sum of squares of the model and let SRRr denote the
same sum of squares subject to the linear constraints H0 : Lβ = l0 . Then Sum of
SSRr − SSR Squares
F = . Reduction
rank{L}'
σ2
Test
The ordinary least squares residuals ' eols (s) have zero mean and variance
σ M = σ (I − H). If hii denotes the ith diagonal element of H, then the
2 2
e'ols (si )
ti = √
'−i 1 − hii
σ
is called the externally studentized residual.
The quantity hii is also called the leverage; it expresses how unusual an
observation is in the “X”-space. Data points with high leverage have the
potential to be influential on the analysis, but are not necessarily so. In a linear
model with uncorrelated errors and an intercept, the leverages are bounded,
1/n ≤ hii ≤ 1, and the sum of the leverages equals the rank of X(s). Note
Leverage that
Matrix '
∂ Z(s)
H= .
∂Z(s)
Diagnostic measures based on the sequential removal of data points are
particularly simple to compute for linear models with uncorrelated errors.
At the heart of the matter is the following powerful result. If X(s)−i is the
((n − 1) × p) regressor matrix with the ith row removed, then the estimates
of the fixed effects—if the ith data point is not part of the analysis—are
) ! *−1
'
β = X(s) X(s) X(s)−i Z(s)−i .
ols,−i −i −i
1974). The importance of these results for diagnosing the fit of linear models
is that statistics can be computed efficiently based only on the fit of the model
to the full data and that many statistics depend on only a fairly small number
of elementary measures such as leverages and raw residuals. For example, a
PRESS residual is simply PRESS
Residual
' i )−i = e'(si ) ,
e'(si )−i = Z(si ) − Z(s
1 − hii
and Cook’s D (Cook, 1977, 1979), a measure for the influence of an observation
' can be written as
on the estimate β, Cook’s D
ri2 hii
D= ,
k(1 − hii )
where k = rank{X(s)}. The DFFITS statistic of Belsely, Kuh, and Welsch
(1980) measures the change in fit in terms of standard error units, and can be
written as J
hii
DF F IT Si = ti .
1 − hii
These and many other influence statistics are discussed in the monographs by
Belsely, Kuh, Welsch (1980) and Cook and Weisberg (1982).
Fitted residuals in a statistical model are commonly used to examine the un-
derlying assumptions about the model. For example, a QQ-plot or histogram
of the 'eols (s) is used to check whether it is reasonable to assume a Gaus-
sian distribution, scatter plots of the residuals are used to test a constant
variance assumption or the appropriateness of the mean function. In spatial
models, whether the errors are assumed to be correlated or uncorrelated, an
important question is whether the covariance structure of the model has been
chosen properly. It seems natural, then, to use the fitted residuals to judge
whether the assumed model Var[e(s)] = Σ(θ) appears adequate. When, as
in this section, it is assumed that Σ(θ) = σ 2 I and the model is fit by ordi-
nary least squares, one would use the ' eols (s) to inquire whether there is any
residual spatial autocorrelation. Common devices are a test for autocorrela-
tion based on Moran’s I with regional data and estimation of semivariograms
of the residuals with geostatistical data. To proceed with such analyses in a
meaningful way, the properties of residuals need to be understood.
Recall from the previous section that the “raw” residuals from an OLS fit
are
' '
eols (s) = Z(s) − Z(s) = Z(s) − HZ(s) = MZ(s), (6.4)
! !
where H = X(s)(X(s) X(s))−1 X(s) is the “hat” (leverage) matrix. Since
we aim to use ' eols (s) to learn about the unobservable e(s), let us compare
their features. First, the elements of e(s) have zero mean, are non-redundant,
uncorrelated, and homoscedastic. By comparison, the elements of ' eols (s) are
308 SPATIAL REGRESSION MODELS
residuals carry information about the model disturbances e(s). The re-
maining k residuals are redundant.
• correlated: The variance of the OLS residual vector is
eols (s)] = σ 2 M,
Var['
a non-diagonal matrix. The residuals are correlated because they result
from fitting a model to data and thus obey certain constraints. For example,
!
X(s) 'eols (s) = 0. Fitted residuals exhibit more negative correlations than
the model errors.
• heteroscedastic: The leverage matrix H is the gradient of the fitted values
with respect to the observed data. A diagonal element hii reflects the weight
of an observation in determining its predicted value. With the exception
of some balanced classification models, the hii are not of equal value and
the residuals are thus not equi-dispersed. A large residual in a plot of the
e'ols (si ) against the fitted values may not convey a model breakdown or an
outlying observation. A large residual is more likely if Var[' eols (si )] is large
(hii small). Furthermore, σ (1 − hii ) < σ , since in an OLS model with
2 2
ing with quantities that result from a model fit, rather than actual, observed
data.
Most practitioners adopt (i), usually without the caveat and understand-
ably so: it is difficult to interpret the results once we realize all the problems
that can arise when working with raw residuals. One approach to (ii) is to
derive a set of n − k “new” quantities that overcome the problems inherent in
working with raw residuals. This approach is termed error recovery. As for
(iii), spatial statistics typically used for spatial autocorrelation analysis can
be modified for use with residuals. We describe error recovery and some mod-
ifications to spatial autocorrelation statistics for use in OLS residual analysis
in subsequent paragraphs.
When working with spatial data, we need to assess whether there is any spatial
variation that has not been accounted for by the model. Such variation can be
due to an omitted spatially-varying covariate or to spatial autocorrelation in
the data, or both. The only real information we have for this assessment comes
from the OLS residuals. If the OLS residuals exhibit spatial patterns, then the
OLS regression model is not adequate to describe the spatial variation in the
data. We note in passing that a simple map of the residuals can be extremely
informative as can other spatial visualization techniques.
One of the most common tools for assessing spatial autocorrelation is the
empirical semivariogram. Unfortunately, the empirical semivariogram com-
puted from the OLS residuals (which we refer to as the residual semivari-
LINEAR MODELS WITH UNCORRELATED ERRORS 311
ogram) may not be a good estimate of the semivariogram of the error pro-
cess, e(s). As noted previously, the statistical properties of the two processes
are very different. It is difficult to determine whether any structure (or lack
thereof) apparent in the residual semivariogram is due to spatial autocor-
relation in the error process or to artifacts induced by the rank deficiency,
correlation, and heteroscedasticity among the residuals.
0.020
0.0017
0.015
0.0013
0.010
0.0009
0.005
0.0005
0 100 200 0 100 200
Distance Distance
Figure 6.2 Empirical semivariograms of soil carbon (%) (left panel) and OLS resid-
uals of simple linear regression of C% on N%.
There are several options for reducing the potential for such artifacts in
the residual semivariogram. First, we could use the empirical semivariogram
312 SPATIAL REGRESSION MODELS
Example 6.1 (Soil carbon regression. Continued) For the C/N regres-
sion model we computed the empirical semivariograms of the Best Linearly
Unbiased Scaled estimates (Theil, 1971), the recursive residuals, the studen-
tized OLS residuals, and the raw OLS residuals (Figure 6.3). All four types
of residuals exhibit considerable structure. Based on the semivariograms of
the BLUS and recursive residuals we conclude that the model errors should
be considered spatially correlated. A simple linear regression model is not
adequate for these data. The empirical semivariogram for the OLS residuals
does not appear too different from the other semivariograms. Typically, the
bias in the OLS semivariogram is larger for models with more regression co-
efficients, as the residuals become more constrained. With a single regressor
and an intercept, there are only two redundant residuals and the leverages are
fairly homogeneous. The empirical semivariogram of the studentized residuals
bears a striking resemblance to the semivariograms of the recovered errors in
the top panels of Figure 6.3. Again, this is partly helped by the fact that the
X(s) matrix contains only two columns. On the other hand, it is encouraging
that the simple process of scaling the residuals to equal variance provides as
clear a picture of the spatial autocorrelation as the recovered errors, whose
construction is more involved.
A possible disadvantage of recovered errors is their dependence on data or-
der. What is important is that the residual semivariogram conveys the pres-
ence of spatial structure and the need for a spatial analysis. Figure 6.4 displays
empirical semivariograms of the recursive residuals for 10 random permuta-
tions of the data set. To show the variation among the semivariograms, they
are shown as series plots rather than as scatter plots. Any of the sets is equally
well suited to address the question of residual spatial dependency.
LINEAR MODELS WITH UNCORRELATED ERRORS 313
0 100 200
0.0017
0.0016
0.0009 0.0008
Semivariance
1.0000
0.0013
0.5000 0.0007
0 100 200
Distance
Relatively little, practical work has been done on residual diagnostics for
spatial models. In particular, the practical impacts of rank deficiency, corre-
lation and heteroscedasticity among the OLS residuals on inferences drawn
from residual variography are not clearly understood. Also, it is not clear to
what extent the solutions described above are truly effective, and whether or
not they may introduce other problems (e.g., lack of uniqueness in LREs).
Some additional research has been done for more general linear models with
correlated errors and this is discussed in §6.2.3.
Earlier we mentioned three courses of action when working with residuals:
to proceed with an analysis of raw residuals, to compute transformed resid-
uals, and to use adjusted spatial statistics. The careful interpretation of the
semivariogram of OLS residuals is an example of the first action. The semi-
variograms for the recursive, studentized, and BLUS residuals in Figure 6.3
represent the second course. An example of taking into account the fact that
statistics are computed form fitted, rather than observed, quantities is the
test for autocorrelation in lattice data based on Moran’s I for OLS residuals.
314 SPATIAL REGRESSION MODELS
0.0020
0.0015
Semivariance
0.0010
0.0005
0.0000
0 50 100 150 200 250
Distance
n
= tr{MW} (6.6)
(n − k)w..
and variance
- >
n2
Var[Ires ] =
w..2 (n − k)(n − k + 2)
- >
8 29 2[tr{G}]2
× S1 + 2tr G − tr{F} − , (6.7)
(n − k)
where
n n
1 ((
S1 = (wij + wji )2 ,
2 i=1 j=1
! !
F = (X(s) X(s))−1 X(s) (W + W! )2 X(s), and
! !
G = (X(s) X(s))−1 X(s) WX(s).
Cliff and Ord (1981, p. 200) note that randomizing the residuals does not
provide the appropriate reference set for a permutation test of autocorrelation
based on OLS residuals. They consider only the asymptotic test based on the
asymptotic results in equations (6.6) and (6.7), and an assumption that the
data are Gaussian. Thus, an approximate test of the null hypothesis of no
spatial autocorrelation can be made by comparing the observed value of
Ires − E[Ires ]
z= / (6.8)
Var[Ires ]
to the appropriate percentage point of the standard Gaussian distribution.
Note that this test is different than the one described in §1.3.2. For example,
the comparable test for Moran’s I described in (§1.3.2) uses a mean of −(n −
1)−1 to construct the test statistic. If this test statistic is used with OLS
residuals, which can be easily accomplished by importing the residuals into a
software program that computes Moran’s I, the resulting test will be incorrect;
the correct mean is given in (6.6). The same argument applies to the variance.
The computer will not know that you are working with residuals and so cannot
adjust the mean and variance to give the proper test statistic.
As with the semivariogram, we can also compute Moran’s I using the re-
cursive residuals or the LREs from LUS estimation described above. Since
the recursive residuals and the LREs are free of the problems inherent in OLS
residuals, a permutation test can be used. In addition, Cliff and Ord (1981, p.
204) give the moments of I computed from the LREs so that an approximate
z-test can be constructed. However, Cliff and Ord (1981, p. 204) note the
problems with determining which n − k recovered errors to use and the same
concern applies to the use of recursive residuals in this context. They ten-
tatively
. suggest using the most well-connected observations, those for which
w
j ij is the largest. Their discussion and recommendation clearly implies
the potential inferential perils that can occur when using recovered errors to
assess spatial autocorrelation.
316 SPATIAL REGRESSION MODELS
Consider again the linear regression model with uncorrelated errors given in
(6.2). It is a spatial regression model since the dependent variable Z(s), and
the independent variables comprising X(s), are recorded at spatial locations
s1 , . . . , sn . However, for most independent variables, the spatial aspect of the
problem serves only to link Z(s) and X(s). Once the dependent and indepen-
dent variables are linked through location, there is nothing in the analysis
that explicitly considers spatial pattern or spatial relationships. In fact, if we
give you Z(s) and X(s), but simply refer to them as Z and X, you could apply
any and all tools from regression analysis to understand the effect of X on Z.
Moreover, you could move them around in space and still get the same results
(provided you move Z(s) and its corresponding covariates together). Fother-
ingham, Brunsdon, and Charlton (2002) refer to such analyses as aspatial,
a term we find informative. The field of spatial statistics is far from aspatial,
and even in the simple linear model case, there is more that can be done to
use spatial information more explicitly.
One of the easiest ways to make more use of spatial information and rela-
tionships is to use covariates that are polynomial functions of the spatial
coordinates si = [xi , yi ]! . The trend surface models described in §5.3.1 are
an example of this approach. For example, a linear trend surface uses a first
degree polynomial in [x,y] to describe the spatial variation in the response,
e.g.,
Z(si ) = β0 + β1 xi + β2 yi + #i , #i ∼ iid (0, σ 2 ).
Ordinary least squares estimation inference can be used to estimate the β
parameters. However, such an analysis is not aspatial; X is clearly completely
tied to the spatial locations. Although in §5.3.1, the parameter estimates were
simply a means of obtaining a response surface, the β coefficients themselves
have a spatial interpretation, measuring the strength of large-scale spatial
trends in Z(si ).
The parameter estimates from a trend surface analysis provide a fairly
broad, large-scale interpretation of the spatial variation in Z(si ). However,
they are essentially aspatial, since the model has one set of parameters that
apply everywhere, regardless of spatial location. As discussed in §5.3.2, we
can adapt traditional local polynomial regression to the spatial case by fitting
a polynomial model at any specified spatial location s0 . This model is essen-
tially a spatial version of the local estimation procedures commonly referred
to as LOESS or nonparametric regression, where the covariates are polyno-
mial functions of the spatial coordinates. In traditional applications of LOESS
and nonparametric regression methods where general covariates form X, the
term local refers to the attribute or X-space and not to spatial location. The
weights are functions of xi − x0 , differences in covariate values, and the anal-
LINEAR MODELS WITH UNCORRELATED ERRORS 317
ysis is aspatial (Fotheringham et al., 2002, pp. 3–4). What is really needed is
a model that is fit locally in the spatial sense, but allows general covariates
that are not necessarily polynomial functions of the spatial coordinates. The
same general model as that described in §5.3.2 can be used, but with general
covariates
If β is estimated at the sample locations, si , then the fitted values are given by
) ! *−1 !
'
Z(s) = LZ(s), where the ith row of L is x(si ) X(s) W(si )X(s) X(s) W(si ).
The variance component, σ 2 , can be estimated from the residuals of this fit
using (Cressie, 1998)
'
(Z(s) − Z(s))! '
(Z(s) − Z(s))
σ̂ 2 = .
tr{(I − L)(I − L)! }
Cressie (1998) gives more details on how this local model can be used in
geostatistical analysis and Fotheringham et al. (2002) provide many practical
examples of how this model, and various extensions of it, can be used in
geographical analysis.
318 SPATIAL REGRESSION MODELS
where ρk is the effect of the kth block and τl is the effect of the lth treatment
(Hinkelmann and Kempthorne, 1994). The somewhat unusual notation is used
to identify blocks and treatments (indices k and l), as well as lattice positions.
The Papadakis analysis is essentially an analysis of covariance where the
block effects in (6.10) are replaced by functions of OLS residuals in the model
Z(i, j)kl = µ + τl + #kl . Because the residuals are based on a different model,
the analysis involves several steps. In the first step the model with only treat-
ment effects is fit and the residuals
' '
#(i, j) = Z(i, j)kl − E[Z(i, j)kl ] = Z(i, j)kl − Z k
There are many ambiguities in this analysis for which the user needs to make
a determination. In a two-dimensional layout, you can define a single covariate
adjusting in both directions, or separate covariates in each direction. You need
to decide on how to handle edge-effects, EUs near the boundary of the experi-
mental area. You need to decide whether to include immediate neighbors into
the adjustments or extend to second-or higher-order differences. The analysis
can be non-iterative (as described above) or iterative, the residuals from the
analysis of covariance model are then used to recompute the covariates and
the process continues until changes are sufficiently small.
Since the covariates in the Papadakis analysis are linear functions of the
responses Z(s), the analysis essentially accounts for local linear trends. This is
one of the reasons why mean adjustments with this analysis tend to be more
moderate than in models that add trend surface components, which tend to
have a higher degree of the polynomial. Overfit trend surface models can
produce large local adjustments to the treatment means (Brownie, Bowman,
and Burton, 1993).
It is instructive to view the Papadakis analysis in a slightly different light.
The OLS residuals from which the covariates are constructed, are linear
functions of the data in the model Z(s) = Xτ + ', namely ' ' = MZ(s),
! −1 !
M = I − X(X X) X . Consider the case of a single Papadakis covariate.
Because τ' ols = (X! X)−1 X! Z(s), the resulting model can be written as
Z(s) = Xτ + βA'
' + '∗
= Xτ + βAMZ(s) + '∗
= Xτ + βA(Z(s) − X'
τ ols ) + '∗ .
The matrix A determines how the averages of the residuals are determined.
One of the shortcomings of the Papadakis analysis is that the covariates
MZ(s) are assumed fixed, only '∗ is treated as a random component on the
right hand side. Furthermore, the elements of '∗ are considered uncorrelated
and homoscedastic. If the variation in Z(s) were taken into account on the
right hand side, a very different correlation model would result. But if we can
accommodate Z(s) on the right hand side as a random variable, then there is
no need to rely on the fitted OLS residuals τ' ols in the first place. We could
then fit a model of the form
Z(s) = Xτ + βW(Z(s) − Xτ ) + '∗ . (6.11)
The matrix A has been replaced by the matrix W, because how you in-
volve the model errors may be different from how you use fitted residuals
for adjustments. The important point is that in equation (6.11) the response
is regressed on its own residual. This model has autoregressive form, it is a
simultaneous autoregressive model (§6.2.2.1), a special case of a correlated
error model. Studies as those by Zimmerman and Harville (1991), Brownie,
320 SPATIAL REGRESSION MODELS
Bowman, and Burton (1993), Stroup, Baenziger, and Mulitze (1994), and
Brownie and Gumpertz (1997) have shown that correlated error models typ-
ically outperform trend surface models and neighbor-adjusted models with
uncorrelated errors. The importance of models with neighbor adjustments
lies in their connection to autoregressive models, and thus as stepping stones
to other correlated error models.
The first difference matrix ∆, when applied to model (6.12), plays the role
of the inverse “square root” matrix L−1 . It transforms the model into one
with uncorrelated errors. In the correlated error model, the transformation
matrix L−1 is known because Σ is known. In the first-difference approach
we presume knowledge about the transformation directly, at least up to a
multiplicative constant. Note also, that the differencing process produces a
model for n − 1, rather than n, observations. The reality of fitting models
with correlated errors is that Σ is unknown, at least it is unknown up to some
parameter vector θ. Thus, the square root matrix is also unknown.
approach is based on the spatial proximity measures described in §1.3 and ap-
plies to regional data. In what follows, we first focus on geostatistical models
for Σ(θ) and their use in linear regression models with spatially correlated
errors. The approach for regional data will be described in §6.2.2. Our con-
cern now is primarily in estimates of θ for inference about β, rather than in
prediction of the Z(s) process.
If θ is known, the generalized least estimator
) *
β' = X(s)! Σ(θ)−1 X(s) −1 X(s)! Σ(θ)−1 Z(s) (6.14)
gls
can be used to estimate β. In order to use this estimator for statistical infer-
ence, we need to be sure it is a consistent estimator of β and then determine
its distributional properties. If we assume the errors, e(s), follow a Gaussian
distribution, then the maximum likelihood estimator of β is equivalent to the
generalized least squares estimator. Thus, the generalized least squares esti-
mator is consistent for β and β ' ∼ G(β, (X(s)Σ(θ)−1 X(s))−1 ). However, if
gls
the errors are not Gaussian, these properties are not guaranteed. Consistency
of β'
gls depends on the X(s) matrix and the covariance matrix Σ(θ). One
condition that will ensure the consistency of β'
gls for β is
) ! *
X(s) Σ(θ)−1 X(s)
lim = Q, (6.15)
n→∞ n
where Q is a finite, nonsingular matrix (see, e.g., Judge et al., 1985, p. 175).
The asymptotic properties of β̂ gls are derived from the asymptotic properties
of
C ! D−1 !
√
' − β) = X(s) Σ(θ)−1
X(s) X(s) Σ(θ)−1 e(s)
n(β gls √ .
n n
Most central limit theorems given in basic statistics books are not directly rel-
!
evant to this problem since X(s) Σ(θ)−1 e(s) is not a sum of independent and
identically distributed random variables. One condition that can be applied
here is called the Lindberg-Feller central limit theorem and this condition can
be used to show that (Schmidt, 1976)
√ d
n(β' − β) −→ G(0, Q−1 ).
gls
Thus, if the condition in (6.15) and some added regularity conditions for
central limit theorems are satisfied, then
' ∼ · !
β gls G(β, (X(s) Σ(θ)−1 X(s))−1 ).
in §5.5 in the case of a spatially varying mean (and by the methods in §4.5–§4.6
in the case of a constant mean). As in Chapter 5, we need to be concerned with
the effects of using estimated covariance parameters in plug-in expressions
such as (6.16).
Derivation of the distributional properties of βR R
egls is difficult because Σ(θ)
and e(s) will be correlated. Thus, we again turn to asymptotic results. If, in
addition to the condition in (6.15), the conditions
! p
R − Σ(θ)]X(s) −→
lim n−1 X(s) [Σ(θ) 0
n→∞
and
! p
R − Σ(θ)]e(s) −→ 0
lim n−1/2 X(s) [Σ(θ)
n→∞
R
hold, then β '
egls has the same limiting distribution as β gls and so
√ d
R −1
)
n(β egls − β) −→ G(0, Q
(Theil, 1971; Schmidt, 1976; Judge et al., 1985, p. 176). Thus, if these condi-
tions hold,
R · !
egls ∼ G(β, (X(s) Σ(θ) X(s))−1 ).
−1
β
!
R = (X(s) Σ(θ)−1 X(s))−1 .
Var(β)
! R −1 X(s))−1 ) as
The consequences of using the plug-in expression (X(s) Σ(θ)
R are discussed in §6.2.3.1.
an estimator of Var(β)
Table 6.1 Estimated parameters for C–N regression. Covariance structure for REML
and ML estimation is exponential. Independent error assumption for OLS estima-
tion.
OLS REML ML
Parameter Est. Std. Err. Est. Std. Err. Est. Std. Err.
In the previous section, any variation not explained by the parametric mean
function, X(s)β, was assumed to be unstructured, random, spatial variation.
However, in many applications, the variation reflected in e(s) may have a
systematic component. For example, in a randomized complete block design,
it can be advantageous to separate out the variation explained by the blocking
and not simply lump this variation into a general error term. Thus, we can
consider mixed models that contain both fixed and random effects.
The general form of a linear mixed model (LMM) is General
Linear
Z(s) = X(s)β + U(s)α + '(s), (6.18)
Mixed
where α is a (K ×1) vector of random effects with mean 0 and variance G. The Model
vector of model errors '(s) is independent of α and has mean 0 and variance
R. Our inferential goal is now more complicated. In addition to estimators of
the fixed effects, β, and any parameters characterizing R, we will also need
a predictor of the random effects, α, as well as estimators of any parameters
characterizing G.
The mixed model (6.18) can be related to a signal model (see §2.4.1),
Z(s) = S(s) + '(s)
S(s) = X(s)β + W(s) + η(s),
so that U(s)α corresponds to W(s) + η(s), the smooth-scale and micro-scale
components. For the subsequent discussion, we combine these two components
into υ(s), so that (6.18) is a special case of Z(s) = X(s)β + υ(s) + '(s). The
various approaches to spatial modeling that draw on linear mixed model tech-
nology, differ in how U(s)α is constructed, and in their assumptions regarding
G and R.
But first, let us return to the general case and assume that G and R are
known. The mixed model equations of Henderson (1950) are a system of equa-
tions whose solution yield β ' and α,' the estimates of the fixed effects and
predictors of the random effects. The mixed model equations can be derived
using a least squares criterion and augmenting the traditional β vector with
the random effects vector α. Another derivation of the mixed model equations
which we present here commences by specifying the joint likelihood of [α, '(s)]
and maximizing it with respect to β and α. Under a Gaussian assumption for
both random components, this joint density is
1 U U
U G 0 U−1/2
f (α, '(s)) = U U
(2π)(n+K)/2 0 R
H : ;: ;−1
1 α G 0
× exp −
2 Z(s) − X(s)β − U(s)α 0 R
: ;>
α
× ,
Z(s) − X(s)β − U(s)α
326 SPATIAL REGRESSION MODELS
An important special case of the linear mixed model arises when R = σ$2 I
and G = σ 2 I, a variance component model. This is a particularly simple
mixed model, and fast algorithms are available to estimate the parameters
σ 2 and σ$2 ; for example, the modified W-transformation of Goodnight and
Hemmerle (1979). The linear mixed model criterion (6.19) now becomes Mixed
Model
Q(β, α) = σ$−2 (Z(s) − X(s)β − U(s)α)! (Z(s) − X(s)β − U(s)α)
Criterion
+ σ −2 α! α
= σ$−2 ||Z(s) − X(s)β − U(s)α||2 + σ −2 ||α||2 . (6.22)
This expression is very closely related to the objective function minimized
in spline smoothing. In this subsection we elicit this connection and provide
details on the construction of the spatial design or regressor matrix U(s). The
interested reader is referred to the text by Ruppert, Wand, and Carroll (2003),
on which the exposition regarding splines and radial smoothers is based.
The problem of modeling the mean function in Yi = f (xi ) + #i has many
statistical solutions. Among the nonparametric ones are scatterplot smoothers
based on splines. A spline model essentially decomposes f (x) into an “over-
all” mean component and a linear combination of piecewise functions. For
example, define the truncated line function
-
x−t x>t
(x − t)+ =
0 otherwise.
The functions 1, x, (x − t1 ), · · · , (x − tK ) are linear spline basis functions, and
their linear combinations are called splines (Ruppert, Wand, and Carroll,
2003, p. 62). The points t1 , · · · , tK are the knots of the spline. Ruppert et al.
(2003) term a smoother as low-rank , if the number of knots is considerably
less than the number of data points, if K ≈ n, then the smoother is termed
full-rank .
A linear spline model, for example, uses the linear spline basis functions
K
(
f (x) = β0 + β1 x + αj (x − tj )+ . (6.23)
j=1
This model and its basis functions are a special case of the more general power
spline of degree p
K
(
f (x) = β0 + β1 x + · · · βp x +
p
αj ((x − tj )+ )p .
j=1
We have used the symbol α for the spline coefficients to initiate the con-
nection between spline smoothers and the mixed model in equation (6.18).
We have not decided to treat the coefficients as random quantities, however.
Whether α is fixed or random, we can write the observational model for (6.23)
328 SPATIAL REGRESSION MODELS
as
y = Xβ + Uα + ',
where
1 x1 (x1 − t1 ) ··· (x1 − tK )
.. .. .. .. ..
X= . . U= . . . .
1 xn (xn − t1 ) ··· (xn − tK )
In order to fit a spline model to data, penalties are imposed that prevent
the fit from being too variable. A penalty criterion that restricts the variation
Spline of the spline coefficients leads to the minimization of
Criterion
Q∗ (β, α) = ||y − Xβ − Uα||2 + λ2 ||α||2 , (6.24)
where λ > 0 is the smoothing parameter.
The connection between spline smoothing and linear mixed models is be-
' α
coming clear. If β, ' minimize (6.24), then they also minimize
Q∗ (β, α)/σ$2 = Q(β, α),
the mixed model criterion (6.19), that led to Henderson’s mixed model equa-
tion, with λ2 = σ$2 /σ 2 .
You can use mixed model software to perform spline smoothing: construct
the random effects regressor matrix U from the spline basis functions, and
assume a variance component model for their dispersion. This correspondence
has another, far reaching, consequence. The smoothing parameter does not
have to be determined by cross-validation or a comparison of information
'$2 and σ
criteria. If σ '2 are the (restricted) maximum likelihood estimates of the
variance components, then the smoothing parameter is determined as
C 2 D1/2
'$
σ
λ=
'2
σ
in a linear spline model. In a power spline of degree p, this expression is taken
to the pth power (Ruppert, Wand, and Carroll, 2003, p. 113). This automated,
model-driven estimation of the smoothing parameter, replaces the smoothing
parameter selection that is a matter of much controversy in nonparametric
modeling. In fact, the actual value of the smoothing parameter in these models
may not even be of interest. It is implied by the estimates of the variance
components, it has been “de-mystified.”
There is another important consequence of adopting the mixed model frame-
work and ML/REML estimation of the smoothing parameter. In order for
there to be a variance component σ 2 , the spline coefficients α have to be
random variables. If you use the “mixed model crank” to obtain the solutions
' and α,
β ' this is one thing. If you are interested in the precision of predicted
' + u! α,
values such as x! β ' then the appropriate estimate of precision depends
on whether α is fixed or random. We argue that α is a random component
and should be treated accordingly. Recall the correspondence U(s)α = υ(s)
LINEAR MODELS WITH CORRELATED ERRORS 329
in the initial formulation of the linear mixed model. The spline coefficients are
part of the spatial process υ(s), a random process. The randomness of υ(s)
cannot be induced by U(s), it is a matrix of constants. The randomness must
be induced by α, hence the coefficients are random variables.
So far, the spline model has the form of a scatterplot smoother in a single
dimension. In order to cope with multiple dimensions, e.g., two spatial coordi-
nates, we need to make some modifications. Again, we follow Ruppert, Wand,
and Carroll (2003, Ch. 13.4–13.5). For a process in Rd , the first modification
is to use radial basis functions as the spline basis. Define hik as the Euclidean
distance between the location si and the spline knot tk , hik = ||si − tk ||.
Similarly, define ckl = ||tk − tl ||, the distance between the knots tk and tl ,
and p = 2m − d, p > 0. Then, the (n × K) matrix U∗ (s) and (K × K) matrix
Ω(t) are defined to have typical elements
- p
∗ [hik ] d odd
U (s) = p
[h ] log{hik } d even
- pik
[ckl ] d odd
Ω(t) = p
[ckl ] log{ckl } d even.
From a singular value decomposition of Ω(t) you can obtain the square root
matrix Ω(t)1/2 . Define the mixed linear model
Multivariate
Z(s) = X(s)β + U(s)α + ' (6.25) Radial
U(s) = U∗ (s)Ω(t)−1/2 Smoother
α ∼ (0, σ 2 I)
' ∼ (0, σ$2 I)
Cov[α, '] = 0.
As another special case, consider again the general linear model (see also
§5.4.3)
Z(s) = S(s) + #(s), s ∈ D,
where E[#(s)] = 0, Var[#(s)] = σ$2 , Cov[#(si ), #(sj )] = 0, for all i (= j and S(s)
and #(s) are independent. We again assume that the unobserved signal S(s)
can be described by a general linear model with autocorrelated errors
S(s) = x(s)! β + υ(s),
where E[υ(s)] = 0, and Cov[υ(si ), υ(sj )] = Cov[S(si ), S(sj )] = CS (si , sj ).
Note that this model has a linear mixed model formulation, Mixed
Model For-
Z(s) = X(s)β + υ(s) + '(s)
mulation
υ ∼ (0, ΣS )
'(s) ∼ (0, σ$2 I)
Cov[υ, '(s)] = 0,
and (6.18) is a special case, obtained by taking U(s)α ≡ υ(s), R ≡ σ$2 I, and
G ≡ ΣS . The matrix ΣS contains the covariance terms CS (si , sj ). In practice,
ΣS is modeled parametrically as Σ(θ S ). Note that the moments are
E[Z(s)] = X(s)β (6.26)
Var[Z(s)] = ΣS + σ$2 I = V, (6.27)
which are those of a linear model with correlated errors of the form of (6.1).
Thus, as far as estimation of the fixed effects is concerned, the two models are
332 SPATIAL REGRESSION MODELS
Example 6.1 (Soil carbon regression. Continued) For the C/N data we
fitted two conditional models. A mixed model
Z(s) = X(s)β + υ(s) + '(s),
where υ(s) is a zero mean random field with exponential covariance structure,
and a low-rank radial smoother with 49 knots, placed on an evenly spaced grid
throughout the domain. The large-scale trend structure is the same in the two
models, as before, C% is modeled as a linear function of total soil nitrogen
(N%). The low-rank smoother has two covariance parameters, the variance of
the random spline coefficients, and the residual variance, σ$2 = Var['(s)]. The
spatial mixed model has three covariance parameters, the variance and range
of the exponential covariance structure Σ(θ) = Var[υ(s)] and the nugget
variance σ$2 = Var['(s)].
Because of the presence of the variance component σ 2 , predictions at the
observed locations are filtered and do not honor the data. The left-hand panel
of Figure 6.5 shows the adjustments that are made to the estimate of the
mean to predict the C% at the observed locations. The right-hand panel of
LINEAR MODELS WITH CORRELATED ERRORS 333
the figure compares the C% predictions in the two conditional models. They
are generally very close. The computational effort to fit the radial smoother
model is considerably smaller, however. The low-rank smoother has a simpler
variance structure, σ 2 U(s)! U(s) + σ$2 I compared to Σ(θ) + σ$2 I, that does not
have second derivatives. Representing the spatial random structure requires
K = 49 random components in the smoothing model and n = 195 components
in the spatial mixed model.
1.2 1.2
0.8 0.8
0.6 0.6
0.4 0.4
Figure 6.5 Comparison of fitted and predicted values in two conditional models for
the C–N spatial regression. Left-hand panel compares predictions of the C-process
and estimates of the mean carbon percentage in the radial smoother model. The
right-hand panel compares the C-predictions in the two conditional formulations.
tions of Z(s0 ). There are no random effects and all the spatial autocorrelation
is modeled through what is essentially R = σ$2 I + Σ(θ).
Modeling these moments directly can be a difficult task and it is often
easier to hypothesize a spatially-varying latent process, S(s), and to construct
models around the moments of this process. For example, we could assume
Conditional that, given an unobservable spatial process S(s),
Formula-
E[Z(s)|S] ≡ S(s),
tion
Var[Z(s)|S] = σ$2 ,
Cov[Z(si ), Z(sj )|S) = 0 for si (= sj .
To complete the specification we assume
S(s) = x(s)! β + υ(s),
where υ(s) is a zero mean, second-order stationary process with covariance
function Cυ (si , sj ) = CS (si , sj ) = Cov[υ(si ), υ(sj )] = Cov[S(si ), S(sj )], and
corresponding variance-covariance matrix Σ(θ S ). This is an example of a
conditionally-specified or hierarchical model. At the first stage of the
hierarchy, we describe how the data depend on the random process S(s). At
the second stage of the hierarchy, we model the moments of the random pro-
cess. Estimation of β and θ can also be done using iteratively reweighted
least squares, or by using likelihood methods if we assume S(s) is Gaussian
(as described earlier as part of the theory of mixed models). All of the spatial
autocorrelation is modeled through what is essentially G = Σ(θ S ).
Note that the marginal moments of Z(s) are then easily obtained as
E[Z(s)] = ES [E[Z(s)|S(s)]] = ES [E[Z(s)|υ(s)]] = X(s)β
Var[Z(s)] = VarS [E[Z(s)|S(s)]] + ES [Var[Z(s)|S(s)]
= VarS [S(s)] + E[σ$2 I]
= Σ(θ S ) + σ$2 I = V.
Thus, our conditionally-specified, or hierarchical, model is a linear mixed
model and is marginally equivalent to a general linear model with autocorre-
lated errors where the moments are given in (6.32). Because Var[Z(s)|S(s)] =
σ$2 , this variance component represents the nugget effect of the marginal spa-
tial covariance structure. For identifiability of the variance components, the
covariance structure of υ(s) is free of a nugget effect, or it is assumed that
the overall nugget effect can be decomposed in micro-scale and measurement
error variation in known proportions.
With a linear model and only 2 levels in the hierarchy, the use of the condi-
tional specification over the marginal specification is basically one of personal
preference done for ease of understanding. The two approaches give equivalent
inference if the marginal moments coincide. However, if we want to consider
additional levels in the hierarchy or nonlinear models, the conditional speci-
fication can offer some advantages and we revisit this formulation again later
in this chapter.
LINEAR MODELS WITH CORRELATED ERRORS 335
this type of spatial model from the class of conditional autoregressive models
defined in §6.2.2.2.
Clearly, the matrix of spatial dependence parameters, B, plays an important
role in SAR models. To make progress with estimation and inference, we will
need to reduce the number of spatial dependence parameters through the use
of a parametric model for the {bij } and, for interpretation, we would like to
relate them to the ideas of proximity and autocorrelation we have described
previously. One way to do this is to take B = ρW, where W is one of the
spatial proximity matrices discussed in §1.3.2. For example, if the spatial
locations form a regular lattice, W may be a matrix of 0’s and 1’s based on
a rook, bishop, or queen move. In geographical analysis, W may be based on
the length of shared borders, centroid distances, or other measures of regional
proximity. With this parameterization of B, the SAR model can be written
as
Simultaneous
Auto- Z(s) = X(s)β + e(s)
regressive e(s) = ρWe(s) + υ, (6.36)
Model
(SAR) where B = ρW. We can manipulate this model in a variety of ways, and it is
often intuitive to write it as
Z(s) = X(s)β + (I − ρW)−1 υ (6.37)
= X(s)β − ρWX(s)β + ρWZ(s) + υ. (6.38)
From equation (6.37) we can see how the autoregression induces spatial au-
tocorrelation in the linear regression model through the term (I − ρW)−1 υ.
From equation (6.38) we obtain a better appreciation for what this means in
terms of a linear regression model with uncorrelated errors: we now have two
additional terms in the regression model: ρWXβ and ρWZ(s). These terms
are called spatially lagged variables.
For a well-defined model, we require (I−ρW) to be non-singular (invertible).
This restriction imposes conditions on W and also on ρ, best summarized
through the eigenvalues of the matrix W. If ϑmax and ϑmin are the largest
and smallest eigenvalues of W, and if ϑmin < 0 and ϑmax > 0, then 1/ϑmin <
ρ < 1/ϑmax (Haining, 1990, p. 82). For a large set of identical square regions,
these extreme eigenvalues approach −4 and 4, respectively, as the number of
regions increases, implying |ρ| < 0.25, but actual constraints on ρ may be
more severe, especially when the sites are irregularly spaced. Often, the row
sums of W are standardized to 1 by dividing each entry in W by its row sum,
.
j wij . Then, ϑmax = 1 and ϑmin ≤ −1, so ρ < 1 but may be less than −1
(see Haining, 1990, §3.2.2).
Estimation and Inference in SAR Models. The one-parameter SAR model
described by equations (6.36), (6.37) and (6.38), with Συ = σ 2 I, is by far
the most commonly-used SAR model in practical applications. Probably the
main reason for the popularity of this model is the difficulty of actually fitting
models with more parameters. In what follows we describe both least squares
LINEAR MODELS WITH CORRELATED ERRORS 337
and maximum likelihood methods for estimation and inference with this one-
parameter SAR model.
If ρ is known, then generalized least squares can be used to estimate β and
to obtain an estimate of σ 2 . Thus,
) *
β' = X(s)! Σ−1 X(s) −1 X(s)! Σ−1 Z(s),
gls SAR SAR
and
(Z(s) − X(s)β ' )! Σ−1 (Z(s) − X(s)β ' )
gls SAR gls
'2 =
σ ,
n−k
with ΣSAR given in equation (6.35) using B = ρW and Συ = σ 2 I. Of course
ρ is not known, and we might be tempted to use iteratively re-weighted least
squares to estimate both ρ and β simultaneously through iteration. However,
Z(s) and υ are, in general, not independent, making the least squares estima-
tor of ρ inconsistent (Whittle, 1954; Ord, 1975). Based on the work of Ord
(1975) and Cliff and Ord (1981, p. 160), Haining (1990, p. 130) suggested
the use of a modified least squares estimator of ρ that is consistent, although
inefficient
Z(s)! W! WZ(s)
ρ' = .
Z(s)! W! W2 Z(s)
However, given this problem with least squares estimation and the unifying
theory underlying maximum likelihood, parameters of SAR models are usually
estimated by maximum likelihood.
Suppose the data are multivariate Gaussian with the general SAR model
defined in (6.36) and (6.35), i.e., the data are multivariate Gaussian with mean
X(s)β and variance-covariance matrix given in equation (6.35),
) *
Z(s) ∼ G X(s)β, (I − B)−1 Συ (I − B! )−1 .
a Markov random field. Thus, with the conditional autoregressive approach, Markov
we construct models for f (Z(si )|Z(sj ), sj ∈ Ni ). For example, if we assume Random
each of these conditional distributions is Gaussian, then we might model them Field
using:
n
(
E[Z(si )|Z(s)−i ] = x(si ) β + !
cij (Z(sj ) − x(si )! β), (6.43)
j=1
Estimation and Inference with Gaussian CAR models. Following the ideas
regarding SAR models, we consider the case where Σc = σ 2 I and the spa-
tial dependence parameters can be written as functions of a single spatial
autocorrelation parameter, e.g., C = ρW.
Unlike the SAR model, the least squares estimator of the autocorrelation
parameter ρ in a one-parameter CAR model is consistent. Thus, iteratively
re-weighted generalized least squares (described in §5.5.1) can be used to
estimate all of the parameters of this CAR model. As before, β is estimated
using generalized least squares with Σ(θ) = ΣCAR = ΣCAR (ρ), and ρ is
estimated using (Haining, 1990, p. 130)
'! W'
' '
ρ'OLS = ! 2 ,
'W '
' '
where '
' is the residual vector from OLS regression.
To perform maximum likelihood estimation, we consider a slightly more
general formulation for Σc , as in the case of the SAR models. Specifically, we
reparameterize it as Σc = σ 2 Vc , with Vc known. Thus, with C = ρW, the
variance-covariance matrix of this CAR model can be written as
ΣCAR = σ 2 (I − C)−1 Vc = σ 2 VCAR (ρ). (6.47)
Maximizing the Gaussian likelihood based on this variance-covariance struc-
ture is usually straightforward, and the information matrix has a form very
similar to that associated with the one-parameter SAR model (see (6.42)):
(Cliff and Ord, 1981, p. 242)
−2 !
σ (X(s) A(ρ)−1 X(s)) 0 0
I(β, σ , ρ) =
2
0! n −4
σ 1 −2
σ tr(G) ,
2 2
0! 1 −2
2σ tr(G) 1
2α
(6.48)
. 2
where A(ρ) = (I − ρW)−1 , G = W(I − ρW)−1 , α = (ϑi /(1 − ρϑi )2 ), and
ϑi are the eigenvalues of W.
Spatial autoregressive models were developed for use with regional data and
in most applications the data are aggregated over a set of finite areal regions.
Thus, prediction at a new location (which is actually a new region) is usually
not of interest unless there is missing data. Assume that in the Gaussian case
the SAR model Z(s) ∼ G(X(s)β, ΣSAR ), with ΣSAR given in (6.35), or the
CAR model Z(s) ∼ G(X(s)β, ΣCAR ) with ΣCAR given in (6.45), hold for
the observed data as well at the new location. Then universal kriging can
be used to predict at the new location by jointly modeling the data and the
LINEAR MODELS WITH CORRELATED ERRORS 341
Linear Hypotheses About Fixed Effects. The correlated error models of this
section, whether marginal models, mixed models, or autoregressive models,
have an estimated generalized least squares solution for the fixed effects. As
for models with uncorrelated errors, we consider linear hypotheses involving
β of the form
H0 : Lβ = l0
H1 : Lβ (= l0 , (6.49)
where L is a l × p matrix of contrast coefficients and l0 is a specified l × 1
vector. The equivalent statistic to (6.3) is the Wald F statistic Wald F
! Statistic
R − l0 )! [L(X(s) Σ(θ)
(Lβ R −1 X(s))−1 L! ]−1 (Lβ
R − l0 )
FR = , (6.50)
rank(L)
taking β R as the EGLS estimator based on θ, R the latter being either the IR-
WGLS, ML, or REML estimators. For uncorrelated, Gaussian distributed
errors, the regular F statistic ((6.3), page 305) followed an F distribution.
In the correlated error case, the distributional properties of FR are less clear-
cut. If θR is a consistent estimator, then FR has an approximate Chi-square dis-
tribution with rank{L} degrees of freedom. This is also the distribution of FR
if θ is known and Z(s) is Gaussian. Consequently, p-values computed from the
Chi-square approximation tend to be too small, the test tends to be liberal,
Type-I error rates tend to exceed the nominal level. A better approximation
to the nominal Type-I error level is achieved when p-values for FR are com-
puted from an F distribution with rank{L} numerator and (n − rank{X(s)})
denominator degrees of freedom.
the estimated standard error for a single β'i is sii , and a (1-α)% confidence
interval for βi is
√
β'i ± (tα/2,n−k ) sii , (6.52)
where tα/2,n−k is the α/2 percentage point from a t-distribution with n − k
degrees of freedom.
Using an F -distribution (t-distribution) instead of the asymptotic Chi-
square (Gaussian) distribution improves the properties of the test of H0 : Lβ =
l0 , but it does not address yet the essential problem that
E F−1
R = X(s)! Σ(θ)
Ω(θ) R −1 X(s)
where the variance of the estimators derives from the inverse of the observed
or expected information matrix, substituting (plugging-in) ML (REML) esti-
mates for any unknown parameters. Alternatively, we can use likelihood ratio
tests of hypotheses about θ and compare two nested parametric models to see
whether a subset of the covariance parameters fits the data as well as the full
set. Consider comparing two models of the same form, one based on parame-
ters θ 1 and a larger model based on θ 2 , with dim(θ 2 ) > dim(θ 1 ). That is, θ 1 is
obtained by constraining some parameters in θ 2 , usually setting them to zero,
and dim(θ) denotes the number of free parameters. Then a test of H0 : θ = θ 1
Likelihood against the alternative H1 : θ = θ 2 can be carried out by comparing
Ratio Test
ϕ(β; θ 1 ; Z(s)) − ϕ(β; θ2 ; Z(s)) (6.57)
Statistic
to a χ2 distribution with dim(θ 2 ) − dim(θ 1 ) degrees of freedom. (Recall that
ϕ(β; θ; Z(s)) denotes twice the negative of the log-likelihood, so ϕ(β; θ 1 ; Z(s))
> ϕ(β; θ 2 ; Z(s)).)
are comparing
ϕR (θ 1 ; KZ(s)) − ϕR (θ 2 ; KZ(s)) (6.58)
to a χ distribution with dim(θ 2 ) − dim(θ 1 ) degrees of freedom. While like-
2
lihood ratio tests can be formulated for models that are nested with re-
spect to the mean structure and/or the covariance structure, tests based on
ϕR (θ; KZ(s)) can only be carried out for models that are nested with respect
to the covariance parameters and have the same fixed effects structure (same
X(s) matrix).
Example 6.4 Consider a partitioned matrix X(s) = [X(s)1 X(s)2 ] and the
following models:
1. Z(s) = X(s)[β !1 , β !2 ]! +e(s) with Cov[e(si ), e(sj )] = c0 +σ 2 exp{|si −sj ||/α}
2. Z(s) = X(s)2 β 2 + e(s) with Cov[e(si ), e(sj )] = c0 + σ 2 exp{|si − sj ||/α}
3. Z(s) = X(s)[β !1 , β !2 ]! + e(s) with Cov[e(si ), e(sj )] = σ 2 exp{|si − sj ||/α}
4. Z(s) = X(s)2 β 2 + e(s)) with Cov[e(si ), e(sj )] = σ 2 exp{|si − sj ||/α}
To test the hypothesis H0 : β 2 = 0, a likelihood ratio test is possible, based on
the log likelihoods from models 1 and 2 (or models 3 and 4). The REML log
likelihoods from these two models can not be compared because K1 Z(s) and
K2 Z(s) are two different sets of “data.”
To test the hypothesis H0 : c0 = 0, models 1 and 3 (or models 2 and 4) can
be compared, since they are nested. The test can be based on either the −2 log
likelihood (ϕ(β; θ; Z(s))) or on the −2 residual log likelihood (ϕR (θ; KZ(s))).
The test based on ϕR is possible because both models use the same matrix of
error contrasts.
Finally, a test of H0 : c0 = 0, β 2 = 0 is possible by comparing models 1 and
4. Since the models differ in their regressor matrices (mean model), only a
test based on the log likelihood is possible.
Table 6.2 Minus two times (restricted) log likelihood values for C–N regression anal-
ysis. See Table 6.1 on page 324 for parameter estimates
Covariance Structure
Fit Statistic Independent Exponential
with a likelihood ratio test, provided that the models are nested; that is,
the independence model can be derived from the correlated error model by
a restriction of the parameter space. The exponential covariance structure
exp{−||si − sj ||/θ} approaches the independence model only asymptotically,
as θ → 0. For θ = 0 the likelihood cannot be computed, so it appears that the
models cannot be compared. There are two simple ways out of this “dilemma.”
We can reparameterize the covariance model so that the OLS model is nested.
For example,
- >
||si − sj ||
exp − = ρ||si −sj ||
θ
for ρ = exp{−1/θ}. The likelihood ratios statistic for H0 : ρ = 0 is −706 −
(−747) = 41 with p-value Pr(χ21 > 41) < 0.0001. The addition of the exponen-
tial covariance structure significantly improves the model. The test based on
the restricted likelihood ratio test yields the same conclusion (−695−(−738) =
43). It should be obvious that −747 is also the −2 log likelihood for the model
in the exp{−||si −sj ||/θ} parameterization, so we could have used the log like-
lihood from the exponential model directly. The second approach is to evaluate
the log likelihood at a value for the range parameter far enough away from
zero to prevent floating point exceptions and inaccuracies, but small enough
so that the correlation matrix is essentially the identity.
For a stationary process, ρ = 0 falls on the boundary of the parameter
space. The p-value of the test could be adjusted, but it would not alter our
conclusion (the adjustment shrinks the p-value). The correlated error model
fits significantly better than a model assuming independence.
To compare models that are not nested, we can draw on likelihood-based in-
formation criteria. For example, Akaike’s Information Criterion (AIC, Akaike,
1974), a penalized-log-likelihood ratio, is defined by
To contrast residual analysis in the correlated error model with the analysis
of OLS residuals (§6.1.2), it is important to understand the consequences of
engaging a non-diagonal variance-covariance matrix in estimation. We will
focus on this aspect here, and therefore concentrate on GLS estimation. A
LINEAR MODELS WITH CORRELATED ERRORS 349
The matrix H(θ) is the gradient of the fitted values with respect to the
observed data, Leverage
'
∂ Z(s) Matrix
H(θ) =
∂Z(s)
and it is thus reasonable to consider it the “leverage” matrix of the model. As
with the model with uncorrelated errors, the leverage matrix is not diagonal,
and its diagonal entries are not necessarily the same. So, as with OLS residuals,
GLS residuals are correlated, and heteroscedastic, and since rank{H(θ)} = k,
the GLS residuals are also rank-deficient. These properties stem from the
fitting the model to data, and have nothing to do with whether the application
egls (s) is afflicted with the same problems as '
is spatial or not. Since ' eols (s), the
same caveats apply. For example, a semivariogram based on the GLS residuals
is a biased estimator of the semivariogram of e(s). In fact, the semivariogram
of the GLS residuals may not be any more reliable or useful in diagnosing
the correctness of the covariance structure as the semivariogram of the OLS
residuals. From (6.66) it follows immediately that
egls (s)] = M(θ)Σ(θ)M(θ)! ,
Var[' (6.67)
which can be quite different from Σ(θ). Consequently, the residuals will not
be uncorrelated, even if the model fits well. If you fit a model by OLS but the
true variance-covariance matrix is Var[e(s)] = Σ(θ), then the variance of the
OLS residuals is
Var['eols (s)] = Mols Σ(θ)Mols ,
! !
where Mols = I − X(s)(X(s) X(s))−1 X(s) .
There is more. Like the OLS residuals, the GLS residuals have zero mean,
provided that E[e(s)] = 0. The OLS residuals also satisfy very simple con-
!
straints, for example X(s) ' eols (s) = 0. Thus, if X(s) contains an intercept
column, the OLS residuals not only will have zero mean in expectation.
They will sum to zero in every sample. The GLS residuals do not satisfy
!
X(s) ' egls (s) = 0.
The question, then, is how to diagnose whether the covariance model has
350 SPATIAL REGRESSION MODELS
Notice that the matrix M(θ) is transposed on the right hand side of equa-
tion (6.67). The leverage matrix H(θ) is an oblique projector onto the column
space of X(s) (Christensen, 1991). It is not a projection matrix in the true
sense, since it is not symmetric. It is idempotent, however:
E F−1
! −1 ! −1
H(θ)H(θ) = X(s) X(s) Σ(θ) X(s) X(s) Σ(θ) ×
E F−1
! −1 ! −1
X(s) X(s) Σ(θ) X(s) X(s) Σ(θ)
LINEAR MODELS WITH CORRELATED ERRORS 351
0.0019 0.0019
0.0016 0.0016
0.0013 0.0013
0.0010 0.0010
0.0007 0.0007
Figure 6.6 Comparison of ge (h) (open circles) and empirical semivariogram of OLS
residual for exponential models with and without anisotropy.
E F−1
! −1 ! −1
= X(s) X(s) Σ(θ) X(s) X(s) Σ(θ) .
It also shares other properties of the leverage matrix in an OLS model (see
Chapter problems), and has the familiar interpretation as a gradient. Because
diagonal elements of H(θ) can be negative, other forms of leverage matrices in
correlated error models have been proposed. For example, Σ(θ) ' −1 H(θ) and
−1
L H(θ)L, where L is the lower-triangular Cholesky root of Σ(θ) (Martin,
1992; Haining, 1994).
In order to remedy shortcomings of the GLS residuals as diagnostic tools,
we can apply similar techniques as in §6.1.2. For example, the plug-in estimate
of the estimated variance of the EGLS residuals is
T eegls (s)]
Var[' '
= (I − H(θ))Σ( ' − H(θ)
θ)(I ' !)
' − H(θ)Σ(
= Σ(θ) ' '
θ), (6.68)
and the ith studentized EGLS residual is computed by dividing e'egls (si ) by
the square root of the ith diagonal element of (6.68). The Kenward-Roger
variance estimator (6.53) can be used to improve this studentization.
We can also apply ideas of error recovery to derive a set of n − k variance-
covariance standardized, uncorrelated, and homoscedastic residuals. Recall
from §6.1.2 that if the variance of the residuals is A, then we wish to find an
(n × n) matrix Q such that
: ;
! ! I(n−k) 0((n−k)×k)
Var[Q ' eegls (s)] = Q AQ = .
0(k×(n−k)) 0(k×k)
352 SPATIAL REGRESSION MODELS
The first n − k elements of Q! 'eegls (s) are the linearly recovered errors (LREs)
of e(s). Since A = Var[' eegls (s)] = Σ(θ)' − H(θ)Σ(
' ' is real, symmetric, and
θ)
positive semi-definite, it has a spectral decomposition P∆P! √= A. Let D
denote the diagonal matrix whose ith diagonal element is 1/ δi if δi > 0
and zero otherwise. The needed matrix to transform the residuals ' eegls (s) is
Q = PD.
You can also compute a matrix Q with the needed properties by way of a
Cholesky decomposition for positive semi-definite matrices. This decomposi-
tion yields a lower triangular matrix L such that LL! = A. This decomposition
is obtained row-wise and elements of L corresponding to singular rows are re-
placed with zeros. Then choose Q = L− , where the superscript − denotes a
generalized inverse; obtained, for example, by applying the sweep operator to
all rows of L (Goodnight, 1979).
Other types of residuals can be considered in correlated error models. Haslett
and Hayes (1998), for example, define marginal and conditional (prediction)
residuals. Houseman, Ryan, and Coull (2004) define rotated residuals based
on the Cholesky root of the inverse variance matrix of the data, rather than
a root constructed from the variance of the residuals.
6.3.1 Background
A linear model may not always be appropriate, particularly for discrete data
that might be assumed to follow a Poisson or Binomial distribution. General-
ized linear models (comprehensively described and illustrated in the treatise
by McCullagh and Nelder, 1989) are one class of statistical models devel-
oped specifically for such situations. These models are now routinely used for
modeling non-Gaussian longitudinal data, usually using a “GEE” approach
for inference. The GEE approach was adapted for time series count data by
Zeger (1988), and in the following sections we show how these ideas can be
applied to non-Gaussian spatial data.
In all of the previous sections we have assumed that the mean response is a
linear function of the explanatory covariates, i.e., µ = E[Z(s)] = X(s)β. We
also implicitly assumed that the variance and covariance of observations does
not depend on the mean. Note that this is a separate assumption from mean
stationarity. The implicit assumption was that the mean µ does not convey
information about the variation of the data. For non-Gaussian data, these
assumptions are usually no longer tenable. Suppose that Y1 , · · · , Yn denote
uncorrelated binary observations whose mean depends on some covariate x.
If E[Yi ] = µ(xi ), then
• Var[Yi ] = µ(xi )(1−µ(xi )). Knowing the mean of the data provides complete
knowledge about the variation of the data.
GENERALIZED LINEAR MODELS 353
of the data as
1/2 1/2
Var[Z(s)] = Σ(µ, θ) = σ 2 Vµ R(θ)Vµ , (6.72)
where R(θ) is a correlation matrix with elements ρ(si − sj ; θ), the spatial
1/2
correlogram defined in §1.4.2. The diagonal matrix Vµ has elements equal
/
to the square root of the variance functions, v(µ).
If R(θ) = I, then (6.72) reduces to σ 2 Vµ , which is not quite the same
as (6.71). The parameter σ 2 obviously equals the scale parameter ψ for those
exponential family distributions that possess a scale parameter. In cases where
ψ ≡ 1, for example for binary, Binomial, or Poisson data, the parameter
σ 2 measures overdispersion of the data. Overdispersion is the phenomenon
by which the data are more dispersed than is consistent with a particular
distributional assumption. Adding the multiplicative scale factor σ 2 in (6.72)
is a basic method to account for the “inexactness” in the variance-to-mean
relationship.
Recall that the correlogram or autocorrelation function is directly related
to the covariance function and the semivariogram, and so many of the ideas
concerning parametric modeling of covariance functions and semivariograms
described in §4.3 apply here as well. Often we take q = 1 and θ to equal the
range of spatial autocorrelation, since the sill of a correlogram is 1. A nugget
effect can be included by using
1/2 1/2
Var[Z(s)] = c0 Vµ + σ12 Vµ R(θ)Vµ .
In this case, Var[Z(si )] = (c0 + σ 2 )v(µ(s
/i )), and the covariance between any
two variables is Cov[Z(si ), Z(sj )] = σ v(µ(si ))v(µ(sj ))ρ(si − sj ).
2
6.3.3 A Caveat
Before we proceed further with spatial models for non-Gaussian data, an im-
portant caveat of the presented model needs to be addressed. In the Gaussian
case you can always construct a multivariate distribution with mean µ and
variance Σ. This also leads to valid models for the marginal distributions,
these are of course Gaussian with respective means and variances µi and
[Σ]ii . The model
Y = µ + ', ' ∼ G(0, Σ)
is a generalization of
Y = µ + ', ' ∼ G(0, I).
For non-Gaussian data, such generalizations are not possible. In the indepen-
dence case, there is a known marginal distribution for each observation based
on the exponential family. There is also a valid joint likelihood, the product of
the individual likelihoods. Estimation in generalized linear models can largely
be handled based on a specification of the first two moments of the responses
alone. Expressions (6.69) and (6.72) extend these moment specifications to
356 SPATIAL REGRESSION MODELS
the non-Gaussian, spatially correlated case. There is, however, no claim made
at this point that the underlying joint distribution may be a “multivariate
Binomial” distribution, or some such thing. We may even have to step back
from the assumption that a joint distribution in which
• the mean is defined by (6.69),
• the variance is given by (6.72) with R (= I,
• the marginal distributions are Binomial, Poisson, etc.,
exists at all.
Moreover, in addition to the difficulties inherent in building non-Gaussian
multivariate distributions described in §5.8, the mean-variance relationship
also imposes constraints on the possible covariation between the responses. In
the simple case of n binary outcomes with common mean (success probability)
µ, Gilliland and Schabenberger (2001) examine the constraints placed on the
association parameter ρ in a model of equi-correlation (compound symmetry),
when the joint distribution on {0, 1}n is symmetric with respect to permu-
tation of coordinates. While the association parameter is not restricted from
above, it is severely restricted from below (Figure 6.7). The absolute mini-
mum in each case is of course ρmin = −1/(n − 1). This lower bound applies to
any n equi-correlated random variables, regardless of their distribution. But
in the binary case—especially for small n—the lower bound can be substan-
tially larger than ρmin . The bound is achieved for µ = 0.5 if n is even and for
µ = ±1/n. The valid parameter space depends on the sample size and on the
unknown success probability.
If one places further restrictions on the joint distributions, the valid com-
binations of (µ, ρ) are further constrained. If one sets third- and higher-order
correlations to zero as in Bahadur (1961), then the lower bounds are even
more restrictive than shown in Figure (6.7) and the range of ρ is now also
bounded from above (see also Kupper and Hasemann, 1978). Such higher-
order restrictions may be necessary to facilitate a particular parameterization
or estimation technique, e.g., second-order generalized estimating equations.
Prentice (1988) notes that without higher-order effects, models for correlated
binomial data sacrifice a desired marginalization and flexibility.
0.9
0.8
0.7
0.6
0.5
-1.0 -0.9 -0.8 -0.7 -0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 min
0.4
n=2
0.3
n=3
0.2
n=4 n=6
0.1
Figure 6.7 Lower bounds of the correlation parameter of equi-correlated binary obser-
vation with mean µ so that the n-variate joint distribution is permutation symmetric
(n = 2, 3, 4, 5, 6). For a given n, the region to the right of the respective curve is that
of possible combinations of (µ, ρ) under this model. After Gilliland and Schaben-
berger (2001).
E[Z(s)|S] ≡ µ(s).
The link function relates this conditional mean to the explanatory covariates
358 SPATIAL REGRESSION MODELS
Example 6.6 Consider a model with canonical link, g(µ) = log{µ}, identity
variance function v(µ) = µ, and let m(s) = exp{x(s)! β}. Such a construction
arises, for example, in Poisson regression. Using results from statistical theory
relating marginal and conditional means and variances, together with the
mean and variance relationships of the lognormal distribution (see §5.6.1),
the marginal moments of the data can be derived as
E[Z(s)] = ES [E[Z(s)|S]] = ES [m(s) exp{S}]
= m(s) exp{σS2 /2},
K
Var[Z(s)] = m(s) σ 2 exp{σS2 /2}
L
+ m(s) exp{σS }(exp{σS } − 1) ,
2 2
GENERALIZED LINEAR MODELS 359
Note Var[Z(s)] > E[Z(s)], even if σ 2 = 1. Also, both the overdispersion and
the autocorrelation induced by the latent process {S(s)} depend on the mean,
so the conditional model can be used with non-stationary spatial processes.
Traditional GLMs allow us to move away from the Gaussian distribution and
utilize other distributions that allow mean-variance relationships. However,
the likelihood-based inference typically used with these models requires a
multivariate distribution, and when the data are spatially-autocorrelated we
cannot easily build this distribution as a product of marginal likelihoods as
we do when the data are independent.
We can bypass this problem by using conditionally-specified generalized lin-
ear mixed models (GLMMs), since these models assume that, conditional on
random effects, the data are independent. The conditional independence and
hierarchical structure allow us to build a multivariate distribution, although
we cannot always be sure of the properties of this distribution. However, the
hierarchical structure of GLMMs poses problems of its own. As noted in Bres-
low and Clayton (1993), exact inference for first-stage model parameters (e.g.,
the fixed covariate effects) typically requires integration over the distribution
of the random effects. This necessary multi-dimensional integration is diffi-
cult and can often result in numerical instabilities, requiring approximations
or more computer-intensive estimation procedures.
There are several ways to avoid all of these problems (although arguably
they all introduce other problems). To address the situation where we can
define means and variances, but not necessarily the entire likelihood, Wed-
derburn (1974) introduced the notion of quasi-likelihood based on the first
two moments of a distribution, and the approach sees wide application for
GLMs based on independent data (McCullagh and Nelder, 1989). This leads
to an iterative estimating equation based on only the first two moments of a
distribution that can be used with spatial data. Another solution is based on
an initial Taylor series expansion that then allows pseudo-likelihood methods
for spatial inference similar to those described in §5.5.2. A similar approach,
called penalized quasi-likelihood by Breslow and Clayton (1993), uses a Laplace
approximation to the log-likelihood. If combined with a Fisher-scoring algo-
rithm, the estimating equations for the fixed effects parameters and predictors
360 SPATIAL REGRESSION MODELS
of random effects in the model are of the same form as the mixed model equa-
tions in the pseudo-likelihood approach of Wolfinger and O’Connell (1993).
Finally, Bayesian hierarchical models and the simulation methods used for
inference with them offer another popular alternative.
R + ∂g (X(s)β) X(s)(β − β)
−1
. R
µ = g −1 (X(s)β)
∂β R
|β
= µ R
R + ΨX(s)(β R
− β).
Substituting the approximated mean into (6.76) and re-arranging yields the
pseudo-model Linearized
Pseudo-
Z(s)∗ R −1 (Z(s) − µ
= Ψ R
R ) + X(s)β Model
−1 ∗
R
= X(s)β + Ψ e (s). (6.77)
This model is the linearized form of a nonlinear model with spatial corre-
lated error structure. The pseudo-response Z(s)∗ is also called the “working”
outcome variable, but the attribute should not be confused with the concept
362 SPATIAL REGRESSION MODELS
(or EGLS), and iterates until convergence. We can apply this approach to
marginal GLMs as special cases of a GLMM.
The idea behind the pseudo-likelihood approach is to linearize the problem
so we can use the approach in §5.5.2–5.5.3 for estimation and inference. This
is done using a first-order Taylor series expansion of the link function to
give what Wolfinger and O’Connell (1993) call “pseudo data” (similar to the
“working” outcome variable used in the generalized estimating equations)
∗
Z(s) ). As before, the pseudo-data is constructed as
νi = g('
µi ) + g ! ('
µi )(Z(si ) − µ
'i ), (6.78)
where g ! ('
µi ), is the first derivative of the link function with respect to µ,
evaluated at the current estimate µ '. To apply the methods in §5.5.2–5.5.3,
we need the mean and variance-covariance matrix of the pseudo data, ν.
Conditioning on β and S, assuming Var[Z(s)|S] has the form of equation
(6.72), and using some approximations described in Wolfinger and O’Connell
(1993), these can be derived in almost the traditional fashion as
E[ν|β, S] = Xβ + S
Var[ν|β, S] = Σµ̂ ,
with
−1 1/2 ' −1
2' 1/2 ' −1 Σ(' ' −1 .
' =σ Ψ
Σµ V
µ'
R(θ) V Ψ
'
µ
=Ψ µ, θ)Ψ (6.79)
Recall that the matrix Ψ ' is an (n × n) diagonal matrix with typical element
[∂µ(si ))/∂η(si )] and is evaluated at µ
' . The marginal moments of the pseudo-
data are:
E[ν] = Xβ
Var[ν] = ΣS + Σµ
' ≡ Σν ,
and ΣS has (i, j)th element σS2 ρS (si − sj ). This can be considered as a general
linear regression model with spatially autocorrelated errors as described in
§6.2, since the mean (of the pseudo-data ν) is linear in β. Thus, if we are
willing to assume Σµ ' is known (or at least does not depend on β) when we
want to estimate β, and that β is known when we want to estimate θ, we can
maximize the log-likelihood analytically yielding the least squares equations
' = (X! Σ−1 X)−1 X! Σ−1 ν,
β (6.80)
ν ν
' = ΣS Σ−1 (ν − Xβ),
S ' (6.81)
ν
' ! (Σ∗ )−1 (ν − Xβ)
(ν − Xβ) '
'2 =
σ ν . (6.82)
n
(The matrix Σ∗ν in (6.82) is obtained by factoring the residual variance from
Σµ and ΣS .) However, because Σµ ' does depend on β we iterate as follows:
1. Obtain an initial estimate of µ' from the original data. An estimate from
the non-spatial generalized linear model often works well;
364 SPATIAL REGRESSION MODELS
Var[α] = σ 2 I, and the mixed model equations are particularly easy to solve.
This provides a technique, to spatially smooth, for example, counts and rates.
tial data analysis is limited by the dimensionality of the integral in (6.86). Note
that this is an integral over the random variables in S(s), an n-dimensional
problem. A numerical integration via quadrature or other multi-dimensional
technique is computationally not feasible, unless the random effects struc-
ture is much simplified (which is the case for the radial smoothing models).
However, an integral approximation that does not require high-dimensional
integration and assumes conditional independence (R = σ$2 I) is an option. In
the case of longitudinal data, where the marginal covariance matrix is block-
diagonal, Breslow and Clayton (1993) applied a Laplace approximation in an
approach that they termed penalized quasi-likelihood.
B
Assume that you wish to compute p(τ )dτ , that τ is (m × 1) and that the
function p(τ ) can be written as exp{nf (τ )}. Then the Laplace approximation
of the target integral (Wolfinger, 1993) is
A A C Dm/2
. 2π −1/2
p(τ ) dτ = exp{nh(τ )} = exp{nh(' τ )} |−h!! ('
τ )| .
n
The quantity τ' in this approximation is not just any value, it is the value
that maximizes exp{nh(τ )}, or equivalently, maximizes log{h(τ )}. The term
| − h!! ('
τ )| is the determinant of the second derivative matrix, evaluated at
that maximizing value.
In the case of a GLMM, we have p(τ ) = f (Z(s)|S(s)) fs (S(s)), τ =
[β ! , S(s)! ]! , and h(β, θ) = h(β, S(s)) = n−1 log{f (Z(s)|S(s)) fs (S(s))}. Note
that this approximation involves the random field S(s), rather than the covari-
ance parameters. Breslow and Clayton (1993) use a Fisher-scoring algorithm,
which amounts to replacing −h!! (' τ ) with the expected value of
∂2
− log {f (Z(s)|S(s)) fs (S(s))} ,
∂τ ∂τ !
which turns out to be
? ! ! @
X(s) Σ−1 X(s) X(s) Σ−1
µ' '
µ .
Σ−1 X(s) Σ−1 + Σ−1
'
µ '
µ S
You will recognize this matrix as one component of the mixed model equations
(6.85). From the first order conditions of the problem we find the values that
maximize h(β, S(s)) as
X(s)ΨΣ−1 (Z(s) − g −1 (X(s)β + S(s))) = 0
'
µ
ΨΣ (Z(s) − g −1 (X(s) + S(s))) − Σ−1
−1
S S(s) = 0.
'
µ
The solutions for β and S(s) are (6.80) and (6.81). Substituting into the
Laplace approximation yields the objective function that is maximized to ob-
tain the estimates of the covariance parameters. One can show that this ob-
jective function differs from the REML log likelihood in the pseudo-likelihood
method only by a constant amount. The two approaches will thus yield the
same estimates.
368 SPATIAL REGRESSION MODELS
Example 6.7 Assume that a model has been fit with a log link, E[Y ] =
µ = exp{η} = exp{x! β} and we have at our disposal the estimates β ' as well
' Then the plug-in estimate of µ is
as an estimate of their variability, Var[β].
' = exp{'
µ η } and the variance of the linear predictor is Var[' '
η ] = x! Var[β]x. To
obtain an approximate variance of µ ', expand it in a first-order Taylor series
about η,
. '
∂µ
'=µ+
µ ('
η − η).
∂'
η |η
In our special case,
' = exp{η} + exp{η}('
µ η − η),
.
so that Var['
µ] = exp{η}Var['
η ]. The estimate of this approximate variance is
then
T µ] = exp{'
Var[' T η ].
η }Var['
Since link functions are monotonic, confidence limits for η' = x(s)β' can be
transformed to confidence limits for µ(s), by using the upper and lower limits
as arguments of the inverse link function g −1 (·).
Confirmatory statistical inference about the covariance parameters θ in
generalized linear mixed models is more difficult compared to their linear
mixed model counterparts. Likelihood-ratio or restricted likelihood-ratio tests
are not immediately available in methods based on linearization. The obvious
reason is that adding or removing columns of the X(s) matrix or changing the
structure of S(s) changes the linearization. When the linearization changes,
so does the pseudo-data, and the pseudo objective functions such as (6.84) are
not comparable. Information based criteria such as AIC or AICC also should
GENERALIZED LINEAR MODELS 369
not be used for model comparisons, unless the linearizations are the same;
this holds for ML as well as REML estimation. The fact that two models are
nested with respect to their large-scale trend structure (X(s)2 is a subset of
X(s)1 ), and we perform ML pseudo-likelihood estimation does not change this
fact. An exception to this rule, where models can be compared based on their
pseudo objective functions, occurs when their large-scale trend structures are
the same, their formulation of S(s) is the same, and they are nested with
respect to the R(θ) structure.
where σ is the covariance vector between the pseudo-data for a new obser-
vation and the “observed” vector ν. Plug-in estimation replaces the GLS
estimates with EGLS estimates and evaluates σ and Σµ at the estimated
covariance parameters. The mean-squared prediction error E[(p(ν; ν(s0 )) −
ν(s0 ))2 ] = σν2 (s0 ) is computed as in §5.3.3, equation (5.30). To convert this
prediction into one for the original data, you can apply the inverse link func-
tion,
' 0 ) = g −1 (p(ν; ν(s0 )))
Z(s (6.87)
and apply the Delta method to obtain a measure of prediction error,
C −1 D2 + ,2
∂g (p(ν; ν(s0 ))) ∂µ
σν2 (s0 ) = σν2 (s0 ). (6.88)
∂p(ν; ν(s0 )) ∂η |p
' 0)
The mean-squared prediction error associated with this predictor is E[(Z(s
− Z(s0 )) ], which can be obtained from the mean-prediction squared error of
2
/
The root mean-squared prediction error, 2 (s ), is usually reported as a
σZ 0
prediction standard error.
In the case of a marginal GLMM where the mean function is defined as in
equation (6.69), the predictor based on pseudo-likelihood estimation (equation
6.90) is very similar in nature to that proposed by Gotway and Stroup (1997)
without the explicit linearization of pseudo data. Similarly, the mean-squared
prediction errors are comparable to those derived by Vijapurkar and Gotway
(2001) by expanding the nonlinear mean function instead of the link function;
see also Vallant (1985).
instead the median housing value per county as a surrogate for housing age
and maintenance quality (Figure 1.6). We also do not have data on whether
or not an individual child in the study was living in poverty in 2000. Thus, we
obtained from the U.S. Census Bureau the number of children in each county
under 17 years of age living in poverty in 2000, and will use this variable as a
surrogate measure of impoverishment (Figure 6.8).
# in Poverty
80 −345
346 −705
706 −1045
1046 −1838
1839 −15573
Figure 6.8 Number of children under 17 years of age living in poverty in Virginia
in 2000. Source: U.S. Census Bureau.
so that
E[p(si )] = λ(si )
Var[p(si )] = λ(si )/n(si );
Table 6.3 Results from Poisson and Binomial regressions with overdispersion.
Poisson-Based Regression
Binomial-Based Regression
There is very little difference in the results obtained from the two models.
There is also very little to guide us in choosing between them. Comparing the
value of the AIC criterion from both models may be misleading since the mod-
els are based on two different distributions whose likelihoods differ. Moreover,
GLMs are essentially weighted regressions, and when different distributions
and different variance functions are used, the weights are different as well.
Thus, comparing traditional goodness of fit measures for these two models
(and for the other GLMs and GLMMs considered below) may not be valid.
GENERALIZED LINEAR MODELS 373
257 328 372 442 477 449 512 490 445 502 441 443 408 399 336 306 262
2.0
1.5
Semivariance
1.0
0.5
0.0
0 100 200 300
Distance
• The results from the random effects model indicate that the relationship
between median housing value and the percentage of children with elevated
blood lead levels is not significant. They also show an inverse relationship
between the percentage of children with elevated blood lead levels and
poverty, although the relationship is not significant. The results from all
other models indicate just the opposite: a significant association with me-
dian housing value and a nonsignificant poverty coefficient that has a pos-
itive sign. Thus, some of spatial variation in the data arising from spatial
autocorrelation (incorporated through R(α)) is perhaps being attributed
to median housing value and poverty in the random effects model. In the
conditionally-specified model choices about R(α) affect S(s) which in turn
can affect the estimates of the fixed effects; This just reflects the fact that
the general decomposition of “data=f(fixed effects, random effects, error)”
is not unique. The random effects model does not allow any spatial struc-
ture in the random effects or the errors. The conditional and marginal
spatial models do allow spatial structure in these components, with the
conditional model assigning spatial structure to the random effects, and
the marginal model assigning the same sort of structure to “error.” Thus,
while the conditional and marginal spatial GLMs accommodate variation
in different ways, incorporating spatially-structured variation in spatial re-
gression models can be important.
• The marginal spatial model and the traditional Poisson-based model (Table
6.3) give similar results. However, the standard errors from the margi-
nal spatial GLM are noticeably higher than for the traditional Poisson-
based model. Thus, with the “data=f(fixed effects, error)” decomposition in
marginal models, the marginal spatial GLMs incorporate spatial structure
through the “error” component.
• The majority of the models suggests that median housing value is signifi-
cantly associated with the percentage of children with elevated blood lead
376 SPATIAL REGRESSION MODELS
levels, and that counties with higher median housing values tend to have
a lower percentage of children with elevated blood lead levels. Also, there
appears to be a positive relationship between the percentage of children
with elevated blood lead levels and poverty, although the relationship is
not significant (and may not be well estimated).
The spatial GLM and GLMM considered thus far in this example use a
geostatistical approach to model autocorrelation in the data. However, the
regional nature of the data suggests that using a different measure of spa-
tial proximity might be warranted. For linear regression models, the spatial
GENERALIZED LINEAR MODELS 377
Table 6.5 Results from a marginal spatial GLM with a CAR autocorrelation struc-
ture.
Effect Estimate Std. Error t-value p-value
Intercept (β0 ) −2.1472 0.1796 −11.95 < 0.00001
Median Value (β1 ) −0.7532 0.2471 −3.05 0.0014
Poverty (β2 ) 0.5856 1.4417 0.40 0.6574
σ̂02 0.0000
σ̂12 6.2497
ρ̂ 0.0992
These results are quite similar to those from the marginal spatial GLM
given in Table 6.4, although it is difficult to generalize this conclusion. Often,
adjusting for spatial autocorrelation is more important than the parametric
models used to do the adjustment, although in some applications, different
models for the spatial autocorrelation can lead to different results and con-
clusions.
50
Marginal Spatial GLM Predictions
40
30
20
10
0 10 20 30 40 50
Conditional Spatial GLMM Predictions
Figure 6.10 Predictions from marginal spatial GLM vs. predictions from conditional
spatial GLMM.
dicted percentage of children with elevated blood lead levels, adjusted for
median income and poverty. For the models with random effects, this mean
' + S(s)),
is g −1 (X(s)β ' '
where S(s) is obtained from (6.81). For the marginal
spatial models, this mean should be based on prediction of Z(si ) at the data
locations, using the methods described in §6.3.6 (cf., §5.4.3).
Figure 6.10 compares the adjusted percentages for the conditional spatial
GLMM and the marginal spatial GLM with a geostatistical variance structure.
There is a strong relationship between them, although there is more smoothing
in the marginal model. This is dramatically evident from the adjusted maps
shown in Figure 6.12.
The maps of the adjusted percentages obtained from the conditional spatial
GLMM are more similar to the raw rates shown in Figure 1.5, since this model
tends to shrink the percentages to local means (through S(s)) rather than to
a global mean as in the marginal models. Figure 6.11 shows the adjusted
percentages from the two marginal spatial GLMs, one with a geostatistical
variance, and one with the CAR variance. The adjusted percentages are more
different than the maps in Figure 6.12 reflect. The model with the CAR vari-
ance smooths the data much more than the model with the geostatistical
variance. This may result from our choice of adjacency weights since with
these weights adjacent counties are assumed to be spatially similar and so the
contribution from σ02 which mitigates local spatial similarity is reduced.
25
Marginal Model with Geostatistical Variance
20
15
10
0 5 10 15 20 25
Marginal Model with CAR Variance
Figure 6.11 Predictions from marginal GLM using geostatistical variance structure
vs. predictions from marginal GLM using CAR variance structure.
380 SPATIAL REGRESSION MODELS
Traditional Poisson−Based
% Elevated
< 2.62
2.62 −5.23
5.24 −10.46
10.47 − 20.93
Marginal Spatial GLM > 20.93
With Geostatistical Variance
Both the conditional spatial GLMM and the marginal spatial GLM indicate
three areas where blood lead levels in children remain high, after adjusting for
median housing value and poverty level: one, comprised of Frederick and War-
ren counties, in the north; another in south-central Virginia near Farmville;
and another in the northeast in the Rappahannock river basin. To investigate
these further, a more thorough analysis that includes other potentially impor-
tant covariates (e.g., other demographic variables, more refined measures of
known lead exposures), more local epidemiological and environmental infor-
mation, and subject-matter experts, is needed.
Moreover, these analyses all assume that the relationship between the per-
centage of children with elevated blood lead level and median housing value
and poverty level is the same across all of the Virginia. We can relax this
assumption by considering a Poisson version of geographically weighted re-
gression (§6.1.3.1, Fotheringham et al., 2002). Instead of assuming a linear
relationship between the data and the covariates as in (6.9), we assume
E[Z(s)] = µ(s) ≡ n(s)λ(s),
Figure 6.13 Estimates of the effect of median housing value from GWR.
major advantages of this type of spatial analysis: it allows us to fine tune our
hypotheses and isolate more specific exposure sources that may affect blood
lead levels in children.
Figure 6.14 Standard errors of the effect of median housing value estimated from
GWR.
BAYESIAN HIERARCHICAL MODELS 383
Example 6.8 Suppose that given φ, the data Zi are independent Poisson
ind
variables, Zi |φ ∼ Poisson(φ), and suppose φ ∼ Γ(α, β). Then, the likelihood
function is
_n
φzi
f (z|φ) = e−φ ,
i=1
z i !
and the posterior is
n
_ φzi β α α−1 −βφ
h(φ|z) = e−φ × φ e .
i=1
zi ! Γ(α)
Combining terms, the posterior can be written as
.
h(φ|z) ∝ φα+ zi −1 e−(β+n)φ ,
.
which is Γ(α + zi , β + n).
Given the posterior distribution, we can use different summary statistics
from this distribution to provide inferences about φ. One that is commonly
used is the posterior mean, which in this case is
. C D C D
α + zi n α n
E[φ|z] = = z̄ + 1− .
β+n n+β β n+β
Thus, the Bayes estimate is a linear combination of the maximum likelihood
estimate z̄, and the prior mean, α/β.
384 SPATIAL REGRESSION MODELS
Suppose φ = (θ, τ )! with joint posterior distribution h(θ, τ |z). Inference about
θ is made from the marginal posterior distribution of θ obtained by inte-
grating τ out of the joint posterior distribution. For most realistic models,
the posterior distribution is complex and high-dimensional and such inte-
grals are difficult to evaluate. However, in some cases, it is often possible
to simulate realizations from the desired distribution. For example, suppose
h(θ, τ |z) = f (θ|τ, z)g(τ |z), and suppose it is easy to simulate realizations from
f (θ|τ, z) and g(τ |z). Then
A
h(θ|z) = f (θ|τ, z)g(τ |z)dτ ,
B
P (x, A)π(x)dx. Markov chain Monte Carlo (MCMC) is a sampling based
simulation technique for generating a dependent sample from a specified dis-
tribution of interest, π(x). Under certain conditions, if we “run” a Markov
chain, i.e., we generate X (1) , X (2) , · · ·, then
d
X m → X ∼ π(x) as m → ∞.
Thus, if a long sequence of values is generated, the chain will converge to a
stationary distribution, i.e., after convergence, the probability of the chain
being in any particular “state” (or taking any particular value) at any partic-
ular time remains the same. In other words, after convergence, any sequence
of observations from the Markov chain represents a sample from the station-
ary distribution. To start the chain, a starting value, X0 must be specified.
Although the chain will eventually converge to a stationary distribution that
does not depend on X0 , usually the first m∗ iterations are discarded (called
the “burn-in”) to minimize any potential impact of the starting value on the
remaining values in the sequence. Note that the values in the remaining, post-
convergence sequence are dependent since each new value still depends on the
previous value. Thus, in practice, one selects an independent random sam-
ple by using only every k th value appearing in the sequence, where k is large
enough to ensure the resulting sample mimics that of a purely random process.
Tierney (1994) and Robert and Casella (1999) provide much more rigorous
statistical treatments of MCMC sampling and convergence, and Gilks et al.
(1996) provide a practical discussion.
Example 6.10 Consider a hierarchical model that builds on the model given
in Example 6.8 above. Suppose that
ind
Zi |λi ∼ Poisson(λi )
ind
λi ∼ Γ(α, βi )
ind
βi ∼ Γ(a, b)
with α, a, and b known. Then, the posterior distribution is
n
_ e−λi λzi βiα α−1 −βi λi ab a−1 −bβi
h(λ, β|z) = i
× λi e × β e .
i=1
zi ! Γ(α) Γ(a) i
At convergence, the MCMC values should represent a sample from the tar-
get distribution of interest. Thus, for large m they should randomly fluctuate
around a stable mean value. So, a first diagnostic is to plot the sequence φt
against t. Another approach is to use multiple chains with different starting
values. This allows us to compare results from different replications; for ex-
ample, Gelman and Rubin (1992) suggest convergence statistics based on the
ratio of between-chain variances to within-chain variances. However, conver-
gence to a stationary distribution may not be the only concern. If we are
interested in functions of the parameters, we might be more interested in
390 SPATIAL REGRESSION MODELS
.
convergence of averages of the form 1/m g(φt ). Another consideration is
the dependence in the samples. If independent samples are required for infer-
ence, we can obtain these by subsampling the chain, but then we need to assess
how close the resulting sample is to being independent. There have been many
different convergence diagnostics proposed in the literature and Cowles and
Carlin (1996) provide a comprehensive, comparative review. Many of these,
and others proposed more recently, are described and illustrated in Robert
and Casella (1999).
At this point, there has been nothing explicitly spatial about our discussion
and the examples provided were basically aspatial. It is difficult to give a
general treatment of Bayesian hierarchical models, in general, and their use
in spatial data analysis, in particular, since each application can lead to a
unique model. Instead, we give an overview of several general types of models
that have been useful in spatial data analysis. More concrete examples can
be found in e.g., Carlin and Louis (2000) and Banerjee, Carlin, and Gelfand
(2003).
BAYESIAN HIERARCHICAL MODELS 391
[Z(s)|β] ∼ G(X(s)β, R)
[β] ∼ G(m, Q), (6.94)
!
with m and Q known. We assume that Q and (X(s) R−1 X(s)) are both of
full rank. The posterior distribution, h(β|Z(s)), is
h(β|Z(s)) ∝ f (Z(s)|β)π(β)
-
16
∝ exp − (Z(s) − X(s)β)! R−1 (Z(s) − X(s)β)
2
79
+ (β − m)! Q−1 (β − m)
-
16 !
= exp − β ! (Q−1 + X(s) R−1 X(s))β
2
!
− β ! (Q−1 m + X(s) R−1 Z(s))
!
− (Q−1 m + X(s) R−1 Z(s))! β
79
+ m! Q−1 m + Z(s)R−1 Z(s)
- >
1 ∗
∝ exp − (β − m∗ )! (Q )−1 (β − m∗ )
2
with
! !
m∗ = (Q−1 + X(s) R−1 X(s))−1 (X(s) R−1 Z(s) + Q−1 m)
!
Q∗ = (Q−1 + X(s) R−1 X(s))−1 .
For spatial prediction with the hierarchical model in (6.94), we require the
predictive distribution of Z(s0 ), the marginal distribution of Z(s0 ) given
392 SPATIAL REGRESSION MODELS
and
Var[Z(s0 )|Z(s)] = R00 − R0z R−1 −1
zz Rz0 + (X(s0 ) − R0z Rzz X(s))
!
× (Q−1 + X(s) R−1
zz X(s))
−1
(X(s0 ) − R0z R−1
zz X(s)) .
!
(a/2)d/2 2 −(d+2)/2 8 9
π(σ 2 ) = (σ ) exp −a/(2σ 2 ) ,
Γ(d/2)
and mean a/(d − 2). Then σ −2 has a Gamma distribution with parameters
(d/2, a/2) and mean d/a. The joint posterior distribution is h(β, σ 2 |Z(s)) =
f (Z(s)|β, σ 2 )π(β, σ 2 ), but inferences about β must be made from h(β|Z(s)),
obtained by integrating out σ 2 from the joint posterior. We can obtain the
moments of this distribution using results on conditional expectations. Condi-
tioning on σ 2 is the same as assuming σ 2 is known, so this mean is equivalent to
that derived above. Thus, the posterior mean is E[β|Z(s)] = E[E(β|σ 2 , Z(s))]
= m∗ . However, the variance is Var[β|Z(s)] = E[σ 2 |Z(s)]Q∗ . The posterior
mean of σ 2 , E[σ 2 |Z(s)], can be derived using similar arguments (and much
more algebra). O’Hagan (1994, p. 249) shows that it is a linear combination
of three terms: the prior mean, E(σ 2 ), the usual residual sum of squares, and
a term that compares β '
gls to m. The real impact of the prior distribution,
π(β, σ ), is that the posterior density function of β follows a multivariate t-
2
and variance
6
Var[Z(s0 )|Z(s)] = V00 − V0z V−1 −1
zz Vz0 + (X(s0 ) − V0z Vzz X(s))
!
× (Q−1 + X(s) V−1zz X(s))
−1
7 ∗
× (X(s0 ) − V0z V−1
zz X(s)) ν ,
!
(6.98)
394 SPATIAL REGRESSION MODELS
where
a + (n − p)' ' − m)! (X(s)! V−1 X(s))(β̂ − m)
σ 2 + (β gls zz gls
ν =
∗
.
n+d−2
Finally, consider the more general case with R = σ 2 V(θ), where θ charac-
terizes the spatial autocorrelation in the data. Now we now need to assume
something for the prior distribution π(β, σ 2 , θ) For convenience, Handcock
and Stein (1993) choose π(β, σ 2 , θ) ∝ π(θ)/σ 2 and use the Matérn class of
covariance functions to model the dependence of V on θ, choosing uniform
priors for both the smoothness and the range parameters in this class. In this
case, and for other choices of π(β, σ 2 , θ), analytical derivation of the predictive
distribution becomes impossible. Computational details for the use of impor-
tance sampling or MCMC methods for a general π(θ) and any parametric
model for V(θ) are given in Gaudard et al. (1999).
Unfortunately, and surprisingly, many (arguably most) of the noninforma-
tive priors commonly used for spatial autocorrelation (including the one used
by Handcock and Stein described above) can lead to improper posterior dis-
tributions (Berger, De Oliveira, and Sansó, 2001). Berger et al. (2001) provide
tremendous insight into how such choices should be made and provide a flex-
ible alternative (based on a reference prior approach) that always produces
proper prior distributions. However, this prior has not been widely used in
spatial applications, so its true impacts on Bayesian modeling of spatial data
are unknown.
One of the most popular uses of Bayesian hierarchical modeling of spatial data
is in disease mapping. In this context, for each of i = 1, · · · , N geographic
regions, let Z(si ) be the number of incident cases of disease, where si is a
generic spatial index for the ith region. Suppose that we also have available
the number of incident cases expected in each region based on the population
size and demographic structure within region i (E(si ), i = 1, . . . , N ); often the
E(si ) reflect age-standardized values. The Z(si ) are random variables and we
assume the E(si ) are fixed values. In many applications where information on
E(si ) is unknown, n(si ), the population at risk in region i is used instead, as
in Example 1.4.
To allow for the possibility of region-specific risk factors in addition to those
defining each E(si ), Clayton and Kaldor (1987) propose a set of region-specific
relative risks ζ(si ), i = 1, · · · , n, and define the first stage of a hierarchical
model through
ind
[Z(si )|ζ(si )] ∼ Poisson(E(si )ζ(si )). (6.99)
Thus E[Z(si )|ζ(si )] = E(si )ζ(si ), so that the ζ(si ) represent an additional
multiplicative risk associated with region i, not already accounted for in the
calculation of E(si ).
BAYESIAN HIERARCHICAL MODELS 395
Most applications complete the model at this point, assigning fixed values
to the two parameters of the inverse Gamma hyperprior. There has been
much recent discussion on valid noninformative choices for these parameters
since the inverse Gamma distribution is only defined for positive values and
zero is a degenerate value (see e.g., Kelsall and Wakefield, 1999 and Gelman
et al., 2004, Appendix C). Ghosh et al. (1999) and Sun et al. (1999) define
conditions on the inverse Gamma parameters to ensure a proper posterior.
In our experience, convergence of MCMC algorithms can be very sensitive to
choices for these parameters, so they must be selected carefully.
To more explicitly model spatial autocorrelation, we can use the prior dis-
tribution for ψ to allow a spatial pattern among the ψ(si ), e.g., through a
parametric covariance function linking pairs ψ(si ) and ψ(sj ) for j (= i. Thus,
we could consider a joint multivariate Gaussian prior distribution for ψ with
spatial covariance matrix Σψ , i.e.,
ψ ∼ G(0, Σψ ), (6.103)
and, following the geostatistical paradigm, define the elements of Σψ through
a parametric covariance function, e.g., such as in Handcock and Stein (1993)
discussed above. A more popular approach is to use a conditionally specified
prior spatial structure for ψ similar to the conditional autoregressive models
introduced in §6.2.2.2. With a CAR prior, we define the distribution of ψ
conditionally, i.e.,
+N ,
(
ψ(si )|ψ(sj&=i ) ∼ G cij ψ(sj ), τi2 , i = 1, · · · , N ,
i=1
which is equivalent to
ψ ∼ G(0, (I − C)τ 2 M), (6.104)
where M = diag(τ12 , · · · , τN
2
). As in §6.2.2.2, the cij s denote spatial dependence
parameters and we set cii = 0 for all i. Typical applications take M = τ 2 I
and consider adjacency-based weights where cij = φ if region j is adjacent to
region i, and cij = 0, otherwise. Clayton and Kaldor (1987) propose priors
in an empirical Bayes setting, and Breslow and Clayton (1993) apply CAR
priors as random effects distributions within likelihood approximations for
generalized linear mixed models.
For a fully Bayesian approach, we need to specify a prior for [φ, τ 2 ], e.g.,
π(φ, τ 2 ) = π(φ)π(τ 2 )
π(τ 2 ) ∼ IG(a, b)
−1
π(φ) ∼ U (γmin −1
, γmax ),
−1
where γmin and γmax
−1
are the smallest and largest eigenvalues of W, where
C = φW and W is a spatial proximity matrix. (e.g., Besag et al., 1991; Stern
and Cressie, 1999).
There are many issues surrounding the use of CAR priors, e.g., choices for
cij that lead to desired interpretations (cf., intrinsic autoregression of Besag
BAYESIAN HIERARCHICAL MODELS 397
and Kooperberg, 1995), the “improperness” that results from the pairwise
construction of CAR models and the use of constraints on ψ to assure proper
posterior distributions, etc. (see, e.g., Besag et al., 1991; Besag et al., 1995).
When the data are not Gaussian, the optimal predictor of the data at new
locations, E[Z(s0 )|Z(s)], may not be linear in the data and so linear predic-
tors (like the universal kriging predictor described in §5.3.3 and the Bayesian
kriging predictors described above in §6.4.3.1) may be poor approximations
to this optimal conditional expectation. Statisticians typically address this
problem in one of two ways, either through transformation as described in
§5.6 or through the use of generalized linear models described in §6.3. Just as
there is a Bayesian hierarchical model formulation for universal kriging, there
are also Bayesian hierarchical model formulations for trans-Gaussian (§5.6.2)
and indicator kriging (§5.6.3) and for spatial GLMMs (§6.3.4).
De Oliveira et al. (1997) extend the ideas given in §6.4.3.1 to the case of
transformed Gaussian fields. Consider data Z(s), and a parametric family of
monotone transformations, {gλ (·)}, for which gλ (Z(s)) is Gaussian. In contrast
to the methods described in §5.6.2, De Oliveira et al. (1997) assume λ is
unknown. On the transformed scale, (gλ (Z(s)), gλ (Z(s0 ))! is assumed to follow
the same multivariate Gaussian distribution as in (6.95), with R = σ 2 V(θ).
The idea is to build a hierarchical model that will incorporate uncertainty in
the sampling distribution through a prior distribution on λ. Following ideas
in Box and Cox (1964), De Oliveira et al. (1997) choose the prior
p/n
π(β, σ 2 , θ, λ) ∝ σ 2 π(θ)/Jλ ,
=
where Jλ = |gλ! (z(si )|, and then assume (bounded) uniform priors for λ and
each component of θ.
The predictive distribution is obtained from
A
p(Z(s0 )|Z(s)) = f (Z(s0 )|Z(s), φ)f (φ|Z(s))dφ, (6.105)
probability coverages based on 95% prediction intervals from the BTG model
were much closer to nominal than those from trans-Gaussian kriging. BTG
intervals covered 91.6% of the 24 true rainfall values, while TGK covered 75%,
although the kriging prediction intervals were based on z-scores rather than
t-scores. Thus, the BTG model is another approach to adjust for use of the
plug-in variance in kriging in small samples (in addition to that of Kackar and
Harville and Kenward and Roger discussed in §5.5.4).
Kim and Mallick (2002) adopt a slightly different approach to the prob-
lem of spatial prediction for non-Gaussian data. They utilize the multivariate
skew-Gaussian (SG) distribution developed by Azzalini and Capitanio (1999).
The SG distribution is based on the multivariate Gaussian distribution, but
includes an extra parameter for shape. Kim and Mallick (2002) construct a hi-
erarchical model very similar to the most general model described in §6.4.3.1,
but with the likelihood based on a skew-Gaussian distribution rather than a
Gaussian distribution. Because the multivariate Gaussian distribution forms
the basis for the SG distribution, the SG distribution has many similar prop-
erties and the ideas and results in §6.4.3.1 can be used to obtain all full
conditionals needed for Gibbs sampling.
These approaches assume the data are continuous and that there is no rela-
tionship between the mean and the variance. In a generalized linear modeling
context, Diggle et al. (1998) consider the model in §6.3.4 where given an un-
derlying, smooth, spatial process {S(s) : s ∈ D}, Z(s) has distribution in the
exponential family. Thus,
E[Z(s)|S] ≡ µ(s),
g[µ(s)] = x(s)! β + S(s), (6.106)
and we assume S(s) is a Gaussian random field with mean 0 and covariance
function σS2 ρS (si − sj ; θ).
Thus, in a hierarchical specification
ind
[Z(s)|β, S] ∼ f (g −1 (X(s)β + S))
[S|θ] ∼ G(0, ΣS (θ))
[β, θ] ∼ π(β, θ),
where f (·) belongs to the exponential family, and has mean g −1 (X(s)β + S).
Diggle et al. (1998) take a two step approach, first constructing an algorithm
to generate sample from the full conditionals of θ, β and S(s), and then
considering prediction of new values S(s0 ). For prediction of S(s0 ), again
assume that, conditional on β and θ, S(s) and S(s0 ) follow a multivariate
normal distribution of the form of equation (6.95), where
S
: ; Σss (θ) ΣSs0 (θ)
S(s)
∼ G 0, .
S(s0 )
Σ0s (θ) Σ00 (θ)
S ! S
Diggle et al. (1998) assume that the data Z(s) are independent of S(s0 ) (which
BAYESIAN HIERARCHICAL MODELS 399
Problem 6.1 Consider a Gaussian linear mixed model for longitudinal data.
Let Zi denote the (ni × 1) response vector for subject i = 1, · · · , s. If Xi and
Zi are (ni × p) and (ni × q) matrices of known constants and bi ∼ G(0, G),
then the conditional and marginal distributions in the linear mixed model are
given by
Yi |bi ∼ G(Xi β + Zi bi , Ri )
Yi ∼ G(Xi β, Vi )
Vi = Zi GZ!i + Ri
(i) Derive the maximum likelihood estimator for β based on the marginal
distribution of Y = [Y!1 , · · · , Y!s ]! .
(ii) Formulate a model in which to perform local estimation. Assume that
you want to localize the conditional mean function E[Yi |bi ].
Problem 6.2 In the first-difference approach model (6.12), describe the cor-
relation structure among the observations in a column that would lead to
uncorrelated observations when the differencing matrix (6.13) is applied.
Problem 6.3 Show that the OLS residuals (6.4) on page 307 have mean zero,
even if the assumption that Var[e(s)] = σ$2 I is not correct.
Problem 6.8 Suppose that t treatments are assigned to field plots arranged
in a single long row so that each treatment appears in exactly r plots. To
account for soil fertility gradients as well as measurement error, the basic
model is
Z(s) = Xτ + S(s) + ',
where S(s) ∼ (0, σ Σ) and ' ∼ (0, σ$2 I). Assume it is known that the random
2
field {S(s)} is such that the (rt − 1) × rt matrix of first differences (∆, see
(6.13) on page 320) transforms Σ as follows: ∆Σ∆! = I.
(i) Describe the estimation of treatment effects τ and the variance compo-
nents σ 2 and σ$2 in the context of a linear mixed model, §6.2.1.3.
(ii) Is this a linear mixed model with or without correlations in the condi-
tional distribution? Hint: After the transformation matrix ∆ is applied, what
do you consider to be the random effects in the linear mixed model?
Problem 6.9 In the mixed model formulation of the linear model with au-
tocorrelated errors (§6.2.1.3), verify the expression for C in (6.28) and the
expressions for the mixed model equation solutions, (6.29)–(6.30).
Problem 6.11 Show that the mixed model predictor (6.31) has the form of
a filtered, universal kriging predictor.
402 SPATIAL REGRESSION MODELS
Problem 6.15 If Z|n Binomial(n, π), and n ∼ Poisson(λ), find the distribu-
tion of Z. Is it more or less dispersed than the distribution of Z|n?
Problem 6.17 For the data on blood lead levels in children in Virginia, 2000,
repeat the analysis in Example 1.4 using models based on the Binomial dis-
tribution.
CHAPTER 7
The Gaussian random field holds a core position in the theory of spatial
data analysis much in the same vein as the Gaussian distribution is key to
many classical approaches of statistical inferences. Best linear unbiased krig-
ing predictors are identical to conditional means in Gaussian random fields,
establishing their optimality beyond the class of linear predictors. Second-
order stationarity in a Gaussian random field implies strict stationarity. The
statistical properties of estimators derived from Gaussian data are easy to
examine and test statistics usually have a known and simple distribution.
Hence, creating realizations from GRFs is an important task. Fortunately, it
is comparatively simple.
Chilès and Delfiner (1999, p. 451) discuss an instructive example that high-
lights the importance of simulation. Imagine that observations are collected
along a transect at 100-meter intervals measuring the depth of the ocean floor.
The goal is to measure the length of the profile. One could create a continuous
profile by kriging and then obtain the length as the sum of the segments be-
tween the observed transect locations. Since kriging is a smoothing of the data
in-between the observed locations, this length would be an underestimate of
the profile length. In order to get a realistic estimate, we need to generate val-
ues of the ocean depth in-between the 100-meter sampling locations that are
consistent with the stochastic variation we would have seen, had the sampling
interval been shorter.
In this example it is reasonable that the simulated profile passes through the
observed data points. After all, these were the values which were observed, and
barring measurement error, reflect the actual depth of the ocean. A simulation
method that honors the data in the sense that the simulated value at an
observed location agrees with the observed value is termed a conditional
simulation. Simulation methods that do not honor the data, for example,
because no data has yet been collected, are called unconditional simulations.
Several methods are available to simulate GRFs unconditionally, some are
more brute-force than others. The simplest—and probably crudest— method
relies on the reproductive property of the (multivariate) Gaussian distribution
and the fact that a positive-definite matrix Σ can be represented as
Σ = Σ1/2 Σ! 1/2 .
µ + Σ1/2 X
CONDITIONAL SIMULATION OF GAUSSIAN RANDOM FIELDS 407
If Σn×n is a positive definite matrix, then there exists an upper triangular ma-
trix Un×n such that Σ = U! U. The matrix U is called the Cholesky root of Σ
and is unique up to sign (Graybill, 1983). Since U! is lower-triangular and U is
upper-triangular, the decomposition is often referred to as the lower-upper or
LU decomposition. Many statistical packages can calculate a Cholesky root,
for example, the root() function of the SAS\IML! r module. Since Gaussian
random number generators are also widely available, this suggests a simple
method of generating data from a Gn (µ, Σ) distribution. Generate n indepen-
dent standard Gaussian random deviates and store them in vector x. Calculate
the Cholesky root U! of the variance-covariance matrix Σ and a (n × 1) vector
of means µ. Return y = µ+U! x as a realization from a G(µ, Σ). It works well
for small to moderate sized problems. As n grows large, however, calculating
the Cholesky decomposition is numerically expensive.
The idea of sequential simulation is simple. For the general case consider
simulating a (n×1) random vector Y with known distribution F (y1 , · · · , yn ) =
Pr(Y1 ≤ y1 , · · · , Yn ≤ yn ). The joint cdf can be decomposed into conditional
distributions
Note that (7.1) corrects the unconditional simulation at S(s) by the residual
between the observed data and the values from the unconditional simulation.
Also, it is not necessary that S(s) is simulated with the same mean as Z(s).
Any mean will do, for example, E[S(s)] = 0. It is left as an exercise to show
that (7.1) has the needed properties. In particular,
(i) For s0 ∈ {s1 , · · · , sm }, Zc (s0 ) = Z(s0 ); the realization honors the data;
(ii) E[Zc (s)] = E[Z(s)], i.e., the conditional simulation is (unconditionally)
unbiased;
(iii) Cov[Zc (s), Zc (s + h)] = Cov[Z(s), Z(s + h)] ∀h.
The idea of a conditional simulation is to reproduce data where it is known
but not to smooth the data in-between. The kriging predictor is a best linear
unbiased predictor of the random variables in a spatial process that smoothes
in-between the observed data. A conditional simulation of a random field
will exhibit more variability between the observed points than the kriging
predictor. In fact, it is easy to show that
K L
2
E (Zc (s) − Z(s)) = 2σsk2
,
where σsk
2
is the simple kriging variance.
On occasion we may want to constrain the simulations more than having the
proper covariance on average or honoring the data points. For example, we
410 SIMULATION OF RANDOM FIELDS
may want all realizations have the same empirical covariance function than
an observed process. Imagine that m observations have been collected and the
empirical semivariogram γ '(h1 ), · · · , γ
'(hk ) has been calculated for a set of k lag
classes. A conditional realization of the random field is to be simulated that
honors the data Z(s1 ), · · · , Z(sm ) and whose semivariogram agrees completely
with the empirical semivariogram γ '(h).
To place additional constraints on the realizations, the optimization method
of simulated annealing (SA) can be used. It is a heuristic method that is used in
operations research, for example, to find solutions in combinatorial optimiza-
tion when standard mathematical programming tools fail. The monograph
by Laarhoven and Aarts (1987) gives a wonderful account of SA, its history,
theory, and applications.
The name of the method reveals a metallurgic connection. If a molten metal
is cooled slowly, the molecules can move freely, attaining eventually a state
of the solid with low energy and little stress. If cooling occurs too quickly,
however, the system will not be able to find the low-energy state because the
movement of the molecules is obstructed. The solid will not reach thermal
equilibrium at a given temperature, defects are frozen into the solid.
Despite the name, simulated annealing originated in statistical physics. The
probability density function f (a) of state a in a system with energy U (a) and
absolute temperature T > 0, when the system reaches thermal equilibrium, is
given by the Boltzmann distribution,
if it increases the energy in the system and not only configurations that lower
the system’s energy. The key is to allow these transitions to higher energy
states with the appropriate frequency. In addition, the probability to accept
higher-energy states will gradually decrease in the process.
The connection of simulated annealing to optimization and an illustration
on why this procedure works, is given in the following example.
Example 7.1 Consider that we want to find the global maximum of the
function - > - >
1 1
f (x) = exp − U (x) = exp − (x − µ) . 2
2 2
Obviously, the function is maximized for x = µ when the energy function U (x)
takes on its minimum. In order to find the maximum of f (x) by simulated
annealing we sample from a system with probability distribution gT∗ (x) ∝
f (x)1/T , where T denotes the absolute temperature of the system. For exam-
ple, we can sample from
- >
1 1 1 (x − µ)2
gT (x) = √ f (x)1/T
=√ exp − .
2πT 2πT 2 T
Note that g(x) is the Gaussian density function with mean µ and variance
T . The process commences by drawing a single sample from gT (x) at high
temperatures and samples are drawn continuously while the temperature is
dropped. As T → 0, x → µ with probability one. From this simple example
we can see that the final realization will not be identical to the abscissa where
f (x) has its maximum, but should be close to it.
Denote by Su (s) the realization of the random field at stage u of the iterative
process; S0 (s) is the initial image. The most common rule of perturbing the
image is to randomly select two locations si and sj and to swap (exchange)
their values. If the simulation is conditional such that the eventual image
honors an observed set of data Z(s1 ), · · · , Z(sm ), we choose
!
S0 (s) = [Z(s1 ), Z(s2 ), · · · , Z(sm ), S0 (sm+1 ), · · · , S0 (sm+k )] ,
and perturb only the values at the unobserved locations.
After swapping values we decide whether to accept the new image. Su+1 (s)
is accepted if Uu+1 < Uu . Otherwise the new image is accepted with proba-
bility fu+1 /fu , where
fu ∝ exp {−U/Tu } .
Chilès and Delfiner (1999, p. 566) point out that when the temperature Tu
is reduced according to the cooling schedule Tu = −c log{u + 1}, the process
converges to a global minimum energy state. Here, c is chosen as the minimum
increase in energy that moves the process out of a local energy minimum—
that is not the global minimum—into a state of lower energy. In practice,
the cooling schedule is usually governed by simpler functions such as Tu =
T0 λ, λ < 1. The important issues here are the magnitude of the temperature
reduction (λ) and the frequency with which the temperature is reduced. It
is by no means necessary or prudent to reduce the temperature every time a
new image is accepted. At each temperature, the system should be allowed
to achieve thermal equilibrium. Goovaerts (1997, p. 420) discusses adjusting
the temperature after a sufficient number of perturbations have been accepted
(2n—10n) or too many have been tried unsuccessfully (10n—100n); n denotes
the number of sites on the image.
Simulated annealing is a heuristic, elegant, intuitive, brute-force method.
It starts with an initial image of n atoms and tries to improve the image by
rearranging the atoms in pairs. Improvement is measured by a user-defined
discrepancy function, the energy function U . Some care should be exercised
when simulating spatial random fields by this method.
• If several simulations are desired one should not start them from the same
initial image S0 (s), but from different initial images. Otherwise the real-
izations are too similar to each other.
SIMULATING FROM CONVOLUTIONS 413
Since the kernel function governs the covariance function of the Z(s) pro-
cess, an intuitive approach to generating random fields is as follows. Generate
a dense field of the X(s), choose a kernel function K(u), and convolve the
two. This ensures that we understand (at least) the first and second moment
structure of the generated process. Unless the marginal distribution of X(s)
has some reproductive property, however, the distribution of Z(s) is difficult
to determine by this device. A notable exception is the case where X(s) is
Gaussian; Z(s) then will also be Gaussian.
If it is required that the marginal distribution of Z(s) is characterized by
a certain mean-variance relationship or has other moment properties, then
these can be constructed by matching moments of the desired random field
with the excitation process. We illustrate with an example.
Table 7.1 Average sample mean and sample variance of simulated convolutions
whose semivariograms are shown in Figure 7.1
Bandwidth h Z s2
5 4.98 4.29
10 5.30 4.67
15 5.23 4.02
20 5.07 4.72
4
Empirical Semivariogram
3
h=5
h = 10
h = 15
2 h = 20
Figure 7.1 Average semivariograms for Negative Binomial excitation fields convolved
with Gaussian kernel functions of different bandwidth h (=standard deviation).
Because of the importance of Monte Carlo based testing for mapped point
patterns, efficient algorithms to simulate spatial point processes are critical
to allow a sufficient number of simulation runs in a reasonable amount of
computing time. Since the CSR process is the benchmark for initial analysis
of an observed point pattern, we need to be able to simulate a homogeneous
Poisson process with intensity λ. A simple method relies on the fact that
if N (A) ∼ Poisson(λ), then, given N (A) = n, the n events form a random
sample from a uniform distribution (a Binomial process, §3.2).
Problem 7.6 Verify formulas (7.8) and (7.9), that is, show that a convolution
where X(s) ∼ Beta(α, β) satisfies E[Z(s)] = π, Var[Z(s)] = π(1 − π).
Non-Stationary Covariance
If a random process does not have the needed attributes for statistical infer-
ence, it is common to employ a transformation of the process that leads to
the desired properties. Lognormal kriging (§5.6.1) or modeling a geometrically
anisotropic covariance structure (§4.3.7) are two instances where transforma-
tions in spatial statistics are routinely employed. An important difference be-
tween the two types of transformations is whether they transform the response
variable (lognormal kriging) or the coordinate system (anisotropic modeling).
Recall from §4.3.7 that if iso-correlation contours are elliptical, a linear trans-
formation s∗ = f (s) = As achieves the rotation and scaling of the coordinate
system so that covariances based on the s∗ coordinates are isotropic. If g() is
424 NON-STATIONARY COVARIANCE
a variogram, then
Var[Z(si ) − Z(sj )] = g(||f (si ) − f (sj )||), (8.2)
and g'(||f (si ) − f (sj )||) is its estimate, assuming f is known.
This general idea can be applied to spatial processes with non-stationary
covariance function. The result is a method known as “space deformation:”
• find a function f that transforms the space in such a way that the covariance
structure in the transformed space is stationary and isotropic;
• find a function g such that Var[Z(si ) − Z(sj )] = g(||f (si ) − f (sj )||);
• if f is unknown, then a natural estimator is
T
Var[Z(s '(||f'(si ) − f'(sj )||).
i ) − Z(sj )] = g
points are included and excluded in the neighborhoods with changing predic-
tion location, spurious discontinuities can be introduced. On the upside, local
kriging is computationally less involved than solving the kriging equations
based on all n data points for every prediction location. Also, if the mean
is non-stationary, it may be reasonable to assume that the mean is constant
within the kriging window and to re-estimate µ based on the observations in
the neighborhood.
Whether the mean is estimated globally or locally, the spatial covariation
in local kriging is based on the same global model. Assume that the covari-
ances are determined based on some covariance or semivariogram model with
parameter vector θ. Local kriging can then be expressed as
(i) ' ! Σ(i) (θ)
' −1 (Z(s)(i) − 1µ),
pok (Z; s0 ) = µ
' + c(i) (θ)
where Z(s)(i) denotes the subset of points in the kriging neighborhood, Σ(i) =
Var[Z(s)(i) ], and c(i) = Cov[Z(s0 (i) ), Z(s)(i) ]. All n data points contribute to
the estimation of θ in local kriging.
The moving window approach of Haas (1990, 1995) generalizes this idea
by re-estimating the semivariogram or covariance function locally within a
circular neighborhood (window). A prediction is made at the center of the
window using the local estimate of the semivariogram parameters,
(i)
pok (Z; s0 ) = µ '(i) )! Σ(i) (θ
' + c(i) (θ '(i) )−1 (Z(s)(i) − 1µ).
The neighborhood for local kriging could conceivably be different from the
neighborhood used to derive the semivariogram parameters θ (i) , but the neigh-
borhoods are usually the same. Choosing the window size must balance the
need for a sufficient number of pairs to estimate the semivariogram parameters
reliably (large window size), and the desire to make the window small so that
a stationarity assumption within the window is tenable. Haas (1990) describes
a heuristic approach to determine the size of the local neighborhood: enlarge
a circle around the prediction site until at least 35 sites are included, then
include five sites at a time until there is at least one pair of sites at each lag
class and the nonlinear least squares fit of the local semivariogram converges.
random field Z(s) can be constructed by convolving X(s) with a kernel func-
tion Ks (u), centered at s. The random field
A
Z(s) = Ks (u)X(u) du
all u
B
has mean E[Z(s)] = µx u Ks (u) du = 0 and covariance function
Cov[Z(s), Z(s + h)] = E[Z(s)Z(s + h)]
A A
= Ks (u)Ks+h (v)E[X(u)X(v)] du dv
uA v
= φx Ks (v)Ks+h (v) dv.
v
Since the covariance function depends on the choice of the kernel function, a
non-stationary covariance function can be constructed by varying the kernel
spatially. Following Higdon, Swall, and Kern (1999), consider the following
progression for a process in R2 and points u = [ux , uy ], s = [sx , sy ].
The local processes are uncorrelated, Cov[Zi (s), Zj (s)] = 0, i (= j and have
covariance functions Cov[Zi (s), Zi (s + h)] = C(h, θ i ). The resulting, non-
stationary covariance function of the observed process is
? k k
@
( (
Cov[Z(s), Z(s + h)] = Cov Zi (s)wi (s), Zi (s + h)wi (s + h)
i=1 i=1
k (
( k
= Cov[Zi (s), Zj (s + h)]wi (s)wj (s + h)
i=1 j=1
k
(
= C(h, θ i )wi (s)wi (s + h).
i=1
To see how mixing locally stationary processes leads to a model with non-
stationary covariance, we demonstrate the Fuentes model with the following,
simplified example.
1.0
0.5
Z(t) process
0.0
0.50
-0.5
0.25
-1.0 0.00
Figure 8.1 A single realizations of a Fuentes process in R1 and the associated Gaus-
sian kernel functions.
Because the degree of spatial dependence varies between the segments, the
covariance function of the weighted process
4
(
Z(t) = Zi (t)wi (t)
i=1
is non-stationary (Figure 8.2). The variances are not constant, and the corre-
lations between locations decrease slower for data points in the first and last
segment and faster in the intermediate segments.
430 NON-STATIONARY COVARIANCE
Figure 8.2 Covariance function of the Fuentes process shown in Figure 8.1.
CHAPTER 9
Spatio-Temporal Processes
directly comparable. Space has no past, present, and future and the spatial
coordinate units are not comparable to temporal units. If certain assumptions
are met, it is acceptable to model spatio-temporal data with separable co-
variance structures. This is not the same as modeling spatio-temporal data as
“3-D” data. A spatio-temporal process with a spatial component in R2 may
be separable, but it is not a process in R3 . Assume we would treat the spatio-
temporal data as the realization of a process in R3 and denote the coordinates
as si = [xi , yi , ti ]! . Let si − sj = [hij , ti − tj ]. An exponential correlation model
results in
Corr[Z(si ), Z(sj )] = exp{−θ((xi − xj )2 + (yi − yj )2 + (ti − tj )2 )1/2 }
= exp{−θ(||hij ||2 + (ti − tj )2 )1/2 }. (9.2)
We noted in §4.3 (page 141) that by Bochner’s theorem valid covariance func-
tions have a spectral representation. For a process in Rd we can write
A ∞ A ∞
C(h) = ··· exp{iω ! h}s(ω) dω,
−∞ −∞
NON-SEPARABLE COVARIANCE FUNCTIONS 437
which suggests the following method for constructing a valid covariance func-
tion: determine a valid spectral density and take its inverse Fourier transform.
To acknowledge the physical difference between time and space, the spectral
representation of a spatio-temporal covariance that satisfies (9.5) is written
as A ∞ A ∞
C(h, k) = ··· exp{iω ! h + iτ k}s(ω, τ ) dωdτ , (9.10)
−∞ −∞
where s(ω, τ ) is the spectral density. One could proceed with the construction
of covariance functions as in the purely spatial case: determine a valid spatio-
temporal spectral density and integrate. This is essentially the method applied
by Cressie and Huang (1999), but these authors use two clever devices to avoid
the selection of a joint spatio-temporal spectral density.
First, because of (9.10), the covariance function and the spectral density are
a Fourier transform pair. Integration of the spatial and temporal components
can be separated in the frequency domains:
A ∞ A ∞
1
s(ω, τ ) = ··· exp{−iω ! h − iτ k}C(h, k) dhdk
(2π) d+1
−∞ −∞
A ∞ ? A ∞ A ∞
1 1
= exp{−iτ k} ···
2π −∞ (2π)d −∞ −∞
@
exp{−iω ! h}C(h, k) dh dk
A ∞
1
= exp{−iτ k}h(ω, k) dk. (9.11)
2π −∞
The function h(ω, k) is the (spatial) spectral density for temporal lag k. What
has been gained so far is that if we know the spectral density for a given lag
k, the spatio-temporal spectral density is obtained with a one-dimensional
Fourier transform. And it is presumably simpler to develop a model for h(ω, k)
than it is for s(ω, τ ). To get the covariance function from there still requires
complicated integration in (9.10).
The second device used by Cressie and Huang (1999) is to express h(ω, k) as
the product of two simpler functions. They put h(ω, k) ≡ R(ω,
B k)r(ω), where
R(ω,Bk) is a continuous correlation function, r(ω) > 0. If R(ω, k)dk < ∞
and r(ω)dω < ∞, then substitution into (9.11) gives
A ∞
1
s(ω, τ ) = r(ω) exp{−iτ k}R(ω, k) dk
2π −∞
and A ∞ A ∞
C(h, k) = ··· exp{−iω ! h}R(ω, k)k(ω) dω. (9.12)
−∞ −∞
Cressie and Huang (1999) present numerous examples for functions R(ω, k)
and k(ω) and the resulting spatio-temporal covariance functions. Gneiting
(2002) establishes that some of the covariance functions in the paper by Cressie
438 SPATIO-TEMPORAL PROCESSES
and Huang are not valid, because one of the correlation functions R(ω, k) used
in their examples does not satisfy the needed conditions.
just as (9.17).
The set N (h, k) consists of the points that are within spatial distance h
and time lag k of each other; |N (h, k)| denotes the number of distinct pairs in
that set. When data are irregularly spaced in time and/or space, the empirical
semivariogram may need to be computed for lag classes. The lag tolerances in
the spatial and temporal dimensions need to be chosen differently, of course,
to accommodate a sufficient number of point pairs at each spatio-temporal
lag.
Note that (9.18) is an estimator of the joint space-time dependency. It is
different from a conditional estimator of the spatial semivariogram at time t
which would be used in a two-stage method:
1 ( 2
't (h) =
γ {Z(si , t) − Z(sj , t)} . (9.19)
2|Nt (h)|
Nt (h)
where ms and mt are the number of spatial and temporal lag classes, respec-
tively. A fit of the conditional spatial semivariogram at time t minimizes
ms
( |Nt (hj )|
γ (hj , t) − γ(hj , t; θ)}2 .
{'
j=1
2γ(h j , t; θ)
442 SPATIO-TEMPORAL PROCESSES
Recall from §3.4 that the first- and second-order intensities of a spatial point
process are defined as the limits
E[N (ds)]
λ(s) = lim
|ds|→0 |ds|
E[N (dsi )N (dsj )]
λ2 (si , sj ) = lim ,
|dsi |,|dsj |→0 |dsi ||dsj |
where ds is an infinitesimal disk (ball) of area (volume) |ds|. To extend the
intensity measures to the spatio-temporal scenario, we define N (ds, dt) to
denote the number of events in an infinitesimal cylinder with base ds and
height dt (Dorai-Raj, 2001). (Note that Haas (1995) considered cylinders in
local prediction of spatio-temporal data.) The spatio-temporal intensity of the
process {Z(s, t) : s ∈ D(t), t ∈ T } is then defined as the average number of
events per unit volume as the cylinder is shrunk around the point (s, t):
E[N (ds, dt)]
λ(s, t) = lim . (9.20)
|ds|,|dt|→0 |ds||dt|
which observed spatio-temporal patterns are tested, in much the same way as
observed spatial point patterns are tested against CSR.
References
Genton, M.G., He, L., and Liu, X. (2001) Moments of skew-normal random vectors
and their quadratic forms. Statistics & Probability Letters, 51:319–325.
Gerrard, D.J. (1969) Competition quotient: a new measure of the competition afect-
ing individual forest trees. Research Bulletin No. 20, Michigan Agricultural Ex-
periment Station, Michigan State University
Ghosh, M., Natarajan, K., Waller, L.A., and Kim, D. (1999) Hierarchical GLMs
for the analysis of spatial data: an application to disease mapping. Journal of
Statistical Planning and Inference, 75:305–318.
Gilks, W.R. (1996) Full conditional distributions. In Gilks, W.R. Richardson, S.
and Spiegelhalter, D.J. (eds.) Markov Chain Monte Carlo in Practice Chapman
& Hall/CRC, Boca Raton, FL, 75–88.
Gilks, W.R., Richardson, S., and Spiegelhalter, D.J. (1996) Introducing Markov
chain Monte Carlo. In Gilks, W.R. Richardson, S. and Spiegelhalter, D.J. (eds.)
Markov Chain Monte Carlo in Practice Chapman & Hall/CRC, Boca Raton, FL,
1–19.
Gilliland, D. and Schabenberger, O. (2001) Limits on pairwise association for equi-
correlated binary variables. Journal of Applied Statistical Science, 10:279–285.
Gneiting, T. (2002) Nonseparable, stationary covariance functions for space-time
data. Journal of the American Statistical Association, 97:590–600.
Godambe, V.P. (1960) An optimum property of regular maximum likelihood esti-
mation. Annals of Mathematical Statistics, 31:1208–1211.
Goldberger, A.S. (1962) Best linear unbiased prediction in the generalized linear
regression model. Journal of the American Statistical Association, 57:369–375.
Goldberger, A.S. (1991) A Course in Econometrics. Harvard University Press, Cam-
bridge, MA.
Goodnight, J.H. (1979) A tutorial on the sweep operator. The American Statistician,
33:149–158. (Also available as The Sweep Operator: Its Importance in Statistical
Computing, SAS Technical Report R-106, SAS Institute, Inc. Cary, NC).
Goodnight, J.H. and Hemmerle, W.J. (1979) A simplified algorithm for the W-
transformation in variance component estimation. Technometrics, 21:265–268.
Goovaerts, P. (1994) Comparative performance of indicator algorithms for modeling
conditional probability distribution functions. Mathematical Geology, 26:389–411.
Goovaerts, P. (1997) Geostatistics for Natural Resource Evaluation. Oxford Univer-
sity Press, Oxford, UK.
Gorsich, D.J. and Genton, M.G. (2000) Variogram model selection via nonparamet-
ric derivative estimation. Mathematical Geology, 32(3):249–270.
Gotway, C.A. and Cressie, N. (1993) Improved multivariate prediction under a gen-
eral linear model. Journal of Multivariate Analysis, 45:56–72.
Gotway, C.A. and Hegert, G.W. (1997) Incorporating spatial trends and anisotropy
in geostatistical mapping of soil properties. Soil Science Society of America Jour-
nal, 61:298–309.
Gotway, C.A. and Stroup, W.W. (1997) A generalized linear model approach to
spatial data analysis and prediction. Journal of Agricultural, Biological and En-
vironmental Statistics, 2:157–178.
Gotway, C.A. and Wolfinger, R.D. (2003) Spatial prediction of counts and rates.
Statistics in Medicine, 22:1415–1432.
Gotway, C.A. and Young, L.J. (2002) Combining incompatible spatial data. Journal
of the American Statistical Association, 97:632–648.
Graybill, F.A. (1983) Matrices With Applications in Statistics. 2nd ed. Wadsworth
International, Belmont, CA.
REFERENCES 453
Greig-Smith, P. (1952) The use of random and contiguous quadrats in the study of
the structure of plant communities. Annals of Botany, 16:293–316.
Grondona, M.O. (1989) Estimation and design with correlated observations. Ph.D.
Dissertation, Iowa State University.
Grondona, M.O. and Cressie, N. (1995). Residuals based estimators of the covari-
ogram. Statistics, 26:209–218.
Haas, T.C. (1990) Lognormal and moving window methods of estimating acid de-
position. Journal of the American Statistical Association, 85:950–963.
Haas, T.C. (1995) Local prediction of a spatio-temporal process with an applica-
tion to wet sulfate deposition. Journal of the American Statistical Association,
90:1189–1199.
Haining, R. (1990) Spatial Data Analysis in the Social and Environmental Sciences.
Cambridge University Press, Cambridge.
Haining, R. (1994) Diagnostics for regression modeling in spatial econometrics. Jour-
nal of Regional Science, 34:325–341.
Hampel, F.R., Rochetti, E.M., Rousseeuw, P.J., and Stahel, W.A. (1986) Robust
Statistics, the Approach Based on Influence Functions. John Wiley & Sons, New
York.
Handcock, M.S. and Stein, M.L. (1993) A bayesian analysis of kriging. Technomet-
rics, 35:403–410.
Handcock, M.S. and Wallis, J.R. (1994) An appproach to statistical spatial-temporal
modeling of meteorological fields (with discussion). Journal of the American Sta-
tistical Association, 89:368–390.
Hanisch, K.-H. and Stoyan, D. (1979) Formulas for the second-order analysis of
marked point processes. Mathematische Operationsforschung und Statistik. Series
Statistics, 10:555–560.
Harville, D.A. (1974) Bayesian inference for variance components using only error
contrasts. Biometrika, 61:383–385.
Harville, D.A. (1977) Maximum-likelihood approaches to variance component esti-
mation and to related problems. Journal of the American Statistical Association,
72:320–340.
Harville, D.A. and Jeske, D.R. (1992) Mean squared error of estimation or Prediction
under a general linear model. Journal of the American Statistical Association,
87:724–731.
Haslett, J. and Hayes, K. (1998) Residuals for the linear model with general covari-
ance structure. Journal of the Royal Statistical Society, Series B, 60:201–215.
Hastings, W.K. (1970) Monte Carlo sampling methods using Markov chains and
their applications. Biometrika, 57:97–109.
Hawkins, D.M. (1981) A cusum for a scale parameter. Journal of Quality Technology,
13:228–231.
Hawkins, D.M. and Cressie, N.A.C. (1984) Robust kriging—a proposal. Journal of
the International Association of Mathematical Geology, 16:3–18.
Heagerty, P.J. and Lele, S.R. (1998) A composite likelihood approach to binary
spatial data. Journal of the American Statistical Association, 93:1099–1111.
Henderson, C.R. (1950) The estimation of genetic parameters. The Annals of Math-
ematical Statistics, 21:309–310.
Heyde, C.C. (1997) Quasi-Likelihood and Its Applications. A General Approach to
Optimal Parameter Estimation. Springer-Verlag, New York.
Higdon, D. (1998) A process-convolution approach to modeling temperatures in the
North Atlantic Ocean. Environmental and Ecological Statistics, 5(2):173–190.
454 REFERENCES
Higdon, D., Swall, J., and Kern, J. (1999) Non-stationary spatial modeling. Bayesian
Statistics, 6:761–768.
Hinkelmann, K. and Kempthorne, O. (1994) Design and Analysis of Experiments.
Volume I. Introduction to Experimental Design. John Wiley & Sons, New York.
Houseman, E.A., Ryan, L.M., and Coull, B.A. (2004) Cholesky residuals for assessing
normal errors in a linear model with correlated outcomes. Journal of the American
Statistical Association, 99:383–394.
Huang, J.S. and Kotz, S. (1984) Correlation structure in iterated Farlie-Gumble-
Morgenstern distributions. Biometrika, 71:633–636.
Hughes-Oliver, J.M., Gonzalez-Farias, G., Lu, J.-C., and Chen, D. (1998) Parametric
nonstationary correlation models. Statistics & Probability Letters, 40:267–278.
Hughes-Oliver, J.M., Lu, J.-C., Davis, J.C., and Gyurcsik, R.S. (1998) Achieving
uniformity in a semiconductor fabrication process using spatial modeling. Journal
of the American Statistical Association, 93:36–45.
Hurvich, C.M. and Tsai, C.-L. (1989) Regression and time series model selection in
small samples. Biometrika, 76:297-307.
Isaaks, E.H. and Srivastava, R.M. (1989) An Introduction to Applied Geostatistics.
Oxford University Press, New York.
Jensen, D.R. and Ramirez, D.E. (1999) Recovered errors and normal diagnostics in
regression. Metrica, 49:107–119.
Johnson, M.E. (1987) Multivariate Statistical Simulation. John Wiley & Sons, New
York.
Johnson, N.L. and Kotz, S. (1972) Distributions in Statistics: Continuous Multivari-
ate Distributions. John Wiley & Sons, New York.
Johnson, N.L. and Kotz, S. (1975) On some generalized Farlie-Gumbel-Morganstern
distributions. Communications in Statistics, 4:415–427.
Jones, R.H. (1993) Longitudinal Data With Serial Correlation: A State-space Ap-
proach. Chapman and Hall, New York.
Jones, R.H. and Zhang, Y. (1997) Models for continuous stationary space-time pro-
cesses. In Gregoire, T.G. Brillinger, D.R. Diggle, P.J., Russek-Cohen, E., Warren,
W.G., and Wolfinger, R.D. (eds.) Modeling Longitudinal and Spatially Correlated
Data, Springer Verlag, New York, 289–298.
Journel, A.G. (1980) The lognormal approach to predicting local distributions of
selective mining unit grades. Journal of the International Association for Mathe-
matical Geology, 12:285–303.
Journel, A.G. (1983) Nonparametric estimation of spatial distributions. Journal of
the International Association for Mathematical Geology, 15:445–468.
Journel, A.G. and Huijbregts, C.J. (1978) Mining Geostatistics. Academic Press,
London.
Jowett, G.H. (1955a) The comparison of means of industrial time series. Applied
Statistics, 4:32–46.
Jowett, G.H. (1955b) The comparison of means of sets of observations from sections
of independent stochastic series. Journal of the Royal Statistical Society, (B),
17:208–227.
Jowett, G.H. (1955c) Sampling properties of local statistics in stationary stochastic
series. Biometrika, 42:160–169.
Judge, G.G., Griffiths, W.E., Hill, R.C., Lütkepohl, H., and Lee, T.-C. (1985) The
Theory and Practice of Econometrics, John Wiley & Sons, New York.
Kackar, R.N. and Harville, D.A. (1984) Approximation for Standard errors of estima-
REFERENCES 455
tors of fixed and random effects in mixed linear models. Journal of the American
Statistical Association, 79:853–862.
Kaluzny, S.P., Vega, S.C., Cardoso, T.P. and Shelly, A.A. (1998) S+SpatialStats.
User’s Manual for Windows! r and Unix!r . Springer Verlag, New York.
Kelsall, J.E. and Diggle, P.J. (1995) Non-parametric estimation of spatial variation
in relative risk. Statistics in Medicine, 14: 2335–2342.
Kelsall, J.E. and Wakefield, J.C. (1999) Discussion of Best et al. 1999. In Bernardo,
J.M. Berger, J.O. Dawid, A.P. and Smith, A.F.M. (eds.) Bayesian Statistics 6,
Oxford University Press, Oxford, p. 151.
Kempthorne, O. (1955) The randomization theory of experimental inference. Journal
of the American Statistical Association, 50:946–967.
Kenward, M.G. and Roger, J.H. (1997) Small sample inference for fixed effects from
restricted maximum likelihood. Biometrics, 53:983–997.
Kern, J.C. and Higdon, D.M. (2000) A distance metric to account for edge effects
in spatial analysis. In Proceedings of the American Statistical Association, Bio-
metrics Section, Alexandria, VA, 49–52.
Kianifard, F. and Swallow, W.H. (1996) A review of the development and applica-
tion of recursive residuals in linear models. Journal of the American Statistical
Association, 91:391–400.
Kim, H. and Mallick, B.K. (2002) Analyzing spatial data using skew-Gaussian pro-
cesses. In Lawson, A. B. and Denison, D. GṪ(eds.) ˙ Spatial Cluster Modeling,
Chapman & Hall/CRC, Boca Raton, FL. pp. 163–173.
Kitanidis, P.K. (1983) Statistical estimation of polynomial generalized covariance
functions and hydrological applications. Water Resources Research, 19:909–921.
Kitanidis, P.K. (1986) Parameter uncertainty in estimation of spatial functions:
Bayesian analysis. Water Resources Research, 22:499–507.
Kitanidis, P.K. and Lane, R.W. (1985) Maximum likelihood parameter estimation
of hydrological spatial processes by the Gauss-Newton method. Journal of Hy-
drology, 79:53–71.
Kitanidis, P.K. and Vomvoris, E.G. (1983) A geostatistical approach to the inverse
problem in groundwater modeling (steady state) and one-dimensional simulations.
Water Resources Research, 19:677–690.
Knox, G. (1964) Epidemiology of childhood leukemia in Northumberland and
Durham. British Journal of Preventative and Social Medicine, 18:17–24.
Krige, D.G. (1951) A statistical approach to some basic mine valuation problems
on the Witwatersrand. Journal of Chemical, Metallurgical, and Mining Society of
South Africa, 52:119–139.
Krivoruchko, K. and Gribov, A. (2004) Geostatistical interpolation in the presence
of barriers. In: geoENV IV - Geostatistics for Environmental Applications: Pro-
ceedings of the Fourth European Conference on Geostatistics for Environmental
Applications 2002 (Quantitative Geology and Geostatistics), 331–342.
Kulldorff, M. (1997) A spatial scan statistic. Communications in Statistics-Theory
and Methods, 26:1487–1496.
Kulldorff, M. and International Management Services, Inc. (2003) SaTScan v. 4.0:
Software for the spatial and space-time scan statistics. National Cancer Institute,
Bethesda, MD.
Kulldorff, M. and Nagarwalla, N. (1995) Spatial disease clusters: detection and in-
ference. Statistics in Medicine, 14:799–810.
Kupper, L.L. and Haseman, J.K. (1978) The use of a correlated binomial model for
456 REFERENCES
P.A.W. Lewis, (ed.) Stochastic Point Processes. John Wiley & Sons, New York,
646–681.
Ogata, Y. (1999) Seismicity analysis through point-process modeling: a review. Pure
and Applied Geophysics, 155:471–507.
O’Hagan, A. (1994) Bayesian Inference. Kendall’s Advanced Theory of Statistics,
2B, Edward Arnold Publishers, London.
Olea, R. A. (ed.) (1991) Geostatistical Glossary and Multilingual Dictionary. Oxford
University Press, New York.
Olea, R. A. (1999) Geostatistics for Engineers and Earth Scientists. Kluwer Aca-
demic Publishers, Norwell, Massachusetts.
Openshaw, S. (1984) The Modifiable Areal Unit Problem. Geobooks, Norwich, Eng-
land.
Openshaw, S. and Taylor, P. (1979) A million or so correlation coefficients. In N.
Wrigley (ed.), Statistical Methods in the Spatial Sciences. Pion, London, 127–144.
Ord, K. (1975) Estimation methods for models of spatial interaction. Journal of the
American Statistical Association, 70:120–126.
Ord, K. (1990) Discussion of “Spatial Clustering for Inhomogeneous Populations”
by J. Cuzick and R. Edwards. Journal of the Royal Statistical Society, Series B,
52:97.
Pagano, M. (1971) Some asymptotic properties of a two-dimensional periodogram.
Journal of Applied Probability, 8:841–847.
Papadakis, J.S. (1937) Méthode statistique pour des expériences sur champ. Bull.
Inst. Amelior. Plant. Thessalonique, 23.
Patterson, H.D. and Thompson, R. (1971) Recovery of inter-block information when
block sizes are unequal. Biometrika, 58:545–554.
Percival, D.B. and Walden, A.T. (1993) Spectral Analysis for Physical Applications.
Multitaper and Conventional Univariate techniques. Cambridge University Press,
Cambridge, UK.
Plackett, R.L. (1965) A class of bivariate distributions. Journal of the American
Statistical Association, 60:516–522.
Posa, D. (1993) A simple description of spatial-temporal processes. Computational
Statistics & Data Analysis, 15:425–437.
Prasad, N.G.N. and Rao, J.N.K. (1990) The estimation of the mean squared error
of small-area estimators. Journal of the American Statistical Association, 85:161–
171.
Prentice, R.L. (1988) Correlated binary regression with covariates specific to each
binary observation. Biometrics, 44:1033–1048.
Priestley, M.B. (1981) Spectral analysis of time series. Volume 1: Univariate series.
Academic Press, New York.
Rathbun, S.L. (1996) Asymptotic properties of the maximum likelihood estimator
for spatio-temporal point processes. Journal of Statistical Planning and Inference,
51:55–74.
Rathbun, S.L. (1998) Kriging estuaries. Environmetrics, 9:109–129.
Rathbun, S.L. and Cressie, N.A.C. (1994) A space-time survival point process for
a longleaf pine forest in Southern Georgia. Journal of the American Statistical
Association, 89:1164–1174.
Rendu, J.M. (1979) Normal and lognormal estimation. Journal of the International
Association for Mathematical Geology, 11:407–422.
Renshaw, E. and Ford, E.D. (1983) The interpretation of process from pattern us-
REFERENCES 459
Smith, A.F.M. and Gelfand, A.E. (1992) Bayesian statistics without tears: A
sampling-resampling perspective. The American Statistician, 46:84–88.
Smith, A.F.M. and Roberts, G.O. (1993) Bayesian computation via the Gibbs sam-
pler and related Markov chain Monte Carlo methods. Journal of the Royal Sta-
tistical Society, Series B, 55:3–24.
Solie, J.B., Raun, W.R., and Stone, M.L. (1999) Submeter spatial variability of
selected soil and bermudagrass production variables. Journal of the Soil Science
Society of America, 63:1724–1733.
Stein, M.L. (1999) Interpolation of Spatial Data. Some Theory of Kriging. Springer-
Verlag, New York.
Stern, H. and Cressie, N. (1999) Inference for extremes in disease mapping. In A.
Lawson et al. (eds.) Disease Mapping and Risk Assessment for Public Health,
John Wiley & Sons, Chichester, 63–84.
Stoyan, D., Kendall, W.S. and Mecke, J. (1995) Stochastic Geometry and its Appli-
cations. 2nd ed. John Wiley & Sons, New York.
Stroup, D.F. (1990) Discussion of “Spatial Clustering for Inhomogeneous Popula-
tions” by J. Cuzick and R. Edwards. Journal of the Royal Statistical Society,
Series B, 52:99.
Stroup, W.W., Baenziger, P.S., and Mulitze, D.K. (1994) Removing spatial variation
from wheat yield trials: a comparison of methods. Crop Science, 86:62–66.
Stuart, A. and Ord, J.K. (1994) Kendall’s Advanced Theory of Statistics, Volume
I: Distribution Theory. Edward Arnold, London.
Sun, D., Tsutakawa, R.K., and Speckman, P. L. (1999) Posterior distribution of
hierarchical models using CAR(1) distributions, Biometrika, 86:341–350.
Switzer, P. (1977) Estimation of spatial distributions from point sources with ap-
plication to air pollution measurement. Bulletin of the International Statistical
Institute, 47:123–137.
Szidarovsky, F., Baafi, E.Y., and Kim, Y.C. (1987). Kriging without negative
weights, Mathematical Geology, 19:549–559.
Tanner, M.A. (1993) Tools for Statistical Inference. 2nd Edition, Springer-Verlag,
New York.
Tanner, M.A. and Wong, W.H. (1987) The calculation of posterior distributions by
data augmentation. Journal of the American Statistical Association, 82:528–540.
Theil, H. (1971) Principles of Econometrics. John Wiley & Sons, New York.
Thiébaux, H.J. and Pedder, M.A. (1987) Spatial objective analysis with applications
in atmospheric science. Academic Press, London.
Thompson, H.R. (1955) Spatial point processes with applications to ecology.
Biometrika, 42:102–115.
Thompson, H.R. (1958) The statistical study of plant distribution patterns using a
grid of quadrats. Australian Journal of Botany, 6:322–342.
Tierney, L. (1994) Markov chains for exploring posterior distributions (with discus-
sion). Annals of Statistics, 22:1701–1786.
Tobler, W. (1970) A computer movie simulating urban growth in the Detroit region.
Economic Geography, 46:234–240.
Toutenburg, H. (1982) Prior Information in Linear Models. John Wiley & Sons, New
York.
Turnbull, B.W., Iwano, E.J., Burnett, W.S., Howe, H.L., and Clark, L.C. (1990)
Monitoring for clusters of disease: Application to leukemia incidence in upstate
New York. American Journal of Epidemiology, 132:S136–S143.
REFERENCES 461
Upton, G.J.G. and Fingleton, B. (1985) Spatial Data Analysis by Example, Vol. 1:
Point Pattern and Quantitative Data. John Wiley & Sons, New York.
Vallant, R. (1985) Nonlinear prediction theory and the estimation of proportions in
a finite population. Journal of the American Statistical Association, 80:631–641.
Vanmarcke, E. (1983) Random Fields: Analysis and Synthesis. MIT Press, Cam-
bridge, MA.
Verly, G. (1983) The Multi-Gaussian approach and its applications to the estima-
tion of local reserves. Journal of the International Association for Mathematical
Geology, 15:259–286.
Vijapurkar, U.P. and Gotway, C.A. (2001) Assessment of forecasts and forecast
uncertainty using generalized linear models for time series count data. Journal of
Statistical Computation and Simulation, 68:321–349.
Waller, L.A. and Gotway, C.A. (2004) Applied Spatial Statistics for Public Health
Data. John Wiley & Sons, New York.
Walters, J.R. (1990) Red-cockaded woodpeckers: a ’primitive’ cooperative breeder.
In: Cooperative Breeding in Birds: Long-term Studies of Ecology and Behaviour.
Cambridge University Press, Cambridge, 67–101.
Wand, M.P. and Jones, M.C. (1995) Kernel Smoothing. Chapman and Hall/CRC
Press, Boca Raton, FL.
Watson, G.S. (1964) Smooth regression analysis. Sankhya (A), 26:359–372.
Webster, R. and Oliver, M.A. (2001) Geostatistics for Environmental Scientists.
John Wiley & Sons, Chichester.
Wedderburn, R.W.M. (1974) Quasi-likelihood functions, generalized linear models
and the Gauss-Newton method. Biometrika, 61:439–447.
Whittaker, E.T. and Watson, G.N. (1927) A couse of modern analysis, 4th ed.,
Cambridge University Press, Cambridge, UK.
Whittle, P. (1954) On stationary processes in the plane. Biometrika, 41:434–449.
Wolfinger, R.D. (1993) Laplace’s approximation for nonlinear mixed models.
Biometrika, 80:791–795.
Wolfinger, R.D. and O’Connell, M. (1993) Generalized linear mixed models: a
pseudo-likelihood approach. Journal of Statistical Computing and Simulation,
48:233–243.
Wolfinger, R., Tobias, R. and Sall, J. (1994) Computing gaussian likelihoods and
their derivatives for general linear mixed models. SIAM Journal on Scientific and
Statistical Computing, 15:1294–1310.
Wong, D.W.S. (1996) Aggregation effects in geo-referenced data. In D. Griffiths
(ed.), Advanced Spatial Statistics. CRC Press, Boca Raton, Florida, 83–106.
Yaglom, A. (1987) Correlation Theory of Stationary and Related Random Functions
I. Springer-Verlag, New York.
Yule, G. U. and Kendall, M. G. (1950) An Introduction to the Theory of Statistics.
14th Edition, Griffin, London.
Zahl, S. (1977) A comparison of three methods for the analysis of spatial pattern.
Biometrics, 33:681–692.
Zeger, S.L. (1988) A regression model for time series of counts. Biometrika, 75:621–
629.
Zeger, S.L. and Liang, K.-Y. (1986) Longitudinal data analysis for discrete and
continuous outcomes. Biometrics, 42:121–130.
Zeger, S.K., Liang, K.-Y., and Albert, P.S. (1988) Models for longitudinal data: a
generalized estimating equation approach. Biometrics, 44:1049–1006.
462 REFERENCES
Zellner, A. (1986) Bayesian estimation and prediction using asymmetric loss func-
tions. Journal of the American Statistical Association, 81:446–451.
Zhang, H. (2004) Inconsistent estimation and asymptotically equal interpolators
in model-based geostatistics. Journal of the American Statistical Association 99:
250–261.
Zhao, L.P. and Prentice, R.L. (1990) Correlated binary regression using a quadratic
exponential model. Biometrika, 77:642–648.
Zimmerman, D.L. (1989). Computationally efficient restricted maximum likelihood
estimation of generalized covariance functions. Mathematical Geology, 21:655–
672.
Zimmerman, D.L. and Cressie, N.A. (1992) Mean squared prediction error in the
spatial linear model with estimated covariance parameters. Annals of the Institute
of Statistical Mathematics, 32:1–15.
Zimmerman, D.L. and Harville, D.A. (1991) A random field approach to the analysis
of field-plot experiments and other spatial experiments. Biometrics, 47:223–239.
Zimmerman, D.L. and Zimmerman, M.B. (1991) A comparison of spatial semivari-
ogram estimators and corresponding kriging predictors. Technometrics, 33:77–91.
Author Index
prior distribution, 383, 385, 393, 395– Blood lead levels in children, Virginia
397 2000, 370
prior sample, 385 as example of lattice data, 10
Bayesian kriging, 391 building blocks for models, 371
Bayesian linear regression, 391 geographically weighted Poisson re-
Bayesian transformed Gaussian model, gression, 381
397 GLM results, 372
Bernoulli GLMs and GLMMs, 370–382
experiment, 83, 416 introduction, 10
process, 83 maps of predictions, 380
random variable, 416 predicted value comparison, 378, 379
Bessel function, 210–211 pseudo-likelihood, 375
I, 211 semivariogram of std. Pearson resid-
J, 141, 147, 180, 210 uals, 374
K, 143, 210 spatial GLMs and GLMMs results,
modified, 210 376
Best linear unbiased estimator, 223, 299 spatial GLMs with CAR structure,
Best linear unbiased prediction 378
of residuals, 224 spatial models, 374
with known mean, 223 variable definitions, 371
BLUP, see Best linear unbiased predic-
with unknown but constant mean,
tor
226
Bochner’s theorem, 72, 141, 436
Best linear unbiased predictor, 33, 134,
Boltzmann distribution, 410
220, 222, 242, 299
Box-Cox transformation, 271
filtered measurement error, 332
Breakdown point, 162
of random effects (mixed model), 326
BTG, see Bayesian transformed Gaus-
Binary classification, 17
sian model
Binomial
distribution, 352, 353
C/N ratio data
experiment, 83
comparison of covariance parameter
multivariate, 356
estimates, 176
point process, 83
filtered kriging, 252
power mixture, 439
introduction, 154
probability generating function, 439 ordinary kriging (REML, and OLS),
process, 83, 86 246
random variable, 83 semivariogram estimation, 176
Binomial sampling, see Sampling semivariogram estimation (paramet-
Birth-death process, 442 ric kernel), 186
Bishop move, 19, 336 spatial configuration, 156
Black-Black join count, 20 Candidate generating density, 389
Black-White join count, 19 Canonical link, 354
Block indicator values, 290 inverse, 354
Block kriging, 285–289 Canonical parameter, 354
in terms of semivariogram, 287 CAR, see Spatial autoregressive model,
predictor, 286 conditional
variance, 286 Cardinal sine covariance model, 148
weights, 286 Cauchy distribution, 293
with indicator data, 290 Cauchy-Schwarz inequality, 44
Blocking, 300, 325 Cause-and-effect, 301
SUBJECT INDEX 469
Replication, 42, 55, 133, 140, 405, 421 in MCMC methods, 387
Residual maximum likelihood, see Re- point patterns temporally, 443
stricted maximum likelihood SAR, see Spatial autoregressive model,
Residual recursion, 309 Simultaneous
Residual semivariogram (GLS), 349 Scale effect, 285
Residual semivariogram (OLS), 308, 311, Scale mixture, 439
349, 350 Scale parameter, exponential family, 353,
Residual sum of squares, 303, 305, 393 355
Residual variance, 303 Scales of pattern, 95
Residuals Scales of variation, 54
(E)GLS, 257, 348 Scan statistic, 116
and autocorrelation, 235, 307 Scatterplot smoother, 327, 329
and pseudo-data (GLS), 362 Score equations, 304
GLS, 225, 260, 348–352 Screening effect, 232
OLS, 225, 257, 303, 304 Second reduced moment function, 101
PRESS, 307 Second-order
standardized (OlS), 305 factorial moment, 99
studentized (GLS), 362 intensity, 85, 99, 101, 443
studentized (OLS), 305 intensity of Cox process, 125
trend surface, 235 intensity of Neyman-Scott process,
Restricted maximum likelihood, 163, 168, 127
422 product density, 99
and error contrasts, 168, 262 reduced moment measure, 101
estimator of µ, 168 spatio-temporal intensity, 444
estimator of scale (σ 2 ), 263 stationarity, 27, 43, 52, 65, 201, 255,
estimator of variance components, 308, 406, 415
328 Semivariogram, 28, 30, 45, 50, 135, 224
in linear mixed model, 326 and covariance function, 45, 136
objective function, 263 and lognormal kriging, 270
spatially varying mean, 261–263 as structural tool, 28
spatio-temporal, 440 biased estimation, 137
REV, 35 classical estimator, 30
Robust semivariogram estimator, 160 cloud, 153, 170, 178, 186
Rook move, 19, 238, 336 conditional spatial, 441
RSV, see Relative structured variability Cressie-Hawkins estimator, 159, 239
empirical, 30, 135, 153, 236, 244,
Sample 255, 256, 410, 413, 416, 424,
autocovariance function, 191 441
covariance function, 192, 193 empirical (spatio-temporal), 441
dependent, 387 estimate, 30
mean, 31, 34, 191, 231 estimation, 153–163
size, 83 Genton estimator, 162
size (effective), 32, 34 local estimation, 426
Sampling Matheron estimator, 30, 136, 153,
binomial, 20 174, 441
by rejection, 386 moving average formulation, 184
from stationary distribution, 387 nonparametric modeling, 178–188
Gibbs, 387, 389, 398 nugget, 50, 139
hypergeometric, 20 of a convolved process, 184
Importance, 386, 394 of GLS residuals, 258, 362
484 SUBJECT INDEX
Zero-probability functionals, 85
Zoning effect, 285