Spatial Statistical Data Analysis For Gis Users
Spatial Statistical Data Analysis For Gis Users
Ask for Esri Press titles at your local bookstore or order by calling 1-800-447-9778. You can also shop online at www.esri.com/esripress. Outside the United States, contact
your local Esri distributor.
Esri Press titles are distributed to the trade by the following:
In North America:
Ingram Publisher Services
Toll-free telephone: 800-648-3104
Toll-free fax: 800-838-1149
E-mail: [email protected]
In the United Kingdom, Europe, Middle East and Africa, Asia, and Australia:
Eurospan Group
3 Henrietta Street
London WC2E 8LU
United Kingdom
Telephone: 44(0) 1767 604972
Fax: 44(0) 1767 601640
E-mail: [email protected]
CONTENTS
PREFACE
PART 1: INTRODUCTION TO STATISTICAL DATA ANALYSIS
CHAPTER 1: STATISTICAL APPROACH TO GIS DATA ANALYSIS
Spatial analysis and spatial data analysis
Types of spatial data and related statistical models
Spatial data dependency
Distance between geographic objects
An example of statistical smoothing of regional data
Assignment:
1) Investigate how a semivariogram can capture spatial dependence
Further reading
CHAPTER 2: EXAMPLES OF THE IMPORTANCE OF ESTIMATING DATA AND MODEL
UNCERTAINTY
The difference between averaging dependent and independent data
Prediction error maps and maps that show the probability that a specified threshold is
exceeded
Statistical comparison of mortality rates
The uncertainty of prediction errors
Quantile estimation and minimization of a loss function
Hypothesis testing and modeling
Assignments:
1) DEM averaging exercise
2) Create a kriging prediction error map that depends on the data values
Further reading
Spatial Statistical Data Analysis for GIS Users 3
CHAPTER 3: UNCERTAINTY AND ERROR IN GIS DATA
Errors in GIS data
Systematic and random errors
Data variation at different scales
Using a semivariogram to detect data uncertainty
Locational uncertainty
Local data integration
Roundingoff errors
Censored and truncated data
Digital elevation model uncertainty
Case study: Error propagation in radiocesium food contamination
Estimating the internal dose in people from the measured food contamination
Estimating uncertainty in expressions with imprecise terms. Error in estimating the internal
dose
Assignments:
1) Investigate the influence of the locational uncertainty on the semivariogram model
2) Recalculate the uncertainty in estimating the internal dose
Further reading
CHAPTER 4: THE IMPORTANCE OF THE DISTRIBUTION ASSUMPTION
Gaussian processes
Lognormal processes
Bernoulli, binomial, and Poisson processes
Modeling data distribution as a mixture of Gaussian distributions
Use of gamma distribution for modeling positive continuous data
Modeling proportions using beta distribution
Negative binomial distribution
Modeling data with extra zeros
Nonparametric modeling
Confidence intervals and Chebyshev’s inequality
Spatial Poisson process
Assignments:
1) Simulate and plot normal, lognormal, gamma, binomial, Poisson, and negative
binomial distributions
2) Detect multimodal data distributions
3) Find a golf course with the best air quality using kriging and the Chebyshev’s
inequality
CHAPTER 5: METHODS FOR SENSITIVITY AND UNCERTAINTY ANALYSIS
Example of sensitivity analysis in GIS
Monte Carlo simulation
Bayesian belief network
Fuzzy set theory
Fuzzy logic
Raster maps comparison
Assignments:
1) Use the Bayesian belief network for relating the risk of asthma to environmental
factor
2) Classify environmental variables using fuzzy logic
Spatial Statistical Data Analysis for GIS Users 4
3) Using fuzzy inference, find areas that most likely contain large number of people
with a high irradiation dose
Further reading
CHAPTER 6: TYPES OF SPATIAL DATA, STATISTICAL MODELS, AND MODEL DIAGNOSTICS
Three types of spatial data
Goals of spatial data modeling
Goals of spatial data exploration
Examples of applications with input data of different types: radioecology, fishery, agriculture,
wine grapes quality model, wine price formation, forestry, criminology
Random variables and random fields
Stationarity and isotropy
Model diagnostic
Methods for indicator (yes/no) prediction
Methods for continuous prediction: crossvalidation and validation
Summary of spatial modeling
Assignments:
1) Choose the appropriate model for analyzing the quantity of goodsized shoots
produced by the vine
2) Investigate possible models for the variation of malaria prevalence
3) Calculate and display indices for indicator prediction using ozone concentration
measured in June 1999 in California
4) Investigate the variability of yields of individual trees
Further reading
CHAPTER 7: SPATIAL INTERPOLATION USING DETERMINISTIC MODELS
Spatial interpolation goals
Predictions are always inaccurate
Deterministic and statistical models
Inverse distance weighted interpolation
Radial basis functions: RBFs and kriging
Global and local polynomial interpolation
Local polynomial interpolation and kriging
Interpolation using a nonEuclidean distance metric
Nontransparent barriers determined by polylines
Semitransparent barriers based on cost surface
Assignments:
1) Compare the performance of deterministic interpolation models
2) Find the best deterministic model for interpolation of cesium137 soil
contamination
Further reading
Spatial Statistical Data Analysis for GIS Users 5
PART 2: PRINCIPLES OF MODELING SPATIAL DATA
CHAPTER 8: PRINCIPLES OF MODELING GEOSTATISTICAL DATA: BASIC MODELS AND
TOOLS
Optimal prediction
Geostatistical model
Geostatistical Analyst’s kriging models
Semivariogram and covariance
What functions can be used as semivariogram and covariance models?
Convolution
Semivariogram and covariance models
Models with true ranges
Powered exponential family or stable models
KBessel or Mattern class of covariance and semivariogram models
Models allowing negative correlations
JBessell semivariogram models
Rationial quadratic model
Nested models
Indicator semivariogram models
Semivariogram and covariance model fitting
Trend and anisotropy
Kriging neighborhood
Continuous kriging predictions and prediction standard error
Data transformations
Data declustering
Assignments:
1) Simulate surfaces using various semivariogram models
2) Find the best semivariogram models for simulated data
3) Investigate the Geostatistical Analyst’s rediction smoothing option
4) Try a general transformation of nonstationary data
Further reading
CHAPTER 9: PRINCIPLES OF MODELING GEOSTATISTICAL DATA: KRIGING MODELS AND
THEIR ASSUMPTIONS
Choosing between simple and ordinary kriging
Kriging output maps
Multivariate geostatistics
Indicator kriging and indicator cokriging
Disjunctive kriging
Checking for bivariate normality
Moving window kriging
Kriging assumptions and model selection
A movingwindow kriging model
Kriging with varying model parameters: sensitivity analysis and Bayesian predictions
Copulabased geostatistical models
Assignments:
1) Reproduce prediction maps shown in the “Accurate Temperature Maps Creation for
Predicting Road Conditions” demo
2) Investigate the performance of simple, ordinary, and universal kriging models by
comparing their predictions with known values
Spatial Statistical Data Analysis for GIS Users 6
3) Find the optimal number of neighbors for prediction using simple and ordinary
kriging models by comparing their rootmeansquared prediction errors
4) Participate in the Spatial Interpolation Comparison 97 exercise
5) Predict the tilt thickness in the lake
6) Develop a geostatistical model for interpolation of the lake Kozjak depth data
Further reading
CHAPTER 10: OPTIMAL NETWORK DESIGN AND PRINCIPLES OF GEOSTATISTICAL
SIMULATION
Spatial sampling and optimal network design
Monitoring design in the precomputer and early computer era
Ideas on a network design formulated after 1963
Sequential versus simultaneous network design
Geostatistical simulation
Unconditional simulation and conditioning by kriging
Sequential Gaussian simulations
Simulating from kernel convolutions
Simulated annealing
Applications of unconditional simulations
Applications of conditional simulations
Assignments:
1) Find optimal places for the addition of new stations to monitor air quality in
California
2) Simulate a set of candidate sampling locations from inhomogeneous Poisson
process
3) Reduce the number of monitoring stations in the network using validation
diagnostics
4) Discuss two simulation algorithms proposed by GIS users that are based on
estimated local mean and standard error
5) Conditional simulation with Geostatistical Analyst 9.3
Further reading
CHAPTER 11: PRINCIPLES OF MODELING REGIONAL DATA
Geostatistics and regional data analysis
The question of applying geostatistics to regional data
Binomial and Poisson kriging
Distance between polygonal features
Regional data modeling objectives
Spatial smoothing
Cluster detection methods
Spatial regression modeling
Simultaneous autoregressive model
Markov random field and conditional autoregressive model
Assignments:
1) Investigate the new proposal of mapping risk of disease
2) Smooth the data for the tapeworm infection in red foxes
3) Spatial clusters detection using R package DCluster
Further reading
Spatial Statistical Data Analysis for GIS Users 7
CHAPTER 12: SPATIAL REGRESSION MODELS: CONCEPTS AND COMPARISON
Geographically weighted regression
Linear mixed model
Generalized linear model and generalized linear mixed models
Semiparametric regression
Hierarchical spatial modeling
Hierarchical models versus binomial and Poisson kriging
Multilevel and random coefficient spatial models
Geographically weighted regression versus random coefficients models
Spatial factor analysis
Copulabased spatial regression
Regional data aggregation and disaggregation
Spatial regression models diagnostics and selection
Assignments:
1) Investigate the effect of sun exposure on lip cancer deaths
2) Practice with ArcGIS 9.3 geographically weighted regression geoprocessing tool
Further reading
CHAPTER 13: PRINCIPLES OF MODELING DISCRETE POINTS
Examples of point patterns
Spatial point processes: complete spatial randomness and Poisson processes; spatial clusters;
inhibition processes: Cox processes
Point pattern analysis and geostatistics
Ripley's K function
Cross K function
Pair correlation functions
Test of association between two types of point events
Model fitting
Inhomogeneous K functions
K functions on a network
Cluster analysis
Marked point patterns
Hierarchical modeling of spatial point data
Residual analysis for spatial point processes
Local indicators of spatial association
Assignments:
1) Modeling the distribution of early medieval grave sites
2) Getting to know Gibbs processes
Further reading
Spatial Statistical Data Analysis for GIS Users 8
PART 3: STATISTICAL SOFTWARE USAGE
CHAPTER 14: GEOSTATISTICS FOR EXPLORATORY SPATIAL DATA ANALYSIS
Data visualization
Exploration of ozone data clustering, dependence, distribution, variability, stationarity, and
finding possible data outliers
Analysis of spatially correlated heavy metal deposition in Austrian moss
Zoning the territory of Belarus contaminated by radionuclides
Averaging air quality data in time and space
Analysis of the nonstationary data from a farm field in Illinois
Spatial distribution of thyroid cancer in children in postChernobyl Belarus
Assignments:
1) Exploration of the arsenic groundwater contamination in Bangladesh in 1998
2) Average the particulate matter data collected in the United States in June 2002 in
time and space
3) Explore the annual precipitation distribution in South Africa
Further reading
CHAPTER 15: USING COMMERCIAL STATISTICAL SOFTWARE FOR SPATIAL DATA
ANALYSIS
Programming with SAS
Traditional (nonspatial) linear regression
Linear regression with spatially correlated errors (kriging with external trend)
Using MATLAB and libraries developed by MATLAB users
Moran’s I scatter plot
Simultaneous autoregressive model
Using SPLUS spatial statistics module S+SpatialStats
Creating spatial neighbors
Moran’s I
Conditional autoregressive model
Assignments:
1) Repeat the Bangladesh case study using another subset of the data
2) Use MATLAB for nonGaussian disjunctive kriging
3) Use CAR model from S+SpatialStats module for analysis of infant mortality data
collected in North Carolina from 1995–1999
Further reading
CHAPTER 16: USING FREEWARE R STATISTICAL PACKAGES FOR SPATIAL DATA ANALYSIS
Analysis of the distribution of air quality monitoring stations in California using the R splancs
package
Epidemiological data analysis using the R environment and spdep package
Analysis of the relationships between two types of crime events using the splancs package
Cluster analysis using the mclust package
Assignments:
1) Simulate spatial processes with the spatstat package
2) Repeat the analysis of infant mortality using data collected in North Carolina from
1995–1999
3) Repeat the analysis of the relationships between robbery and auto theft crime
events using the splancs package with 1998 Redlands data
Spatial Statistical Data Analysis for GIS Users 9
4) Estimate the density and clustering of gray whales near the coastline of Flores
Island
5) Test for the spatial effects around putative source of health risk
Further reading
APPENDIX 1: USING ARCGIS GEOSTATISTICAL ANALYST 9.2
Exploratory spatial data analysis
Displaying the semivariogram surface on the map
Statistical predictions
Predictions using ordinary kriging
Replicated data prediction using lognormal ordinary kriging
Continuous predictions
A close look at predictions with replicated data
Quantile map creation using simple kriging
Multivariate predictions
Probability map creation using cokriging of indicators
Probability maps creation using disjunctive and ordinary kriging
Moving window kriging
Example of geoprocessing: finding places for new monitoring stations
Deterministic models
Validation diagnostics
About ArcGIS Geostatistical Analyst 9.3
Assignments:
1) Repeat the analysis shown in this appendix
2) Use the Geostatistical Analyst models and tools to analyze heavy metals
measurements collected in Austria in 1995
3) Find the 20 best places for collecting new values of arsenic to improve predictions of
this heavy metal distribution over Austrian territory
4) Practice with the Gaussian Geostatistical Simulation geoprocessing tool
Further reading
APPENDIX 2: USING R AS A COMPANION TO ARCGIS
Downloading
The first R session
Reading and displaying the data
Scatterplots
The linear models
Fitting a linear model in R
Regression diagnostics
Beyond linear models
Runnng R scripts from ArcGIS
Assignments:
1) Repeat linear regression analysis of data on infant mortality data and house
prices
2) Verify the assumptions of the linear regression model
Further reading
Spatial Statistical Data Analysis for GIS Users 10
APPENDIX 3: INTRODUCTION TO BAYESIAN MODELING USING WINBUGS
On the reasons for using Bayesian modeling
Bayesian regression analysis of housing data
Multilevel Bayesian modeling
Bayesian conditional geostatistical simulations
Regional Bayesian analysis of thyroid cancer in children in Belarus
Assignments:
1) Repeat the case studies presented in this appendix
2) Interpolate the precipitation data using Gaussian and Bayesian kriging
3) Perform the Bayesian analysis of weeds data
4) Verify the classification of the happiest countries in Europe
5) Bayesian random coefficient modeling of the crime data
6) Bayesian spatial factor analysis of the crime data
Further reading
APPENDIX 4: INTRODUCTION TO SPATIAL REGRESSION MODELING
USING SAS
Logistic regression
Logistic regression with spatially correlated errors
Poisson regression with spatially correlated errors
Binomial regression with spatially correlated errors
Semivariogram modeling using Geostatistical Analyst and the procedures mixed and nlin
Assignments:
1) Repeat analysis of the pine beetle and thyroid cancer data
2) Reconstruct the semivariogram model parameters
3) Compare two semivariogram models fitting
Further reading
AFTERWORD
GLOSSARY
BIBLIOGRAPHY
Spatial Statistical Data Analysis for GIS Users 11
PREFACE
This book is intended to help GIS users extract useful information from their data with spatial statistics by:
Explaining basic concepts of statistical models that are commonly used or central to current research
Indicating difficulties in applying models to the data
Sketching solutions to typical problems
Providing examples of data analysis and computer codes
Offering exercises for practicing spatial data analysis
Referring to an extensive bibliography
To achieve our goals, ideas behind a number of statistical models are discussed, and many case studies are
presented, mostly based on real data from GIS applications. These serve to introduce readers to the goals of
statistical analysis. Many case studies are completed using the ArcGIS Geostatistical Analyst extension,
version 9.2. Some use a prototype upgraded version of Geostatistical Analyst, and the rest use various other
statistical software packages.
ArcGIS Geostatistical Analyst is a collection of models and tools for spatial data exploration, identification of
data anomalies, optimal prediction, evaluation of prediction uncertainty, and surface creation. To do the
predictions successfully, data should be spatially correlated; that is, data should have a discernible, but
perhaps not easily detected, relation to each other. Observed values closer together usually should be more
alike than values further apart. The data should also possess the quality of being stationary, meaning that
data variability is the same everywhere in the data domain.
To test how well these conditions of spatial data dependency and stationarity are met, statistical analysis
begins with data exploration. Geostatistical Analyst provides a set of linked exploration tools, including the
three‐dimensional data view, histogram, scatter plot, and semivariogram. Exploration with these tools allows
one to gain insight simultaneously into both the “where” and the “why” of the data.
Once the exploratory analysis is complete, statistical modeling can begin. A model only approximates reality
and uses data that inevitably have errors, and these errors are compounded by modeling itself. In modeling,
there are three tasks: to find out how much data variation exists, to make sure that the model describes data
as closely as possible and is not adding too much additional error, and to know how much error is too much.
Various diagnostic tools help in performing these tasks.
Spatial data can be divided into three main categories:
1. Geostatistical or continuous data: data that can be measured at any location in the study area but are
known only at a limited number of sample points. Geostatistical data occur in meteorology,
agriculture, mining, and environmental studies, among other applications.
2. Regional (sometimes known as aggregated, polygonal, or lattice) data: data that are associated with
areas and typically include counts of an event within a polygon. Regional data occur in epidemiology,
criminology, agriculture, census, and business‐related applications.
3. Discrete point data: data that consist of locations of events. Applications of point pattern analysis
include forestry, epidemiology, and criminology.
Available statistical models work well with large quantities of input data that strictly conform to models’
assumptions, but in practice these assumptions are not often met. Therefore, statistical models can be applied
incorrectly and their outputs can be misleading. Currently available statistical packages are not capable of
warning users of a wrong result. For this reason, proof is required in the decision‐making process to ensure
that a proper model was used and that uncertainty was taken into account.
Debate as to the meaning of “proof” has been ongoing since the beginning of civilization. In ancient Egypt,
mathematical formulas with or without proof were only accessible to a small circle of priests. Several
Egyptologists suggest that the area of the isosceles triangle with a narrow base was calculated by multiplying
the base by half of the side instead of half of the height. While this has not caused a catastrophe (the pyramids
are still standing, after all), this error could have been avoided if the proof had been available to more people.
In ancient India, proof was often based on “inner vision.” In other words, if a person was believed to have
been enlightened by a higher power, the statements he or she wrote down were inarguably correct. This
enlightenment was the only proof needed. Look at the example in the figure below. Ask yourself if this
provides proof that the formula for the area of the circle is equal to the area of the rectangle with the side
dimensions of the radius and half the circumference of the circle, respectively.
In this example, some may choose to accept the proof and adopt the formula. However, it is unlikely that such
inner vision could provide compelling evidence when analyzing more complicated issues such as spatial
phenomena. Fortunately, the ancient Greeks adopted a more rational approach, with proof and truth
defended at the court by speakers without absolute authority. From this point in history, reasoning became
the basis of the proof. Here, by proof, we do not mean a fixed sequence of statements, which are derived from
previous statements via commonly agreed upon rules, as in mathematics, but a process, perhaps interactive,
by which the validity of the statements is accepted. In this process, the goal is not reaching certainty but
rather accepting the statements with bounded error probability.
How does the above discussion relate to the use of Geostatistical Analyst and other statistical software? Quite
simply, when analyzing complex spatial data, the best way to convince the audience that the output from a
statistical model is correct is to clearly present the key ideas behind the model.
This book discusses many spatial statistical concepts and models, from simple to complex. Models are
presented at the level of detail sufficient to familiarize readers with the ideas behind them and to apply these
models to the readers’ data. Although formulas are presented, it is not critical to follow the math, as the
models are explained qualitatively in the text.
ArcGIS Geostatistical Analyst and ArcGIS Spatial Analyst users who are interested in case studies and
more detailed descriptions of statistical models.
Students and teachers who are looking for qualitative explanations of the advantages and
disadvantages of various spatial statistical models. Assignments and sample data are provided to
assist self‐study and to help to organize courses and lectures based on this book.
Researchers who need a practical introduction to statistical modeling of certain categories of spatial
data and to common statistical software usage.
Readers who want to learn about applications of spatial statistics.
ABOUT THE BOOK’S STRUCTURE
This book consists of three parts: exposition (chapters 1–6), development (chapters 7–13), and recapitulation
(chapters 14‐16 and the four appendixes).
In the exposition, the reader may find that the ideas repeat themselves: the themes are exposed, discussed
once, and then followed by variations. The development evolves the themes further. In the finale, everything
is recapitulated. This construction makes the book comparable to a musical sonata. The structure of the
sonata, with its themes and variations, development, and summary, has proven enduring to music lovers for
generations.
The main themes of the exposition are data and model uncertainty, data distribution, spatial dependence, and
probabilistic modeling. Here, we show examples of geoprocessing enhanced by statistical data analysis. We
begin by introducing the statistical approach to analysis of GIS data, followed by discussion and examples of
the importance of evaluating the uncertainty of the data, as well as explanation of how uncertainty in the data
carries over to uncertainty in the model. Also in this part, sources of uncertainty and error in GIS data and
models are discussed. Commonly used statistical distributions and methods of sensitivity and uncertainty
analysis are also illustrated. Examples of spatial data analysis using data from radioecology, criminology,
fisheries science, agriculture, and forestry are presented. Finally, the exposition discusses basic statistical
concepts, such as random variables and stationarity, and the principles of model diagnostics.
The development part elaborates on modeling assumptions and requirements for statistical modeling of
continuous, regional, and discrete data. The advantages and disadvantages of deterministic and statistical
models are discussed because an understanding of the limitations of spatial data analysis is essential for
choosing the optimal model and for decision making.
The recapitulation part shows many more applications of spatial data analysis, this time focusing on using
statistical software packages. The exposition is like visiting a restaurant with an extensive menu, while the
recapitulation presents detailed cooking instructions. In the recapitulation we show how to reproduce maps
and statistical tables with statistical software packages. Working with the application clarifies one’s
understanding of the concepts by viewing them in action.
Successful application of spatial statistics requires confidence in the researcher’s knowledge of the data and
use of the software. This confidence can be attained through practice in solving real data problems. Examples
of such problems are included in the assignments provided at the end of each chapter and appendix.
The following list of data is analyzed in this book:
1. Radiocesium soil contamination data collected in Belarus six years after the Chernobyl accident
2. Radiocesium food contamination data collected in Belarus seven years after the Chernobyl accident
3. Pediatric thyroid cancer data collected in Belarus during the first seven years after the Chernobyl
accident
4. Particulate matter and ozone concentration measurements made in California
5. Heavy metal contamination in Austrian moss
6. Arsenic concentration in groundwater collected in Bangladesh
7. Fertilizer application using data collected in a farmer’s field in Illinois
8. Soil temperature and moisture
9. Weed counts from an eastern Nebraska cornfield
10. Bell pepper disease
11. Temperature in North America and Western Europe
12. Rainfall on April 29, 1986, in Sweden
13. Monthly precipitation in Catalonia, Spain
14. Annual precipitation in South Africa
15. Fishery data collected west of the British Isles
16. Favorable habitat conditions for the California gnatcatcher
17. Elevation data collected in various places
18. Coal seam thickness
19. Agriculture‐related data about America’s farms and farmers for U.S. counties
20. Housing block group data from the vicinity of Rancho Cucamonga, California
21. House prices in part of the city of Nashua, New Hampshire
22. Family size data for U.S. counties
23. Tapeworm infections in red foxes in Lower Saxony, Germany
24. Level of happiness in western European countries
25. Clustering of the pharmacies locations in the city of Montpellier, France
26. Christian church distribution in the United States
27. Cancer mortality data collected for U.S. counties
28. Prevalence of malaria in children in Gambia, Africa
29. Childhood mortality rates in African countries
30. Distribution of lightning strikes
31. Martian crater locations
32. Serious crime in California counties
33. Automobile theft and robbery in Redlands, California
34. Violent crimes in Houston city tracts
35. Shrubs data collected in Valencia Province, Spain
36. Tree distribution in the Bavarian forest, Germany
37. Trees in central Oregon attacked by the mountain pine beetle
Most of these data are included on the DVD (or links to the data locations are provided) for repeating and
improving the analyses presented in this book. Several additional datasets are provided for the assignments.
Writing this book would not have been possible without the support of the following people:
The developers of the ArcGIS software extension Geostatistical Analyst: Dmitry Pavlushko, Alex Zhigimont,
Alexander Gribov, Denis Remezov, Ivan Figurin, and Sergey Karebo; Carol Gotway, discussions with whom
greatly improved the book; Jay Ver Hoef, the consultant for the first release of the Geostatistical Analyst; Josef
Strobl, who gave me the opportunity to read a course on spatial statistics for graduate students in the
Geography Department at Salzburg University and accepted a draft version of the book for the UNIGIS online
education network; Robert Siegfried, who greatly improved my knowledge about agriculture practices; Roger
Bivand, who helped me to learn the R language and packages; Corey LaMar, who helped me find geographical
data; Michael Karman, an editor with a passion for good writing; Mark Henry, an editor who guided this book
to completion; David Boyles, editorial supervisor; Peter Adams, manager of Esri Press; and Kathleen Morgan,
permissions specialist. I am also grateful to the many Esri employees who helped me at various stages of
writing this book. Special thanks to Jack Dangermond, Scott Morehouse, and Clint Brown for their support of
this project.
Konstantin Krivoruchko
Redlands and Yucaipa, California
Dedicated with love to
Tatsiana, Katja, and Maryia
INTRODUCTION TO
STATISTICAL DATA
ANALYSIS
CHAPTER 1: STATISTICAL APPROACH TO GIS DATA ANALYSIS
CHAPTER 2: EXAMPLES OF THE IMPORTANCE OF ESTIMATING DATA AND
MODEL UNCERTAINTY
CHAPTER 3: UNCERTAINTY AND ERROR IN GIS DATA
CHAPTER 4: THE IMPORTANCE OF THE DISTRIBUTION ASSUMPTION
CHAPTER 5: METHODS FOR SENSITIVITY AND UNCERTAINTY ANALYSIS
CHAPTER 6: TYPES OF SPATIAL DATA, STATISTICAL MODELS, AND MODEL
DIAGNOSTICS
CHAPTER 7: SPATIAL INTERPOLATION USING DETERMINISTIC MODELS
STATISTICAL APPROACH
TO GIS DATA ANALYSIS
SPATIAL ANALYSIS AND SPATIAL DATA ANALYSIS
TYPES OF SPATIAL DATA AND RELATED STATISTICAL MODELS
SPATIAL DATA DEPENDENCY
DISTANCE BETWEEN GEOGRAPHIC OBJECTS
AN EXAMPLE OF STATISTICAL SMOOTHING OF REGIONAL DATA
ASSIGNMENT
1) INVESTIGATE HOW A SEMIVARIOGRAM CAN CAPTURE SPATIAL
DEPENDENCE
FURTHER READING
T his chapter begins with a discussion of the difference between deterministic and statistical spatial
data analysis. Definitions are given, and an example of modeling radiocesium soil contamination
data with measurement errors is presented.
Spatial statistical models are classified into geostatistical, regional, and point pattern models. They
correspond to the data classified by their locations into continuous, regional, and discrete types. There are
four types of data values that can be associated with any spatial statistical model: continuous, ordinal,
categorical, and binary. Continuous means that data values come from the real number line. Ordinal means
that they are in discrete categories, and the order of those categories is meaningful; for example, small,
medium, and large. Categorical means that data values are in discrete categories, and order is not meaningful;
for example, red, yellow, and green. Binary data take either the value 0 or 1. In this book, we usually use the
word “continuous” when describing data locations, not data values.
The differences in modeling assumptions for each data type are briefly discussed, and an example is given of
data (lightning strike locations) that can be modeled using any of the three types of spatial statistical models.
Spatial data dependency is a common feature of any spatial statistical model. The importance of modeling the
extent of spatial dependency is illustrated using simulated data.
Spatial Statistical Data Analysis for GIS Users 18
Spatial data dependency is based on the distance between spatial objects, but distance is not necessarily
related to the length of a line connecting any two objects. Several examples illustrate the problem of selecting
the appropriate measure of distance between spatial objects.
The chapter ends with an example of regional data visualization problems using family size data in counties
of the United States.
An addendum to the chapter illustrates how spatial dependence is estimated and used in geostatistics for
prediction to the unsampled locations.
Many readers begin a book at chapter 1, skipping the preface. We suggest reading the preface before this
chapter because it provides additional motivation for studying and using spatial statistics.
SPATIAL ANALYSIS AND SPATIAL DATA ANALYSIS
The use of GIS functionality to understand and interpret spatial information is called spatial analysis. GIS tools
are powerful, and the results they return are often so visually impressive that it is easy to lose sight of
something that everyone knows at some level, namely, that most datasets have errors both in attributes and
in locations, and that processing data—with overlay, buffering, and the like—propagates errors. Much of the
current progress in GIS consists of providing open and consistent database support and making it easier to
construct maps. If there is a need to test hypotheses and to make predictions in fields such as agriculture,
epidemiology, or hydrology, it is also necessary to be informed about the uncertainty of these tests and
predictions in order to fully understand the decisions that can be made using the results of the analysis.
Spatial data analysis uses statistical theory and software to analyze data with location coordinates.
Nonstatistical models (often called deterministic) postulate data relationships by the imposition of a priori
models, whereas statistical models estimate the model parameters from the data at hand. For example,
inverse distance weighted (IDW) interpolation is a deterministic model because the dependence between
pairs of values is defined simply by the distance between data locations raised to a predefined power value.
This model has been applied to fields as diverse as air pollution, meteorology, and the study of plant nutrient
distributions in soils. The main shortcomings of deterministic models are the arbitrariness of the postulated
spatial data similarity and the absence of information on their prediction uncertainties.
Kriging is a statistical model because it infers the dependence between pairs of points from examination of all
similarly distant pairs in the data. This allows calculation of averages and their associated properties along
with an estimate of the uncertainty of the prediction. The parameters of the kriging model are different for
any two different datasets.
A GIS provides (1) spatial data management, (2) mapping, (3) spatial data analysis, and (4) decision‐making
assistance. A GIS has outstanding tools for the first two and parts of the latter two. ArcGIS Geostatistical
Analyst is a spatial data analysis program, an extension to Esri ArcGIS software developed to integrate GIS
data management and mapping capabilities with statistical modeling of spatially continuous data such as
temperature and elevation—data for which Tobler’s first law of geography applies: “Everything is related to
everything else, but near things are more related than distant things.” Incorporating spatial statistical
modeling into the GIS environment is natural because GIS tools describe things over distance, while statistics
can explain how these same things change and interrelate. With the addition of statistical methods, GIS is
enhanced with key tools for decision support and evaluating the uncertainty associated with those decisions.
Statistics is not the only tool for decision making, but much of statistics is concerned with quantifying
uncertainty. Making decisions requires managing risk, and this, in turn, requires understanding uncertainty.
Spatial Statistical Data Analysis for GIS Users 19
Integration of 1 through 4 above provides scientists and managers with an inclusive environment in which
data are managed, models are constructed, and prediction uncertainty is made explicit.
In Belarus, after the Chernobyl accident in 1986, decisions had to be made quickly about whom to evacuate.
The maximum permissible exposure was considered one millisievert per year. The amount of exposure is not
important for our purposes here, only that there is a measurable threshold. The decision to evacuate had to
proceed in the face of two types of uncertainty: uncertainty about the quality of the data and uncertainty
about the model of the various processes through which the data was run to get results.
In the years since the accident, territory zoning has been based on the available experimental data:
Direct measurements of the dose accumulated in a person, or
If the individual dose is unavailable, dose estimation from radioactive food contamination, or
If irradiation from eating contaminated food cannot be calculated, the level of radioactive
contamination of soil with cesium‐137, strontium‐90, and actinoids
Decision making based on individual doses is the easiest because the data uncertainty is small.
Several elements go into the calculation of dose: contamination of food and characteristics of a person’s daily
diet, age, and body weight. Each of these is necessarily an average. Not every person weighs the same, nor
does everyone eat the same amount of the same food. Contamination varies from place to place and from
foodstuff to foodstuff. Calculations (see chapter 3) show that uncertainty of the dose based on the data
collected in one of the villages is 47 percent, uncertainty of intake (average daily diet plus average
contamination) is 42 percent, and uncertainty of weight is 21 percent. Such a great amount of uncertainty
makes identifying which areas to evacuate a difficult task. It is no small effort (or expense) to evacuate people,
and there is significant risk in leaving people in dangerous areas.
In practice, threshold values of soil contamination were used in zoning instead of the dose measuring or
calculation, since soil samples are much easier to collect than doses are to measure or calculate. The adopted
thresholds are the calculated effective values of soil contamination that are believed to correspond to some
critical values of the individual dose. Although measurements of the radionuclide soil contamination are
considered accurate, the uncertainty of such territory zoning for the purpose of the dose control is large. The
major source of uncertainty is concealed in the coefficient of correlation between the dose and the soil
contamination. Another component of uncertainty is that of the measurements of soil contamination
themselves.
Filtered prediction is one of many methods of measuring data quality and accounting for uncertainties. This
method improves decisions in one area by using information from other areas. The map in figure 1.1 shows
the locations of settlements close to Chernobyl, along with numbers representing each settlement's level of
radiocesium soil contamination in curies per square kilometer (Ci/km2). The upper permissible limit is 15.
The error of radioactive soil contamination measurements is about 20 percent, and two circled settlements
are close to the upper permissible level limit. When information from other settlements is used to predict the
level of contamination at the two circled settlements, the one measured at 14.56 Ci/km2 is predicted to be
17.05 Ci/km2, and the one marked 15.45 is predicted to be 16.88. According to the kriging model with a
measurement error component, their true contamination values are in the intervals 14.2–19.9 and 13.8–20.0.
Thus, both settlements are rather unsafe, and people living in and around them should probably be
evacuated.
Spatial Statistical Data Analysis for GIS Users 20
Figure 1.1 Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
In the Chernobyl example, it is evident that the data were uncertain, both in the accuracy of the
measurements themselves and in the prediction of the values at the unsampled locations. Similarly, much
spatial data contains errors in both attribute values and locations because extremely precise measurements
are very costly or impractical to obtain. Although this should not limit the use of the data for decision making,
neither should it give the decision maker the false impression that maps based on such data are exact.
In summary, a model is an approximation of reality, both incomplete and imperfect. The data from which
models are made is a sampling of reality, also incomplete and imperfect. To use models intelligently, their
limitations must be recognized. Where is the error? How much of it is there? How much is too much? Only
with answers to these questions can uncertain data be used and models incorporating that uncertainty be
created and understood to make sensible and reliable decisions.
The statistical methods covered in this book provide objective tools for measuring data quality, accounting
for the inevitable errors and uncertainties involved in mapping, assessing spatial patterns, and even deciding
whether what seems to be a pattern is in fact a pattern. Humans are good at seeing patterns, even where
there are not any. Just think how easy it is for people to see faces everywhere—in rocks, in clouds, in the
moon.
TYPES OF SPATIAL DATA AND RELATED STATISTICAL MODELS
There are three broad groups of spatial data classified according to their locations: discrete, continuous, and
regional. All are illustrated in figure 1.2.
Discrete data are isolated on the earth’s surface, for example, tree locations, which cannot be found at every
point. Continuous data exist at every point (e.g., elevation), but they are measured in a finite number of
locations. Regional data are taken collectively or in summary form; for example, the incidence of a particular
cancer in the United States.
Spatial Statistical Data Analysis for GIS Users 21
Figure 1.2
Statistical models for discrete, continuous, and regional data are different. In statistical literature, modeling of
discrete point data is called point pattern analysis, modeling of spatially continuous data is called
geostatistics, and modeling of regional data is called lattice analysis. Applying tools developed for continuous
data to data that are discrete or regional will produce anomalous results. This may not be immediately
obvious, for nothing in those results will necessarily announce that there are anomalies.
For example, locations of lightning strikes on one particular day are displayed in figure 1.3. The red circles
reflect a radius of uncertainty in the location of the strikes. In addition to the approximate data location, there
is information on the polarity and strength of the lightning strikes. (Positive polarity is more likely to ignite a
wildfire.)
Figure 1.3
Spatial Statistical Data Analysis for GIS Users 22
A continuous map of positive‐polarity lightning strikes using a statistical interpolation method (kriging) can
be created and used to indicate the risk of a wildfire. However, this approach does not account for the spatial
distribution of the lightning strikes. There may be far more strikes with negative polarity, with increased fire
risk simply because of the number of strikes. More important, prediction of the polarity and strength between
observation locations is questionable because lightning strikes did not happen between actual strike
locations.
To account for errors in identifying the locations of lightning strikes, polygons can be defined that partition
the data according to soil and forest types, then the number of lightning strikes can be counted and polarity
distribution estimated in the polygons. Regional data analysis on such aggregated data would be challenging
because this requires sophisticated methods for the creation of polygons and their neighborhood selection.
In spatial statistics, the lightning strike data would be considered as the realization of a marked point pattern
process, which is a combination of two processes. The first process locates the lightning strikes (a discrete
point process), while a second process controls the strength and polarity associated with strikes at the
recorded locations (a continuous process).
Statistical models for discrete, continuous, and regional data can be employed and visually appealing maps
created with the lightning strike data, which are discrete by nature. The results of modeling will be different
but equally appealing, although most likely erroneous, if the geostatistical method is used.
The analysis of the spatial point pattern usually starts with finding the most similar pattern among those that
can be generated with known statistical features. A typical goal of statistical analysis of discrete points is
estimation of the points’ intensity at each location in the study area. The intensity surface can be a function of
covariates. For example, the density of plants can be a function of the soil properties.
A model of point pattern intensity can be developed using spatial trend, dependence on spatial covariates,
and interaction between points of the pattern. Models called marked point processes allow dependence
between point locations and their associated values. For example, tree location and diameter at breast height
could interact because trees compete for light and nutrients.
Geostatistical models (kriging) predict values where no measurements have been taken assuming that data
have similar statistical properties in any part of the study area. This assumption is called stationarity. More
precisely, the mean and variance of a variable at one location should be equal to the variable’s mean and
variance at other locations, and the correlation between data at any two locations should depend only on the
direction and magnitude of the vector that separates them and not on their exact locations. Stationarity is a
property of the kriging model; it is not necessarily a feature of the phenomena under study.
In kriging, data are assumed to be stationary even if there is uncertainty as to what “stationary” means. If
conventional kriging is used on nonstationary data, such as disease rates (the incidence of disease observed
in a geographical area per 1,000 or 100,000 persons) in which the data variability (variance) of the rate
depends on the population, which varies with geographical region, conventional kriging will not return
results that can be used safely to make decisions because the fundamental model assumption has been
violated, and the accuracy of the analysis cannot be trusted.
Statistical models always use some assumptions, and if these assumptions are not fulfilled, decision making
based on the results of the modeling may be incorrect. Once it is known what assumptions are made by the
model, the appropriate data can be matched to it or altered to fit the assumptions. This can be done, for
example, by removing large‐scale data variation and transforming data to approximate a theoretical data
distribution such as Gaussian.
Spatial Statistical Data Analysis for GIS Users 23
For instance, the popular indexes of spatial data association, Moran’s I and Geary’s c, are widely used for
analyzing rates. However, these indices were derived under the assumption that data mean and variance are
constants, a condition that hardly applies for rates. Therefore, other measures that more explicitly take into
account the aggregated nature of the data are to be preferred.
In contrast to continuous data, regional data are usually observed at every location (that is at every polygon)
and there is usually no need in interpolation. Researchers working with regional data are typically interested
in a way of validly presenting a map of counts or rates. One problem with this is that regions with smaller
populations may have an abnormally high rate simply because there are fewer people in the region. Other
typical goals of regional data analysis are finding significant clusters, homogeneous groups of regions that
vary in similar manner, and variables that vary together with the variables of interest (covariates).
Statistical models for continuous, regional, and discrete data are discussed in chapters 8–13.
SPATIAL DATA DEPENDENCY
The applicability of spatial statistics demands that the data be spatially dependent, at least to a certain extent.
If data are spatially independent, there is no way to predict a data value at an unsampled location with
reasonable accuracy. If there is data dependency, data contain information about the relationship between
values at nearby locations. As the dependence on data grows stronger, the prediction uncertainty lessens, and
the least amount of data will be required for reasonable prediction. In figure 1.4, it would not be difficult to
predict the value of one or several blue points if they were removed, but prediction of a removed green point
is more difficult. The best that could be done would be to predict the removed green point within the area
bounded by the red lines.
Figure 1.4
The importance of spatial data dependency in the case of spatial interpolation can be illustrated by using
deterministic and statistical models implemented in Geostatistical Analyst. Fifty data points are displayed in
figure 1.5 (left). Using this data, a surface was created using an inverse distance weighted interpolation (IDW)
with the Geostatistical Analyst’s default parameters. The predictions are plotted in figure 1.5 (right).
Spatial Statistical Data Analysis for GIS Users 24
Figure 1.5
Using the same data, another surface was created using a radial basis function with a thin‐plate spline kernel
(figure 1.6 at left). The surfaces differ, although not greatly. Each surface is the product of different
assumptions about the relation of data points to each other. These assumptions are not connected in any
fundamental way to characteristics abstracted from the dataset but are predefined. Thus, the models are
deterministic.
However, the default statistical model, ordinary kriging (figure 1.6 at right), applied to the same data
produces a very different surface—a noisy, flat surface—unlike the chain of valleys and hills produced by the
deterministic inverse distance weighted interpolation and the radial basis function models. Is this surface
better in some way?
The relationship between points on the surface produced by ordinary kriging is defined by characteristics
drawn from the data. All prediction surfaces are not absolutely accurate, but error in the surface produced by
ordinary kriging can be evaluated.
Figure 1.6
Figure 1.7
The smooth‐looking surfaces created using deterministic interpolation models are less accurate than the
rough surface produced by ordinary kriging, because the data used in this example are spatially independent.
Spatially independent data can be produced by a random number generator (as these were), and the points
are unrelated to each other in the sense that near datum is not more alike than data separated by greater
distances. Because the data do not contain information about similarity with their neighbors, interpolation is
not appropriate. Although none of the models can predict values removed from the dataset, kriging produces
a surface close to the average value of the data. This is the best prediction surface when data are spatially
independent.
Real data are spatially dependent more often than not. If the elevation is 500 feet in one place and 600 feet in
another, there must be at least one location somewhere in between with an elevation of 550 feet. Spatial data
dependency is not simply present or absent; it occurs in varying degrees. Spatial statistics provide methods to
define the extent of spatial data dependency for use in modeling.
DISTANCE BETWEEN GEOGRAPHIC OBJECTS
Distance between geographic objects is an important attribute of models in spatial statistics. Specification of
distances between points will influence the conclusions drawn from data. The terrain that forms the
foundation of the spatial effects under investigation is not a flat surface. Spatial objects are frequently
separated by barriers that may be natural, political, or conceptual as shown in figure 1.8. Therefore, distance
in spatial statistics is a more complicated concept than simply the length of a line connecting two objects.
Spatial Statistical Data Analysis for GIS Users 26
Figure 1.8
Straight‐line distance may not be the best measure of the proximity between geographic objects. In figure 1.9,
lines connect locations into pairs that are all about an equal distance apart. Add the geographic layers in
figure 1.9 (right), and it is easy to see that many pairs cross bodies of water. Therefore, if the points represent
animals that do not swim, then the lines that cross water do not represent true shortest distances between
those points for those animals.
Figure 1.9
Figure 1.10 at left shows the straight‐line (Euclidean) distance to the closest measurement point for each cell
of the 285‐by‐253 grid with a cell size of 180 meters. This map of distances can be compared to a distance
map in figure 1.10 at right, which is the least accumulative cost distance over a cost surface calculated as a
sum of three grids:
1. Cost value equals 1 + elevation/(lowest elevation value).
2. Cost value equals 5 if cell intersects stream and 0 otherwise.
3. Cost value equals 20 if cell intersects lake and 0 otherwise.
Spatial Statistical Data Analysis for GIS Users 27
The points’ density will be very different if different distance metrics are used (see example in chapter 13).
Figure 1.10
Figure 1.11 shows different interpolations of ozone concentration. Figure 1.11 at left, which shows the
predicted concentration of ozone in California, neglects to account for the effects of mountains. But since
mountains are as effective a barrier to industrial ozone as water is to nonswimmers, taking the effect of the
mountains into consideration gives a more realistic picture of ozone distribution, as shown in figure 1.11 at
right, which depicts it traveling up gorges, for instance.
Figure 1.11
From California Ambient Air Quality Data CD, 1980‐2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support
Division, Air Quality Data Branch.
Figure 1.12
AN EXAMPLE OF STATISTICAL SMOOTHING OF REGIONAL DATA
The ease of spatial visualization provided by GIS entices users to rely on visual comparison between maps for
reaching conclusions. For example, an impression that frequently arises in disease mapping is the apparent
clustering of disease cases on a map. Apparent clustering tempts the researcher to look for an environmental
cause for the disease outbreak, while this clustering may simply be caused by mapping the raw number of
disease cases instead of rates (the number of cases per capita).
It is also easy to produce misleading impressions from data points just by classifying them for mapping and
manipulating the color palette of the map. Figure 1.13 shows the average family size in United States counties
using two widely used classification schemes: natural breaks (top) and quantiles (bottom). The bottom map
suggests that most large families prefer to live near the borders, while the top map suggests a more
homogeneous distribution.
The population in the bottom map appears to be higher, too. Which map is correct? Which map more closely
corresponds to reality, and how can that be determined? Answering these questions requires understanding
the data.
Figure 1.13
Data in their aggregated form exist not by their nature, but by the way they are assembled into regional
subdivisions. Figure 1.14 at left shows hypothetical locations of families of different sizes, stars, and the
administrative borders. The distribution of family sizes is affected by the administrative divisions within
which the average family size is quantified. But, the idea of family size distribution has a continuous
component as shown in figure 1.14 at right, because family sizes are likely to grade into each other, however
steeply.
Figure 1.14
Figure 1.15
The intensity map, in other words, the map of the population distribution, is usually unknown. Smoothing
techniques try to approximate it by averaging values in each county with those of the neighboring counties
(figure 1.16).
Figure 1.16
In this example, data smoothing makes sense because the average family size in a given county does not
completely characterize the distribution of the variable within that county. For example, figure 1.17 shows
the distribution of family size in block groups in the part of the city of Redlands, California. Family size
variation is large, and values between the first and third data quartiles (3.17 and 3.91 people per family,
respectively) have approximately the same frequency of appearance as the mean family size value, 3.54. In
other words, the number of families with three and four members is approximately the same. The uncertainty
of the family size is described by standard deviation. If standard deviation of the family size for each
administrative unit is available, a pair of maps of the average family size and its standard deviation is much
more informative than the mean family size map alone.
Figure 1.17
Figure 1.18 shows a smoothed map of average family size. A new value is predicted in each county, using
values in the neighboring counties and the distance between county centroids (note that more accurate
smoothing can be done using areal interpolation model, see chapter 12). After smoothing, it is easier to see
which areas have a large family size and which have a small family size.
Figure 1.18
But the modeling job is not over until a determination is made of how much uncertainty is associated with the
smoothing just accomplished. Figure 1.19 shows that the uncertainty associated with data smoothing is small
in counties with little migration and a homogeneous population, shown in blue. This estimated uncertainty
substitutes for the unknown standard deviation of the family size in U.S. counties.
The 95‐percent prediction interval for the displayed values of the family size in figures 1.18 and 1.19 can be
estimated as
Prediction plus or minus two prediction errors
For example, in the southern states, the family size value is approximately 3.0 ± 0.4.
This estimation (it is no more than that) of data uncertainty will have to be taken into account when family
size is used in further calculations. Each operation on family size values, whether overlay or buffering or
other, will increase the uncertainty.
ASSIGNMENT
1) INVESTIGATE HOW A SEMIVARIOGRAM CAN CAPTURE SPATIAL DEPENDENCE.
Statistical data interpolation, also called kriging, is based on the statistical description of the data using a
semivariogram. Combinations of playing cards can be used to illustrate how a semivariogram captures spatial
dependence (see glossary for statistical terminology explanations). The initial order of a new partial deck of
cards is shown in figure 1.20.
Figure 1.20
Figure 1.21
The semivariogram γ(h) is constructed by calculating half the average squared difference of the values Z(si) of
all the pairs of cards separated by a given distance. The resulting quantity is gamma (h) (in Greek). Gamma
(h) is plotted on the y‐axis against the separation distance, h, known as the lag distance. In mathematical
notation
For example, the semivariogram value in blue circle in figure 1.21 is calculated using the 10 pairs connected
by the light blue lines.
The differences between the values of points separated by small distances are expected to be smaller than the
differences between points separated by greater distances if data are spatially correlated. The pink line is an
approximation of the semivariogram values for the distances between pairs of locations. Many functions can
be used as semivariogram models. One of them, Gaussian, is used in figure 1.21.
There is a good visual fit between the model and the semivariogram points for the distances smaller than 4.
However, the Gaussian semivariogram model cannot describe data with a strong trend at greater distances.
The amount of change in the value with movement from the value at a known point to the unsampled location
is governed by the distance dependency explicitly shown by the semivariogram model. Weights of the
measurements separated from the prediction location by a distance greater than 3 are small, less than
1 percent (in Geostatistical Analyst, weights are provided by the Searching Neighborhood dialog box next to
the Semivariogram Modeling dialog box), and reliable prediction in the extent of the cards with the Gaussian
semivariogram shown in figure 1.21 is possible.
Figure 1.22
Removing an arbitrary number of cards from the middle of the deck and putting them at the end results in the
following sequence, figure 1.23.
Figure 1.23
Figure 1.24
The kriging predictions (figure 1.25) using the semivariogram in figure 1.24 indicate that two of the four
predictions (4.28 and 5.28) are not accurate.
Figure 1.25
Figure 1.26
Kriging using the semivariogram shown in figure 1.26 results in weak predictions at three out of four
locations (figure 1.27).
Figure 1.27
In summary, when strong spatial correlation exists, the semivariogram can help to reconstruct missed values,
and kriging makes reliable predictions at the unsampled locations. If data are not spatially correlated, there is
no way to predict the value between measurement locations other than assigning an arithmetical mean to all
With Geostatistical Analyst, repeat this exercise using a different number of initially ordered data, several
reordering scenarios, and various locations of removed data.
FURTHER READING
1. De Veaux, R., P. Velleman, D. Bock (2007) Stats: Data and Models. Second Edition. Addison Wesley, 800 pp.
This book can be recommended for readers with no knowledge of statistics. The book focuses on statistical
thinking, highlighting how statistics helps us to understand the world. It is organized into short chapters that
focus on one topic at a time. Each chapter includes a discussion of common misuses, misapplications, and
misunderstanding of statistics to help students recognize and avoid them.
2. Krivoruchko, K., and C. A. Gotway. 2002. “Expanding the ‘S’ in GIS: Incorporating Spatial Statistics in GIS.”
This paper discusses the need for spatial statistics within a GIS and the power of the inferential methods it
provides. The authors illustrate through several case studies the utility of spatial statistics within a GIS.
Although Geostatistical Analyst is an excellent first step in integrating spatial statistics and GIS, there is also a
need to include spatial statistics for other types of data for which the premises of geostatistical models do not
apply. Available at http://training.esri.com/campus/library/index.cfm
3. Krivoruchko, K., and C. A. Gotway. 2003. “Using Spatial Statistics in GIS.” Available at
http://training.esri.com/campus/library/index.cfm
This paper was presented at the International Congress on Modeling and Simulation, Townsville, Australia,
July 2003. It was published in the conference proceedings.
The authors provide several examples that show the power of exploratory spatial data analysis within a GIS
and how this can provide the foundation for more sophisticated probabilistic modeling. While Esri ArcGIS
software now facilitates the integration of spatial data analysis and GIS functionality, more tools are needed
for comprehensive spatial data analysis. The authors suggest how to implement additional spatial statistical
methods within a GIS including methods for using non‐Euclidean distances in the analysis of geostatistical,
lattice, and point pattern data.
4. Krivoruchko K. and R. Bivand (2009) “GIS, Users, Developers, and Spatial Statistics: On Monarchs and Their
Clothing.” In Interfacing Geostatistics and GIS, pp. 209‐228. Springer. Available at
http://training.esri.com/campus/library/index.cfm
This paper was presented at the StatGIS 2003, International Workshop on Interfacing Geostatistics, GIS, and
Spatial Databases, September 29–October 1, 2003, Pörtschach, Austria.
The paper discusses the statistical tools and models that propagate between communities of users and the
problems that arise with using statistical inference in inappropriate settings.
This introductory book on spatial data analysis for undergraduate students describes statistical methods for
discrete, regional, and continuous data. The Bailey and Gatrell book is recommended for additional reading in
several other chapters.
EXAMPLES OF THE IMPORTANCE
OF ESTIMATING DATA AND
MODEL UNCERTAINTY
THE DIFFERENCE BETWEEN AVERAGING DEPENDENT AND INDEPENDENT
DATA
PREDICTION ERROR MAPS AND MAPS THAT SHOW THE PROBABILITY THAT A
SPECIFIED THRESHOLD IS EXCEEDED
STATISTICAL COMPARISON OF MORTALITY RATES
THE UNCERTAINTY OF PREDICTION ERRORS
QUANTILE ESTIMATION AND MINIMIZATION OF A LOSS FUNCTION
HYPOTHESIS TESTING AND MODELING
ASSIGNMENTS
1) DEM AVERAGING EXERCISE
2) CREATE A KRIGING PREDICTION ERROR MAP THAT DEPENDS ON THE DATA
VALUES
FURTHER READING
U
nderstanding how to consistently assign probabilities to geographic data that contain various types
of uncertainties is difficult, even in the research domain, because uncertainty has various meanings
and levels. Providing estimates of uncertainty associated with geographic data in an
understandable and ready‐to‐use form for the decision maker is even more challenging. In this
chapter, the importance of accurate estimation of data and prediction uncertainties is discussed using
examples from several typical applications. The chapter begins with discussion of data averaging. It is shown
that arithmetical averaging of spatially dependent elevation data seriously underestimates averaging
uncertainty.
Spatial Statistical Data Analysis for GIS Users 40
Estimation of the prediction uncertainty is an essential feature of statistical analysis. Prediction error and
probability maps are discussed, and their use is illustrated using air pollution particulate matter
measurements made in California.
The uncertainty of regional data (counts of events for a known population within a polygon) varies when
crossing a border between regions, and some statistical calculations are required to compare data in different
polygons. The section on statistical comparison of mortality rates in U.S. counties introduces this problem,
which will be discussed in detail in the following chapters.
Prediction errors can be calculated using various methods, meaning that they are a subject of uncertainty
themselves. The uncertainty of prediction errors produced by different kriging models is discussed and
illustrated using elevation data.
The most common method of statistical data interpolation, kriging, produces the most probable values (the
mean values). An example of an application in which a quantity other than a mean is required is presented in
the section “Quantile estimation and minimization of a loss function.”
The statistical models usage can be classified as either hypothesis testing or estimation/prediction through
modeling. Researchers often do a test of hypothesis first to reject a simple model, for example, for the lack of
spatial data dependence. Then researchers begin actually modeling the data. The difference between
hypothesis testing and modeling and the danger of the informal use of the former are discussed using crime
data collected in Redlands, California.
THE DIFFERENCE BETWEEN AVERAGING DEPENDENT AND INDEPENDENT DATA
Geographical data are often collected at different spatial scales. For example, disease counts are usually
available starting from the county level (figure 2.1, left); social and environmental variables that could be
used to explain the anomalies in the spatial distribution of the disease are available at a smaller scale—ZIP
Codes (figure 2.1, center left) and monitoring stations locations, which are treated as points (figure 2.1, center
right)—while historical meteorological data, which may help in understanding the distribution of diseases
such as asthma, can be available as averages on a grid (figure 2.1, right).
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 2.1
Spatial Statistical Data Analysis for GIS Users 41
As another example, using raster data at different resolutions, estimated areas of land cover may vary up to
30 percent based on different image sources for the same area such as Landsat TM (30‐meters resolution),
SPOT MX (20 meters), and Ikonos (4 meters).
When data are available at different scales, the first step in analysis is usually data aggregation (also known as
“data upscaling”) at coarser scales. For example, data in ZIP Codes are summarized to the county level and
points inside a grid cell are averaged.
Data disaggregation (also known as “data downscaling”) is used rarely because this operation is more
uncertain than aggregation and because statistical software for data disaggregation is not readily available.
Indeed, there are too many ways of dividing county population among ZIP Codes. For example, disaggregating
census data in the counties to the ZIP Codes in figure 2.2 just according to the ZIP Code areas is misleading
because people’s houses and apartments are distributed non‐uniformly.
Figure 2.2 shows population density in ZIP Codes in four Northern California counties, a choropleth map, and
histograms for a number of people in the ZIP per ZIP area in square miles. The difference between population
density in counties and population density in ZIP Codes is small in Del Norte County with the most
homogeneous population, large in Siskiyou and Trinity counties, and very large in the most urbanized county,
Humboldt.
Figure 2.2
Data aggregation uncertainty arises because of heterogeneity among units. When the values associated with
fine polygonal objects are similar, the information loss during polygon aggregation is smaller than when
aggregating polygons with dissimilar values.
Dissimilarity among units is reduced at the coarser scale: the larger the polygons, the smaller the difference
between their aggregated values. For example, census data variability at the ZIP Code level is larger than at
the counties level. Loss of information when using population density in counties instead of ZIP Codes
increases with increasing urbanization.
Spatial Statistical Data Analysis for GIS Users 42
Accurate data aggregation must relate the variation among the aggregated coarse polygons to the variation
among the polygons at the finer scale or among the point values. In other words, uncertainty of the
aggregation should be provided along with aggregated values since this uncertainty often varies significantly
between administrative units.
Figure 2.3 shows a raster map of elevation values. We will compare aggregation of the fine 119‐by‐139 grid to
a grid with a coarser resolution of 17 by 20 (shown by red lines in figure 2.3) using zonal statistics, which
assumes that data are independent, and block kriging, which takes into account spatial data correlation.
Figure 2.3
There are several functions for local data averaging in the ArcGIS Spatial Analyst extension. In all functions,
mean value is estimated as the arithmetic average of the values z(si) measured at locations si that are inside
of each cell of the coarser grid where N is the number of data values in a moving window (grid cell).
The variance of the mean estimator can then be computed as
N
var(µ) = 1
N(N −1) ∑ (z(s ) − µ )
i
2
i=1
The arithmetic mean value and the square root of the mean variance (standard deviation) will be compared
with kriging prediction and prediction standard error below. There are several variants of averaging data
using geostatistics (for example, we will see how averaging can be done using geostatistical conditional
€
simulation in chapter 10), but in this example the choice of the kriging model is not very important; we used
simple block kriging implemented in Geostatistical Analyst.
Spatial Statistical Data Analysis for GIS Users 43
Elevation is a classic example of spatially correlated data, because elevation values closer together are usually
more similar than those farther apart. Therefore, we might expect estimations that assume the data are
dependent will differ from those that assume the data are independent. The visual comparison of aggregation
in figure 2.4, using arithmetical averaging (left) and block kriging (right) is not very different, however.
Figure 2.4
But the estimated uncertainty associated with the two approaches is dramatically different in figure 2.5,
where the square root of the mean variance (standard deviation) estimated using arithmetical averaging is
shown on the left and standard error estimation using block kriging is shown on the right (note that the
legends are different).
Figure 2.5
Arithmetical averaging gives unrealistically small, overoptimistic estimates of the averaging accuracy for
spatially correlated data. Indeed, theory tells us that if normally distributed observations are dependent, the
variance of the
mean, assuming that data have the same known variance , is equal to
where cov(expression) is data covariance (see glossary and discussion in chapter 8). According to the formula
above, value is underestimated true variance of the mean value when values z(si) are spatially
correlated (for uncorrelated data, the second term in the formula above is equal to zero).
Spatial Statistical Data Analysis for GIS Users 44
The measure of uncertainty associated with predicted values affects the conclusions that can be drawn from
the map. If the mean elevation is predicted to be 850 ± 5 meters in one place and 850 ± 35 meters in another,
the predictions will look similar on the map. But if we are developing an algorithm that routes us through
complex topography with the least amount of effort, a cell that is predicted to be 850 ± 5 meters might lead to
different decisions than a cell predicted to be 850 ± 35 meters.
Although it seems obvious that correlated data should not be averaged using the algorithm that assumes data
independence, arithmetical averaging is still the default approach for recalculating values when changing
raster cell size in GIS software.
PREDICTION ERROR MAPS AND MAPS THAT SHOW THE PROBABILITY THAT A
SPECIFIED THRESHOLD IS EXCEEDED
In ArcGIS Geostatistical Analyst, there are four types of output kriging maps. Prediction maps are created by
contouring many interpolated values systematically obtained throughout the region of interest. Standard
error maps are produced from the standard errors of interpolated values. Probability maps show where and
how much the interpolated values exceed a specified threshold. Quantile maps are a special type of
probability map where the thresholds are the quantiles of the prediction distribution.
In environmental sciences, the upper permissible level of contamination is often known and can be used as a
threshold value for probability mapping. The colored circles in figure 2.6 show measurements of the
maximum 24‐hour PM2.5 value (particulate matter with an aerodynamic diameter of less than 2.5
micrometers) in November 1999 in California. Black points represent populated places. The California
Ambient Air Quality Standard for the maximum 24‐hour PM2.5 is 65 g/m3.
There are large territories without measurements of air quality. However, many people want to know the
level of air pollution near their houses. A very large number of samples would have to be collected to satisfy
them. This is impractical to do in real life. As a compromise, prediction to the unsampled places can be used,
and, as in the case of the first example in this chapter, prediction uncertainty should be provided together
with predictions. A prediction of 75 ± 5 g/m3 means that residents are living in a rather contaminated area,
while a prediction of 75 ± 20 g/m3 is less certain because the critical level of 65 g/m3 falls within the
prediction interval [55, 95] g/m3.
Figure 2.6 at right shows a prediction kriging map from which areas with PM2.5 values larger than 65 g/m3
can be determined. However, the data and prediction model are not absolutely precise, and a decision about
the classification of safe and unsafe areas with predictions near the threshold 65 g/m3 can be inaccurate.
Prediction standard errors are displayed in figure 2.7 at left. Prediction quality is low in the areas where
measurements are sparse and where measurements differ substantially from their neighbors.
Prediction uncertainty in dark‐colored areas is large, and a decision about air quality for those areas is
difficult. In this case, the probability map presented in figure 2.7 at right may help to identify the most
contaminated areas. The threshold for the probability value (for example, 0.8, which is 80‐percent assurance
that the threshold was exceeded) above which an area is considered highly contaminated should be chosen
using additional, perhaps subjective, information.
Figure 2.7
Quantile is a specific value that divides the distribution into two parts, those values that are greater than the
quantile value and those values that are less. In figure 2.8, the highlighted 65 percent of data is below the 0.65
quantile value.
Figure 2.8
Figure 2.9 at left shows a prediction map of the radiocesium soil contamination in the southeastern part of
Belarus in 1992, six years after the Chernobyl accident. Digits represent the measured value of radiocesium in
Ci/km2. According to regulations in effect at that time, relocation was required if the radiocesium
concentration was higher than 15 Ci/km2. However, measurement errors were up to 20 percent to 30 percent
of the measurement values. Therefore, the decision to relocate people is difficult in areas where radiocesium
values are between 10.5 and 19.5 Ci/km2 (15 ± 15⋅0.3 = 15 ± 4.5).
A possible solution is to use an overestimation of the contamination by mapping the 0.65 quantile of the
prediction distribution at each location (figure 2.9, right) instead of mapping the most probable mean value,
equal to 0.5 quantile in the case of the symmetric Gaussian distribution of the predictions. In this case, a
larger territory has a dangerous contamination status and the chance of missing populated places with large
contamination is reduced. An example of justification of a specific quantile value for mapping will be
presented in the section “Error in estimating the internal dose” in chapter 3.
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 2.9
Predictions obtained from kriging are fairly robust, and if a map of predicted values is all that is needed, the
default procedure allowed by ArcGIS Geostatistical Analyst (ordinary kriging) will produce an adequate visual
picture of a spatial surface. However, prediction uncertainty (prediction standard error) is more sensitive to
the selected kriging model parameters and on how close the data are to the model’s assumptions. The
difference between default and optimal kriging models increases with increasing complexity of the data.
Probability and quantile maps are constructed using prediction and prediction standard error estimates.
Thus, in order to validly interpret the uncertainty maps associated with kriging (standard error, probability,
and quantile maps), it is important that the kriging parameters be well estimated. Figure 2.10 shows an
example of outputs from two different kriging models (in this case, ordinary kriging without additional
Figure 2.10
Because different maps can be created using different geostatistical models, clear justification for the chosen
model is required.
STATISTICAL COMPARISON OF MORTALITY RATES
The National Cancer Institute published a dataset that contains 1970–1994 cancer mortality information for
counties in the United States (http://www.nationalatlas.gov/mld/cancerp.html). Included are
number of deaths, death rates (the number of deaths per 10,000 people), and confidence levels for the most
common cancer types. When the variable uncertainty is known or estimated, it can be incorporated into the
choropleth map production.
Figure 2.11 shows a choropleth map of the lung cancer mortality rates in females in California during the
years 1970 to 1994. Color intervals are selected using the geometric intervals option, the default classification
algorithm in Geostatistical Analyst. The table shows part of the data, including lower and upper bounds of
rates. Therefore, cancer rates can be considered as a random variable that follows a probability distribution,
Distributions of the cancer rates in three counties are shown in figure 2.11 (right). Vertical colored stripes
reproduce classification intervals. Although mean value 30.31 of the rate distribution in red is in the yellow
interval, there is high probability (the area under the distribution line) that actual value can be in the light
blue or brown intervals. Similarly, it is very possible that true value of the rate with the blue distribution line
with mean value of 34.90 is actually in the red interval, and blue and dark blue areas under the black
distribution line with mean value of 20.99 are almost the same.
From U.S. Geological Survey, 2008, Cancer Mortality in the United States: 1970‐1994. Courtesy of U.S. Geological Survey, Reston, Va.
Figure 2.11
This example highlights difficulties with selecting optimal classification intervals for the data with relatively
large uncertainty. Other classification schemes, such as quantile or natural breaks, have the same problem.
One possibility for choosing the best classification method for a given number of intervals is to calculate
probabilities that each rate value falls in the interval with its mean value and compare the sums of the
probabilities for each classification method. The method with the largest sum of probabilities visualizes rates
most accurately.
Figure 2.12 shows female esophagus cancer mortality rates (red bars and red font) and their estimated
standard error (blue) in the central part of California. Counties are colored from cold to warm colors
according to the female population increase. The standard errors are small in the counties with large
populations in the Bay Area, allowing certainty about esophagus mortality there. However, in sparsely
populated eastern counties, the standard deviation is sometimes larger than the rate. For example, in the
selected county with 5,426 females and only one cancer case, the 95‐percent significance interval of the
esophageal cancer mortality rate 1.48 is [0, 5.53], causing very large uncertainty about true esophageal
cancer mortality in this county.
Spatial Statistical Data Analysis for GIS Users 50
Figure 2.12
A choropleth map of esophagus cancer mortality rates shown in figure 2.13 (left) gives a misleading picture of
the disease distribution because of the uncertainty of the mortality rates in sparsely populated counties.
If mapping cancer rates is required, it would be better to use one of the statistical smoothing techniques.
Figure 2.13 (right) shows the empirical Bayesian smoothing of esophagus cancer mortality rates. When the
observed rates are based on small populations, the empirical Bayes estimator is doing local smoothing,
borrowing information from the neighboring counties, and the resulting smooth rate tends toward the
California mean rate. When the county population is large, the resulting smooth rate is close to the observed
rate in the county.
The estimated rate for the highlighted county in figure 2.13 is higher than the observed rate because the
smoothed value is a weighted sum of the mortality data in the nearby counties.
Figure 2.13
Unfortunately, information on the data confidence intervals is rarely included in the databases. If regional
data uncertainty is not available, it can be estimated as shown in chapters 4 and 10 and appendixes 3 and 4.
Spatial Statistical Data Analysis for GIS Users 51
THE UNCERTAINTY OF PREDICTION ERRORS
Errors arise from incomplete observation of spatial variables. Data in unsampled locations can be predicted
using available measurements. Deterministic models ignore data and model uncertainty, while statistical
models provide errors associated with predictions. Prediction error is the difference between true and
predicted value in an unsampled location. Since the true value at the time of prediction is unknown, the error
is also unknown. Statistical models replace unknown error either with a set of possible other simulated
values from which mean value and its variance are derived, or with an estimated prediction variance
obtained by averaging spatial data features. For example, in geostatistics, the averaging set consists of pairs of
data locations grouped by distance and orientation.
There are many ways to define the averaging set, to group it, and to calculate error using these groups.
Therefore, there are many variants of the prediction error, and the task is to find the method that minimizes
the difference between true and predicted values. Intuitively, prediction error must reflect neighboring data
variability, and an optimally estimated prediction error must depend on local data values. We would expect
that prediction error is lower if data are changing slowly. For example, the error of digital elevation model
(DEM) values is lower in a valley than in mountains. We would also expect a lower error at a shorter distance
from the measurement location because closer values are more likely to be similar.
We expect that variables at the unknown location are described by the observed data distribution. However,
this leads to difficulty in predicting extreme values; for example, the concentration of air pollution higher
than any of the observations in the region, since practically all spatial prediction models tend to predict a
value closer to the mean than to the maximum or minimum data values. Intuitively, such extreme value can
occur rather far from the measurement locations where the prediction error is large so that the extreme value
could be inside the prediction interval prediction ± prediction standard error, but there is no guarantee that it
is always so.
The most popular and simplest geostatistical model, ordinary kriging, accurately describes prediction error
that arises from the data configuration (the relative position of the data locations), since averaging and
grouping is based on the distance between pairs of points. However, the same data configuration produces
the same error regardless of the prediction location, because all data pairs are used in estimating the spatial
data structure. Figure 2.14 shows a prediction standard error map created by ordinary kriging. The method
assumes that spatial data correlation is the same in all directions. Input data points are elevation
measurements on a grid.
Courtesy of U.S. Geological Survey.
Figure 2.14
Spatial Statistical Data Analysis for GIS Users 52
Using the Geostatistical Analyst trend analysis tool shown in figure 2.15 and described in chapter 14, we can
see that there is a tendency for the elevation data values to change faster in the northwestern direction
(green line), and there is a periodicity in the perpendicular direction (blue line).
Figure 2.15
Assuming data anisotropy, averaging the pairs of data values produces different results in different
directions, and the resulting prediction standard error map, shown in figure 2.16 reflects the difference. The
anisotropy option for deterministic and statistical data interpolation is discussed in chapters 7 and 8.
Since prediction error depends on the method of averaging, it is important to present information on the
method when reporting the results of geostatistical data analysis.
Figure 2.16
Spatial Statistical Data Analysis for GIS Users 53
In practice, there are two main methods of estimating prediction error that depend not only on the data
configuration but also on the nearby measurements. The first method is based on locally estimated spatial
dependence (semivariogram or covariance) by taking into account only those measurements closest to the
prediction location. There are many variants of this method, and one of them, moving window kriging, is
discussed in chapter 9. The second method is based on data transformation and data detrending that aims for
the production of data with a particular theoretical distribution. The kriging variance of successfully
transformed data depends on the data values, and after proper back transformation, the resulting kriging
prediction error becomes dependent on the local data values. Figure 2.17 shows a prediction error map
created by simple kriging, with normal score transformation and a mixture of Gaussian kernels options (see
discussion in chapter 9). This time, prediction error is small near the measurement locations and in the areas
with small changes in the data values, as it should be.
Figure 2.17
The kriging model estimates only the spatially correlated part of the prediction error assuming that there is
no data variability at the measurement locations. A spatially uncorrelated part of the prediction error is
included in the nugget parameter and interpreted as constant measurement error plus data variation at very
short distances. However, it is possible that kriging may underestimate the total prediction uncertainty
because the uncorrelated part of the prediction error may include several other components, including an
uncertain measurement location and a varying mean value, covariance, and data distribution.
Methods for taking into account each particular component of the prediction error do exist. However, the
only way to deal with all of them together is to use Monte Carlo simulations. Then a distribution of many
possible predictions and, therefore, the prediction variability, is available for the prediction locations. Monte
Carlo simulations are discussed in chapters 5 and 10 and appendix 3.
QUANTILE ESTIMATION AND MINIMIZATION OF A LOSS FUNCTION
Prediction error is used in various applications. In this section, the use of kriging standard error to minimize a
loss function is demonstrated. We consider an example in which the loss function describes the consequences
of over‐ and underestimating the required level of fertilizer application for yield improvement.
Spatial Statistical Data Analysis for GIS Users 54
If fertilizer is applied in time, the yield response has a shape shown in figure 2.18 (left). A loss function can be
defined in monetary units based on the economic benefits that result from the application of fertilizer, for
example, potassium. The required amount of potassium fertilization equals the difference between that
recommended for a particular crop and that which is known to be available in the soil. The recommended
amount of potassium per acre M is dependent on the desired crop yields, current crop prices, soil type, and
fertilizer cost.
If the amount of fertilizer is underestimated, this information can be used to calculate the yield loss in dollars
for every pound of potassium applied. Alternately, if the application of potassium leads to approximately the
same yield, this results in unnecessary loss of money due to buying useless potassium. The resulting loss
function would display an asymmetric relationship similar to that shown in figure 2.18 (right).
Figure 2.18
Assuming that M is an optimum amount of fertilizer, x is the amount of fertilizer at the location s, and y is the
optimal amount of fertilizer that should be applied at location s, the loss function can be defined as in figure
2.18 (right). If the distribution of the fertilizer is Gaussian, the random variable loss function has the expected
value
The optimal amount of fertilizer y that should be applied at location s can be found by minimizing the last
equation for y, that is, when the first derivative of y is equal to zero. The expression for optimal y is the
following:
, where ,
Spatial Statistical Data Analysis for GIS Users 55
The integral L(x) and its inverse can be calculated numerically so that the expression for y becomes
Here, is a particular quantile of the normal distribution with mean µ and
standard deviation σ at the prediction location s. A value can be calculated in Microsoft Excel as
c
NORMSINV 2 or in any statistical software, for example, in the freeware R.
c1 + c 2
Therefore, instead of mapping kriging prediction values, the minimum loss strategy is to estimate the value
is the kriging prediction standard error, then subtract it from the recommended value M to
compensate for the potassium deficiency.
The correct estimation of prediction standard error is essential in loss function analysis, the aim of which is to
improve the decision‐making process to increase profit. Figure 2.19 illustrates this by showing two maps of
the optimal values
of potassium in a field in Illinois (assuming that c1 and c2 are known). Data are provided by the University of
Illinois at Urbana‐Champaign and discussed in more detail in chapter 14.
The map in figure 2.19 (left) was created using ordinary kriging without additional options, and the map in
figure 2.19 (right) was produced by ordinary kriging with Box‐Cox transformation with power value 0.15,
which put the kriging prediction and prediction standard error at nearly normal distribution, and made the
prediction standard error dependent on the nearby measurement values. The estimated required amount of
potassium is different for two maps and, therefore, justification for using a particular kriging model is
required.
Note that if some of the values are negative, leaving potassium levels as they are is the best we can do
because we cannot take potassium away from the field. The next crop will draw the potassium down.
In reality, the field is not homogeneous, and the recommended value M of the fertilizer can vary as a function
of the local soil properties. Figure 2.20 shows that the field under consideration consists of various soil types,
and each of them potentially produces a different volume of corn (red digits, in bushels per acre). Therefore,
loss of money in the case of underestimation of fertilizer varies with differing soil types, and the improved
model should use variable slope c1 in the loss function.
Figure 2.20
HYPOTHESIS TESTING AND MODELING
Epidemiological, crime, and forestry applications often begin with testing the spatial distributions of disease,
crime, and tree locations for homogeneity and the tendency to cluster near specific places, such as industrial
plants, low‐income parts of cities, and rich soil patches. The researchers in these and other areas use their
sample data to test the null hypotheses:
“There are no areas with significant excess of the observed number of events."
The 269 yellow points in figure 2.21 are locations of auto thefts in the center of Redlands, California, in 2000.
The largest parking lots and apartment complexes are displayed in pink and red. Their locations can be used
for testing the hypothesis of auto theft locations clustering around specific places (see assignment 5 in
chapter 16). The proportion of the number of auto thefts to the total number of crimes is displayed in the
background. In practice, information on every crime event location may be inaccessible due to privacy or
other reasons, and researchers have to work with aggregated instead of original count data. A typical example
is epidemiological data, which are rarely available at the home address level but can be found at county,
district, province, or state levels as the numbers of disease cases per population.
Courtesy of City of Redlands, Police Department.
Figure 2.21
Verification of the hypothesis that points are distributed randomly is possible because the distribution of
random points is known. The mathematical model of the random pattern of points requires only one
parameter: the intensity of points. The model is called the homogeneous Poisson process, or complete spatial
randomness, and is discussed in chapters 4 and 13.
Typically, a tool called Ripley’s K function, or simply K function, is used to check the hypothesis of random
distribution of the observed locations. K function is a normalized empirical distribution of the pairwise
distances between observed points. The theoretical K function for complete spatial randomness in the
polygon around crime locations is shown as a red line in figure 2.22 (left). One hundred simulations of a
similar number of random points in the same polygon were made, and the maximum and minimum K
function values are displayed as blue lines. If observed crime locations are distributed randomly, the
empirical K function should be inside these two blue lines. However, the K function of the observed data is
higher for all distances between pairs of the crime locations. It is displayed by red points in figure 2.22.
This discrepancy between the observed and the theoretical K functions for complete spatial randomness
indicates positive association between points or points clustering. In the case when the observed K function
lies below the theoretical K function, association between points is negative, indicating repulsion (see
examples in chapter 13).
First, auto thefts often happen on roads where cars are parked and left. Figure 2.22 (at bottom) shows
rasterized streets and a random selection of 295 grid cells that belong to the streets. The K function of these
points is shown in figure 2.22 (top) in green. Its values are slightly higher than the values of the K function of
the random points on the plane, meaning that points are slightly aggregated. That may occur just because
streets are denser in some areas than others. This observation suggests using distance metrics associated
with a street network for testing the hypothesis about distribution of the auto theft locations. Point pattern
analysis on a network is discussed in chapter 13.
Figure 2.22
Another problem is that a completely random pattern of points is the exception in the world of real discrete
spatial objects, but a comparison of the real thing with the unlikely theoretical distribution will always show
the difference between them. This is a problem because any hypothesis test, including a test of the spatial
independence of points, is designed for answering the question yes or no and not for quantitative
differentiation between several patterns. If two K functions are outside of the lower and upper envelopes of
simulations from the homogeneous Poisson processes, then neither one of them is a random pattern, and
which point process is more irregular should be decided using other statistical tools.
Hypothesis testing can help formalize a scientific question in statistical terms, but it does not necessarily help
in decision making. If the null hypothesis is rejected, the next step is modeling and inspection of the
residuals—the difference between data and the predictions from the model. Methods for doing this are
discussed at the end of chapter 13. Residual inspection permits modeling such complicated processes as a
spatial mixture of regular and clustered point patterns. An example of this type of point pattern is mixed‐age
forest in which locations of old trees form a regular pattern as a result of a thinning mortality process, and
young trees grow in clusters that are associated with gaps in the canopy created after old trees fall (see
chapter 6).
If crime locations are nonrandomly distributed, they cannot be described by the complete random point
process with constant intensity, and methods for validating the alternative non‐Poisson models are required.
A crime analyst may be interested in an explanation of particular crime location intensity by spatial
covariates such as city infrastructure, census data, and the occurrence of other types of crimes. This analysis
can be done using modern point pattern analysis theory.
In the section “The difference between averaging dependent and independent data” at the beginning of this
chapter, elevation data were averaged, and the prediction standard error map—the last map in the section—
was presented as well. Several other maps can be created using predicted values on a grid. For example, a
map of local entropy, figure 2.23, is discussed in detail in chapter 14. It is produced using the Geostatistical
Analyst Voronoi map local statistics. In this map, the smaller the entropy value (shown in light colors), the
smaller the variability of the data in the local neighborhood. Areas in light colors are those where data do not
change significantly, and dark‐colored areas are those where the data gradient is large. The entropy map may
serve as a benchmark for confirmation that prediction errors are calculated correctly, at least qualitatively,
since entropy is larger in the areas where prediction error is also large.
Another measure of data variability is Moran’s index of spatial data association (Moran’s I) shown in figure
2.24 (note that this map is qualitatively similar to the map in figure 2.23). It is discussed in detail in chapter
11. In figure 2.24, dark green shows the area in which data are more similar (large Moran’s I), and light green
depicts areas with large data dissimilarity (small Moran’s I). Prediction uncertainties should be small in the
areas with large local Moran’s I values and large where local Moran’s I are small. Indeed, a correctly chosen
kriging model (here, simple kriging with the normal score transformation option) displays such association:
the lines in figure 2.24 are simple kriging prediction errors with large values in hot colors and low values in
cold ones.
Figure 2.24
The usage of Moran’s I requires data normality and stationarity (that is, the same data mean and data
variance at any location). Here it is legitimate and Moran’s I is tractable. However, the Moran’s I is rarely used
in geostatistics in which data stationarity is the main assumption and data normality is a desirable feature.
The local Moran’s I compares each data value with weighted values of the closest neighbors, and in the case of
polygons, the choice of the neighborhood structure and weights is very large; see discussion in chapter 11.
Figure 2.25 shows possible links between adjoining administrative regions in the center of Redlands,
California, and weights calculated using the inverse 1,000 feet distance (weight=1,000/distance; map units are
feet) between polygon centroids.
Courtesy of City of Redlands, Police Department.
Figure 2.25
The local Moran’s I map is based on the proportion of auto thefts to the total number of crimes as a value for
each polygon and the above‐defined neighbors and weights is shown in figure 2.26. Polygons with positive
values of Moran’s I are interpreted as areas in which there is positive spatial correlation with neighbors,
negative values indicate negative spatial correlation, and values close to zero show areas with negligible
correlation between neighbors.
Figure 2.26
The difference between this and the previous map of local Moran’s I for elevation data is that, this time, input
data are not stationary and not normally distributed, meaning that the result of the calculations must be
questioned. It is possible that the proportion of stolen cars in the blue polygon (the value 0.04 printed in
black) with 27 total crimes (printed in red) is significantly different in comparison with those proportions in
the neighboring polygons as suggested by the local Moran’s I (meaning that a proportion value in the polygon
is almost certainly less than the mean value of 0.11, in this case, while neighbors have values larger than the
mean value). But it is also possible that it is just a random variation of the crime data since the number of
total crimes in the blue polygon is more than twice as low as the average number of crimes per polygon in
this area, and its uncertainty must be high.
The Moran’s I and other exploratory data analysis tools are good for investigating spatial data features.
However, their usage should not be the last stage in the regional data analysis. The next step in statistical
analysis of regional data is incorporation of the results of the data exploration into the data modeling. These
topics are discussed in the second and third parts of the book.
ASSIGNMENTS
The assignments require Geostatistical Analyst. If you are unfamiliar with this extension to ArcGIS, read
appendix 1 first.
1) DEM AVERAGING EXERCISE.
Repeat the data averaging exercise using a small part of any digital elevation model (DEM). Averaging by
assuming data independence can be done using formulas in the text or by using ArcGIS Spatial Analyst
functionality, or using equal weights with inverse distance weighted interpolation, or using nugget effect
model with kriging. Averaging taking into account spatial data dependence can be done using the block
kriging option in Geostatistical Analyst (Properties > Symbology > Grid or Properties > Export to Raster
options).
Create prediction standard error maps using Geostatistical Analyst’s ordinary kriging and simple kriging with
the normal score transformation option. Investigate the difference between prediction standard error maps.
Elevation data for this exercise are in the folder assignment 2.2, figure 2.27.
Figure 2.27
FURTHER READING
Gotway, C. A., and L. J. Young. 2002. “Combining Incompatible Spatial Data,” Journal of the American Statistical
Association 97: 632–48.
This paper gives an overview of the problems that arise when modeling data are gathered from a variety of
sources.
UNCERTAINTY AND ERROR IN GIS
DATA
ERRORS IN GIS DATA
SYSTEMATIC AND RANDOM ERRORS
DATA VARIATION AT DIFFERENT SCALES
USING A SEMIVARIOGRAM TO DETECT DATA UNCERTAINTY
LOCATIONAL UNCERTAINTY
LOCAL DATA INTEGRATION
ROUNDINGOFF ERRORS
CENSORED AND TRUNCATED DATA
DIGITAL ELEVATION MODEL UNCERTAINTY
CASE STUDY: ERROR PROPAGATION IN RADIOCESIUM FOOD
CONTAMINATION. ESTIMATING THE INTERNAL DOSE IN PEOPLE FROM THE
MEASURED FOOD CONTAMINATION. ESTIMATING UNCERTAINTY IN
EXPRESSIONS WITH IMPRECISE TERMS. ERROR IN ESTIMATING THE
INTERNAL DOSE.
ASSIGNMENTS
1) INVESTIGATE THE INFLUENCE OF THE LOCATIONAL UNCERTAINTY ON THE
SEMIVARIOGRAM MODEL
2) RECALCULATE THE UNCERTAINTY IN ESTIMATING THE INTERNAL DOSE
FURTHER READING
Spatial Statistical Data Analysis for GIS Users 65
T his chapter begins with general comments on error in data and models. The difference between
systematic and random errors is illustrated using meteorological data for the territory of the United
States.
Data variation is often different at different scales so that data variation at finer scale can be considered as
measurement error at the coarser scale. This is illustrated using radiocesium soil contamination data
collected in Belarus six years after the 1986 Chernobyl accident.
The next section of the chapter discusses semivariogram modeling, which is one method of estimating the
measurement error in spatial data.
Practically all geographical data include locational uncertainty. In some applications it can be ignored, and in
others locational error is too large to be ignored. Examples of locational uncertainty in agricultural,
radioecological, and forestry data are presented in the next section.
The next three sections deal with three types of data uncertainty. Local data integration error is illustrated
using measurements of radiocesium in forest berries collected in Belarus. The influence of rounding error on
prediction is illustrated using temperature data. Censored and truncated data are discussed using heavy
metals soil contamination in Austria.
The chapter concludes with two case studies. First, the uncertainty of DEM (digital elevation model) data is
investigated. We conclude that a DEM is the result of smooth interpolation with unknown spatially varying
error. The second case study discusses the error propagation in estimating the exposure of people from
measured food radiocesium contamination. We discuss how to estimate the uncertainty in expressions that
have imprecise terms and use this knowledge in estimating the internal dose error. The result of calculations
suggests using a quantile map instead of a prediction map if spatial interpolation of the radiocesium food
contamination dose is required.
ERRORS IN GIS DATA
Error is the discrepancy between the true value and a measurement, or between a map and the real world it
represents. Measurement errors arise in both attribute values and locations. The two main sources of error
are data collection and data modeling.
In GIS, errors propagate as a result of manipulating the input data. If a value is known with an accuracy of ±10
percent of its value, any mathematical operation with this value, such as addition of or multiplication to
another nonprecise value, will have an output data error greater than 10 percent.
Although often used as synonyms, the words uncertainty and error have their own preferred uses. An error is
a difference between a true value and another value. If that other value is a measurement, we have
measurement error; if we assume that other value is random, we have random errors; if that other value is a
prediction, we have prediction errors. Because error is used in many ways, we need a qualifier in front of it.
Uncertainty is a more general term for variance, the squared‐error defined as ,
where E (expression) is the expected value of the expression in brackets. For example, when estimating the
sample mean, uncertainty is another word for variance. There are other ways to assess uncertainty, for
example by using entropy or fuzzy logic. Therefore, the uncertainty is more general than variance.
Multiple measurements of a variable made at the same location are often different, whether from faulty
sampling, errors in the measurement device, human recording error, changes in measurement conditions, or
data integration. If replicated measurements are available, measurement error is proportional to the
Spatial Statistical Data Analysis for GIS Users 66
standard deviation of the replicated data.
In practice, GIS datasets often have only one value per location. If information about data uncertainty is not
available, it is prudent to assume that data are not precise, and one’s first efforts should be spent on exploring
the uncertainty.
estimated true value, the more accurate the measurement. Since ΔZ cannot be calculated precisely, usually it
is preferable to overestimate it rather than underestimate. If a measurement has larger error than justified by
the theory of the physical phenomena under study, that measurement, called an outlier, should be removed
from the data.
If relative measurement or locational error is significant, a model should incorporate information about data
inaccuracy. For example, GPS devices have accuracy specifications for well‐defined conditions.
The reliability of a measurement depends upon the probability that a very similar result will be obtained if a
measurement is repeated. In practice, measurements of soil contamination by a chemical taken in different
parts of the city can be significantly different. In this case, a single measurement used as a representative of
the city contamination is untrustworthy, and decision making based on it must be questioned. A number of
measurements are necessary to obtain reliable results. The average value of a series of measurements is not
always sufficient. When the average is used, it should be accompanied by its estimated error.
It is advisable to make at least two measurements at the same location. The cost of taking a second
measurement in the nearby part of the city or lake or region is usually low, but it allows estimating the
measurement error (see the section “Using a semivariogram to detect data uncertainty”). In some cases, the
data variability at the sample locations can be so large that further data interpolation makes no sense.
Investigation of the cause of large data variability should be done before modeling spatial data dependence
because quality of the statistical analysis should not outweigh quality of the data.
Often, measurement devices do not measure the data of interest itself but rather a substitute associated with
it. For example, instead of measuring the amount of O3 molecules in the air sample, a sample is exposed by
ultraviolet light, and the degree of absorption of the light, which is proportional to the amount of ozone, is
measured. As another example, small airborne particles with diameters 2.5 and 10 microns (PM2.5 and PM10),
which may cause acute respiratory and cardiovascular diseases, are not measured directly. Instead, they are
measured by using the tapered element oscillating microbalance monitor. When air is drawn into the
monitor, the tube resonates, and the resonant frequency is used to estimate the particulate concentration
value. The amounts of ozone and particulate matter, and similarly nitrogen oxides, sulphur dioxide, and
carbon monoxide, are never measured absolutely precisely; the data uncertainty depends upon the
measuring device and upon the accuracy of the assumed relationship between the chemical and the
substitute.
The number of measurements is usually limited for a variety of reasons (often financial), so modeling is used
to predict values at unsampled locations. These predictions cannot be absolutely accurate even if the model is
valid and correctly fitted (see the beginning of chapter 7). Also, modeling always contains a subjective
component since it is based on the researcher’s knowledge and experience.
The optimal trade‐off between the number of measurements and the amount of modeling is unknown or at
least data‐ and problem‐specific.
In this chapter, various sources of data uncertainty are discussed. Modeling uncertainty is a subject of the
following chapters.
Spatial Statistical Data Analysis for GIS Users 67
SYSTEMATIC AND RANDOM ERRORS
Errors that affect all measurements in the same way are called systematic. Systematic error is also called
“bias” in statistics. Variations of repeated measurements that occur due to natural variation in the
measurement process are called random errors. They follow some law but cannot be predicted exactly.
Figure 3.1 illustrates the difference between random and systematic error, where error is the difference
between the putt and the hole. The golfer in the left and center photographs is consistently putting balls close
to each other. The difference in the ball positions is random. The balls in the center figure are shifted toward
the same direction and at relatively large distances from the hole; hence, it is a combination of systematic
(shift) and small random error in putting. By rotating his putter slightly and hitting the balls a little harder,
the golfer can correct the putts. In the photograph at right, balls are also shifted, but less consistently. This is
an example of large random and large systematic error.
Photo by the author.
Figure 3.1
Spatial Statistical Data Analysis for GIS Users 68
Figure 3.2 displays a winter day’s temperature in the United States. Temperature systematically increases
from north to south and does so regardless of mountains or ocean. Of course, mountains and ocean influence
temperature, but latitude has a stronger effect on a large scale. A trend in temperature data works as
systematic error. However, trend is not a difference between “truth” and something else. It is deterministic
rather than random, because there is a clear external reason for changing temperature from south to north.
Figure 3.2
Systematic directional change in temperature data is recognizable in the semivariogram modeling dialog
boxes in figure 3.3 because:
The semivariogram surfaces (see its definition in the glossary and in chapter 6) in the bottom left
corner of each picture are not symmetric.
There is a large difference between the semivariogram values (red points) in the north–south search
direction (left) and the east‐west search direction (right).
Figure 3.3
Large‐scale data variation detection is important because most statistical models require that systematic
errors be negligible or that they be removed from the data during data preprocessing.
Figure 3.4
If this option is chosen, the trend can be modeled in the next dialog box by moving the slider slightly closer to
Global trend, as shown in figure 3.5 at left. The values of this smooth surface at the measurement locations si
are calculated and subtracted from the measurement values. The resulting values are called residuals.
residuals(si) = data(si) − trend surface(si)
Kriging proceeds to the semivariogram modeling using residuals instead of the observed data. The
semivariogram surface built on residuals is almost symmetrical, and the semivariogram points behave
similarly in different directions, as can be seen in figure 3.5 at right.
Figure 3.5
prediction = trend surface + predicted residuals
Usually, a model applied to the data that has been detrended produces more accurate predictions than a
model applied directly to data that has trend. However, the prediction standard error is underestimated, see
chapter 8.
DATA VARIATION AT DIFFERENT SCALES
The scale of a map determines how much detail can be shown. It would be ideal to depict, for example,
geological variability that exists at all scales including variability of deposits associated with a mountain
building or tectonic event at distances up to 100 kilometers, mineralized lenses of data variability up to
100 meters, and variability resulting from the transition of one mineralogical element to another at the
centimeter scale on one map. In practice, if samples are collected to describe variability at meters, data
separated by centimeters are not available, and data variation at the centimeter scale can be attributed to
measurement error. Accordingly, the data variation at the kilometer scale is considered as trend.
Figure 3.6 shows three subsets of radiocesium soil contamination data in Belarus collected in 1992, six years
after the Chernobyl accident. The gray circles show measurements with values less than 1 Ci/km2. Brown
shows data values in the interval 1–3 Ci/km2, and green shows 10–20 Ci/km2. Semivariogram modeling using
these three datasets shows that data with contamination of less than 1 Ci/km2 are spatially dependent up to
120 kilometers, data in the range of 1–3 Ci/km2 are spatially dependent up to 30 kilometers, and data from
the dataset with contamination in the 10–20 Ci/km2 range are spatially dependent at distances less than
5 kilometers.
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 3.6
Figure 3.7 shows radiocesium soil contamination in southern Belarus interpolated using kriging. The kriging
model for spatial correlation consists of three spatial scales and uncorrelated measurement error. The first
scale represents spatial continuity on very short scale, less than the distance between observations. The
second scale represents the distance between nearby observations, the neighborhood scale. The third scale
represents the distance between observations separated by distances several times larger than a typical
distance between nearest neighbors, the regional scale.
Filled contours show large‐scale variation in contamination (spatial trend), which decreases with increasing
distance from the Chernobyl nuclear power plant. Using the estimated residuals, contour lines display
detailed variations in radiocesium soil contamination. Outlines of cities and villages with populations larger
than 300 are also displayed to show where measurement errors originated. Only average values of
radiocesium estimated from replicated measurements within these settlements are available, even though
the values are different in different parts of the cities. They belong to the first scale, the microscale of
contamination. Measurement errors are significant for this data—about 20 percent of the measured values.
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 3.7
Spatial data models are based on the estimated or postulated data similarity of the neighboring data. If
distance between locations is measured by straight lines, the most popular function for measuring spatial
data correlation is the semivariogram. It is defined as follows:
Semivariogram(distance h) = ½ average[(value at location i − value at location j)2]
for all pairs of locations i and j separated by distance h.
For data that are regularly distributed in space, that is, for grids, distances at which data are averaged can be
the first dozen shortest distances between grid cells. For irregularly distributed observations, several
distances h (called lags) are defined, and pairs separated by distances h, 2h, 3h, . . . are considered together.
For example, all pairs of observations separated by more than 190 yards and less than 210 yards form one
such group, bin 200. The squared differences between the values for each lag are averaged for each bin.
The average semivariogram value in each bin is plotted as a red dot in figure 3.8. The figure has more than
one red dot per bin because Geostatistical Analyst is averaging the semivariogram values in different
directions separately.
Semivariogram model is used to describe spatial data variability at various distances between points and for
predicting values at locations where no data has been collected. Because the prediction location can be
separated from the data locations by any distance, a particular algorithm is used to fit some analytical
function of distance between points. For example, the semivariogram model (blue line) is fitted to the
semivariogram values (red dots) in figure 3.8.
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 3.8
Many semivariogram models have three parameters, as shown in figure 3.9. The range is the distance beyond
which data lack significant spatial dependence. The partial sill is the amount of variation of a spatial process.
The nugget is data variation at distances shorter than the distances between closest data pairs.
The following text will focus mostly on the nugget parameter because its value can be used for estimation of
measurement error.
Figure 3.9
In semivariogram modeling, the nugget parameter is estimated by extrapolating the semivariogram model to
zero distance between pairs, the y‐intersect.
The nugget parameter is the sum of measurement error and microscale irregularities (called microstructure
in Geostatistical Analyst Wizard). In Geostatistical Analyst, the measurement error part of the nugget value
can be specified. Then kriging can filter measurement error and predict a new noiseless value at the data
location.
If there are multiple observations per location, Geostatistical Analyst estimates the measurement error using
the following formula:
where D is the set of all data locations that have more than one measurement, is the jth measurement
at location si, is the mean value at location si, ni is the number of observations at location si in D,
for all si in D, and nD is the number of spatial locations in D.
The slider in the Geostatistical Wizard, step 2 of 4 of the Semivariogram/Covariance Modeling dialog box (in
the red box in figure 3.10), is set at the position that corresponds to value
(nugget )/nugget,
Figure 3.10
If replications are not available, measurement error can be estimated indirectly using additional information.
The example below shows one such possibility.
Measurement error can be determined using information on the known error of another measurement of the
same variable using the following formula:
and .
Here E (expression) is the expected value of the expression in brackets.
A derivation of the formula above follows. The expected squared difference between measurements is
.
The averaged squared difference between true values is typically smaller than the averaged squared errors
and .
difference of two measurements and the known measurement error of another experiment.
In most applications, measurement data are not precise, and it is advisable to assume that at least part of the
nugget is due to measurement error.
€
LOCATIONAL UNCERTAINTY
Locational error arises when the measurement ascribed to a point si has actually been measured at the point
si + Δsi, where Δsi is nonzero distance. After the September 11, 2001, terrorist attacks in New York and
Washington, D.C., the National Geospatial‐Intelligence Agency investigated how publicly accessible geospatial
information could be used by possible attackers. The investigation found that useful and unique information
sources are no longer being made public since 2001. It is good that terrorists do not have access to precise
geospatial information, but this also means that our own research is usually based on inaccurate coordinates.
Sources of locational errors include measurements collected for territories (polygons), the area of which is
unknown; measurements distributed throughout a city but with coordinates represented by the city centroid;
truncated coordinates; and coordinates varying by map projection. An example of the locational errors due to
unknown area where measurements are collected follows.
A farmer’s field in Illinois was intensively sampled for corn yields using a combine equipped with a GPS
receiver; see measurement locations in green in figure 3.11 at left. The combine driver did his best to
maintain a constant speed, and the collected corn was weighed regularly. Nonetheless, there are many
uncertainties in the data including locational ones. The volume of the grain collector, the speed of the
combine, and the U‐turn locations influence the accuracy of the corn yield measurements. The combine is
computing yield in discrete packets as the grain gets picked, peeled, and deposited. This cannot be done
precisely. As a result, there are areas with gradually changing yield value and islands with unusually large
values. Figure 3.11 at right shows the enlarged part of the field displayed in light blue in the map at left.
Neighboring data in the north–south orientation of the combine passes looks more similar than neighboring
data to the east and west, meaning that the next measurement depends on the previous one.
Farm Spatial statistics data courtesy of the Department of Crop Sciences, University of Illinois at Urbana–Champaign.
Figure 3.11
Farm Spatial statistics data courtesy of the Department of Crop Sciences, University of Illinois at Urbana–Champaign.
Figure 3.12
Predictions to the sampled locations using kriging with the semivariogram model in figure 3.12 are different
from the actual measurements. They are more similar to their neighboring data.
Farm Spatial statistics data courtesy of the Department of Crop Sciences, University of Illinois at Urbana–Champaign.
Figure 3.13
Figure 3.14 shows a subset of radiocesium data, collected in cities and villages of Belarus in 1992, depicted as
colored points. The city of Gomel with a population of approximately 500,000 is in the center of the map
surrounded by villages with populations of roughly 100 people. Contours of some Belarus cities and villages
are shown in this figure as green polygons. Points inside polygons are measurements associated with the
polygons’ centroids. Measurements were made inside cities, but it is not known where. Any point inside a
polygon can be the location of radiocesium measurement. Clearly, locational error exists, and its magnitude
depends on the size of the city. The bigger the city, the larger the locational error.
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 3.14
The semivariogram is calculated using the distance between pairs of locations, and uncertainty in location
coordinates leads to changes in the semivariogram model. In figure 3.15, the semivariogram without
locational errors (purple) is different from the semivariogram with errors at small distances between data
pairs (green).
If it is known that locational error exists, the semivariogram model should be modified for distances shorter
than the error in the data coordinates by increasing the nugget parameter value. With a larger nugget
parameter value, kriging predictions are smoother, and the prediction uncertainty is larger. Therefore,
locational errors may change decision‐making.
There are many applications in which spatial data consists of locations only. The main goal of such data
analysis is the explanation of the observed point patterns. Locational errors in point data may change the
result of an analysis significantly, because the statistical models are based on distances between points.
One of the major applications of point pattern analysis is in forestry. The main goal of the analysis is to
uncover the mechanism that generates the pattern of tree data. With few exceptions, researchers assume that
tree or shrub locations are known precisely.
In the example shown in figure 3.16, the tree root systems overlap, and an answer to the question “where are
trees’ locations?” is uncertain. Of course, “tree” is generally understood to be the above‐ground part, but here
we are interested in its point location, which then will be used to calculate various functions of distances
between pairs of trees, and a point representation of the polygon is always uncertain, especially when the
shape of the polygon is not known. Therefore, a good statistical model must take into account tree location
uncertainty or at least estimate how much this uncertainty may influence the results.
Figure 3.16
Spatial health data have different sources of locational uncertainty, one of which is peoples’ mobility.
Individuals can get and spread infections any place they go, whether by personal contact or by other routes,
but the information of a particular disease is bound to where people reside. Spatial analysis of such data only
becomes appropriate after certain measures of geographic proximity are taken into account that describe
various routes of getting an infection.
Ideally, locational uncertainty should be among the modeling parameters of spatial data analysis.
LOCAL DATA INTEGRATION
If data are collected in an area, a point representing this area can be inaccurate. The goal of analysis of Cs‐137
contamination of wild berries in Belarus was to zone the areas where there would be danger in consuming
local forest berries. To measure the radioactive contamination, about one kilogram of berries was required.
But such an amount could only be collected over a relatively large area of forest, probably somewhere within
a 5‐mile radius around the monitoring station, as shown in figure 3.15. The resulting sample integrated
berries that were more or less contaminated than the contamination value assigned to that sample.
The consequence of local data integration on modeling is similar to the consequence of locational error: the
semivariogram changes near the origin, kriging predictions have a larger variance, and the resulting
interpolation surface is smoother than when exact point coordinates are used.
There is a concurrent process when areas of data integration are overlapped. In that case, highly correlated
data can be produced simply because values in the nearby locations may represent almost the same
measurements data. Observe that the forest berries example in figure 3.17 shows overlap.
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 3.17
Local data integration can occur over time. For example, temperature is usually available as an averaged
value around midnight or noon and ozone values are measured during one‐ or eight‐hour periods. The longer
the time period of data integration, the smoother the spatial predictions and the smaller the prediction
variance.
Many applications have rounded measurements. For example, air temperature, which can be measured with
high precision, is often available as an integer value because the scale of most thermometers is one degree
and most people do not need even one degree of precision to plan their outdoor activities.
The precision of the measurement devices are such that the expression
1 + δ = 1
is true whenever δ is smaller than the small constant ε. For example, if latitude is measured with precision δ =
0.01, ε=0.002 is smaller than δ and the values 32.46 and 32.462 (32.46 + ε) are equal. In ArcGIS, a numeric field
has a precision—the number of digits that can be stored in a number field. Therefore, data could be
erroneously rounded just because of the chosen precision δ.
The precision of data coordinates should be sufficient for calculating distances several times smaller than the
shortest distance between pairs of measurement locations. This is especially important when data
coordinates are latitude‐longitude.
Figure 3.18 shows airport locations in the United States with original coordinates in latitude‐longitude. The
locations were projected using precisions δ = 0.01 (fields LAT2 and LON2) and δ = 0.000001 (fields LAT and
LON). The resulting x‐ and y‐coordinates are presented, and the difference in coordinates is calculated and
shown in the last two fields on the right side of the table. For Washington‐Dulles International Airport with
original coordinates (38.93, 77.43) and (38.934722, 77.433333), the difference in x‐ and y‐coordinates is 383
and 463 meters. For some applications these differences are negligible, but for others they can be important.
Figure 3.18
The influence of rounding errors on prediction can be illustrated using winter temperatures for one day in
January 1999 in the continental United States. The original data with a precision of 1 degree Fahrenheit
would be rounded off to the nearest multiple of 5.
Figure 3.19
The map in figure 3.20 shows a predicted kriging surface using original data. Contour lines are predictions
using rounded down values. Because rounding down produces underestimated values, predictions are
systematically shifted to the south.
Figure 3.20
Censored data lie above or below some threshold. Truncated data are missed from the sample because of the
measurement device’s sensitivity limit.
Censored data are typical in survival analysis, which deals with death in biological organisms and failure in
mechanical systems, when the time of interest (disappearance of a disease symptom or death) is not known
exactly, but partial information is available. A right‐censored value occurs when unobserved survival time is
known to be larger than the estimated time of interest. A left‐censored value occurs when the time (for
example, age at death) is known to be less than the estimated time of interest.
It is common that some observations of environmental measurements, such as herbicide concentrations in
soil, air, and water, are not recorded below specified analytical reporting limits because of the limitation of
measurement devices. Measurements below a certain limit are called left censored. Their precise values are
unknown.
Some of the heavy metals in moss data collected in Austria in 1995 are left censored, with the following
detection limits:
0.13 mg/kg for arsenic
0.10 mg/kg for cadmium
0.25 mg/kg for cobalt
0.25 mg/kg for molybdenum
Figure 3.21 shows cadmium data with question marks for censored data. There are only 4 out of 219
locations with measured but unknown cadmium. They are situated in areas with low contamination, and their
removal or replacement using a random value between zero and 0.1 should not seriously affect the analysis of
cadmium data distribution.
Copyright © 1995, Umweltbundesamt, GmbH.
Figure 3.21
Figure 3.22 shows a different situation with arsenic measurements: 143 out of 219 measurements are
censored. An interpolation map using noncensored measurements only, with replacement of censored
measurements using random values between zero and 0.13 or using value 0 for censored measurements, can
be seriously biased. In this situation, a model based on truncated data distribution is required.
Several statistical papers describe the interpolation models for truncated data, but at this writing there is no
software for these models. A possible workaround is transformation of the arsenic variable to an indicator
variable with a value of zero if the measurement is less than the detection limit 0.13 (green) and a value of 1 if
otherwise (purple), as shown in figure 3.23.
Copyright © 1995, Umweltbundesamt, GmbH.
Figure 3.23
Then cokriging (multivariate spatial prediction) using only measurements with values greater than the
detection limit as the primary variable and with indicators as a secondary variable can be used for creating a
surface of arsenic predictions (see discussion on cokriging in the section on multivariate geostatistics in
chapter 8).
An example of a right‐censored dataset in which the data values were known to exceed the capacity of the
measuring device is the set of measurements of the short‐lived radionuclides made in the first several days
after the Chernobyl accident in April 1986. These radioactive substances, with a half‐life of hours to days,
were released into the atmosphere within the first several days after the accident. At that time, measurement
devices for very high concentrations of radionuclides in the air were not available at most of the
A similar situation arises when measuring rainfall on stormy days: available containers quickly become full,
making it impossible to know the true amount of rain that fell.
An example of interval‐censored data is when data are available for selected ranges such as household
income.
Less than $10,000 $10,000–$14,999
$15,000–$19,999 $20,000–$24,999
$25,000–$29,999 $30,000–$34,999
$35,000–$39,999 $40,000–$44,999
$45,000–$49,999 $50,000–$59,999
$60,000–$74,999 $75,000–$99,999
$100,000–$124,999 $125,000–$149,999
$150,000–$199,999 $200,000 and greater
If these data are used to make a map or used as variables in statistical models, a single value will be required
for representing each range. However, using rather arbitrary values from the data intervals may produce a
biased and inconsistent estimation.
DIGITAL ELEVATION MODEL UNCERTAINTY
Figure 3.24 shows light detection and ranging (LIDAR) elevation measurements, the common source for the
creation of cell‐based digital elevation model (DEM) data of the shape of the earth’s surface. DEM data are
used in many applications as input for quantification of the characteristics of the land surface. Therefore,
accurate prediction of the average values in the grid cells is an important task.
Suppose the size of the required DEM cell is 30 by 30 meters, as shown in figure 3.24. Any prediction of the
average elevation in the cell is imprecise because measurements in each cell are different. We may expect that
inaccuracy will be greater in the cells on the slopes where data variation is greatest, for example, near the
edges between points with different colors, and in the cells with a relatively small number of measurements.
Figure 3.24
Figure 3.25 at left shows averaged values of elevation in the 30‐by‐30‐meter grid cells predicted by block
kriging. As expected, kriging prediction standard error is inhomogeneous, varying from 0.11 to 2.52 meters,
at right. Assuming that predictions are normally distributed, a 95‐percent prediction interval for most of the
cells is ± 3–5 meters. This uncertainty can be too high for hydrological and some other applications.
Copyright © 1995, Umweltbundesamt, GmbH.
Figure 3.25
Copyright © 1995, Umweltbundesamt, GmbH.
Figure 3.26
Data for modeling can be the output from another model as it is in the case of a DEM, which is a result of
preprocessing and interpolating elevation measurements. Figure 3.27 shows a small part of a DEM near a
creek that we will use to investigate DEM features.
Figure 3.27
DEM data was used as the input to kriging, and semivariogram models available in Geostatistical Analyst
were compared by using the cross‐validation diagnostic. Cross‐validation uses all the data to fit the
semivariogram model. Then it removes each data location, one at a time, and predicts the associated data
value. For all points, cross‐validation compares the measured and predicted values. The fitted line through
the scatter of points is shown in blue in the plot in figure 3.28 at right. This is almost a 1:1 line, which is very
unusual for real data because all interpolators, including kriging, overestimate low data values and
underestimate large data values. The Gaussian semivariogram used for modeling spatial data correlation is
shown in figure 3.28 at left.
Figure 3.28 can be compared to figure 3.29, which shows a circular semivariogram fit and cross‐validation
kriging diagnostic using a circular semivariogram model.
Although the regression function is not far from a 1:1 line, prediction error statistics in the bottom left part of
the cross‐validation dialog are much worse for kriging with a circular semivariogram. For a model that
provides accurate predictions, the following should be true:
Mean prediction error should be near zero (this investigates bias).
Root‐mean‐square prediction error should be as small as possible.
Average standard error should be close to root‐mean‐square error.
Mean‐standardized prediction error should be near zero just like mean prediction errors, only
divided by estimated prediction standard error.
Root‐mean‐square standardized prediction error should be near 1, indicating that the estimated
prediction uncertainty is consistent.
Kriging using a Gaussian semivariogram clearly outperforms kriging with a circular semivariogram because
the root‐mean‐square and the average standard errors are much smaller for the former and the root‐mean‐
square standardized error is too far from 1 for the latter. This is a problem since the Gaussian model is too
smooth and it does not correspond to any real physical process (see discussion in chapter 8).
Figure 3.29
CASE STUDY: ERROR PROPAGATION IN RADIOCESIUM FOOD CONTAMINATION
Nowadays, more than half the radiation received by Belarusian people comes from food contaminated by
radiocesium. Inhabitants of villages in southeastern Belarus, unable to afford clean food, eat the vegetables,
potatoes, and milk that they produce on their contaminated properties. This diet is often supplemented with
mushrooms and berries, also contaminated, from nearby forests.
A number of factors influence the transfer of radionuclides from soil to plants. Among them are the levels of
soil contamination, the soil type, the meteorological conditions at the time of radionuclide deposition, and the
type and extent of countermeasures.
Many models exist that predict internal radiocesium doses in humans using measurements of soil
contamination. These models first investigate radiocesium transport from soils up the food chain then give an
estimate of the internal dose by applying models of absorption and excretion of radiocesium from the body.
Error analysis for these models can be very complicated because of the large number of input parameters.
These models have two main sources of uncertainty: one is attributable to the error of measurement of soil
contamination, and the other is contained inside the numerous coefficients of retention of radionuclides in
the foods and in the transfer parameters related to the metabolism of radionuclides in the body.
ESTIMATING THE INTERNAL DOSE IN PEOPLE FROM MEASURED FOOD
CONTAMINATION
Estimating the dose of measured food contamination in people rather than from the soil contamination takes
a part of the dose uncertainty away. Even in this case, errors are large due to the transfer parameters and the
measurement error of the concentration of radionuclides in the food.
Figure 3.30 at left presents data on radiocesium concentration in milk in becquerels per kilogram (Bq/kg)
collected by the Belarusian Institute of Radiation Safety in 1993. Samples were collected in the territories
with high radiocesium soil contamination, shown in figure 3.30 at right. Samples close to Chernobyl were not
taken because people were evacuated from that area before 1993. One becquerel represents one radioactive
decay per second; each radioactive particle has a statistical probability of producing changes in DNA and the
immune system, which, in turn, affects an individual’s ability to cope with even low levels of radiation.
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 3.30
Radiocesium is quickly absorbed in the human body and distributed almost uniformly. It is removed through
the kidneys, with an estimated half‐life as shown in table 3.1.
Day Activity
a
2 + q0
3 + +
...
N + + ... + +
Table 3.2
Here, λe is the effective speed of elimination of the radionuclide from the body through biology and
radioactive decay. For adults, λe ≈ 0.0063 .
Summing up the geometric progression, the amount of radiocesium in the body in N days can be estimated as
. Radiocesium accumulation in the human body at constant intake eventually slows
down because of the exponential nature of radioactive decay as well as the elimination of radionuclides from
the body by metabolism. The typical accumulation of radiocesium in a human body in millisieverts (mSv)
through contaminated food in rural southern Belarus in 1993 is displayed in figure 3.31.
Figure 3.31
An annual dose for the adult population can be calculated using the formula
,
Information on radiocesium concentration in food was collected for several dozen foodstuffs in hundreds of
rural settlements (see table 3.3). The average internal dose can be calculated in each village under survey and
compared with the maximum permissible annual dose of radiation in unrestricted areas, 1 mSv.
Figure 3.32 at left shows the contribution of different radiocesium‐contaminated foodstuffs to internal
exposure for a typical village.
As a rule, in settlements where radioecological stations were established, several measurements were
collected. Each measurement was taken from a different place in the same village or from a large territory in
the forest, but radiocesium food contamination is not distributed uniformly, either geographically or within
different types of food.
Of approximately 4,500 villages in the area, only several hundred were sampled. Estimation of food
contamination and the dose in the unsampled settlements can be done using kriging. Probability maps
indicating where the upper permissible level was exceeded can be used for zoning territory according to the
level of food contamination and the risk of receiving a large dose by eating local food.
Figure 3.32 at right presents the probability that the upper permissible level of 370 Bq/kg was exceeded in
1993 for wild mushrooms in southern Belarus. It uses an arithmetic average of the contamination in
mushrooms collected in 292 locations. Note that there is no sufficient data for accurate interpolation in the
vicinity of the Chernobyl nuclear power plant.
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 3.32
Other maps of the probability that the upper permissible level of radiocesium was exceeded in vegetables,
meat, fruits, berries, mushrooms, and milk can be created using kriging as well. In the next section, estimating
uncertainty in calculations with imprecise terms will be discussed. Then the analysis of spatial distribution of
radiocesium food contamination in southern Belarus and food contamination mapping can be better
understood.
ESTIMATION OF UNCERTAINTY IN EXPRESSIONS WITH IMPRECISE TERMS
Any measured quantity z is the sum of two parts, its “true value” and an error Δz,
ztrue = zmeasured ± Δz,
meaning that the true quantity ztrue is between (zmeasured − Δz) and (zmeasured + Δz).
When a variable with uncertainty is used in a mathematical expression, for example, the daily dose qdaily in the
formula for the internal dose,
,
and the bounds of errors Δz (in our example, Δqdaily and Δweightadult) are known, the formulas for sum,
difference, product, quotient, and function of one or several variables can be used for estimating the error
resulting from operations with uncertain values.
Below we present a simplified approach of deriving the necessary formulas following the first reference in
“Further reading” section. The same formulas can be derived using random variables as shown at the end of
this section. Although statisticians think that the latter approach is simpler, many GIS practitioners found that
the approach that follows below is more understandable.
If two observations are
z = zmeasured ± Δz and y = ymeasured ± Δy,
then the maximum possible value of their sum is
(z + y)max = (zmeasured + ymeasured) + (Δz + Δy),
(z + y)min = (zmeasured + ymeasured) (Δz + Δy),
so the uncertainty of the sum is
Δ(z + y) = Δz + Δy.
The uncertainty of difference is also (Δz + Δy) because
(z y)max = (zmeasured ymeasured) + (Δz + Δy),
and
(z y)min = (zmeasured ymeasured) (Δz + Δy).
Therefore,
Δ(z y) = (Δz + Δy).
When two measured values, z= zmeasured ± Δz and y= ymeasured ± Δy, are being multiplied, it is helpful to rewrite
the expressions for z and y as
z= zmeasured ± Δz = zmeasured (1 ± Δz/zmeasured)
and
y= ymeasured ± Δy = ymeasured (1 ± Δy/ymeasured).
Then the largest possible value of their product is
zmeasured ⋅ ymeasured(1 + Δz/|zmeasured|)⋅( 1 + Δy/|ymeasured|)=
=zmeasured ⋅ ymeasured(1 + Δz/|zmeasured| + Δy/|ymeasured| +ΔzΔy/|zmeasured|*|ymeasured|).
Because the product of multiplying two small values Δz and Δy is small, the last expression can be simplified
as
zmeasured ⋅ ymeasured(1 + Δz/|zmeasured| + Δy/|ymeasured|).
Comparing this expression with
z= zmeasured ± Δz = zmeasured (1 ± Δz/zmeasured),
the uncertainty of the product of zmeasured ⋅ ymeasured is
Δ(zmeasured ⋅ ymeasured) = Δz/|zmeasured| + Δy/|ymeasured|.
A similar calculation of the uncertainty of the quotient gives the same expression for the estimated
uncertainty.
Δ(zmeasured / ymeasured) =Δz/|zmeasured| + Δy/|ymeasured|.
z = z1+z2+z3+…+zn – zn+1 zn+2…zn+m
is
Δz= Δz1+Δz2+Δz3+…+ Δzn + Δzn+1 + Δzn+2 +… Δzn+m,
and uncertainty of the product and the quotient
z = z1⋅z2⋅z3⋅…⋅zn/zn+1/zn+2/…/zn+m
is
Δz1/|z1measured| + Δz2/|z2measured| +…+ Δzn/|zn_measured| …+ Δzn+m/|zn+m_measured|.
If one of the terms in the product is known exactly, for example b=A⋅z, where A is a constant without
uncertainty, the estimated uncertainty of the product is
Δb=|A|⋅Δz,
where |A| is the absolute value of A.
In the case of the arbitrary function of two or more variables, f(z1, z2) or f(z1, z2 …zn), observed with
uncertainties Δz1, Δz2, …+ Δzn, the uncertainty of the function is
,
where is an absolute value of the partial derivative of f with respect to z1 while treating z2, z3, and zn as
fixed. This formula is a result of the approximation
,
assuming that Δz is much smaller than z.
For example, the uncertainty of the function sinz is |cosz|Δz (Δz must be measured in radians). If the angle z is
measured as 40°±3°≈40°±0.05 rad, then
Δ(sinz) = (cosz)⋅Δz ≈ 0.77⋅0.05 ≈ 0.04, and sin(40°±3°) = 0.64 ± 0.04.
As another example, the uncertainty of f(z)=zn, where n is a fixed number, is
.
Calculating the relative uncertainty for f(z)=z1/3 results in .
Results of this section have been worked out a long time ago by using random variables. The approach is the
following. We write
,
Then we use the rules of random variables such as
The results of the random variable approach can be found in statistical textbooks.
ERROR IN ESTIMATING THE INTERNAL DOSE
The annual internal dose for the adult population can be estimated using adult weight (an average of both
men and women) and the daily radiocesium intake.
The weight of adult Belarusian can be reasonably estimated as = 70 ± 15 kg. Calculation of the
daily intake requires knowledge of the average daily diet and the average radiocesium contamination in each
type of food.
Measurements of food contamination were taken multiple times with 53 villages contributing more than 100
milk samples each. To estimate the uncertainty of radiocesium food contamination, the distribution of
measurements in a moderately contaminated village with a population of 400 is examined below using the
Geostatistical Analyst Histogram tool and Normal Score Transformation dialog box, which is accessible in the
Wizard when simple or disjunctive kriging is used.
Geostatistical Analyst requires that measurements be taken in at least 10 different locations. However, the
exact location of each measurement is unknown because a cow does not eat at a mathematical point but
within a polygon, and this polygon is only known within a certainty of several miles. Therefore, coordinates of
the measurements can be randomly shifted. For example, a random distance within a range of one mile from
the village centroid can be added for each measurement using the following algorithm:
Xrand = x – 1609 +3218*rnd()
Yrand = y – 1609 +3218*rnd(),
where x and y are the village centroid’s coordinates and rnd() is the random number generator in the interval
[0, 1]. The choice of the shifting distance is not important for the univariate data distribution analysis because
data coordinates are not used. It is necessary for running Geostatistical Analyst Wizard.
Figure 3.33 at left shows a histogram of 126 milk measurements made using samples presented by
inhabitants of this village and perhaps nearby ones as well.
The distribution of milk measurements is not symmetrical, and it is not immediately clear how to estimate
either the most probable value or its uncertainty. Using the arithmetic mean is equivalent to assuming that all
milk in the village is carefully mixed and distributed among the population. To protect people, it would have
been better to use the maximum measured value, or, if there had been evidence that some measurements
were incorrect, then the third or fourth largest value. But by using those values, a very large territory would
have been considered unsafe for living. Only very rich countries can afford countermeasures in large
territories. One possible way to find a compromise is to use a value that is larger than the arithmetic average
but that still corresponds to the typical exposure of a significant part of the population.
Figure 3.33 at right shows the milk distribution (red line) obtained with a mixture of several different
Gaussian distributions from the Geostatistical Analyst 9.1 Normal Score Transformation dialog box. In this
dialog box, a number of Gaussian functions can be specified, and the software will find the best parameters of
the distribution’s mean and standard deviation values.
The milk distribution data can be modeled using a mixture of three Gaussian distributions and the graph in
figure 3.33 at right can be interpreted as a mixture of three areas with high cesium‐137 contamination
around which cows have eaten grass. Two kernels describe the majority of the data; the reasonable value for
use as input to dose estimation in this village is the mean of the second kernel, the blue line with a value of
approximately 74 Bq/kg (the corresponding standard deviation is shown as a horizontal green line), which is
larger than mean value. In this case, people are protected if their cows pasture near the unknown locations of
the first and second largest hot spots of radiocesium.
Figure 3.34 shows the same Normal Score Transformation dialog box in ArcGIS 9.2. In this version of the
software, parameters of the Gaussian functions are listed on the right side. Two kernels with means 21 and 74
Bq/kg describe 89 percent of the data (a sum of the probability values 0.49 and 0.40 in the fourth column P).
The standard deviation 17.18 in the third column (Sigma) can be used as uncertainty of the contamination in
the second hot spot with a mean value of 73.77.
The next several figures show histograms and the estimated distributions of radiocesium contamination in
vegetables (3.35), fruits (3.36), and meat (3.37) in the same Belarusian village. The values, which can be used
as representative of each type of food contamination in the village, are shown with blue lines.
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 3.35
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 3.37
Sometimes data includes an outlier, as with potato radiocesium contamination in figure 3.38.
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 3.38
That very large value contrasts with other measurements of contamination and most probably results from a
typing error. After removing the outlier, the data distribution (figure 3.39) can be used as representative of
the potato contamination in the village.
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 3.39
Figure 3.40 shows the distribution of wild berry contamination around the village. One value is greater than
the others, but there are several other large values, and it is possible that there is a radiocesium hot spot in
the nearby forest. Thus, there is no good reason to remove the largest measurement of the berry
contamination from the database.
Figure 3.40
Radiocesium intake for the village can be calculated using the following formula (abbreviations were
described in table 3.4):
(Bq) = 0.001⋅(MI⋅Csmi + PO⋅Cspo + VE⋅Csve + FR⋅Csfr + BE⋅Csbe +ME⋅Csme),
where the coefficient 0.001 is used because radiocesium contamination is measured in Bq/kg, while food
consumption is in grams. Csi is radiocesium contamination, represented in the graphs above by blue lines.
Standard deviations of the second Gaussian distribution in the mixture can be used to describe the
uncertainty of Csi.
The uncertainty of the intake estimation, using the mathematical expressions discussed in the previous
section, is
= 378 ± 160 Bq.
Using the formula with weight of 70 kg and weight uncertainty of 15 kg,
annual internal dose in that village can be estimated as
0.013 ± 0.008 (millisievert).
Large uncertainty in the radiocesium food contamination data shows that a lot of variation exists around the
estimated mean value. Figure 3.41 created in ArcGIS 3D Analyst illustrates this. The range of prediction
variation (sticks) over predictions of the radiocesium food contamination (surface) shows that the true
values can differ from the most probable values given by kriging predictions by the height of the sticks. The
sticks go under the surface showing a prediction interval.
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 3.41
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 3.42
ASSIGNMENTS
1) INVESTIGATE THE INFLUENCE OF THE X‐ AND Y‐COORDINATE UNCERTAINTIES
ON THE SEMIVARIOGRAM MODEL.
A subsample of Cs‐137 (radiocesium) milk contamination, in Bq/kg, measured in 70 villages of southern
Belarus in 1993 can be found in the assignment 3.1 folder. The locations of the measurements are the village
centroids, shown as circles in figure 3.43. Contours of large villages are in gray, lakes are in blue; roads in
yellow are also shown.
It is reasonable to assume that pastures spread 1 to 5 kilometers around the center of a village.
Change the x and ycoordinates of the data locations randomly by 1 to 5 kilometers and compare the
semivariogram models for the original and the shifted data locations at short distances. Use the lag size of
about 2,000 meters.
For instance, we can assume that the maximum radius of the spatial shift depends on the size of a village,
because cows from a small village are likely to pasture at a shorter distance from the village center than cows
from a large village. Since the size of a village is proportional to its population, the maximum radius can be
defined by the following algorithm:
If the population is less than 500, the maximum radius is equal to 1,000 meters.
If the population is in the interval (500; 2,000), the maximum radius is equal to population×2.
If the population is greater than 2,000, the maximum radius is equal to 4,000 meters.
The table in figure 3.44 lists the maximum radius of the shift for various village populations. Of course, other
algorithms for the maximum radius can be used as well.
The shifts in x and ycoordinates can be simulated from the uniform distribution using the expression
(maximum radius) +2× (maximum radius)× rnd()
The resulting shifted locations and village centroids are connected with arrows, shown in figure 3.44. The
Python codes are provided to assist in the creation of the shapefile with shifted locations,
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 3.44
2) RECALCULATE THE UNCERTAINTY IN ESTIMATING THE INTERNAL DOSE.
The transfer coefficient from food to dose used in the case study in the section “Error in estimating the
internal dose” was assumed to be known exactly. In fact, it is an estimated average from many case studies
made by different researchers under different assumptions. Assume that the transfer coefficient error is
15 percent and recalculate the intake estimation uncertainty.
Data measured around a village and used in this chapter are available in the folder assignment 3.2
FURTHER READING
Taylor, J. R. 1997. An Introduction to Error Analysis. Sausalito, Calif.: University Science Books.
Rabinovich, S. G. 2000. Measurement Errors and Uncertainties. New York: Springer.
Although these books do not discuss geographic data, they are excellent introductions to measurement error
analysis for geographers.
THE IMPORTANCE OF THE
DISTRIBUTION ASSUMPTION
GAUSSIAN PROCESSES
LOGNORMAL PROCESSES
BERNOULLI, BINOMIAL, AND POISSON PROCESSES
MODELING DATA DISTRIBUTION AS A MIXTURE OF GAUSSIAN DISTRIBUTIONS
USE OF GAMMA DISTRIBUTION FOR MODELING POSITIVE CONTINUOUS DATA
MODELING PROPORTIONS USING BETA DISTRIBUTION
NEGATIVE BINOMIAL DISTRIBUTION
MODELING DATA WITH EXTRA ZEROS
NONPARAMETRIC MODELING
CONFIDENCE INTERVALS AND CHEBYSHEV’S INEQUALITY
SPATIAL POISSON PROCESS
ASSIGNMENTS
1) SIMULATE AND PLOT NORMAL, LOGNORMAL, GAMMA, BINOMIAL, POISSON,
AND NEGATIVE BINOMIAL DISTRIBUTIONS
2) DETECT MULTIMODAL DATA DISTRIBUTIONS
3) FIND A GOLF COURSE WITH THE BEST AIR QUALITY USING KRIGING AND THE
CHEBYSHEV’S INEQUALITY
FURTHER READING
Spatial Statistical Data Analysis for GIS Users 107
R ecently, a spatial regression model was presented at a GIS conference. In the discussion, the
presenter was asked how he dealt with the model’s assumption of data normality because, although
the model assumed a Gaussian data distribution, it was clear that the data were far from being
Gaussian. The presenter’s answer was that he did not care about assumptions, because assumptions
were for statisticians. Geographers just used whatever statistical model they liked. Many people in the
audience applauded that explanation, but if data do not satisfy a model’s assumptions, inference from the
model’s output can be wrong. Moreover, a model with valid assumptions about data distribution will
outperform a model that does not require an assumption about data distribution.
Statistical assumptions are statements about processes. The objective of statistical modeling is to describe a
wealth of data with a few parameters. Fewer parameters result in a simpler model, making it easier to
understand and faster to compute. For example, univariate normal distribution has only two parameters:
mean and standard deviation. The most effective way to reduce the number of values that describes a process
is to prove that the data follow a specific distribution.
A distribution describes the proportion of samples with specific values. For example, we can calculate the
proportions of house sales in the city that fall into price intervals incrementing by $10,000 and approximate
these proportions by a suitable mathematical function. If the number of house sales is very large, this function
gives the probability that a particular house taken at random will have a value in a specific interval.
This chapter discusses several widely used data distributions: Gaussian, lognormal, gamma, beta, Bernoulli,
binomial, Poisson, and negative binomial. Also discussed and illustrated is the special case when data consist
of more zeros than allowed by standard distributions. Figure 4.1 shows the relationships between the data
value types and the probability models discussed in this chapter.
Figure 4.1
The use of various distributions is illustrated using air quality data collected in California, radiocesium soil
contamination and thyroid cancer in children data from Belarus, cancer mortality data from U.S. counties,
precipitation in Sweden, fishery data collected west of the British Isles, milk cows as a proportion of all cattle
Spatial Statistical Data Analysis for GIS Users 108
and calves in the central north part of the United States, house prices from the city of Nashua, New
Hampshire, and temperature observations made in Western Europe.
We continue with a discussion of how to construct confidence intervals for Gaussian and for unknown
distribution to quantify the uncertainty of the predictions. Air quality data are used to illustrate the theory.
Most of the illustrations in this chapter are based on univariate distributions of the data values. At the end of
this chapter, a basic spatial process that describes spatially independent data locations is presented.
Examples of spatial modeling using normal, lognormal, gamma, Poisson, binomial, and negative binomial
distributions can be found in the following chapters and in the appendices.
GAUSSIAN PROCESSES
When measuring values, we usually assume that although the true value of a measurable quantity is not
known, it exists and is a constant. A true value is an abstraction similar to the mathematical point. When
many measurements are taken, they are often distributed symmetrically around a true value. In scientific
literature, statisticians and mathematicians usually use the term “normal” for this distribution, physicists
mostly call it “Gaussian,” while social scientists sometimes prefer the term “bell curve.”
Figure 4.2 shows a Gaussian distribution with the mean and standard deviation equal to 1 (red line) and the
cumulative area under the red line displayed in green. The probability that the value of 1.5 is exceeded
(p≈0.31) is equal to the proportion of area in yellow to the area under the red curve. Both of these areas can
be easily calculated if the mean and standard deviation of the Gaussian distribution are known.
Figure 4.2
ArcGIS Geostatistical Analyst creates probability and quantile maps by assuming that the predictions are
normally distributed at any location with a mean equaling the kriging prediction and the standard deviation
equaling a kriging prediction standard error. The area under the distribution line to the right of the threshold
value is calculated and divided by the area under the entire distribution line. The resulting value is
interpreted as the probability that the specified threshold is exceeded.
But how can we guarantee that predictions and their standard errors are described by Gaussian distribution?
Spatial Statistical Data Analysis for GIS Users 109
The kriging prediction at an unsampled location s? is a weighted sum of the measurements Z(si) in locations si
,
Suppose that the prediction equals T and the standard deviation is σ. Figure 4.3 shows the probability, axis y,
that a measurement is inside the interval between
(T ‐ x⋅σ) and (T + x⋅σ).
Figure 4.3
The measurement is inside one standard deviation interval with a probability of 0.683 as shown by green
lines. Table 4.1 shows the probability values for the standard deviation values between 0 and 3.
Confidence intervals for the kriging prediction in a particular location using table 4.1 can be constructed only
if predictions are normally distributed.
Spatial Statistical Data Analysis for GIS Users 110
Figure 4.4 shows a histogram of the one‐hour maximum daily ozone concentration at Lake Tahoe, California,
in 1999. A normal distribution with a mean of 0.053 parts per million (ppm) and a standard deviation of
0.0103, blue line, describes the ozone data quite well.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 4.4
Figure 4.5 shows histograms of maximum one‐hour daily ozone concentration in three other California cities:
San Francisco, Pasadena, and Redlands. Their mean values and standard deviations are different. While the
assumption that data are normal may be appropriate for the first two cities, it is not so for Redlands, where
the distribution of the ozone concentration is more complicated.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 4.5
Spatial Statistical Data Analysis for GIS Users 111
Redlands data can be described using a combination of several Gaussian distributions with different means
and standard deviations, as shown in figure 4.6. This is because ozone values are seasonal in Redlands: they
are large in summer and relatively small in winter, with intermediate values in autumn and spring. Figure 4.6
(left) uses two Gaussian distributions, while figure 4.6 (right) uses three.
From California Ambient Air Quality Data CD, 1980‐2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 4.6
Figure 4.7 shows the distribution of ozone values for 193 California cities. It is close to a normal distribution
(red curve). However, the distribution of ozone values in each place is different, as it was for the four cities
above, and the distribution of ozone values in the entire state in figure 4.7 is not a good approximation for
most of the California places.
To satisfy the assumptions of the Gaussian process when analyzing two‐dimensional data such as ozone
concentration values in the California locations, a variable at each location in the domain must be normally
distributed with the same mean and the same standard deviation. Therefore, data preprocessing is required
to bring ozone data closer to the Gaussian distribution, even if the data histogram already looks close to bell
shaped.
Often, continuous data that are non‐Gaussian can be successfully transformed to an approximately normal
distribution (see chapter 8), but it is more difficult, if not impossible, to do the same for discrete and
categorical data such as disease rates and binary (indicator) 0/1 variables.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 4.7
Spatial Statistical Data Analysis for GIS Users 112
LOGNORMAL PROCESSES
Suppose that the is the mean value of variable Y(s), where s denotes the location’s coordinates. The
following equality is always true:
,
A common geostatistical assumption is that is a deterministic function of coordinates x and y, for
There are natural multiplicative processes, such as spatial interaction, which can be described as
multiplication of a deterministic function of coordinates and random error:
.
Such an equation may arise if the mean value is not a linear combination of measurements, but a function
such as entropy .
Taking a logarithm from both sides of the multiplicative process gives
.
have an additive process for the variable . Then an additive kriging model for the log‐transformed
variable can be used. After finishing all estimations and calculations, the prediction and its
standard error should be transformed back to the original space. Although log transformation is simple, back
transformation of the predicted kriging value on the original scale is more complex:
where is the predicted kriging variance of the log‐transformed data Y(s) at location s. The last formula
is valid for simple kriging. For ordinary kriging, it becomes more complicated. Ordinary kriging yields kriging
coefficients and Lagrange multiplier , and back transformation of the predicted ordinary kriging value on
the original scale includes the Lagrange multiplier as well:
a seriously overestimated prediction .
Formulas for simple and ordinary kriging variances in the original scale are again rather complex:
2
σ Z2ˆ ,simple (s) = Zˆ (s)
( ) (expσ 2
Yˆ
(s) −1) ,
σ Z2ˆ ,ordinary (s) = exp(2µˆ Y + σ Y2ˆ (s)) exp σ Y2ˆ (s) + exp var(Yˆ (s) − 2exp cov(Y (s),Yˆ (s) ,
( ) ( )
€
where is estimated mean value of the transformed variable Y(s), and var(expression) and cov(expression)
€are variance and covariance of the expression in brackets.
In practice, researchers sometimes log transform original data and then model and visualize the results in log
scale. This makes sense since some physical processes are measured in log scale. For example, pH, the
common measure of acidity, is the negative logarithm of concentration of H+ ions. The power of earthquakes
is measured in the logarithm of the amplitude of waves (the Richter scale).
When kriging with logarithmic transformation is used, it is necessary to use the cross‐validation diagnostic to
check for possible unreliable predictions. Figure 4.8 shows a subset of cesium soil contamination data
collected in southern Belarus in 1992 and the histograms of original (top left) and log‐transformed (bottom
left) data. After log transformation, the histogram becomes more symmetrical and lognormal kriging may
produce better predictions than kriging using original measurement values. However, cross‐validation
diagnostic of the lognormal ordinary kriging with default parameters indicates that average root‐mean‐
square standardized prediction errors statistic 0.2719 is far from a desirable value 1, and the root‐mean‐
square prediction error 4.462 is very different from the average standard prediction error 21.78, while they
should be similar when the predictions are nearly optimal (see discussion on cross‐validation diagnostic in
chapter 6). In this case, more effort should be spent on data preprocessing (detrending and transformation)
and further checking of the cross‐validation statistics.
Other transformations can also make data approach a normal distribution. For example, the power
transformation with a parameter of 0.1 (histogram in the right part of figure 4.8) is another candidate for
transforming data to make it close to normal.
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 4.8
Figure 4.9 shows a lognormal distribution with the mean and standard deviation equal to 0.5 (red line) and
the running area under the red line displayed in green. Log Gaussian data is always positive and its
distribution has a heavy right tail. It is a continuous distribution in which the logarithm of a variable has a
Gaussian distribution.
Figure 4.9
The probability that a random sample from the lognormal distribution in figure 4.9 will be less than or equal
to 2.0 is approximately 0.65, as shown by blue lines.
The lognormal distribution has two parameters typically denoted as μ and σ2, although they are not the data
mean and variance. If variable Z is lognormally distributed, Z~lognormal(μ, σ2), then log(Z)~normal(μ, σ2)
where now μ and σ2 are the mean and variance of log(Z).
Like the negative binomial and the gamma distributions discussed below, the variance of the lognormal
distribution has a relationship with the mean:
Therefore, the successful use of lognormal kriging makes predictions and prediction standard error
dependent on each other, a feature expected in many applications.
BERNOULLI, BINOMIAL, AND POISSON PROCESSES
A Bernoulli (or binary) variable is a random variable that can take only the value of either 1 (success) or 0
(failure). The trials are independent so that the outcome of one trial has no effect on the outcome of another.
The probability of success is a constant and is usually denoted by p. Because the sum of the probabilities over
all possible values should be equal to 1, the probability of a failure is (1p). Mean and variance of a Bernoulli
random variable are related; they are equal to p and p(1p), correspondingly.
One typical application of a Bernoulli distribution is habitat suitability when information whether a particular
species of interest is present or absent in the polygons or quadrats is studied. Often several explanatory
variables measured in each polygon are collected. A model that predicts species presence or absence using
explanatory variables is called logistic regression. Other common binary processes include rain/no rain and
Suppose we repeat a Bernoulli experiment n times and count the number of successes. The distribution of
successes is called the binomial random variable. Mean and variance of a binomial random variable are equal
to np and np(1p), correspondingly.
Suppose that 35 percent of Geostatistical Analyst users always use the default kriging model. Suppose we
want to know the probability that among 20 randomly selected users, at least 10 use the default model only.
This problem is binomial because
The number of trials (n = 20) is fixed.
The trials are independent (if one of the users uses the default model, it is not determined whether
another user uses the defaults).
The trials can be classified into two categories: success (the default geostatistical model is used) or
failure (the user is looking for more optimal predictions).
The probabilities (p = 0.35 for the default model usage and (1 ‐ p) = 0.65 for more flexible choice)
remain constant.
Some events do not happen often. For example, the death rate of infants in the United States is small, and the
number of murders in American cities happen rarely compared to the overall occurrence of total crimes.
If the number of trials n is large, usually greater than 10, and the probability of its occurrence p is small,
usually less than 0.05, then the Poisson distribution with mean and variance equal to n⋅p is a good
approximation to the binomial distribution. To use the binomial distribution, we have to specify both the
number of trials n and the probability of each trial p, whereas the average number of events λ = n⋅p is
sufficient for applying the Poisson distribution.
The essential property of a Poisson process is that its mean is equal to its variance. That is, the variability of
the counts is smaller for experiments with small average counts and larger for experiments with large
average counts. For example, if two soil samples taken close to Chernobyl, Ukraine, after April 1986 gave
values of µ1 = 400 and µ2 = 80 radioactive decay per unit time, the estimated uncertainty of this Poisson
experiment is a square root from the number of counts. So the numbers of emitted particles are
approximately
µ1 ± √µ1 = 400 ± 20
µ2 ± √µ2 = 80 ± 9.
If both n⋅p and n⋅(1 − p) are greater than 5, then a good approximation to the binomial distribution is given by
the normal distribution with mean n⋅p and variance np⋅(1 p). Historically, this approximation was the first
use of the normal distribution.
Figure 4.10 shows the probabilities, axis y, of observing a number of users who use the default Geostatistical
Analyst model only, axis x. Blue lines are the probabilities given by the binomial distribution with the
probability p = 0.35 and the number of trials n = 20. Red lines are Poisson probabilities when the expected
average number of users equals n⋅p = 7. Normal distribution with the same mean as selected for the Poisson
distribution, n⋅p, with a standard deviation equaling the square root of the mean is shown in green. To
calculate these probabilities, we used the R environment (see appendix 2). Code (in red) used for calculations
is shown in the left part of figure 4.10. The probability that among 20 randomly selected users at least 10 use
sum(dbinom(10:20, n0, p0)).
It is equal to 0.12. The calculated probabilities are printed in the bottom part of the R console in blue.
Figure 4.10
The National Atlas of the United States and the U.S. Geological Survey provides 1970–94 cancer mortality data.
Figure 4.11 shows U.S. counties colored according to male lung cancer mortality rates. Columns in the table
show the age‐adjusted rates, LUNM_RATE; the number of male deaths due to lung cancer, LUNM_CNT; the
lower and the upper bounds of the 95‐percent confidence interval on the lung cancer mortality rate for males,
LUNM_LB and LUNM_UB; the expected number of male lung cancer deaths based on U.S. mortality rates,
LUNM_ECNT; and the age‐adjusted number of males in the counties.
From U.S. Geological Survey, 2008, Cancer Mortality in the United States: 1970‐1994. Courtesy of U.S. Geological Survey, Reston, Va.
Figure 4.11
The U.S. standard population distributions are presented in table 4.2. Cancer mortality data shown in figure
4.11 are standardized based on 1970 proportion.
Age‐adjusted rate is calculated using the following formula:
11
count i
Ageadjusted rate per 100,000 people = 100,000 × ∑ population × proportion i ,
i=1 i
and is a population proportion from table 4.2.
€
The binomial probability distribution and expected number of deaths for two highlighted counties can be
calculated using the R code in figure 4.12. Command sum(dbinom(100:150, n2, p)) calculates the probability,
0.176, that number of deaths is greater than 100 in the county where 90 deaths were registered, that is the
proportion of the green area under the binomial distribution curve to the whole area.
Figure 4.12
The standard error of the estimated rate per 100,000 people using a binomial variable is
For example, the standard error of the estimated rate for the first row in the male lung cancer data table in
the ArcMap screen capture in figure 4.11 is 8.55.
Confidence intervals are constructed using the normal approximation to the binomial distribution. This is
justified by comparing binomial and normal distributions for age‐adjusted populations of 4,000 (smallest
value in the dataset), 15,000, and 30,000 in figure 4.13. The normal distribution approximation works well
for populations larger than 15,000, but low bounds for rates can be smaller than zero for smaller populations.
Because negative bounds do not make sense, zero values are used for the lower bounds in such cases.
However, there are only a couple dozen counties with age‐adjusted populations lower than 15,000.
Figure 4.13
A 95 percent confidence interval for the true rate is given by
rate − 1.96 , rate + 1.96 .
For the data displayed in the first row of the lung mortality data, the 95‐percent confidence interval for the
true rate is
(50.58 − 1.96⋅8.55, 50.58 + 1.96⋅8.55) = (33.83, 67.33).
Confidence intervals with Poisson variables are constructed assuming that the natural logarithm of the rate, not
the rate itself, follows a normal distribution. In other words, in the Poisson approach, we need the variance of
ln(rate) and not the variance of the rate as in the case of the binomial variables.
The standard error of the logarithm of the estimated rate per 100,000 people using the Poisson variable is
Confidence intervals are constructed by assuming the natural logarithms of the rates are normally distributed
and then extrapolating back to the original scale.
A 95 percent confidence interval for the rate is given by
In the example, a 95‐percent confidence interval for the rate is
(exp{ln(50.58) − 1.96 0.17}, exp{ln(50.58) + 1.96 0.17}) = (36.24, 70.58),
which is close to the confidence interval estimated using the binomial variable.
The Poisson process is well suited for analysis of regional data, such as disease rates, where the mean and
variance are not constant over the entire domain but vary from region to region according to the size of its
population. Therefore, it is often reasonable to assume that regional data follows a Poisson distribution with a
mean and variance of r⋅ni, where r is the underlying risk of disease:
r = (the number of disease events)/(the number of people at risk of having the disease).
Note that for a Poisson random variable, as the mean increases, the variance will increase too. If a regression
model is fitted to the mean of a distribution where the distribution is Poisson, the spread around the
regression line will increase with the mean. Case studies using spatial Poisson regression can be found in
appendices 3 and 4. The logistic regression usage is illustrated in chapter 6 and appendix 4.
MODELING DATA DISTRIBUTION AS A MIXTURE OF GAUSSIAN DISTRIBUTIONS
Real data often has complex distribution. In “Gaussian process” we mentioned that such data can be modeled
as a mixture of Gaussian distributions. Figure 4.14 shows temperature observations in Western Europe
(circles) and the Geostatistical Analyst Normal Score Transformation dialog with data histogram and its
approximation with five Gaussian distributions (red line), with means (Mu), standard deviations (Sigma), and
probabilities (P) that data are described by a particular Gaussian distribution in the right part of the dialog.
Figure 4.14
If these data are used in geoprocessing and simulation from the data distribution is needed, the following
algorithm does the job.
First, create a variable with probability intervals equal to sequential sums of the probabilities Pi, figure 4.15.
In our case its values are 0.429 (P1), 0.65 (P1+P2), 0.783 (P1+P2+P3), 0.903 (P1+P2+P3+P4), and 1.0
(P1+P2+P3+P4+P5).
Second, simulate a random value u from the uniform distribution in the diapason [0, 1].
Third, simulate a value from the Gaussian distribution with mean and standard deviation from the probability
interval with value u. For example, if u is less than 0.429, simulate from the normal distribution with mean
31.86 and standard deviation 1.28 (yellow box in figure 4.15) and if u is between 0.65 and 0.783, simulate
from the normal distribution with mean 20.39 and standard deviation 4.32 (see the gray box in figure 4.15).
Figure 4.15
Figure 4.16
USE OF GAMMA DISTRIBUTION FOR MODELING POSITIVE CONTINUOUS DATA
There are many applications in which variables with continuous positive or non‐negative values are modeled,
including soil, water, and air contamination; house price; amount of rainfall; and proportions (such as the
number of a thyroid cancer cases to the total number of people in a region). In these applications statistical
models are usually based on the lognormal or gamma distributions. We already know that the product of
independent lognormal random variables is lognormal distribution. It appears that, similar to normal
distribution, the sum of independent gamma random variables (with the same scale parameter, see below) is
gamma distribution. Because quantities of interest are often viewed as averages, the gamma distribution is
frequently preferred for modeling continuous positive data values. Note also that statistical software
packages more often provide an option to model gamma random variables than the lognormal ones.
The gamma distribution is defined by two parameters, the shape (α) and scale (β), in the following formula:
α −1
f (z) =
(z / β ) exp(−z / β )
βΓ(α )
where z is the random variable and is a value of the mathematical function called the gamma function
defined by
€
The shape parameter α determines the level of positive skew (figure 4.17, top). α value should be greater than
0. For α < 1, the gamma distribution is strongly skewed to the right. For α greater than 1, the shape of the
distribution alters to approach the y‐axis at the origin. The scale parameter β defines the spread of values,
stretching the distribution with increasing β value (figure 4.17, bottom).
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 4.17
The gamma distribution is commonly used in meteorology to represent variation in precipitation amounts.
For example, one popular approach to model daily precipitation data consists of four stages:
1. Creating two datasets: the first dataset is binary and shows presence (1) and absence (0) of rain. The
second dataset includes all positive precipitation values.
2. Modeling the precipitation presence binary data using logistic regression (see discussion on logistic
regression in chapter 6 and appendix 4).
3. Modeling the amount of precipitation given that it is not zero and assuming that it follows gamma
distribution.
4. Combining the two models from stages 2 and 3 to estimate the expected precipitation using the
following formula
(probability of nonzero precipitation)×(expected amount of precipitation)
An advantage of this approach is that we can model two processes separately using different explanatory
variables, see also the section below called “Modeling data with extra zeros.”
Figure 4.18
Figure 4.19 at left shows a histogram of thyroid cancer rates in children in Byelorussian districts in 1986‐
1994 (these data are discussed in details in appendix 3). The data distribution can be approximated using
lognormal (green), gamma (red), and normal (blue) distributions. By inspecting figure 4.19 (left), we can
conclude that the lognormal and gamma distributions should be preferred over the normal distribution.
Suppose we want to find the relationship between the thyroid cancer rates (variable rate) and two
explanatory variables, distance to the Chernobyl (distance) and average value of cesium soil contamination in
the region (csmean) using a linear non‐spatial model
rate =a0 + a1 csmean + a2 distance,
where a0, a1 and a2 are regression coefficients to be estimated.
The linear model is discussed in appendix 2, and here we present only the main results of the regression
analysis using gamma and normal distributions:
Both models found that the distance to the Chernobyl and the average value of cesium soil
contamination are significant for explanation of the thyroid cancer rates in children (note that spatial
modeling of these data presented in appendix 3 leads to different conclusion).
Gamma regression better fits the data.
Figure 4.19 at right shows distributions of simulations from the fitted normal and gamma regression models
for fixed explanatory variables values, 1 Ci/km2 of cesium soil contamination and 400 miles from the
Chernobyl. Prediction of the thyroid rate by gamma regression is 2.5 times larger and the standard deviation
Spatial Statistical Data Analysis for GIS Users 124
is 2 times smaller. More importantly, predictions by the gamma regression are always positive, while nearly
half of the predictions made by the normal regression are negative, and this contradicts to the nature of the
rates data.
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 4.19
In the case of thyroid cancer rates, data approximation by normal and gamma distributions (figure 4.19 at
left) is very different, and the difference in the modeling results is expected. Figure 4.20 at left shows an
example of data that are approximated very similarly by normal and gamma distributions. The data are the
prices of houses that were sold in the first half of 2006 in part of the city of Nashua, New Hampshire. These
data are analyzed in chapter 12 and appendices 2 and 3.
House prices can be modeled using the following regression model
house price = a0 (square feet) + a1⋅age + a2⋅(number of rooms) + a3⋅(garage availability), where square feet,⋅age,
number of rooms, and garage availability are the explanatory variables.
Just as in the case of the thyroid cancer rates, both gamma and normal regression models found that the
explanatory variables are significant for an explanation of the house prices, and gamma regression better fits
the data.
Spatial Statistical Data Analysis for GIS Users 125
Figure 4.20 (right) shows distributions of the simulated values from the fitted normal and gamma regressions
for fixed explanatory variables values that correspond to a small house without a garage. Although the mean
prediction values are not very different—$155,000 and $130,000—the normal model allows for unreliably
small and even negative house prices (minimum values predicted by the normal and gamma regressions are ‐
$21,500 and $71,900 correspondingly, while the minimum house price value in the dataset is equal to
$100,000).
Courtesy of the City of Nashua, N.H.
Figure 4.20
We conclude that gamma regression is a viable alternative to the normal and lognormal regression models for
analysis of data with positive values.
MODELING PROPORTIONS USING BETA DISTRIBUTION
Beta distribution is appealing for modeling proportions (such as the proportion of infected plants) and
economic indices because it takes values between 0 and 1, has many shapes depending on the parameters,
and contains the uniform distribution as a special case. The mean value of beta distribution is equal to
2 αβ
and the variance is σ = 2 , where α and β are the parameters.
(α + β + 1)(α + β )
When the two parameters α and β are equal, the distribution is symmetrical. If α < 1 and β < 1 the probability
density function is “U” shaped. If the product (α−1)(β−1) is less than 0, the probability density function is “J”
€
shaped. If α is less than β, the distribution is skewed to the left, and if α is greater than β, the distribution is
skewed to the right. If α = β = 1, then beta distribution becomes uniform distribution. As α and β tend to
infinity such that α/β is constant, the beta distribution tends to a standard normal distribution. Figure 4.21
shows examples of the beta distribution shapes.
The beta distribution can be rescaled to model a variable X that changes from a to b by using the formula
X = a + (b ‐ a)⋅beta (α, β)
where beta (α, β) is beta distribution with parameters α and β.
We illustrate the usage of beta distribution in regression analysis using agriculture‐related data collected by
the U.S. Department of Agriculture (USDA) National Agricultural Statistics Service (NASS). The Web site
http://www.nationalatlas.gov/maplayers.html?openChapters=chpagri#chpagri
provides data about America’s farms and farmers for U.S. counties.
Suppose we want to analyze the milk cows as a percentage of all cattle and calves in the north‐central part of
the United States (see map in figure 4.22 at left with the milk cows proportion in 656 counties). The
proportion of milk cow distribution with a mean 0.113 and standard deviation 0.13 can be approximated by
beta distribution with parameters α=1.813 and β=0.23 (see figure 4.22 at right).
Courtesy of The National Atlas. http://www.nationalatlas.gov.
Figure 4.22
Suppose we want to model the proportion of milk cows using two covariates, the average value of crops
sold per acre for harvested cropland (in dollars), and the percentage of farms operated by families and
individuals. Since the map of the proportion of the milk cows in figure 4.22 at left suggests that the data are
spatially dependent, we will model this dependence using spatial covariance as discussed in chapter 12 and
appendix 4.
Courtesy of The National Atlas. http://www.nationalatlas.gov.
Figure 4.23
The maps of predictions and prediction standard errors are shown in figure 4.24. They are related because
the mean and variance relationship is a feature of the beta distribution.
Courtesy of The National Atlas. http://www.nationalatlas.gov.
Figure 4.24
The predicted distribution of the proportions of the milk cows, given the information on spatial dependency
and the relationships with relevant covariates, makes sense because it summarizes more information than
the proportions of the milk cow data themselves. Spatial data correlation in this example represents
NEGATIVE BINOMIAL DISTRIBUTION
The Poisson process discussed earlier in this chapter is the base model for count data in the applied statistical
literature. Other theoretical processes generate count data with the data mean not equal to the data variance.
Data with variance greater than mean are called overdispersed. Negative binomial distribution with mean µ
and variance µ + µ2/θ is the most popular alternative to the Poisson distribution for overdispersed count data.
The Poisson distribution is a special case of a negative binomial distribution when parameter θ becomes
infinitely large. An overdispersion parameter θ is a measure of deviation from a Poisson distribution.
Overdispersion is a problem in regression analysis because it causes underestimation of the prediction
standard errors so that the explanatory variable appeared to be significant for explanation of the response
variable while in fact it is not significant. Overdispersion is a typical feature of many count data processes.
Histograms of 4 distributions with mean equals 20, and different parameters θ are shown in figure 4.25.
Figure 4.25
The features of the negative binomial distribution can be illustrated using the example about the probability
of someone having a car accident. In the case of the Poisson distribution, all individuals have an equal chance
of having one, two, or more accidents. In many applications, the variance and the mean of the distributions of
the events are not equal, meaning that more people have a higher (or sometimes, a lower) number of
accidents than the Poisson distribution predicts as regulated by the parameters θ of the negative binomial
distribution. In other words, the negative binomial distribution describes a situation when everyone starts
out with the same chance to have an accident; however, when an accident occurs, the future probability of
Spatial Statistical Data Analysis for GIS Users 129
accidents for that individual is changed. For example, after a car accident one is likely to be more cautious on
the road, a psychological effect decreasing the individual’s accident probability.
As another example, consider a set of entertainment places in a large tourism center. If the next arriving
tourist is equally likely to go to any place, the number of people in the entertainment place can be modeled
using binomial distribution. However, if the probability to go to different entertainment places varies
(because they are different), the negative binomial distribution is a better choice for modeling.
The binomial distribution gives the probability of m successes after n trials, whereas the negative binomial
distribution gives the probability of the successes after the mth success. For example, if the proportion of
individuals with a certain characteristic is known and we sample until we see a particular number of such
individuals, then this number is a negative binomial random variable. A negative binomial experiment has the
following properties:
The experiment consists of n repeated independent trials.
Each trial can result in two possible outcomes called a success and a failure.
The probability of success is the same on every trial.
The experiment continues until m successes are observed, where m is known in advance.
If the probability of success is p, the mean and variance of the negative binomial distribution are given by
The homogeneous Poisson process assumes that mean and variance are equal to the same constant value for
any part of the data domain. If this assumption is relaxed, the resulting inhomogeneous Poisson process has
the mean value λi , which varies according to some function. If this function is gamma distribution, the
resulting distribution is negative binomial. It can be interpreted as a Poisson distribution for all individuals,
while the variability between individuals is described by a gamma distribution (so‐called compound Poisson
process). In other words, the negative binomial distribution can be generated as a mixture of Poisson
distributions.
There is also a derivation of the negative binomial distribution using the Poisson and logarithmic
distributions. In this case, the cluster centers are distributed according to the Poisson distribution, and the
number of individuals in a cluster follows the logarithmic distribution.
The negative binomial distribution is sometimes used with continuous data values in environmental and
geological applications instead of gamma distribution because it may describe the data variance more
accurately.
The Poisson, the binomial, and the negative binomial distributions are characterized by the relationships
between their variances and their means. For all three distributions, writing the variance as a function of the
mean µ, yields
variance(µ) = µ + µ2/θ = µ + αµ2, α=1/θ
Zero value of α gives the Poisson distribution. A negative one leads to the binomial distribution, and a positive
one leads to the negative binomial distribution. Then the problem of the distribution choice for the observed
count data reduces to the estimation of the parameter α (or θ), assuming that the mean and variance can be
estimated.
All distributions define specific expected values of counts for specific values of the distribution parameters.
Spatial Statistical Data Analysis for GIS Users 130
Figure 4.26 shows the percentage of 0, 1, 2, 3, 5, and 6 counts expected for Poisson (red) and three negative
binomial distributions with mean equal to 1.
Courtesy of The National Atlas. http://www.nationalatlas.gov.
Figure 4.26
If the number of 0, 1, and other counts is substantially different from the expected numbers, distributional
assumptions are violated because of overdispersion. Note that the negative binomial model can be also
overdispersed if the variance estimated by the negative binomial regression model is larger than the value of
µ + µ2/θ.
The following distributional assumptions violations are typical: excess of zeros in the data, an insufficient
number of zeros, data truncation and censoring, data outliers, omission of important explanatory variables in
the regression model, and data correlation.
Reasons for arising overdispersion can be illustrated using simulated Poisson data and the Z‐score test for the
Poisson regression model overdispersion
,
where yi are the modeled data (response) and are the predicted mean value of yi. If Z score value is large
than 0 with t‐probability less than 0.05 (it is calculated using linear regression on Zi values (see a discussion
on linear regression in appendix 2), the hypothesis that the Poisson distribution is suitable for the data
modeling is rejected.
The response variable yi was simulated from the Poisson distribution with mean defined as a linear
combination of 10,000 independent Gaussian variables x1, x2 and x3:
After fitting Poisson regression to the simulated data, , the Z‐score did not show
overdispersion, and the estimated regression coefficients a1=0.5040, a2=0.2441, and a3=0.1218 were very
close to the true ones.
Spatial Statistical Data Analysis for GIS Users 131
Next, a model without one covariate, , was fitted. This time the Z‐score test indicated
overdispersion, and estimated regression coefficients a2, and a3 were notably changed: a2=0.5890 and
a3=0.3295. The same result was observed when several data outliers were added to the simulated response
variable yi: overdispersion and inaccurate estimation of the regression coefficients. Overdispersion was also
observed when the response variable yi was made spatially dependent; in this case, the estimated regression
coefficients a1, a2, and a3 were 0.5308, 0.2884, 0.150 instead of 0.5, 0.25, and 0.125.
A conclusion from the simulation experiment is that both uncertainty in the data and the model
misspecification lead to overdispersion, meaning that overdispersion is not necessarily caused by the nature
of the data. Therefore, the problem with overdispersion can be fixed by fixing problems with the data and by
improving the statistical model.
Figure 4.27 (left) shows the estimated number of eggs released into the ocean west of the British Isles and
France in 1992 by Atlantic mackerel (these data are discussed and analyzed in chapter 6). There is a large
number of zero egg counts, 265 out of 634. Most of them were observed in the southern part of the data
domain. The graph in figure 4.27 (center) shows the scatterplot of the mean and variance of the observed
eggs density (eggs number estimated as egg count = (density of eggs) × (net area)) calculated in polygons
shown in figure 4.27 at left and the fitted model variance = µ + µ2/θ (blue line). The data variance is larger
than the data mean, and the estimated value of parameter θ is equal to 2.16 (mean value is equal to 38.2).
However, the entire dataset consisting of point data can be approximated using the negative binomial
distribution with a much lower overdispersion parameter θ = 0.178. Histograms of the data (green) and the
negative binomial distribution with θ = 0.178 are shown in the right part of figure 4.27.
So, what negative binomial parameters should be used in the analysis of these data? From spatial data
interpolation point of view, specification of the negative binomial distribution with θ = 2.16 maybe preferable
because it better describes the data in the most part of the data domain.
Mackerel data courtesy of A. Eltink, P. Lucio, J. Mass´e, L. Motos, J. Molloy, J. Nichols, C. Porteiro and M. Walsh.
Figure 4.27
The negative binomial distribution can be used with data from applications other than ecological. Figure 4.28
shows rainfall data on April 29 in Sweden (see the data description at http://www.esri.com/news/
arcnews/fall03articles/analyzing.html) and the negative binomial distribution fitting. In this
case, the mean value of rainfall is 2.15 millimeter, the estimated value of θ using the mean and variance at 19
polygons is equal to 1.2, and the estimated value of θ using the entire dataset is 0.3.
Spatial Statistical Data Analysis for GIS Users 132
Courtesy of International Sakharov Environmental University, Minsk, Belarus;
Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 4.28
Fish abundance and rainfall data discussed in the previous section have a large number of zeros. This inflated
number violates assumptions of many probability models (Gaussian, lognormal, Poisson, negative binomial,
and others), although the negative binomial can handle extra zeros. For example, a draw from the negative
binomial distribution with estimated parameters for the entire fishery dataset (µ=38.2, θ=0.178) gives 249
zeros out of 634 simulated values (the observed number of zeros is 265). In the case of the rainfall dataset,
the proportions of simulated and observed zeros are 366/710 and 337/710. However, the number of zeros
simulated from the negative binomial distribution with parameters estimated using counts averaged in
polygons is 1 for fishery data and 216 for the rainfall.
Data with extra zeros and the application of models that cope with these data are quite common in ecology,
epidemiology, and econometrics. Typically, a model with extra zeros uses the probability distribution that
underestimates the number of zeros in a dataset and repairs this distribution to account for the extra zeros.
The most popular models with excess zeros are zeroinflated and hurdle models. Both assume that data
distribution is entirely generated from two probability models, say, PM1 and PM2. A zero‐inflated model
assumes the zeros can come from both PM1 and PM2 while the hurdle model assumes that zeros can come
only from PM1 after crossing a zero barrier, or hurdle.
The Poisson model with extra zeros is the most common in the literature. The zero‐inflated Poisson (ZIP)
model is defined for random variable x as follows
In the zero‐inflated Poisson model, the observed zero counts come from two sources: 1) the Poisson
distribution that is capable of generating zero counts, and 2) the Bernoulli distribution for modeling
presence/absence data. For example, in habitat suitability modeling, rare species might not occur in some
areas of suitable habitat just because they are rare (PM2), and some areas can be unsuitable for the species to
live (PM1).
The expressions for PM1 and PM2 above can be combined using formulas for random variables. The result is
the following
,
where θ is a parameter that should be estimated when fitting the model; for example, using logistic regression
discussed in chapter 6 and appendix 4.
The hurdle Poisson model assumes that there are two separable processes. For example, in habitat suitability
modeling, two processes can be colonization and population growth. The hurdle Poisson model is defined
with the following formulas:
The expression for PM2 is called a truncated Poisson distribution (see the end of this section) because zero
values are removed from the distribution. The denominator is used to make the sum of
probabilities equal to 1.
The above formulas can be combined into a single formula using formulas for random variables:
When zero‐inflated or hurdle models are used in regression, a typical goal is developing separate regression
models for the parameter θ and the Poisson intensity λ. The modeling algorithm can be the following:
Model the intensity λ using gamma regression model (to make sure that the intensity is positive)
Estimate a θ value using logistic regression
Substitute θ to the ZIP or hurdle model
Hurdle models are preferable for continuous distributions of non‐negative data so that PM2 is either a
lognormal or gamma distribution (see the example in the earlier section about gamma distribution). For
example, lognormal or gamma distribution can be used to describe the fish abundance in successful catch
while the parameter θ describes trips that catch no fish.
The negative binomial probability model is often sufficient to handle occurrences of zero‐inflation in the data
as it is in the cases of fishery and rainfall data discussed above. Therefore, it makes sense to try the negative
binomial model before using the zero‐inflated or hurdle models. On the other hand, even when the number of
zeros in the negative binomial model is sufficient, zero‐inflated and hurdle models may better describe the
data when different physical and ecological effects are responsible for excess of zeros (parameter θ) and the
process intensity (λ).
The opposite situation of the zero‐inflated models is modeling count data without zeros. For example, we may
not have a list of every person who uses the Geostatistical Analyst in their research; we can only count those
who have published at least one article in which the software is mentioned. Or, a survey of how often people
are shopping consists of people who are currently at the mall. In both cases, observations with a value of zero
are not included in the dataset. Using standard Poisson or negative binomial regression with this sort of data
may be inappropriate. Zero‐truncated count models are designed for such situations.
Figure 4.29 shows distributions of counts simulated form Poisson and zero‐truncated Poisson distributions
with intensity λ=µ=3. The numbers of small counts differs significantly for these two distributions. Therefore,
if the mean value of the observed count data is small, the difference between using standard and zero
truncated Poisson models can be substantial.
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 4.29
In the case of negative binomial distribution, the probability of a zero count is . This value should
be subtracted from 1, and then the remaining probabilities should be rescaled.
There are other situations when standard distributions should be adjusted to the process that generated the
data. One example is a censored model that is different from the truncated one in the interpretation of the
threshold value: the (left‐)censored model assigns the threshold value to the data smaller than the threshold,
while the truncated model eliminates all counts smaller than the threshold. More information on regression
models for count data can be found in the last reference in “Further reading.”
NONPARAMETRIC MODELING
Statistical models that do not specify data distribution are called nonparametric. With a broader set of model
parameters, they contain the proper data distribution. The price for this flexibility is that the prediction
uncertainty is higher. Moreover, this uncertainty is difficult to interpret because significance intervals for the
model output values are usually constructed assuming a specific data distribution.
It is always a good idea to test data for departures from Gaussian or Poisson distributions before data
modeling. During this testing, we learn about the data.
If during discrete data exploration, we find that the data distribution looks like Poisson, the next step is to
estimate the intensity of a Poisson process. It can be not constant but defined by another spatial process.
Then the explanatory variables should be used in the model instead of assuming that the distribution of the
data is unknown.
Although derivation of basic kriging models (simple, ordinary, and universal) does not require the normality
assumption, many useful kriging features arise when data and predictions are normally distributed.
Moreover, stationary Gaussian distribution can be described by the mean and covariance functions, but the
unknown distribution requires additional information. Therefore, proper interpretation of the kriging results
is difficult if the data distribution is unknown.
Samples cannot be simulated from unknown data distribution, but only from a specific model, which is in
practice almost always based on one or another theoretical data distribution or their mixture. For example,
geostatistical conditional simulations usually demand that the modeling process be Gaussian; see chapter 10.
Examples of simulations from other distributions can be found in appendix 3.
Most statistical models available in commercial and freeware software packages are based on the specific
distribution assumptions. A user may not be aware of these assumptions simply because the software
manuals may not discuss this issue.
CONFIDENCE INTERVALS AND CHEBYSHEV’S INEQUALITY
A confidence interval is a range of values that quantifies the uncertainty of the prediction or the parameter
estimation. A 95 percent confidence interval is an interval that is likely to contain the prediction 95 percent of
the time. For a given probability, a narrow confidence interval suggests a relatively precise prediction, while a
wide interval suggests the prediction must be of a more general nature.
Because researchers usually collect their spatial data only once, a 95‐percent confidence interval is often
constructed by taking the estimated mean value and adding and subtracting the 1.96 estimated standard
errors. A 90‐percent confidence interval adds and subtracts 1.645 standard errors, while a 99‐percent
confidence interval adds and subtracts 2.57 standard errors assuming a normal distribution, where the
values 1.96 and 1.645 came from the Gaussian distribution, the use of which should be justified.
For example, kriging predictions tend to be normally distributed because the data surrounding the prediction
location are averaged, but this is not usually so for the kriging prediction errors. Figure 4.30 shows
histograms of the Cs‐137 soil contamination data (left) and predictions and prediction standard errors
estimated on a 40‐by‐65 grid using ordinary kriging. Input data are rather lognormally distributed. The
predictions are shifted towards the bell‐shaped distribution comparing to the input data, but prediction
standard errors are not. This is because ordinary kriging prediction standard errors do not depend on the
nearby input data (see the discussions in chapters 2, 8, and 9).
Then the probability for Z of being inside the interval is calculated using percentiles of
the Student’s distribution. For example, for construction of the 95‐percent confidence interval for N
measurements, ε can be found in table 4.3:
A problem is that these calculations are valid for the Gaussian distribution only. However, there is a way to
construct a confidence interval for an unknown distribution using the Chebyshev inequality.
where ε is a constant.
deviation :
1
P (| Z − µZ |≥ εσ Z ) ≤
ε2
or
| Z − µZ | 1
P ≥ σZ ≤ 2
ε ε
or
| Z − µ | 1
Z
P ≥ ε ≤ 2
σ Z ε
In words, the fraction of the data more than ε standard deviations from the mean is at most . Therefore, at
most, a quarter of the data is more than two standard deviations different from the mean; at most, 1/9 is
€
more than three standard deviations, and at most, 1/16 is more than four standard deviations.
The Chebyshev inequality is valid for any distribution of random variables. “At most” means that it will never
be more than , and usually it is less. For example, if data are normally distributed, only 1/20 of the data
are more than two standard deviations, not a quarter as indicated by the Chebyshev inequality.
The interval
prediction – 1.96 standard deviation, prediction + 1.96 standard deviations
is not a 95 percent kriging prediction interval unless predictions and their standard deviations are distributed
normally. However, the Chebyshev inequality guarantees that it will be not less than 75 percent.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 4.31
Suppose we are interested in estimating the probability that the upper permissible level of 0.12 ppm was
exceeded.
Using statistics from the Histogram tool shown in figure 4.31, left, the z‐score for the average ozone
concentration in June in California is (0.120.06)/0.018=10/3. The Chebyshev inequality tells us that the
fraction of the data more than ε=10/3 standard deviations is at most f=1/(10/3)2=0.09.
On the other hand, assuming that data are normally distributed, the probability that the ozone concentration
exceeds 3.3 standard deviations is less than 1 percent; see “Gaussian process” above.
Similar calculations of the Chebyshev inequality for the selected monitoring station using statistics from the
Histogram tool shown in figure 4.31, right, gives
zscore = (0.120.095)/0.028=0.89; f=1/(0.89)2=1.25.
A confidence interval that is very wide, like this one, indicates that the data sample size is too small or that
conditions of the experiment (of measuring ozone values) were changed significantly. Indeed, weather
conditions at that monitoring station were different in June 1999.
If we assume that the data presented in the histogram above right is normally distributed, then the
probability that ozone concentration may exceed the upper permissible level is about 0.4. In fact, only five out
of thirty measurements (part of the histogram in blue) are larger than 0.12 ppm (i.e., 1/6=0.17 of the data).
According to legend, the French mathematician Poisson derived his distribution while studying the pattern of
raindrops at the beginning of a rain. He counted the number of drops per tile and concluded that the drops
were distributed uniformly on each square and that the number of drops on one tile is unrelated to the
number of drops on other tiles.
The number of events on the plane with area A described by a homogeneous Poisson distribution with a
_
mean value of λA is
n
P{N(A) = n} =
[λA] exp{−λA} ,
n!
where is the probability that n points occur in the area A and λ is the mean number of events
per unit area (the intensity of a Poisson process).
€
An event that follows a uniform distribution has the same probability of occurrence for all possible values.
Because the points of a homogeneous Poisson process are independently and randomly located, their x and y
coordinates can be simulated from a uniform distribution. If the uniform distribution is divided into equally
spaced intervals, an equal number of events in each interval is expected. Figure 4.32 shows histograms of 100,
1,000, and 10,000 simulations from a uniform distribution in the interval [0, 1]. When the number of
simulations is small, the distribution of the simulated values does not look “uniform,” meaning that the
distribution of a small data sample is difficult to reconstruct.
Figure 4.32
Figure 4.33 shows 100 points, with x and y coordinates simulated from a uniform distribution. These
randomly located points correspond to the spatial Poisson process (more accurate definition of the spatial
Poisson process is given in chapter 13).
Testing observed data locations for spatial randomness is the first step in point data analysis. If the test
reveals that spatial locations are distributed non‐randomly, the next hypothesis to test is whether point
patterns are regular or aggregated. If points are aggregated, the task is to explain why they are.
1) SIMULATE AND PLOT NORMAL, LOGNORMAL, GAMMA, BINOMIAL, POISSON,
AND NEGATIVE BINOMIAL DISTRIBUTIONS.
Use the following R functions to simulate normal, lognormal, gamma, binomial, Poisson, and negative
binomial distributions with various parameters:
rnorm(), rlnorm(), rgamma(), rbinom(), rpois(), rnbinom().
For example, this code produces a graph shown in figure 4.34:
gamma.dist < rgamma(n=50000, shape=2, rate = 6)
hist(gamma.dist, col="pink", freq=FALSE, main="Gamma Distribution")
lines(density(gamma.dist), lwd=4, col="blue")
Figure 4.34
Figure 4.35 shows average one‐hour ozone concentrations in June 1999 in California. Two pairs of locations
from the northern part of the state (where ozone values are low) and from the southern part (where ozone
values are high) are selected, and a histogram of daily measurements shows that data are from two different
distributions.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 4.35
Separate data from different locations using the Histogram tool and Normal Score Transformation dialog box
from the Geostatistical Analyst Wizard using ozone data from the folder assignment 4.2.
Suppose you are visiting Esri in the summer and you would like to play golf on the weekend. There are many
public golf courses close by, shown as black flags in figure 4.36. You decide to play at the course closest to
Esri, where the ozone concentration is not higher than 0.1 ppm with the probability of at least 0.7. The
average ozone concentration at nearby monitoring stations is shown by white digits—these are data from
assignment 1 in chapter 3. Use kriging prediction and prediction standard error to estimate the probability 1)
assuming data normality, and 2) without distributional assumption using Chebyshev’s inequality.
How much shorter is your drive to a golf course if you make the stronger assumption of the data normality?
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 4.36
To make predictions to the specified locations, use geostatistical layer’s option Prediction shown below at left
in figure 4.37. The output file will include columns with predictions and prediction standard errors (figure
4.37, right).
Figure 4.37
FURTHER READING
1) Martin, T. G., B. A. Wintle, J. R. Rhodes, P. M. Kuhnert, S. A. Field, S. J. Low‐Choy, A. J. Tyre, and H. P.
Possingham. 2005. “Zero Tolerance Ecology: Improving Ecological Inference by Modeling the Source of Zero
Observations.” Ecology Letters 8:1235–1246.
The authors propose a framework for understanding how zero‐inflated datasets originate and deciding how
best to model them. It is demonstrated how failing to account for the source of zero inflation can reduce the
ability to detect relationships in ecological data and at worst lead to incorrect inference.
2) Hilbe, J. M. 2007. Negative binomial regression. Cambridge University Press.
This book discusses the negative binomial model and its many variations. There are many examples prepared
using the statistical commercial software package Stata.
METHODS FOR SENSITIVITY AND
UNCERTAINTY ANALYSIS
EXAMPLE OF SENSITIVITY ANALYSIS IN GIS
MONTE CARLO SIMULATION
BAYESIAN BELIEF NETWORK
FUZZY SET THEORY
FUZZY LOGIC
RASTER MAPS COMPARISON
ASSIGNMENTS:
1) USE THE BAYESIAN BELIEF NETWORK FOR RELATING THE RISK OF ASTHMA TO
ENVIRONMENTAL FACTORS
2) CLASSIFY ENVIRONMENTAL VARIABLES USING FUZZY LOGIC
3) USING FUZZY INFERENCE, FIND THE AREAS THAT MOST LIKELY CONTAIN A LARGE
NUMBER OF PEOPLE WITH A HIGH IRRADIATION DOSE
FURTHER READING
G
eoprocessing tools, such as buffering, overlay, intersection, and interpolation, allow researchers to
combine and interpret data obtained from different sources. Because it is easy to use geoprocessing
operations within a GIS, one might tend to forget that forming new features alters the input data.
Some geoprocessing operations are self‐explanatory and do not require scientific justification.
Many others need a more thorough understanding of the functions and models that produce new features.
The overall goal of modeling may be to understand how assumptions, parameters, and variation associated to
the input data and the working model affect the resulting output data and the conclusions made from them. If
data are not precise or the model is not exact, uncertainty can only increase with each geoprocessing
operation so that tiny perturbations in model input may propagate into gross changes in model output (the
so‐called butterfly effect). These cases require a modeling framework that takes into account the data and
model uncertainty.
Spatial Statistical Data Analysis for GIS Users 147
In many applications, optimal model parameters are difficult to find because of the large data variability. One
consequence is that it may be difficult to choose a reliable interval around the parameter values. For example,
figure 5.1 at left shows the empirical semivariogram values in black that were calculated from the thirty
simulated data, using the Gaussian distribution with the semivariogram model shown in red. Two estimated
semivariogram models in blue and green are shown in figure 5.1 at right. While different at short distances
(the distances most important for predictions), both estimated semivariogram models look reasonable.
However, the true semivariogram that was used in the simulation (red) has a shape that can easily be outside
the credible interval around any of two estimated models. This is illustrated by the green envelope around
one of the estimated models framed in figure 5.1 at right. Statistical spatial interpolation is based on the
semivariogram model, and its misspecification leads to inaccurate prediction.
Figure 5.1
Models are often used for prediction. For example, kriging predicts values at the unsampled locations. Models
are also used in what‐if experiments—for instance, in predicting environmental system response when
varying model parameters and input data. In this context, the goal of prediction is to make a choice, either
between consequences of management decisions or by comparing the models’ prediction performance.
There are two different approaches to decision making with imprecise data and models: sensitivity and
uncertainty analysis. Although closely related in their attempt to represent random processes that are present in
the empirical world but lost in modeling, they have different goals.
Sensitivity analysis assesses how the variation of the model parameters contributes to the total variation in
the output. General scientific principle dictates that replicating an experiment should produce similar results.
Similarly, a good model should not react sensitively to the addition of a small amount of noise to the input.
Sensitivity analysis is used to increase confidence in the model and its predictions by providing information
on how the model responds to changes in the inputs. If the model output is unreliable, then additional efforts
are required to provide more selective information on those inputs that most affect the quality of the model’s
predictions and conclusions. In this chapter, sensitivity analysis is illustrated using data on habitat conditions
for the California gnatcatcher in the San Diego area.
The goal of the uncertainty analysis is providing confidence intervals for non‐exact value of the measurement
or estimation. Examples of the estimated uncertainty are the following:
Kriging produces both the prediction map of attribute values and the map of the prediction
uncertainty.
A ratio of counts of disease to the population under risk (disease rate) is a common type of input data
in epidemiology. If the rates are based on very different population numbers, such as a mix of urban
and rural areas, the uncertainty associated with each rate will also be different. Using statistical
smoothing, each rate is replaced with its smoothed value calculated as a weighted average of
neighboring values. The resulting smoothed rates are more stable and less variable than the original
Spatial Statistical Data Analysis for GIS Users 148
rates and more appropriate for visual interpretation. Smoothing uncertainty can be estimated and
used in decision making.
Locations of crimes are visualized with point features. They may be clustered simply because there
are a large number of people in the area. Statistical models can describe the variability of points on
various scales, and a map of the density of points provides an estimate of the probability of finding a
crime event at any location. Similarities in density of crime and population over the same study area
can be estimated, and the uncertainty of these estimates can be used in decision making.
Most spatial datasets have only one observation per location, which is not sufficient for estimation of the local
data distribution and its characteristics, such as mean and standard deviation values. To estimate statistical
characteristics of the data at each location, data simulation techniques that preserve statistical data features
can be used. Spatial data are often correlated so that observations separated by small distances are more like
each other than not. Therefore, simulated data values and data locations should be dependent, and methods
developed for simulating data in classical statistics may not be appropriate when used with spatially
correlated data. In this chapter, the use of Monte Carlo simulation in uncertainty analysis is discussed using
California cancer mortality data and soil temperature and moisture.
In the second part of this chapter, two other common approaches to modeling imprecise data and model
uncertainties, Bayesian belief network and fuzzy logic, are discussed and illustrated by examples.
EXAMPLE OF SENSITIVITY ANALYSIS IN GIS
Effective use of complex ecological and environmental models requires knowledge of the sensitivity of their
outcomes to variations in parameters and inputs. Some input parameters come as constants or known
functions that cannot be changed, whereas others are chosen based on existing concepts that can be modified
and improved. Systematic use of sensitivity analysis is necessary even for parts of the model that are well
understood. Responses of local parts of the large model may either have little overall significance or they may
be crucial. The complex behavior of the local parts of the model may also aggregate into the simple model’s
behavior overall.
For example, imagine a fifty‐foot buffer created around a forest stream. Within this zone, logging is
prohibited. Even if the distance of fifty feet were derived from consideration of ecosystem protection,
questions remain, such as, what happens if the distance is changed to forty feet or seventy‐five feet?
A software manual for geoprocessing in ArcGIS explains how to locate regions with favorable habitat
conditions for the California gnatcatcher in the San Diego area. In the manual, the following criteria were
chosen for the potential habitats of the California gnatcatcher.
The impact of roads should be as low as possible. It is assumed that the impact of roads on
gnatcatcher habitat increases proportionately with the road size, and this is modeled using a buffer
of 1,312 feet around interstate freeways and 820 feet around other roads.
The gnatcatcher lives in areas where the primary vegetation is San Diego coastal sage scrub.
Near the coast, vegetation patches should be greater than 25 acres, and in other areas, they should be
greater than 50 acres.
The elevation should be less than 250 meters above sea level.
The slope of the terrain should be less than 40 percent.
These criteria were used to formulate the model for estimating gnatcatcher habitat in figure 5.2.
Spatial Statistical Data Analysis for GIS Users 149
Figure 5.2
Given input data layers delineating major roads, vegetation types, elevation, and the boundaries of the San
Diego region, figure 5.3 shows the result of the analysis with the fixed parameter values listed above. Roads
are colored brown (freeways) and gray (others), suitable vegetation is shown by the green polygons, and red
polygons indicate regions of potential habitat.
Figure 5.3
To determine the sensitivity of this model to changes in inputs, we can reduce the maximum permissible
slope from 40 percent to 25 percent and 20 percent, increase and decrease the size of the buffers around the
roads, use a more detailed road layer, and include information on human population (cats are a mortal enemy
of these birds, so we may want to select areas that are at least 300‐500 feet from homes). Red polygons in
figure 5.4 show habitat estimated in figure 5.3. Blue polygons show areas closer than 500 feet from the
detailed streets layer that should be excluded from the estimated area if cats are taken into account. After this
operation, the habitat is reduced by 23 percent.
Spatial Statistical Data Analysis for GIS Users 150
Figure 5.4
Changes in the size of the buffer around roads and the maximum permissible slope also lead to fairly different
habitat areas. Changing the maximum permissible slope from 40 percent to 25 percent reduces the estimated
area of optimal habitat by 24 percent, and the change in slope from 40 percent to 20 percent reduces the
estimated area of optimal habitat by 40 percent.
The gnatcatcher habitat model turns out to be very sensitive to changes in the slope parameter, because the
elevation pattern of the area has very few patches of flat ground, using the DEM cell size of 115 feet. Since the
effect of slope parameter on gnatcatcher habitat significantly overrides the effects of all other components of
the model, slope parameter is a candidate for elimination from the model.
Finally, the gnatcatcher habitat estimation should probably not be based only on geometric relationships
between geographic layers. Biological information, such as bird counts in the area, should be added to make
this model reliable.
In sensitivity analysis, variation is introduced into the model output by altering inputs or model parameters.
Ideally, these changes should be produced by a well‐understood random mechanism so that existing data
variability, rather than artificial data variability, will be tested.
MONTE CARLO SIMULATION
Monte Carlo simulation is a numerical method for solving mathematical problems by modeling random
variables. The method was proposed in 1949 by Metropolis and Ulam. The name of the method is associated
with gambling games of chance. Roulette, die, and coins are the simplest devices for producing random
numbers. Monte Carlo algorithms generate data with specified statistical properties. These data, called
realizations or simulations, are used as input to a transfer function. This function can be simple or
complicated. The resulting distribution of the function’s outputs characterizes the uncertainty of the
modeling result. A schematic illustration of Monte Carlo simulation is shown in figure 5.5. Simulated air
contamination surfaces datai, dataj, datak, and datal, among others, are used as input to a function, which
calculates a summary of each simulation; for example, average value, or population exposure. The histogram
at right shows outputs from the function. Instead of using just one number, for example, a value
function(datai), the researcher can use a variety of reasonable values as an input to the next transfer function
or for decision making.
Spatial Statistical Data Analysis for GIS Users 151
Figure 5.5
The Monte Carlo method can model any process influenced by random factors. Even if the problem is not
associated with randomness, it is sometimes possible to create a probabilistic model that can solve the
problem. For example, area of a polygon, in figure 5.6 in white, can be estimated by simulating random points
in the rectangle with the known area created around the polygon of interest and calculating proportion of the
points inside the polygon of interest. In this figure, the exact proportion of white to light brown areas is equal
to 0.2. Nineteen out of one hundred randomly simulated points are inside the white polygon on the left, and
eighty‐eight out of four hundred random points are inside the white polygon on the right. This gives good
approximations, 0.19 and 0.22, of the area’s true proportion of 0.2.
Figure 5.6
For each uncertain variable or parameter, the possible values can be defined with a probability distribution.
Depending on the type of data available, spatial data properties can be quantified by a density function for
discrete point data, a neighborhood structure depicting relationships among regions with polygonal data, or a
semivariogram model for spatially continuous data.
Spatial Statistical Data Analysis for GIS Users 152
To test statistical features of the data, a question of interest is translated into a question about a parameter in
a probability model or about a function called a null hypothesis. Monte Carlo simulation testing of a null
hypothesis proceeds as follows:
1. Generate simulated values from the probability distribution of the null hypothesis.
2. Compute a statistic of interest, for example, U. For instance, in regional data analysis, U could be
Moran’s index of spatial data association for each polygon i, ,
3. Repeat M times. This gives U1, U2, ..., UM.
4. Compare the observed statistic computed from the given data, for example, Uobs, to the distribution of
the simulated Ujs and determine the proportion of simulated Uj values that are greater than Uobs.
If this proportion, called p‐value, is small, this provides evidence against the null hypothesis in the direction
of the particular alternative hypothesis being considered.
Figure 5.7 shows Moran’s index for some cancer mortality rates in California counties, left, and proportion of
Moran’s indexes calculated from 500 simulated Gaussian datasets with the same mean and standard
deviation as the original mortality rates (null hypothesis distribution), right. Large positive Moran’s I values
(hot colors) indicate large data similarities in the nearby polygons in the north. Counties in blue in the right
part of the figure have a small proportion of Moran’s index that is greater than those calculated and shown in
the left part of figure. In several counties, Moran’s index calculated from simulated data was always less than
Moran’s index calculated from original data. A null hypothesis of the absence of spatial similarity of cancer
mortality rates in the nearby counties should be rejected in the counties with small p‐values (mostly in
Northern California) if the distribution assumption was chosen correctly. A detailed discussion on Moran’s
index can be found in chapters 11 and 16.
From U.S. Geological Survey, 2008, Cancer Mortality in the United States: 1970–1994. Courtesy of U.S. Geological Survey, Reston, Va.
Figure 5.7
A particular variant of Monte Carlo simulation called geostatistical simulation is often used in assessing
uncertainty in functions of continuous data. Suppose the model’s input is a measurement of a hydrologic
parameter and the output is an estimate of groundwater travel time, which is a function of a hydrologic
parameter. Results should be represented as a detailed map of groundwater travel time. Such a map can be
created using kriging. The interpolated values of an input parameter are used to obtain the output at required
locations, but the uncertainty associated with the output is as important as the estimated value itself, and the
uncertainty is typically much more difficult to estimate.
With geostatistical simulation, several plausible grids are generated, each reflecting the same statistical
information on the input parameter, such as the semivariogram model and the data histogram, and the
transfer function is used to compute the output value in each grid cell. In the groundwater example, the result
is not a single prediction of groundwater travel time but an entire distribution of groundwater travel times in
each grid cell whose spread reflects the uncertainty in the hydrologic parameter values. Knowledge of the
output value distribution allows one to analyze worst‐case scenarios using values in the tails of the output
distribution.
To see how geostatistical simulation might be useful in geoprocessing spatial data within a GIS, consider an
agronomist working with a layer of soil temperature (figure 5.8, left), and a layer of soil moisture (figure 5.8,
right). The agronomist wants to objectively determine whether there is a relationship between these two
maps, namely, whether areas with high soil temperature should correspond to areas with low soil moisture.
Figure 5.8
The well‐known measure of the relationship between these two variables is the linear correlation coefficient.
For the soil temperature and soil moisture maps in figure 5.8, it can be computed using the Pearson
correlation coefficient
∑ (temp − µ )(moist
i temp i − µmoist )
i=1
r= ,
(n −1)σ tempσ moist
€ for GIS Users
Spatial Statistical Data Analysis 154
where tempi and moisti are temperature, and moisture predicted at grid cell i (prediction is used because
temperature and moisture data are collected at different locations), and µ and σ denote estimated means and
standard deviations of two variables. The correlation coefficient is equal to −0.53, indicating a moderately
strong negative association between the soil temperature and moisture. However, both temperature and
moisture are measured and predicted with uncertainty, and the agronomist needs an estimation of the error
associated with this estimated correlation coefficient. Therefore, conditional autocorrelated simulations
(discussed in detail in chapter 10) for each of soil temperature and moisture can be used to generate a
distribution of the Pearson correlation coefficient from all pairs of grid cells. The use of auto‐correlated
simulations for soil temperature and moisture makes the distribution of the Pearson correlation coefficient
wider than when each variable has no autocorrelation.
The agronomist produced 500 plausible maps of soil temperature and soil moisture using conditional
geostatistical simulations. For each pair of maps generated, the correlation coefficient was computed, and the
distribution of the resulting 500 correlation coefficients is shown in figure 5.9.
Figure 5.9
From this figure, the correlation between soil temperature and soil moisture could potentially range from
−0.25 to 0.05. Therefore, the estimated correlation coefficient computed from the kriging‐interpolated maps,
−0.53, is likely to be an overestimation of the strength of the relationship between soil temperature and
moisture.
Instead of using the correlation coefficient −0.53 to measure the association between the two maps, we could
use the cross‐covariance function that measures this correlation as a function of distance. In this case, rather
than getting a histogram of the correlation coefficient values, we would obtain 500 estimated cross‐
covariance functions, which result in the uncertainty interval as a function of distance (as shown in the green
envelope around the blue cross‐covariance model estimated from the observed data in figure 5.10).
Spatial Statistical Data Analysis for GIS Users 155
Figure 5.10
One major advantage of Monte Carlo simulation is flexibility. We can simulate data from different
distributions under different assumptions with different choices of a model’s parameters. Monte Carlo
simulation can be used to derive a measure of uncertainty associated with complex functions of spatial data
whose analytical form may be difficult or impossible to derive. Given the complex nature of many of the
geoprocessing operations and the tendency to combine many of them in complex modeling, Monte Carlo
simulation is a valuable method for assessing the uncertainty resulting from geoprocessing operations in a
GIS.
BAYESIAN BELIEF NETWORK
There are situations when the outcome of an event B makes event A probable to some degree. For example, if
we find the probability that a rolled die is an odd number (event A) and we know that this number is less than
4 (event B), the probability of event A (2/3) is higher than when additional information on event B is not
available (1/2).
Suppose we observe events A and B, which appear by chance. Every time we observe that B has a particular
value β, we put the corresponding A into a data subset. Then the probability distribution of A in the subset is
written P(A|B), the conditional probability of A given B. If A is conditional on more than one variable, the
probability distribution of A is written P(A|B,C,D,…), where B,C,D are variables that have particular values β, σ,
and δ.
The aim of Bayesian statistics is to calculate P(A|B) and P(A|B,C,D,…), providing the inductive inferences from
evidence B,C,D,… to hypothesis A.
Suppose there are K possible outcomes and N of them are favorable for event A, so that P(A) = N/K, and M of N
are favorable for event B, so that P(B|A) = M/N. The probability that both A and B occur is P(A & B) = M/K. But
M/K = M/K⋅N/N = M/N N/K = P(B|A) P(A). Therefore, the joint distribution of A and B is
P(A & B) = P(B|A) P(A)
A graphic illustration of this relationship is shown in figure 5.11, where P(A) is probability of A, yellow,
relative to the whole space in blue; P(B) is probability of B, green, and probability of both A and B, P(A & B), is
the overlap of yellow and green polygons. To get the conditional probability of B considered only in the cases
where A is known to be true, P(B|A), divide the overlap of the two polygons by the polygon in yellow.
Spatial Statistical Data Analysis for GIS Users 156
Figure 5.11
Just exchanging letters A and B in the last formula results in
P(B & A) = P(A|B) P(B)
Because P(A & B) = P(B & A), and the right parts of the last two formulas are equal,
P(B|A) P(A) = P(A|B) P(B)
This last formula results in
This relationship is the Bayes theorem for two events A and B. It states that if information on the part B of the
probability model is available, P(B), the other part of the probability model, A, should be changed to the new
probability given P(B) and P(B|A). The components of the formula above are
P(A|B) is known as the posterior probability A given B. The posterior probability expresses the belief
in A after making an observation B.
P(B|A) is known as the likelihood.
P(A) is the prior probability of A. The prior probability expresses the belief in A before making an
observation B.
P(B) is the prior probability of B.
If N variables are involved in calculation, the formula above can be rewritten as
The idea of a Bayesian belief network is to use Bayes’ theorem to go from the prior probability of a hypothesis
Ai to its posterior probability, using information on the related variables.
The Bayesian approach is useful when there is insufficient information for scientific decision making because
of data scarcity and complexity. This approach can be illustrated using data on the spatial distribution of birds
attracted to chaparral. The purpose of this case study is to create a continuous map of credibility of bird
sightings. Available bird abundance data are imprecise because of locational errors, lack of access to
properties, outdated observations, and human errors (for example, incorrect identification of the bird type).
Spatial Statistical Data Analysis for GIS Users 157
Each observation of a bird made in the past was assigned a credibility value (in the interval 0–1) by an expert,
and kriging was used to calculate a continuous credibility map of the bird sightings, as shown at the top of
figure 5.12. This credibility map is based on one person’s guess, and many other maps can be created based
on guesses of other experts.
A classified version of this map using three intervals of bird sighting credibility—low, moderate, and high, as
shown at the bottom of figure 5.12 —can be used as a component of the bigger model.
Figure 5.12
It is also known that different wild birds prefer nesting as far as possible from human activity and on
relatively flat ground, so information on human proximity, distance to roads, and slope can be used to define
preferred habitats, as in figure 5.13.
Figure 5.13
Spatial Statistical Data Analysis for GIS Users 158
Areas close to the preferred vegetation, such as chaparral and riparian, can be constructed and considered as
possible preferred habitat for the bird, as shown in figure 5.14.
Figure 5.14
Using GIS software, all this information can be converted to grids with the same extent, and prior
probabilities of relationships between bird occurrences and classes of each raster cell can be assigned. For
instance, if a grid cell contains chaparral, we can assign the probability of finding a bird there to be 0.9 (such
cells are labeled high in the figure above); if a grid cell does not contain chaparral, assign this probability to be
0.05 (low); if a grid cell does not contain chaparral and it is contiguous to a grid cell that contains chaparral, a
probability of 0.3 (moderate) is assigned to that cell because of possible chaparral location misclassification.
We illustrate the use of simplified raster grids shown above with only two intervals of credibility, low and
high, by defining a graphic model for Bayes’ chain rule that indicates potential dependencies among the
rasterized values. This graphic model demonstrates the concept of conditional independence, for example: if
human factor (A8) is known, then no knowledge of roads (A3) and human proximity (A4) will affect the
probability of adjusted habitat (A9). Statistical literature sometimes calls the members of the chain “parents”
and “offspring/children.” For example, in figure 5.15, A3 is a parent, and A8 is offspring.
Figure 5.15
For the boxes A1, A2, and A7 (chaparral, riparian, and vegetation) in the scheme above, we need to set up the
conditional probabilities. Let A1 denote the chaparral random variable, A2 denote the riparian random
variable, and A7 denote the vegetation random variable. To get the joint distribution of P(A1, A2, A7 ) we take
P(A7| A2, A1 )P(A1)P(A2) assuming A1 and A2 are independent. The conditional table 5.1 is set up as
Condition Vegetation
Riparian Chaparral True False
(bird uses) (bird uses) (bird uses) (Bird doesn’t use)
F F 0 1
F T 1 0
T F 1 0
T T 1 0
Table 5.1
Values in the table above are explained as the following. If a bird uses a cell because it is in the high riparian
class but not because it is a high chaparral class, then the bird used it in the high–high vegetation class
(because it was in the high riparian class). Likewise, if a bird uses a cell because it is in the high chaparral
class, but not because it is a high riparian class, then the bird used it (because it was in the high chaparral
class). Thus, we have the following joint probabilities P(A1, A2, A7 ):
P(1, 1, 1) = 0.54
P(0, 1, 1) = 0.06
P(1, 0, 1) = 0.36
P(0, 0, 1) = 0.00
P(1, 1, 0) = 0.00
P(0, 1, 0) = 0.00
P(1, 0, 0) = 0.00
P(0, 0, 0) = 0.04,
which are summed to 1.
Marginal probability for each variable sums to 1 as well. “Marginal” is another word for totals. “Marginal” is
used because totals appeared in the margins of contingency tables (see chapter 6):
P(A1 = 1) = 0.9 and P(A1 = 0) = 0.1
P(A2 = 1) = 0.6 and P(A2 = 0) = 0.4
P(A3 = 1) = .96 and P(A3 = 0) = .04
,
where parents are explanatory variables (if any) such as vegetation, human factor, and slope in the case of
adjusted habitat. In particular,
because variable A7 (vegetation) has two parents with defined a priori probabilities.
Then we could obtain the probabilities like P(1,1,1,0,0,1,1,1,1,1) for the high‐chaparral, high‐riparian, high‐
roads, low‐human‐proximity, low‐slope, high‐previous sightings, high‐high‐vegetation, high‐low‐human‐
factor, high‐low‐high‐adjusted habitat, and high‐high‐adjusted sightings. Many of the probabilities will be 0,
as they cannot happen (that is, if a bird is present in any of the parents, it must be present in the offspring).
Then the Bayesian belief network calculates a predicted habitat probability for each raster cell, combining the
relationships described in the graph with their prior probabilities defined by an expert. Figure 5.16 shows
one possible result of the predicted habitat estimation, the probabilities P(1,1,1,0,0,1,1,1,1,1). Note that all
input layers, variables A1–A6, influence the resulting map.
Figure 5.16
We could also map other probabilities, such as P(1,1,1,1,1,1,1,1,1,1) for each combination of classes for all 10
variables when they make sense (for example, we cannot have low‐chaparral, low‐riparian, and high
vegetation). There would be 26 = 64 possible values of P(1,1,1,1,1,1,1,1,1,1), one for each combination of
classes among A1, A2, A3, A4, A5, and A6.
In the example above, the correlation between neighboring raster cells was not taken into account. However,
a grid cell without chaparral or riparian may have several neighboring cells with vegetation suitable for birds
drawn to chaparral. In this case, confidence in finding that bird in the cell increases. Unfortunately, it is very
difficult to introduce spatial dependency into the model using data in the adjacent cells as additional variables
In reality, birds and animals are moving, and they have a home range where they can be found. This home
range is a distribution of the usage of space, rather than an area with equal probability of finding animals.
Several statistical models were proposed for modeling the home range distribution. It can also be modeled
using fuzzy set theory, discussed in the following section.
Bayes’ theorem allows reducing the uncertainty of the posterior distribution because the prior distribution
and data come from different sources of information, and they support each other. This is true if the
likelihood and prior probabilities are consistent. By inconsistency we mean the following. For a given joint
distribution
P(A & B) = P(B|A) P(A),
the probability for individual distribution P(B) can be found by summing over all possibilities of A. This so‐
called marginal distribution of B is
This must be true after specifying the likelihood and the prior for A. If something different for P(B) is
specified, there may be a conflict.
Experts often disagree on the values of prior probabilities of bird presence as well as other parameters of the
model that, together with imperfect credibility about bird locations, lead to uncertainties in the results. If
more than one value of the prior probability is suggested for a parameter by experts, all these parameters can
be used as input to sensitivity analysis of the model. This can help to determine which parts of the model
most influence the predicted habitat and which influence it the least.
Bayesian data modeling can account not only for measurement and modeling uncertainty, but also for
uncertainty in sampling design and in initial and boundary conditions (see reference 5 in “Further reading”).
FUZZY SET THEORY
Fuzzy set theory is an example of nonprobabilistic mathematical theory that can analyze data and model
uncertainties. In mathematics, a collection of objects is called a set. A fundamental difference between fuzzy
set theory and traditional logic (also called Boolean after George Boole, a nineteenth‐century English
mathematician) is that, in the former, observations can partially belong to predefined sets, whereas in the
latter, membership in the set is an all‐or‐nothing proposition. Fuzzy set operations are grounded on a solid
theoretical foundation. They deal with imprecise quantities in a well‐defined, precise way. Just as Boolean set
theory forms a useful mathematical discipline with a variety of important applications, so does fuzzy set
theory. The traditional geoprocessing operations, such as intersection and union, can have fuzzy counterparts
that reflect the uncertainty in attribute values or locations.
GIS spatial objects—points, lines, and polygons—are defined by their location, topology, and attributes. The
topology is defined by the logical relationships between the positions of the objects. For example, in figure
5.17, a topographic map of Whiskeytown‐Shasta‐Trinity National Recreation Area in California, we can see
the shape of the park, the shape of the object within it, Whiskeytown Lake, and the shapes of the islands
within the lake.
Figure 5.17
In a topological map in figure 5.18, the precise shape of the objects (islands) is not important. It is only
important that the lake object is entirely contained inside the park object and the lake's islands are
completely within the lake.
Figure 5.18
Topology is critical for analyzing the relationships between objects. If the topology of a set of data is wrong,
the GIS cannot efficiently analyze how objects relate to each other. Topological relationships follow classical
set theory in which zero (0) represents nonmembership and 1 represents membership. Therefore, a lake
belongs to a park or not, and there is no need to broaden the belonging concept.
Farm Spatial statistics data courtesy of Department of Crop Sciences at the University of Illinois at Urbana–Champaign
Figure 5.19
Soil type is defined by a combination of a definite range in a particular property such as acidity, degree of
slope, texture, structure, land‐use capability, degree of erosion, and drainage. Table 5.2 below shows a small
part of the classification rules that define a soil type based on the soil texture.
Soil type Soil material that contains
Sandy clay loam 20% to 35% clay, less than 28% silt, and 45% or more sand
Clay loam 27% to 40% clay and 20 to 45% sand
Silty clay loam 27% to 40% clay and less than 20% sand
Sandy clay 35% or more clay and 45% or more sand
Silty clay 40% or more clay and 40% or more silt
Clay 40% or more clay, less than 45% sand, and less than 40% silt
Heavy clay more than 60% clay
Farm Spatial statistics data courtesy of Department of Crop Sciences at the University of Illinois at Urbana–Champaign
Table 5.2
As with the topological example at the beginning of this section, an object’s soil texture is either a member or
a nonmember of a set soil type. For example, if a sample consists of 61 percent clay, it is a heavy clay soil type,
but if it contains 59 percent clay, it is not considered heavy. But the soil type boundaries are not changing
abruptly, and a more flexible approach to soil classification may be desirable.
The soil type dataset shown above includes average expected yield values for each polygon, shown in digits in
figure 5.20. These values were converted to the cell centers of the overlapped grid shown in white‐ to gray‐
colored circles. Then, spatial statistical interpolation, kriging, was used to create a continuous surface of the
expected yields. Predicted polygons do not coincide with original ones, and they give a rough picture of
possibly more flexible soil type classifications.
If soil type is Chenoa Silt Loam, Z = 1 0.1⋅(random value in the interval 0–1)
else, Z = 0.1⋅(random value in the interval 0–1)
A resulting surface, shown in figure 5.21 can be interpreted as a probability (since predictions are in the
range 0–1) that soil type belongs to the Chenoa Silt Loam.
Farm Spatial statistics data courtesy of Department of Crop Sciences at the University of Illinois at Urbana–Champaign.
Figure 5.21
A probability map in figure 5.21 should not be treated very seriously because of arbitrary input data
construction and because optimum interpolation of indicator‐like variables is a difficult task (see discussion
on indicator kriging in chapters 6 and 9). The purpose of this map is to demonstrate that uncertainty in the
indicator variable results in vague data classification.
Figure 5.22
Figure 5.23 shows another example of the membership function for slope classification, defined rather poorly
with the linguistic terms flat, gentle, average, and steep. In this example, “gentle” is between “flat” and
“average,” with some overlapping.
Figure 5.23
Figure 5.24
Membership function can be multivariate. For example, soil texture function may depend on silt, clay, and
sand percentage.
100
µ(clay,sand,silt) =
( 2 2
1+ 2(clay − 35) + ( sand − 45) + 3( silt − 40)
2
)
Soil texture membership functions using the formula above are shown in figure 5.25 for clay and sand
changing from 0 to 100 percent; axes x and y; and silt equal to 50 percent (left), 70 percent (center), and 90
percent (right).
€
Figure 5.25
ρ
µ(clay,sand,silt) = ,
(1+ w (clay − a)
a
2 2
+ w b ( sand − b) + w c ( silt − c )
2
)
where a+b+c=100, are greater than zero, and ρ is between zero and one. This membership
function attains ρ when clay=a, sand=b, and silt=c.
€
An important property of fuzzy logic is that membership need not sum to 1. If only one of two events A and B
may happen, rules based on classical logic assign a probability of, say, 80 percent that event A will occur and
20 percent that event B will be realized. The probabilities are constructed so that they sum to 100 percent. A
fuzzy approach might produce results of 70 percent and 50 percent, respectively. Not enforcing a constraint
on the sum of probabilities can be advantageous because absolute numbers can have clearer meaning. Note
also that there are no constraints on the choice of fuzzy membership values which are chosen based on
subjective judgment. In particular, membership values may change non‐monotonically in contrast to the
figures above.
A fuzzy expert system is one that uses a collection of fuzzy membership functions and rules to reason about
data. Each Boolean operation has a fuzzy counterpart, for example:
Complement
(or negation)
Intersection
(or conjunction)
Union
(or disjunction)
Table 5.3
Figure 5.26
Assignment 1 in this chapter proposes to use the Bayesian belief network for relating the risk of asthma to
several rasterized environmental factors including ozone and particulate matter concentrations and
temperature. These factors are classified into three categories: low, moderate, and high. Figure 5.27 shows
polygons where all three factors are moderate (blue).
Figure 5.27
Figure 5.28 shows membership values for ozone data using the following constant and thresholds: = 5000,
= 0.085, and = 0.1. Light blue points are membership values for ozone concentration in the polygons
above.
Figure 5.28
Membership functions for particulate matter and temperature are shown in figure 5.29.
The fuzzy intersection is defined as
The resulting fuzzy classification of the moderate value for all three environmental variables is shown
in figure 5.30.
Figure 5.30
Among other fuzzy operators used for analysis of geographical information are fuzzy algebraic product, fuzzy
algebraic sum, and fuzzy gamma operator.
membership value. The resulting fuzzy membership values are smaller than or equal to the smallest
contributing membership value. It is important that all the contributing membership values have an effect on
the result, unlike the fuzzy intersection and fuzzy union operators.
that the resulting fuzzy membership value is larger or equal to the largest contributing fuzzy membership
value. In this case, each piece of evidence that favors a hypothesis reinforces one another.
The fuzzy gamma operator is defined as the combination of the fuzzy algebraic product and the fuzzy
algebraic sum by the expression , where γ is a constant in the range (0, 1).
For γ not equal to 0 or 1, the resulting fuzzy membership value gives a value between the fuzzy algebraic
product and the fuzzy algebraic sum.
FUZZY LOGIC
The logical operations of complement, containment, intersection, and union discussed in the previous section
are the foundation for a fuzzy expert system. A fuzzy expert system is based on fuzzy propositions (rules).
Logical reasoning is the process of combining given propositions into other propositions repeatedly.
Propositions are derived from the following operations:
Conjunction, where we assert the simultaneous truth of two separate propositions A and B.
Disjunction, where we assert the truth of either or both of two separate propositions.
Implication, which usually takes the form of an ifthen rule. The if part of an implication is called the
antecedent, whereas the then part is called the consequent.
Negation, where a new proposition can be obtained from a given one by prefixing the clause “it is
false that . . .”
Equivalence, which means that both A and B are either true or false.
The most fundamental inference rules to infer a conclusion from a premise are Modus Ponens and Modus
Tollens, shown in table 5.4 below.
Modus Ponens Modus Tollens
Rule: if x is A, then y is B Rule: if x is A, then y is B
Premise: x is A Premise: y is not B
Conclusion: y is B Conclusion: x is not A
Table 5.4
In fuzzy logic, the premise and conclusion are not simple yes‐no decisions, since they are fuzzy sets. The rule
“if x is A, then y is B” has a membership function , which measures the degree of truth of the
implication relationship between x and y. The question is how to modify the membership function to
accommodate the premise. The two most popular methods of inference in fuzzy logic are correlation‐
minimum and correlation‐product methods.
correlation‐minimum:
correlation‐product:
Figure 5.31
The correlation‐minimum method has a drawback of ignoring all information about the premise membership
when that function exceeds the rule membership. The correlation‐product method is a scaled version of the
rule membership, and it preserves all information about premise membership function.
The result of a sequence of fuzzy operations may appear complicated, as in figure 5.32. Selecting a single
representative point from the domain of a fuzzy membership function is called defuzzification. As a rule, the
representative point is that at which the function would balance if it is a physical object (see the formula in
figure 5.32). Other approaches to finding a centroid of the polygonal object can be used if a researcher is not
satisfied with this one.
Figure 5.32
A fuzzy map is appropriate for representing complex data relationships because it enables the map units
(pixels or polygons) to have multiple memberships in, for example, the soil type classes. The use of fuzzy sets,
continuous membership functions, and fuzzy logic gives more flexibility in modeling with geoprocessing
operations than the traditional formalism of Boolean logic. There are fuzzy versions of many mathematical
and statistical functions and models including fuzzy kriging.
RASTER MAPS COMPARISON
Comparison of raster maps is an important task in remote sensing, landscape ecology, forest inventory,
medical image analysis, and other applications. Reasons for this comparison include comparing numerical
models with a reference map and detection of temporal changes, such as clusters of unusual values.
Everyone can compare two or more images and decide which are more similar (say, whether a baby looks
more like the mother or the father), but it is inconsistent among researchers (and parents). Therefore,
methods that imitate the human recognition system are required. In this section, we present a map
comparison approach based on fuzzy set theory because it is intuitively appealing and because freeware
software is available (see reference 6 in “Further Reading”).
The cell‐by‐cell matching is the simplest method of raster maps comparison. The figure 5.33 shows when this
approach is totally unsatisfactory by comparing two chessboards with white and black lower left corners.
Cell‐by‐cell method reports no similarity between images because it finds white cells where black are
expected and vice versa. However, two images are the same; just one of them is flipped over the horizontal
line.
Figure 5.33
Figure 5.34
The difference between cell‐by‐cell and fuzzy set comparison can be illustrated using the popular game “find
the differences between the two drawings,” such as the rasters in figure 5.35. Here, the goal is to find nine
ways in which the drawings differ from each other.
Figure 5.35
The right part of figure 5.36 shows fuzzy agreement between maps with the red color used for the largest
disagreement. We see that the differences between the two drawings are recognizable.
Figure 5.36
The cell‐by‐cell comparison method does not capture the similarity of patterns. Patterns similarity is
important, in particular, when the outputs of spatial models are compared because the predictive models are
usually not expected to be accurate at the very fine scale, but they should be similar at the regional scale.
Figure 5.37 shows interpolated maps of logarithm of cesium‐137 soil contamination produced by the same
kriging model using data in the recommended coordinate projection for Belarus, Gauss‐Kruger, left, and
another projection, right.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 5.37
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus
Figure 5.38
Figure 5.39 shows arithmetical difference, at left, and fuzzy agreement, at right, between the maps in figure
5.38. We see sharp edges between predicted values found in the cell‐by‐cell method everywhere, and it is
rather surprising because the maps in figure 5.38 look rather similar. From the cell‐by‐cell map comparison
method, we conclude that influence of the data projection is large for both small and large values of cesium‐
137 soil contamination.
A fuzzy map comparison method, as shown in figure 5.39 at right, is trying to emulate the human reasoning
approach of identifying a hierarchy of map similarities. The preliminary step is data preprocessing in which
the input raster is converted into a set of polygons by grouping cells with identical values. Then two maps
that now consist of polygons are overlaid, and fuzzy areal intersections are calculated using the following
interpretation rules:
Scaling value Description
Very low Definite differences. Areal intersection is very low.
Low Differences very likely. Areal intersection is low.
Medium Possible differences. Areal intersections and complements are similar.
High Differences very unlikely. Areal agreement is high.
Very high Polygons are nearly identical. Areal agreement close to perfect.
The membership function for areal intersection is defined similarly to the slope classification with not well‐
defined linguistic terms in figure 5.23 so that the membership functions overlap.
To account for the areal intersection uncertainty due to the different number of cells in the polygons,
additional membership functions shown in figure 5.40 are used to guarantee that different weights are
assigned to small and large polygons.
Figure 5.40
Next, a collection of ifthen statements for decision making about data classification is defined, for example,
if (area of intersection is low) and (area of complements is high) and (number of cells is large) then
(local matching is poor),
where and is interpreted as fuzzy intersection operator.
The final step is defuzzification when the output membership function produced by the fuzzy inference
algorithm is transformed into a value for mapping. Interested readers can find the details of the method in
reference 7 in “Further reading.”
A fuzzy agreement map in the right part of figure 5.39 reflects cesium‐137 soil contamination patterns. Each
pattern has certain similarity value for two prediction maps. Most of the spots of large dissimilarity in violet
are located near the border of the country where sampling density is low and, therefore, kriging prediction
errors are relatively large. There is also relatively large disagreement between two maps in the areas with
low contamination. Overall, the effect of the data projection transformation is unexpectedly large. Therefore,
data re‐projection may be a significant component of the map uncertainty, and we recommend converting all
A fuzzy global index of the map similarity can be computed as
,
ASSIGNMENTS
1) USE THE BAYESIAN BELIEF NETWORK FOR RELATING THE RISK OF ASTHMA TO
ENVIRONMENTAL FACTORS
Use the Bayesian belief network for relating the risk of asthma to five factors that might influence this risk in
California: ozone concentration, particulate matter concentration, temperature, elevation, and latitude. These
factors can be estimated and displayed as a raster map clipped by the California border (figure 5.41.)
The polygonal data using annual averages for pollution and temperature are provided in the folder
assignment 5.1. Since ozone, particulate matter, and temperature are changing every day, daily
measurements should be used when accessing the Bayesian network in the attempt to predict bad days for
asthmatics. Even with use of daily data, each value in the polygon is uncertain because of measurement and
averaging errors.
Figure 5.41
Figure 5.42
The prior probabilities for variables A6–A8 can be defined based on the expert knowledge or, for simplicity, as
,
, and
Using this simplification for the prior probabilities, calculate the result of the Bayesian network estimation of
asthma risk (the prior probabilities for variables A1–A5 can be adjusted as well)
The resulting map may be similar to the one shown in figure 5.43 (note that ordinary kriging was used to
produce smooth interpolation from the Bayesian network output).
Figure 5.43
According to the California Breathing Project, http://www.californiabreathing.org/, there were
3.4 million Californians with asthma, 40,000 asthma hospitalizations, and 543 deaths from asthma in 2001.
Therefore, it is possible to compare the estimated risk of asthma with the density of the observed number of
hospitalizations using information from the above‐mentioned Web site. The availability of accurate disease
data provides an interesting opportunity to calibrate the asthma risk model by sequentially changing the
prior probabilities and minimizing the difference between model output and the observed data.
2) CLASSIFY ENVIRONMENTAL VARIABLES USING FUZZY LOGIC
Use data from the first assignment and compare classification using Boolean and fuzzy logic for the following
combinations of classes:
Classes for the input variables can be defined using the quantile classification algorithm.
One way to create fuzzy sets is using the Spatial Analyst’s raster calculator. The figure 5.44 shows an example
of fuzzy favorable aspect calculation. In this example, it is assumed that the aspect is favorable when the
terrain is oriented towards the south:
Figure 5.44
Alternatively, use the fuzzy modeling environment implemented in the MapModels software (see reference 8
in “Further reading”).
3) USING FUZZY INFERENCE, FIND THE AREAS THAT MOST LIKELY CONTAIN A
LARGE NUMBER OF PEOPLE WITH A HIGH IRRADIATION DOSE.
Use the following rasters provided in the folder assignment 5.3 and shown in figure 5.45: 1992 strontium‐90
soil contamination prediction (left), calculated using lognormal ordinary kriging, its prediction standard error
(center), and estimated population density (right). The author produced these maps for illustration purposes
only, and they are not describing the soil contamination and population density accurately.
Figure 5.45
Strontium‐90 is a radioactive substance with a half‐life of 28 years emitting beta particles. It is accepted by
plants, animals, and people as substitute for calcium. It attacks the bones and can cause bone cancer. A level
of 37,000 becquerels per square meter (Bq/sq.m=Bq/m2) is considered a very high strontium‐90 value.
FURTHER READING
1. Maps of soil temperature and soil moisture are created using data similar to the data from S. R. Yates and A.
W. Warrick. 1987. Estimating soil water content is accomplished using cokriging, Soil Science Society of
America Journal, 51:23–30.
2. Heuvelink, G. 1998. Error Propagation in Environmental Modeling with GIS. London: Taylor & Francis Books
Ltd., 150.
This book reviews different aspects of error propagation in GIS from a geographer’s point of view and
presents several case studies using Monte Carlo simulation.
3. Zadeh, L. A., Web site, The Berkeley Initiative in Soft Computing,
http://www.cs.berkeley.edu/~zadeh/
The concepts of fuzzy set and fuzzy logic were introduced by Lotfi Zadeh in 1965. His seminal paper and
information on the recent advances in fuzzy reasoning, as well as his thoughts on the link between probability
theory and fuzzy logic, can be found at this Web site.
4. Ross, T. J., J. M. Booker, and W. J. Parkinson, editors. 2002. Fuzzy Logic and Probability Applications: Bridging
the Gap. ASA‐SIAM Series on Statistics and Applied Probability. Philadelphia: SIAM. Alexandria, Va.: ASA.
This book states that probability theory and fuzzy logic are complementary rather than competitive because
probability theory deals with partial certainty, whereas fuzzy logic deals with partial possibility and partial
truth. The book explains how to employ combinations of methods from fuzzy logic and probability theory.
5. Cressie, N., C. Calder, J. Clark, J. Ver Hoef, and C. Wikle. 2009. “Accounting for Uncertainty in Ecological
Analysis: The Strengths and Limitations of Hierarchical Statistical Modeling.” Ecological Applications,
19(3):553–570.
This paper reviews hierarchical statistical modeling, mostly Bayesian, using data on an ecological study of
harbor seals and their abundance at haul‐out sites in Prince William Sound, Alaska. The data analysis assumes
that sites are spatially independent.
6. The freeware “Map Comparison Kit” software Web site http://www.riks.nl/mck/index.php.
7. Power, C., A. Simms, and R. White. 2001. “Hierarchical Fuzzy Pattern Matching for the Regional Comparison
of Land Use Maps.” International Journal of Geographical Information Science, 15(1), 77‐100.
This paper describes the fuzzy inference algorithm implemented in the Map Comparison Kit software.
8. Riedl, L. Homepage of Leop’s MapModels—with downloads, demonstration data, and documentation. It is
available from http://www.srf.tuwien.ac.at/MapModels/MapModels_English.htm.
MapModels is an ArcView 3x fuzzy modeling environment extension developed by Leopold Riedl (Technical
University of Vienna, Austria). Although the software manual is provided in German only, the software is
intuitively understandable for those who do not speak German.
TYPES OF SPATIAL DATA, STATISTICAL
MODELS, AND MODEL DIAGNOSTICS
THREE TYPES OF SPATIAL DATA
GOALS OF SPATIAL DATA MODELING
GOALS OF SPATIAL DATA EXPLORATION
EXAMPLES OF APPLICATIONS WITH INPUT DATA OF DIFFERENT TYPES:
RADIOECOLOGY, FISHERY, AGRICULTURE, WINE GRAPES QUALITY MODEL,
WINE PRICE FORMATION, FORESTRY, CRIMINOLOGY
RANDOM VARIABLES AND RANDOM FIELDS
STATIONARITY AND ISOTROPY
MODEL DIAGNOSTIC
METHODS FOR INDICATOR (YES/NO) PREDICTION
METHODS FOR CONTINUOUS PREDICTION: CROSSVALIDATION AND
VALIDATION
SUMMARY OF SPATIAL MODELING
ASSIGNMENTS:
1) CHOOSE THE APPROPRIATE MODEL FOR ANALYZING THE QUANTITY OF
ADEQUATELY SIZED SHOOTS PRODUCED BY THE VINE
2) INVESTIGATE POSSIBLE MODELS FOR THE VARIATION OF MALARIA PREVALENCE
3) CALCULATE AND DISPLAY INDICES FOR INDICATOR PREDICTION USING OZONE
CONCENTRATION MEASURED IN JUNE 1999 IN CALIFORNIA
4) INVESTIGATE THE VARIABILITY OF YIELDS OF INDIVIDUAL TREES
FURTHER READING
This plan works for analysis of various data, including spatial data. For example, we can minimize the squared
difference between Z?, the true Z, and , the predicted Z values at the unsampled locations s?, where is a
function of the observed data Z1, Z2, … ZN, =function(Z1, Z2, … ZN), such as ,
where unknown coefficients need to be estimated.
Although some research was done during the 145 years after publication of Laplace’s book, statistical analysis
of spatially dependent data became a common practice only after 1960, primarily because of advancements in
computer technology. Today, spatial statistics is a separate science with a large collection of models, tools,
and tricks. In this chapter we expand discussion of the ideas behind spatial statistical data modeling, which
we began in previous chapters.
In the next section, spatial data are classified into three types: discrete, continuous, and regional. The
difference in exploring and modeling each type is discussed. Then examples of applications with input data of
different types are presented. First, data concerning thyroid cancer in children are presented, and problems
with its spatial analysis are discussed. Then a case study using fishery data collected west of the British Isles
is presented using two different models, kriging and the generalized additive model. Next, typical agriculture
datasets—weed counts and vegetable disease—are analyzed using several spatial regression models. Then
model for wine grapes quality is proposed. Next, forestry data are presented, and several methods for their
analysis are discussed. Finally, some methods for regression analysis of the number of crimes in the
administrative regions are discussed.
The next part of this chapter discusses basic assumptions used in statistical modeling of all three types of
spatial data: random variables and random fields, stationarity, and isotropy. The main tool for measuring
spatial data dependency in geostatistics, semivariogram, is introduced there.
Next, two general approaches to model diagnostic—cross‐validation and validation—are discussed focusing
on assessing the quality of prediction of indicator and continuous data.
The chapter ends with short summary on spatial modeling.
THREE TYPES OF SPATIAL DATA
Geostatistical data exist and can be measured at any location in the data domain but are known in the
measurement locations only. Examples of geostatistical data are meteorological observations and soil
contamination because there is some temperature and some chemical concentration (perhaps zero) at any
location.
Discrete or point pattern data represent the counted number of locations of events. These events cannot be
observed at any location. Examples of point pattern data are locations of crime events, locations of trees in a
forest, and earthquake locations (they are estimated using one or another methods), because arbitrary
location may not contain a crime or a tree. A more general kind of point pattern is a marked spatial point
pattern, in which a variable called a mark is observed at each location. Examples of marks are crime types,
tree diameters, and earthquake magnitudes.
Although the regression models play a role in analysis of all types of spatial data, the goal of the analysis is
usually different for applications with different data types. One of the main goals of geostatistical data
analysis is prediction of values at unsampled locations. In contrast, point pattern and regional data analysis
more focused on inference for spatial data structure and spatial dependence between variables.
For all types of spatially dependent data, two observations separated by a small distance are more alike than
those separated by a large distance. This makes analysis of spatially dependent data much more complicated
than that of independent data, because the number of parameters that need to be estimated quickly becomes
very large. Figure 6.1 shows pairs of eight spatial objects. If they are correlated, it is necessary to find
correlation coefficients for all N⋅(N+1)/2 pairs composed of N available data samples. For N=8, 8⋅9/2=36,
correlation coefficients Cij for each pair of locations si and sj should be estimated from just eight
measurements, and this is, of course, impossible.
Created from USGS information downloaded free from the USGS Web site.
Figure 6.1
Spatial statistics provides models and tools for analyzing spatial patterns in data by finding relationships
between spatial objects. Information on spatial object relationships allows the number of model parameters
to be reduced to several parameters only. These parameters can be estimated from the data in a number of
ways, and the researcher’s task is to choose the optimal one.
Spatial data structure may be different at different scales. Large‐scale variability is described in geostatistics
and regional data analysis by the mean function, and in point pattern data analysis by the intensity function.
Small‐scale or local data variability is described by a semivariogram or covariance model in geostatistics,
Ripley’s K and pair correlation functions in point pattern analysis, mark semivariogram and correlation
function in marked point pattern analysis, and neighbor weights in regional data analysis.
When choosing a particular model, we accept assumptions on which the model is based and from which
formulas are derived. The model usage follows the deductive reasoning: if the premises are true, the
conclusion is valid, but if the premises are not verified, the conclusion may be wrong.
The first step in model selection is defining the type of input data. Often, choosing a type of data is easy, but
sometimes it can be a problem, as seen in the section called “Types of spatial data and related statistical
models” in chapter 1. Regional, geostatistical, and marked point pattern data have many similarities. An
example of lightning data that can be analyzed as continuous, discrete, and aggregated was presented in
chapter 1. In that case, the result of the analysis depends on the choice of data type.
GOALS OF SPATIAL DATA MODELING
For all data types, modeling starts with the verification of the assumption of spatial data independence. If this
assumption holds, there is no reason to use spatial statistics. If the assumption that the data come from a
spatially random process is rejected, then the appropriate spatial statistical model for the data should be
chosen.
The linear model described by Laplace in the beginning of this chapter can be written as
data = trend + all random errors
or
,
In geostatistics, the primary goals of the analysis are predicted data values and prediction uncertainty at
locations where data have not been collected. The trend component of the geostatistical data model is almost
always described by the low‐order polynomial of coordinates such as
. Although the trend can be very complicated, researchers usually
are not interested in the coefficients values because the main purpose of the trend analysis usage in
geostatistics is removing the large‐scale variation from the error term . Another typical geostatistical
goal is the prediction of some function of the observed variable. For example, block kriging predicts the
average value of the variable over a region. Important parts of geostatistical data analysis are detection of
erroneous data, finding cross‐correlations between variables and using them for prediction (cokriging), and
prediction of the new values to the sampled locations when measurements contain errors.
In point pattern analysis, the locations are random variables. To distinguish that fact, they are often called
events instead of points. For spatial point patterns, the intensity of events plays the role of trend in
geostatistics. A primary mapping goal is estimation of the intensity surface λ(s). For regression, the intensity
surface is modeled as a function of covariates. Because the intensity surface must have positive values, a
common approach is modeling the intensity process as log‐normally distributed variable
.
An important goal of point and marked point pattern analysis is detection of the dependence between
locations and between marks and locations. In contrast with geostatistics, events typically interact at short
distances and the process that generates marks may influence the process that generates locations and vice
versa.
Statistical models can be simple or complex and may represent reality more or less adequately. Therefore, it
is necessary to know what models are available, what their assumptions are, and how to choose the optimal
model. Once a model is chosen, the next step is to estimate the model parameters and make an inference from
the model about the data.
GOALS OF SPATIAL DATA EXPLORATION
Before modeling the data, a good understanding of the phenomena under investigation is required.
Exploratory spatial data analysis tools help to determine the data distribution, find possible outliers, check
the independence of data values and data locations, reveal spatial trends in the data, and investigate spatial
similarity at different spatial scales.
At a deeper level, exploratory spatial data analysis helps to find appropriate data transformations to known
theoretical distribution, investigate the dependence of spatial data and locations as a function of the distance
between observations, find and separate small‐ and large‐scale data variation, identify spatial patterns, and
generate hypotheses that may explain the patterns. There are situations where exploratory spatial data
analysis answers all the important questions regarding the data and renders the modeling step unnecessary.
From the modeling point of view, the most important part of exploratory spatial data analysis is verification
of the model’s assumptions. For example, a popular index of spatial data association, Moran’s I, assumes that
data are stationary and normally distributed. Therefore, these two assumptions, stationarity and normality,
must be verified before using Moran’s I.
Several typical applications with input data of different types are presented below with a brief discussion of
possible statistical modeling scenarios. The next chapters and the appendixes discuss concepts behind spatial
statistical models in detail and present computer scenarios and codes for modeling typical geographical data.
RADIOECOLOGY
Figure 6.2 at left shows the number of thyroid cancer cases in children during the period 1986–1993 in
Belarusian towns and cities. Data of this type usually raise several typical questions:
Is the density of cancer disease locations dependent on location?
Do cancer events tend to be located near a particular location, in that case close to the nuclear power
plant?
How can the elevated number of thyroid cancer events be explained?
These questions can be formulated as statistical hypotheses: statements about data distribution, parameters
in a probability model, or anything about the data we would like to confirm. For example, comparing case
locations with simulated random locations can assess the hypothesis that cancer cases are equally likely to
occur at any location in Belarus.
Objective analysis of thyroid cancer cases is necessary because we cannot always distinguish between
clustered and random events. An example of a deceitful map is shown in figure 6.2 at right. It has the same
number of case points as the map at left, but they are randomly distributed over the Belarus territory. The
ring buffers around points give the impression that data are clustered when they are not.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 6.2
To see how the actual distribution of cancer cases in Belarus differs from a random distribution, random
locations of cancer cases were generated that total the actual number of cases and are related to the local
density of population. Figure 6.3 at left shows the result of generating such events. The large number of
locations near the center of the map, 48, is the result of there being 1.5 million people in the capital of Belarus,
Minsk. It is immediately clear that the number of the actual cancer cases in Minsk, 20, is smaller than it would
be if the children had been randomly selected.
After answering the questions about disease location clustering (we will show how to do this in chapters 13
and 16), the next step is to understand the underlying physical processes and use them for explanation of the
elevated number of thyroid cancer events. The primary reason for thyroid cancer is irradiation by short‐lived
radioactive particles, specifically, by radioisotopes of iodine. Therefore, the correlation between thyroid
cancer cases and radioiodine deposition should be examined. However, detailed and accurate information on
radioiodine deposition right after the Chernobyl accident is not available. Figure 6.3 at right shows a
The study of correlation between estimated milk contamination and thyroid cancer rates involves two types
of spatial data: continuous (milk contamination) and discrete points (locations of cancer cases). Although
both the settlement locations of children with thyroid cancer and milk dose estimation locations are available
as points, the models for analyzing these data are different. The dose of irradiation from local foodstuffs can
be predicted in any location including the locations of cancer cases, but predicting disease locations based on
the contamination data does not make sense, because we already know where all the disease locations are. It
does make sense to create a map of density of the disease, which can be interpreted as risk for local
population to have thyroid cancer. Local similarity between risk of thyroid cancer and food and soil
contamination can be examined using spatial statistical models.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 6.3
Typical continuous data are presented in figure 6.4: measurements of the radio‐cesium soil contamination in
the majority of Belarus settlements at left and mushroom radio‐cesium contamination in the southeastern
part of Belarus. These data are invaluable for explaining many diseases after Chernobyl, because people living
in the contaminated territories are receiving radio‐cesium through local food.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 6.4
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 6.5
Statistical distributions used for modeling count data are characterized by dependence between the mean
and the variance, such as the mean is equal to the variance in the case of Poisson distribution. Count data in
epidemiology, ecology, and agriculture are often over‐dispersed, meaning that data exhibit more variation
than given by Poisson distribution because people (animals or plants in the case of ecology or agriculture)
acting as a group have individual responses to covariates and are influenced by external factors similar to
their nearest neighbors. In this case, statisticians recommend using either the negative binomial or quasi‐
Poisson distribution because their standard errors reflect more accurately the range of likely count values.
Both negative binomial and quasi‐Poisson distributions account for overdispersion, have two parameters,
their means can vary as a function of covariates, and both distributions are available in statistical software
packages. The quasi‐Poisson distribution with mean , has the variance ,
>0, where is an overdispersion parameter. In the case of a negative binomial distribution with mean
, the overdispersion is defined by , , that is the variance
is quadratic in the mean. A quasi‐Poisson process is denoted as , and a negative
binomial process is denoted as .
Regression models with negative binomial and quasi‐Poisson data are usually fitted using weighted least
squares with weights inversely proportional to the variance. Since the mean to variance relations in negative
binomial and quasi‐Poisson distributions are different, the estimated weights and, consequently, the
regression coefficients, are also different. A common approach is to use quasi‐Poisson distribution if there are
several areas where most of the events are concentrated, while negative binomial distribution may be better
Data about thyroid cancer in children are further analyzed in appendix 3.
FISHERY
Fishing is an important source of food, and effective management of fish stock is necessary since most
commercially efficient fishing areas are overexploited. Usually fish samples are collected by trawling at a
speed of three knots. Economically important fish are sorted according to species, weighed, and counted. As a
rule, geostatistics is used to interpolate the sampled fish density assuming that spatial fish distribution is
determined by its spatial correlation.
Unfortunately, accurate counting of the number of fish is impossible since adult fish are fast moving, occur in
shallow water, and tend to avoid nets. Another problem is the difficulty of predicting next year’s fish
population using counts of adult fish from this year. One alternative is to estimate the number of fish eggs that
the fish spawn on the sea surface on particular days and then estimate the number of adult fish required to
produce this number of eggs to make a decision as to absolute spawning stock biomass and the maximum
amount of fish to catch in the near future. Eggs are picked up by a fine‐meshed net and their number
estimated using the formula
egg count = (density of eggs) × (net area)
Figure 6.6 shows free‐floating egg counts released into the ocean west of the British Isles and France in 1992
by Atlantic mackerel. Sample points are located systematically with increased sampling in the areas where
high egg densities were expected.
A histogram shows the distribution of the egg counts in 450 samples, represented as colored circles. The
distribution is not symmetrical, and we should look for a model that works well with skewed data.
From the dataset, 184 random samples in black were removed, and they will be used below for comparison of
the observed counts and predictions produced by several statistical models.
Data courtesy of the University of Bath.
Figure 6.6
Figure 6.7 shows locally calculated means (left) and standard deviations (right) using data in the polygons
with common borders (we used the Geostatistical Analyst Voronoi Map tool discussed in chapter 14). Visual
impression is such that the mean is proportional to the standard deviation
both in the areas with low number of eggs in the southeast and in the remaining territory with larger number
of eggs. Normally distributed data do not have such a feature, and we should think about statistical models
that allow for data distribution with the data variance proportional to the data mean. The appropriate
statistical model for egg counts can be based on Poisson, binomial, or negative binomial data distributions.
Data courtesy of the University of Bath
Figure 6.7
The semivariogram cloud shape and semivariogram surface in figure 6.8 show that the egg counts are
spatially correlated (semivariogram surface is not flat) and that there are two measurements with unusually
large counts, the common locations of green lines in the map. Because data are spatially correlated, kriging
can be used for prediction of the number of mackerel eggs at the unsampled locations. Two possible instances
of outlier data should be verified and removed if their removal can be justified.
Data courtesy of the University of Bath.
Figure 6.8
Spawning depends on the value of environmental variables such as water temperature, salinity, and ocean
depth. Therefore, spatial regression models that can use explanatory variables are of interest. According to
the literature, imprecision in estimating egg abundance contributed about 60 percent of the data variance in
One regression model that was used for modeling variation in egg counts as a function of explanatory
variables is the generalized additive model (GAM). This model gives a choice of assumed data distributions. In
contrast to cokriging (multivariate version of kriging, see chapters 8 and 9), the generalized additive model
requires values of the explanatory variables at the predicted locations. In the case of fishery data, all
explanatory variables can be reasonably estimated at the required locations using kriging.
The generalized additive model has the following form:
where is the ith explanatory variable and Zi is the variable of interest (egg counts), which has a specified
statistical distribution whose expectation is a function of the explanatory variables, and the functions
€
are smoothing spline functions for the explanatory variables .
The concept of regression using smooth functions can be illustrated using less‐complicated one‐dimensional
data. Red points in figure 6.9 display one‐hour maximum ozone concentration in Riverside, California, in June
1999. Suppose we want to fit ozone data using a smooth function s1(x), , where is a
response variable (ozone concentration), is a smooth function of explanatory variable (day
number), and are independent and identically distributed random errors. This can be done by defining a
set of known basis functions bi(x) so that s1(x) has representation
,
where are regression coefficients that are estimated using the linear model discussed in appendix 2.
In practice, the most suitable basis functions bi(x) are splines. A set of four possible basis functions is shown
in figure 6.9 at bottom. We used this function and fit a linear model to find coefficients . According to the
linear model diagnostics (R2 statistics), four basis functions explain 96 percent of the data variation. Fitted
line is shown in figure 6.9 on top in blue.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 6.9
Ozone concentration can be partially explained by the temperature. Therefore, the regression model can be
refitted using additional smooth function s2(x) of the temperature. The resulting model is
+
The principle of spline interpolation in two dimensions is the following. Suppose the data Z(si) are measured
in the locations si = (xi, yi). The goal is to find a function f(x, y) that minimizes the tradeoff between goodness
of fit to the data and the smoothness of the resulting surface. There are many ways to measure surface
€
smoothness. For example, the difference between mean value and function f(x, y) can be penalized:
2
J1 ( f ) = ∫∫ [ f (x, y) − mean] dxdy (mean coefficient)
∂f 2 ∂f 2
J 2 ( f ) = ∫∫ + dxdy (variance coefficient))
∂x ∂y
∂2 f 2 ∂2 f 2 ∂2 f 2
J 3 ( f ) = ∫∫ 2 + 2 + 2 dxdy (roughness coefficient)
€
∂x ∂x∂y ∂y
Then a task is to find optimal parameters mi, i=0, 1, 2, 3, of the following function:
N
€ 2
m0 ∑ [f(x i , y i ) - Z i ] + m1J1 (f) + m2 J 2 (f) + m3 J 3 (f)
i=1
The first term in the expression above is a measure of closeness of the fit to the data; the second term
guarantees that predictions far from data locations tend to the data mean value as it should be in the absence
of the information; the third and fourth terms are the measures of the function f(x, y) curvature.
€
The parameters mi control the tradeoff between the terms. Large values of m1 imply more smoothing toward
the mean; small variance coefficient m2 means less variability and more smoothness; the larger the roughness
of coefficient m3, the smoother the surface (as in the global polynomial interpolation discussed in chapter 7).
If m1 = m2 = m3 = 0, there is no penalty for rapid changes in the function f(x, y), and the fitted surface passes
through the data values. Different choices of mi lead to different surfaces, from too smooth to too rough.
Coefficients mi do not work independently, and finding optimal parameters of the spline model is a
challenging task.
Sometimes a simplified version of spline smoothing with parameters m1 and m2 equal to zero is used for
modeling two‐dimensional data. Therefore, users of spline models should check which parameters are
available for controlling surface smoothness.
Figure 6.10
Figure 6.11 at left shows the result of the validation exercise: the prediction versus observed egg counts for
ordinary kriging (see glossary and chapter 8) and the generalized additive model at the testing locations. The
generalized additive model uses three covariates shown in figure 6.10 and assumes that egg counts have a
negative binomial distribution. It also assumes that the response variable is spatially uncorrelated. We see
that ordinary kriging performance, shown in blue, is very poor: it heavily overestimates the small values of
egg counts (at the same time, there are predictions of a negative number of eggs) and overestimates large
values, while predictions made by the generalized additive model are more reasonable. For researchers with
experience in data interpolation, the comparison result must be surprising because egg counts are spatially
correlated.
Figure 6.11
The scientific assessments of the mackerel stock could force a drop in catching allowances in the future as
well as revision of quotas of this species for different countries. Therefore, a statistically sound measure of the
uncertainty associated with the egg counts prediction is required.
Figure 6.12 at left shows prediction errors of ordinary kriging, ordinary cokriging, and the generalized
additive model versus observed egg counts. Both ordinary kriging and ordinary cokriging prediction errors
are approximately the same for small and large egg counts, meaning that kriging reports an average
prediction error for the entire data domain. This is certainly unsatisfactory (see the discussion on kriging
prediction errors in chapter 2). A problem is that conventional kriging often does not assume any particular
distribution for the input data. It is frequently a disadvantage of the model as discussed in chapter 4.
Assuming that ordinary kriging predictions are normally distributed, a 95‐percent prediction interval for egg
counts in the diapason from 0 to 70 counts is
prediction ± 91.3,
which is unacceptably large. The same calculation for ordinary cokriging gives the following prediction
interval:
prediction ± 13.2,
which is too large for small egg count values and most probably too small for large egg counts.
In contrast, prediction standard errors produced by the generalized additive model are proportional to the
egg counts (figure 6.12 at right), which coincides with current knowledge about biological count data.
Figure 6.12
A comparison of semivariogram clouds and semivariogram surfaces constructed using the difference between
the observed egg counts and the predicted counts (figure 6.13) indicates that the cokriging and the
generalized additive models have captured a large part of the spatial variation in mackerel egg counts, while
the ordinary kriging residuals are spatially dependent (because the semivariogram surface has clear
structure), meaning that there is an external reason for spatial data correlation. Note that we did not remove
two outliers found in the beginning of this section, and they are responsible for the layering in the
semivariogram clouds.
Figure 6.13
The analysis above can be improved using additional explanatory variables. Other covariates available for
inclusion in the model are sea‐surface temperature, temperature at a depth of 20 meters, salinity, distance
from the 200‐meter contour of the ocean bed in meters (biologists believe that mackerel prefer to spawn at
that depth), and an approximation of the gradient of the ocean floor in the direction of the 200‐meter contour
at each sampled point. Adding new variables in the generalized additive model is easier than in the cokriging
model (see discussion on cokriging in chapter 9). Another good model for analyzing and predicting fishery
data is the generalized linear mixed model discussed in appendixes 3 and 4. Note that ordinary cokriging is
the simplest multivariate geostatistical model. More complex models will produce more accurate predictions
and more plausible prediction standard errors.
The previous section indicates that predictions made by the generalized additive model with independent
errors may be an acceptable alternative to the cokriging model with spatially correlated errors. In this
section, we continue comparing the most popular geostatistical models with generalized linear and
generalized additive models, with the intention to raise interest in studying spatial regression models in the
next chapters. However, there is no intention to find the best model for the data.
Figure 6.14 shows data from an eastern Nebraska corn and soybean field consisting of counts of weeds in
small squares into which the field was divided. These data were first discussed and analyzed by Johnson,
Mortensen, and Gotway in 1996 (see reference in “Further reading”). The data histogram and local standard
deviation are shown in the right part of the figure. These data are not Gaussian, and data transformation
using a log or power function does not make the transformed weed counts data close to a normal distribution.
Data courtesy of Gregg A. Johnson, University of Minnesota.
Figure 6.14
The weed counts can be generated from stationary Poisson or negative binomial distributions. If we assume
that the data distribution is Poisson, then the expected mean and variance values are equal. However,
according to statistical and weed science literature, it may be better to base this data on the negative binomial
distribution, since for the weed data the variance can be larger than the mean.
As a rule, agriculture scientists interpolate their spatial data using ordinary kriging. Figure 6.15 shows the
result of the prediction validation exercise in which weed data were divided on 127 training data, pink, and
85 testing data, question marks. Using training data, predictions (y axis) to the locations of testing data
(x axis) were made using ordinary kriging. We see that predictions at the locations with both small and large
counts are inaccurate, and predictions of medium counts do not look very good either.
Figure 6.15
Figure 6.16 shows ordinary kriging prediction standard errors versus count values at the testing locations,
blue circles. Prediction variance is the same no matter how large the number of weeds in the cell, which
contradicts the nature of the weed data. The kriging model can be made more complex, as in the case of
simple kriging with normal score transformation, red. This time, the kriging prediction error depends on the
prediction value, but it still does not accurately represent the count data variability. Therefore, decision‐
making based on conventional kriging models may lead to poor results.
Figure 6.16
Possible solutions to the problem with prediction error is to use a generalization of conventional kriging to
Poisson and negative binomial distributions (see chapter 12) and the generalized linear mixed model
(Poisson and negative binomial spatial regression).
The Poisson regression model assumes that the probability that the dependent variable is equal to some
number n is given by
,
where λ is the mean of variable Z and n!=n*(n1)*(n2)…2*1. Although Z can take only integer values, λ can be
any positive number. The parameter λ depends on the explanatory variables. Because λ cannot be negative, it
,
where λi varies across observations (i=1,2,…m) and is spatially dependent random variation described
by the semivariogram model.
The Poisson regression model works well in the case of counts of events that occur over the same time
interval. It can be adjusted for situations when events are counted over different lengths of time by
incorporating time into the model. If ti is the observation time interval in the location i, the number of events
Zi that occur during that time interval is assumed to have the Poisson distribution
n − λi t i
Pr(Z i = n) =
( λi t i ) e
,n = 0,1,2,...,
n!
so that the expected value of Zi is . Then the expected value of Zi is equal to
€
Figure 6.17 shows the relationship between prediction of weed counts and prediction standard error for the
generalized linear model of Poisson type with exponential semivariogram model estimated using freeware
WinBUGS software (see appendix 3). The relationship between prediction and prediction standard error is
almost linear for small and medium values, but not for large values. This is evidence of overdispersion.
Overdispersion usually does not bias the prediction, but leads to underestimation of the prediction standard
error.
Figure 6.17
WinBUGS is a Bayesian software, and kriging model’s parameters are treated as random variables instead of
being optimally estimated constants (see discussion of Bayesian kriging in chapter 9 and appendix 3). Figure
6.18 shows distributions of the semivariogram model parameters in logarithmic scale, the mean, sill, and
range values.
Figure 6.18
Figure 6.19 shows predictions and prediction standard errors made by the generalized linear model of
Poisson and negative binomial types. The negative binomial type should be preferred since predictions and
predictions standard errors are proportional for both small and large values as suggested in the science of
studying weeds.
Figure 6.19
Figure 6.20
Figure 6.21 shows predictions made by simple kriging (left) and simple cokriging (right). These geostatistical
models coincide with indicator kriging and cokriging in the case of zeroes and ones input data, and the maps
below are interpreted as probability of having pepper disease in the field. Because of problems with indicator
kriging discussed in chapter 9, it is advisable to use these maps as a preliminary (exploratory) step to
statistical modeling and not for decision making.
Figure 6.21
) is modeled as a linear combination of disease presence in the
neighborhood and soil moisture:
,
where p is a probability of disease presence, are regression coefficients to be estimated, m is a moisture
value, and f is a function of the disease presence in the neighborhood such as weighted neighboring values.
This model is called logistic regression model.
The equation above arises because we want to model the probability of the event occurrence p (in this case,
presence of disease) as a linear combination of n covariates Xi:
,
where are regression coefficients. However, the probability p has to be between zero and one, but the
linear predictor can take any real value. A solution is to transform the probability. We move from
the probability p to the odds , defined as the ratio of favorable to unfavorable cases. Then we take
logarithm calculating log‐odds or “logit” . Now, as the probability tends to zero, the odds approach
minus infinity, and as the probability goes to one, the odds approach infinity, as in figure 6.22.
Figure 6.22
Taking the derivatives with respect to Xi (this makes sense if the variable is continuous)
,
we can see that the effect on the probability p of increasing the ith covariate while holding the other covariates
constant depends on both the coefficient and the probability value p. If is negative, the line in figure
6.22 is flipped horizontally so that for small logit values p is near 1 and p is near 0 when the logit is large.
Figure 6.23 shows how the logistic regression model predicts the probability value p using the bell pepper
disease data without (left) and with (right) the covariate variable moisture.
Figure 6.23
prediction ± 1.96⋅(prediction standard error) ≈ prediction ± constant
at any location inside the field.
In contrast, the prediction standard error calculated by the spatial logistic regression model, right, depends
on the data values variation as it should be.
Figure 6.24
There is competition between plants for resources such as water availability, and these resources vary
throughout a field. The existence of a correlation between plant density and field resources allows for
variable rate seeding to be potentially profitable to the farmer.
Crop yields are affected by three categories of factors:
Factors that vary on a field and are not controlled by the farmer. Examples are soil texture and slope.
Uncontrolled stochastic factors, which do not remain constant over time. Examples are rainfall and
the first autumn frost date.
Factors controlled by the farmer, such as seed density and the amount of fertilizers.
Although spatially variable rate seeding is advantageous over traditional uniform rate seeding, variable rate
seeding may only be profitable if the costs of equipment and services do not exceed revenue returned by
increased yield and cost reductions achieved by reduced inputs of seed and fertilizer.
Many researchers have separately estimated the effects on yield of the factors mentioned above, but
information on how all three factors interact to affect yield is difficult to find. However, this is what farmers
really need because this information would help them estimate and map the economically optimal controlled
factors and investigate various scenarios with different uncontrolled factors.
In practice, there is much uncertainty about how crop yields depend on different values of controlled and
uncontrolled factors. Figure 6.25 at left shows possible dependence between the amount of rain and yield per
unit area for three different soils represented in red, pink, and green lines. Each point on the lines
corresponds to some average values of yield and rain calculated from the distributions shown in the top of
the figure for three particular values of the amount of rain. Each year, nature “draws” a value from these
distributions, and the generated values can differ from the mean (i.e., expected/average) values shown as
blue vertical lines.
The graph in figure 6.25 at right shows a typical dependence of yield on the amount of fertilizer for three
different soils. A case study in chapter 2 shows how this information can be used in decision making.
Figure 6.25
The yield response information is currently expensive, and variable rate seeding is economically beneficial
for the production of high‐value crops such as grapes, but not for relatively inexpensive corn.
WINE GRAPES QUALITY MODEL
Agricultural data analysis can be complex and challenging. The following discussion of the wine grape quality
model, proposed by Bob Siegfried, a Salinas Valley vineyard manager and producer of grapes for the only
California wine to have attained a one hundred point rating from Wine Spectator at the time of this writing, in
personal communication, illustrates problems that arise when modeling such a valuable product as wine
grapes.
Both grape quality and yield vary across vineyards, but increasing yields is less beneficial than increasing
grape quality. The value of the grapes quality model lies in defining geographical areas that may produce
premium‐quality wine and in planning improvements by making lower‐quality parts of the vineyard more
like the premium areas. It should be noted that in practice winemakers believe that most of tasks needed to
grow good grapes are already known. It should be clear from the following discussion that this belief is
overoptimistic.
To spatially analyze wine grape quality across a vineyard, it would be necessary to produce small lots of wine,
5 to 10 gallons, from grapes collected from each measurement site in the vineyard to obtain wine quality
scores. The number of measurements should be at least several times the number of terms in the grapes
quality model.
Wine price consists of both wine quality, fashion (which influences demand for certain grape varieties), and
rumor. Wine quality is comprised of objective information on wine characteristics, which may be related to
grape characteristics, and subjective sensory perception (amount of pleasure). Wine grapes quality can be
also seen as latent variable which cannot be measured directly by a particular device (more on this is in the
section called Criminology below).
Subjective wine scoring is usually based on a one‐hundred‐point scale and is arrived at by tasting and giving
an overall score. Wine scoring for wine production is mostly done in‐house. (Wine scoring is also done by
wine critics and by judges at wine competitions.) It combines such categorical wine characteristics as
aromatic intensity (strong/classic/discrete), finesse of aromas (yes/no), complexity (yes/no), firmness of
attack (yes/no), excessive acidity (yes/no), suppleness (yes/no), flatness (yes/no), fat
(plump/medium/lean), well concentrated (yes/no), harmony (perfect/balanced/unbalanced), fine tannins
(yes/no), finish (long/medium/short), alcohol excess (yes/no), traces of staleness (yes/no), touch of
reduction (yes/no), and necessity for aging (yes/no). Many examples of wine tasting by amateurs and
professionals are described in the literature, and the influence of various components of tasting was
estimated. This literature can be used to estimate the variance of subjective wine scoring.
A model of objective grape quality can be written for site i as
wine grape quality = a0 + a1⋅(phenolic concentration) + a2⋅(sugar concentration) +
a3⋅(pH) + a4⋅(titratable acidity) + a5 (ratio of tartaric to malic acid) + a6 (nitrogen
concentration) + a7 (assorted stuff in the juice necessary for yeast growth) + random
error,
where all of the measurements of the explanatory variables are on the juice. In the model above,
The constant a0 describes grape variety and region specificity.
Phenolic concentration is the concentration of myriad compounds in the berry and seeds, among
which are the tannins and the pigments. Phenolic concentration integrates factors such as crop load,
berry size, and irrigation effects, removing the necessity to deal with yield. This is important
because the optimum yield is different for each variety and vineyard combination, and it is different
for subareas of the vineyard. Phenolic concentration would be better specified as concentration at a
given sugar concentration. Phenolic concentration is correlated with pH and titratable acidity.
Titratable acidity decreases as crop yield increases and as weather during ripening gets hotter. pH
increases (acidity decreases) as yield increases. Most phenolics decrease as the grapes ripen.
Therefore, the statistical model should be adjusted to account for these correlations (see the section
“Geographically weighted regression versus random coefficients models” in chapter 12 for an
example of such model.
pH is the logarithm of hydrogen ion activity, one measure of acidity. Wine color changes as pH
increases, turning from red to purple. Low pH also inhibits the growth of undesirable bacteria,
At the farming level, this model can be translated to the following
wine grapes quality = b0 + b1 (average daily heat accumulation above 50 degrees Fahrenheit from
bloom to harvest) + b2 (crop load ratio) + b3 (berry size) + b4 (ratio of light intensity in the fruit zone to
ambient light intensity) + b5 (deficit of crop evapotranspiration from reference evapotranspiration) + b6
(sugar concentration) + b7 (phenolic concentration) + b8 (titratable acidity) + b9 (pH) + b10 (leaf
nitrogen content at veraison) + b11 (leaf potassium content at veraison) + b12 (rot and mildew levels) +
b13 (grape flavor) + random error
In that model,
Average daily heat accumulation is the sum of (daily maximum temperature minus 50 degrees)
divided by the total number of days. Grapes need enough heat in order to ripen. Too much heat
accumulation after ripening causes the vines to respire away their acids. Cooler weather is more
conducive to retention of flavor components and acid.
Crop load ratio is defined by the amount of leaf area to the summed weights of all the clusters (per‐
vine basis). The predicted crop weight is used prior to harvest. Other measures may be substituted
for crop weight and leaf area. The summed weight of the wood (canes) pruned off the vine during the
winter preceding the growing season is the most typical substitute for leaf area. The canes can be
thought of as the scaffolding on which the leaves and clusters are displayed. The number of clusters
per vine is usually evaluated prior to bloom (because they are easy to count before the shoots get
long), and adjusted downward if necessary after the berries are formed. These ratios have optima
that can be found in a section of a vineyard with highest wine grapes quality.
Vineyard evapotranspiration is the daily water consumption of the vineyard expressed as a depth of
water. Evapotranspiration is a combination of evaporation and transpiration. The latter term refers
to water that is picked up by plant roots, and which is transported through the plant to exit from the
leaves. Reference evapotranspiration is a convention agreed to mean the daily water use of a crop of
grass or alfalfa that covers 100 percent of the ground and is not short of water. The water use of any
crop can be related to the water use of this reference crop. This term has an optimum as well. Berry
size decreases for deficits greater than approximately 20 percent of full‐vine water use, but the
grapes will develop off flavors if water is restricted below approximately 20 percent of potential
evapotranspiration, although this cutoff probably differs between varieties.
Small values of berry size are preferable.
Potassium attaches to the tartaric acid anion to produce potassium tartrate, which is observable as
crystals in the bottom or on the cork of a wine. A lot of potassium in the berries may raise juice and
wine pH, which is undesirable. Lack of potassium will retard or prevent the accumulation of sugar in
the grapes.
Veraison is the point in the vine’s annual cycle when the grapes begin to ripen.
Rot and mildew impart flavors to wine. There is a chemical assessment of rot, but it can be also
evaluated visually in the field or in the load of grapes using point scoring. Less is better for most
All terms in the models above, except phenolic concentration, tartaric to malic ratio, and rot and mildew
levels have optima. This suggests using a second‐order polynomial for most of the model’s terms, such as
b60⋅(sugar concentration) + b61⋅(sugar concentration)2 instead of b6⋅(sugar concentration) or regression splines
as in the generalized additive model introduced earlier in this chapter.
The grapes quality model can be evaluated at harvest. It can also be evaluated before harvest, perhaps a few
weeks prior, to predict wine quality.
The regression coefficients may differ for each grape variety within the two main classes of white and red
wines. White wines are not fermented on their skins as often as red wines (where fermentation on skins is
done for color extraction). An example of a difference within the red wine category is that a pinot noir would
not be expected to have as much color as a syrah because the two varieties differ in the amounts of pigments
they normally contain.
A spatial component can be added by allowing locally varying regression coefficients a0‐a7 and b0‐b13 or by
using weighted averages of the nearest observations. These models are discussed in chapter 12. We suggest
using the Bayesian spatially varying coefficient model (see chapter 12 and appendix 3).
Maps of the estimated distribution of grape quality in the subregions of the vineyard and the spatially varying
regression coefficients can be used for separating wine grapes of different quality for wine production of
greater variety, from the best possible to average, from the same vineyard, The result of the regression
analysis can be also used for planning and improving the next harvest.
WINE PRICE FORMATION
Intuitively, lower‐priced wines should score lower. In reality, there is a lack of correlation between price and
pleasure for both expensive and inexpensive wines, although reasons for the price randomness can be
different.
The wine price formation can be represented as a weighted sum of historic trend, wine grape quality, and
grape processing. Historic trend means the features associated with the label of the bottle, such as vintage,
name, and ranking. Grape quality is defined subjectively (or objectively as discussed above) and includes the
consequences of weather and soil type. Grape processing consists of the technology of grape picking,
pressing, and winemaking.
Wine price includes a perceptual component—an allusion to the fact that two identical juices may be
perceived very differently if one juice comes from Burgundy and another from Chile. Historic trend dominates
prices of expensive wines. For example, prices of top Bordeaux wines are still heavily influenced by the 1855
classifications although vines, people, technology, and climate have changed during 150 years.
The number of prestigious wine tasters who are able to distinguish quality parameters within a wine variety
with small error is very small. These individuals are known by name and are oriented toward prestigious
wine tastings. This is bad news because, in contrast to blind‐taste evaluations of whiskies, the less expensive
samples of wines are usually preferred by nonexperienced tasters.
Unfortunately, objective wine quality plays an insignificant role in today’s wine market prices. Objective wine
grapes quality modeling could improve predictability of wine characteristics and, consequently, it may help in
pricing wines objectively.
FORESTRY
Locations of trees depend on positions of other trees, soil characteristics, slope, and forest management in the
past. Forest stands are usually inhomogeneous, and their characteristics vary in space. Usually the data for
forest stands are analyzed in statistical literature using point pattern analysis techniques. Sometimes tree
data are analyzed as regional data where the random variable of interest is, for example, the presence or
absence of some disease in forest plots.
Figure 6.26 shows shrub heights and diameters over the interpolated map of organic matter. Data were
collected in Spain on the 10‐by‐10‐meter area after a forest fire, meaning that the distribution of shrubs is
natural. These data are analyzed in chapter 13 using point pattern analysis theory.
Courtesy of Jorge Mateu, Universitat Jaume I (UJI), Castellon, Spain.
Figure 6.26
Manual collection of tree samples in the forest is difficult and expensive. The limited number of samples that
can be collected is insufficient for a description of forest features with reasonable accuracy. An application of
aerial photo interpretation to tree species identification and the assessment of disease and insect infestation
allow incorporating statistical modeling into management strategies for maximizing forest yield, since
hundreds and thousands of samples can be collected in different territories and identified quickly.
Figure 6.27 illustrates the individual tree identification from combined LIDAR and multispectral airborne
imagery. A semiautomatic method allowed researchers to recognize nearly all dead trees, about 89 percent of
the live spruce and about 62 percent of the deciduous crowns, and this performance can be improved.
Figure 6.28 show two images and ground data collected in the Bavarian Forest National Park. Locations of
spruce trees of different sizes are displayed in the image at left, and large trees of three types are displayed in
the image at right. A histogram shows distribution of the trees’ heights (units are in meters). Altitude data
from which a slope surface can be created and used as a covariate in tree intensity modeling are also
available.
Courtesy of Jorge Mateu, Universitat Jaume I (UJI), Castellon, Spain.
Figure 6.28
Spatial patterns of the small (smaller than 4 meters) and large (larger than 12 meters) spruce trees is visually
different as shown in figure 6.28. Briefly introduced in chapter 2, the K function can be used to compare the
distribution of the observed tree locations with random points simulated in the same area. Figure 6.29 shows
a normalized variant of the K function called the L function. A red line is the observed L function of the large
spruce locations. Two hundred simulations of a similar number of random points in the polygon around
spruce locations were produced, and their maximum and minimum L function values are displayed as blue
lines. The observed L function of the large spruce locations (figure 6.29 at left) is inside the blue lines,
suggesting that large trees are distributed randomly. However, the L function calculated using small spruce
locations is outside the blue lines for all distances between pairs of the tree locations (figure 6.29 at right),
confirming our visual impression that small green points in figure 6.28 at left are clustered.
Figure 6.29
Spruce locations consist of two patterns and their joint modeling is difficult. Modeling of the locations of large
trees is relatively easy since their distribution is close to random. The distribution of the locations of small
trees is complicated and can be explained using covariate variables. For example, regression analysis using
grids with calculated distances to the nearest large tree (figure 6.30 at left) and slopes (figure 6.30 at right)
gives the following result:
Log(intensity of small trees) = 0.14 +
+ 0.127·(distance to the nearest large tree) + 0.056·slope,
where constant intensity 0.14 corresponds to a random (Poisson) process.
Courtesy of Jorge Mateu, Universitat Jaume I (UJI), Castellon, Spain.
Figure 6.30
Forest yields are often described using stand basal area, which is estimated using tree diameters at breast
height (dbh):
Stand basal area = ,
where n is the number of trees in the stand. Tree diameters at breast height cannot be estimated from LIDAR
images directly, and linear dependence between tree crowns or heights and dbh is assumed in applications:
dbhi = α·crowni or
dbhi, = β·heighti
where α and β are constants. Therefore, knowledge about tree crowns or tree heights distribution can help in
estimating forest yields.
A nonstationary multitype Poisson process with tree crowns or heights as covariates can be fitted to tree
locations. For example, crowns of mixed stands were divided into three classes, small (less than 5 meters),
medium (between 5 and 13 meters), and large (between 13 and 22 meters), and the resulting model of tree
intensity can be the following:
Log(intensity of mix stand) = 0.008 +
+ 0.087·(medium size crown) − 0.606·(large size crown)
According to the fitted model, tree intensity is less near trees with large crowns, as expected.
Figure 6.31 shows three simulations of the tree locations with small, medium, and large crowns using the
fitted intensity model above. If the model was correctly fitted, we may find such distributions of trees in the
forest. If so, the fitted model may help in the management of forests.
Figure 6.31
Note that the fitted model may generate a different number of each type of tree every time. This is because
the intensity model is probabilistic, and it generates random values from the estimated model’s parameter
distributions.
The risk of becoming a victim of crime varies geographically. Police departments try to identify
concentrations of crime (“hot spots”), determine what causes these concentrations by linking crimes and
underlying social conditions, and develop actions to reduce the hot spots. A hot spot is an area that has a
greater than average number of crimes so that people have a higher than average risk of victimization. Hot
spots vary in how far above average they are. Cool spots are areas with less than the average amount of
crime.
The first step in crime distribution analysis is mapping the density of crimes. The next step is fitting a model
that explains the density of crimes by the covariates and that can generate crime locations with observed
statistical features for simulation studies. Case studies using burglary and auto theft locations are presented
in chapters 13 and 16.
From the regional data analysis point of view, epidemiological and crime data are similar. Examples of
epidemiological data would include thyroid and lung cancer rates, and examples of crime data would include
arson or burglary rates. The statistical models and tools discussed in chapters 11‐12 using mostly
epidemiological data are also applicable to crime data. Another common, and this time unfortunate, feature of
practical epidemiological and crime data analyses is the usage of tools and models outside of their
applicability.
Figure 6.32 at left shows proportion of the observed and expected violent crimes (a sum of murder, rape,
robbery, and aggregated assault) data from Houston, Texas. This map can be interpreted as the local relative
risk measuring increases or decreases over the counts expected under the equal probability model so that
regions with values greater than 1 have greater numbers of observed crimes than expected, and regions with
values less than 1 have less number of crimes than expected. The expected number of crimes in region i is
defined as , where is the total number of observed crimes divided by the total population size
and is population size in region i.
The available explanatory variables are the total alcohol sales and the number of illegal drug arrests in
Houston tracts. Figures 6.32 at right and figure 6.33 at left show the standardized logarithms of these
variables. Publicly available data were collected by the authors of reference 12 of “Further reading” from the
city police department Web site and the Web site of the Texas Alcoholic Beverage Commission.
Data from City of Houston Police Department and Texas Alcoholic Beverage Commission.
Figure 6.32
Statistical method for reducing the number of variables by exploiting the information that is shared by the
observed variables is called factor analysis. In factor analysis, the observed variables are assumed to be the
linear combinations of latent variables or factors. Spatial factor analysis generalizes classical factor analysis
assuming that the observed and latent variables are spatially correlated.
Data from City of Houston Police Department and Texas Alcoholic Beverage Commission.
Figure 6.33
One possible spatial factor model can be written as a collection of one Poisson regression and two normal
linear models (these models are discussed in chapter 12 and appendix 2):
are mean values of the violence crimes, log standardized alcohol sales, and log standardized
drug arrests, , are normally distributed and spatially unstructured processes, and
are unknown constants. Spatial correlation in is introduced through
conditional autoregressive model which requires specification of the neighboring regions. One possible
variant of connections between the polygons centroids and the nearest 5‐8 centroids are shown in figure
6.33, right.
The spatial factor model presented above can be fitted using freeware Bayesian software WINBUGS, see
appendix 3. In figure 6.34, the image at left shows the estimated common spatial factor while the image at
right shows estimated distributions of the common factor in three selected tracts.
Data from City of Houston Police Department and Texas Alcoholic Beverage Commission.
Figure 6.34
In the adopted Bayesian model, common crime factor is estimated as the posterior expectation of the latent
factor given the observed variables and model parameters. The factor is the weighted average of three
variables. The weights are functions of model parameters. They are derived from data by maximizing
available information in the area, meaning that weights will be different in other cities. Common factor values
are estimated not as single numbers, but as distributions that reflect factors’ uncertainties, which can be of
direct interest to the decision makers. The simulated values can be further used in geoprocessing as
discussed in chapters 5 and 10.
RANDOM VARIABLES AND RANDOM FIELDS
Spatial statistical models are based on the assumption that at least a part of the data can be modeled by a
random process. There are two sorts of randomness. The first is unpredictable even in theory. The second is
due to the complexity of the data. Many natural and artificial phenomena are complex, but they follow rules
that can be reconstructed by statistical models.
Probability theory studies events that cannot be predicted exactly because they depend at least in part on
conditions we do not know or know only approximately. Not all uncertain events can be analyzed using a
probabilistic approach but only those that are statistically stable—that is, those that can be repeated (at least
theoretically) under the same conditions and produce a result described by the same frequency distribution.
Objective probability theory is usually considered appropriate for natural sciences while subjective (largely
Bayesian) probability theory may be more suitable for social sciences. This is because two experiments are
always different in at least some ways and the differences can be neglected in environmental applications but
rarely in social studies because, for example, the probabilities of many economic events cannot be interpreted
in terms of repeated random experiments. Participants of the social processes have imperfect understanding
about those processes, and their beliefs are often full of errors: “… market prices are always wrong in the sense
that they present a biased view of the future… not only do market participants operate with bias, but their bias
can also influence the course of events” (George Soros, reference 11 in “Further reading”).
Many beliefs are social in character. Political parties, particular religions, and fan clubs for sports teams have
common beliefs within each group, and the probabilities of doing something by members of the group are
Statistical stability seems a serious restriction, but a surprisingly large number of natural and even social
events follow statistical laws. When modeling spatial data probabilistically, one expects that statistical data
features at unsampled locations are similar to the observed data features. Randomness is a characteristic of
the statistical model that helps to describe data, make estimations and predictions, and interpret outputs.
Variables that take different values when repeated under the same conditions and have statistical stability
are called random variables. Statistics refers to random variables, not data. Statistics describes phenomena
and make predictions that are not absolutely precise.
Figure 4.4 at the beginning of chapter 4 showed a histogram of the maximum monthly concentration of ozone
(parts per million) at Lake Tahoe during the period 1981–85. These data can be approximately described by
the normal distribution, as displayed with the continuous blue line. It has only two parameters: the mean
value, around which all values are grouped, and the standard deviation, which controls the shape of the line
(the smaller the standard deviation, the narrower the normal distribution line).
If the theoretical distribution of the data is known, a single realization from the distribution can be drawn.
Today, most programming languages, spreadsheets, and calculators have a function, rand() or rnd(), that
generates a random number between 0 and 1. A sequence of these numbers will form a uniform distribution.
Figure 6.35 shows a histogram when the rand() function was used 100 times (left), 300 times (center), and
1,000 times (right).
Data from City of Houston Police Department and Texas Alcoholic Beverage Commission.
Figure 6.35
a = rand()
b = rand()
c = sqrt(−2⋅ln(1−a))
d = 2⋅π⋅b
N(0,1) = c⋅cos(d)
A value from the normal distribution with a mean of µ and a standard deviation of σ is
N(µ,σ) = µ + σ N(0,1)
Figure 6.36 shows a histogram of 60 simulations from the normal distribution, using the mean and standard
deviation of ozone at the city of South Lake Tahoe in California.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 6.36
We would not be surprised to find that such monthly concentration of ozone values had actually been
observed at South Lake Tahoe.
Assuming that the data distribution is known at each monitoring station or estimated from the available
repeated measurements, we can repeat the exercise of fitting histograms of ozone concentration to normal
distribution and drawing values from the distribution in other California cities. This realization of spatial data
is called a random field or stochastic process. The term field came from physics as an analogy to magnetic or
gravitational fields.
Interpolation applied to a sequence of real or simulated values at each city location results in a sequence of
surfaces. Instead of displaying all of them, we can show the most typical one and a couple that differ
substantially from the typical one. In practice, the required surfaces are constructed from the prediction and
[prediction – 1.96⋅(prediction standard error), prediction + 1.96⋅(prediction standard error)]
Only one of the seven data ranges (0.2–0.3 in yellow) is observed in both the top and bottom surfaces,
indicating that cadmium contamination varies greatly.
Copyright © 1995, Umweltbundesamt, GmbH.
Figure 6.37
Predictions with associated uncertainties are possible because properties of the random variables are
assumed to follow some laws. Correlated random variables are the unseen, complex connections between
samples. We can imagine that there is some mechanism that governs the process under investigation, as
shown in the image of an engine in figure 6.38.
Photo by the author.
Figure 6.38
A consequence from the discussion on random variables above is that a dataset of N observations observed in
N locations is just one sample from one random surface, not N samples. The mean value of the sample Z(s) at
the location s is considered as an average of all possible random draws from the random surfaces at the
location s. In practice, observations are usually not repeated at each measurement location many times, and
by “replications under similar conditions,” spatial statistics considers analysis of neighboring data in different
parts of the data domain, assuming that statistical features of the data such as mean and variance are the
same.
STATIONARITY AND ISOTROPY
The statistical approach requires observations to be replicated in order to estimate prediction uncertainty.
This is called data stationarity.
In time‐series data analysis, the random function is called stationary if the distribution function
Ft(τ) = Probability{Z(t) < τ}
where τ is a specified data threshold, remains the same if the group of time observations t1, t2, …, tn is shifted
along the time axis. Stationary process Z(t) describes the change in time of the completely stable phenomena
with the distribution function that does not depend on time t, but only on the differences t2 t1, t3 t2, . . . and
so on. Stationary time‐series values have the same amplitude; they do not increase or decrease
systematically, and their short‐time changes are similar at different times. The mean (expected value) of a
stationary process is a constant.
In two dimensions, stationarity means that statistical properties do not depend on exact locations. That
means that the mean of a variable at one location is equal to the mean at any other location; data variance is
constant in the area under investigation, and the correlation between any two locations depends only on the
vector that separates them, not on their exact locations.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 6.39
A stationary Gaussian random field is completely described by the mean (expected value) and covariance
(spatial correlation modeled as function of distance between pairs of locations). This makes modeling
relatively easy in comparison with non‐Gaussian data that may require information about additional
statistical characteristics of the data. This is why geostatistical analysis often begins with verification of the
assumption of the data normality.
If locations do not have values, stationarity of the point process means that the change in the points density
depends on distance between event locations, not on the locations themselves (see chapter 13).
Of the several types of stationarity, the strongest one states that the probability distribution is the same at
any location. This assumption is often non‐realistic, and weaker types of stationarity are commonly used in
practice. The most common assumption in geostatistics is second‐order stationarity that requires a constant
mean and spatial covariance that depends only on the distance and direction separating any two locations.
Isotropy means that the magnitude of spatial dependence is the same in all directions.
Copyright © 1995, Umweltbundesamt, GmbH.
Figure 6.40
The total number of pairs is very large, and the extent of data dependency is difficult to recognize from the
semivariogram cloud. Data dependency becomes evident when the semivariogram is plotted in polar
coordinates with a common center for all pairs and retains the angle of the line connecting pairs of sample
points, and the distance between them along that line (figure 6.41). Then, the pairs are covered by a grid, the
cell size of which, called lag, is larger than the minimum distance between points. The cell size of the grid can
be given various values to track data dependency at various scales. Semivariogram values that fall in each cell
are then being averaged, and the cells are being colored according to their average semivariogram values.
Cool colors like blue and green are used for low average semivariogram values, and warm colors like red and
yellow denote high values. The color scale next to the semivariogram graph links the values of the
semivariogram graph to the semivariogram surface.
Figure 6.41
In the example in figure 6.41, the semivariogram surface shows a clear difference in the semivariogram
values in the northwest and northeast directions: the values of eastern and western points are less different
from each other than those of northern and southern points.
Figure 6.42 shows an example of spatial dependence detection in the case of polygonal input data, lung cancer
rates in males in California. Geostatistical Analyst calculates and uses centroids of the polygons as data
locations. One unusually large semivariogram value is highlighted in the graph; a pair of polygons from which
the semivariogram value was calculated is automatically highlighted on the map. A semivariogram surface
shows data anisotropy clearly: rapid change of the cancer rates in the direction perpendicular to the ocean
and slow data variation along the coastline.
U.S. Geological Survey, 2008, Cancer Mortality in the United States: 1970–1994.
Courtesy of U.S. Geological Survey, Reston, Va.
Figure 6.42
MODEL DIAGNOSTIC
Model diagnostic can be viewed as a collection of tools for comparing observed data to what can be obtained
under statistical model and subsequent detection of the differences between the observed data and
predictions. Knowing how much the data depart from the fitted model may suggest possible model
improvements. The more complex the model and the stronger the model assumptions, the more important
model diagnostics become in detecting the model misfit areas. It is better to look at many data diagnostics
rather than at a useful but single test such as a residual plot.
One common approach for testing model assumptions is graphical comparison of the reference distribution
with actual data. Statistical literature proposes several general methods for defining reference distribution
including simulation from the model used for data fitting, permutation tests, bootstraps, validation, and cross‐
validation.
Then the simulated data are interpreted as approximate reference distribution.
Having reference distributions, we are interested in test statistics, which can be displayed as scatterplots,
€
quantile‐quantile plots, or histograms as in the example of estimation of the correlation coefficient between
two spatial variables in chapter 5. If the data show patterns that do not generally appear in the reference
distributions, then there is a potential misfit of the model to the data.
The map in figure 6.43 shows 40 observations of average January temperature (in Celsius) in 2004 in
Catalonia and continuous predictions made by kriging. The number of the data is small and reconstruction of
the true semivariogram model cannot be done very accurately. In this case, the Bayesian version of kriging is
useful because it estimates distributions of the kriging parameters instead of supposedly optimal constant
values (see chapter 9 and appendix 3). Figure 6.43 at right displays distributions of the values of range, sill,
and mean parameters together with their median (vertical red line) and mean values (vertical green line). We
see that there is substantial uncertainty about the parameters of the semivariogram model. Bayesian kriging
incorporates this uncertainty to the prediction standard errors.
From Generalitat de Catalunya. Departament de Medi Ambient i Habitatge. Direcció
Figure 6.43 General de Qualitat Ambienta.
The graph in the top left part of figure 6.43 shows the difference between predicted and observed values
(errors) versus observed values. Points in the graph do not show any clear pattern or trend indicating that
our kriging model is reliable.
Figure 6.44 shows distributions of the predictions in four measurement locations and their measured values
(red vertical lines). We see that deviation from the measured values is small and approximately symmetric so
that we should be satisfied with the predictions. The prediction distributions in figure 6.44 can be used for
estimation measurement errors in the data locations (by calculating standard deviation values).
Figure 6.45 shows distributions of the predictions at four unsampled locations in different parts of Catalonia.
All distributions are nearly symmetric and predictions are less certain than predictions at the sampled
locations as it should be.
From Generalitat de Catalunya. Departament de Medi Ambient i Habitatge. Direcció General de Qualitat Ambiental.
Figure 6.44
Figure 6.45
It is instructive to investigate what happens if we change one of the largest temperature values 6.5°C
(highlighted in figure 6.43 by white lines) to 0°C. Figure 6.46 shows a new prediction map and graph of the
error versus observed (including wrong 0) values. Five points with the largest errors are highlighted in the
graph and on the map (white crosses). Four prediction distributions are shown in figure 6.47. Selected
unsampled locations for predictions are the same as in figure 6.44. From figure 6.47, we see that
Most of predictions at the location with wrong 0 value (top right) are very different from 0. In fact,
the mean and median of the prediction distribution are closer to correct value 6.5°C than to 0°C.
Prediction at the nearest sampled location with observed value 9.7°C (bottom left) is also seriously
biased.
Mean values of the predictions at two locations situated relatively far from the wrong value 0 (top
left and bottom right) did not change much but the prediction uncertainties (spread of the prediction
distribution) are increased in comparison with prediction distributions using correct values in figure
6.44.
Figure 6.47
Summarizing, there are differences between good and bad data and good and bad models and model
diagnostics may help to reveal both issues.
In contrast to classical statistics, permutation tests are often not appropriate for spatially related data, see
example in chapter 16. Bootstrapping introduced in chapter 12 is used rarely in spatial statistics as well. The
validation and cross‐validation diagnostics discussed in the following sections of this chapter are the most
popular. Other diagnostics, including AIC (Akaike’s information criterion) are discussed in the following
chapters and appendices after introducing the statistical models.
Some applications require a decision in the form of yes/no: add fertilizer or not, take a sample or not, rainy
weather or not, healthy air or not, and so on. For some applications, a threshold can be specified to separate
yes from no; for example, eight hours’ maximum ozone concentration greater than 0.08 parts per million
indicates unhealthy air in California.
A common way to verify indicator predictions is to employ a contingency table. The four possible
combinations of predictions and observations (the joint distribution) are:
1. hit—event was predicted and did occur
2. miss—event was not predicted, but did occur
3. false alarm—event was predicted but did not occur
4. correct rejection—event was not predicted and did not occur
A perfect prediction model produces only hits and correct rejections.
The total number of observed and predicted occurrences and non‐occurrences (the marginal distribution) are
given at the lower and right sides of table 6.1 (contingency table).
Contingency table
observed
yes no total
predicted yes hits false alarms predicted yes
no misses correct rejections predicted no
total observed yes observed no total
Table 6.1
Suppose that over the course of a year unhealthy air quality predictions and observations were made and are
summarized in table 6.2 for one particular city:
Contingency table
observed
yes no total
predicted yes 60 25 85
no 28 252 280
Total 88 277 365
Table 6.2
Several statistics can be computed from the elements in table 6.2 to describe prediction performance.
Accuracy statistics show what fraction of the predictions is correct:
,
that is, 85 percent of predictions of air quality were correct.
Bias score
Detection score
shows what fraction of the observed events was correctly predicted. The detection score is good for rare
events but ignores false alarms.
False detection
shows what fraction of unobserved events was incorrectly predicted. False detection is sensitive to false
alarms but ignores misses.
Skill statistic
shows how well the prediction separates the yes events from the no events. The skill statistic uses all
elements in the contingency table. A skill statistic value of zero indicates no skill, and a value of one indicates
perfect skill.
Odds ratio
shows the ratio of the odds of a yes prediction being correct to the odds of a yes prediction being wrong. The
larger the odds ratio, the better the prediction.
Statistics discussed in this section can be generalized from binary yes/no events to prediction of multiple
categories of events using a multicategory contingency table (see the first reference at the end of this
chapter).
The most popular methods for verifying predictions of continuous variables are cross‐validation and
validation.
CROSS‐VALIDATION
Cross‐validation uses all of the data to estimate the model. Then it removes each data location, one at a time,
and predicts the associated data value. For all points, cross‐validation compares the measured and predicted
values. In figure 6.48, the fitted line in the prediction plot through the scatter of points in Geostatistical
Analyst is given in blue, with the equation given just below the plot. All interpolators tend to overestimate
low data values and underestimate high data values, and the slope of the blue line is almost always smaller
than 45 degrees.
Figure 6.48
The cross‐validation dialog shown in figure 6.48 includes a diagnostic result for each observed value (the
table at bottom right), prediction errors statistics (bottom left), and a series of graphs that visualize data in
the table. Graphs and tables are linked through a selected row in the table or a point in the graph, in green.
The first graph is a scatter plot of the predicted versus measured values. If all the data are independent (no
spatial correlation), every prediction would be close to the mean of the measured data, so the blue line would
be horizontal. With strong spatial correlation and a good model, the blue line should be closer to the black
dashed 1:1 line.
The error plot in figure 6.49 is the same as the prediction plot, except here the measured values are
subtracted from the predicted values. This plot helps show the degree of underestimation of large values and
the degree of overestimation of small values.
Figure 6.49
For the standardized error plot (figure 6.50), the measured values are subtracted from the predicted values,
and the result is divided by the estimated prediction standard error. The standardized error plot appears
when the statistical model is used, since deterministic models cannot estimate the prediction uncertainty. The
standardized error is unit free while the error has the same units as the data. Unit‐free errors calculated for
different variables can be compared.
Figure 6.50
Figure 6.51
Points that deviate from the lines more than the average standard error are potential data outliers. Ideally,
they should be corrected or excluded from the dataset before final model creation because most statistical
models, including kriging, are very sensitive to data outliers.
Summary statistics on the model prediction errors in the lower left corner of the dialog in figure 6.51 examine
the average performance of the prediction model.
The averaged difference (mean prediction error) between measured z(si) values and predicted ones
is displayed in the first row in the bottom left part of the dialog. If the mean prediction error is near zero,
predictions are centered on the measurement values. In this case they are unbiased. The mean prediction
error does not measure the magnitude of the errors. It is possible to get a perfect score close to zero for a bad
prediction if there are compensating positive and negative terms in the sum in the above formula.
The closer the predictions are to their true values, the smaller the root‐mean‐square prediction errors. The
average root‐mean‐square prediction errors are computed as the square root of the average of the squared
difference between observed and predicted values
This is shown in the second row in the bottom left part of the cross‐validation dialog in figures 6.48‐6.51. The
root‐mean‐square prediction error calculates the average error, weighted according to the square of the
error. It puts greater influence on larger errors than smaller errors, which is good if large errors are
€
especially undesirable. For a model that provides accurate predictions, the root‐mean‐square prediction
error should be as small as possible.
Statistical prediction such as kriging provides the prediction standard error for the location si. This
allows for additional cross‐validation statistics. Geostatistical Analyst calculates average standard error
∑σˆ 2
(si )
i=1
,
n
the mean standardized prediction error
n
€
∑ (Zˆ (s ) − z(s )) /σˆ (s )
i i i
i=1
,
n
and the root‐mean‐square standardized prediction error
n
€ 2
∑ [(Zˆ (s ) − z(s )) /σˆ (s )]
i i i
i=1
.
n
These three diagnostics are shown in the third, fourth, and fifth rows in the bottom left section of the cross‐
validation dialog.
€
The first two diagnostics above should be as small as possible. If the kriging standard errors are close
n
2
∑ (Zˆ (s ) − z(s ))
i i
i=1
to the root‐mean‐square prediction errors , then the variability in prediction is
n
correctly assessed because two different estimates of the prediction error give very similar results. Therefore,
the root‐mean‐square standardized prediction error should be close to one.
Each summary statistic is based on specific assumptions. For example, the root‐mean‐square error depends
€
on a quadratic loss function, meaning that larger deviations between the data and the predictions (outliers)
dominate and that over‐predictions have the same scientific importance as under‐predictions. The latter are
undesirable, for example, in environmental applications when missing contaminated areas are worse than
overestimating pollution.
Validation first removes part of the data (the test dataset) and then uses the rest of the data (the training
dataset) to develop the model to be used for prediction and diagnostics.
In Geostatistical Analyst, the test and training datasets can be created from the data using the Create Subset
tool. The algorithm of the data subsetting follows. Suppose we want to create subsets with L and M data from
N=L + M data randomly. The probability that a datum will be assigned to the first subset is L/N and to the
second subset M/N. Random value is simulated in the diapason [0, 1] from the uniform distribution. If the
simulated value is less than L/N, the first datum is assigned in the first subset; otherwise it is assigned to the
second subset. Then N1 data remain, and the same procedure is used to assign the next datum in the subsets
with L1 and M or L and M1 locations for the data, depending on which subset received the first datum. This
procedure continues until subsets are filled.
Validation creates a model for only a subset of the data, so it does not directly check the model, which should
include all available data. Validation checks whether the model options and parameters, such as the
transformation, detrending, semivariogram model, lag size, or prediction neighborhood work well for the test
subset of the data. If so, they usually work for the whole dataset. The validation diagnostic is more objective
than cross‐validation, because data in the predicted locations are not used in modeling. It is advisable to
repeat the validation exercise several times with different training and testing subsets of the data.
Validation and cross‐validation are very useful diagnostics, but their results should not be overrated. They
respond too much to measurement errors and not enough to stable features of the data. Generally, building
and evaluating a model using the same data can be misleading: the model always looks better than it really is.
This statement is illustrated below using simulated data. Simulated data are used because in this case we
know the exact true data values.
A spherical model with parameters range=0.2, partial sill=1, and nugget=0.1 was used to generate 150 points
randomly distributed in the unit square. Table 6.3 below shows the result of cross‐validation diagnostics for
true and estimated (the default Geostatistical Analyst semivariogram model) models.
From this table we see that there is no reason to prefer true model because three cross‐validation diagnostics
are better for true model, while two others support the estimated model.
The results of validation and cross‐validation, the table at the bottom of the Geostatistical Analyst dialogs, can
be saved to calculate other statistics, for example multiplicative error and correlation coefficient.
Multiplicative error
n
1
∑ Zˆ (si )
n i=1
n
1
∑ z(si )
n i=1
shows how the average prediction magnitude is compared to the average observed magnitude. A good
prediction model must have a multiplicative error close to one.
€
Correlation coefficient
∑ (Zˆ (s ) − Zˆ (s ))(z(s ) − z (s ))
i i i i
i=1
r=
n 2
∑ (Zˆ (s ) − Zˆ (s ))
i i ⋅ ( z(si ) − z (si ))
2
i=1
.
Both should be small for optimal predictions.
The percentage of the data inside 95 percent prediction interval, assuming that prediction and prediction
standard errors are normally distributed
,
should be close to 95 percent. In the last formula, I(⋅) is an indicator function, which is equal to 1 if the
expression in brackets is true and 0 otherwise.
In summary, cross‐validation and validation statistics are useful, but their importance should not be
overestimated since they are summaries of large amounts of information. For instance, a single summary
measure such as root‐mean‐square error provides no indication whether the model reproduces the data
better in some locations than in others. Also, we should not forget that measured data are almost never
precise and that geographical data are often processed by imperfect computer models before they are used.
Therefore, one should be careful when using data with measurement errors as “true” data.
SUMMARY OF SPATIAL MODELING
In statistical analysis, the researcher observes data and tries to learn (using hypotheses testing, parameters
estimation, and prediction) about the probability distribution from which data can be generated by some
random mechanism and from which inference about spatial phenomenon under study is possible. Usually the
researcher tries several models, each with different parameters. It is recommended that researchers use
methods that allow for simulating realizations from the statistical models.
Ideally, the researcher’s choice is based on the information on the statistical features of the phenomena, the
results of the exploratory data analysis, the researcher’s knowledge and intuition, and the most relevant
statistical software available. However, more often than not, such subjective factors as internal organization
instructions and limited experience in specific software usage play a role.
The next step after choosing the statistical model or models is fitting the model to the observed data and
estimating the unknown parameters. Then various diagnostics are used for refining the fitted model and for
comparing several suitable models.
The most popular spatial models are linear and generalized linear. A linear model has the following form
data = expected value + all random errors
or
data = firstorder effects + secondorder effects
A generalized linear model uses a function of the data (called link function) instead of the data. This model
has the same right part as a linear model. A linear model is a particular case of generalized linear model with
identity link function and assumption that data distribution is Gaussian.
First order effects are a variation of the mean value in space—that is, large‐scale data variation or trend.
Trend is often modeled using covariates (explanatory variables).
Second order effects result from spatial correlation structure and describe local (small‐scale) data variation
using information on the neighboring values. Small‐scale data variation is usually modeled as a stationary
spatial process. Data stationarity implies that the mean value or mean surface is known, variance of the
second‐order effects is constant, and the data covariance depends on the distance and direction between
pairs of the data locations and not on the locations themselves. Probability distribution of the second‐order
effects is usually assumed to be known; most often it is a Gaussian distribution. Unfortunately, there is no
optimal and general way of choosing the most appropriate data neighborhood for the point data and optimal
measure of proximity for regional data. The distinction between first‐ and second‐order effects is not a reality
but a modeling assumption.
There are various approaches to data analysis, including deterministic, exploratory, frequentist, and
Bayesian, and it is helpful to consider them as complementary to one another rather than as competitors.
However, there are situations when a particular approach is more relevant for the observed data than others.
For example, a small data sample dictates the usage of Bayesian methods, because additional information
such as an expert opinion becomes essential in estimation and prediction in the case of insufficient amount of
data. Generally, simulations are preferable over predictions for applications in which the result of statistical
analysis is used for further geoprocessing, because simulations allow producing distributions of the values in
the next step of the data analysis, and this is extremely useful for decision‐making.
1) CHOOSE THE APPROPRIATE MODEL FOR ANALYZING THE QUANTITY OF
ADEQUATELY SIZED SHOOTS PRODUCED BY THE GRAPE VINE.
A vineyard in California is shown in the figure 6.52. Vines in the foreground have intact canes. Vines in the
background have already been pruned. Count is the quantity of good‐sized shoots produced by the vine the
previous season. This quantity of shoots is indicative of the vine’s capacity to produce grapes. Vines are
pruned to a prudent number of buds each winter, about one bud for every good‐sized shoot the farmer
expects the vine to produce the following season. What is the appropriate model for analyzing these data:
geostatistical, point pattern, or regional?
Courtesy of Bob Siegfried.
Figure 6.52
Within rhetorical pedagogy it was the practice of imitation that helped students analyze form and content.
Students were asked to observe a model closely and then to copy the form but supply new content or vice
versa. Such imitations forced students to think what exactly a given form did to bring about a given content.
Determine whether you agree with the following description of data content:
One can draw an analogy with soil pH. pH is continuous. A vine can be regarded as a sensing device like pH
electrodes. Soil pH is the result of random processes such as soil properties and the environment (rainfall,
presence of other plants, age of soil, and so on). Soil pH is detected by electrodes. The integrated effect of
numerous environmental variables is detected by a plant, grapevines in this case. Electrodes give you a
number, and grapevines give you a number—cane counts. Therefore, count data are continuous and
geostatistics is the best model to use.
Formulate discrete point and regional data models and choose the correct one.
Figure 6.53 shows malaria prevalence samples in children recorded at villages (represented by red points) in
Gambia, Africa. The variation in malarial prevalence measured from each test on the presence of malarial
parasites in a blood sample taken from a child can be modeled using the effects of age of the child and the use
of mosquito nets over beds, greenness of surrounding vegetation derived from satellite information, and the
residual spatial and nonspatial variation due to unknown or nonmeasured covariates.
What are the possible statistical models for malaria in Gambia?
Data are courtesy Diggle, P., R. Moyeed, B. Rowlingson, and M. Thomson. 2002. "Childhood Malaria in the Gambia: A Case‐Study in Model‐Based Geostatistics." Applied Statistics.
Background is from Africa 150m EarthSat.
Figure 6.53
Authors of the paper “Childhood Malaria in the Gambia: A Case‐Study,” (Diggle, P., R. Moyeed, B. Rowlingson,
and M. Thomson. 2002. Journal of the Royal Statistical Society: Series C (Applied Statistics) 51(4):493–506.)
discussed the Gambia data and proposed a statistical model for this epidemiological problem. Their results
explain the increase in prevalence of malaria with age and the protective effects of bed‐nets and show that the
residual variation is spatially structured, suggesting an environmental effect rather than variation in familial
susceptibility.
Read the paper and verify your answer.
3) CALCULATE AND DISPLAY INDICES FOR INDICATOR PREDICTION USING
OZONE CONCENTRATION MEASURED IN JUNE 1999 IN CALIFORNIA.
Daily measurements of one‐hour‐maximum ozone concentration were made at 183 monitoring stations in
June 1999 in California. Assuming that a 15‐percent increase in ozone concentration can be unhealthy for
people, the distribution of increases was predicted using average ozone values observed in the previous days
using the following algorithm:
If the average value of ozone concentration in previous days of June multiplied by 1.15 is less than today’s value,
then tomorrow is an unhealthy day.
For example, the average daily ozone concentration during the period June 1–17 in Ventura was 0.057 ppm.
Measured ozone on June 18 was 0.069 ppm; therefore, June 19 was predicted to be unhealthy because 0.069
is greater than 0.057 × 1.15.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 6.54
The assignment is to calculate other indices for indicator prediction and visualize them as a continuous map
using one of the available interpolation models.
4) INVESTIGATE THE VARIABILITY OF YIELDS OF INDIVIDUAL TREES.
Batchelor and Reed (Batchelor, L. D., and H. S. Reed. 1918. “Relation of the variability of yields of fruit trees to
the accuracy of field trials.” Journal of Agricultural Research 12:245–83) published data on the yields of
individual trees (in pounds) from plantations in California and Utah that had received uniform treatments for
growing fruit for a number of years. All plots in the fields are arranged on a regular grid. All orchards were
irrigated during the summer months so that the variability due to non‐uniform water supply was significantly
reduced. Three datasets are visualized below.
The first dataset consists of yields of individual 24‐year‐old orange trees from a grove at Arlington Station,
California, collected in 1915–1916 (figure 6.55).
The second dataset consists of yields from a 10‐year‐old Jonathan apple orchard at Providence, Utah (figure
6.56). The surface soil of this orchard was very uniform except on the eastern edge, where the percentage of
gravel increased slightly.
"Yields of Individual Trees" by Batchelor, L. D. and H. S. Reed. 1918. Relation of the Variability of Yields of Fruit Trees to the Accuracy of Field Trials.
Journal of Agricultural Research 12:245–283.
Figure 6.56
The third dataset consists of yields from 24‐year‐old walnut trees at Whittier, California, during the seasons
of 1915 and 1916 (figure 6.57). The planting was entirely surrounded by additional walnut plantings, except
on a part of one side that was adjacent to an orange grove.
"Yields of Individual Trees" by Batchelor, L. D., and H. S. Reed. 1918. Relation of the Variability of Yields of Fruit Trees to the Accuracy of Field Trials.
Journal of Agricultural Research 12:245–283.
Figure 6.57
Are these data stationary? Are these data isotropic? Are they distributed normally? What are the best models
to describe the trees’ yields?
Authors of the paper divided the groves into imaginary plots of various size and shape and then compared
yield variability. Repeat this exercise.
Data are available in the folder assignment 6.4.
Jolliffe, I. T., and D. B. Stephenson, editors. 2003. Forecast Verification. A Practitioner’s Guide in Atmospheric
Science. Wiley and Sons, 240 pp.
This book discusses different verification and evaluation techniques and scores for meteorological forecasts.
Most of the discussed methods can be used in applications outside meteorology.
Burnett, C., M. Heurich, and D. Tiede. 2003. “Exploring Segmentation‐based Mapping of Tree Crowns:
Experiences with the Bavarian Forest NP Lidar/Digital Image Dataset.” Poster presented at ScandLaser 2003
International Conference and Workshop, 2‐23 September, Umeå , Sweden.
This paper discusses methods for extraction of individual tree characteristics using multispectral digital
images and scanning lidar measurements.
Bullock, D. G., D. S. Bullock, E. D. Nafziger, T. A. Doerge, S. R. Paszkiewicz, P. R. Carter, and T. A. Peterson. 1998.
“Does Variable Rate Seeding of Corn Pay?” Agronomy Journal 90:830–36.
This paper estimates and discusses the economic value to the farmer of variable‐rate seeding versus uniform‐
rate seeding.
In the next two papers, weed counts data were analyzed using geostatistics and then using a generalized
linear model.
Johnson, G. A., D. A. Mortensen. and C. A. Gotway. 1996. “Spatial Analysis of Weed Seeding Population Using
Geostatistics.” Weed Science 44:704‐710.
Gotway, C. A., and W. W. Stroup. 1997. “A Generalized Linear Model Approach to Spatial Data Analysis and
Prediction.” Journal of Agricultural, Biological and Environmental Statistics 2:157–78.
The mackerel data were originally analyzed using the generalized additive model in the following paper.
Borchers, D. L., S. T. Buckland, I. G. Priede, and S. Ahmadi. 1997. “Improving the Precision of the Daily Egg
Production Method Using Generalized Additive Models.” Canadian Journal of Fisheries and Aquatic Sciences
54:2727–42.
Mackerel data are provided by the R library gamair.
Mgcv package created by Simon Wood was used in the analysis. The package and the generalized additive
model theory are discussed in the following book.
Wood, S. N. 2006. Generalized Additive Models: An introduction with R. Boca Raton, Fla.: CRC. 391 pages.
The case study related to figures 6.20‐6.24 was conducted using data generated to mimic the bell pepper data
originally published and analyzed in the following paper.
Gumpertz, M. L., J. Graham, and J. B. Ristaino. 1997. “Autologistic Model of Spatial Pattern of Phytophthora
Epidemic In Bell Pepper: Effects of Soil Variables on Disease Presence.” Journal of Agricultural, Biological, and
Environmental Statistics 2:131‐56.
This paper provides background information about environmental consequences of the Chernobyl accident.
Lecocqa, S., and M. Visser. 2006. “What Determines Wine Prices: Objective vs. Sensory Characteristics. Journal
of Wine Economics, 1(1) Spring 2006: 42–56. This paper can be downloaded from http://www.wine-
economics.org/journal/content/Volume1/number1/index.shtml
This is a good review of wine price formation. However, the regression analysis part of the paper must be
questioned because the authors did not verify the linear model assumptions and the essential part of any
statistical data analysis, model diagnostics, is not provided.
Soros, G. 1994. The Alchemy of Finance: Reading the Mind of the Market. John Wiley.
In this book first published in 1987, George Soros presented a nontechnical description and forecast of the
dynamic interplay between participants of the stock market. In particular, the author presented arguments
for his thesis that the social and natural sciences require different modeling approaches.
Waller, L. A., L. Zhu. C. A. Gotway. D. M. Gorman, P. J. Gruenewald. (2007) “Quantifying Geographic Variations
in Associations Between Alcohol Distribution and Violence: A Comparison of Geographically Weighted
Regression and Spatially Varying Coefficient Models.” Stochastic Environmental Research and Risk Assessment
21(5):573–588.
Authors of this paper analyzed crime data collected in Houston tracks and compared the geographically
weighted regression and the spatially varying regression coefficients models.
SPATIAL INTERPOLATION USING
DETERMINISTIC MODELS
SPATIAL INTERPOLATION GOALS
PREDICTIONS ARE ALWAYS INACCURATE
DETERMINISTIC AND STATISTICAL MODELS
INVERSE DISTANCE WEIGHTED INTERPOLATION
RADIAL BASIS FUNCTIONS
RBFS AND KRIGING
TREND SURFACE OR GLOBAL AND LOCAL POLYNOMIAL INTERPOLATION
LOCAL POLYNOMIAL INTERPOLATION AND KRIGING
INTERPOLATION USING A NONEUCLIDEAN DISTANCE METRIC
NONTRANSPARENT BARRIERS DEFINED BY POLYLINES
SEMITRANSPARENT BARRIERS BASED ON COST SURFACE
ASSIGNMENTS
1) COMPARE THE PERFORMANCE OF DETERMINISTIC INTERPOLATION
MODELS
2) FIND THE BEST DETERMINISTIC MODEL FOR INTERPOLATION OF CESIUM
137 SOIL CONTAMINATION
FURTHER READING
T his chapter discusses and illustrates three commonly used deterministic interpolation models that are
based on three different concepts: inverse distance weighted interpolation, radial basis functions, and
local polynomial interpolation (this model is partly deterministic because its prediction standard
error can be estimated).
Then ideas behind three deterministic interpolation models, their use, and their relationships with statistical
interpolation (kriging) are discussed in detail.
The chapter ends with a discussion of local polynomial interpolation using a non‐Euclidean distance metric.
SPATIAL INTERPOLATION GOALS
Interpolation is the process of obtaining a value for a variable of interest at an unsampled location based on
surrounding measurements.
The map in figure 7.1 shows measurements of radiocesium contamination of milk (becquerel per kilogram)
as colored circles. Predictions in villages without monitoring stations, shown as light green polygons, are
important because people living there have the same right as others to know how much their milk is
contaminated.
It seems to make sense to predict a value at an unsampled location by using an average of the nearby data.
But an average implies that the data are equally weighted, and the only kind of data that are equally weighted
is independent data. With independent spatial data, prediction to an unsampled location using either
deterministic or statistical interpolators is not advantageous because the data do not contain information
about their spatial structure. Predictions that use an arithmetic average of spatially dependent data will give
overoptimistic estimates of averaging accuracy. See the DEM averaging example in chapter 2.
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 7.1
Ideally, we would like our predictions to have the following attributes:
Be based on local information.
Have a measure of prediction uncertainty so that we can tell how good they are.
Have smaller prediction uncertainty than other prediction models.
PREDICTIONS ARE ALWAYS INACCURATE
Suppose we know the value of the variable Z at two points, a and b, separated by a distance l. Linear
interpolation to the point x on the line l between a and b is
.
The average square error of the interpolation is
where E(•) is the expected value of the expression in brackets. Using the definition of the semivariogram,
1 2
γ (h) = E [ Z(a) − Z(a + h)] , can be rewritten as
2
.
€
Error of the interpolation to the center of the line l/2 equals
.
Similar formulas can be derived for regular polygons. For polygons with four, six, and twelve sides, errors of
interpolation to the centers are
where r is a distance from the corner of the polygon to its center.
π
1
σ ∞ 2 = 2γ (r) −
π
∫ γ (2r sin ) dϕ
ϕ
2
0
where r is the radius and ϕ is an angle of a sector of circle.
For r = l/2 = 0.5 and a spherical semivariogram model with a range of 1, a partial sill of 1, and a zero nugget
€
(that is, γ(r) = 3/2⋅r – 1/2⋅r3), the errors of interpolation using the formulas above are
We see that the error of interpolation is never equal to zero no matter how accurately data are measured (in
this example data are measured exactly since the nugget parameter is equal to zero) or how many sides the
regular polygons have.
The error of the interpolation does decrease as the number of samples increases, but the reduction is small.
For example, interpolation error is reduced times, or by
3.5 percent when the number of points in the interpolation is increased from four to twelve. When the nugget
parameter of the semivariogram model is greater than zero (that is, when measurements are imprecise), the
error of interpolation is larger.
Since prediction is always associated with uncertainty, the goal is to minimize that uncertainty.
Usually we are analyzing unique spatial phenomena, meaning that particular weather patterns, locations of
fast‐food restaurants, and cancer mortality incidence cannot be found elsewhere in exactly the same form.
Therefore, spatial data interpolation must be based on analysis of the data features, not just on a distance
between pairs of locations.
Kriging is the name given to a class of statistical techniques for optimal spatial interpolation. Kriging
predictors are called optimal since they are statistically unbiased (that is, on average the predicted value and
the true value coincide), and they minimize prediction mean‐squared error, a measure of uncertainty in the
predicted values. Statistical models derive their parameters based on the analysis of the actual data. They
satisfy the above‐mentioned goals of spatial interpolation.
Deterministic interpolation models are based on mathematical formulas that determine the smoothness of
the resulting surface. This smoothness can be calibrated using the cross‐validation technique, but it is
unlikely that it corresponds to natural data variability, because deterministic models do not take into account
relationships among the actual data. Another unfortunate feature of deterministic models is that they cannot
provide a measure of the accuracy of the predictions. In summary, deterministic interpolation models ignore
the achieved level of knowledge on spatial data correlation.
INVERSE DISTANCE WEIGHTED INTERPOLATION
In inverse distance weighted (IDW) interpolation, the value at the prediction location is calculated by
summing the weighted contributions from the neighboring sampling points. IDW weighs the points closer to
the prediction location greater than those farther away so that the weight is inversely proportional to the pth
power of the distance between the prediction location and the sampling points. Note that other interpolators
discussed in this book, both deterministic and statistical, have weights that depend not only on the distance
IDW is an exact interpolator, that is, it predicts a value identical to the measured value at a sampled location.
If the prediction location coincides with one of the sampling points, then the prediction is replaced with the
measured value at that location. If data are not absolutely precise, which is typical in many applications, exact
interpolation does not make much sense.
The maximum and minimum values in the interpolated surface using IDW can only occur at sample points.
The surface is sensitive to data clustering and to the presence of data outliers.
Figure 7.2 at left shows a Geostatistical Analyst IDW searching neighborhood dialog. The points highlighted in
the data view give an indication of the weights that are associated with each point located within the
searching neighborhood. These points are used in calculating the value at the unsampled location, the center
of the crosshairs.
All nearby points may lie in one particular direction. Such data structure can be accounted for by restricting
the neighbor search to four or eight angular sectors. In this case, the data involved in the prediction are
distributed more uniformly, avoiding the problem of selecting data in preferentially sampled regions.
Points in northeast and southwest sectors have small weights—only two points are between 3 percent and 5
percent—while three points in the southeast sector have weights larger than 10 percent and two points are
between 5 percent and 10 percent. The test location panel displays the estimated value at the center of the
circle as well as the number of neighbors used for the prediction.
Figure 7.2
Data in the image at left are clustered, with one cluster heavily affecting predictions in its vicinity. If data
correlation has a range larger than the typical intercluster distance, prediction in the center of the circle in
this figure will be seriously biased. It is better when the whole cluster affects the prediction proportionally to
the distance between the predicted point and the cluster’s center.
Figure 7.2 at right shows weights (absolute value in percent) of the neighboring points when predicting to the
same location using ordinary kriging. Points in each nonempty sector have sufficient influence on the
prediction in the center of the circle, meaning that statistical interpolation automatically takes care of data
clustering.
, ,
is the distance between locations si and s0, N is the number of data used for prediction, is the weight
of the ith sample point, and p is the power value (p is greater or equal to 1). The weights for the measured
locations used in the prediction are scaled so that their sum is equal to 1.
Figure 7.3 shows how weights change with increasing distance and power value. For distances less than 1
distance unit, weights are greater than 1, and for distances larger than 1, weights are less than 1. For very
small power values, weights are approximately the same for any distance.
Figure 7.3
For a power value greater than zero, the weight function decreases with increasing distance, as it should, and
one may wonder why ArcGIS Geostatistical Analyst allows using only power value greater than 1.
, if 0 < p < 1
Figure 7.4
Therefore, samples far away from the prediction location can influence prediction more than those that are
close, which contradicts common sense.
The common choice for the default power value p is 2. But there is no reasonable theoretical justification for
that choice. It is better to use the Geostatistical Analyst option for choosing the optimum power value based
on cross‐validation, in which each measured point is removed and compared to the predicted value for that
location. Geostatistical Analyst tries several different powers for IDW to identify the power that produces the
minimum root‐mean‐square prediction error
2 1/ 2
1 ∑ Zˆ (si ) − Z(si )
( ) ,
N
where is the estimated value in the data location si.
€
To calculate the optimum power value, Geostatistical Analyst plots the root‐mean‐square prediction error for
several different power values using the same searching neighborhood. A second‐order local polynomial is
fitted to the points, and the power that provides the smallest error is the optimal value, as shown in figure 7.5.
Figure 7.5
With the large p value, only a few surrounding points will influence the prediction, there will be little
averaging, and the predicted value will be close or equal to the value of the nearest point. When the power
value p is very large, say 20, the resulting prediction map will be close to that of a Voronoi map, figure 7.6 at
left.
A similar surface can be produced using the default power value 2 and just one neighboring measurement
(see figure 7.6 at right).
Figure 7.6
It can be shown that when data are dense and spatially correlated, the optimal power value is close to 3, while
it is close to 1 when spatial data correlation is weak. Therefore, an optimal power value close to 1 is an
indication of data independence. In the case of independent data, interpolation should not be used.
If there are no directional influences on the data, neighboring points can be selected equally in all directions
using a circular neighborhood. If there is a directional influence on the data, such as air quality measurements
in a prevailing wind, then the searching neighborhood should be elliptical, as locations upwind from a
prediction location are going to be more similar at farther distances than locations perpendicular to the wind.
Both deterministic and statistical models can be anisotropic, but they are based on different concepts.
( ), meaning that data are changing more rapidly in the northeast direction.
Figure 7.7
Since deterministic methods are based on predefined mathematical formulas and not on the data themselves,
anisotropy in deterministic models (including IDW) can be defined by transformation of the coordinates such
that distance in one direction changes faster than in the perpendicular direction by a user’s specified factor.
Figure 7.8
Figure 7.9 shows how this works in the case of IDW interpolation. Assuming that directional variation in the
northeast direction is twice as fast, space is modified in such a way that circle is transformed to ellipse.
Figure 7.9
To show that this approach works, data were simulated using the semivariogram model shown above in
figure 7.7. Since data are simulated, we know how true surface looks. Then IDW models with the parameters
displayed in figure 7.9 were used. The result of the interpolation is displayed in figure 7.10 using contour
lines over the surface created by kriging using a true semivariogram model. Both IDW maps include artificial
islands of high and low values around the measurement locations. However, the anisotropic IDW model at the
right has fewer islands, and its contour lines match kriging predictions better than the isotropic IDW model at
the left. This can be verified using validation and cross‐validation diagnostics, discussed in chapter 6.
Figure 7.10
Although very popular among GIS users, inverse distance weighted interpolation does not have the features
needed in a predictor, including the ability to estimate how uncertain predictions are. The only advantage this
model has over the others is its speed of calculation. But with today’s computers, kriging predictions are
produced in seconds, so this reason to prefer IDW has disappeared.
RADIAL BASIS FUNCTIONS
Radial basis functions (RBF) is a particular variant of spline interpolation (see chapter 6) in which knots
(points that control the shape of the surface) coincide with measurement locations. RBFs make surfaces that
pass through measured sample values with the least amount of curvature. It can be compared to a draftsman
who is using a flexible ruler to obtain a smooth contour line that goes through all marks. An infinite number
of surfaces can be constructed using this idea. Figure 7.11 shows two RBF surfaces fitted through the sample
values.
Figure 7.11
In contrast to IDW, RBFs can predict values above the maximum measured value or below the minimum
measured value. Figure 7.12 shows fragments of two interpolated surfaces using cesium radionuclide soil
contamination data collected in Belarus, created using RBF (right) and IDW (left). The red arrows show
where the difference in prediction is most noticeable.
Figure 7.12
Like the IDW, RBFs are exact interpolators. RBF functions produce good results for gently varying data such
as elevation. The models are inappropriate when there are large changes in the surface values within short
distances, or when the sample data are imprecise.
ThinPlate Spline
Multiquadric
Completely Regularized Spline
E1 is the exponential integral function;
CE is the Euler constant (CE = 0.577215…).
Figure 7.13
Spline with Tension
K0 is the modified Bessel function;
CE is the Euler constant.
Inverse Multiquadric
Figure 7.14
Figure 7.15
Prediction to the unsampled location s0 is made using a linear combination of the basis functions situated in
the sampled locations si:
where is the distance between locations si and s0, ωi are weights given to each of the basis functions,
ω0 is a bias parameter, and N is the number of points in the searching neighborhood.
Prediction is made in two steps. First, the value of each function is calculated for a prediction
location, as illustrated in figure 7.16. The multiquadric RBF, shown for three locations s1, s2, and s3, is
calculated using the distance from each location si to the prediction location s0. The value of each RBF at the
prediction location is shown by the intersection of black stick and kernel surfaces at ϕ1, ϕ2, and ϕ3 in figure
7.16.
Figure 7.16
The default parameter σ is determined using cross‐validation in a way similar to that used for IDW
interpolation: one sample is removed from the dataset, and prediction made to this point is compared with
the measured value. This procedure is repeated for all data and for different values of the parameter σ .
Optimal σ corresponds to the minimum average difference between predicted and measured values:
N
2
∑ (Z (s ) − Z(s ))
i i → min
i=1
For data with anisotropic properties, coordinate transformation is used similarly to IDW interpolation. The
coordinates x and y are transformed to the new coordinates x′ and y′ as follows:
€
where α is the direction of the anisotropy and ρ is the ratio of the major semiaxis to the minor semiaxis of the
ellipsis of anisotropy.
RBFS AND KRIGING
RBF and kriging interpolations have different objectives. The goal of kriging is not to make a smooth surface
but to produce good average prediction accuracy. Note that there is a version of kriging that produces both
accurate predictions and smooth maps (see chapter 8). We do not know how accurate RBF predictions are,
but we do know how accurate kriging is, because kriging provides information about the prediction
uncertainty.
Although not easy, it is possible to find the RBF kernel that corresponds to the kriging interpolation using
specific semivariogram models. One such kernel is a thin‐plate spline. The function γ(h) =⋅|σh|2⋅ln|σh| could
be used as a semivariogram, except that it is not valid without the nugget effect, and it is changing too fast.
This is because a semivariogram must be positive for all distances h and because the thin‐plate spline
function grows faster than function γ(h)=h2, meaning that one of the main kriging assumptions—mean value
must be constant—is violated.
RBF can produce surfaces similar to kriging at short or large scales. Figure 7.17 shows predictions with
default parameters using thin‐plate splines (filled contours) and inverse multiquadric kernels (contours). The
former display primarily large‐scale data variations while the latter concentrate on small‐scale variations
around sampled data.
Figure 7.17
In this situation, kriging should outperform RBF because a kriging model can include both large‐ and small‐
scale data variation components. Further examples of the radial smoother maps can be found in appendix 4.
GLOBAL AND LOCAL POLYNOMIAL INTERPOLATION
Global polynomial interpolation (or trend surface interpolation) is like fitting a piece of paper between data
values represented as sticks of different heights, then making a surface so that the sum of the heights above
the surface is close to the sum of the heights below the surface, as shown in figure 7.18.
Figure 7.18
The following are examples of polynomials:
A constant (zero‐order polynomial): (used in figure 7.18 at left)
A linear (first‐order polynomial):
A quadratic (second‐order polynomial): (used in figure
7.18 in the center)
A cubical (third‐order polynomial):
.
A fifth‐order polynomial was used to create the surface in figure 7.18 at right.
Here, aj are coefficients, determined by minimizing the squared difference E between predictions to the data
locations and the data :
N
E = ∑ ( Zˆ (x i ,y i )−Z (x i ,y i )) 2 ,
i =1
where N is the number of data samples. So, all data are used in this interpolation method, hence the name
global interpolation.
For example, in the case of linear trend, the following expression would be minimized:
€ N
E = ∑ ( a 0 +a1 x i +a 2 y i −Z (x i ,y i )) 2
i =1
Figure 7.19
Mathematically, small changes dE in E can be presented as changes in coefficients aj:
,
where are partial derivatives of E.
At the minimum E, dE equals zero. This is the equivalent of having zero derivatives . If we find the
derivatives of E with respect to coefficients aj and equate them to zero in accordance with the requirement
that E be a minimum, then we obtain the following equations for determining the coefficients aj of the
polynomial:
From :
From :
From :
The technique described above is called ordinary least squares. Using , the system
of equations above can be rewritten as
The last two equations show that residuals, the difference between measured and predicted values,
,
Trend surface interpolation is sensitive to unusually high and low values. If a point with very large value is
added to the dataset, the estimated surface will rise. Therefore, the first step in using this model is checking
for data outliers.
As the name local polynomial interpolation suggests, it does not use the entire dataset but only the data in the
specified local moving window around the locations where prediction is required. While ArcGIS Geostatistical
Analyst allows specifying a searching neighborhood around locations where prediction is required, the main
idea of local polynomial interpolation is different. The ordinary least squares method discussed above
assumes that all data are equally weighted. The idea of local polynomial interpolation is to use larger weights
for data separated by small distances from the estimated location. ArcGIS Geostatistical Analyst follows Lev
Gandin’s suggestion, made in his book on optimal interpolation (kriging) in 1963, and minimizes the
following weighted squared error of prediction:
N
E = ∑ hi ( Zˆ (x i ,y i )−Z (x i ,y i )) 2
i =1
where N is the number of samples used for prediction. Weights hi decrease as the distance between points
increases as
hi = exp(ri /3a),
where ri is a distance between data sample i and the location where prediction is required and a is a
€
parameter, a distinctive distance. This is optimized using the same cross‐validation method used to optimize
parameters of IDW and RBF interpolation: by minimizing the cross‐validation root‐mean‐squared prediction
error.
The surface that results does not need to have passed through each measurement point. Therefore, global and
local polynomials are inexact interpolators. Figure 7.20 at left shows the result of interpolation using a first‐
order local polynomial with between 15 and 20 neighbors involved in the prediction, depending on
prediction location. The estimated surface looks like the surface produced using a fifth‐order global
polynomial interpolation in figure 7.18 at right, in the beginning of this section.
To see the difference between exact and inexact interpolation, a prediction using IDW using the same data is
shown in figure 7.20 at right. Generally, inexact interpolation is preferable for smoothly changing data with
measurement and locational errors such as temperature or elevation.
Figure 7.20
A typical application of polynomial interpolation is large‐scale data variation detection and filtering. With the
same radiocesium data as were used to compare IDW and RBF interpolation above, a first‐order local
polynomial interpolation surface is presented in figure 7.21. Several data nearly touch the predicted surface,
while some of them are under‐ or overestimated.
Figure 7.21
LOCAL POLYNOMIAL INTERPOLATION AND KRIGING
The main assumption behind statistical interpolation, kriging, is data stationarity, in particular that the mean
data values in the area under investigation remain constant. Finding and removing large‐scale data variation
from the data can make kriging predictions more accurate. Figure 7.22 shows an ordinary kriging prediction
using the mean value estimated by the local polynomial interpolation in figure 7.21. This variant of kriging
first calculates the difference between the data values and predictions produced by global or local polynomial
interpolation (called residuals), uses these derived data as input for geostatistical modeling, and finally adds
the small‐scale data variation surface predicted by kriging to the smooth surface produced using polynomial
interpolation.
Figure 7.22
Kriging can be both an exact and an inexact interpolator. In this example, we used an inexact variant because
cesium soil contamination measurements are not precise, and there is no reason to reproduce noisy data
exactly. It is better to use the available information in nearby locations to predict a new value to the
measurement location, a value that can turn out to be more precise than the measurement itself.
The more complex the polynomial, the more difficult it is to ascribe physical meaning to it. Low‐order
polynomials, with their slowly varying surfaces, are more appropriate to describe a physical process such as
temperature change with changing latitude. Since low‐order polynomials require estimating a smaller
Local polynomial models can account for anisotropy based on the transformation of coordinates. In figure
7.23, the prediction surface changes when distance between locations changes faster in vertical direction
(left) or horizontal direction (right) with an anisotropic ratio equaling two.
Figure 7.23
The use of local polynomial interpolation can be illustrated with meteorological data. Continental weather
maps can be described by low‐order polynomial trend surfaces. Figure 7.24 shows local polynomial
interpolation with very large number of points in the searching neighborhood. Contour lines show the
temperature increasing in the northern hemisphere from north to south. The difference between the
measurements and the interpolation to measurement locations is shown by squares (departures from the
temperature trend may interest travelers). Blue and red symbols show where the error of prediction is large:
near and far from the coasts.
Data courtesy NOAA.
Figure 7.24
Data courtesy NOAA.
Figure 7.25
We can investigate how much variation remains in the data after removing trend estimated by polynomial
interpolation. The simplest way to do this in Geostatistical Analyst is to use kriging on residuals. The first step
will be local polynomial interpolation. We can use, for example, the two trend models described above. The
next step is modeling spatial correlation in the residuals (arithmetic difference between data and trend
surface value in the measurement locations).
Figure 7.26 shows a semivariogram model for residuals after removing the trend shown in the maps in
figures 7.24 and 7.25. Figure 7.26 at left, which corresponds to the map with slow change in temperature in
figure 7.24, shows that the distance for which correlation between measurements became negligible is about
the distance between Seattle and Houston; see Major Range value. The nonsymmetrical semivariogram
surface in the bottom left corner indicates that significant trend still exists in the data. The range of data
correlation in figure 7.26 at right is about the distance between Seattle and Vancouver. The semivariogram
surface does not show preferential direction in the data variation.
Figure 7.26
Data courtesy NOAA,
Figure 7.27
Local polynomial interpolation is useful at the data exploration stage and in data preprocessing for kriging.
Changing the scale of the data variation in local polynomial interpolation—from the distance compatible with
the size of the study area to the distance only several times larger than the shortest distance between data
locations—helps make apparent different scales of data variability. Figure 7.28 shows surfaces of cesium‐137
soil contamination in Belarus after the Chernobyl accident; the surfaces were created using local polynomial
interpolation with different parameters. The map at the bottom shows the two most contaminated spots; the
map in the middle displays several other contaminated areas; and the map at the top shows local but still
smooth data variation. All these trend surfaces can be used as a variable mean value in kriging interpolation.
If local polynomial interpolation is used for data detrending in kriging, the researchers look for a surface that
helps to remove large‐scale data variation from the data and, in the case of simple and disjunctive kriging, to
make stationary data with mean value close to zero (see the discussion in chapters 8 and 9). In practice, mean
surface specified in kriging is usually assumed to be known exactly and this leads to underestimation of the
prediction standard errors since there is a little hope that the estimated mean surface coincides with the true
one.
Local polynomial interpolation is a particular case of the geographically weighted regression with
explanatory variables defined as functions of x and y coordinates (x, y, xy, x2, y2 and so on; see chapter 12). In
geographically weighted regression, it is possible to estimate prediction errors (see example in figure A3.1 in
appendix 3), although there are problems with prediction errors interpretation. The geographically weighted
regression is available in ArcGIS 9.3 Spatial Statistics Tools for users with Geostatistical Analyst or ArcInfo
licenses. However, this version of the geographically weighted regression cannot be used instead of local
polynomial interpolation (see chapter 12 for the explanation why). The local polynomial interpolation model
with the prediction standard error and additional diagnostic will be available in the ArcGIS version
subsequent to version 9.3. The diagnostic is the spatial condition number, a measure of how unstable the
solution of the prediction equation is for a specific location. It is necessary because the prediction standard
error is calculated, assuming that the LPI model correctly describes the data, which is rarely true.
INTERPOLATION USING A NONEUCLIDEAN DISTANCE METRIC
Theoretically, interpolation using a non‐Euclidean distance metric (it is also called interpolation in the
presence of barriers) may be preferable in GIS applications because the earth is not flat. Estimation of the
Funding for the Chesapeake Earth Science Study was provided by the U.S. Environmental Protection Agency (EPA) under Contract No. R805965.
Figure 7.29
Interpolation using a non‐Euclidean distance metric is also important in geological applications because
fractures and faults may have a significant influence on fluid properties in reservoirs. Deformations of the
earth’s upper crust due to tectonic forces are called fractures. A fracture with significant displacement of rock
is called a fault. Faults can be barriers to water and can change the direction of fluid currents. Faults are
three‐dimensional objects. In two dimensions, faults can be modeled as discontinuities within the rock. Figure
7.30 at left shows that rocks have been displaced on the sides of a fault. The size of a fault can be represented
by the distance between two points on opposite sides of a fault that were infinitesimally close prior to
faulting. Large faults can be estimated from seismic data as discontinuities in seismic amplitudes. However,
the locations of small faults can be missed or misplaced because seismic measurement errors are usually
large.
Modeling the reservoir prior to faulting often can be successfully done assuming that data are described by
Gaussian random field. However, after faulting, the original spatial structure of the rock is altered due to the
spatial rearrangement of the vectors between points on the other sides of the faults.
Figure 7.30 at right shows data from a geological survey of a region in Queensland, Australia (data are from
Berman, M., 1986, “Testing for Spatial Association between a Point Process and Another Stochastic Process,”
Applied Statistics 35: 54–62). Dots mark the locations of copper ore deposits. Line segments represent
geological features that are visible on a satellite image. It is believed that they are geological faults.
Interestingly, it was found that the density of copper deposits does not depend on the distance from the
nearest fault. Therefore, faults may or may not be barriers between spatial objects. An intermediate situation
in which faults act as semitransparent barriers is very likely.
NONTRANSPARENT BARRIERS DEFINED BY POLYLINES
Figure 7.31 at bottom left shows inverse Euclidean distance weighted interpolation of the simulated data in a
channel with several additional nontransparent barriers defined by polylines. There are several ways to use
information about barrier locations. An idea implemented in ArcGIS Spatial Analyst is to use only those points
that are “visible,” such as when a straight line between points does not cross any barrier. This approach to
interpolation leads to jumps in predictions near the edges of barriers because the searching neighborhood
changes abruptly there (when different observations are used to predict values in the nearest locations, these
predictions can be very different; see figure 7.31 top left and also figure 8.46 in chapter 8). It is unlikely that
there is a physical process that can be reproduced using the visibility rule.
Another idea is to use the shortest distance between points so that points on the sides of the barriers are
connected by a series of straight lines. Figure 7.31 at bottom right shows the inverse distance weighted
interpolation using this approach. This time, all points are connected and the resulting map is smoother.
Figure 7.31
Inverse distance weighted interpolation can be considered having a radially symmetric kernel .
Other kernels are possible, including exponential (Geostatistical Analyst’s local polynomial interpolation
pattern analysis literature), where p is a power value, r is a radius with a center in the prediction location, and
h is a bandwidth. The bandwidth h can be estimated from the data by minimizing the mean squared
prediction error as discussed in the section called “Inverse distance‐weighted interpolation.”
Figure 7.32 shows interpolated maps created using the Epanechnikov (left) and Gaussian (right) kernels.
Comparing predictions near the channel’s edges with predictions in figure 7.31 at bottom left, it can be seen
that kernels based on the shortest distance between points tend to spread near the edges of the channel,
while predictions based on the straight‐line distance between points do not “see” the channel surface. Models
based on the shortest distance between points are preferable in hydrological applications and the like.
Figure 7.32
. The physical meaning of this equation is to describe how heat, gases, or particles
diffuse with time in a homogeneous medium. The diffusion kernel is similar to the Gaussian one when there
are no barriers, but near the barrier edges the predictions made using the diffusion kernel gently flow around
barriers as in figure 7.33.
Figure 7.33
Cross‐validation diagnostics are shown in figure 7.34 for the inverse distance weighted (at top left),
Epanechnikov (bottom left), Gaussian (top right), and diffusion (bottom right) kernels. With this particular
dataset and barriers, the inverse distance weighted kernel has the lowest average prediction root‐mean‐
square error and the diffusion kernel has the largest. This is because the data were simulated ignoring
barriers. Nevertheless, predictions made with the diffusion kernel can be preferred because they have better
physical interpretation.
Figure 7.34
SEMITRANSPARENT BARRIERS BASED ON COST SURFACE
Cost‐weighted distance is a common raster function in GIS that calculates the cost of travel from one cell of a
grid to the next, making it a natural choice for the distance metric. The value of each cell in the cost surface
represents the resistance of passing through the cell and may be expressed in units of cost, risk, or travel
time. Figure 7.35 at left illustrates the use of a cost surface for interpolation using a side view of elevation
data. The x‐axis shows cell locations, and the y‐axis shows the cost values assigned to grid cells. There are
penalties for moving up and down, because a car, for example, uses more gas to go uphill and has more brake
wear going downhill. Cell locations where distance is changed are highlighted. On a flat surface, the distance
between points is calculated without penalties: moving from cell 3 to cell 4 is not penalized. Going uphill,
from cell 4 to cell 5, distance is added to the path because of the difference among the cost surface values in
the neighboring cells, using one of the following formulas:
1. “Additive” barrier
(average cost value in the neighboring cells) × (distance between cell centers)
2. “Cumulative” barrier
(difference between cost values in the neighboring cells) + (distance between cell centers)
3. “Flow” barrier for interpolating data with preferential direction of data variation
Indicator (cost values in the to neighboring cell > cost values in the from neighboring cell) * (cost values in the to
neighboring cell cost values in the from neighboring cell) + (distance between cell centers),
where indicator(true) = 1 and indicator(false) = 0.
The templates in figure 7.35 at right show several ways to calculate distances between centers of neighboring
cells. The more directions used, the closer the distance between points will be to an optimal trajectory.
However, the more directions used, the more time calculations take.
Figure 7.35
Courtesy of California Air Resources Board.
Figure 7.36
Figure 7.37 at left shows interpolation of ozone values using exponential kernel and a semitransparent
cumulative barrier with the cost surface from figure 7.36 at left. The predictions shown as contours over
elevation can be compared with standard local polynomial interpolation (figure 7.37, right). The elevation
values through cost surface heavily influence the predicted ozone values, especially in the Los Angeles area
and near the ocean.
Courtesy of California Air Resources Board.
Figure 7.37
Courtesy of California Air Resources Board.
Figure 7.38
A major problem with local polynomial interpolation in the presence of barriers using kernels described in
this section is that the shape of the kernel is calibrated by cross‐validation instead of being defined by the
degree of spatial data correlation. This is a problem, because kernels with small bandwidth cannot predict
values between points that are separated by relatively large distance, while kernels with large bandwidth
produce an overly smooth surface. This problem can be solved using geostatistical approach to data
interpolation (see “Simulating from kernel convolutions” in chapter 10).
ASSIGNMENTS
1) COMPARE THE PERFORMANCE OF DETERMINISTIC INTERPOLATION MODELS
The performance of interpolation models can be determined by comparing their predictions with known
values using diagnostics discussed in chapter 6. Figure 7.39 shows three different datasets with known
features:
1. A two‐dimensional mathematical function, at left.
2. DEM data assuming that elevation values are absolutely accurate, center.
3. Simulated Gaussian data with spatial correlation defined by a particular anisotropic
semivariogram model, right.
Figure 7.39
Geostatistical Analyst software option Create Subset can be used to create training and testing subsets.
However, this tool divides data randomly while in practice data are rather preferentially sampled. Therefore,
methods for simulating preferentially sampled (clustered and inhibited) data are required. Chapters 13 and
16 explain how to simulate point patterns with required characteristics.
Use Geostatistical Wizard option Validation shown in figure 7.40 to perform this exercise.
Figure 7.40
2) FIND THE BEST DETERMINISTIC MODEL FOR INTERPOLATION OF CESIUM‐137
SOIL CONTAMINATION.
Find the best deterministic model (implemented in Geostatistical Analyst) for interpolation of cesium‐137
soil contamination data collected in Belarus in 1992. Data are available in the folder assignment 7.2.
FURTHER READING
1. Kondor, R. I., and J. Lafferty. 2002. “Diffusion Kernels on Graphs and Other Discrete Input Spaces.”
Proceedings of the International Conference on Machine Learning (ICML‐2002).
The authors proposed and discussed a method of constructing kernels based on the heat equation, and a
paper was written for readers with a strong background in mathematics.
PRINCIPLES OF
MODELING
SPATIAL DATA
CHAPTER 8: PRINCIPLES OF MODELING GEOSTATISTICAL DATA: BASIC
MODELS AND TOOLS
CHAPTER 9: PRINCIPLES OF MODELING GEOSTATISTICAL DATA: KRIGING
MODELS AND THEIR ASSUMPTIONS
CHAPTER 10: OPTIMAL NETWORK DESIGN AND PRINCIPLES OF
GEOSTATISTICAL SIMULATION
CHAPTER 11: PRINCIPLES OF MODELING REGIONAL DATA
CHAPTER 12: SPATIAL REGRESSION MODELS: CONCEPTS AND
COMPARISON
CHAPTER 13: PRINCIPLES OF MODELING DISCRETE POINTS
PRINCIPLES OF MODELING
GEOSTATISTICAL DATA: BASIC
MODELS AND TOOLS
OPTIMAL PREDICTION
GEOSTATISTICAL MODEL
GEOSTATISTICAL ANALYST’S KRIGING MODELS
SEMIVARIOGRAM AND COVARIANCE
WHAT FUNCTIONS CAN BE USED AS SEMIVARIOGRAM AND COVARIANCE
MODELS?
CONVOLUTION
SEMIVARIOGRAM AND COVARIANCE MODELS
MODELS WITH TRUE RANGES
POWERED EXPONENTIAL FAMILY OR STABLE MODELS
KBESSEL OR MATTERN CLASS OF COVARIANCE AND SEMIVARIOGRAM
MODELS
MODELS ALLOWING NEGATIVE CORRELATIONS
JBESSELL SEMIVARIOGRAM MODELS
RATIONAL QUADRATIC MODEL
NESTED MODELS
INDICATOR SEMIVARIOGRAM MODELS
SEMIVARIOGRAM AND COVARIANCE MODEL FITTING
TREND AND ANISOTROPY
KRIGING NEIGHBORHOOD
DATA TRANSFORMATIONS
DATA DECLUSTERING
ASSIGNMENTS
1) SIMULATE SURFACES USING VARIOUS SEMIVARIOGRAM MODELS
2) FIND THE BEST SEMIVARIOGRAM MODELS FOR SIMULATED DATA
3) INVESTIGATE THE GEOSTATISTICAL ANALYST’S PREDICTION SMOOTHING
OPTION
4) TRY A GENERAL TRANSFORMATION OF NONSTATIONARY DATA
FURTHER READING
T his chapter covers the foundations of geostatistical data analysis. First, the concept of optimal
prediction (kriging) is discussed, and the ArcGIS Geostatistical Analyst kriging models are introduced.
Next, semivariogram and covariance modeling, where most effort typically goes into geostatistics, is
discussed in detail, clarifying the difference between semivariogram and covariance models by using the
relationship between the theoretical covariance models and the kernel convolution. Then the difference
between standard and indicator semivariograms is discussed, and problems with indicator semivariogram
usage are highlighted. Ideas behind semivariogram and covariance model fitting are also discussed.
Next, a case when data vary differently in different directions is examined using concepts of trend and
anisotropy.
In practice, predictions are usually made using a dozen or so closest observations instead of using the entire
dataset. The consequences of using the kriging neighborhood are discussed and kriging modification for
continuous predictions is presented.
Finally, the use of data transformations to bring the data close to normal distribution and satisfy stationary
assumptions is discussed.
OPTIMAL PREDICTION
In this chapter we discuss statistical models (kriging) that predict a value for a variable of interest at
an unsampled location based on surrounding measurements .
Most kriging models predict a value using the weighted sum of the values in the nearby N
locations :
.
The main statistical assumption behind kriging is one of stationarity, which means that statistical properties
of the data (mean and variance) do not depend on exact spatial locations, so the mean and variance of a
variable at one location are equal to the mean and variance at another location. Also, the correlation between
any two locations depends only on the vector that separates them, not on their exact locations. The
assumption of stationarity is very important since it provides a way to obtain replication from a single set of
correlated data and allows one to estimate parameters of the statistical model. Stationarity is needed for
estimation of spatial dependence through semivariogram, not for spatial prediction (kriging). When data
cannot be assumed to be stationary, detrending and transformation techniques may help to make data
reasonably close to stationarity.
Andrei Kolmogorov, one of the great mathematicians of the twentieth century, made major contributions to
the problem of predicting and filtering stationary random processes in one dimension in 1941. The Soviet
mathematician Isaak Yaglom summarized a theory of stationary random functions in one dimension (for a
time series) in 1952. In the introduction to his book, Yaglom mentioned that generalizing the theory of
stationary random functions to two and higher dimensions and space‐time would be easy. The Soviet
statistician and meteorologist Lev Gandin did this supposedly easy job. He developed optimal interpolation in
two dimensions in 1959. The term “optimal interpolation” was introduced by Norbert Wiener, an American
mathematician, in 1949, characterizing the fact that the mean square error of the interpolation is minimal in
this method. Gandin summarized his research in a book that was published in Russian in 1963 and translated
into English in 1965. Somewhat parallel work was done by the American econometrician Arthur Goldberger
in 1962. The French geologist and mathematician Gorges Matheron published his first article on the subject in
1962. Matheron knew Russian and he visited the Soviet Union in the early 1960s; his comprehensive book on
geostatistics was published in Russian in the Soviet Union in 1968. Matheron knew about Yaglom’s and
Gandin’s research but for some reason did not refer to them. Matheron called the optimal interpolation
“kriging” after Danie Krige, a South African mining engineer who used empirical methods for determining
spatial ore‐grade distributions but did not formulate optimal spatial prediction model, and this name is used
today. According to the so‐called law of eponymy, a scientific notion is never attributed to the right person;
kriging is no exception.
Figure 8.1
One can reach the following conclusion from this derivation: if the semivariogram or covariance is known for
distances between all pairs of points in the data domain, a value at the unsampled location can be predicted,
and the prediction error can be estimated. Because the semivariogram and covariance are key parts of
geostatistical data analysis, a large part of this chapter is devoted to their estimation and modeling.
Gandin’s classical derivation of ordinary kriging, which has been reproduced thousands of times in various
articles and books, does not assume data normality, and many practitioners believe that it is not important.
However, most of the papers on kriging written by statisticians state that kriging is a Gaussian predictor and,
as a rule, this statement is not justified since it seems obvious for the authors. This is because only a Gaussian
random field is completely described by the mean and covariance. If the data follow multivariate Gaussian
distribution, kriging (more precisely, “simple” kriging, see below) is the best predictor under squared‐error
minimization criterion. If the process that produces the data is not Gaussian, kriging is the best linear
predictor, and additional model refinement is required for optimal prediction. For example, if the data follow
binomial or Poisson distribution, the empirical semivariogram formula must be modified (see “Binomial and
Poisson kriging” in chapter 12). Predictions and especially prediction standard errors can be significantly
improved if information about data distribution is used. If input data are not Gaussian, they often can be
transformed to approximately a Gaussian distribution.
The standard geostatistical modeling process is presented in figure 8.2. First, the empirical semivariogram
values, half of the squared difference between pairs of data values separated by distance h, are calculated. The
empirical semivariogram is shown in figure 8.2 at top right along the y‐axis, and the distance between the
data points are shown along the x‐axis.
Then the semivariogram values are averaged for pairs within the specified distances and directions (figure
8.2 at bottom right) and fitted by a selected function of distance (semivariogram model), shown as a
blue line. The empirical semivariogram values calculation using the grid and sector methods is discussed in
chapter 6 and reference 2 in “Further reading.” Figure 8.2 at bottom right shows the empirical semivariogram
values calculated using the Geostatistical Analyst’s grid method (pink points), traditional empirical
semivariogram values (the green circles), the estimated spherical semivariogram model (the blue line), and
the kernel smoothed semivariogram values (black). The kernel smoothed semivariogram values
N N
u− | si − s j | 2
∑ ∑ kernel ⋅ ( Z(si ) − Z(s j ))
i=1 j=1 b
γˆ (u) = N N
,
u− | si − s j |
2∑ ∑ kernel
i=1 j=1 b
where kernel() is a symmetric function and N is the number of data pairs in the neighborhood defined by the
bandwidth parameter b, can be considered as a target for modeling using nested semivariogram models, see
“Nested models” below.
€
Semivariogram(distance h) =
½ average[(value at location i – value at
location j)2]
for all pairs of locations i and j separated by
Compute the empirical semivariogram using
distance h.
points averaging.
Estimate the semivariogram model (yellow
Calculate weights and make a prediction
line).
to the center of the
ellipse using highlighted neighboring
measurements.
Figure 8.2
But how good is our kriging model? Performing diagnostics before mapping is good practice, and the final
dialogs in Geostatistical Analyst are cross‐validation (default) and validation (optional) diagnostics. Cross‐
validation and validation diagnostics are discussed in chapter 6.
A typical scenario of data analysis using kriging is presented in figure 8.3. In this scheme, γ(dij,θ) is a
semivariogram model that depends on the distance between points dij and on the parameters θ (such as
nugget, range, and partial sill, see below). If data preprocessing (transformation and/or detrending) Y(si) =
(
F Z(si) ) is not used, Y(s ) is equal to Z(s ).
i i
Figure 8.3
Geostatistical Analyst includes several kriging models that are based on different assumptions. The models
differ in the way they model the mean value and in the assumptions they make about the data distribution
(no assumptions, multivariate Gaussian distribution, or bivariate Gaussian distribution). The choice of a
kriging model should be based on the result of data exploration and a priori information about the physical
process that supposedly generated the data.
The variation of many processes, both natural and artificial, can be categorized as large‐scale and small‐scale.
For example, weather has usually no predictability beyond a week (small‐scale variation), while the bigger
concept of climate is the averaged weather (large‐scale variation). Climate is a reflection of the probabilistic
properties of weather fluctuations. The random weather fluctuations may hide the real changes in climate.
In Geostatistical Analyst, large‐scale variation is modeled by low‐order polynomials. Large‐scale variation can
also be modeled using covariates (explanatory variables) with the spatial regression models discussed in
chapter 12 if the data variation partially depends on other variables and not just on x‐ and y‐coordinates.
Small‐scale variation is modeled by a stationary process with an estimated semivariogram or covariance
model. If the semivariogram is not the same for all distances, accurate estimation of the large‐scale variation
requires simultaneous estimation of the correlation among observations.
The distinction between large‐scale and small‐scale variability is a modeling assumption, not a real data
feature. Therefore, the same data can be treated by different researchers as mean surface or random effects.
The geostatistical model assumes that data are the sum of a true measured variable called signal (meaning
“the message”) and a sum of random errors:
data = signal + random errors
Normally we are interested in the true data value, not in the value with the erroneously measured noise
component. Therefore, spatial prediction should be a prediction of the signal. Signal is modeled as the sum of
a deterministic mean function, which changes slowly in space (large‐scale data variation or
trend),
a small‐scale random error variation with a range of data correlation of approximately several
typical distances between pairs of data locations, and
Kriging with assumed measurement error component produces new values at the observed locations (see
example in chapter 1). Predictions of the signal and the error‐contaminated processes at the unsampled
locations are coincided, but the prediction standard errors are different; signal’s prediction standard error is
more precise. Although this kriging feature was discussed by Gandin in 1963 , it is often overlooked in
modern geostatistical and statistical literature. For example, the SAS 9.1 procedure krige2d (ordinary kriging)
predicts the noisy process, always assuming that there is no measurement error in the data.
The sum of the errors is usually called measurement error. Measurement error here is not just a
measurement device error. It may consist of several error types, including locational error and error due to
local data integration, see the discussion in chapter 3.
The geostatistical model above can be rewritten as:
data = trend + smallscale variation + microscale variation + measurement error
By definition, the last three components in the expression above have an expected value of 0. It is assumed
that all four components of the geostatistical model are independent of each other because it is unclear how
to estimate the correlation between the components.
Figure 8.4 displays data measurements, a true but unknown signal, and a trend in one dimension (in this case,
the true signal is known because data are simulated).
Figure 8.4
Figure 8.5
The goal of kriging is the reconstruction of large‐ and small‐scale data variations. If the data are not precise,
kriging can filter out average measurement error and predict a new value at the measurement location.
A new prediction at the data location is more accurate than the original noisy measurement. This statement is
often surprising to beginners. But when making a new prediction, we use additional information on the
nearby measurements and the spatial data structure. To understand why this makes the prediction more
accurate than the actual measurement, recall the formula for standard deviation,
n
2
∑ (Z i − Z)
i=1
,
n −1
There is no way to reconstruct microscale data variation because data are not available at the scale where
this variation is detectable. Because of this, surfaces based on kriging predictions are smoother than real data
variation.
Figure 8.6 presents the data, the true signal, and the filtered kriging predictions for data contaminated by
measurement errors. The true signal is equal to neither data nor predictions. This is bad news. But the good
news is that kriging provides estimation of the upper and lower bounds of prediction. If the kriging model is
estimated properly, the bounds include the true signal, as shown in figure 8.6.
Figure 8.6
GEOSTATISTICAL ANALYST’S KRIGING MODELS
Six kriging models in the Geostatistical Wizard (versions 8.1 – 9.3) allow manipulation of a large number of
model parameters.
Simple, ordinary, and universal kriging are linear predictors, meaning that prediction at any location is
obtained as a weighted average of neighboring data. These three models make different assumptions about
the mean value of the variable under study.
Simple kriging requires a known mean value or mean surface as input to the model;
Ordinary kriging assumes a constant, but unknown, mean and estimates the mean value as a constant
in the prediction searching neighborhood; and
Universal kriging is modeling local means using low‐order polynomial functions of the spatial
coordinates.
Figure 8.7 shows the difference between mean value specification (simple kriging) and estimation (ordinary
and universal kriging). The residuals that result, shown for two data points, are different, and we might
expect that the kriging predictions will be different as well.
Figure 8.7
For ordinary kriging, we estimate the mean (the blue line), so we also estimate the random errors. In figure
8.7, the measurements can be interpreted as the elevation values collected from a line transect over a
mountain. The data are more variable at the edges and smoother in the middle. In fact, these data were
simulated from the kriging model with a constant mean. Therefore, both simple and ordinary kriging can be
used for data that seem to have a trend because there is usually no way to decide, based on the data alone,
whether the observed pattern is the result of data correlation only.
For universal kriging, the mean (the black line) in figure 8.7 is a third‐order polynomial. If we subtract it from
the original data, we obtain the random errors. Using universal kriging, we are doing regression with the
spatial coordinates as the explanatory variables. However, instead of assuming that the errors (residuals) are
independent, we model them as correlated. There is a computational problem with universal kriging:
knowledge about trend is required for estimation of the semivariogram model, but estimation of trend
requires knowledge of the semivariogram model (the chicken and egg problem). Iterative algorithms provide
an approximate solution of the problem.
Although prediction surfaces created by simple, ordinary, and universal kriging are usually not very different,
the differences in their approaches to modeling large‐scale data variation may produce very different
prediction standard error surfaces.
Disjunctive kriging uses a linear combination of functions of the data rather than the original data values
themselves:
Gaussian disjunctive kriging implemented in Geostatistical Analyst assumes that all data pairs come from a
bivariate normal distribution. The validity of this assumption should be checked in the Geostatistical Analyst
Examine Bivariate Distribution dialog (see chapter 9). When this assumption is met, disjunctive kriging may
outperform other kriging models.
Indicator kriging uses preprocessed data. Indicator values are defined for each data location as the following:
an indicator is set to zero if the data value at the location s is below the threshold, and to 1 otherwise:
(
I(s) = I Z(s) < threshold ) = .
These indicator values are then used as input to one of the linear kriging models, usually ordinary kriging.
Assuming that ordinary kriging produces continuous predictions with values between 0 and 1, a predicted
value at a location s is interpreted as the probability that the threshold is exceeded at the location. For
Cokriging combines spatial data on several variables to make a single map of one of the variables using
information about the spatial correlation of the variable of interest and cross correlations between it and
other variables. For example, predictions of ozone pollution can be improved by using distance from roads
and measurements of nitrogen dioxide as secondary variables.
Probability kriging is a variant of cokriging in which the primary variable first undergoes an indicator
transformation, then it is modeled jointly with the primary variable to enhance the predictions. Because
indicators are used for the primary variable, the resulting prediction is the same as in the case of indicator
kriging: the probability that a specified threshold is exceeded.
It is appealing to use information from other variables to help in making predictions, but it comes at a price—
the more parameters that need to be estimated, the more uncertainty is introduced in the predictions.
When fitting a semivariogram or covariance model with only one variable, we estimate three or sometimes a
few more parameters (depending on the covariance model). Model fitting for two or more variables is more
complicated, because we need to estimate (3–6 + 3–6 + 1–3 = 7–15; here “3–6” means “between 3 and 6”)
parameters for two variables, (3–6 + 3–6 + 3–6 + 1–3 + 1–3 + 1–3 = 12–27) parameters for three variables,
and (3–6 + 3–6 + 3–6 + 3–6 + 1–3 + 1–3 + 1–3 + 1–3 + 1–3 + 1–3 = 18–42) parameters for four variables. In
the case of anisotropy, one additional parameter should be estimated for each model. Cross covariance model
with shift discussed in chapter 14 requires two more parameters. It is very difficult to estimate such a large
number of parameters accurately. Although there are applications with a large number of variables,
Geostatistical Analyst allows the use of no more than four continuous variables, because a reliable estimation
of a very large number of model parameters is problematic.
All kriging models mentioned in this section can use secondary variables. Then they are called simple
cokriging, ordinary cokriging, and so on. Cokriging is discussed in “Multivariate geostatistics” in chapter 9.
SEMIVARIOGRAM AND COVARIANCE
Semivariogram and covariance functions model spatial data variability based on the distance between
locations. For most applications, the semivariogram and covariance are unknown functions of distance and
are estimated using the observed data.
Semivariogram functions have been used in meteorology since 1925 (when they were called the structure
functions). Meteorologists assumed that moments of order greater than two (the first two moments of the
data are mean and variance) might be neglected when characterizing the average motion in hydrodynamics.
In 1941, Kolmogorov presented a simplification of the theory for locally homogeneous and isotropic
turbulence flow. In particular, he showed that the velocity structure function (semivariogram) is proportional
to two‐thirds the power of the distance in a range that satisfies an assumption of local stationarity. After
Kolmogorov’s publications on structural analysis, a number of case studies were published, mostly in the
Soviet Union. Since 1959, when Gandin developed optimal spatial interpolation (kriging), the semivariogram
has been used extensively as a necessary step in spatial interpolation.
A flow chart of the geostatistical process with emphasis on structural analysis (variography) is shown in
figure 8.8.
Figure 8.8
In chapter 6, a section on isotropy includes a discussion of how Geostatistical Analyst calculates the empirical
semivariogram values (red points in figure 8.9) and semivariogram surface. The next step is to estimate the
model that best fits it. Default parameters for the model are found with the software by minimizing the
squared differences between the empirical semivariogram values and the theoretical model (blue line). The
user can then change the model’s parameters.
The semivariogram function has three or more parameters. In the exponential model below, the three
parameters are
Partial sill—the amount of variation in the process that generated the data
A nugget (a discontinuity at the origin)—data variation due to measurement errors and data
variation at fine scale
Range—the distance beyond which data do not have significant statistical dependence
Figure 8.9
The nugget occurs when measurements are different at very short distances. Ideally, the closer the sampling
points, the less disparate the measurements should be, with this difference approaching zero at the zero
distance between samples. With real data, that does not happen, and samples taken at the same location can
differ, leading to a nugget effect in the semivariogram model.
Figure 8.10
Similar to spatial data, the happiness model is uncertain near the origin, because life without money or its
equivalent is not possible, and an absolutely unhappy person does not exist.
In Geostatistical Analyst, the proportion between measurement error and microstructure can be specified, as
shown in figure 8.11 at left. If measurement replications are available, the proportion of measurement error
in the nugget can be estimated, as shown in figure 8.11 at right.
Figure 8.11
In two‐dimensional space, the semivariogram and covariance functions may change not only with distance
but also with direction. Figure 8.12 shows how the Semivariogram/Covariance Modeling dialog looks when
spatial dependency varies in different directions. The semivariogram values are averaged based on the
direction and distance between pairs of locations, after which a surface of the semivariogram values in polar
coordinates, shown in the bottom left corner of the Semivariogram/Covariance Modeling dialog in figure 8.12,
is produced. The center of the semivariogram surface corresponds to zero distance between locations. An
asymmetrical semivariogram surface indicates that data variation in different directions is different. In this
case, the semivariogram model (yellow lines) changes gradually as direction changes between pairs of points.
In figure 8.12, the distance of significant correlation in the northwest direction (Major Range) is
approximately twice as large as in the perpendicular direction (Minor Range).
Figure 8.12
Covariance, the expected value of the product of the deviations of two random variables from their mean
values, is a statistical measure of similarity and correlation. It looks like an upside‐down semivariogram
model, as in figure 8.13 (blue). The total variation within the data or the sill parameter is a sum of nugget (red
line) and partial sill (yellow); the distance between points beyond which the data values are independent of
one another (the range) is shown in pink. The nugget value is the distance between the circle close to the top
of the y‐axis and where the blue line nearly intersects the y‐axis.
Figure 8.13
For semivariogram models with sills, it is easy to go back and forth between semivariogram and covariance
functions using the relations
In Geostatistical Analyst, all semivariograms and covariances have a sill. Covariance requires the overall mean
value, which is usually unknown, to be specified, but the semivariogram does not. This is the main reason why
the semivariogram is used more often than covariance. In practice, the semivariogram is usually better for
estimating the data correlation at small distances (the nugget parameter and the shape of the model at small
distances) while covariance can be better for estimating the model parameter for large lags (sill and range
parameters).
γ(h) must grow more slowly than function a⋅h2.
(Note: in 1950s, Yaglom proposed a more general class of the covariance called the generalized covariance.
Spatial processes based on the generalized covariance were called “intrinsic random functions” by Matheron
in 1970s. They can be used when the semivariogram increases faster than the square of the distance between
pairs of points.)
Figure 8.14 shows clouds of semivariogram value points exported from Geostatistical Analyst and the best fit
provided by Microsoft Excel using the power value model a⋅hb (in red). The estimated power value b, in figure
8.14 at left is 0.5646. This is a valid model. The semivariogram model in figure 8.14 at right grows faster than
is permissible, with power value b equal to 2.1952.
Figure 8.14
All semivariogram models in Geostatistical Analyst grow more slowly than the function a⋅h2.
WHAT FUNCTIONS CAN BE USED AS SEMIVARIOGRAM AND COVARIANCE MODELS?
Using a linear combination of the variables being studied, Z(si),
,
where bi are some constants, and assuming that the mean value of Z(si) equals zero (that is, data were
detrended), the square average of the linear combination of G is
since is a covariance between si and sj when mean is equal to zero. In the last expression, h =
is the distance between points si and sj.
The equation is true by construction, no matter what constants bi and bj are. A
covariance function that satisfies such an equation is called positive definite. To check the validity of a
formula for a particular covariance model, we could examine all combinations of the coefficients and .
The result of all calculations in two dimensions is the following: the covariance model Cov(h) is positive
definite if the Fourier transformation is greater than zero for all 0 ≤ u <∞,
where is a Bessel function. Covariance can be found from its Fourier representation as
. For some covariance functions, the integral can be calculated analytically.
For more complicated models, it can be calculated numerically.
Using the relationship between covariance and the semivariogram, the function can be expressed in the
terms of the semivariogram:
All calculations for covariance can be used for the semivariogram.
The consequence of this covariance/semivariogram feature is that it is quite possible that the semivariogram
model drawn manually in figure 8.15 at left (blue line) will not be positive definite, so using it could lead to
absurd results: a negative variance in some or many prediction locations. Figure 8.15 at right shows another
invalid model in two dimensions, the tent semivariogram, which consist of two lines. However, this
semivariogram can be used safely in one dimension. (Geostatistical Analyst has the circular semivariogram,
which is very close to the tent semivariogram and valid in two dimensions, see below.)
Figure 8.15
It is possible to check the validity of the lines displayed in blue in figure 8.15 at left using their spectral
representation. If a spectrum of some parts of the line is negative, then it can be corrected, and a valid
semivariogram model that is close to those shown in blue in figure 8.15 at left can be derived.
,
where a, b, c, d, and f are constants that should be estimated; J(⋅) is a Bessel function; K(⋅) is a modified Bessel
function; and Γ(⋅) is a gamma function. There are some limitations on the constants. For example, constant b
should be less than 2.
Some of the semivariogram function formulas above look complicated and one may ask how it would be
possible to discover, say, the function. An answer is that these formulas are simple
in spectral representation.
It is almost always assumed that the distance between points is a straight‐line (Euclidean) distance. Another
distance metric may lead to a negative function with valid models for the Euclidean distance
metric. The exponential semivariogram (see the formula for this model below) is the model proven to be valid
for the city‐block distance metric. There is little information on valid semivariogram models for other non‐
Euclidean distance metrics because kriging is rarely used there. Interpolation using distance metric based on
cost surface is discussed in chapters 7 and 10.
Suppose we are modeling the distribution of contamination from a point source located in the grid cell shown
in the center of figure 8.16.
Figure 8.16
We want to estimate the pollution in each raster cell, assuming that pollution can be described by a
symmetrical function that decreases as the distance from the central cell increases, as shown in figure 8.16,
where red represents high pollution, pink represents high, brown represents medium, and yellow, low. This
function is called kernel. With just one source of pollution, we know the level of contamination everywhere. If
we have several sources each described by a kernel, one way to calculate the sum of pollutions using distance
from the sources to the cells is to weight the influence of each source according to the kernel value, which
changes with distance from the source of pollution, figure 8.17.
Figure 8.17
If the cell we are interested in has the location (i,j), and the one to the right is (i+1,j), and so on, then the
formula for total contamination is
where Source(k,l) is the value of pollutant in cell (k,l) and Kernel(k – i, l – j) is a function of decreasing
pollution with distance from the source. This process is called a discrete values convolution or moving
average. If identically and independently distributed Gaussian random variables are used instead of
Source(k,l) in the last formula, convolution leads to a continuous Gaussian white noise process over the space.
The same process can be equivalently generated using a covariance function (see also “Simulating from
kernel convolutions” in chapter 10).
One function can be calculated from another using the following sequences:
Kernel(s) FT Kernel (ω)squared Kernel2(ω) IFT Cov(s)
Cov(s) FTCov(ω) square root Cov1/2(ω) IFT Kernel (s),
where FT and IFT denote the Fourier transformation and its inverse, the functions squared and square root
are applied for each cell, and spectra Cov(ω) and Kernel(ω) are spectra of the covariance Cov(s) and Kernel(s).
The relationship between covariance and kernel is not one to one; multiple kernels can give the same
covariance function, but not vice versa.
Using the convolution approach, the covariance functions can be transformed to the relevant kernels. The
relationship between the physical process that generated the data and the moving kernel which is averaging
the data can be imagined, and the closest kernel in the right column in the pairs of graphs in the next section
can be selected. The corresponding covariance model may produce kriging predictions that make better
physical sense. For example, if heavy rain can be modeled as a moving cylinder, then a circular model is the
most appropriate one for modeling the amount of precipitation. However, it is unlikely that a cylindrical
kernel would be relevant for generating data with scattered rainfall.
SEMIVARIOGRAM AND COVARIANCE MODELS
In theory, the best covariance model in one dimension is exponential and in two dimensions the K‐Bessel
model has the best theoretical properties.
Geostatistical Analyst’s 9.3 default model is a sum of two models, nugget and spherical, because this is the
most popular combination of models in geostatistical literature.
Nugget model
, where θs ≥ 0.
The kernel that corresponds to the nugget model is infinitely small so other cells do not influence the value in
a particular cell when this kernel is used for averaging (see figure 8.18).
The nugget describes measurement error as well as the data variation at distances smaller than the shortest
distance between any two locations with measurements. The nugget model is usually used together with one
of the other semivariogram or covariance models.
Figure 8.18
Covariance models and kernels in figure 8.18 are shown in three dimensions by rotating a space under the
covariance model line around the y‐axis as shown at the right in figure 8.19. In a case of the nugget model, the
line rotation produced a plane in figure 8.18 at left.
Figure 8.19
If the estimated covariance or semivariogram model is a nugget model, then data are spatially independent.
The process described by the nugget model is called white noise in signal processing by analogy with white
light, where the power density spectrum is constant. In this case, all frequency components of the signal are
distributed uniformly.
MODELS WITH TRUE RANGES
Four of the semivariogram and covariance models in Geostatistical Analyst are close to linear at small
distances, and they have true ranges beyond which the correlation is exactly zero:
Circular
Tetraspherical
Pentaspherical
The circular model is valid in one and two dimensions; the spherical in one, two, and three dimensions; the
tetraspherical in one to four dimensions; and the pentaspherical in one to five dimensions.
These models can be constructed using the indicator function of the sphere of
d‐dimension with the radius :
In one dimension, one such model is the tent:
The tent model is not valid in two or more dimensions.
The circular model is produced by a cylindrical kernel as presented in figure 8.20.
Figure 8.20
One of the kernels that corresponds to the spherical covariance is also cylindrical. The spherical model
received its name because its value is equal to the volume of the intersection of two spheres with the same
diameter (equal to the range of data correlation) that are separated by the distance h.
One kernel that corresponds to the pentaspherical model is presented in figure 8.21.
Figure 8.21
Models with true ranges are compared in figure 8.22 at left. Not many physical processes can be modeled
using a cylindrical kernel, because it is unlikely that real processes have such strong edge effect.
Consequently, circular and spherical semivariograms do not make much physical sense either, and the
popularity of the spherical model among geostatistical software users is probably because of its simple
formula, not because the model has good statistical features. Pentaspherical and tetraspherical models have
better statistical features in two and three dimensions than a spherical semivariogram.
Other models in Geostatistical Analyst become uncorrelated at infinitely large distances. The parameter range
in these models, called practical range, is reached at 0.95 percent of the data variance (parameter sill).
Figure 8.22
POWERED EXPONENTIAL FAMILY OR STABLE MODELS
where power value parameter is in the diapason 0 ≤ θe ≤ 2.
Powered exponential models shown in figure 8.23 are valid in all dimensions. When a parameter θe tends to
zero, the model tends to the nugget model. Special cases of the powered exponential model are Gaussian
(parameter equals 2) and exponential (parameter equals 1). The stable model has a parameter that controls
the shape of the model. It is flexible, and it is a good candidate for the default model.
Stable with parameter 1.0 (exponential)
Stable with parameter 1.5
Stable with parameter 1.95
Figure 8.23
The behavior of the semivariogram and covariance models near the origin (or the sharpness of their kernels)
is important because the predicted surface near the measurement location behaves like the covariance at
small distances.
Figure 8.24
Figure 8.25 presents surfaces that were simulated using circular, Gaussian, exponential, and spherical models
with the same semivariogram parameters
partial sill = 1.0, nugget = 0.01, and range = 2.0
We used the hillshading option to highlight surface variability. Note that the smooth version of conditional
simulation discussed in reference 5 in “Further reading” was used.
We see that the choice of a semivariogram model influences interpolated surface smoothness. A surface
simulated using the Gaussian semivariogram model is smoother than surfaces simulated using the three
other semivariogram models. The number of hills and valleys is the largest for a surface created using an
exponential semivariogram.
Most statisticians do not recommend using a Gaussian model because it is numerically unstable (the default
Gaussian model in Geostatistical Analyst always has a nugget component to overcome this problem) and
because it is smoother than any real physical process. In chapter 3, we presented an example of fitting DEM
data in which the Gaussian model outperformed all other models. This is evidence that the commonly used
DEM provides oversmoothed data and cannot be successfully used for modeling local physical processes.
KBESSEL OR MATTERN CLASS OF COVARIANCE AND SEMIVARIOGRAM MODELS
θk
(Ω θ
γ (h;θ ) = θs 1− θ k −1
k
h / θ r ) Kθ k (Ωθ k h /θ r ) for all h
2 Γ(θ k )
where shape parameter θk ≥ 0, is a value found numerically so that γ(θr) = 0.95 θs for any θk, Γ(θk) is the
gamma function,
€ , and is the modified Bessel function of the second kind
of order θk.
This model is named in honor of Swedish statistician Bertil Mattern in statistical literature although it is not
clear who proposed it first. In particular, this model was used for two‐dimensional statistical data analysis by
Soviet meteorologists in the 1950s, before Mattern. Geostatistical Analyst uses the name K‐Bessel, referring to
the mathematical function used in the formula. The K‐Bessel model is very flexible; it includes not only
exponential and Gaussian models but also some others.
θ k Covariance model
→0 Nugget
1/2 Exponential
1 θs h/θr K1(h/θr) (Whittle’s model)
3/2 θs (1+h/θr)exp(‐h/θr)
5/2 θs (1+h/θr+ (h/θr)3/3)exp(‐h/θr)
→∞ Gaussian
Table 8.1
Some known processes can be described by the K‐Bessel model with specific parameters. For example, the
following stochastic Laplace process has Whittle’s covariance function:
,
where is a white‐noise process.
The parameter θk of the model also specifies the degree of differentiability of the random field: Z is n times the
mean square differentiable if θk > n. The mean square differentiability means that the expected squared
difference between processes Z(s) and process converges when h tends to zero.
Comparing the covariances and kernels in figure 8.26 suggests that the process becomes smoother as the
parameter θk increases.
COVARIANCE KERNEL
KBessel with parameter θ k = 0.25
KBessel with parameter θ k = 1.0
KBessel with parameter θ k = 1.75
KBessel with parameter θ k = 4.0
KBessel with parameter θ k = 15.0
Figure 8.26
Because the K‐Bessel model estimates smoothness from the data instead of using a predefined shape as do
the spherical or exponential models, we suggest that the K‐Bessel model should almost always be used as the
final semivariogram model. However, because computation of parameters for this model is time consuming,
the stable model can be used to define the appropriate lag and maybe some other semivariogram parameters
first, using the fact that the power parameter θe in the stable model corresponds to approximately 2⋅θk in the
K‐Bessel model.
MODELS ALLOWING NEGATIVE CORRELATIONS
The semivariogram models discussed so far increase monotonically with increasing lag distance. However,
some meteorological, ecological, and geologic processes are naturally cyclical and non‐monotonically
increasing models are required. Example of cyclical correlation is daily rainfall data, due to the fact that
rainfall occurs in the areas of low atmospheric pressure that alternate with areas of high atmospheric
pressure.
A semivariogram that describes the periodic structure of data can be constructed by multiplying two valid
semivariograms, for example, exponential and cosine, as meteorologists suggested
many years ago. But current practice shows that semivariograms and covariances constructed with the J‐
J‐BESSEL SEMIVARIOGRAM MODEL
2θ d Γ(θ d + 1) for all h
γ (h;θ ) = θs 1− θ d Jθ d (Ωθ d h /θ r )
(Ωθ d h /θ r )
0 for h = 0
1− sin2π h /θ
γ (h;θ ) = θ r
for h ≠ 0
s sin2π h /θ r
COVARIANCE KERNEL
JBessel with parameter θ d=0.01
JBessel with parameter θd=0.5 (Hole effect)
JBessel with parameter θd=1.0
COVARIANCE KERNEL
JBessel with parameter θd=2.0
JBessel with parameter θd=5.0
Figure 8.27
The amplitude of the periodic part of the J‐Bessel semivariogram model decreases as the parameter θd
increases, and for a large parameter θd (θd>10.0), it becomes similar to the K‐Bessel model with a large
parameter θk.
Figure 8.28
The cross‐validation diagnostic comparison shows that kriging with the hole effect model outperformed
kriging with the spherical one: the regression line is almost equal to a perfect fit y = x for the hole effect
model, and its root‐mean‐square and average standard prediction errors are significantly smaller (see figure
8.29).
Figure 8.29
RATIONAL QUADRATIC MODEL
The rational quadratic model semivariogram model,
shown in figures 8.30 is included in Geostatistical Analyst because it is popular in the geostatistical
community. It is comparable to a K‐Bessel model with a parameter θk smaller than 1.
Figure 8.30
The most important part of the semivariogram and covariance models from the prediction point of view is at
small distances. Generally, incorrect choice of model leads to less accurate prediction than incorrect choice of
parameters of the correct model. Based on the correspondence between covariances and kernels shown
above, Gaussian, exponential, and spherical models are different at small distances. Therefore, using, for
example, a spherical model, when the actual process is described by, say, the exponential model, may lead to
inaccurate predictions.
Closer study of the shape of the Gaussian kernel near the origin suggests that this model is the most sensitive
to parameter misspecification while the exponential model is one of the most stable. The sensitivity of
predictions to changes in the semivariogram parameters can be investigated with Geostatistical Analyst’s
Semivariogram Sensitivity tool (see chapter 9).
If the physical process behind the data is known, the corresponding kernel should be used. If it is unknown,
the most flexible K‐ and J‐Bessel model should be used (usually after fitting a stable model to save time on
estimating the lag parameter) because the smoothness of K‐ and J‐Bessel models is estimated from the data,
not defined a priori as, for example, in a spherical model. The cross‐validation and validation diagnostics are
useful for selecting the most reliable model or combination of models.
A semivariogram modeling case study can be found in appendix 4 under the heading “Semivariogram
modeling using Geostatistical Analyst and the procedures mixed and nlin.”
The covariance of the process can be more complicated than any particular model. One way to improve
structural analysis is to use a combination of models. This is possible because any linear combination of
positive definite functions gives a positive definite function:
Cov(h) = a1 Cov1(h) + a2⋅Cov2(h) + a3 Cov3(h) + … an⋅Covn(h)
Figure 8.32 at left illustrates a nested model usage. The data were simulated using a sum of the Gaussian
semivariogram model with range equaling 0.2 and sill equaling 1 and a spherical semivariogram model with
range equaling 0.6 and sill equaling 1.
When the underlying process is complicated and unknown, the most flexible model should be used. The
default K‐Bessel model estimated by the Geostatistical Analyst software is shown in figure 8.32 at right.
Figure 8.31
Figure 8.32
INDICATOR SEMIVARIOGRAM MODELS
Indicator values can be defined for any variables Z(s) at each data location. If the data value at the location s is
below the threshold, the indicator is set to zero. Otherwise, it is set to 1.
Figure 8.33 at left shows the semivariogram cloud calculated using radiocesium soil contamination
measurements. A typical example of the indicator semivariogram cloud is presented in figure 8.33 at right
using a threshold value of 15 Ci/sq. km (the upper permissible level for cesium‐137 soil contamination). Since
all indicator semivariogram values are either zero or 0.5, the semivariogram cloud for indicator data is less
informative than the semivariogram cloud for the original observations.
Figure 8.33
The empirical (averaged) semivariogram values for the original values and the indicator values in figure 8.33
are presented in figure 8.34 at left and right, correspondingly. The expected value of the sill of any indicator
semivariogram is 0.25. The estimated sum of the partial sill and the nugget in figure 8.34 at right is close to
this value.
Figure 8.34
Although the estimated semivariogram models look similar, there is an important difference in the
interpretation of the semivariogram parameters. The proportion of nugget to sill for the estimated indicator
semivariogram model is 0.066/(0.066+0.174) = 0.275, and for the estimated semivariogram using original
data, it is 10.4/(10.4+122.5) = 0.08. Therefore, the variability of the indicator variable at distances compatible
with the shortest distance between data locations is much larger than for the standard semivariogram.
However, the interpretation of this variation (the nugget parameter) is different. In the case of original data, it
could be primarily because of measurement error. However, indicators are known precisely by definition, and
the only reason for such a large nugget must be data variation at distances less than the shortest distance
between observed data locations.
Unfortunately, there is no theory that describes a class of permissible indicator semivariograms and
covariances. The theory discussed in section “What functions can be used as semivariogram and covariance
models” cannot be used with indicator semivariogram models since it was developed for continuous data,
while indicator data are 0s and 1s only and there is nothing in between. It is known, however, how to
construct valid semivariograms if the original data from which indicators are produced are normally
distributed. If the covariance of the Gaussian random field is Cov(Θ,h), where Θ is a set of the covariance
model parameters and h is a lag distance, then the correct indicator semivariogram model is
β2
exp−
1 Cov(Θ,h ) 1+ t
γ (h) = ω ⋅ (1− ω ) − ⋅ ∫ dt ,
2π 2 0.5
0 ( )
1− t
where β and ω are two parameters to be defined. The only possible computation of this semivariogram is
numerical.
€
For a fixed distance h1, for example, h1 = 0.5, and a series of distances h2 in the interval [0.1, 2.0], calculated
weights λ1 and λ2 are presented in figure 8.35 for three semivariogram models: Gaussian (red), stable with
power value 1.5 (blue), and exponential (green). Models are shown in the same colors as their weights, and
the space between models is filled with gray.
Figure 8.35
If Z1 and Z2 are indicators, prediction to Z? will be between a minimum of (λ1,λ2) and a maximum of (λ1,λ2).
Prediction will be in the interval [0, 1] only if λ1 and λ2 are not negative. However, for distances shorter than
some specific distance, the stable semivariogram model with a power value greater than 1 can produce
predictions greater than 1 and less than zero. The specific distance depends on a power value. It is small for
power values close to 1 and large for values close to 2, that is, for a Gaussian semivariogram model.
Therefore, only a stable model with a power value equal to or less than 1 should be used with indicator
variables. The same is true for a K‐Bessel semivariogram; this model is permissible when used with indicator
data only if its shape parameter is equal to or less than ½, as expected, because it is an exponential model.
This example shows that using a model that is not permitted may lead to absurd results. When modeling real
data, it is not always easy to see that the prediction map is incorrect, however. Therefore, although an
indicator semivariogram can be successfully used in exploratory data analysis, it should be used cautiously in
modeling and decision making.
Sometimes the geostatistical literature recommends using iterative visual semivariogram model fitting. The
sequence would therefore be to first choose and fix a reasonable range parameter, then the nugget, then find
an appropriate partial sill. Figure 8.36 shows that a visually reasonable model, left, can be very different from
the true one, right, which was used to simulate the data from which the empirical semivariogram values in
the graphs were calculated.
Figure 8.36
It is better to find the semivariogram model objectively and to use the same approach for every problem so
that models can be compared. In other words, it is better to use a statistical procedure to estimate the
parameters of the model.
The two main statistical approaches for estimating the parameters of the semivariogram model are weighted
least squares and maximum likelihood.
Statisticians have commonly used variants of maximum likelihood estimation to fit the semivariogram model.
In this case, the best‐fitting model does not necessarily come close to the visually expected line in blue, since
likelihood methods use all semivariogram pair values without averaging them. Because very large numbers of
points are used in estimation, likelihood methods cannot be used with large datasets (however, see reference
11 in Further reading, which describes a method for very large datasets).
Researchers from physical sciences, such as meteorology, geology, and environmental science, more often use
variants of a weighted least‐squares approach, because there is assurance of visual confirmation that the
semivariogram model fits the empirical data well. Because of this and, more importantly, because weighted
least squares can be used with a large number of measurements, Geostatistical Analyst uses the weighted
least‐squares method of fitting the semivariogram model.
Note that when the number of samples is very large in semivariogram modeling, if the number of data values
is greater than 5,000, the Geostatistical Analyst randomly selects 5,000 observations for the semivariogram
model fitting. The result of modeling is usually insensible to the random sampling because all the data are
used in the predictions. However, if the dataset has a few very large values, they may or may not be in the
selected dataset and, consequently, the estimated semivariogram model can be different from the
semivariogram model estimated using the entire dataset. This will result in different predictions.
, where N is the number of observations
The assumption behind the ordinary (nonweighted) least‐squares method is that data are independent. This
assumption is always violated by empirical semivariogram values, which largely use the same data for
averaging at different distances.
In practice, the weighted least‐squares method of semivariogram fitting with weights associated with the
variance of the empirical semivariogram is used. It can be shown that variance of the empirical
semivariogram is proportional to the squared semivariogram model and inversely proportional to the
number of pairs used in the empirical semivariogram averaging. That variance can be calculated if the
semivariogram parameters are known. Because the true semivariogram model is unknown, a solution is to
use an iterative algorithm, assigning some estimated values to the semivariogram parameters, estimating the
variance–covariance matrix, making a new estimate for the semivariogram parameters, and so on until the
algorithm converges.
Likelihood (maximum likelihood and restricted maximum likelihood) and least‐squares methods can be
compared as follows.
The advantages of the likelihood methods are the following:
They operate on the original data rather than the empirical semivariogram. The user does not have to
choose lag values.
Maximum likelihood estimates are asymptotically normally distributed, and no other estimates have
smaller asymptotic variance.
Estimating the trend (large‐scale data variation) and spatial covariance structure is done
simultaneously. There is no need to remove the trend in one step and model the semivariogram of
the residuals in another step.
The concept of the estimation algorithm is more straightforward than the least‐squares estimate.
Among the disadvantages of the likelihood methods are the following:
They are less graphical. Comparing the estimated semivariogram (constructed from the likelihood
estimates of the covariance parameters) against the empirical semivariogram is not appropriate,
because the estimates were not obtained by fitting to the empirical semivariogram values. The least‐
squares estimates usually look better in comparison with the likelihood estimates when compared
against the empirical semivariogram.
They require a distributional assumption. Likelihood theory is well developed for several probability
distributions, but in geostatistics the normal distribution is used almost without exception.
Computationally, they are much more involved: likelihood methods for spatial models require very
sophisticated optimization algorithms.
New semivariogram model fitting algorithms are published every year, but the universal method that
outperforms all others remains to be discovered. In practice, any reasonable algorithm implemented
interactively and with sufficient model fitting diagnostics works better than any black‐box algorithm because
graphical representation is indispensable for good data analysis. This is what Geostatistical Analyst does—it
provides interactive semivariogram modeling with cross‐validation and validation diagnostics.
A detailed description of the Geostatistical Analyst weighted least‐squares algorithm and its comparison with
the restricted maximum likelihood method can be found in reference 2 at the end of this chapter. Several
semivariogram model fitting case studies using Geostatistical Analyst and SAS software packages are
presented in appendix 4. Assignment 3 in this appendix refers to the diagnostic for examining a model for
covariance or semivariogram which is based on the “error decorrelation” approach. The idea is to produce
residuals that should be uncorrelated after a transformation and examine these residuals for the remaining
autocorrelation. Note also that the weighted least square method is usually used with constant lag size. A
variable lag size with increasing values with increasing distance between points better describes the shape of
the semivariogram model at the most important small distances. This semivariogram fitting method will be
implemented in the following version of the Geostatistical Analyst.
TREND AND ANISOTROPY
Trend leads to a systematic increase of semivariogram values in particular direction or directions, sometimes
unbounded. Anisotropy can be detected when parameters of the semivariogram model change differently in
divergent directions at small and moderate distances between pairs of points.
A typical example of trend in data is temperature measurements over a large territory. Measured from north
to south between any points in the northern hemisphere separated by at least 100 kilometers, temperature
rises the farther south we go.
An example of anisotropic data is movement of air pollution with prevailing wind from the ocean: wind speed
along the coast is different from inland wind speed, and the difference is approximately the same for similarly
distant locations from the ocean.
To see how the semivariogram and covariance react to the presence of anisotropic data variation, the
example below will use simulated data so the model that generated the data is known.
The left graphic in figure 8.37 shows the first component of the data: a surface created in a unit square using
300 simulated Gaussian data values randomly located in the data domain with a zero mean and a spherical
covariance model with the partial sill equaling 1, range equaling 0.2, and a zero nugget parameter. Figure
8.38, center, shows the second component of the data: strong trend in the east direction. These data were
created using the formula
Ztrend(si) = const⋅xi,
where xi is the x‐coordinate of the data location si.
Figure 8.37 at right shows projections of data values on the ZY and ZX planes: there is no systematic change in
the data in the north–south direction, whereas there is a clear increase in data values in the west–east
direction.
Figure 8.38 shows contour lines of the ordinary kriging predictions in the top part of the data domain. This
was accomplished by using an estimated (meaning that we “forgot” the true semivariogram) nearly optimal
isotropic spherical semivariogram model (that is, a model with better cross‐validation statistics than
spherical models with other parameters), bottom left, and an estimated isotropic spherical semivariogram on
residuals after removing large‐scale data variation using first‐order global polynomial interpolation, bottom
right. Colors from yellow to brown show the predictions using the semivariogram on residuals; blue colors
show the prediction using kriging without data detrending.
Figure 8.38
The semivariogram modeling on residuals found the correct parameters almost exactly, while the
semivariogram estimated using input data found a relatively large nugget, a range two times larger, and a 60‐
percent larger partial sill. The accuracy of the interpolation is also different according to the cross‐validation
statistics comparison at the bottom of the contour map. Predictions based on the kriging model with the
detrending option are clearly better, and we conclude that data detrending is an important part of the
geostatistical analysis here.
It should be noted, however, that when trend is negligible but the detrending option is used, the predictions
can be less accurate than using kriging without the detrending option.
Even though trend is, in theory, a systematic change in data and should be removed from the data before
statistical analysis (see discussion in chapter 3), kriging can produce reasonable predictions in its place. This
is because kriging is a local predictor, meaning that measurements close to the prediction location influence
prediction more than measurements farther apart. Because at relatively small distances systematic change in
the data is usually insignificant, trend and autocorrelation are difficult to distinguish.
where i=1,2,3 …N and is a random value from normal distribution with a mean of zero and a
variance of 1. Second, autocorrelation is defined as
so that the next value is equal to 90 percent of the previous one plus Gaussian uncorrelated noise. It is not
always easy to recognize which line is trend and which is autocorrelation in figure 8.39 (these graphics were
created in Excel using the two formulas above).
Figure 8.39
Misunderstanding of basic statistical features in data analysis may lead to wrong decision making. An
interesting historical example of wrong interpretation of trends in data is Marx’s study of interest rates in the
nineteen century. Marx’s economic law of motion of modern society is a collection of conceptually linked
relationships, including the law of decreasing of interest rates, the law of the falling tendency of the rate of
profit, and the general law of capitalist accumulation. Marx’s explanations of economic laws were almost
always qualitative because he did not use statistics when analyzing time series of the observed data.
After an economic upturn in the late eighteenth century, interest rates tended to fall in England from 1800 to
1844, raising a question whether profit rates are also declined. According to Marx, “The fall in the interest on
money is a necessary consequence and result of industrial development.” Based on this observation, he
discovered, with no empirical evidence, the law of the falling tendency of the rate of profit, which Marx had
identified as “in every respect the most important law of modern political economy.” The general law of
capitalist accumulation depends on rates of unemployment and underemployment and other variables, which
were and are difficult to measure. Nevertheless, Marx produced a model of capitalism without regression
analysis. However, just after Marx’s discovery, the behavior of interest rates changed. From the middle 1840s
through the 1870s, English and German interest rates exhibited a moderate rising trend. Just like in the
example in figure 8.39, it was not easy to make the distinction between trend and autocorrelation in the
interest rates in the nineteenth century without statistical data analysis. Marx’s economic theory was used for
theoretical justification of the necessity of socialist revolution.
It is important to note that although trend may not seriously influence kriging predictions locally, it uses the
same semivariogram everywhere; but finding optimal semivariogram parameters in the presence of trend is
difficult, as in example in figure 8.38.
Another way to take different data variation in different directions into account is to estimate and use
different semivariogram model parameters in different directions. Figure 8.40 at left shows such
semivariograms using data from the previous example (figure 8.38). Blue lines display models, and the ellipse
over the semivariogram surface shows the range of data correlation in different directions. Cross‐validation
statistics (figure 8.40, see right side) and contour lines are closer to the predictions using kriging on residuals
than kriging without detrending. Therefore, when data varies in different directions, both detrending and
anisotropy options can improve semivariogram estimation and, consequently, kriging predictions.
Figure 8.40
If the range of the data correlation varies with direction, it is called geometric anisotropy. It is possible that
other semivariogram parameters change with direction. A variant of anisotropy when the partial sill changes
with direction is called zonal anisotropy. The partial sill describes data variance, and all conventional kriging
models assume that variance is constant. Therefore, zonal anisotropy indicates data nonstationarity, and this
nonstationarity should be removed using detrending and transformation tools before applying kriging. If it is
not possible, a non‐stationary model should be used instead of the zonal anisotropy option.
The nugget parameter represents measurement error in the data locations (that is, data variation at zero
distance) and data variability at distances smaller than the shortest distance between measurement locations.
The nugget itself cannot vary in different directions, because it is defined for just one, zero distance between
measurements. However, measurement error (a part of the nugget parameter) can be different at each data
location so that it is possible to think about “anisotropy” of the measurement error. Varying measurement
error can be modeled using Monte Carlo simulations (The version of Geostatistical Analyst after 9.3 includes a
varying measurement error option for the Gaussian Geostatistical Simulations geoprocessing tool).
In Geostatistical Analyst 9.3, only the option for geometric anisotropy is implemented. If the data variation is
different in different directions, using a geometric anisotropy model and choosing a searching neighborhood
radius smaller than the minor range (0.068 instead of the 0.139 in figure 8.41 (left side, shown with the green
vertical line) will lead to the use of models with different sill parameters in different directions (pink
horizontal lines). This can be compared to using a searching neighborhood radius (green vertical line) smaller
than the range distance (light blue vertical line) in figure 8.42 (right side), where there is true sill anisotropy.
Figure 8.41
Figure 8.42
KRIGING NEIGHBORHOOD
Near the prediction location, spatial data variation is typically smaller than average data variation and it
makes sense to predict a value at the unsampled location using nearby measurements only. In Geostatistical
Analyst, kriging uses a neighborhood constraint, forcing most of the kriging weights to zero. In figure 8.43,
weights are set to zero if they do not correspond to observations that are the nearest three within each of
four sectors around the estimation location. The legends in the top center parts of the dialog boxes show that
measurements in green have weights smaller than 3 percent of the sum of weights and that measurements in
red have weights larger than 10 percent of the sum of weights. A searching neighborhood is defined by
specifying both a minimum and maximum number of observations to be included in the specified number of
sectors. Weights for the measurements in the searching neighborhood are displayed in the right side of the
dialog boxes in figure 8.43.
Figure 8.43
In the case of real data with measurement errors, the larger the number of points involved, the more
the prediction is corrupted by noise in the data.
The closer the data to the model assumptions, the larger the number of points can be safely used.
However, when data features do not follow the model assumptions, even usage of moderate number
of nearby observations may produce poor predictions.
Kriging with a large number of nearby observations (depending on the data and the model, from
several dozen to several hundred) may lead to the computational problem of solving a large system
of linear equations if nugget parameter equals zero or it is very small.
The kriging weights associated with a distant observations tend to zero anyway because of small data
correlation.
There is no theory as to the best neighborhood size; a general recommendation is to use a neighborhood with
at least three but fewer than 25 observations. In particular, 50‐year‐old case studies in meteorology
suggested using approximately eight observations to minimize the prediction error. This recommendation
contradicts with geostatistical theory, according to which global kriging has the smallest prediction standard
errors. This theory (for example, asymptotic behavior of semivariogram and covariance models when a
number of observations is infinitely large) is based on simulated Gaussian data that are absolutely precise.
However, such data do not occur in practice.
Instability of a kriging system of linear equations can be investigated by adding small perturbations to the
covariance model and to the data coordinates. It can be shown that the instability increases when increasing
the number of observations N involved in prediction and decreasing the proportion .
Because of this instability, kriging with a greater number of neighbors may produce larger mean‐squared
prediction error than with a smaller number of neighbors (additional kriging prediction instability arises
when the data locations are very close to each other).
Figure 8.44 shows how to verify the suggestion to use a small number of neighbors in Geostatistical Analyst:
change the Neighbors to Include value, and see how the prediction standard error changes in a particular
location, in this case in the center of a neighborhood circle. A number of neighbors is approximately optimal if
prediction standard error remains nearly the same with increasing the number of data points in the kriging
searching neighborhood.
Figure 8.44
nugget model with the following proportion of the parameter values: . The resulting
surface can be used as an estimated mean value surface. With a decrease in the number of measurements
involved in the prediction and increase of the proportion , the prediction surface becomes more
variable. Note that simple kriging produces a flat surface in this case because the mean value is defined for all
locations.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 8.45
CONTINUOUS KRIGING PREDICTIONS AND PREDICTION STANDARD ERROR
GIS users usually expect to see a continuous surface made from continuous data, such as temperature
observations. Otherwise, a reason for jumps in the predictions should be justified. However, kriging
predictions and prediction standard errors in nearby locations can be substantially different if their local
neighborhoods are different. Also, prediction error at individual location is actually spatially averaged
prediction error (see “Uncertainty of prediction errors,” chapter 2) and it should be continuous. Another
reason for continuous mapping is esthetic: independently of statisticians’ desire, visual appeal is sometimes
more important than map accuracy.
Figure 8.46
Consider kriging prediction at two nearby points, represented as centers of two searching neighborhood
circles (red and blue in figure 8.46). The only difference between local neighborhoods is whether location
with value –3.60 was included, which leads to discontinuity of the prediction and prediction standard error
surfaces between two points in the center. Figure 8.47 shows the resulting predictions (at left) and prediction
standard errors (at right) in three dimensions in the rectangle around data locations.
Figure 8.47
Conventional kriging cannot produce continuous surface with local searching neighborhood, and breaks in
the prediction surface are clearly seen if there are significantly different data values in nearby local
neighborhoods. Other interpolators that use a limited number of nearby measurements, including
deterministic models discussed in chapter 7, have this drawback as well.
Before presenting a solution to this problem, it is instructive to discuss a rather unexpected feature of kriging:
prediction depends on the data outside the range of correlation.
This feature is illustrated using three points on a line (figure 8.48 at left), where distances from point B to
Figure 8.48
Prediction and prediction variance at point A using point B only are
ˆ 2
and E
( )
Z ( A) − Z ( A) = 1− a 2
Prediction and prediction variance at point A using point C only are
ˆ 2
€ and E
( )
Z ( A) − Z ( A) = 1.
Prediction and prediction variance at point A using both points B and C are
€ ˆ 2 1− 2a 2
(
)
and E Z ( A) − Z ( A) =
1− a 2
.
Therefore, data outside the range of correlation (in our case, point C) influence prediction if at least one
measurement at a distance less than a range of correlation exists.
€
Continuous surfaces can be produced using global neighborhood with all input data. However, as we
discussed above, it is usually not a good idea to use all measurements.
A smooth surface can be produced by predicting values on a grid and then using smoothing filters. This is
equivalent to adding uncontrollable noise to the prediction. The resulting prediction and prediction standard
error surfaces do not match since they are smoothed out independently, which is a problem if they are both
Geostatistical Analyst 9.2 and higher has an option to produce statistically consistent continuous (without
breaks) prediction and prediction standard error surfaces. Because the data outside the range of correlation
do influence prediction, the idea is to modify the kriging system so that data outside a specified distance from
the prediction location have zero weights. The implementation is shown in figure 8.49. The standard
searching neighborhood with range r weights corresponds to the central circle in the searching neighborhood
dialog, with points inside used for prediction at the center of the circle. This searching neighborhood can be
presented as a cylindrical kernel, which transect is shown at left in blue. The kernel at right in pink is used to
modify data to produce a smooth surface. It is equal to 1 up to specified distance ri that corresponds to the
inner circle, so that the data inside this circle are unchanged. Kernel value then falls smoothly and becomes 0
at distance ro (shown as the outer circle).
Figure 8.49
Variance of the modified data remains the same as variance of the original data, while covariance between
datum and prediction location decreases and data outside distance ro do not correlate with data inside a circle
with radius ro.
Using this smoothing option, simple kriging with zero mean, the covariance model used for simulating data in
figure 8.46, and a radius of the searching neighborhood equaling 10 units (that is a radius smaller than the
range of the data correlation), the following smooth predictions (figure 8.50 at left) and prediction standard
errors (figure 8.50 at right) are produced.
Figure 8.50
Figure 8.51
It is important that continuous simple kriging variance is between standard simple kriging variances with
searching neighborhood radii 7.5 and 12.5 units. Standard simple kriging variance using a searching
neighborhood radius equal to 10 units (green) is close to the continuous simple kriging variance (blue).
Figure 8.52 shows the difference between smooth and nonsmooth ordinary kriging prediction surfaces using
Geostatistical Analyst tutorial data, annual maximum eight‐hour nitrogen dioxide concentration in California
in 1999. A hillshading option in figure 8.52 at left highlights discontinuities in the NO2 predictions in the areas
where searching neighborhood is changing. The prediction surface becomes fluent when the smoothing
option is used, right.
Note that ordinary kriging with smooth option cannot predict values when searching neighborhood does not
contain observations as in the northern part of California. In contrast, conventional ordinary kriging uses a
specified number of the nearest neighbors and predictions are possible in the areas beyond the extent of the
data.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 8.52
Any interpolator with a local searching neighborhood produces nonsmooth predictions. Figure 8.53 at left
shows that radial basis function, which is a smooth predictor by its very nature, creates a nonsmooth map of
NO2 concentration, at left, with a searching neighborhood diameter of 175 kilometers. After applying the
Geostatistical Analyst smoothing option, the interpolated surface becomes smooth, as seen at right.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 8.53
DATA TRANSFORMATIONS
Transformations are used in geostatistics to bring data close to normal distribution and to satisfy stationarity
assumptions: constant data mean and constant data variance.
The logarithmic transformation is used for a skewed data distribution with a small number of large values.
Lognormal process, formulas for lognormal simple and ordinary kriging predictions and prediction standard
errors, and an example of cross‐validation diagnostics are presented in “Lognormal process,” chapter 4.
Additional information on lognormal kriging can be found in reference 10 in “Further reading.”
There are applications in which the transformation required to achieve normality is not the logarithm but
some other function. In this case, exact formulas for kriging predictions and prediction standard errors do not
exist, and approximate formulas based on expansion of the transformation function in a second‐order Taylor
series can be used. This method, called the delta method, is used in Geostatistical Analyst for kriging with
power (Box‐Cox) and arcsine data transformations.
If data values in part of the study area are smaller than in other parts, the data variability in that region is
usually smaller than in other regions. In this case, the square‐root transformation helps to make the variances
more similar throughout the study area, and this transformation simultaneously often makes the data closer
to normal distribution. The square‐root transformation is a particular case of the Box‐Cox transformation
with the parameter equal to ½.
When data values are proportions, the variance is smallest near 0 and 1 (small and large data values) and
largest near 0.5 (intermediate data values). In this case, the arcsine (the inverse sine expressed in radians)
transformation makes data variance less variable throughout the study area, and it also makes the data nearly
normally distributed.
Successful use of normal score transformation makes data multivariate normal. A very important
consequence of successful normal score transformation is that kriging prediction standard error becomes a
function of the observed data in addition to a function of the data configuration, so that the prediction
standard error becomes larger in the areas where the observed values are larger (see examples in chapter 6).
Suppose four ordered values are Z(s1), Z(s2), Z(s3), Z(s4), where Z(s1) is the lowest value and Z(s4) is the
largest. The empirical cumulative distribution function is shown in figure 8.54 as a step function. Note that
the weights on the y‐axis need not be increments of 1/4 if a data declustering algorithm is used (see next
section, which explains how to estimate the univariate cumulative data distribution in spatial domain).
Figure 8.54
Geostatistical Analyst provides three methods for normal score transformation: direct, linear, and a mixture
of Gaussian distributions.
Figure 8.55
Because of the incomplete data sampling, the exact shape of the true cumulative distribution is not known. In
particular, there may be a nonzero probability for values less than the minimum value and greater than the
maximum value in the dataset.
Kriging and other interpolators tend to underestimate large values and overestimate small ones, and it is an
exception when prediction is greater than the maximum or less than the minimum value in the dataset. The
situation is worse when a large or small quantile value needs to be estimated, for example estimating 0.75
quantile at the location near maximum value of the ozone data from the Geostatistical Analyst tutorial (see
the location of the question mark in figure 8.56). Simple kriging with direct normal score transformation
gives prediction 0.155 and prediction standard error 0.015. Since 0.75 quantile of normal distribution with
mean 0.155 and standard deviation 0.015 is equal to 0.175 and this value is greater than dataset maximum
value 0.174, which has zero probability to be exceeded in the direct normal score transformation algorithm,
Geostatistical Analyst generates an error message.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 8.56
Figure 8.57
A mixture of Gaussian distributions with different means and standard deviations can be used to smooth the
probability density function (and, therefore, the cumulative distribution function) with sufficient accuracy.
Figure 8.58 shows how three‐modal data distribution is approximated by a mixture of three Gaussian
distributions, with parameters Mu (mean µ), Sigma (standard deviation σ), and P (weight of the Gaussian
kernel) shown in the right part of the Geostatistical Analyst dialog box Normal Score Transformation. The
data are approximated by the formula
with the following constraint to the weights pi: .
A mixture of Gaussian distributions is also a Gaussian distribution, and both very small and very large values
are allowed, although with small probability.
Figure 8.58
Using the known function of the back transformation z=f(y) for a particular dataset as shown in figure 8.59 at
right, kriging variance can be calculated using the following formula:
2
+∞ 2 1 u
2
σˆ (x) =
z ∫ [ f ( yˆ ( x ) + σˆ
−∞ y (x)u) − f ( yˆ ( x )) ] 2π e
−
2 du ,
where subscript signs z and y refer to the original and transformed data respectively so that is the kriging
variance of the transformed data. Unfortunately, this estimation is biased because of the inequality
€
. Therefore, another back transformation method should be used.
Figure 8.59
For transformation of the cumulative distribution to and from the standard univariate Gaussian distribution
using the linear and Gaussian mixture methods, Geostatistical Analyst uses piecewise linear approximation by
segments of such length that the deviation between the approximation of the cumulative distribution function
and each line segment is less than the specified value ε=5⋅10−5. The piecewise linear approximation is defined
for n segments
.
Linear approximation of each segment is defined as
2 2
(
The variance var ( Z ) = E Z − E ( Z ) ) = E ( Z 2 ) − ( E ( Z )) of the function Z=f(Y), where Y is a Gaussian
random variable with expectation and variance , can be calculated for the piecewise linear
Gaussian mixture is a multimodal distribution. If a unimodal distribution is required, data can be
approximated, for example, by the generalized lambda distribution, which has four parameters (it is not
implemented in Geostatistical Analyst 9.3). The generalized lambda distribution can take on a very wide
range of shapes within one distributional form. Four examples of this distribution with different parameters
are shown in figure 8.60 at left. A fit to the temperature data used for illustration of the mixture of Gaussian
kernels in chapter 4 (figure 4.14) is shown in figure 8.60 at right.
Figure 8.60
The fundamental difference between methods of data transformation discussed in this section is that the
normal score transformation of the data values changes with each particular data because the result of the
normal score transformation depends on the position of a value in the ordered dataset, whereas, for example,
the logarithmic transformation of a value is always its natural logarithm.
The normal score transformation must occur after data detrending, since univariate normal distribution does
not have the trend component. This is in contrast to logarithmic, Box‐Cox, and arcsine transformations in
which transformations are used before removing trend from the data.
Successful data transformation is not always possible. All functional transformations in Geostatistical Analyst
require positive data values. If data have negative values, a constant larger than the absolute value of the
smallest observation in the dataset can be added before transformation to all data values; this constant
should be subtracted from the kriging predictions. Normal score transformation does not work properly
when there are many repeated values in the dataset. An example is a daily precipitation dataset with
numerous entries of zero precipitation. In this case, many of the transformed data with different values
should be associated with a particular zero in the original dataset, and there is no unique way to do this.
However, transformed data with many equal values cannot be bivariate normal, and the normal score
transformation verification (see “Checking for bivariate normality” in chapter 9) will indicate this. Daily
rainfall data is an example of “zero‐inflated” process which requires a special model (see discussion in
“Modeling data with extra zeros” of chapter 4).
Using equal weights ωi = 1/N, where N is a number of observations, for regularly sampled data is sufficient for
calculating the data cumulative distribution. If the data are oversampled in some subregions of the area being
studied, data in densely sampled areas should receive less weight than data in sparsely sampled regions
because our goal is estimation of the data distribution that represents the entire data domain.
Specification of locally varying weights ωi is called data declustering in geostatistical literature. One method
of data declustering is to divide the data domain into a grid of cells and weigh each sample according to the
number of samples falling inside the cell. For example, if there are M cells with data, and the number of
samples in each cell is given by n1, n2, …, nM, then the weight of each sample in the cell i is .
Figure 8.61 shows an example of calculating declustering weights for M=3. If each of the eight data samples
has a value equal to its number, that is, z1=1, z2=2, …, z8=8, the declustering mean is equal to 5.5, while the
mean value with equal weights (ωi=1/8) is equal to 4.5.
M = 3
,
,
,
Figure 8.61
Figure 8.62 shows a Geostatistical Analyst dialog box for cell declustering of the data. Input parameters are
the coordinates of the data and the data values. The output parameter is the declustering weight for each data
sample. The user can select Cell Size, Angle of grid rotation, and Shift (number of origin offsets). The last
option allows incremental movement of the grid origin using up to 16 shifts in such a way that the grid is
shifted less than one‐half of the cell size. Then weight ωi is the average of the weights calculated for each shift.
The shift option is provided because, for relatively small and clustered datasets, declustering weights can
change significantly if the origin is shifted by one‐half or one‐quarter of the cell size.
A strategy for selecting the optimal grid and its orientation is to find the minimum (if high values are
preferentially sampled) or the maximum (in the case of clustered low values) of the declustering mean
, which is shown as the y‐value in the graph in the right part of the dialog box. The abscissa (x‐
value) could be the cell area, the anisotropy of the cell size (proportion of the cell size lengths), or the angle of
rotation.
Figure 8.62
From California Ambient Air Quality Data CD,
1980–2000, December 2000.
Courtesy of The California Air Resources Board,
Planning and Technical Support Division, Air
Quality Data Bran
An optional method for data declustering in Geostatistical Analyst consists of defining weights as a function of
the area of a Voronoi polygon that surrounds each point. Voronoi polygon construction is discussed in
chapter 13. Figure 8.63 shows two examples of data declustering using Voronoi polygons clipped by the
specified boundary polygon, using a meteorological monitoring network in western Europe (at left) and
pollution in California (at right). The polygonal declustering algorithm assigns larger weights to the data in
large red areas and smaller weights to the data in small areas in blue. Clipping is important for non‐grid data
to avoid overweighting of the data near the data boundary.
Copyright © 1995, Umweltbundesamt, GmbH.
Figure 8.63
or
The motivation is the same as in the case of estimating the data distribution in the data domain: arithmetic
average of the empirical semivariogram values assumes that each available datum is equally representative of
the local areas, although this is true for regular data sampling only. Therefore, correction for the data
irregularity can be helpful.
There are other approaches for defining data weights in addition to cell and polygonal declustering (for
instance, there is a proposal to weigh each datum inversely proportional to the variance of its contribution to
the estimated semivariogram at lag h, see reference 4 in “Further reading”) and all of them assign smaller
weights to the data in the densely sampled areas (clusters). Therefore, rough approximation of the data
weights can be also done using the estimated density of the observed data locations. Points density
estimation is discussed in chapter 13.
ASSIGNMENTS
1) SIMULATE SURFACES USING VARIOUS SEMIVARIOGRAM MODELS.
If you have Geostatistical Analyst 9.3, use Gaussian Geostatistical Simulation tool to unconditionally simulate
surfaces using various semivariogram models. In particular, try to reproduce surfaces in figure 8.25. Discuss
the difference between simulated surfaces.
2) FIND THE BEST SEMIVARIOGRAM MODELS FOR SIMULATED DATA.
Accurate semivariogram modeling is not an easy task, and some experience is required to find models and
parameters close to the optimal ones. One way to improve semivariogram modeling skills is to use simulated
data with a known model, then “forget” about this model and try to reconstruct it.
The folder assignment 8.2 includes simulated Gaussian data using different semivariogram models. Names of
the shapefiles are random. Using Geostatistical Analyst, try to find the best models. Then compare your
findings with correct answers in the file solution.txt.
Use data from folder assignment 8.3 and try to reproduce maps in figure 8.64 using searching neighborhood
size which is detectable from the map in the left part of the figure.
Figure 8.64
4) TRY A GENERAL TRANSFORMATION OF NONSTATIONARY DATA.
Stationarity can be achieved by numerically transforming and smoothing the mean and variance using a
moving average technique. For example, a smooth mean and variance can be estimated using the following
formulas:
n
− si −s Θ 2
∑e (Z (si ) − mean(s))
, stdev ( s) = i=1
n
,
− si −s Θ
∑e
i=1
where si are data locations, Z(si) are data values, and is a smoothing parameter that can be estimated
empirically using cross‐validation.
€
The following formulas are then used for direct and back transformation:
, ,
Here is a simplified algorithm based on a local polynomial interpolation model:
Estimate local mean value mean(s) using medium‐size searching neighborhood, where s are the
coordinates of the unsampled locations, usually cell centers of the overlapping grid.
Predict local mean values mean(si) to the data locations si.
Interpolate stdev(si) values to stdev(s) surface using local polynomial interpolation.
€ Transform data Z(si) to the new variable Y(si) using estimated mean and variance values:
Back transform the predictions using the formula , where
is a prediction of the original data in the location s.
Calculate prediction standard error map as .
Use the algorithm above to produce prediction and prediction‐error maps. Compare predictions and
prediction standard errors with those produced by one of the kriging models available in Geostatistical
Analyst. Use data from assignments to chapter 7.
FURTHER READING
1. Geostatistical Analyst Web site
http://www.esri.com/software/arcgis/extensions/geostatistical/index.html
Visit the Geostatistical Analyst Web site for more information about available models, tools, and new
functionality in development.
2. Gribov, A., K. Krivoruchko, and J. M. Ver Hoef. 2006. “Modeling the Semivariogram: New Approach, Methods
Comparison, and Simulation Study.” In Stochastic Modeling and Geostatistics: Principles, Methods, and Case
Studies, Volume II. Edited by T. C. Coburn, J. M. Yarus, and R. L. Chambers, 45–57. AAPG Computer Applications
in Geology 5.
Available at http://training.esri.com/campus/library/index.cfm
This paper discusses the semivariogram parameter called the nugget effect and compares exact and filtered
kriging models.
3. Ver Hoef, J., and R. P. Barry. 1998. “Constructing and Fitting Models for Cokriging and Multivariable Spatial
Prediction.” Journal of Statistical Planning and Inference 69:275–94.
Traditional cokriging requires that semivariogram models for each variable be identical—that is, they should
have the same shape and range parameters (although a different nugget parameter is allowed). This results in
difficulties with constructing flexible and valid cross‐covariance models. The authors consider more general
class of cross‐semivariogram and cross‐covariance models based on moving average methodology that
support models with different shape and range parameters.
The authors show that the standard practice of using equal weights for all pairs of points for the empirical
semivariogram calculation may lead to inefficient and unstable semivariogram estimation. They propose an
algorithm for optimal weights for the empirical semivariogram estimate.
5. Gribov, A., and K. Krivoruchko. 2004. “Geostatistical Mapping with Continuous Moving Neighborhood.”
Mathematical Geology 36(2) February 2004. Available at Available at
http://training.esri.com/campus/library/index.cfm
The authors proposed a modification of kriging that produces continuous prediction and prediction standard
error surfaces. Simple, ordinary, and universal kriging, conditional geostatistical simulation, and limitations
using the proposed model are discussed. A model is implemented in Geostatistical Analyst version 9.2 for
both deterministic and statistical models.
6. Chiles, J. P., and P. Delfiner. 1999. Geostatistics: Modeling Spatial Uncertainty. John Wiley & Sons, 720 pages.
There are many books on geostatistics written by researchers and scientists from a variety of fields in earth
sciences. Books written by professional statisticians are difficult to read without a good background in
mathematics and classical statistics. From a mathematical notation point of view, this book can be read by a
larger number of GIS users. The number of problems, methods, and tricks discussed in this book will surprise
most new readers.
7). Diggle P. J., R. Menezes, T. and Su. 2008. “Geostatistical Inference under Preferential Sampling”. Johns
Hopkins University, Working Paper #162.
The authors of this paper developed a geostatistical model for preferentially sampled environmental data. It
is shown that ignoring preferential sampling can lead to misleading inferences.
8. Emery, X. 2005. “Variograms of Order ω: A Tool to Validate a Bivariate Distribution Model.” Mathematical
Geology 37(2):163–181.
This paper discusses several tests for data consistency with a bivariate Gaussian distribution.
9. Kagan, R. L. 1997. Averaging of Meteorological Fields. Kluwer Academic Publishers. (Original Russian
edition: Gidrometeoizdat, St. Petersburg, Russia, 1979)
In this classic geostatistical book, the author summarized his research on optimal data averaging in time and
space from 1962 (when Gandin and Kagan published their first paper on “block kriging”) to 1979. Data
averaging is also important in geological applications, such as estimating recoverable reserves.
Block kriging is not discussed in this chapter primarily because data averaging is often more convenient using
geostatistical conditional simulation as discussed in chapter 10.
10. Cressie, N. 2006. “Block Kriging for Lognormal Spatial Processes.” Mathematical Geology 38:413–43.
The purpose of this paper is to take a fresh look at geostatistics for lognormal data, build on the results of
earlier papers, and develop new results in light of the statistical literature on linear models and data
transformations.
This paper discusses the implementation of the maximum likelihood algorithm for interpolating large
datasets assuming that the input data are stationary and Gaussian. Geostatistical literature is generally in
agreement that the maximum likelihood is usually preferable over weighted least square covariance fitting
algorithms. However, the implementation of the maximum likelihood in the commercial and freeware
software is limited by several hundred data samples. The authors proposed an innovative solution to the
“large N” problem.
PRINCIPLES OF MODELING
GEOSTATISTICAL DATA: KRIGING
MODELS AND THEIR ASSUMPTIONS
CHOOSING BETWEEN SIMPLE AND ORDINARY KRIGING
KRIGING OUTPUT MAPS
MULTIVARIATE GEOSTATISTICS
INDICATOR KRIGING AND INDICATOR COKRIGING
DISJUNCTIVE KRIGING
CHECKING FOR BIVARIATE NORMALITY
MOVING WINDOW KRIGING
KRIGING ASSUMPTIONS AND MODEL SELECTION
A MOVINGWINDOW KRIGING MODEL
KRIGING WITH VARYING MODEL PARAMETERS: SENSITIVITY ANALYSIS AND
BAYESIAN PREDICTIONS
COPULABASED GEOSTATISTICAL MODELS
ASSIGNMENTS
1) REPRODUCE PREDICTION MAPS SHOWN IN THE “ACCURATE TEMPERATURE MAPS
CREATION FOR PREDICTING ROAD CONDITIONS” DEMO
2) INVESTIGATE THE PERFORMANCE OF SIMPLE, ORDINARY, AND UNIVERSAL
KRIGING MODELS BY COMPARING THEIR PREDICTIONS WITH KNOWN VALUES
4) PARTICIPATE IN THE SPATIAL INTERPOLATION COMPARISON 97 EXERCISE
5) PREDICT THE TILT THICKNESS IN THE LAKE
6) DEVELOP A GEOSTATISTICAL MODEL FOR INTERPOLATION OF THE LAKE KOZJAK
DEPTH DATA
FURTHER READING
T his chapter continues discussing the foundations of geostatistical data analysis, focusing on the
advantages and disadvantages of various kriging models. First, two basic geostatistical models, simple
and ordinary kriging, are compared. Formally, the only difference between these models is
specification of the data mean value as known or unknown constant. However, this small difference
may lead to very different predictions and especially prediction standard errors.
Next, construction of four output kriging maps are discussed and illustrated using cadmium data collected in
Austria.
Using data on milk and soil contaminated with radiocesium, we discuss and illustrate the multivariate
extension of kriging, when prediction of the variable of interest is based on the surrounding observations of
several variables.
Assumptions behind indicator kriging, a popular model for probability map creation in geostatistical
literature, are discussed. Indicator kriging has several shortcomings, and this chapter presents disjunctive
kriging, a model that overcomes some of them. This kriging model requires bivariate normal data. A method
for checking bivariate normal data assumption is presented.
A method for analyzing large non‐stationary data, moving window kriging, is also discussed and illustrated
using precipitation data collected in South Africa.
A summary of conventional kriging assumptions and consequences of the model misspecification are
presented and illustrated using meteorological and radioecological data.
Then Bayesian kriging that treats the model parameters as random variables instead of constants is discussed
and illustrated using cesium‐137 soil contamination data from Belarus.
This chapter ends with a discussion of copula‐based geostatistics, an interesting alternative to spatial
dependence modeling using semivariogram.
Several other geostatistical models are discussed in the following chapters and appendixes.
How Gandin derived formulas for ordinary kriging was briefly discussed at the beginning of chapter 8. This
section presents the derivation of simple kriging formulas and discusses their features, then compares simple
and ordinary kriging.
Simple kriging requires the detrended data,
Y(si) = Z(si) – mean,
where Z(si) is an observed data value in the location si and mean is a constant or the value of the surface at
location si (in Geostatistical Analyst, mean surface is usually estimated using local polynomial interpolation).
If the assumed mean value or surface is defined correctly, then the mean of the residuals Y(si) is equal to zero.
After making kriging predictions at location s, where hat denotes a predicted value, mean is added to
the predictions and is calculated as
Simple kriging assumes that the prediction of Y(s?), where s? is a location where measurement is not
available, is a linear combination of the measurements
,
where N is a number of surrounding data points and λi are weights of the measurements Y(si).
Weights should depend on the distances between predicted locations and observed data locations and also on
the degree of correlation in the data values—this is what makes kriging so different from deterministic
methods. To choose λi optimally, the expected square difference between actual value Y(s?) and its prediction
is minimized:
The expected value on the left side is called variance. The formula can be rewritten as
The last formula indicates that the variance of the prediction is less than the variance of the analyzed variable
at the unsampled location Y(s?), meaning that kriging underestimates local data variability, resulting in a
prediction map that is always smoother than the actual surface.
Using one of the textbooks on random variables, the last formula can be rewritten using covariance between
unsampled s? and sampled si locations as
,
where is covariance between locations sj and si.
, for i=1,2,…,N
From this equation, a system of linear equations can be written to find kriging weights λi.
To find the kriging weights, the covariance between data locations si and sj, expressed as , and the
covariance between unsampled location s? and data locations si, which is , should be known. In
practice, the covariance model is estimated from the data and assumed to be known. This is a strong
assumption, and modern statistical literature tends to overcome it (see section “Kriging with varying model
parameters: sensitivity analysis and Bayesian predictions” of this chapter).
Using the formulas above, the prediction standard error is calculated as
,
meaning that knowledge of the data covariance allows estimation of prediction uncertainty. Note that kriging
variance does not directly depend on the observed values Y(si). It depends on the observations through the
covariances and .
It is shown in statistical literature that the simple kriging predictor has the smallest mean‐squared prediction
error among all linear kriging models including ordinary and universal kriging. This statement is true only if
the mean is really known and not estimated from the same data.
The predictor for the original data Z(s?) is
=
Note that the mean works as an additional observation with weight . If data are weakly
correlated, neighbors’ weights are small, sum is much smaller than one, and predictions tend to be
close to the mean. Predictions also tend to be close to the mean in areas without observations. Usually this
makes sense, but not always. For example, some areas can be undersampled because values there are smaller
than average value. In these areas, the mean value is not the best predictor. On the other hand, these areas are
The mean value can be excluded from the last formula, assuming that the sum of the kriging weights of the
neighboring data is 1, that is, . The resulting ordinary kriging predictor assumes that mean may have
any value and vary across the areas, being approximately constant in the kriging neighborhood. However, it is
reasonable to suspect that such a simplistic solution to the complicated problem of mean value or mean
surface estimation may not work for each particular problem.
Note that from the Bayesian statistics point of view, the simple kriging prediction standard error tends to the
ordinary kriging prediction standard error when uncertainty of the mean value (prior of the mean variance)
tends to infinity. However, it is often possible to make a reasonable estimation of the mean value or mean
surface and, therefore, use additional information to create a better model.
In the early years of geostatistics, the primary applications were in meteorology and mining. Meteorologists
often preferred simple kriging, whereas geologists almost exclusively used ordinary kriging. Both schools
were concerned with specification of the mean value or the mean surface.
Meteorological data are replicated in time, and observations from previous periods of time can be used for
estimation of the mean surface or mean value. Then mean surface is removed from the data, and simple
kriging on residuals is used.
In geological applications, data replications are usually not available, and it is difficult to estimate the mean
value or mean surface. Geological data are expensive and the number of samples available is usually smaller
than in meteorological applications. In addition, measurement errors of geological data are often larger than
in meteorological data. In these circumstances, a constraint on the kriging weights may not be a
good idea since in theory this constraint works well only in the case when measurement error is negligible
and the number of samples in the kriging neighborhood is large.
In the case of equally spaced measurements, it is possible to show in the spectral domain that an optimal
value of the sum of weights should decrease with a decreasing number of neighboring samples and an
increasing measurement error. In particular, when nugget parameter is equal to partial sill, the optimal sum
of weights is 0.5 instead of 1.
In the extreme case of independent data and a small number of samples in the kriging neighborhood, the
surface created using ordinary kriging varies too much while simple kriging predictions are very close to the
flat surface, as it should be. In this case, simple kriging weights tend to zero values.
Figure 9.1 shows five spherical semivariograms with a range equal to 0.2, sill equal to 1, and various
combinations of nugget and partial sill parameters. Spatially dependent, standard normal data with
correlation defined by spherical semivariograms were simulated. Note that geostatistical simulations are
almost always based on simple kriging (see discussion in chapter 10). Then performance of ordinary kriging
can be compared with simple kriging, which in this case is the best predictor by construction.
Figure 9.1
Figure 9.2 shows predictions at three locations and kriging weights of the five closest samples for simple (at
left) and ordinary (at right) kriging, using data simulated with a spherical semivariogram with nugget
parameter equal to partial sill (the green line in figure 9.1). The only difference between prediction models is
a constraint on the weights, since mean value is equal to zero and does not influence the predictions directly.
The difference between predictions using ordinary and simple (in this case, optimal) kriging in figure 9.2 is
significant. This difference is larger in the areas with a smaller number of samples. It can be shown using
simulations with other models displayed in figure 9.1 that this difference decreases as nugget parameter
(here interpreted as a measurement error) is decreasing.
Figure 9.2
Figure 9.3 shows four output maps created by Geostatistical Analyst using the same data (maximum annual
ozone concentration in California in 1999) and different renderers.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 9.3
The prediction map is created by contouring many interpolated values systematically obtained throughout
the region. Note that the Geostatistical layer is not a regular grid with square cells but an irregular triangular
grid where the number of triangles is a function of the data density.
The prediction standard error map is produced from the standard errors of interpolated values. The second
map in figure 9.3 shows that prediction standard error is increasing significantly along with increasing the
distance from the measurement locations. The actual value of ozone concentration can be very large or very
low in the areas with large prediction error.
The probability map shows the probability that the interpolated values exceed a specified threshold.
The quantile map is a probability map in which the threshold is equal to the specified quantile of the data
distribution. A quantile map shows overestimated (quantile is greater than 0.5) or underestimated (quantile
is less than 0.5) values and can be used in decision making. For example, it is better to evacuate people from a
larger area than to underestimate the area contaminated by dangerous levels of chemicals. Hence, mapping
the quantile values of, say, 0.7 instead of the mean values can be advantageous.
The kriging prediction is in the center of each curve. Probability that the value is greater than a threshold
value of 5 is a proportion of the area under the curve in blue to the whole area under the curve. The values
shown as blue crests in figure 9.4 are predictions of 0.66 quantiles.
Figure 9.4
Figure 9.5 illustrates the construction of probability values that cadmium concentration exceeded the
threshold of 0.25 microgram per kilogram in moss in Austria in 1995 estimated using disjunctive kriging.
Filled contours display cadmium predictions. Prediction standard errors are shown with contour lines.
Predictions and prediction standard errors for a low‐contamination location on the border of Hohe Tauern
National Park, medium contamination in Vienna, and relatively high contamination in Innsbruck are shown in
the table in the bottom left part of the figure. Using these values and assuming that predictions are normally
distributed with mean equals kriging prediction and standard deviation equals kriging prediction standard
error, probability values for a threshold of 0.25 are calculated and shown in the top part of the graph.
Copyright © 1995, Umweltbundesamt, GmbH.
Figure 9.5
MULTIVARIATE GEOSTATISTICS
Spatial data may change with a changing environment. For example, ozone concentration in the air varies
with changes in elevation, temperature, and concentration of other chemicals.
The cokriging models—including simple, ordinary, universal, indicator, probability, and disjunctive
cokriging—are all generalizations of the kriging models for prediction using several variables.
Cokriging was first used in meteorology to improve prediction of the geopotential (a variable that describes
the earth’s gravitational field) using data from wind measurements.
In Geostatistical Analyst, one, two, or three variables—Z2(s), Z3(s),and Z4(s)—can be used as secondary
variables for making more accurate predictions of the primary variable Z1(s) at the unsampled location s?. The
prediction is a weighted sum of the observations in the searching neighborhoods:
,
where are weights, and N, M, L, and K are the numbers of measurements of the variables
involved in the prediction.
Weights in different cokriging models have different meaning and constraints. In ordinary cokriging, the sum
of weights of the primary variable is equal to 1 and the sums of weights of the secondary variables are zeroes.
In simple kriging, the difference between measurements and their mean values is weighted, and the weights
can have any values:
Each location can have measurements of all variables or each variable can have its own unique set of
locations.
Cross‐correlation between spatial variables is defined by cross‐covariance as a function of distance between
measurement locations of two variables
Crosscovariance(distance h) = average[(Z1(si) – mean(Z1))⋅(Z2(sj) – mean(Z2))]
for all pairs of locations i and j separated by distance h.
Because the variances of primary and secondary variables may differ significantly, the Geostatistical Analyst
covariance model fitting algorithm first scales each dataset, = (z (s ) − µ ) /σ
j i j j , where and σj are
the sample mean and standard deviation for the jth variable, reducing risk of numerical instability of the
covariance and cross‐covariance models fitting.
Weights of the measurements are estimated using semivariogram and cross‐covariance models. If variables
are spatially independent, cross‐covariance is equal to zero; weights of the secondary variables are also zero;
and predictions using kriging and cokriging are identical.
Figure 9.6 shows Cesium‐137 measurements in milk (squares) and soil (circles) in part of Belarus in 1993.
Gray lines display the road network. Red check marks highlight areas where Cesium‐137 soil samples were
collected and milk samples were not. Both datasets reflect Cesium‐137 deposition, and it is reasonable to use
more densely sampled soil measurements to improve prediction of Cesium‐137 milk contamination.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 9.6
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 9.7
Digits at the bottom in figure 9.7 number the estimated parameters of the cokriging spatial correlation model.
Six parameters must be estimated for two variables when using cokriging instead of three parameters in the
case of kriging (note that the parameter range is the same for all semivariogram and cross‐covariance models,
and the nugget parameter is equal to zero for the cross‐covariance model). This number increases when three
or four variables are used.
If cross‐correlation exists and is estimated correctly, cokriging outperforms kriging. However, the need to
estimate additional parameters introduces additional uncertainty, and prediction precision may decrease if
cokriging model parameters are estimated incorrectly.
Figure 9.8 shows weights of simple kriging (bottom right window) and simple cokriging (top right and bottom
left) using six nearest measurements to the prediction location with the secondary variable observations (soil
contamination) closer to the prediction location than the primary variable (milk contamination). The
semivariograms and cross‐covariance models in figure 9.7 have a range of correlation of approximately 15
kilometers. Distances from the six nearest primary variable measurements to the prediction location are
between 5 and 7.5 kilometers, whereas nearest observations of the secondary variable are closer than 4
kilometers. There is strong spatial cross‐correlation between two variables at distances less than 8
kilometers, which results in substantially different predictions (131.75 and 96.524) by cokriging and kriging.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 9.9
Because large negative ordinary cokriging weights do not make much sense, a so‐called “rescaled” version of
ordinary cokriging was proposed in geostatistical literature. The rescaled ordinary cokriging constrains the
weights as follows:
The rescaled ordinary cokriging is a mixture of simple and ordinary cokriging because it requires
specification of the stationary means of the primary and secondary variables. Since specification of the means
is required anyway, it is better to define proper mean surfaces and then use simple cokriging.
Figure 9.10 compares prediction surfaces created by simple kriging (contours) and simple cokriging (filled
contours). In areas with dense sample measurements of the primary variable (cesium‐137 milk
contamination, shown as squares), the difference between the two surfaces is small. In the areas where
measurements of the primary variable are scarce or absent but the secondary variable (cesium‐137 soil
contamination, circles) was measured densely, the filled contours of cokriging deviate substantially from the
kriging contours.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 9.10
Cokriging usually produces more accurate predictions than kriging if the secondary variable is more densely
sampled than the primary variable (for example, when the secondary variable is DEM or remotely sensed
data), but not vice versa, because a coarsely sampled variable predominates over a densely sampled variable
in estimating the parameters of covariance models.
Cokriging is sometimes used for prediction using data collected from more than one sampling network used
by different organizations with different personnel, research laboratories and quality criteria. When
measurement quality differs at each sampling network, cokriging yields more accurate results than kriging on
simple addition of the datasets with different accuracy, because cokriging can use different measurement
error variances for each dataset.
When measurement error is different at each sampling location, a conditional geostatistical simulation with a
varying nugget parameter should be used instead of kriging or cokriging.
INDICATOR KRIGING AND INDICATOR COKRIGING
The indicator transformation for one‐dimensional data is illustrated in figure 9.11. Transformed input data
inside the interval adjacent to the threshold (for example, this can be the measurement error interval) is
displayed as red points in the indicator transformation in figure 9.11 at right. It is quite possible that these
points could be flipped if the measurements were more precise. However, it is assumed that we know exactly
whether or not a value is below or above the threshold. After indicator transformation, input values nearer to
or farther from the threshold become either zeroes or ones. Therefore, when transforming the data,
information on how close an observation is to the threshold is lost.
Figure 9.11
In indicator kriging, the indicator values are used as input to ordinary (or sometimes to simple) kriging. The
indicator kriging prediction at location s is interpreted as the probability that the threshold is exceeded. For
instance, if the prediction equals 0.27, it is interpreted as a 27 percent chance that threshold was exceeded.
Predictions made at each location form a surface that could be interpreted as a probability that the specified
threshold is exceeded.
It is expected that the predictions at the unsampled locations are between zero and one. Unfortunately, this is
often not fulfilled in practice.
Any continuous function can be represented by a weighted sum of indicators for different thresholds (see also
the discussion on disjunctive kriging below). The larger the number of indicators, the more accurate the
function representation. If the number of thresholds is large, predictions of indicator kriging can be used for
reconstruction of the data distribution at a prediction location. Figure 9.12 illustrates this idea: black crosses
are the estimated probabilities that the data thresholds (vertical dashed lines) are exceeded. One may hope
that the red line can be reconstructed using a sufficiently large number of crosses. Unfortunately, this idea
Figure 9.12
For small and large thresholds, the indicator functions contain too many identical values (zeroes or ones),
spatial structure disappears, and corresponding indicator semivariogram estimation becomes unstable.
Another problem is that there is strong correlation between indicators defined by different thresholds, and
important information about the data is lost if this correlation is ignored.
All conventional geostatistical models assume data stationarity. In practice, data often depart from
stationarity. The solution is to use detrending and transformation techniques to bring data close to stationary.
However, indicators are almost always nonstationary, and there is no possibility of transforming them to
stationary data. Figure 9.13 shows local mean (left) and local standard deviation (right) of indicator variable
I(rainfall > 2mm) using rainfall data collected in Sweden (map in the center). Both local mean and local
variance are changing too fast to be considered locally stationary.
Figure 9.13
In the case of rainfall data in figure 9.13, the local mean is proportional to the local variance, and the optimal
probability map can be created assuming that data follow gamma or negative binomial distribution. Simple or
disjunctive kriging with normal score transformation can also be an option if only positive values are used as
in the algorithm presented in chapter 4. See also an example of complex data transformation in chapter 10
(figure 10.12).
Kriging is the best linear predictor for Gaussian random variables, but the indicators 0 and 1 are Bernoulli
random variables. However, indicator kriging does not use the properties of Bernoulli distribution and treats
the indicators as if they were a measure on a continuous scale. Bernoulli distribution defines the relationship
between the mean and variance values (see chapter 4). If one of these values is unknown (for example, the
mean value in the ordinary indicator kriging), then so is the other, but the other is the sill parameter.
Indicator kriging attempts to build a multivariate distribution from the bivariate one. This is rarely possible
(see reference 1 in “Further reading”).
There is a problem with interpretation of the indicator kriging prediction. Figure 9.14 shows the prediction of
the probability that a temperature of 32 degrees Fahrenheit was exceeded in the western part of Switzerland
in the middle of winter 2004. Five closest observations are used for prediction, and three of them have zero
value after the indicator transformation. The weights estimated from the semivariogram are multiplied by
these zero values so that the resulting prediction depends only on the observations that are above the
threshold temperature. The predicted probability is equal to 0.23 and will be the same for any other three
northern temperature observations below 32 degrees Fahrenheit, for example ‐10, 0, and 10. This is
counterintuitive.
Data courtesy of NOAA.
Figure 9.14
We conclude that it is safe to use indicator kriging as an exploratory data analysis tool and not as a predictive
model for decision making.
Indicator cokriging is cokriging of indicators at several different thresholds. Theoretically, indicator cokriging
can solve a problem with correlation between different indicators in figure 9.12 if all covariance and cross‐
covariance parameters are estimated correctly. However, accurate estimation of many covariance and cross‐
covariance parameters is very difficult to do interactively and almost impossible automatically. In addition,
It is possible to express the data and indicators as linear combinations of uncorrelated functions. This
decomposition of the indicators is the basis of disjunctive kriging. In this model, cokriging of the indicators is
equivalent to a set of simple kriging systems of linear equations.
DISJUNCTIVE KRIGING
Disjunctive kriging is the cokriging of independent indicators. The method is called “disjunctive” because it
uses disjoint orthogonal (uncorrelated) functions. Gaussian disjunctive kriging tries to improve upon linear
kriging by making prediction standard errors dependent on the data values in the kriging neighborhood and
in making indicator cokriging predictions reliable. The price for these improvements is a requirement that the
data distribution should be bivariate normal, and the data should be known without error. In this section, we
discuss Gaussian disjunctive kriging (see reference 5 for non‐Gaussian disjunctive kriging).
The pair of data Z(s) and Z(s+h) separated by distance h has a bivariate normal distribution if any linear
combination of Z(s) and Z(s+h) is normally distributed with a correlation coefficient ρ(h) that depends only
on distance between pairs. The correlation coefficient is expressed through the covariance or semivariogram
models.
Any function of the data Z(s) can be expanded in terms of orthogonal functions such as trigonometric,
Chebyshev, Hermite, and Bessel orthogonal polynomials. In the 1950s, meteorologists widely used various
expansions of meteorological fields when modeling the spatial structure of meteorological data
(semivariogram and covariance). It was found that expansion in terms of Hermite polynomials facilitates the
fastest decrease of data variability of linear combination of data when the number of terms in the expansion
is increasing. For example, in 1959 Leonid Rukhovets studied the expansion of geopotential collected over the
territory of Europe and found that the first two and the first four expansion terms explain approximately 90
and 99 percent of the variability in the geopotential data, respectively.
The first three Hermite polynomials are as follows:
H0 (Z ) = 1
H1 ( Z ) = −Z .
H 2 ( Z ) = ( Z 2 −1) / 2
The other polynomials are calculated using the recurrent relation
€ .
In the 1970s, French geostatisticians developed a model called Gaussian disjunctive kriging. The disjunctive
kriging predictor has the following form:
Zˆ (s? ) = ∑ g (Z(s)),
s
s ∈D
€
Spatial Statistical Data Analysis for GIS Users 365
where gs(Z(s)) is some function of the variable Z(s). In Geostatistical Analyst, the available functions gs(Z(s))
are simply Z(s) itself and indicator I(Z(s) > threshold). Geostatistical Analyst uses the following predictor:
where fi and λki are coefficients, are Hermite polynomials, m is a number of terms in the
expansion, n is a number of observations in the kriging neighborhood, and the pair of the values Z(si) and
Z(sj) has a bivariate normal distribution.
In Geostatistical Analyst, the number of terms in the expansion varies. For each data value, a number is
chosen so that at least 99 percent of the data variability is described by the expansion. The number of terms
in the expansion is equal to the number of simple kriging systems of linear equation that should be solved to
find kriging weights.
If the indicator transformation is used, disjunctive kriging is similar to indicator cokriging. The difference is
that thresholds for indicator transformation in disjunctive kriging are automatically selected in such a way
that pairs of indicator‐transformed data with different thresholds are not correlated. Then the large and
complicated cokriging system is replaced by several simple kriging systems and the semivariogram model
estimated using the original data is used to obtain the indicator semivariogram models. Since disjunctive
kriging and indicator cokriging are related, the former may have the same problems as the latter (such as the
order relation problem mentioned in “Indicator cokriging”) although much rarely.
Table 9.1 below summarizes requirements for indicator kriging, linear kriging, Gaussian disjunctive kriging,
and kriging using multivariate normally distributed data (so‐called multi‐Gaussian kriging). Theoretically, the
model shown in the right column is better than the model in the left column if model assumptions are
satisfied.
Simple,
Indicator ordinary, Gaussian MultiGaussian
universal disjunctive
Optimal function(z1,z2 ,...zn )
predictor
∑ function(z ) i
Requirement Indicator Semivariogram Bivariate normal (n + 1)‐variate normal
semivariogram model model distribution distribution
€ €
Table 9.1
In practice, both Gaussian disjunctive and multi‐Gaussian kriging are used after transformation, usually
normal score transformation. However, the consequence of using these models when the assumption of a
bivariate normal distribution is not satisfied is different. Multi‐Gaussian kriging produces reliable predictions
with non‐optimally estimated prediction standard error. On the other hand, both prediction and prediction
standard errors calculated by disjunctive kriging may be far from the optimal values when the data
distribution is not bivariate normal because, in this case, the expansion terms are not orthogonal and the
algorithm does not work properly since an important model assumption is violated. That is why data
transformation is a desirable feature for simple kriging, but the bivariate data normality is a requirement for
Gaussian disjunctive kriging.
Gaussian disjunctive kriging requires that data have a bivariate normal distribution. Also, the creation of
probability and quantile maps using linear kriging models (simple, ordinary, and universal) assumes that the
data come from a multivariate normal distribution. Quantile‐quantile plots can be used to check whether the
data appear to be univariate normally distributed. As an additional tool, it is possible to check for bivariate
normality. Neither of these guarantees that the data come from a full multivariate normal distribution, but it
is often reasonable to assume so based on these diagnostic tools.
If data transformation to bivariate normal distribution is not possible, more complex non‐Gaussian
disjunctive kriging can be used. The idea of non‐Gaussian disjunctive kriging is to find a suitable
decomposition of particular bivariate distributions into orthogonal factors so that prediction could be
obtained by kriging each of these factors separately, just as in Gaussian disjunctive kriging with Hermite
polynomials. A general method for decomposition of various continuous and discrete distributions into
orthogonal factors exists (see reference 5 in “Further reading”).
The semivariogram and covariance models estimated on the transformed data can be used to obtain the
expected semivariogram and covariance models of indicators for various standard normal quantiles zp using a
function f(p,h):
f(p,h) = Prob[Z(s) ≤ zp, Z(s + h) ≤ zp],
where p is a probability value. When p = 0.975, then zp = 1.96; when p = 0.5, then zp = 0; and when p = 0.025,
then zp = −1.96. The probability statement above gives the probability that both Z(s) and Z(s + h) are less than
zp.
f(p,h) can be converted to semivariogram and covariance functions for indicators using the relationship
f(p,h) = E[I(Z(s) ≤ zp)×I(Z(s + h) ≤ zp)],
where I(statement) is the indicator function and E(statement) is the expected value of the statement.
The covariance model for the indicators CI(p,h) for probability p and function f(p,h) are related (see reference
10 for details):
CI(p,h) = f(p,h) – p2
The relationship between semivariogram for indicators γI(p,h) for probability p and function f(p,h) is the
following:
γI(p,h) = p − f(p,h)
The points in figure 9.15 taken from Geostatistical Analyst’s Examine Bivariate Distribution dialog box are the
values for the empirical covariance and semivariogram on the indicator variables. The green lines are the
expected semivariogram and covariance models. The yellow line is the covariance or semivariogram model
fitted to the indicator covariance and semivariogram. The green line and the yellow line should be similar for
several selected quantile values if the data have a bivariate normal distribution.
Note that examination of the bivariate normality is not a formal statistical test because a probability that the
testing hypothesis is accepted is not provided.
Several additional tests for data consistency with a bivariate Gaussian distribution are discussed in reference
8 in “Further reading.”
MOVING WINDOW KRIGING
The kriging prediction standard error is the average mean square error in the kriging neighborhood. For the
same pattern of point locations and the same semivariogram, the conventional kriging prediction uncertainty
is the same regardless of the actual values of the data in the kriging neighborhood. This is an unfortunate
feature of conventional kriging, since areas with larger data values usually have larger data uncertainty. For
example, the measurement error in precipitation values is usually in the interval of 10 percent to 30 percent,
and the kriging prediction standard error surface over the country with a variety of climatic zones should be
different in the dry desert and in the wet mountain areas. The linear kriging model can produce data‐
dependent prediction error if predictions (but not necessarily the observed data) are multivariate normal. In
practice, the only way to make predictions normally distributed is to use an appropriate data transformation,
because linear combination of Gaussian variables (kriging prediction) is also a Gaussian variable.
Figure 9.16 at left shows observations of average annual rainfall in South Africa. More than 8,000
measurements are available for the territory covered by mountains, desert, and jungles, and it has several
climate zones. Universal kriging with a locally estimated first‐order polynomial mean value produces a very
reasonable prediction surface from both a visual and validation diagnostic points of view. But the universal
kriging prediction standard error map (figure 9.16 at right) reflects only the density of the data locations.
According to this map, the uncertainty in the desert with average annual rainfall of about 150 mm is larger
than in the wet areas near the coast with average annual rainfall values greater than 800 mm, which does not
make much sense.
The standard error map in figure 9.17 at left was produced by simple kriging with detrending and normal
score transformation options. The medium prediction errors are in the desert where the monitoring network
is sparse, and the largest prediction errors are in the rainforest where variability in precipitation values is
large.
We see that the prediction standard errors became conditioned by nearby observations—that is, the standard
error is a function of the data values in the kriging neighborhood. However, it is not known how close this
prediction error surface is to the optimal one that has completely localized prediction standard errors.
Therefore, various model diagnostics should be used to choose between kriging models. In the case of the
data with a very large number of observations, the prediction standard errors can be compared to the
prediction standard errors produced using moving window kriging with a locally estimated semivariogram
model (figure 9.17 at right); that is, a model with parameters that vary across the country. It can be observed
that the standard error values in figure 9.17 at left are underestimated in most of the areas.
Data courtesy of R.S.A. Water Research Commission, 2004.
Figure 9.17
The rationale behind moving window kriging that was used to create the map in figure 9.17 at right is to
recalculate the range, nugget, and partial sill semivariogram parameters in a moving window centered on the
location to be predicted, as shown in figure 9.18. As the window moves through the study area, new
semivariogram parameters are calculated, and the prediction is made using a specified number of
neighboring points. For location s1, the blue and pink points are used to estimate the semivariogram model
parameters (but smaller number of points near the prediction location are used for the prediction); for
location s2, the pink and light blue points; and for location s3, the blue points. Within each moving window, the
Figure 9.18
Moving window kriging implementation in Geostatistical Analyst is presented in appendix 1 and will be
further discussed in the next section.
KRIGING ASSUMPTIONS AND MODEL SELECTION
Conventional geostatistical models discussed in this chapter so far are based on several assumptions that are
accepted when using kriging even if their existence is not comprehended:
Data are stationary (in the case of moving window kriging, in the moving window); that is, the mean
and variance of a variable at one location is equal to the mean and variance at another location.
The semivariogram or covariance model is known exactly and is the same throughout the entire data
domain (in the case of moving window kriging, in the moving window).
Sampling locations have been chosen independently of the data values.
Kriging models may also require knowledge of the data mean value or mean surface, distribution of the input
data, and measurement error value.
It is helpful to use exploratory spatial data analysis to check kriging assumptions before modeling and
mapping. The deviations from kriging assumptions should be minimized using data preprocessing to make
predictions close to optimal and being interpretable.
Kriging models have been successfully used for several decades in a variety of applications and many
practitioners do not verify kriging assumptions, believing that the approach always works. In fact, kriging
assumptions are rarely satisfied simply because a true semivariogram model is estimated from the data.
Because not all uncertainty is taken into account, a 95‐percent prediction interval constructed using formula
kriging prediction ± 1.96⋅(kriging prediction standard error)
can in reality be under 90, 80, or even 50 percent (see “Chebyshev’s Inequality” in chapter 4).
Since real data never exactly follow kriging assumptions, this raises a question about prediction accuracy
when one or more assumptions are violated.
In “Moving window kriging,” an example of precipitation modeling in South Africa using a moving‐window
kriging model was presented, and a map of the prediction standard errors produced using a locally estimated
semivariogram model was shown. In this model, the semivariogram parameters vary across the country.
Figure 9.19 shows that the estimated range parameter is in an interval from 16 to 200 kilometers.
Data courtesy of R.S.A. Water Research Commission, 2004.
Figure 9.19
A MOVINGWINDOW KRIGING MODEL
A moving‐window kriging model has several shortcomings. First, semivariogram modeling is performed
automatically, meaning that we assume that the software is extremely reliable in finding semivariogram
parameters without interaction. This is certainly not so when spatial correlation is not strong or the number
of measurements in the window is relatively small (less than 200). Second, the local semivariogram model
fitted within the window may be incompatible with the local semivariogram model calculated in slightly
larger or smaller window. This incompatibility may lead to jumps in the estimated semivariogram
parameters in the neighboring windows. However, moving‐window kriging results are very informative if
used with a full understanding of the model limitations.
Two main reasons for dependency between data values and data locations are data interaction and
preferential sampling. An example of an interaction between data locations can be found in chapter 13. An
example of preferentially sampled data is environmental monitoring when most samples are concentrated in
areas where greater air pollution is expected. In this case, there is no reason to believe that the
semivariogram at small distances, estimated from data collected in the area with the largest data values, will
work adequately in other areas. However, it may be a good idea to use clustered data when clustered
locations are independent of the measurement values, for example, when clusters are distributed
systematically in the area being studied. Such a data configuration (sampling plan) helps to estimate
accurately the most important part of the semivariogram model at short distances in areas with both small
and large values.
,
where is the temperature surface estimated using actual measurements and α1 and β1 are some
constants. In this case, locations and the value of interest (temperature) are related and conventional
semivariogram modeling cannot reconstruct true spatial dependence between the temperature observations
if they are made in the blue points locations. Figure 9.20 at right shows unconditional simulation of the
average temperature using the semivariogram model that was used for the temperature interpolation in
figure 9.20 (unconditional simulation is discussed in chapter 10). Both maps have the same spatial structure
because they were created using the same geostatistical model, but surface in figure 9.20 at right does not
honor the actual temperature values. Green points are generated from the inhomogeneous Poisson process
with intensity defined by the simulated surface as
In this case, locations and actual value of the temperature are not related and conventional semivariogram
modeling can be used for reconstruction of the true spatial dependence between the temperature data if they
are made in the locations with green points.
Figure 9.20
The data in figure 9.21 show an example of the clustered data with the centers of the clusters mostly in areas
with large data values.
Figure 9.21
It is difficult to recognize the spatial structure of these data without data preprocessing because a small
number of large data values act as data outliers (low‐density points at the top part of the
Semivariogram/Covariance Cloud dialog box) which distort estimation of the semivariogram model
parameters.
Data preprocessing improves estimation of the kriging model parameters significantly. Table 9.2 shows cross‐
validation statistics for two models with all default parameters estimated by Geostatistical Analyst, lognormal
ordinary kriging and simple kriging with normal score transformation and data declustering options. The
performance of the latter model is better because average mean, root‐mean‐square, standard error, and mean
standardized prediction errors are smaller, and root‐mean‐square standardized prediction error is closer to 1
(see discussion of validation and cross‐validation statistics in chapter 6).
Simple kriging with
Average Lognormal ordinary normal score transformation and
prediction error kriging data declustering
Mean 0.29170 0.087740
Rootmeansquare 5.85200 5.632000
Standard error 13.22000 3.519000
Mean standardized 0.01842 0.004868
Rootmeansquare 0.57050 1.229
standardized
Table 9.2
Note that in regression analysis model selection is often based on information criteria such as AIC (Akaike’s
information criteria) (see chapter 12 and appendix 2). The AIC is used as an overall measure of the quality of
fit. However, the information criterion does not tell us whether the model with the best AIC value is also the
model that predicts best. Therefore, from the prediction point of view, cross‐validation and validation are
more meaningful diagnostics.
The left map in figure 9.22 shows the predictions produced by simple kriging with normal score
transformation and data declustering options. Sampling from the point process with intensity
,
where α2 and β2 are some constants (this process is called Cox point process in point pattern analysis; see
chapter 13), gives the black points displayed over the map in the right part of figure 9.22 at right. These
A geostatistical model for preferential sampling can be found in reference 9 in “Further reading.” The idea is
to combine sampling theory with geostatistics by including the coefficient β2 to the kriging model as a
parameter.
Figure 9.22
Suppose we want to create an accurate map of monthly precipitation in Catalonia, Spain. Figure 9.23 at left
shows total precipitation in May 2004 measured at 40 monitoring stations. Such a small number of
measurements does not allow the use of validation diagnostics in choosing an optimal kriging model. A work‐
around is to simulate similar data and use them to choose a model. Figure 9.23 at right shows estimated trend
surface on the grid clipped by the Catalonia border. This grid was created by local polynomial interpolation
with a large searching neighborhood.
From Generalitat de Catalunya. Departament de Medi Ambient i Habitatge. Direcció General de Qualitat Ambiental.
Figure 9.23
From Generalitat de Catalunya. Departament de Medi Ambient i Habitatge. Direcció General de Qualitat Ambiental.
Figure 9.24
Figure 9.25 at left shows simulated data at the monitoring station locations. We will use these data for
modeling and data in figure 9.24 at right for comparison of the predictions and true values of simulated
precipitation.
Of course, in this validation exercise, we should forget the simulated large‐ and small‐scale variation
components of the true data and just fit the best model with the data provided. We begin by checking
assumptions behind kriging models. The assumption of the independence of data values and data locations
can be verified using the marked point pattern analysis theory (see chapter 13). In practice, this test can be
skipped if data locations are distributed randomly. One tool for testing how points are distributed in space is
the K function, shown in figure 9.25 at right (it is discussed in chapters 2 and 13). Red points are the K
function for precipitation values in the circle with the radius shown as x‐axis. Randomly distributed points for
the Catalonia territory were simulated 100 times, and an envelope with the upper and lower bounds of 100
calculated K function values is displayed by blue lines. Because the K function calculated using the locations of
the precipitation measurements is inside the envelope, the distribution of monitoring stations in Catalonia
can be considered random.
From Generalitat de Catalunya. Departament de Medi Ambient i Habitatge. Direcció General de Qualitat Ambiental.
Figure 9.25
From Generalitat de Catalunya. Departament de Medi Ambient i Habitatge. Direcció General de Qualitat Ambiental.
Figure 9.26
Figure 9.27 shows validation of the predictions using simple kriging (red triangles) with mean surface
estimated by zero‐order local polynomial interpolation with slider position at Global 65 percent (top center
of the dialog box) and ordinary kriging without additional options (blue circles). The spread of blue points
(ordinary kriging predictions) around the 1:1 line in pink is similar to the spread of red triangles for small
and medium but slightly larger for large precipitation values, and one may conclude that validation statistics
for simple kriging is better.
From Generalitat de Catalunya. Departament de Medi Ambient i Habitatge. Direcció General de Qualitat Ambiental.
Figure 9.27
Ordinary Simple kriging with
Statistics/Model kriging estimated mean surface
Average rootmeansquare prediction error 7.85 7.62
Average standard error 3.38 3.11
Average rootmeansquare standardized 0.67 0.77
error
From Generalitat de Catalunya. Departament de Medi Ambient i Habitatge. Direcció General de Qualitat Ambiental
Table 9.3
There are other considerations for choosing a model for predictions (see reference 7 in “Further reading”). In
connection with model selection in the natural sciences, Albert Einstein is credited with the saying that things
should be made as simple as possible but not simpler, since it is difficult to find optimal parameters of a
complex model and understand how that complex model is really working. From a Bayesian statistical point
of view, the simpler the model, the larger its prior probability.
The simplest geostatistical model is ordinary kriging because it is based on fewer assumptions than simple,
universal, and disjunctive kriging. Ordinary kriging can be safely used when the prediction map is all that is
needed. On the other hand, ordinary kriging can be too simple if an accurate map of prediction error is
required. In that case, simple or disjunctive kriging with data preprocessing are generally preferable.
In summary, there is no simple method of kriging model selection. Instead, there is a model selection process
that includes exploratory spatial data analysis, simulation studies, model diagnostics, and models
comparison.
Often the major difference between kriging models is the values of the prediction standard errors, not the
predictions themselves. Exercises using simulated data with known statistical features help in understanding
the differences between the semivariogram and kriging models. The next time a similar situation arises,
information gained from the exercise can be used for optimal model selection.
However, if the variable under study is weakly correlated, strongly nonstationary, or clustered, the choice of
conventional kriging model is practically irrelevant because all these models will produce poor predictions.
The next section illustrates kriging with varying model parameters. First, a data exploration approach,
sensitivity analysis, is used. Next, an example of Bayesian kriging with model parameters treated as random
variables is shown.
KRIGING WITH VARYING MODEL PARAMETERS: SENSITIVITY ANALYSIS AND
BAYESIAN PREDICTIONS
Figure 9.28 depicts locations of cesium‐137 soil measurements (blue points) collected in 1992 in the eastern
part of Belarus, over a prediction map created using ordinary kriging. The histogram to the left shows that the
distribution of the data is not symmetrical. Data transformation will make the data close to normal
distribution and predictions more optimal, but in this section, we are interested in the influence of varying
parameters on kriging predictions rather than in the most accurate predictions, so the data will not be
preprocessed.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 9.28
A spherical semivariogram model used for the soil contamination map creation in figure 9.28 is shown in
figure 9.29 at left. According to the model diagnostics, the semivariogram parameters are nearly optimal.
Another spherical model with a smaller parameter range is shown in figure 9.29 at right. It does not fit
empirical semivariogram values. This second model will be used to investigate how sensitivity analysis reacts
on the semivariogram model misspecification.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 9.29
[0.85·range, 1.15·range],
[0.85·nugget, 1.15·nugget], and
[0.85·(partial sill), 1.15·(partial sill)]
and the calculations were made 333 times for each randomly selected value inside the diapasons for the
range, nugget, and partial sill shown above.
Figure 9.30 shows two histograms of the 999 estimated range values. The histogram at the left corresponds
to the nearly optimal spherical semivariogram model shown in figure 9.29 at left, and the histogram at the
right was created using parameters of the spherical semivariogram model with reduced range parameter
from figure 9.29 at right.
Calculated and estimated range parameters in the case of a nearly optimal spherical semivariogram model
are consistent and form symmetrical distribution, indicating that the kriging predictions are similar if some
or all semivariogram parameters are changing in the 15‐percent interval. Distribution of the range parameter
is very different in the case of the model with reduced range where all values are grouped into three clusters,
showing that small changes in the parameters of that non‐optimal semivariogram model lead to substantially
different semivariogram models and, therefore, to very different kriging predictions.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 9.30
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 9.31
A modeling approach using varying model parameters (Bayesian kriging) treats the model parameters as
random variables, meaning that uncertainty in the parameters is allowed. Bayesian kriging requires a large
number of simulations from multivariate distributions using Markov chain Monte Carlo techniques, and it
often produces more realistic predictions and prediction errors.
Presented below are the results of the Bayesian kriging predictions for cesium‐137 soil contamination data
(those shown in figure 9.28) using the R package geoR (see “Further reading,” reference 6.)
The red line in figure 9.32 shows distributions of the simulated parameters mean (left) and range (right).
Blue vertical lines show the mean values of the distributions.
Simulated kriging mean values are distributed nearly symmetrically around its mean value, justifying the use
of ordinary kriging with dense soil contamination data.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 9.32
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 9.33
Figure 9.34 shows the distributions of the nugget and partial sill parameters. In contrast to the mean
distribution, they are not symmetrical and are more concentrated at small values. The probability of having
values larger than the nugget and the partial sill in the semivariogram models shown in figure 9.34 is
relatively high, indicating possible nonstationarity of the data.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 9.34
Allowing for the semivariogram model uncertainty through simulations with different model parameters is
important because the semivariogram estimation is optimal only in the ideal case when 1) pairs [Z(s), Z(s+h)]
follow bivariate Gaussian distribution, 2) the observations are sampled randomly, and 3) there are no data
outliers. The first assumption is violated, for example, when Z(s) are positive values or distribution of Z(s) is
skewed and heavy‐tailed. The second assumption is often violated because clustered sampling is naturally
follow inhomogeneous population. The third assumption is frequently violated because of measurement
errors and because of very high variability of the sampling process (when nature “draws” a value from the
distribution with large variance, see also chapter 3).
Although predictions at these three locations by Bayesian kriging are slightly better than the predictions
made by standard ordinary kriging, the difference is not large, confirming that ordinary kriging is a quite
flexible and reliable predictor.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 9.35
The difference between Bayesian and ordinary kriging predictive distributions is larger when the dataset is
smaller, because in this case, the effect of the model parameters’ uncertainty is more substantial.
An advantage of Bayesian kriging is that it is more informative because it takes into account the uncertainty
associated with the estimation of the mean value and the semivariogram parameters. Another advantage of
Bayesian kriging is that it can be easily used with non‐Gaussian data (see appendix 3). Finally, it should be
noted that the name “Bayesian kriging” is misleading because the methodology produces a large number of
simulations at sampled and unsampled locations. Then the mean and standard deviation are calculated from
the distribution of the simulated values. Therefore, a better name would be “geostatistical simulation with
varying model parameters,” but it is too long.
COPULABASED GEOSTATISTICAL MODELS
The semivariogram model does not provide a complete description of the spatial data dependence structure
because it is an integral over the entire distribution of the empirical semivariogram values and, therefore, it
characterizes spatial dependence of large, medium, and small values equally, although in practice this
dependence is varying (see an example in figure 3.6). Another problem is that the semivariogram is strongly
influenced by the univariate distribution of the random field and by unusually large data values. Also, the
semivariogram fitting methods usually assume data normality and this assumption is rarely fulfilled.
Disjunctive kriging assumes a particular (usually Gaussian) bivariate data distribution and uses a fitted
indicator semivariograms for different data quantiles. However, in the case of Gaussian distribution, data
correlation always decreases at high quantiles, although there are many examples when this bivariate normal
distribution feature contradicts the observed non‐uniform data dependency as in figure 3.6.
A copula is a function that links univariate marginal distributions to the full multivariate distribution. An
essential feature of copulas is that they incorporate the data dependence structure in a distribution‐free way.
Copula requires that the variables be transformed to uniform marginal distributions (that is the variables
should be rescaled into percentiles). This results in the variables dependence without the influence of the
marginal distributions.
It is shown in the literature that bivariate copulas and indicator semivariograms and cross‐covariances are
related. Moreover, the bivariate copula is a generalization of the semivariogram. It is more general because
the spatial copula can describe the dependence of different quantiles differently in contrast to semivariogram
that models the mean dependence only. It can be shown that usage of the bivariate Gaussian copula leads to
the same result as disjunctive kriging with successful normal score transformation. The usage of non‐
Gaussian copulas allows for more flexible and realistic modeling of spatial data dependence. In particular,
bivariate copulas with strong dependence in the lower or high parts of the data distribution can be
constructed.
Copula‐based kriging and sequential simulation (see chapter 10) are not very computationally involved and,
therefore, they are candidates for implementation in geostatistical software in the near future. However,
additional research is required. For instance, currently used maximum‐likelihood methods for the model
parameters estimation can be used with several hundred observations in the case of multivariate Gaussian
copula and with just a dozen observations in the case of non‐symmetric multivariate copulas; the latter is
perhaps insufficient for describing data dependency.
ASSIGNMENTS
1) REPRODUCE PREDICTION MAPS SHOWN IN THE “ACCURATE TEMPERATURE
MAPS CREATION FOR PREDICTING ROAD CONDITIONS” DEMO.
Road maintenance during winter is important because ice and snow may lead to road closures and hence
economic losses and because, according to the literature, the risk of accidents may increase as much as five
times in the presence of ice and snow. The cost of not taking anti‐icing measures when ice does form on the
road is greater than that of anti‐icing when no ice forms. Therefore, maps of the predicted probability that ice
will form are required.
A demonstration on accurate temperature map creation for predicting road conditions is available at
http://www.esri.com/software/arcgis/extensions/geostatistical/about/demos.html.
Data are in the folder assignment 9.1. Watch this demonstration and repeat the case study using temperature
measurements shown in figures 9.36 and 9.37.
Courtesy of NOAA.
Figure 9.37
Note: Ice forms on the road if the air temperature is below freezing and precipitation occurs. Therefore, a
model for the joint predictive probability distribution of temperature and precipitation is needed. However, a
map of the probability that the air temperature is below freezing is a step in the right direction.
Use data and follow instructions in the assignments in chapter 7.
3) FIND THE OPTIMAL NUMBER OF NEIGHBORS FOR PREDICTION USING SIMPLE
AND ORDINARY KRIGING MODELS BY COMPARING THEIR ROOT‐MEAN‐SQUARED
PREDICTION ERRORS.
Use data in the assignments in chapter 7 and create graphs of the root‐mean‐squared prediction errors
versus the number of neighbors in the circular searching neighborhood.
4) PARTICIPATE IN THE SPATIAL INTERPOLATION COMPARISON 97 EXERCISE.
Figure 9.38 shows rainfall measurements on May 8, 1986, a dozen days after the Chernobyl accident, from
467 locations in Switzerland. Rainfall on that day might be radioactive.
Figure 9.38
One hundred randomly selected data were proposed to formulate and fit models to the data and predict the
remaining 367 measurements. About 20 analyses were published in Journal of Geographic Information and
Decision Analysis, Volume 2, No. 1–2, 1998. Special Issue: Spatial Interpolation Comparison 97, available at
http://www.geodec.org/gida_4.htm (see also http://www.ai‐geostats.org/index.php?id=45). Other
researchers then used Swiss rainfall data to illustrate their models’ performance.
The daily rainfall measurements can be downloaded from http://www.ai-geostats.org/index.php
?id=data.
Follow convention by using 100 test data, and find the best model for prediction at 367 validation locations.
Use models provided with Geostatistical Analyst.
Read “Spatial Interpolation Comparison 97” papers, and choose the best and the worst among the proposed
models. Justify your choice.
Note: These data are not typical for daily rainfall because almost all land surface under study was wet. More
often, at least half of the rainfall data values are equal to zero (see, for example, a map with Swedish rainfall
observations three days after the Chernobyl accident at
5) PREDICT THE TILT THICKNESS IN THE LAKE.
The author of “Deriving Volumes with ArcGIS Spatial Analyst” (available at
http://www.esri.com/news/arcuser/1002/files/volumes.pdf) mentioned that the silt
thickness in the lake (data in the left part of figure 9.39) was carefully constructed using ArcGIS Geostatistical
Analyst. One problem with interpolation inside the lake has to do with predictions near the coastline. The tilt
thickness is equal to zero on the land; the addition of zeroes along the coastline (gray circles in the right part
of figure 9.39) may improve the prediction and prediction standard error maps.
Data courtesy of Mike Price GISP Entrada/San Juan Inc.
Figure 9.39
Use data in the folder assignment 9.5 and compare the result of interpolation with and without adding zero
values.
Figure 9.40 at left shows bathymetric measurements in the lake Kozjak, Plitvice Lakes National Park, Croatia,
and isolines’ vertices in the close vicinity to the lake (mostly red points). The bathymetric data were collected
by researchers from the Faculty of Geodesy, University of Zagreb in May 2004. The combination of a high‐
frequency echosounder and precise satellite positioning (Real Time Kinematics GPS) was used for
bathymetry. The isolines were vectorized from a digitized topographic map of the area. The detail description
of the data can be found in
Medak D., B. Pribicevic, and K. Krivoruchko. (2008) Geostatistical Analysis of Bathyimetric Measurements:
Lake Kozjak Case Study. Geodetski list, 3, 1–18.
Note that the measurements are not absolutely precise.
Figure 9.40 at right shows a subset of the data (14494 points) and their summary statistics. This data subset
is available in the folder assignment 9.6. The task of this assignment is to choose an optimal geostatistical
model for predicting and mapping the lake depth as well as the associated prediction standard errors.
Figure 9.40
Data courtesy of Damir Medak, faculty of Geodesy, University of Zagreb.
1) Schabenberger, O., and C. A. Gotway. 2004. Statistical Methods for Spatial Data Analysis. New York:
Chapman & Hall/CRC, 488.
Without knowing statistical theory of the models and tools used in applications, the data analysis becomes a
collection of tricks that are difficult to remember and interrelate. The authors of this book discuss
geostatistical theory in chapters 4 and 5.
2) Rivoirard, J. 1994. Introduction to Disjunctive Kriging and NonLinear Geostatistics. Oxford: Clarendon Press,
181.
This book explains the ideas behind disjunctive kriging using a relatively small number of formulas. The
implementation of disjunctive kriging in Geostatistical Analyst is largely based on this book.
3) Cressie, N., and G. Johannesson. 2006. Fixed rank kriging for large spatial datasets. Technical Report
Number 780. Department of Statistics, The Ohio State University, https://www.stat.ohio-
state.edu/~sses /papers.html.
The authors developed a kriging model for large nonstationary spatial datasets.
4) Krivoruchko, K., and A. Gribov. 2004. "Geostatistical Interpolation in the Presence of Barriers," geoENV
IV—Geostatistics for Environmental Applications: Proceedings of the Fourth European Conference on
Geostatistics for Environmental Applications (Quantitative Geology and Geostatistics), 331–342. Available at
http://www.esri.com/software/arcgis/extensions/geostatistical/about/literature.html.
This paper discusses a kernel convolution model for prediction and simulation using a non‐Euclidean
distance metric.
5) Armstrong, M., G. Matheron. 1986. Disjunctive kriging revisited, Part I. Mathematical Geology 18 (8), 711–
728.
Armstrong, M., G. Matheron. 1986. Disjunctive kriging revisited, Part II. Mathematical Geology 18 (8), 729–
742.
The authors of these papers presented a generalization of disjunctive kriging for modeling non‐Gaussian data,
including data that follow gamma, Poisson, binomial, and negative binomial distributions. Armstrong and
Matheron found a decomposition of several non‐Gaussian bivariate distributions into orthogonal factors.
Then they show that predictions can be obtained by kriging each of these factors separately, so that the model
works similar to Gaussian disjunctive kriging with Hermite polynomials as discussed in this chapter. See also
assignment 2 of chapter 15.
6) Ribeiro, P. J. Jr., and P. J. Diggle. 2001. geoR: A package for geostatistical analysis. Rnews, Volume 1,
Number 2, pp. 15–18. Available at http://cran.R‐project.org/doc/Rnews.
The authors developed a Bayesian kriging software application that assumes that mean value and the
semivariogram parameters are random variables. Several conventional kriging models are also implemented
in the geoR package.
7) Huang, H.‐C. and C.‐S. Chen. 2007. Optimal geostatistical model selection. Journal of the American Statistical
Association, 102, 1009‐1024.
8) Emery, X. (2005) Variograms of order ω: A tool to validate a bivariate distribution model. Mathematical
Geology, 37 (2), pp. 163‐181.
This paper discusses several tests for data consistency with a Bivariate Gaussian distribution.
9) Diggle P. J., R. Menezes, and T. Su. 2008. Geostatistical Inference under Preferential Sampling. Johns
Hopkins University, Working Paper #162.
The authors of this paper developed a geostatistical model for preferentially sampled environmental data. It
is shown that ignoring preferential sampling can lead to misleading inferences.
Next two papers present theory and applications of copula‐based geostatistical models. The second paper
suggests using a Bayesian variant of copula‐based kriging.
10) Bardossy, A. 2006. Copula‐based geostatistical models for groundwater quality parameters, Water
Resources Research, 42, W11416
11) Kazianka H. and J. Pilz. 2008. Spatial interpolation using copula‐based geostatistical models. Proceedings
of the GeoENV conference in Southampton, UK.
OPTIMAL NETWORK DESIGN AND
PRINCIPLES OF GEOSTATISTICAL
SIMULATION
SPATIAL SAMPLING AND OPTIMAL NETWORK DESIGN
MONITORING DESIGN IN THE PRECOMPUTER AND EARLY COMPUTER ERA
IDEAS ON A NETWORK DESIGN FORMULATED AFTER 1963
SEQUENTIAL VERSUS SIMULTANEOUS NETWORK DESIGN
GEOSTATISTICAL SIMULATION
UNCONDITIONAL SIMULATION AND CONDITIONING BY KRIGING
SEQUENTIAL GAUSSIAN SIMULATIONS
SIMULATING FROM KERNEL CONVOLUTIONS
SIMULATED ANNEALING
APPLICATIONS OF UNCONDITIONAL SIMULATIONS
APPLICATIONS OF CONDITIONAL SIMULATIONS
ASSIGNMENTS
1) FIND OPTIMAL PLACES FOR THE ADDITION OF NEW STATIONS TO MONITOR AIR
QUALITY IN CALIFORNIA
2) SIMULATE A SET OF CANDIDATE SAMPLING LOCATIONS FROM INHOMOGENEOUS
POISSON PROCESS
3) REDUCE THE NUMBER OF MONITORING STATIONS IN THE NETWORK USING
VALIDATION DIAGNOSTICS
Spatial Statistical Data Analysis for GIS Users 390
4) DISCUSS TWO SIMULATION ALGORITHMS PROPOSED BY GIS USERS THAT ARE
BASED ON ESTIMATED LOCAL MEAN AND STANDARD ERROR.
5) CONDITIONAL SIMULATION WITH GEOSTATISTICAL ANALYST 9.3
FURTHER READING
T his chapter discusses two applications of modeling continuous spatial data: optimal network design
and simulation of the data with known statistical features that are consistent with available
observations.
A set of locations where measurements are taken is called a monitoring network design or simply a
monitoring network. A monitoring network may have inefficient locations for pollutants recognition and for
maintaining the ability to understand long‐term historical air quality and weather trends because the
variables of interest are changing, and our understanding of environmental issues is improving. Therefore,
environment protection agencies need to optimize monitoring networks.
A monitoring network design is called prospective when a network is to be created before collection of data
begins. When stations are being removed from or added to an existing network, the design is called
retrospective.
The number and location of monitoring stations are affected by economic considerations, so designing the
optimal monitoring network is always a task of balancing the accuracy of predictions with minimizing the
costs. Several historical and more recent statistical methods for optimizing monitoring networks are
discussed in this chapter.
The main goal of geostatistical conditional simulation is to mimic the spatial small‐scale data variability of the
variable under study more realistically than can be done using kriging. The primary use of conditional
simulation is in uncertainty analysis: the evaluation of the impact of spatial uncertainty on the results of
complex modeling such as flooding and the consequences of environmental or industrial disasters.
There are many geostatistical simulation algorithms, and all of them have advantages and disadvantages.
Therefore, a good understanding of the ideas behind each and the use of various pre‐ and post‐processing
diagnostics is imperative for production of simulated surfaces that can be used in further modeling and
decision making.
The geostatistical simulation part of this chapter begins with a discussion of the geostatistical simulation
ideas. Then the principles of the following geostatistical conditional simulation methods are presented:
Unconditional simulations and conditioning by kriging
Sequential simulations
Simulating from kernel convolutions (this method is implemented in Geostatistical Analyst 9.3)
Simulated annealing
The chapter ends with examples of unconditional and conditional simulations.
Spatial Statistical Data Analysis for GIS Users 391
SPATIAL SAMPLING AND OPTIMAL NETWORK DESIGN
Randomly selected locations for monitoring stations can be inefficient, because such design does not take into
consideration the spatial data dependency. A systematic design with regularly distributed monitoring
locations is usually adequate for well‐behaved stationary random fields, but it is inflexible when the data
features are changing in space—for example, when data variance is inhomogeneous, data features vary with
direction, and when spatial objects are separated by natural barriers.
From the kriging prediction point of view, it is preferable to use irregularly distributed stations so that a
sufficient number of short, medium, and long distances between monitoring stations are available. At the
same time, areas without any measurements should be as small in number as possible. The resulting optimal
design typically combines a nearly regular network supplemented by groups of closely spaced locations. If a
new location is added too close to the existing location, knowledge about measurement error and micro‐scale
variation is increasing, but the associated solution of the kriging system of linear equations becomes unstable.
Also, locations of relatively dense points should not depend on the measurement values. Therefore, clusters
of locations should not be created only in areas where the variable of interest has very large values.
Figure 10.1 shows two design examples with small and medium numbers of sampling locations of monitoring
pollution in a lake when a priori information about water contamination is not available. Locations for
sampling were generated by using an inhibition point process (a process that does not allow two points to lie
closer than a specified distance) discussed in chapter 13. These designs can be used as a starting point in the
optimization of the monitoring network.
Figure 10.1
MONITORING DESIGN IN THE PRECOMPUTER AND EARLY COMPUTER ERA
The first attempts to create an optimal monitoring network using geostatistical theory were made in the
Soviet Union in 1946 and 1947 by meteorologists Drozdov and Shepelevskij. They demonstrated that
interpolation error to the midpoints between pairs of monitoring locations is a function of the semivariogram
model estimated from observations. Drozdov and Shepelevskij estimated the maximum permissible distance
Spatial Statistical Data Analysis for GIS Users 392
between stations using the estimated interpolation error in the middle of the line between pairs of
monitoring stations. These maximum permissible distances were used for planning a network to monitor
daily temperature, snow cover, monthly total precipitation, number of days with thunderstorms, and soil
temperature. In 1961, Gandin showed that the kriging prediction standard error can be used to choose the
most suitable locations for monitoring stations. From 1960 to 1965, Gandin and Kagan developed a method
for estimating the accuracy of values averaged over an area from the observational data at point locations
(block kriging). This allowed improvement of the monitoring network design for such area‐averaged
meteorological variables as snow cover and precipitation.
The Drozdov and Shepelevskij method is based on the formula for the interpolation error between two
points, which was discussed at the beginning of chapter 7. The only difference is that their method adjusts the
formula for the data with measurement errors, the last term in the expression below:
,
The error of the interpolation to the center of the line l/2 is equal to
.
If the semivariogram model is known, the maximum permissible distance between stations is calculated using
a predefined maximum permissible standard error of interpolation. For example, if the semivariogram model
is assumed to be spherical,
solving a quadric equation is all that is needed to find a maximum permissible distance between stations.
In most early work on optimal network design, the requirement was that the standard error of interpolation
of point values (the right‐hand side of the formulas above without the last term) should not exceed the
standard error of the observation , and the inequality was usually used. The standard error
was determined from the estimated nugget parameter of the semivariogram model after correction for
microscale variation. This approach works if the semivariogram or covariance function is the same
everywhere in the region under study, that is, when data are stationary. If this assumption could not be made,
Soviet scientists instead assumed that the standardized covariance is stationary
,
where is covariance between points s1 and s2 and σ(si) is the data variance at location si. In this
case, the maximum permissible distance varies within the region with changes in the data variance.
,
where is defined above for interpolation error between two points.
An approach based on the estimated average error on the line was proposed by Laikhtman and Kagan in
1960. It was successfully used for planning meteorological networks for estimation of snow cover and
precipitation.
The use of average error is preferable for planning meteorological networks because there are situations
when the maximum value of the prediction error does not occur at the midpoint. Also, network planners are
usually interested in average values of the prediction error across areas rather than the prediction error in
exact locations.
After the development of statistical interpolation in two dimensions in 1959 and with the improvement of
computers in the 1960s, interpolation along a line was replaced by interpolation on a plane. As shown at the
beginning of chapter 7, interpolation using more than two points increases the accuracy of the prediction, and
the kriging standard error usage results in a reduction of the required network density. Meteorological data
are usually measured with error, so a filtered variant of kriging is preferable.
Statistical spatial interpolation can also be used to determine the best location for the addition of a new
monitoring station: for example, on the standard kriging error surface produced using measurements at the
monitoring network and an optimal kriging or cokriging model, the maximum value can be found. Because
suitable locations for new monitoring stations cannot occur everywhere, a practical solution is to choose a
location with the highest prediction standard error within a set of predefined locations or, if these locations
are not defined, to indicate the areas where prediction errors are unacceptably large and search for suitable
locations within these areas. Data variability decreases with an increase in the area over which a spatial
average is computed. Therefore, obtaining a specified accuracy in the values averaged over an area requires a
less dense network than in the case of the prediction of a point with the same accuracy.
Using kriging, Soviet meteorologists defined three groups of meteorological monitoring networks with
different permissible densities of monitoring stations. The least dense network with permissible distances
between 150 and 200 kilometers was used for collecting measurements of air pressure, mean monthly values
of soil temperature, and monthly duration of sunshine. A network with intermediate permissible distances
between 50 and 60 kilometers was used for calculating the mean daily values of air temperature and
humidity, wind speed, and the amount of cloud cover. It was found that the densest network with permissible
distances less than 30 kilometers was required to track monthly totals of the snow cover and precipitation as
well as patterns of fog and thunderstorms.
The estimated permissible distances discussed above refer to flat areas. In mountainous areas, the network
density should increase. Even in relatively flat areas, the permissible distance between meteorological
stations can be different in different geographic regions. For example, early Soviet studies indicated that
prediction of the meteorological data variables with the same accuracy required more measurements in
Ukraine than in the Valdaj region.
If a critical value of the variable under study, , is known, a network optimization algorithm can be
based on the estimated probabilities that this critical value is exceeded. For example, locations with
probabilities that ozone concentration exceeded the upper permissible level of 0.08 ppm are candidates for
the installation of a new station.
The probability of exceeding a critical value can be used together with the prediction standard error to select
a new location. For example, the probability of exceeding a critical value can be used to weigh the prediction
standard error. If this probability is equal to 0.5 (a very large uncertainty), the optimality criterion is
equal to the prediction standard error, and the criterion value decreases as the uncertainty about exceeding
the threshold value decreases:
The locations with the largest weighted prediction standard error are candidates for a new
monitoring station. Adding these locations to the monitoring network will improve prediction of data values
near the threshold.
Quantile predictions can also be used in searching for the best place to add a new observation. For example,
the following function can be used for the optimization of a network.
,
where Z(s) is the estimated value of the variable under study at the location s, with the subscript indicating
the quantile of the estimated prediction distribution at location s. increases with increasing data
variability (the numerator) and decreases the difference between the estimated data median and a threshold
(the denominator).
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 10.2
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 10.3
Instead of designing a network for making the most reliable predictions, other goals for the monitoring
design can be used, for example, the optimal estimation of semivariogram model parameters. In this case, the
suggested design can be based on minimization of the variance of the empirical semivariogram within each
bin. One motivation for this design is that semivariogram model parameters have a physical sense: distance of
the data correlation (range parameter), measurement error (part of the nugget parameter), data variability
(partial sill parameter), and smoothness (extra parameter in models such as K‐Bessel).
It is useful to investigate how the uncertainty in semivariogram model estimation influences optimal
predictions. One possible way to check this is simulating a set of reasonable semivariogram models and
finding the optimal design for each and then using the resulting designs to evaluate whether predicted values
at the unknown locations have an acceptable accuracy.
The ideas above can be combined, and a goal of a monitoring network design can be a reasonable
compromise between prediction model parameters and prediction accuracy.
Rather than optimizing a kriging model, optimal estimation of the unknown parameters of one of the spatial
autoregressive regression models (they are discussed in chapter 12) can be a goal in designing a monitoring
network. An example is a design for epidemiological studies, which should provide effective alerts for high
air‐pollution days.
Sequential network design creates a network by optimally adding new locations step by step. At each step,
the algorithm adds one new location that together with existing locations forms an optimal network with
respect to some criterion—for example, minimum average prediction standard error. This strategy can result
in a final network design that is not optimal. This is illustrated in figure 10.4, which compares the final result
of a sequential design (top line) with a simultaneous design in which all previously selected points can be
shifted (bottom line).
Figure 10.4
The simultaneous design in the example above is better from a geometrical point of view, since the location
areas of influence are more homogeneous as shown in figure 10.5 for three‐ four‐, and six‐point designs using
Voronoi polygons, where top row is sequential and bottom is simultaneous design.
Figure 10.5
Figure 10.6 shows a south‐eastern part of the Belarus territory. The black triangles are locations of the
observations of cesium‐137 contamination in forest berries. A continuous map shows the prediction standard
errors of forest berry contamination. Suppose we need to collect 100 additional measurements of the cesium‐
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 10.6
Figure 10.7 at left shows the candidate locations in blue found using sequential design with optimization of
the kriging prediction standard error. Appendix 1 explains how these locations can be found using
Geostatistical Analyst.
Figure 10.7 at right shows 100 candidate locations (in blue) proposed by simultaneous spatially‐balanced
random survey design (references and additional examples of this design can be found in assignment 1 at the
end of this chapter). The locations are selected from a list of the populated areas with known and non‐zero
inclusion probability so that all locations within a study area have a chance of being selected. That is why the
design is called a “random survey.” In this example, the inclusion probability is proportional to the prediction
standard error. Both sequential and simultaneous designs prefer locations with a large prediction standard
error, but the latter also gives a chance for locations with relatively small prediction error to be selected. This
makes sense since measurement and locational errors of the berry observations are large, and it may be
desirable to collect additional information in the densely sampled areas where the prediction standard error
has medium values.
The concept of simultaneous spatially‐balanced random design is the following: The inclusion probability
layer defines the desired sample intensity function (the number of samples per unit measure of the
population). This inhomogeneous intensity surface is transformed into an equiprobable surface from which
the required number of random points can be selected using the well‐known algorithm discussed in chapter
13.
Figure 10.8 shows 100 candidate locations selected by another simultaneous spatially‐balanced random
survey design from a list of a fine raster cells (instead of the predefined locations of populated places) with
assigned inclusion probabilities taken from the normalized prediction standard error map and with the
maximum prediction standard error equaling 1.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 10.8
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 10.9
The spatially‐balanced random survey design is very flexible because the inclusion probabilities can reflect
not only statistical data features, such as the prediction standard error, but also other information, including
geographical features such as population density, distance to the nearby road, and slope.
Another method for implementing a simultaneous design is based on the simulated annealing algorithm.
The simulated annealing algorithm is an iterative procedure that modifies a network design set
S k = s1k ,s2k ,...sNk , where k is the number of the algorithm’s iteration in the direction of reducing the error
{ }
( )
k
criterion function O S , for example, the average prediction standard error.
An initial design can be obtained by adding or removing the required number of monitoring locations in
€ the areas where the kriging prediction error is large or by just using randomly chosen locations in the study
area. €
The design is obtained by randomly modifying the location of one of the points of the design . A
design is accepted if it leads to a reduction in the error criterion function, in other words, if
( ) ( )
O S k +1 < O S k . A design that leads to a larger error can also be accepted with specified probability that
decreases with the number of computer iterations. The probability of accepting a design over a
previous design can be defined as
O ( S k+1 )−O( S k )
−
€ P(S → S k k +1
) = min1,e T k ,
where is a decreasing function of the iteration k that controls the likelihood of accepting a worse design,
( )
and O ( S k+1 ) − O ( S k ) is a measure of the discrepancy between the current and new designs. For large , a
worse design is more likely to be accepted than for small . Therefore, first iterations are less stable. As the
€
chance of accepting worse designs decreases, an accepted design makes only small refinements in point
locations and iterations stop.
€
Spatial Statistical Data Analysis for GIS Users 401
The function must decrease slowly to guarantee the algorithm’s convergence. The simplest possible
schedule for changing with iterations is
.
However, this schedule may not be effective, since convergence to the optimal solution is not guaranteed and
it is very difficult to find proper starting constant values and α. A more robust, adaptive schedule adjusts
through the variability of the error function at the current network design. For example, a
schedule can be the following:
,
This adaptive schedule ensures uniform reduction of the error function to the minimal error achieved
by the optimal network design.
Using simulated annealing, complicated criteria for optimal design can be utilized. For example, it is possible
to define the error function so that it incorporates the cost of establishing or relocating the new
station. Unfortunately, simulated annealing is a computer‐intensive method. Hundreds of thousands of
iterations may be necessary to find optimal network configuration. Simulated annealing is further discussed
below in the geostatistical simulation part of this chapter.
The geoprocessing tools for the monitoring network design, including the spatially‐balanced random survey
design, will be available in the Geostatistical Analyst version after 9.3.
GEOSTATISTICAL SIMULATION
In this part of the chapter we review some of the procedures that generate stochastic random fields,
collectively termed geostatistical simulation. All procedures use some form of kriging, and they are rather
mathematical algorithms then statistical concepts. Most of geostatistical simulation algorithms require
normally distributed data (information on non‐Gaussian geostatistical simulations can be found in references
9 and 10 in “Further reading” as well as in appendix 3). Since real data are rarely Gaussian, data
transformation and detrending options discussed in chapter 8 are usually used for the data preprocessing.
We begin with a motivation example. At the end of this chapter, several applications of Gaussian geostatistical
simulations will be presented.
Figure 10.10
When using the elevation data from the coarse resolution DEM, kriging produces the smooth elevation
predictions (blue line in the graph in figure 10.11). When geostatistical simulation is used with the same
elevation data, the simulated values are shown with the green line. The length of the road determined from
the kriging prediction (blue line) is significantly shorter than the length of the road determined from the
geostatistical simulation (green line). The road derived from the finer resolution DEM is shown as the red
line, and we can see that geostatistical simulation gives a better estimate of the true road length than kriging.
Generated by the author.
Figure 10.11
Kriging predicts a single value that is close to the true but unknown value. Geostatistical conditional
simulations try to do a better job than kriging using the same information by estimating microscale variation
in addition to large‐ and small‐scale data variability.
If data represent a single realization from a model with specified mean value and covariance function, a
limitless number of surfaces can be produced that fit values at sampled points but deviate from one another
at unsampled locations. Each such surface can obey particular statistical characteristics of the dataset, so any
Unconditional geostatistical simulation produces a set of values, usually on a grid, that correspond to a
particular data distribution with a specified mean value and semivariogram. In practice, it is usually assumed
that data follow normal distribution.
Conditional geostatistical simulation does the same and also reproduces the observed data values at their
locations. Another name for the conditional simulation technique is spatially consistent Monte Carlo
simulation. The average of all the spatial conditional simulations tends toward the surface produced by
kriging when the number of simulations is increasing.
It can be shown that the mean squared prediction error from conditional simulation based on simple kriging
is two times greater than the mean squared prediction error of simple kriging. Therefore, for the purpose of
prediction, kriging should be preferred to conditional simulation.
However, it is possible to obtain better predictions than those produced by kriging using a large number of
conditional simulations after proper data transformation. In chapter 4, a formula for back transformation of
the predicted value by lognormal simple kriging on the original scale was presented:
where is the predicted kriging variance of the log‐transformed data Y(s) at location s. The point was
that it is not sufficient to simply use the inverse log‐transformation formula . However, the
average of the exponent values of a large number of conditional simulations gives the formula for simple
kriging prediction above (it can be shown by evaluating the expected value of the inverse transformed
normal distribution of and n is the number of simulations). Generalizing, if the formula for inverse
transformation is known, all that is needed to estimate the average value on the original scale (that is to do
the job of kriging prediction) is inverse transformation of the simulated values and averaging them. This
simplifies estimation of the expected values if the transformation function is complex.
Figure 10.12 shows an example of complex data transformation. The histogram in the top left shows
distribution of about 15,000 measurements of cesium‐137 soil contamination collected in Belarus in 1992. It
is practically impossible to describe this distribution by a mixture of Gaussian kernels because there are too
many isolated and relatively large values that require a separate kernel. Log transformation shown in the
bottom left does not make the data Gaussian either. However, a combination of log and normal score data
transformations can do the job.
A problem is that the resulting kriging predictions are on the log‐scale. In this case, the expected value of the
cesium‐137 soil contamination on the original scale can be calculated by exponentiating and then averaging
the simulated log‐transformed values in the grid cells, as in the map in figure 10.12 at center. A map of
standard deviations calculated for each grid cell using exponents of 30 simulations is shown in figure 10.12 at
right. These maps can be more accurate than maps created using conventional kriging models.
Simulations are useful for constructing a probability distribution at each location or in each region. With that
distribution, one can estimate, for example, the probability that soil contamination in a particular area
simultaneously exceeds a permissible upper threshold.
Conditional simulation is preferred to block kriging prediction for averaging data in the polygons or grid cells
because all weighted average methods, including kriging, smooth the values between measurement locations
and the resultant averaged values have a lower variance than the true data. Geostatistical conditional
simulations may better reflect the variability between points.
It makes sense to vary the parameters of the geostatistical model during simulations, particularly those of the
semivariogram, since we do not know the true semivariogram but have only an estimate of it. In other words,
it may not be necessarily to reproduce the semivariogram model parameters exactly since they are not
observed. The simplest way to use the variable semivariogram model is the following:
1. simulate data
2. estimate the semivariogram model from simulated data
3. simulate new data using the semivariogram model estimated in the second step
If data transformation is used in the kriging model, the semivariogram model in step two should be estimated
using transformed data. Note that randomization of the semivariogram parameters should lead to smaller
changes in the resulting simulations than that produced by a geostatistical simulation algorithm itself.
Conditional simulation discussed in this chapter deals with a single variable. For example, we can simulate
soil moisture and soil temperature in the farmer’s field, as was discussed in chapter 5. But soil moisture and
soil temperature are correlated (so that the high grade occurrences of moisture inform us about the
neighboring temperature and vice versa), while the simulations reproduce the distributions of soil moisture
and soil temperature but not the intrinsic correlation between the variables. Conditional cosimulation is a
multivariate extension of conditional simulation. In our example, soil moisture and soil temperature can be
simulated in such a way that their cross‐covariance and both semivariograms are respected.
Four commonly used methods of geostatistical simulation are presented below.
We know that the parameters of the semivariogram model have a physical meaning: measurement error
(part of the nugget), distance of data correlation (range), and data variance (sill). A question may arise: What
is common between the surfaces of values that have a particular semivariogram model?
The left part of figure 10.13 shows measurements of the ozone concentration in California, a Gaussian
semivariogram model, and a circular searching neighborhood with a radius equal to the estimated range of
data correlation. Using the estimated Gaussian model and the data mean and assuming that data distribution
is Gaussian, unconditional simulations on a grid of points inside the California border are shown in the right
part of figure 10.13. Simulated values reproduce the spatial structure of the data in the right part of the
figure—in particular, we can clearly see one large and several smaller areas of large values—but the positions
of the spots are arbitrary.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 10.13
Figure 10.14
Several reliable algorithms for simulating unconditional simulations with a specified covariance model are
available, and two of them are discussed below.
A method that transforms an unconditional simulation to a conditional one is the following. Suppose the
kriging predictor of the true value Z(s) at the location s is . The following equality is always true:
,
Suppose we know the mean value and the covariance of the process Z(s). We can simulate a variable
using these mean and covariance and write a similar equality:
geostatistical simulation conditioning method is to use the simulated error instead of the
unknown one :
,
where denotes conditional simulation in location s. This substitution makes sense because
unconditional simulations correctly reproduce spatial data structure as in figures 10.13 and 10.14.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 10.15
Figure 10.16 (center) shows the data locations and unconditional simulations using the estimated
mean value and the semivariogram model on a grid of points inside the California border.
The difference between true simulated and predicted values on the grid is shown in figure
10.16 at right.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 10.16
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 10.17
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 10.18
Looking at many simulated surfaces, one can observe that some conditional simulations show extreme
variations while others show average behavior of the variable. However, conditional simulations are equally
probable from the point of view that each one is a possible depiction of the true spatial surface. The mean of
the large number of conditional geostatistical simulations tends to kriging prediction and the simulations
variance tends to the kriging variance multiplied by 2, meaning that conditional simulation values fluctuate
within an envelope defined by the kriging prediction standard errors.
Because the majority of observed data are not absolutely precise, measurement errors should be
incorporated in the simulation model. If measurement errors exist, we can use a filtered version of
conditional simulation that honors filtered values (reconstructed signal) and, in the case of zero
measurement errors, honors the data. In this case, the goal should be simulation of the true signal (that is
data not contaminated by errors), S(s) of the measurement Z(s),
Z(s) = S(s) + ε(s),
where Z(s) is the observational process at location s, S(s) is the process of interest (signal) with mean value µ
The unconditional step of simulation requires knowledge about process ZN(s)=SN(s))+ εN(s), and we should
generate unconditional processes SN(s) and εN(s), independent of one another with expected mean values
E(SN(s)) = E(S(s)) = μ,
E(εN(s)) = E(ε(s)) = 0,
and with specified covariances
cov(SN(s1), (SN(s2)) = cov(S(s1), (S(s2)),
cov(εN(s1), (εN(s2)) = cov(ε(s1), (ε(s2)).
The covariance models of the observed and the signal processes are the same at the unsampled locations. The
covariance models differ at the sampled locations, and the difference is defined by the measurement process.
Conditional simulation of the signal S(s) based on simple kriging requires knowledge of the mean value µ, the
simple kriging weights , and simulations at the sampled and target locations. This is because substitution of
the simple kriging predictions using observed and simulated data in the conditioning formula,
leads to the following expression:
where ZN(si)=SN(si)) + εN(si).
From the last formula above, we see that conditional simulation of S(s) depends on the mean value µ only
through simulated values SN(s) and ZN, not directly.
It is possible to simulate data with a locally stationary covariance. For example, the semivariogram can be
defined as
,
where the partial sill is a slowly changing semivariogram parameter (it can be estimated using moving
windows kriging discussed in chapter 9). Then a locally stationary simulation is produced by multiplying
values simulated from a distribution with semivariogram γ(h) by the square root of the locally varying partial
sill parameter.
SC,ok(s) = + (SN(s) )
However, theory suggests that conditional simulation based on simple kriging is always to be preferred to
ordinary and other kriging models because the mean squared prediction error of the conditional simulation
based on simple kriging is smaller. In addition, even for nearly Gaussian data, the kriging variance does not
depend on the actual data values when conditioning is based on ordinary and universal kriging, but simple
kriging variance may depend on the actual data values themselves and not on just the relative spatial
arrangement of data locations.
If kriging with local searching neighborhood is used, the resulting conditional simulation surface is not
continuous and this can be a problem in applications that deal with continuous phenomena such as flood
modeling or weather prediction. In this case, smooth variant of kriging discussed in chapter 8 can be used.
Two examples of conditional simulation using Geostatistical Analyst tutorial data, maximum annual 8‐hours
ozone concentration in California in 1999, are shown in figure 10.19 using simple kriging without the
smoothing option at left and with this option at right. All other parameters of the simple kriging model are
identical. In the case of non‐smooth simulations, jumps in the simulation values are clearly seen (for example,
at the top of the map in figure 10.19 at left) around measurement locations that separated from the nearest
neighbors by the distance comparable with 60 kilometers radius of the searching neighborhood shown as red
circle in figure 10.19 in the center.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 10.19
Just as in the case of kriging, the choice of semivariogram model does matter. Figure 10.20 shows three
conditional simulations using data shown as pink dots. The left surface was created using a Gaussian
semivariogram model with zero nugget parameter. In this case, the solution of the kriging system of linear
equations is unstable. Consequently, kriging weights vary too much and the variability of the simulated data
is much larger than it should be (compare the intervals for the values in the left and in the right surfaces). The
maps in the center and right show simulated surfaces with the same semivariogram model as in the left, but
Figure 10.20
The most popular algorithms for unconditional Gaussian simulations are the Cholesky decomposition and
spectral decomposition. They are based on mathematical features of the Gaussian distribution, and the details
on the algorithms can be found in statistical textbooks. The ideas of the algorithms are presented below.
To simulate a spatially correlated process with mean µ(s) and symmetrical variance‐covariance matrix Σ,
,
Z = µ + L⋅W.
The simulated data has a mean µ and a variance‐covariance matrix Σ. The decomposition and storage of large
matrices are inefficient, and the Cholesky decomposition algorithm is used for simulation of a relatively small
number of values.
The spectral decomposition approach is based on a mathematical method for generating a square root matrix
of Σ:
, ,
where I is the identity matrix composed of all zeroes with just diagonal having ones, Δ is a diagonal matrix
containing the eigenvalues of . The resulting unconditional simulation is calculated using the
formula
If programming properly, the spectral decomposition is the fastest among all known simulation algorithms.
SEQUENTIAL GAUSSIAN SIMULATIONS
The sequential Gaussian simulation is presented below through a discussion of its algorithm.
STEP 1 Define a kriging or cokriging model. In practice, simple kriging with a normal score transformation is
most commonly used.
STEP 2 Define the prediction locations. There are no restrictions on the choice of prediction locations, but in
practice a grid of cells is almost always used. Figure 10.21 shows data locations with fine and coarse grids
superimposed over the data domain shown in blue. Note that two cells include more than one observation.
The reason why two grids are used is explained in step 5 below.
Figure 10.21
STEP 3 Average data in each nonempty cell and shift the new data to the centers of the cells.
Averaging and shifting introduce additional uncertainty for short distances in the semivariogram modeling
(see “Locational error” in chapter 3). The resulting semivariogram will not be equal to the semivariogram
based on original data near the origin. One possible alternative is using a random location inside the cell for
each simulation.
In the case of replicated data, averaging can be done in two steps. First, locations with replicated data are
averaged using linear combination of the observations with variances in the same location
,
where measurement Zi is a sum of the true value of Z, Gaussian measurement error normal(0, ), and
microstructure; N is the number of measurements in a particular location; and weights are calculated as
each cell are averaged taking into account the microstructure component of each value’s variance. The
microstructure is not filtered out because it is a part of the true data values.
The resulting average values are in fact the predictions at the centers of the cells. If we are interested in the
average values in the cells, they can be approximated assuming that the microstructure is equal to zero
because the larger the number of simulations in the cell the smaller the microstructure, and perfect averaging
is the weighted sum of very large number of values.
Note: we observe that in the available geostatistical software, the arithmetical mean of the data inside each
cell is almost certainly used, which is wrong.
STEP 4 If a clipping polygon is specified, assign a “no data” value to the cells outside that polygon.
Usually simulation in the non‐empty cells (green cells in figure 10.22) is not required since we already know
values there. However, if observations are contaminated by errors or shift of the data locations to the center
of cells introduces large uncertainty (when cell size is relatively large), simulation of new values in non‐
empty cells may be required.
Figure 10.22
STEP 5 Define a random path through all empty grid cells, that is, the non‐green cells that contain neither
data nor “no data” in figure 10.23, by simulating random numbers from the uniform probability distribution.
For example, (a) create a list with n non‐green cells; (b) simulate a random value for p in the interval [0, 1]
using the function rand(), which is available in any computer language and also in ArcGIS; (c) choose cell
number i=p·n as the first in the random path; (d) remove item i from the list and move the last item in the list
to position i; and (e) repeat with n = n – 1 until n is larger than one.
In practice, one or two coarser grids with coincident cell centers are defined and a random path is chosen
hierarchically, first for the grid with large cells, then for the medium grid, and finally for the remaining cells
on the fine grid.
Figure 10.23
In the case of clustered measurements, such a nonrandom path may be preferable to limit the influence of the
clusters, which otherwise will lead to simulations that are too similar. Another possibility is to begin
simulations in the areas with a low density of points, then progressively move toward areas with denser
measurements.
Since simulated surface depends on the path, a large number of simulations is required to properly describe
data variability.
STEP 6 Predict the value and its standard error in the first empty cell in the random path using simple kriging
and maybe in all the data in the green cells inside the clipping polygon.
In theory, all data should be included in the simple kriging searching neighborhood, but in practice, only the
closest 10 to 15 measurements are used. This is because of the desire to have fast simulations since solving a
kriging system based on 10 to 15 points can be done quickly. The use of a hierarchy of grids in step 5 avoids
the problem of reconstructing the large‐scale spatial correlation.
A problem is that predictions in the neighboring cells are usually made using different (and sometimes very
different) neighborhoods, see steps 8 and 9 below, and the resulting raster is far from being smooth.
However, smoothness may be a desirable feature in the applications.
STEP 7 Simulate a value from a normal distribution with the mean equal to the simple kriging prediction and
variance equal to the squared simple kriging standard error from the previous step. The right part of figure
10.24 shows a normal distribution with mean defined by the kriging prediction and standard error defined by
the kriging standard error. A random draw from this normal distribution is shown in pink.
The normal score transformation was used in step 1 because it is the only transformation that guarantees
that the transformed data are Gaussian.
Figure 10.24
STEP 8 Add the random draw from normal distribution to the original dataset (a pink cell in figure 10.24).
Simulation from the normal distribution implicitly assumes that data are error free because if data have
measurement error, then it makes no sense to mix original noisy data with filtered predictions. This is a
serious drawback of the algorithm.
In 1963, Lev Gandin considered the following exercise: compare the error of kriging interpolation to the same
location using 1) the original data, 2) predicted data on a regular grid, and 3) a mixture of original and
predicted data. The result of the exercise is that the error is inflated because of a non‐optimal algorithm when
using predicted data, and this error is greater than the error resulting from a direct interpolation from the
original observations. This additional variability is artificial and not the natural variability of the data.
The sequential simulation algorithm adds to the dataset not the most probable mean value at the empty cell
location but a random value that is often close to the mean but sometimes very different from it. Because the
simulated values at the unsampled locations can be very far from the most probable one given by kriging, one
particular simulation should not be used as representative of the prediction surface.
STEP 9 Repeat steps 6 to 8 for the remaining non‐green cells.
STEP 10 Back‐transform the data. The semivariogram was estimated and the simulations were made using
transformed data. Therefore, the simulations in the original scale require proper back transformation.
STEP 11 For this last step, repeat steps 5‐10 to produce a new grid of realizations.
To produce the sequential Gaussian simulation using several variables, simple cokriging instead of simple
kriging can be used.
If the conditional simulation algorithm described above is used without data, the result is unconditional
simulation with values drawn sequentially from a Gaussian distribution. In this case, sequential Gaussian
simulation and Cholesky decomposition are related.
Suppose we are using sequential Gaussian simulation without data. Then the first draw is
,
In unconditional simulations and conditioning by kriging, simple kriging is used only to get predictions while
the prediction uncertainty is not used in the simulation process. Sequential Gaussian simulations use both
simple kriging predictions and prediction standard errors and inaccurate specification of the mean value (or
mean surface), and data transformation may lead to inaccurate estimated uncertainty of the prediction and,
consequently, artificial variation in the simulated values.
Sequential Gaussian algorithm simulation is relatively slow. It only allows simulation from simple spatial
processes without trend. However, despite many shortcomings, it is the most popular algorithm in
engineering applications.
SIMULATING FROM KERNEL CONVOLUTIONS
In chapter 8, we discussed the relationship between covariance and kernel functions: any stationary Gaussian
process Z(s) with covariance
can be expressed as the convolution of a Gaussian white noise process g(s) with convolution kernel Kernel():
By a white noise process g(s), we mean a process for which average value in the area A, , is
distributed as a normal variable with zero mean and with variance proportional to the area A,
. Correlation between mean values G(A) and G(B) for the white noise
process is proportional to the area of intersection of A and B for any A and B contained in the data domain.
The moving average construction guarantees that the resulting covariance function is positive definite. The
resulting zero‐mean stationary Gaussian process is certain to be valid no matter which kernel is chosen.
Figure 10.25
Four unconditional simulations in figure 10.26 were produced with MATLAB two‐dimensional convolution
function, conv2, using the kernel in figure 10.25 at right and four different white noise surfaces.
Figure 10.26
Given unconditional simulations on a grid, conditioning by the kriging approach discussed earlier in this
chapter can be used for creating surface consistent with the observed data.
Although spatially‐correlated surface can be created using almost any kernel, the interpretable predictions
and simulations are produced with kernel functions that satisfy certain properties. Two different strategies
It is possible to find kernels that correspond to non‐Gaussian processes. Examples of kernels for generating
practically important non‐Gaussian processes can be found in reference 6 at the end of this chapter. Kernels
are constructed using definitions of the mean, variance, and covariance expressed through the chosen kernel:
The resulting expressions for one such process with a mean equal to the variance (such process is important
for regional data analysis, see chapters 11 and 12) and with a covariance function that depends on the
distance between locations,
,
,
are complicated, but the calculations are straightforward since the kernel is defined by the analytical formula
and the integrals can be calculated quickly.
A question arises: is it possible to create a conditional simulation from, say, Poisson unconditional simulation
using the method from the earlier section “Unconditional Simulation and Conditioning by Kriging.”
Unfortunately, the kernel convolution method does not guarantee that the simulated values will belong to the
sample space (because conventional simple kriging was not designed for predicting Poisson data), and they
certainly would not be integers as a simulated Poisson variable should be. Still, it would meet the
requirements for having a desired mean and variance values and covariance model. It would be interesting to
compare this approach with other methods that simulate correlated Poisson data. For example, correlated
Poisson data can be simulated from the posterior predictive distribution of the spatial generalized linear
model holding all parameters constant (mean and variance values and covariance model) as shown in chapter
6 for the weed counts data prediction with Poisson type generalized linear model.
The kernel convolution approach can be used for modeling nonstationary data. In this case, the shape of the
kernel depends on spatial location. An interesting application is interpolation and simulation in the presence
of semitransparent and nontransparent barriers.
In each grid cell (i,j), model the independent random variable with zero mean and variance .
Based on the cost value in each grid cell, find the distance to points in the specified
neighborhood, where pairs (i,j) and (k,l) refer to grid rows and columns.
Define a kernel function .
between grid cells (i,j) and (k,l):
€
∑ Kernel(cd( i, j ),( t,s) ) ⋅ Kernel(cd( i′, j ′ ),( t,s) )
cov( Z i, j ,Z k,l ) = σ 2 ⋅ t,s
(
∑ Kernel 2 cd( i, j),( t,s) ⋅ ) (
∑ Kernel 2 cd( i′, j ′),( t,s) )
t,s t,s
The shape of the kernel can be complicated near the barriers. Figure 10.27 at left shows how the shape is
modified from symmetrical near a line barrier.
€
An example of unconditional simulation using a non‐Euclidean distance metric in the area with
nontransparent barriers in black is shown in figure 10.27 at right. The resulting simulated image is changing
smoothly near the barriers.
Figure 10.27
Simulated annealing is an optimization technique that is used widely due to its flexibility and ability to
generate complex processes. It is used in spatial statistics for producing a surface with specified statistical
properties.
Suppose there is a set of values A and for each element a of the set A there is a function f(a) which has a real
number. For example, set A could be permutations of values assigned to the polygons, and f(a) could be an
index of spatial association discussed in chapters 2, 11 and 15‐16. The task is to find maximum value of f(a).
This seems like a trivial task: calculate f(a) for all a and choose a maximum value. However, if the number of
elements in A is very large, it is impossible to check all values in reasonable time. For example, if the number
of polygons is n, the number of possible permutations is n! and a problem arises even for n greater than 8
since 8!=40320.
A common approach is to use local search in A. This requires that neighboring elements in A be similar. It is
often clear how to define neighbors. For example, for the set of permutations, two elements are neighbors if
one of them can arise from the other by exchanging two of the objects only. With defined neighbors, searching
algorithm starts at random element a, compute f(a) for it and for the neighbor, and move to the neighbor if its
f(a) value is larger. Then repeat the process moving to the closest neighbor which has not been visited yet.
The algorithm can be visualized as a hiker in a hilly terrain walking in the steepest direction until the top of
the hill is reached. The drawback is that the algorithm may stop on a modest hill (local maximum) without
noticing a mountain peak (global maximum) farther away. To deal with this drawback, several modifications
were proposed, including occasional downhill steps.
The simulated annealing optimization algorithm gets its name from the age‐old technology of metal (or glass)
annealing, that is, slow cooling of a melted substance to give it enough time to come to the most regularly
ordered crystalline structure that has the lowest atomic energy and hence the best mechanical qualities. To
reach an optimal energy state in metal annealing, all one needs is to cool the metal very slowly. If cooling is
done faster, the metal can freeze in one of the intermediate states, with the energy of the crystalline lattice
above the lowest possible level. Such intermediate states are called local energy minimums as opposed to the
global energy minimum.
Stochastic simulation using simulated annealing imitates the metal annealing process by letting a simulated
set of data go through a succession of states (images) that differ from each other by one or several data values
in the grid cells and by their objective function. In conventional terminology of the simulated annealing
optimization, the objective function is the analog of energy in the metal annealing process.
The objective function measures the difference between the values of statistical parameters of the reference
data and the values of these parameters in each simulation. For example, the objective function can be written
as a difference between the specified semivariogram at lags , which is usually estimated from the data, and
the estimated semivariogram at the same lags at each iteration of the optimization.
Just as the purpose of annealing is to bring the melted metal to the lowest energy state, the purpose of
annealing optimization is to simulate data with the lowest objective function. This results in simulated data
with statistical properties close to those specified by objective functions.
STEP 1 Create an initial image. This image could be randomly generated, but it usually satisfies some
statistical characteristics of the required image to speed up calculations. In the case of conditional simulation,
the initial realization usually includes both the observed data and the simulated values.
For example, the initial image of the ozone concentration over California can be simulated from the univariate
distribution of the observed ozone values. This image has an objective function that is far from optimal, but it
has the advantage of not incorporating the large‐scale variation in the data. This can be important since
simulated annealing freezes large‐scale variation early and at high temperatures (we use word “temperature”
here because of geostatistical literature practice) then converges on details at lower temperatures.
Another way of creating the initial image is using a surface created by another, faster, unconditional
simulation algorithm. Such an image will usually have desired statistical properties; for example, it may
reproduce the histogram and the semivariogram model of the data. This allows starting annealing
computations at a relatively lower temperature.
STEP 2 Calculate the objective function for the initial image. This objective function measures the closeness of
the desired statistical characteristics of the image to those of the image at current iteration.
STEP 3 Create another image by perturbing the first image, for example, by swapping values in two randomly
selected cells. If the resulting image must honor values at the sample points, then swapping takes place only
at cells without sampled locations.
STEP 4 Calculate the objective function for the new image.
STEP 5 Compare the two objective functions and decide if the new simulated image should be accepted or
rejected.
The image is accepted if its objective function is lower than the objective function of the previous image.
However, one should use a more flexible rule of acceptance and not just always choose a simulation that
lowers the objective function. Simulations that increase the objective function should also be given a chance,
because they may help the algorithm to avoid local minimums of the objective function. The function in red in
figure 10.28 has several minimums, and the goal is to find a global one. At the iteration i, objective function Oi
is on the downslope of one of the local minimums. At the next iteration i+1, objective function Oi+1 is close to
the local minimum, and without a possibility to jump to another part of the objective function Oi+j, the
optimization will stop in the local minimum.
Figure 10.28
Figure 10.29
where is the difference between the values of the objective function at the last two
iterations. In the beginning of the annealing optimization when temperature is high, the probability of
accepting an image that increases the objective function is relatively high and it vanishes as the temperature
goes down.
STEP 6 If the current image is accepted, the previous image is replaced by the current one and steps 3 to 5 are
repeated until the difference between objective functions is small enough or until new perturbations stop
affecting it, which means that the system has achieved equilibrium.
STEP 7 The temperature is lowered and the system goes through steps 3 to 6 at a lower temperature.
STEP 8 For this last step, the algorithm terminates either when the temperature reaches a specified value or
when some other stopping criterion has been met.
When an annealing algorithm is used for finding an image with minimal discrepancies between its
semivariogram and the specified semivariogram model, the objective function can be written as
L
2
[
Oγ = ∑ γ mod el ( h j ) − γ image ( h j )
j=1
]
where L is the number of semivariogram lags taken into consideration.
€
The objective function above can be supplemented with constraints on other statistical or physical properties
of the process being studied, for example:
L K
2
[ ]
Oγµ = w 0 ∑ γ mod el ( h j ) − γ image ( h j ) + ∑ w k [µkmod el − µkimage ] ,
j=1 k=1
where is some physical (geological, engineering, financial) properties of the image, ω0 and ωk are
relative importance of matching the semivariogram and a physical property k respectively.
€
The most commonly used annealing schedule is exponential cooling. Exponential cooling begins at some
,
where .
The system must produce enough simulations at each temperature to reach thermal equilibrium before
proceeding to the next step. The choice of suitable values for and is data‐dependent. Empirical
In another exponential cooling schedule, temperature at the nth step of the schedule is given by the formula:
where n is number of iteration.
The simulated annealing process is illustrated below using five points on the plane enumerated in figure
10.30 at left. They were simulated from normal distribution with exponential semivariogram, right, with
range equal to 4/3 of the largest distance between points.
Data values 0.650 and 1.659 are fixed and simulated annealing will be used to reconstruct values of the
remaining three data using known data distribution (here we assume that data are Gaussian), semivariogram
model, and values at points 2 and 5.
Figure 10.30
Table 10.1
An empirical semivariogram for five distances can be estimated using data in these five locations and we will
use objective function
5
2
[ ]
Oγ = ∑ γ ( h j ) − γ model ( h j ) ,
j=1
where γ
model
(h ) is the true semivariogram model value at lag
j ( )
and γ h j is estimated empirical
semivariogram at lag , to check how close the estimated semivariogram values to the required ones.
€
This exercise imitates real situation when, knowing limited number of observations, we are interested in a
€
€
collection of values at specific unsampled locations which closely reproduce spatial data structure. Only three
Table 10.1 shows first 17 iterations of the simulated annealing process. First column contains iteration
number with zero used for the preliminary step with three random values from the data distribution assigned
to points 1, 3, and 4. Temperature values are in the second column. It has the same value because it was
updated each 20 iterations using formula . Columns 3‐5 show current values of the first,
third, and forth data. At each iteration, one of these data is randomly selected and updated by simulating a
new value from normal distribution from which five points in figure 10.30 were simulated. Updated value is
printed in bold. Columns 6‐7 show the calculated and the last accepted values of the objective function. The
last three columns contain probability P(accept) of acceptance of a new value for the first, third, or forth data
calculated using the formula above, random draw from uniform distribution on the interval from zero to one,
which is compared with P(accept), and the decision about acceptance of the new datum calculated as
Figure 10.31 shows current, left, and accepted, right, objective functions for 600 iterations. In this run,
optimization algorithm could be stopped around iteration 120 since there was not a significant decrease of
the objective function at the next 450 iterations.
Figure 10.31
Table 10.2 shows data and objective functions for true data, randomly simulated data for iteration 0, and
three iterations with very low objective function.
Objective
data1 data2 data3 data4 data5
function
True data 3.034 0.650 2.227 0.036 1.659 75.4
Simulated data
4.521 0.650 1.169 3.941 1.659 226.868
(Iteration 0)
Iteration 99 2.395 0.650 3.802 1.788 1.659 10.164
Iteration 306 3.031 0.650 3.895 2.071 1.659 12.241
Iteration 593 2.768 0.650 3.557 1.653 1.659 9.852
Table 10.2
We should not be surprised that simulated data at iterations 99, 306, and 593 better reproduce a true
semivariogram model than the true values since only two or three values are averaged at each of five lags,
which is insufficient for true semivariogram model restoration.
Figure 10.32
A simulated annealing algorithm will work when all five values are unknown. However, objective function
may not improve if it is smaller than the one that was calculated using five random values, while current
empirical semivariogram values are still far from the true semivariogram. Therefore, additional information
about two data values helps to solve a problem more effectively.
Simulated annealing is a flexible algorithm that can produce images with specified statistical features. For
example, using simulated annealing, we do not have to work out the theoretical properties of the kernels to
produce realizations of the processes with the mean equal to the variance. Instead, we simulate independent
Poisson values then match the mean, variance, and covariance. Moreover, we can use additional constraints
to simulate data with required features.
Simulated annealing is powerful, but it is the slowest simulation algorithm. Therefore, simulated annealing is
used when other algorithms cannot produce the required images.
It should be noted that practitioners observed that simulated annealing algorithms often underestimate
actual data variability.
APPLICATIONS OF UNCONDITIONAL SIMULATIONS
One usage of unconditional geostatistical simulation was illustrated in chapter 9 (choosing the optimal
kriging model for Catalonia precipitation data). Applications of unconditional simulations are mostly
exploratory, for instance, generation of what‐if scenarios for planning and data collection purposes when data
are not available but the semivariogram model is approximately known. For example, data can be simulated
in a region before actual data collection using a semivariogram model estimated in a similar sampled region.
Then subsamples can be used based on several sampling plans and values predicted at other locations with
known simulated data values to compare efficiency of the sampling plans.
The semivariogram model in the figure 10.33, center, describes spatial correlation of the DEM data displayed
at left. The range of data correlation is about 5,700 meters. The map in figure 10.33 at right shows filtered
ordinary kriging predictions assuming that the nugget parameter of the semivariogram model consists of
measurement error.
Figure 10.33
The difference between elevation values and filtered kriging predictions, the residuals, is shown in figure
10.34 at left. These residuals have clear spatial structure, which can be described by the semivariogram
model—with a range of about 100 meters—shown in figure 10.34, center. Therefore, random errors for
simulation experiments can be unconditionally simulated using a semivariogram model with a range of about
100 meters. One such simulation is shown in figure 10.34 at right.
Note that it may be useful to multiply simulated values by a function of slope, perhaps
, to distinguish between steep and flat areas.
Figure 10.34
Figure 10.35 illustrates practicing in semivariogram modeling using unconditionally simulated data, with the
raster map at left. This figure also shows the Geostatistical Analyst’s cross validation comparison dialog at
bottom left, with kriging models using true (left) and default (right) semivariogram models. Cross validation
diagnostics for the model with a true semivariogram model is clearly better. However, the default
semivariogram model looks very reasonable at bottom right, and many researchers would be satisfied with it.
The true (and, of course, the best for these data) anisotropic model at top right looks worse than the default
semivariogram model. This example shows that cross‐validation diagnostics can help to improve the default
kriging model.
Figure 10.35
APPLICATIONS OF CONDITIONAL SIMULATIONS
One example of using the conditional geostatistical simulation was presented in chapter 5 (testing the
hypothesis about correlation between soil moisture and soil temperature).
Conditional geostatistical simulations were initially used in mining, then in hydrology, petroleum,
meteorology, and other areas. A typical application of conditional simulation in reservoir characterization is
providing the input to a transfer function such as a flow simulator that requires unsmoothed spatial
distribution of permeability, porosity, and saturation, the output being breakthrough times, production rates,
and sweeping efficiency. All input simulation surfaces should be consistent with available information on the
modeling phenomena. Their differences yield the measure of uncertainty of the response function and
assessment of the probability that the response is greater than some critical value (estimated by the
proportion of the observed responses greater than the threshold value) after processing through the flow
simulator.
The usage of conditional simulation is illustrated below using radioecological, agricultural, mining, and DEM
data. Figure 10.36 shows 197 measurements of cesium‐137soil contamination collected in Belarus in 1992,
kriging predictions on the 121 by 71 grid with a cell size of 765 meters, and semivariogram model (the figure
at bottom right) estimated from the data and used for the predictions. Lakes and streams are shown in blue,
and large villages are displayed as hatched polygons.
Figure 10.36
Figure 10.37 shows a close‐up view of the left part of the data domain. The surface consists of kriging
predictions. The contour of one of the villages is covered by 17 grid cells, shown in gray. The histogram shows
the distribution of simulated values in all these grid cells, based on 100 simulations per cell. The mean cesium
value in the village is 2.75 Ci/km2. The 97.5th and 2.5th percentiles of the distribution of the simulations can
be computed. Then the 95 percent credible interval of cesium soil contamination
[2.5th percentile, 97.5th percentile]=[0.53, 4.79]
gives an estimated range of values that likely contains the true average cesium contamination in the village.
Other credible intervals can also be computed; for example, the 90 percent and 80 percent credible intervals:
[5th percentile, 95th percentile]= [0.94, 4.54] and
[10th percentile, 90th percentile]= [1.39, 4.08]
The use of higher percentiles in the credible intervals requires more simulations. For example, the 99th
percentile of a distribution cannot be estimated accurately if only 100 simulations are available.
The average kriging prediction in the 17 grid cells that cover the village is 2.7 Ci/km2. This value is close to
the mean of the simulations, 2.75 Ci/km2, and if all that is needed is the average value, the kriging prediction
is sufficient. However, there is a problem with reliable estimation of the kriging prediction standard error of
the areal average because it is incorrect to use the arithmetic average of the dependent standard deviations.
Estimation of the uncertainty of the mean value in the polygon is straightforward in the case of conditional
geostatistical simulations. Simple statistics displayed by the Histogram tool show that the standard deviation
of 1,700 simulated values equals 1.08.
The proportion of simulated values greater than 2.1 Ci/km2 is 1,240/1,700 = 0.73. This value can be
interpreted as an estimated probability that the value 2.1 Ci/km2 is exceeded. The proportion of values
greater than 4 Ci/km2, indicated by the blue histogram bars, is 200/1,700 = 0.12.
Figure 10.37
Figure 10.38 shows the 95th percentile of the 500 simulated values at each grid cell, smoothed using the
raster visualization option “bilinear interpolation.” The data for visualization was created using the following
algorithm:
Fix a grid point.
Compute the 95th percentile (a value on a scale of 100 that indicates whether a distribution is above
or below 95 percent of the observations).
Repeat for all grid nodes.
The resulting surface displays overestimated cesium soil contamination mean value, which may occur with
only 5 percent probability according to the model. This map can be used for decision making if a goal is to
protect the most sensitive part of the population living within the contaminated territory.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 10.38
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 10.39
Figure 10.40 shows a map of the proportion of 500 simulated values above a threshold value of 3 Ci/km2,
calculated as
(number of simulated values greater than 3 Ci/km2)/(total number of simulations)
for each grid cell.
The surface in figure 10.40 can be interpreted as a probability map that the specified threshold of 3 Ci/km2
was exceeded.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 10.40
If a transfer function (such as a formula for irradiation dose estimation) from known to unknown variables is
not available, regression models can be used to find the relationships between the response and the
explanatory variables, using a sufficient number of samples. Figure 10.41 shows corn yields collected by a
combine equipped with GPS receivers and averaged by block kriging on the grid with a cell size of 30 meters.
Figure 10.41
Several explanatory variables that influence yield were measured at several hundred locations, and
predictions of their values were made to the grid cell locations in figure 10.41. For example, figure 10.42
shows maps of cation exchange capacity (CEC) predictions and prediction standard errors. The maps depict
relatively large variability of CEC.
Figure 10.42
A linear regression model (discussed in appendix 2) can be used to estimate regression coefficients a0 – a5 of
the model
yieldi = a0 + a1⋅pHi + a2⋅CECi + a3⋅Cai + a5⋅slopei
where i=1,2 … 1225 is the cell number in the map above.
Figure 10.43
The distributions of regression coefficients are shown in figure 10.44. According to linear model diagnostics,
all coefficients are significant because the regression coefficient standard errors are small as shown in table
10.3.
Table 10.3
Large number of samples for model calibration can be collected only rarely, and the model for yields can be
used as a transfer function for estimating yield next year on that field and on similar nearby fields, because
the cost to collect several dozen measurements of pH, cation exchange capacity, and calcium is acceptable for
many farmers. With these samples, conditionally simulated explanatory variables can be used for estimating
distribution of the variable of interest, yield, on the entire field and its parts.
Figure 10.45 shows one iteration of the transfer function usage. Circles show locations of samples at another,
but similar, field. Maps to the left of the sign “=” are the conditional simulations of the explanatory variables.
Note that the regression coefficients are estimated with uncertainty and it makes sense to simulate their
values from the normal distribution assumed by linear regression so that the coefficients may vary when the
model is used repeatedly.
Figure 10.45
Figure 10.46
Figure 10.47 shows an interpolated map of the residuals (the difference between yield, figure 10.41, and its
prediction using the linear regression model). The semivariogram model at right shows that the residuals
have a clear spatial structure, meaning that the model is systematically overestimating yield values in some
areas and underestimating them in other areas. Because linear regression assumes that the explanatory data
are independent though, in our case, they are not, the estimate of the number of degrees of freedom is too
high and the estimated standard errors are too low, resulting in unrealistic values for the significance of the
regression coefficients.
The last statement can be illustrated using the following example. Assume that a small number of response
and explanatory variables regularly distributed in the field, perhaps a dozen samples. Twelve samples are not
enough to get significant results, so data is collected from a dozen additional locations, each close to the
original sampled locations. Usually pH, cation exchange capacity, calcium, and slope do not change abruptly;
therefore, very little additional information is added. However, linear regression assumes that there are two
dozen samples, while there are actually about a dozen independent samples. This results in underestimation
of the standard errors of the regression coefficients because of overestimation of the number of degrees of
freedom in the linear model.
Figure 10.47
Therefore, in this case, the transfer function should be estimated more accurately using spatial regression
models discussed in chapters 6, 12, 15, and 16 and appendix 4. A good candidate is kriging with external
trend model (known as linear mixed model in statistical literature), which simultaneously estimates the
regression coefficients and the parameters of the semivariogram model based on maximum likelihood or
restricted maximum likelihood techniques.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 10.48
A question may arise: How many simulations are required to properly characterize the data? Because the
result of calculations should not depend on the number of simulations, one way to answer this question is to
compare some statistics of different numbers of simulations in a typical small area of the data domain. It is
expected that the statistics tend to a fixed value as the number of simulations increases. Therefore, the task is
to find when the statistic of interest, such as the mean value, becomes nearly constant with increasing the
number of simulations.
Since geostatistical simulation is a time‐consuming process, it is desirable to choose as large an output grid
cell size as possible. This can be done by iteratively decreasing cell size and comparing the statistics for a
selected subregion of the data domain. With coarse grids, it is possible to make more simulations and do more
experiments with different model parameters at the same time.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 10.49
Polygonal statistics such as minimum, maximum, first quartile, median, third quartile, mean, standard
deviation, and proportion of values that exceed a specified threshold are useful in decision making, and they
are used in various applications. Note, however, that these statistics can also be calculated using simple
kriging predictions on the fine grid because the average of many simulations tends toward the simple kriging
predictions. What cannot be calculated with kriging (but only with conditional simulations) is the uncertainty
of the statistics above: “statistics of statistics.” For example, mean value can be calculated in the polygon using
the first simulated surface, then the second simulated surface, and so on, and then a statistical summary of the
mean value in the polygon—including minimum, maximum, first quartile, median, third quartile, mean, and
standard deviation—can be presented.
To estimate the total amount of the variable of interest, such as cesium‐137 contained in the soil in each
Byelorussian district, mean value is multiplied by the area of the polygon. If statistics of the mean are
available, other possible totals can also be calculated using, for example, the first and third quartile values.
Statistics of statistics is also important for estimating the uncertainty of the proportion of simulated values
above the threshold because it helps to make informed decisions about such actions as evacuating people in
the region or beginning mining in the region.
Figure 10.50
Figure 10.51
Table 10.4 shows the proportion of values above the threshold of 39.7 feet estimated using simple kriging
with detrending and transformation options (first column) and statistics for 100 proportions calculated using
geostatistical conditional simulations (columns 2‐8). The decision making about whether to begin mining is
difficult if only the proportion of the values greater than 39.7—calculated either by kriging or geostatistical
simulations—is available, because this proportion is close to the lower permissible value of 60 percent. The
statistics based on conditional geostatistical simulations help in evaluating possible values of the proportion
of interest. In particular, about 75 percent of the calculated proportions have values greater than 60 percent.
However, all these values are not much larger than the threshold, so the decision whether to begin mining is
still difficult.
Table 10.4
Figure 10.52
ASSIGNMENTS
1) FIND OPTIMAL PLACES FOR THE ADDITION OF NEW STATIONS TO MONITOR
AIR QUALITY IN CALIFORNIA.
Figure 10.53 at left shows 5,707 populated places in California. Suppose we want to establish monitoring
stations to monitor air quality in 100 places. They can be selected randomly as shown with blue circles in
figure 10.53 at right. This monitoring network design mimics the population density in California.
Figure 10.53
Figure 10.54 at right shows a map of prediction standard errors of the particulate matter with an
aerodynamic diameter 2.5 micrometer concentration using observations located in the places shown by the
pink rectangles. Suppose we want to add 50 new monitoring stations to the monitoring network and we want
to randomly select five locations within the areas with the prediction standard error smaller than 8.73, eight
locations in the areas with errors between 8.73 and 13.0, 10 locations in the areas with errors between 13.0
and 15.1, 12 locations in the areas with errors between 15.1 and 16.13, and 15 locations in the areas with
errors larger than 16.13. The resulting 50 locations for new monitoring stations are shown in yellow.
Figure 10.54
Install the software following the instruction at the Web site, and upgrade the air quality monitoring network
to 300 stations using information on the locations of the California places and the Geostatistical Analyst’s
tutorial data, ozone and particulate matter with aerodynamic diameter 10 micrometer concentrations,
available in the folder assignment 10.1. Justify your decision. This algorithm will be available in the next
version of Geostatistical Analyst.
2) SIMULATE A SET OF CANDIDATE SAMPLING LOCATIONS FROM
INHOMOGENEOUS POISSON PROCESS.
After reading appendix 2 (and perhaps the beginning of chapter 13), use function rpoispp() from R package
spatstat to simulate approximately 300 new monitoring locations using the points intensity defined by the
raster image created with Geostatistical Analyst using air pollution data from assignment 1. Esri grid file can
be read using function read.asciigrid() from R package sp. Explain qualitatively the difference between this
sampling design algorithm and the method used in assignment 1.
3) REDUCE THE NUMBER OF MONITORING STATIONS IN THE NETWORK USING
VALIDATION DIAGNOSTICS.
The goal of all monitoring network design algorithms discussed in this chapter is to select the best locations
for additional monitoring stations. There are situations when several monitoring stations should be removed
from the network, for example, because of limited financial support. One idea is to remove monitoring
stations at which the value of the monitoring variable can be predicted with small error using observations
from the remaining monitoring stations.
Develop the algorithm for removing 10 monitoring stations from the air pollution monitoring network using
validation diagnostics. Use that algorithm with data from assignment 1.
4) DISCUSS TWO SIMULATION ALGORITHMS THAT ARE BASED ON ESTIMATED
LOCAL MEAN AND STANDARD ERROR.
4a) Prediction significance interval can be constructed using kriging prediction and prediction standard error
or using specified quantile values. For example, figure 10.55 shows the range of possible values using 0.9, 0.5,
and 0.1 quantile surfaces (from top to bottom) of air pollution.
Figure 10.55
Suppose that the kriging model is optimal and both kriging predictions and prediction standard errors are
calculated accurately. Then a surface with simulated values can be produced using the following formula:
prediction + (random value between 0 and 1)×(two prediction standard errors)
4b) For raster data, the local mean and standard deviation can be calculated using Spatial Analyst function
focalStatistics. Then a value in the raster cell can be simulated as
local mean + NORMAL(0,1)·(local standard deviation),
where NORMAL(0,1) is a simulated value from standard normal distribution with mean zero and standard
deviation one.
Is it a good idea to use the two algorithms above in addition to the conditional simulation algorithms
discussed in this chapter?
5) USE CONDITIONAL SIMULATION WITH GEOSTATISTICAL ANALYST 9.3.
If you have Geostatistical Analyst version 9.3, read the documentation and, using cesium137 data from folder
assignment 10.5,
5A) REPRODUCE THE MAP SHOWN IN FIGURE 10.12.
5B) REPRODUCE THE MAPS SHOWN IN FIGURE 10.48.
1. Stevens, D. L. Jr., and A. R. Olsen. 2004. Spatially balanced sampling of natural resources. Journal of the
American Statistical Association 99(465): 262–278.
This paper discusses a strategy for designing probability samples of discrete, finite resource populations,
such as lakes within some geographical region; linear populations, such as a stream network in a drainage
basin; and continuous, two‐dimensional populations, such as forests.
2. Xia, G., M. L. Miranda, and A. E. Gelfand. 2006. Approximately optimal spatial design approaches for
environmental health data. Environmetrics 17(4):363–385.
This paper discusses approximately optimal spatial design in the case of one‐time sampling at a large number
of spatial locations. The goal is simultaneously learning about the spatial dependence structure, the response
variable, and the explanatory variables. The authors show that these objectives can be in conflict. Similar to
the paper in reference 1, the new samples are selected among counted number of possible locations.
3. Lantuejoul, C. 2002. Geostatistical Simulation: Models and Algorithms. Oxford: Springer‐Verlag, p. 256.
This book discusses geostatistical simulation tools, algorithms, and models that are currently used in practice
or intensively discussed in the specialized literature. The expected readers are undergraduate and
postgraduate students at statistical and engineering departments of universities.
4. Chiles, J. P., and P. Delfiner. 1999. Geostatistics: Modeling Spatial Uncertainty. New York: John Wiley & Sons,
p. 695.
Chapter 7 of this book describes the most commonly used geostatistical simulation models. The math is easier
than in Lantuejoul’s book, and the text can be understood by a larger number of GIS users.
5. Krivoruchko, K., and A. Gribov. 2004. "Geostatistical Interpolation in the Presence of Barriers," geoENV IV—
Geostatistics for Environmental Applications: Proceedings of the Fourth European Conference on Geostatistics
for Environmental Applications (Quantitative Geology and Geostatistics), 331–342.
The authors developed a kernel convolution model for prediction and simulation using a non‐Euclidean
distance metric.
6. Schabenberger, O., and C. A. Gotway. 2004. Statistical Methods for Spatial Data Analysis. New York:
Chapman & Hall/CRC, p. 488.
The authors of this book are statisticians, and this book is primarily for students and researchers from
statistical departments of universities. The theoretical foundations of geostatistical simulations are presented
in chapter 7.
7. Bellehumeur C., D. Marcotte, and P. Legendre. 2000. “Estimation of Regionalized Phenomena by
Geostatistical Methods: Lake Acidity on the Canadian Shield.” Environmental Geology 39(3–4):211‐222.
This simple case study is a good example of direct use of conditional geostatistical simulations (averaging).
The authors investigate whether acid rain control programs will succeed in reducing acidity in surface waters
on the Canadian Shield.
The authors proposed a modification of kriging and conditional geostatistical simulation that produces
continuous surfaces.
9. Emery, X. 2005. “Conditional Simulation of Random Fields with Bivariate Gamma Isofactorial
Distributions.” Mathematical Geology 37(4):419–445.
Emery, X. 2008. “Substitution Random Fields with Gaussian and Gamma Distributions: Theory and
Application to a Pollution Dataset.” Mathematical Geosciences 40(N1): 83‐100.
The author of these papers discusses conditional geostatistical simulation based on specific bivariate
continuous data distributions, largely Gaussian and gamma.
10. Armstrong, M., A. Galli, G. Le Loc’h, F. Geffroy, R. Eschard. 2003. Plurigaussian simulations in geosciences.
Berlin: Springer. 160 pp.
This book discusses conditional geostatistical simulation methods for simulating categorical variables.
PRINCIPLES OF MODELING REGIONAL
DATA
GEOSTATISTICS AND REGIONAL DATA ANALYSIS
THE QUESTION OF APPLYING GEOSTATISTICS TO REGIONAL DATA
BINOMIAL AND POISSON KRIGING
DISTANCE BETWEEN POLYGONAL FEATURES
REGIONAL DATA MODELING OBJECTIVES
SPATIAL SMOOTHING
CLUSTER DETECTION METHODS
SPATIAL REGRESSION MODELING
SIMULTANEOUS AUTOREGRESSIVE MODEL
MARKOV RANDOM FIELD AND CONDITIONAL AUTOREGRESSIVE MODEL
ASSIGNMENTS
1) INVESTIGATE THE NEW PROPOSAL OF MAPPING THE RISK OF DISEASE
2) SMOOTH THE DATA FOR THE TAPEWORM INFECTION IN RED FOXES
3) SPATIAL CLUSTERS DETECTION USING R PACKAGE DCLUSTER
FURTHER READING
R egional data, what statisticians call “lattice data,” are often presented in statistical textbooks as data
averaged from continuous variables and assigned to discrete nodes such as centers of grid cells.
However, with the exception of agricultural and image data, which involve continuous variables,
nearly all regional data are aggregated; that is, they are a count of events for a known population
within a polygon. Aggregated data present additional challenges for analysis since we must also factor in
differing population sizes. This challenge comes from the fact that variability of a count of 1 out of 1,000 is
This chapter begins with a comparison of the geostatistical and regional data analysis concepts. Then regional
data modeling objectives are presented and discussion of the regional data analysis begins with spatial
smoothing. Several approaches are discussed and illustrated using male lung cancer data collected in
California counties. Then cluster detection methods are discussed with emphasis on their assumptions. First,
the Moran’s I indices (global, local, and cross) are discussed in detail, and then one modification of those
indices, Walter’s modified I, is presented.
Next, spatial regression models are introduced and compared with kriging. Then simultaneous and
conditional autoregressive models are presented and illustrated using crime rates in California counties and
the level of happiness of people living in European countries.
GEOSTATISTICS AND REGIONAL DATA ANALYSIS
In geostatistics, the semivariogram model defines the statistical distance between pairs of points, and weights
are calculated from the semivariogram model. In regional data analysis, weights are defined first, and then
data correlation is estimated using these weights.
To define weights, neighbors for each region (polygon) are chosen first. This task can be done in a number of
ways. In figure 11.1 at left, green polygons are selected as neighbors of the pink polygon because they fall into
a specified distance range between polygon centroids. Figure 11.2 at right shows the choice of neighbors
based on the common border between polygons. These two variants of the neighborhood selection are the
most popular in the available software packages because they are relatively easy to program.
These two methods of selecting neighbors often encounter problems, from simple geometrical difficulties
illustrated in figure 11.1 to more complicated ones, such as the straight line distance between polygon
centroids not being a good measure of the similarity of economic, crime, or epidemiological data. Even
without knowing the origin of the data in figure 11.1, we may ask why the pink polygon in the left figure has
the two easternmost polygons and one northernmost polygon as its neighbors, and why the one in the right
figure has no neighbors to the south. This often happens when polygons that otherwise meet the neighboring
criteria lie near the edge of the domain or are separated by water. These polygons could be manually included
in the neighborhood based on the expert knowledge about a process that supposedly generated the data.
Figure 11.1
This should be a good reason for individually choosing a particular variant of the neighborhood for each
polygon. However, in practice, researchers often use one variant for all polygons with little thought, believing
that it is better to use some spatial neighbors than ignore them.
Figure 11.2
Regional data analysis tools and models require specification of spatial relationships among the nearby
objects. These relationships are called spatial weights. After selecting neighbors for each polygon, weights can
be assigned to the neighbors using the geometrical relationships between polygons. For example, weights of
the neighboring polygons can be inversely proportional to the distance between centroids or the weights can
be proportional to the length of the common border. Weights of the non‐neighboring polygons are usually set
to zero.
Usually a simple model for spatial dependence between a polygon and its neighbors is assumed through the
use of expression ρ⋅W, where W is a known weights matrix and ρ is the spatial dependence parameter to be
estimated. Surprisingly, in contrast to geostatistics, the weights alone are rarely of interest in regional data
analysis. Their usual role is improving estimation of the regression coefficients that are of concern.
THE QUESTION OF APPLYING GEOSTATISTICS TO REGIONAL DATA
It is possible to use geostatistics with regional data simply by identifying each polygon by its centroid (or
sometimes by some other point inside each polygon). In Geostatistical Analyst, if the input data are polygons,
identification is done automatically using centroids.
However, conventional kriging may not be the best approach for regional data analysis because of the
following problems:
1. Use of centroids ignores the sizes, shapes, and orientations of the polygons. However, the data
variance decreases as the polygon size increases. Centroid locations can be considered as polygons
with infinitely small areas.
3. In some situations, events occur in areas that are not well‐defined. For example, animals live and
travel in groups over a territory whose shape may be polygonal, but the boundaries are known only
approximately. In this case, straight‐line distance between polygons’ centroids may be not a good
measure of the data proximity. However, an animal researcher can often define in general terms how
close one group of animals is to another.
4. Use of rates (proportion of events observed in a region per 1,000 or 100,000 persons) in polygons
violates the requirement of data stationarity because population often changes abruptly on the
border between neighboring polygons.
5. It does not account for the geographical distribution of the population in the case of modeling
epidemiological or crime data.
6. The variance of regional data is often a function of the mean, and conventional kriging does not
preserve this relationship.
Figure 11.3 at left shows California counties and urban areas (in yellow). Red crosses indicate the centroids of
the county polygons. For most counties, the centroids are not where the majority of people live; see, for
example, San Bernardino County in green and Los Angeles and Riverside counties in gray. Any statistical
model that uses the centroid locations as data coordinates may give misleading results if the analyzed data
are related to the population (such as disease and crime rates).
One reason that the population in California is denser in some parts than in others is that a large part of the
territory is either desert or mountain. However, in geologically and climatically homogeneous Belarus, the
situation is not much different. In the figure below right, compare the population density (low density is in
green and high density is in brown) with the centroid locations (black crosses).
Figure 11.3
Copyright © 1995, Umweltbundesamt, GmbH.
Figure 11.4
Although prediction is not the primary goal of regional analysis, since data are usually available in each
polygon, there is nothing wrong with using conventional geostatistical tools and models for regional data
exploration. For example, figure 11.5 shows ordinary kriging prediction of male lung cancer rates using
semivariogram models with the same range parameters, but with different shapes defined by exponential and
Gaussian models. Prediction surfaces are different locally but large‐scale spatial variation is the same and this
is sufficient for data exploration.
Figure 11.5
To deal with problems 5 and 6 in the list in the beginning of this section, binomial and Poisson kriging were
proposed. Problem 1 is discussed in “Regional Data Aggregation and Disaggregation” in chapter 12.
BINOMIAL AND POISSON KRIGING
Development of the binomial kriging model was motivated by the need to map spatial distributions of the
abundance of individual bird species in South Africa. The field data were collected by observers, and
observations were aggregated in quarter degree “squares” of 15 minutes of a degree latitude by 15 minutes of
a degree longitude. The ratio of the number of observations indicating the presence of a given species to the
total number of observations was calculated for each polygon. In some areas the total number of observations
was large and in some other areas it was very small or zero. The variation of reporting rates was
considerable. The measurement error depends on the sample size and the unknown mean value. Therefore,
prior to mapping, it is appropriate to smooth the data using, for example, a filtered version of kriging. The
problems are that conventional kriging does not respect the relationship between mean and variance, and
that the empirical semivariogram of the raw data is not a good estimator in the case of binomial data.
observing the individual bird species in the polygon i, which is binomial, binomial(ni, ). It can be shown
(see reference 1 in the “Further reading” section) that the modified empirical semivariogram for polygons j
and k is defined as
1 2 1
( rj − rk ) − (µˆ (1− µˆ ) − σˆ 2 ) ∑ ,
2 r= j,k n r
global mean and variance of :
N N
N 2 1
∑ y , ∑ (Ri − µˆ ) − µˆ (1− µˆ )∑
n
i
σˆ 2 = i=1 i=1 i ,
µˆ = i=1
N
N
1
∑n i
N −∑
i=1 i=1 n i
where N is the number of polygons.
€ €
Binomial kriging is best unbiased linear estimator of each of the in the form , where M is the
number of observations in the kriging neighborhood and weights are chosen to minimize .
The binomial kriging variance depends not only on the geometrical configuration of the centroids’ locations,
but also on the data values. However, the binomial kriging does not use information on the sizes and shapes
of the regions, indirectly assuming that they are all the same (of course, it can be true, as in the case of bird
counting in South Africa). Since the modified semivariogram is defined not for points, but for aggregated data,
binomial kriging can be used to predict the probability in the polygons with or without (that is, missing)
observations. Therefore, binomial kriging can be used for smoothing regional data.
Binomial kriging was used in epidemiology for mapping cancer rates. In this case, a smooth varying risk to
have a cancer was assumed, meaning that a hypothetical person living at place s has a probability given by
π(s) to have a diagnosed cancer during the period under investigation. The number of cancer cases can be
found in cancer registries and the overall population at risk to have cancer, ni, for the administrative region i
is known from the census. The number of cases for each region yi are independent binomial variables with
parameters ni and and the probability is modeled as a realization of a random function for which the
semivariogram model for any distance h between polygon centroids is known.
Figure 11.6 at left shows maps of predicted thyroid cancer risk for children who lived in Belarus from 1986 to
1994 using conventional ordinary kriging. Data were collected from 1986 to 1994 in 117 Belarusian districts.
Belarusian districts have similar sizes, and binomial kriging (figure 11.6 at right) is applicable to the thyroid
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 11.6
Unobserved risk factors may occupy smaller areas than typical administrative regions, and the prediction
surface can be more informative if binomial cokriging with covariates measured at a finer scale is used.
Estimating animal abundance has motivated another modification of conventional kriging, this time Poisson
kriging. This modification considers Poisson‐distributed data with a mean parameter that is the product of
the observation time at a particular area and the expectation of sighting for a unit of observation time. Like
binomial kriging, the semivariogram estimation formula is modified, and this leads to different kriging
weights and, therefore, different predictions and prediction standard errors.
Detailed information on binomial and Poisson kriging can be found in references 1 and 2 in “Further reading.”
A summary on both kriging modifications is presented in chapter 12 after discussing disease‐mapping based
on the hierarchical Poisson and binomial models.
It is important to note again that it is unsafe to use conventional kriging as a distribution‐free model because
its predictions and prediction standard errors can be very different when the data came from different
distributions. Recall that the formula for the empirical semivariogram is different for Gaussian, binomial, and
Poisson distributions.
DISTANCE BETWEEN POLYGONAL FEATURES
Distance between spatial objects cannot be defined uniquely when the objects have nonzero areas. There are
situations when regions may overlap or be disjointed. In some applications, the distance between spatial
objects can be based on travel time between polygons. Flexible spatial proximity measures can be calculated
using a cost surface, which can be constructed from many geographical layers and, therefore, be very
complex.
Weights of the nearby objects are defined by a proximity measure, which specifies how each neighboring
value contributes to the local statistic. Some expressions for the weights are listed below, with hij being the
distance between regions i and j, lij being the length of the common border, and lij/lj being the proportion of
the common border.
where δ, ν, and β are constants specified by the researcher.
Similar to the gravity model, weights can be defined as
where Aj and Bi are some attributes of the regions and ϕ is a constant. For example, Aj could be the population
in the region j, and Bi could be the number of health care providers in the region i.
Weights can also be defined as combinations of the weights given in the formula above, for instance,
Since a covariance model can be interpreted as a statistical distance between data locations, it can be used for
defining regional data weights if it makes sense to use a straight‐line distance between polygon centroids as a
distance measure between regions and if the covariance model is assumed to be known or can be estimated
from the data:
With scientific guidance about how the spatial weights should be chosen, any of the approaches above can be
applicable. However, it is a good idea to compare the outputs of local indices of spatial association and the
spatial regression models discussed below using different weights to find out where the outputs are
significantly different. Such instances may require special attention since the cause could be the unusual data
values that require verification, or it could be because there is a problem with a model itself.
REGIONAL DATA MODELING OBJECTIVES
Researchers working with regional data are typically interested in the following:
Knowing how to construct a valid map of regional data. Displaying raw counts or rates gives a
misleading picture of the geographical distribution of disease or crime. For example, a county
with a high number of cases may have such a high number simply because there are more people
in that county.
Finding an objective answer to questions such as whether crimes are clumped together with
substantial empty areas, and if so, to find the reason for the increased levels of crime in certain
areas.
SPATIAL SMOOTHING
Smoothing of rates is desirable if the rates are based on small numbers, that is, either the numerator (number
of events) or the denominator (population) forming the rate is small. Smoothing is also desirable if the rates
are based on markedly different population sizes so the variability associated with each rate estimate is
substantially different. With smoothing, each raw rate is replaced with its smoothed value and is calculated
using weighted neighboring data.
Although binomial and Poisson kriging can be successfully used for regional data smoothing, commonly used
smoothing methods are different, and some of them will be discussed in this section.
Let y1, y2, …yN be event counts (that is, a number of disease cases or number of crimes) associated with N
polygons, and assume that each polygon also has a population value n1, n2,… nN. Denote the proportions
(rates) as r1, r2, … rN. Several locally weighted functions can be calculated, including the locally weighted mean
of the rates
and the locally weighted standard deviation of the rates
∑ω (r − r )
2
ij i j
i=1
σj =
( N ′ −1) N ω
N′
∑ ij
i=1
€
The empirical Bayesian approach to smoothing does not specify weights but derives them from a more formal
probability model, and it is a precursor to statistical models for counts and rates. This approach is predicated
on the assumption that the observed rates are noisy measurements of an underlying relative risk (of being a
crime victim or of contracting a disease) associated with each region. We denote these risks as the random
variables ξ1, ξ2, …, ξN. We assume that depending on these true but unobserved relative risks, the observed
counts are a realization from a Poisson distribution with expected means E(Yi|ξi)=niξi and variances
var(Yi|ξi)=niξi, since for the Poisson distribution, the mean is equal to the variance (see chapter 4). To
construct a Bayesian estimator, we need to make some assumptions about the probability distribution (also
called the prior distribution) of ξi. There are many possible assumptions, and hence, many different empirical
,
The factor Ci, called the shrinkage factor, is the ratio of the prior variance to the data variance. When the
observed rates are based on small populations, ni is small; Ci tends to zero, and the Bayes estimator is close to
the prior mean, m (there is a lot of local smoothing). However, when the population is large, Ci tends to 1 and
the Bayes estimator approaches the observed rate (there is little local smoothing).
We need to determine values for m and v to actually compute the estimates. With the empirical Bayesian
smoother, these unknown parameters are estimated from the data, and the estimates are
and .
The empirical Bayesian method described above is called global smoothing because the Bayesian estimates
tend toward the global mean of all regions. The modification consists of using the local mean based on
neighboring regions. This is termed local smoothing. Local smoothing is not necessarily better than global
smoothing. Global smoothing may perform better than local smoothing because regions with higher or lower
population density cluster into urbanized regions and rural surroundings, respectively, so the mean and
variance that are estimated with only several neighboring values can be inaccurate.
A filtered variant of kriging could be used for rate smoothing as shown in figures 1.18 and 1.19. Filtered
kriging predicts a new value to the measurement location using weighted neighboring data. In Geostatistical
Analyst, filtered kriging is initialized by specifying the nonzero measurement part of the nugget parameter.
Then the kriging prediction of lung cancer rates at points can be interpreted as the lung cancer risk for local
people.
A problem with this approach is that input data for conventional kriging must be stationary, that is, the data
mean and data variance must be the same everywhere. Predictions and especially prediction standard errors
using nonstationary data may be biased. Figure 11.7, left, shows that the locally weighted standard deviation
(with adjacent weights) of lung cancer rates varies across the state. This is evidence of data nonstationarity.
Empirical Bayes smoothing stabilizes the rate variances because the variance will be shrunk back to the
global variance in the case of a small regional population, which is responsible for a large part of the unstable
estimates. Figure 11.7 at right shows that the locally weighted standard deviation of the smoothed (by the
Empirical Bayes method) lung cancer rates has a smaller variability, between 0.28 and 1.02, than crude rates,
between 0.29 and 1.70.
Figure 11.7
Continuous maps are easier to read than choropleth maps. It is advantageous to create a continuous map of
smoothed rates in two steps: first, use empirical Bayes estimates of the rates; second, use that estimation as
input to filtered kriging. Figure 11.8 shows the second step of the process, with the empirical Bayesian
smoothing shown at left, the semivariogram modeling in the center, and the resulting smooth male lung
cancer rates at right. One disadvantage of this approach is that accurate estimation of the smoothing
uncertainty cannot be provided (see the example about smoothing rates in appendix 4). One solution to the
problem (areal interpolation or regional data disaggregation) is discussed in chapter 12.
Courtesy of National Atlas of the United States, http://nationalatlas.gov.
Figure 11.8
A distinction between detecting clustering (for which a global measure is appropriate) and detecting clusters
(for which a local measure is appropriate) should be made. If we do not care where the clusters are but want
to know if the regional data are clustered, then a global index is appropriate. Local indices are used for finding
locations of the clusters.
As discussed in chapter 8, spatial correlation of continuous stationary data is described by covariance or
semivariogram using measures
2
(Z(s ) − Z(s ))
i j (variogram) and
€
where and are data at locations si and sj and is a global mean value, calculated as an average
value based on all the data.
€
Using these measures, two estimators of regional data correlation can be constructed. The most popular,
Moran’s index of spatial data association (Moran’s I), is based on covariance, and it compares the weighted
differences between pairs of data and and the overall mean value based on the following
formulae:
,
where N is the number of data measurements involved in the calculations and are weights of data at
locations si and sj. Data are usually rates, because working with count data is misleading since the underlying
population also varies among the polygons. is a global statistic in that it averages all cross‐products
over the entire domain.
Assuming that the data are independent and follow a stationary Gaussian distribution, the Moran’s I expected
value is –1/(N1), approaching zero for a large number of polygons. This result is sometimes used for
calculating the uncertainty of the Moran’s index (p‐value), although it is not a good idea because real regional
data always violate the independency, stationarity, and normality assumptions, meaning that in this case,
apples and oranges are compared.
Positive values of Moran’s I indicate positive spatial autocorrelation, whereas negative values indicate
negative spatial autocorrelation. Recent research shows that Moran’s I is only a good estimator of the
strength of the spatial dependence when this dependence is weak (see reference 3 in the “Further reading”
section, in which an alternative index of spatial association for stationary Gaussian data is proposed). This is
important because local Moran’s I is sometimes interpreted as an estimator of the correlation coefficient,
although statistical literature does not suggest this interpretation.
Formally, Moran’s I should be discussed in the geostatistical chapters of the book (chapters 8‐10) because its
usage requires stationary Gaussian data, but in practice, it is almost always used with regional data
(originally, Moran’s I was developed for testing continuous data: the title of Australian statistician Patrick
A postulate about global data stationarity can be relaxed by assuming that data mean and variance are locally
constant. This idea is widely used in geostatistics when only neighboring data are used for prediction of
values in the unsampled locations (see chapter 8). In regional data analysis, the same idea is used for
analyzing local clusters.
Local cluster detection methods examine the similarity between a polygon's value and the values of its
neighbors. In this case, local Moran’s I is calculated for each polygon i as
,
where N is the number of data measurements in the polygon i neighborhood.
Often, local Moran’s I is divided by the overall standard deviation σ and turns into the standardized local
Moran’s I:
Figure 11.9 shows local Moran’s I values calculated using lung cancer mortality rates for males for California
counties from 1970–1994. The map to the left is produced using neighborhoods based on the distance
between centroids, and the map to the right uses adjacent neighboring counties. The maps show different
clusters of both positively (red) and negatively (blue) correlated cancer rates in the counties.
Courtesy of National Atlas of the United States, http://nationalatlas.gov.
Figure 11.9
Before drawing conclusions from these maps about lung cancer clustering in California, it is necessary to
verify the data stationarity and data normality assumptions. If the local mean and the local standard deviation
vary slowly, an assumption about data stationarity is reasonable. However, figures 11.7 and 11.8 indicate that
the local mean and standard deviation of lung cancer rates vary significantly in neighboring counties. Using
Moran’s I in this situation is analogous to computing a semivariogram when there are outliers and strong
trends in the data. In the case of nonstationary data, anything picked up with Moran’s I could be because of
spatial patterns in the denominator (population under risk) data.
Regional data, such as lung cancer rates, arise from aggregating disease events for administrative regions
such as counties. The basic model for disease events is the spatial Poisson process. In the case of a rare
disease, aggregation of the data over regions results in Poisson distributed data. For a more common disease,
the regional counts may be binomially distributed. Poisson and binomial distributions are discussed in
chapter 4.
Moran’s I can be modified to relax the assumption of constant mean and variance. One such index of spatial
data association is called Walter’s modified Moran’s I:
N
y i −rn i y j −rn j
IiWM = rn i
⋅ ∑ ω ij ⋅ rn j
,
j=1
where r is the underlying risk, assumed to be constant over all regions and estimated from the data as
r = (the total number of events)/(the total number of people at risk)
€
Walter’s modified Moran’s I statistic is based on properties of the Poisson distribution, assuming that the
expected number of counts in the polygon i is yi = r⋅ni. Figure 11.10 displays Walter’s modified Moran’s I local
statistics for neighbors based on the distance between centroids shown at left and based on the adjacency of
the polygons shown at right.
Courtesy of National Atlas of the United States, http://nationalatlas.gov.
Figure 11.10
Walter’s modified Moran’s I statistic shows a different picture of potential local clustering in regional lung
cancer rates than the Moran’s I statistic because it uses different assumptions about data distribution.
Walter’s modified Moran’s I is better suited for analyzing epidemiological data, but it would be a good idea to
try several other indices of spatial association (see assignment 3 and references 4‐5 at the end of this
chapter).
A disadvantage of all indices of spatial association is that they are “short‐sighted” because they consider only
the nearest neighbors, which are usually selected automatically using simplistic rules. What is beyond the
nearest neighbors is ignored.
We know from reading about Sherlock Holmes that if the alternative explanation is disproved, then the
remaining explanation, no matter how unlikely, must be true. Rejection of the null hypothesis that the data
are randomly distributed in space only demonstrates that some spatial pattern exists. But how do we
interpret the magnitude of the clustering indices? For example, if the value of Moran’s I is I=0.3, does this
indicate weak, moderate, or strong spatial autocorrelation? This determination is made through a formal
statistical hypothesis test. The best approach is to use Monte Carlo testing. An example of testing Moran’s I
index is presented in the Monte Carlo Simulation section in chapter 5. Another approach, random
permutation, is discussed in chapter 16.
Local spatial cross‐correlation indices allow local spatial correlation to be assessed between two variables
using the cross‐covariance measure of data association:
(Z(si ) − Z ) ⋅ (Y (s j ) − Y )
These indexes would identify regions where the association between one spatial variable, Z, and another
spatial variable, Y, is strong or weak. For example, the local cross‐Moran’s I is calculated either as
or
(normalized version)
Courtesy of National Atlas of the United States, http://nationalatlas.gov.
Figure 11.11
Other cross‐indices can be constructed from known univariate indices of spatial data association, for example,
an index based on Walter’s modified I can be written as
or
,
where superscripts 1 and 2 refer to two bivariate variables.
Local indices of spatial data association are useful for detecting regions with unusually large or small values
and for finding possible data outliers. It may be dangerous to use these indices for inference, for example, for
making conclusions about the existence of “hotspots” with extreme values. This is because the data
distribution and stationarity assumptions are rarely satisfied, and there could be reasons other than spatial
correlation of nearby data, see reference 6 in “Further reading.” Regional data are usually complex in
geographic variation, and spatial patterns should be further investigated using spatial regression models.
In regression analysis, we are interested in how the entire distribution of variable Z may vary with values of
predictors X1, X2, … Xn,. This coincides with one of the goals of regional data analysis, which is identification of
variables that influence the outcome of interest. Regression analysis helps answer questions such as what
causes a disease, what factors contribute to crime, and how house characteristics determine its price.
The geostatistical model discussed in chapter 8 decomposes the data into four components:
data = trend + smallscale variation + microscale variation + random error
In geostatistics, the primary goal of analysis is predicting a value at an unsampled location. Although the
trend component of the geostatistical data model can be complex, researchers usually are not interested in its
exact form.
In spatial regression analysis, the nature of the trend component is the primary goal. The spatial regression
model can be rewritten to highlight the trend component importance as
data = trend + all random errors
or
,
The universal nature of the spatial regression model leads to trend surface, kriging, and spatial regression
under certain assumptions:
Usually the relationship between Z and X is assumed to be the same across the entire spatial domain so none
of the regression parameters β0, β1,… βM depends on spatial location. One of the most popular spatial
regression models of this kind is the simultaneous autoregressive model (SAR). The simultaneous
autoregressive model expresses a value at a location si as a sum of three components: a mean value
at location si, a weighted influence of values from neighboring regions, and random
Gaussian error with zero mean and variance :
,
where weights so that second term includes information about the values of neighbors of the
polygon at location si.
The spatial weights matrix can be asymmetric for some of the region pairs i and j, , so that the
influence of polygon i on polygon j is different than the influence of polygon j on polygon i. Asymmetric
weights can be useful, for example, in the case of data anisotropy.
Variance can be equal to 1/ni, where ni is the population of region i. The relationship
can be justified as follows. Binomial data can be approximated by normal
to reduce the number of parameters (that is, for all i).
The parameter ρ controls the spatial dependence between a value in the polygon and its neighbors,
measuring the average influence of neighboring observations , j=1,2,..,Ni through weights defined
by the specification of the searching neighborhood of . The parameter ρ is estimated simultaneously
with other regression parameters.
In the statistical literature on spatial regression, we can see that the matrix with elements
1 – ρ
must be invertible and, hence, its determinant must be nonzero. This restricts the possible values of ρ: if the
smallest eigenvalue of the matrix is negative, and the largest eigenvalue is positive, ρ
should be in the interval
Therefore, to better understand the results of spatial regression output, one should use a function that
returns an array of eigenvalues. The eigenvalues can be complex numbers. Often, the imaginary part is small
and can be ignored. However, if the neighborhood structure is complicated, spatial weight matrices may have
large imaginary parts of the eigenvalues, and there is no theory on interpretation of the parameter ρ in this
case.
A practical way to avoid this problem is by standardizing the weights matrix so that the rows sum to 1, which
forces the minimum and maximum eigenvalues to be 1.0 and 1.0. In this case, interpreting the spatial
dependence parameter ρ is the same as interpreting the correlation coefficient: a standardized measure of
the relationship between two variables with values range from ‐1 (strong negative relationship) to +1 (strong
positive relationship). However, ρ=0 does not necessarily indicate independence because the variance is not
constant across the data domain. Weights standardization has another useful feature: the autoregressive
models have larger variance at the edges than in the middle of the data domain, as it should be.
The spatial correlation between pairs of regions implied by the simultaneous autoregressive model is difficult
to interpret. We would expect all neighbors to have the same correlation since parameter ρ is constant.
However, with the same parameter ρ, correlations between a region and each of its neighbors may be
different. Using common borders between polygons, figure 11.12 at left shows African countries with three
highlighted neighborhoods. Some countries have just two neighbors, and other countries have seven
neighbors. Figure 11.12 at right shows that there are as many different correlations as there are pairs of
countries, meaning that the simultaneous autoregressive model corresponds to processes with non‐
stationarity covariance. This is a problem because it is not possible to control the pairwise correlations in the
simultaneous autoregressive model.
Figure 11.12
From this formula, the larger the parameter ρ, the more the trend is influenced by the local
neighborhood weights.
If the mean is known, a simultaneous autoregressive model is similar to a simple kriging model, except with a
different error structure induced by the autoregressive component
instead of being defined through the covariance Cov(Z(s), Z(s+h)).
Assuming that all weights are equal and normalized, we can write the simultaneous autoregressive model as,
,
where is the fitted value of the regression for the ith site. The sum of the first two components of the
simultaneous autoregressive model—the smooth fitted value and the neighbor adjustment based on the
amount of autocorrelation—is a predicted value in the region. From the formula above, if all of the values of
neighbors are above their mean values, then their deviations from the mean are averaged and adjusted
upward for the ith site based on the amount of autocorrelation ρ.
Courtesy of National Atlas of the United States, http://nationalatlas.gov.
Figure 11.13
Counties in Northern California have the lowest crime rates. However, individual county populations vary
significantly statewide, ranging from 1,208 people in Alpine County to 9,519,338 people in Los Angeles
County. Although population is included in rates, the rate uncertainty varies greatly with these disparate
population sizes. Several counties in the eastern part of Central California have relatively small populations
but also have relatively large murder rates. The map of murder rates may be biased because the rates
associated with sparsely populated counties are not as precisely measured as those based on much larger
populations.
In several of the sparsely populated counties, there were zero murders in 2000. If ranked by the murder rate,
such counties would transform from the safest into the most dangerous ones if only one murder happened
there during the next year. The minimum number of events that makes a variable unstable depends on the
desirable confidence interval. For instance, a requirement can be such that the standard error for the rate is
less than 30 percent of the rate value. That can be translated into a minimum number of cases per county.
Courtesy of National Atlas of the United States, http://nationalatlas.gov.
Figure 11.14
Figure 11.15 at left shows the predicted murder rates while figure 11.15 at right displays the standard error
of prediction using serious crime rates as explanatory variables. These predictions smoothed the original
murder rates, which are adjusted for the serious crimes rates and for spatial dependence in the model defined
by the neighboring polygon weights. As expected, the largest difference between the raw and predicted
murder rates is in the less‐populated counties. Other explanatory variables may help to better explain the
variation in crime incidents, for example, the average household income and level of education.
Courtesy of National Atlas of the United States, http://nationalatlas.gov.
Figure 11.15
A reproducible case study using the simultaneous autoregressive model can be found in chapter 16.
Markov random fields are based on the conditional independence assumption discussed and illustrated in
chapter 5 in the section about the Bayesian belief network. Recall that random variables z1 and z2 are
conditionally independent given z3 if information on the dependence between z2 and z3 does not provide new
information about the distribution of z1:
Prob(z1 | z2, z3) = Prob(z1 | z3)
In a Markov random field, the values associated with polygons are related to one another via a neighborhood.
This is very useful feature because most statistical models and tools used in regional data analysis assume,
explicitly or implicitly, that the values of interest in a particular region can be explained using the region’s
neighborhood only.
With green polygons, figure 11.16 at left shows the U.S. counties that share a common border with Lake
Michigan. Counties’ centroids with common borders are linked. For 28 out of 35 counties with one or two
neighbors, the ordering is “natural” since the rule of choosing neighbors is transparent: they are neighbors
along the coastline.
However, in most cases, there is no natural ordering among polygonal objects. Figure 11.16 at right shows the
35 European countries colored according their world ranking of their satisfaction with life (or overall
happiness) calculated by psychologist Adrian White from the University of Leicester, UK. The map is based on
an analysis of the results from more than 100 studies (see
http://www2.le.ac.uk/ebulletin/news/press-releases/2000-
2009/2006/07/nparticle.2006-07-28.2448323827). The “happiest” countries in the world are
shown in red. According to this research, happiness is primarily associated with good healthcare and then
with ease of access to secondary education and with financial needs.
Data from White, Adrian G., “A Global Projection of Subjective Well‐Being: A Challenge to Positive Psychology?”
Figure 11.16
Data from White, Adrian G., “A Global Projection of Subjective Well‐Being: A Challenge to Positive Psychology?”
Figure 11.17
Other popular choices of algorithms for automatic selection of the neighboring polygons are the Delaunay
triangulation and the Gabriel graph. If two Voronoi regions share a boundary, the nodes of these regions are
connected as shown in figure 11.18 at left for Austria. Such nodes are called the Voronoi or the Delaunay
neighbors. Points A and B are Gabriel neighbors if their diametral circle (the circle such that line AB is its
diameter) does not contain any other points. For example, in figure 11.18 at right, points A and B are Gabriel
neighbors, whereas B and C are not. The Gabriel graph consists of all pairs of Gabriel neighbors. The Gabriel
neighbors are shown for Austria in figure 11.18 at right.
Data from White, Adrian G., “A Global Projection of Subjective Well‐Being: A Challenge to Positive Psychology?”
Figure 11.18
Figure 11.19
The distributions of the number of neighbors defined based on common borders, a distance of less than 1,500
kilometers between polygon centroids, and the Delaunay and Gabriel neighbors are shown in tables 11.1‐
11.4.
Common border
Number of
0 1 2 3 4 5 6 7 8
neighbors
Number of
1 6 8 5 8 2 3 1 1
polygons
Table 11.1
Distance threshold equals 1,500 kilometers
Number of
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
neighbors
Number of
1 2 3 2 2 1 2 1 1 3 1 2 3 2 2 2 2 3
polygons
Table 11.2
Delaunay triangulation
Number of
4 5 6 7 9
neighbors
Number of
10 11 7 5 2
polygons
Table 11.3
Table 11.4
It is unlikely that all these neighborhoods define the Markov random field so that an observation in each
polygon is completely described by its neighbors. The important practical questions are how to verify the
assumption of the Markov random field and how to change neighborhoods if this assumption does not hold.
There is a spatial regression model called the conditional autoregressive (or CAR) model, which satisfies the
Markov property (under certain theoretical conditions, see references 7‐9 in “Further reading”). It describes
the conditional probability distribution for each location given the observed values for the neighboring
observations , where N(i) is a set of neighbors to polygon i. Assuming that each conditional
distribution is Gaussian, the conditional autoregressive model is written as
,
,
where is the conditional variance of Zi given Zj, which should be specified.
It is shown in statistical literature that the two equations above define a joint multivariate normal
The inverse covariance matrix must be symmetric, leading to the constraint . A commonly used
model formulation that satisfies this symmetry condition is weight matrix normalization
This normalization limits the spatial dependence parameter ρ by values ranging from ‐1 to +1, irrespective of
the neighborhood weights.
The conditional autoregressive model can be written similarly to the simultaneous autoregressive model,
The difference between the conditional and simultaneous autoregressive model outputs can be substantial.
The conditional autoregressive model is more intuitive than the simultaneous autoregressive
model.
The conditional autoregressive model is more general than the simultaneous autoregressive
model: any simultaneous autoregressive model with the normalized weights can be
represented as a conditional autoregressive model, but not vice versa.
The conditional autoregressive model prediction assuming a normal distribution is the best
predictor at new locations from the mean squared error point of view.
Since the conditional autoregressive model preserves the Markov property, one possibility for examining this
property is to use cross‐validation diagnostics. This diagnostic is possible with regional data if the covariates
are specified for each data value.
We illustrate diagnostics for regional data using the happiness data for the European countries, which are
accompanied by the data on life expectancy, gross domestic product (the total value of goods and services
produced by a nation) per capita (GDP), and access to secondary education scores (not for all countries).
The spatial regression model’s performance is usually obtained from a technique known as maximum
likelihood estimation. Relative values of the minimized maximum likelihood function helps to assess how well
the model fits. Specifically, several criteria, known as Akaike’s information criteria (AIC), its finite‐sample
corrected version AICc, and Bayesian information criteria (BIC) are based on this minimized function but also
adjust for the number of parameters in the model (models with a larger number of parameters will typically
fit the data better, but parsimony is the goal). The criteria values (scores) should be as small as possible.
Table 11.5 shows these scores for the conditional autoregressive model with four different neighborhoods
without covariates (ordered from the worst to the best) and for the conditional autoregressive model with
the best neighborhood and two covariates, life expectancy and gross domestic product.
Table 11.5
Neighbors defined by the Gabriel graph are clearly better at describing the spatial dependence of the
satisfaction with life in Europe. Because happiness depends on health and wealth, the prediction is improved
when these covariates are added to the spatial regression model.
The cross‐validation diagnostics help to evaluate the performance of the prediction of each datum using its
neighbors. Figure 11.20 shows prediction versus true data (index of happiness) for three conditional
autoregressive models with different neighborhoods and covariates: a distance less than 1500 kilometers
(shown in green), Gabriel graph (blue), and Gabriel graph and two covariates (pink). The linear trend is fitted
to each cloud of points, and the formulas for these lines are shown in the top part of figure 11.20.
The cross‐validation diagnostics confirm the ranking of the models based on the information criteria:
neighbors selected by the Gabriel graph better explain spatial structure of the happiness data than other
Data from White, Adrian G., “A Global Projection of Subjective Well‐Being: A Challenge to Positive Psychology?”
Figure 11.20
Figures 11.21 shows countries with the largest difference between the indices of happiness and their
predictions, with neighborhoods based on the Gabriel graph without covariates (left) and with two covariates
(right). The index of happiness is poorly predicted in the countries in blue and red, and it is desirable to
update their lists of the neighbors manually.
Data from White, Adrian G., “A Global Projection of Subjective Well‐Being: A Challenge to Positive Psychology?”
Figure 11.21
Figure 11.22
Data from White, Adrian G., “A Global Projection of Subjective Well‐Being: A
Challenge to Positive Psychology.
As mentioned in chapter 8, the smoothed values predicted by spatial autoregressive models can be used as an
estimated large scale variation (trend) in geostatistics. This method of detrending is useful when trend in the
data of interest can be explained by covariates instead of the function of coordinates, as in the temperature
trend surface modeling discussed in chapter 7.
If there are missing values in several polygons and we want to predict them, we can assume that the
conditional autoregressive model predictor is distributed normally with mean and
can be used to predict the new values.
Another case study using conditional autoregressive model can be found in chapter 15. A conditional
autoregressive model is also used in the reproducible case study in appendix 3.
ASSIGNMENTS
1) INVESTIGATE THE NEW PROPOSAL OF MAPPING THE RISK OF DISEASE.
The Asian GIS monthly magazine GIS Development published the article “Mapping Infectious Diseases Using
SARS Maps” by Xun Shi (June 2005, vol. 9, issue 6, available online at
http://www.gisdevelopment.net/magazine/years/2005/jun/sars.htm). The author
proposed a new local risk score for mapping the risk of disease:
,
where r is the risk score of a spatial unit (for example, a province), c is the number of cases in this spatial unit,
n is the number of people in this spatial unit, and a is the size (area) of this spatial unit.
Compare this new index with the indices of spatial association discussed in the “Cluster Detection Methods”
section.
According to the new risk score, the most dangerous province was that with 1 out of 7,209 total cases of
disease. Do you agree with this result?
Figure 11.23 shows the prevalence of tapeworm infection in red foxes (the proportion of diseased animals)
for 43 regions in Lower Saxony, northern Germany. From 1991 to 1997, 5365 red foxes were sampled and
examined for the infection. The data are represented as Voronoi polygons created around regions’ centroids
given in the paper
Berke, O., 2004. Exploratory disease mapping: kriging the spatial risk function from regional count data.
International Journal of Health Geographics, 3:18.
The electronic version of this article can be found online at: http://www.ij-healthgeographics.com
/content/3/1/18.
From the International Journal of Health Geographics,
Figure 11.23
An analysis of spatial distribution of the diseased animals requires continuous mapping of the infection
prevalence. Develop a plan for creating a smooth map creation and compare your plan with the paper’s
proposal.
Data for the case study are in the folder Assignment 11.2.
R package DCluster provides several methods of spatial clusters detection which are not described in this
chapter, including Openshaw, Besag and Newell, Kulldorff and Nagarwalla, and Tango indices and tests.
After reading appendix 2 and maybe chapter 16, use DCluster functions with data from assignment 3 of
chapter 15.
Since DCluster’s documentation is very brief, we suggest reading chapters 6 and 7 of a book from reference 5
below. They discuss advantages and disadvantages of many methods implemented in DCluster.
The assignment on the simultaneous autoregressive model usage can be found in chapter 16 (assignment 2).
FURTHER READING
1a. McNeill, L. 1991. “Interpolation and Smoothing of Binomial Data for the Southern African Bird Atlas
Project.” South African Statistical Journal 25:129–136.
1b. Lajaunie, C. 1991. “Local Risk Estimation for a Rare Noncontagious Disease Based on Observed
Frequencies.” Centre de Geostatistique de l’Ecole des Mines de Paris. Fontainebleau. Note N‐36/91/G
These papers propose modification of conventional kriging for interpolation of binomial data.
2. Monestiez, P., L. Dubroca, E. Bonnin, J. P. Durbec, C. Guinet. 2006. “Geostatistical Modeling of Spatial
Distribution of Balenoptera Physalus in the Northwestern Mediterranean Sea from Sparse Count Data and
Heterogeneous Observation Efforts.” Ecological Modeling, 193:615‐628.
This paper proposes modification of conventional kriging for interpolation of Poisson data.
3. Li, H., C. A. Calder, and N. Cressie. 2007. “Beyond Moran’s I: Testing for Spatial Dependence Based on the
SAR Model.” Geographical Analysis. 39:357–375.
This paper discusses the statistical features of Moran’s index of spatial data association and argues against its
informal use outside of formal hypothesis testing. An alternative to Moran’s I was developed.
4. Krivoruchko, K., C. A. Gotway, and A. Zhigimont. 2003. “Statistical Tools for Regional Data Analysis Using
GIS.” In The proceedings of the Eleventh ACM International Symposium on Advances in GIS. Edited by E. Hoel
and P. Rigaux, 41–48. ACM Press.
Using county‐level crime data in California, the authors show different statistical methods for regional data
analysis that can be implemented within a GIS to provide a powerful set of interactive, analytical tools for
regional data analysis.
5. Waller, L. A., and C. A. Gotway. 2004. Applied Spatial Statistics for Public Health Data. New York: John Wiley
& Sons, 494.
Although this well‐written book is for advanced undergraduate and graduate courses in statistics, a reader
with minimal statistical background can learn about assumptions, differences, similarities, and implications
of statistical models. Chapter 7 on regional data analysis is recommended for all readers.
The author shows that the rejection of the null hypothesis of no dependence means not only that dependence
is present but also that other misspecifications, including functional form misspecification,
heteroskedasticity, and the effects of missing variables, are correlated over space.
7. Schabenberger, O., and C. A. Gotway. 2004. Statistical Methods for Spatial Data Analysis. New York:
Chapman & Hall/CRC, 488.
Chapter 6 of this book is recommended for readers who want an in‐depth discussion of the spatial regression
models.
8. Besag, J. 1974. “Spatial Interaction and the Statistical Analysis of Lattice Systems.” Journal of the Royal
Statistical Society, Series B 36:192–225.
This paper on the spatial Markov random fields is considered a pioneer work on statistical regional data
analysis by most of statisticians. See, however, reference 11 below.
9. Cressie, N. A. C. 1993. Statistics for Spatial Data. Revised Edition. New York: John Wiley & Sons.
This book is the most cited reference on spatial statistics, but it is very difficult to read for nonstatisticians.
However, the introductions to chapters 6 and 7 on regional data analysis are not that difficult to understand.
10. Xiao, N., C. A. Calder, and M. C. Armstrong. 2007. “Assessing the Effect of Uncertainty on Choropleth Map
Classification.” International Journal of Geographic Information Science, 21(2):121–144.
This paper discusses a method that can be used to evaluate the classification robustness of choropleth maps
when the attribute uncertainty associated with the data is known or can be estimated.
11. Dobrushin, R. L. 1968. “Description of a Random Field by Means of Conditional Probabilities and the
Conditions Governing its Regularity,” Theory of Probability and its Applications, 13: 197–224.
Ronald Dobrushin made fundamental contributions to several fields of mathematics and mathematical
physics. In 1960s, he published several papers in Russian and English, including the referenced one, on the
Markov random fields existence and features extending the Markov property to more than one dimension (in
one dimension, the Markov property states that the future of the process, given the present, is not affected by
the past).
Some other of Dobrushin’s papers can be found at http://www.cpt.univ-mrs.fr/dobrushin
/list.html.
SPATIAL REGRESSION MODELS:
CONCEPTS AND COMPARISON
GEOGRAPHICALLY WEIGHTED REGRESSION
LINEAR MIXED MODEL
GENERALIZED LINEAR AND GENERALIZED LINEAR MIXED MODELS
SEMIPARAMETRIC REGRESSION
HIERARCHICAL SPATIAL MODELING
HIERARCHICAL MODELS VERSUS BINOMIAL AND POISSON KRIGING
MULTILEVEL AND RANDOM COEFFICIENT SPATIAL MODELS
GEOGRAPHICALLY WEIGHTED REGRESSION VERSUS RANDOM COEFFICIENTS
MODELS
SPATIAL FACTOR ANALYSIS
COPULABASED SPATIAL REGRESSION
REGIONAL DATA AGGREGATION AND DISAGGREGATION
SPATIAL REGRESSION MODELS DIAGNOSTICS AND SELECTION
ASSIGNMENTS
1) INVESTIGATE THE EFFECT OF SUN EXPOSURE ON LIP CANCER DEATHS
2) PRACTICE WITH THE ARCGIS 9.3 GEOGRAPHICALLY WEIGHTED REGRESSION
GEOPROCESSING TOOL
FURTHER READING
Mixed models extend linear regression models, allowing for random effects. The necessity of random effects
is illustrated using measurements of carbon monoxide collected in California.
Next, the generalized linear and generalized linear mixed models are introduced. The generalization of linear
model has two aspects: (1) a variety of distributions can be used in addition to normal distribution, and (2)
the mean is transformed through a “link function,” linking the regression part to the mean of selected
distribution.
Next, the semiparametric regression model concept is introduced, and the model usage is illustrated using the
malaria data collected in Gambia.
Concepts of hierarchical and random coefficients spatial models are then presented. The two models are
compared with cokriging and geographically weighted regression, respectively.
Next, the application of hierarchical spatial model to spatial factor analysis is illustrated using crime data
collected in small administrative areas.
Then the recently proposed copula‐based approach to spatial regression is introduced.
The chapter ends with a discussion of the regional data disaggregation and regression model diagnostics.
GEOGRAPHICALLY WEIGHTED REGRESSION
In the simultaneous and conditional autoregressive models discussed in chapter 11, it is assumed that the
relationship being modeled applies everywhere in the study area; that is, the regression coefficients are
constants. In some applications, this is not the case. Geographically weighted regression allows parameters
β0, β1,… βk to vary
spatially:
This means that, at every location s, coefficients β0(s), β1(s), …, βk(s) need to be estimated. In the expression
above, is an independently normally distributed random error with mean zero and constant
variance .
The geographically weighted regression is simply a collection of separately fitted classical linear regression
models for weighted noncorrelated data. Geographically weighted regression allows regression coefficients
to vary spatially using the moving window approach: at each location s, all the data in the neighborhood s are
used to estimate the parameters βk(s), with weights assigned to observations being a function of the
Euclidean distance between the estimated location s and its neighbors. Therefore, the regression model
coefficients can be disconnected locally just as the output from the moving window kriging model (the
semivariogram parameters, prediction, and prediction standard error) discussed in chapter 9.
or ,
Here, ωij is the weight assigned to location sj to estimate the parameters βk(si) for observation si, and hij is the
straight‐line distance between observations si and sj.
Geographically weighted regression assumes that all the spatial variation in the data is explained by the
covariates so that data Z(s) are uncorrelated (spatially independent) given Since spatial
variation of the aggregated data in this approach is explained by the impact of the explanatory variables
rather than by the influence of the neighbors, spatial conditional independence of the aggregated data does
not create a problem. Spatially independent geostatistical data are rare, but aggregated and discrete data of
this kind exist. Data are often modeled as spatially dependent as a surrogate for missing explanatory
variables.
The geographically weighted regression variance is assumed constant, meaning that this model allows
varying the mean data values only. In this sense, the model is similar to kriging models discussed in chapter 9.
If Xk are polynomial functions of the spatial coordinates x and y, such as
and weights are defined using expression ωij = exp(hij /3b), where b is a constant to be estimated, it is the
Geostatistical Analyst’s local polynomial interpolation model. Therefore, the geographically weighted
regression can be also used for data detrending.
The geographically weighted regression model is fitted in the following steps:
The kernel bandwidth is estimated by cross‐validation.
The weights are calculated using the distance between pairs of measured and predicted locations.
The regression coefficients βk(s) are estimated at each location where explanatory variables
are observed.
The response are estimated at locations s by .
The locally estimated coefficients β0(s), β1(s), …, βp(s) are used as input data for mapping. Maps of these
coefficients show how much each covariate impacts the value of the data Z(si) locally. For example, a
Realtor may be interested in the influence of the number of bedrooms and the size of the backyards on house
prices in different parts of the city. The model residuals show how much spatial data variation is not
explained by the covariates Xk.
The block group data from the vicinity of Rancho Cucamonga, California, are used to illustrate the application
of geographically weighted regression. Figure 12.1 shows 112 block polygons colored from blue to red
according to the average house price in 2000. Two freeways at the south and east (red) and mountains to the
north limit the selected region. The input data points for geographically weighted regression are coordinates
of the centroids of each block group (yellow). Weights defined based on the distance between polygon
centroids may not be a good measure of the distance between polygons, but it is not a big problem because
the primary goal of geographically weighted regression is data exploration, not modeling.
Figure 12.1
The average house price consists of many factors, and in this exercise, the following factors will be estimated
as spatially varying coefficients β0(s), β1(s), β2(s), β3(s), β4(s), and β5(s) of the following model:
[house price at location s] = β0(s) + β1(s)⋅[distance to the freeway]
+ β2(s) [per capita income] + β3(s) [average family size]
+ β4(s) [house age] + β5(s) [average number of rooms]
,
where b is the bandwidth parameter, which is estimated from the data using a cross‐validation technique.
Usually geographically weighted regression assumes that data follow the Gaussian distribution. This
assumption can be reasonable for house prices, but it may be invalid for other data. For non‐Gaussian data,
geographically weighted regression can be used with Poisson regression when input data are counts or rates
and with logistic regression when input data are binary or proportions.
Figure 12.2
Figure 12.3 shows the estimated spatially varying house age coefficients β4(s) using another formula for the
weights, the “bisquare scheme”
and using the Akaike information criterion instead of the cross‐validation diagnostic for choosing weights’
parameter b.
The resulting map is qualitatively similar to the map of the same house age coefficient above, but there are
differences in the coefficient values everywhere. Therefore, more modeling is required before making a
decision about the spatial structure of the house prices in a particular part of the studied region.
Figure 12.3
Figure 12.4
Usually, houses are cheaper near large noisy freeways and more expensive near the mountains. The map of
coefficients β1(s), which are positive everywhere in figure 12.5, shows that the distance to the freeway is an
important characteristic of housing prices in all block groups but is more important in the red areas than in
the blue ones.
Figure 12.5
Figure 12.6
Many other factors, including distance to nearby schools, proximity of industrial zones or agricultural areas,
square footage of the house, size of the back yard, and presence of a garage, are among the variables that
influence housing prices. All these unaccounted factors are combined in the coefficient β0(s).
In figure 12.7, large positive or negative coefficients β0(s) indicate areas where housing prices are not
explained by the model with five variables. Areas in yellow and in red with small coefficients β0(s) are
reasonably explained, whereas areas in blue may require additional variables for the explanation of the
housing price structure by geographically weighted regression.
Figure 12.7
Providing accurate prediction uncertainties is critical if the prediction maps are used as “data” in other
studies, and the question arises: How accurately are the regression coefficients estimated? Although standard
errors of local regression coefficients can be estimated and mapped showing areas where the relationships
between the variable of interest and the explanatory variables are uncertain, there are several problems with
interpretation of the prediction uncertainty.
A situation when the predictors (explanatory variables) are nearly linearly related to each other is called
collinearity or multicollinearity. The moderate correlation of two explanatory variables makes their
associated local regression coefficients dependent. Moreover, recent publications indicate that local
regression coefficients can be correlated even if the explanatory variables are uncorrelated globally. This is
partially because stable estimation of the regression coefficients requires relatively large number of data, but
then the nearby coefficients are estimated using almost the same data with similar weights.
In the case of housing data, any two explanatory variables (say, the number of bathrooms and the size of the
lot) are reasonably correlated since there is a premium on additional bathrooms and lot sizes in more
prestigious areas. The geographically weighted regression assumes that regression coefficients are changing
smoothly in space, and a small change in geographical coordinates of the house usually produces a small
change in the explanatory variables. Therefore, a plot of the number of bathrooms against lot size along a
particular street would be smooth because a large fluctuation in coefficients does not make sense. This means
that correlation between the regression coefficients is natural.
If correlation between coefficients exists, their meaningful interpretation is not possible because regression
coefficients are not uniquely defined. For example, if the model assumptions hold, it can be said that a
predicted house price consists of n dollars because of a particular number of bedrooms, m dollars because of
that size garage, and so on. However, if the assumptions are violated, there is no reason to believe that a
model accurately partitions the data on the independent terms.
From a mathematical point of view, correlation between explanatory variables leads to an unstable solution
of the linear equations. If two variables are the same, there is no unique solution of the linear system
.
If two identical variables are used in standard geographically weighted regression software, a chosen solution
is u = a and v = 0. However, if the variable u is equal to the noisy variable v (u = v + random noise), the solution
for some reason is u = a/2 and v = a/2, and in both cases, it is an arbitrary decision.
Instability always arises when there is a moderate correlation between explanatory variables because the
relationship between them can be described by the following equation:
u = constant1 + constant2·v + random noise
Technically, correlation between explanatory variables results in an ill‐conditioned matrix. A system of
equations is ill‐conditioned if a small change in the coefficient matrix or a small change in the right‐hand side
results in a large change in the solution vector. For example, the solution to the linear system of equations
is
,
is
.
In the example above, a small change in the right‐hand vector results in a large change in the solution vector.
It is possible to stabilize an ill‐conditioned matrix, although the inference from the resulting model can be
questioned since the original and new models are not the same. In practice, areas with numerical instability
can be highlighted to prevent any decision making there.
A common method to overcome multicollinearity is ridge regression (another name for this technique is
Tikhonov regularization, see reference 16 in Further reading). This type of regression allows to produce
stable estimates of regression coefficients by placing a constraint on the sum of the coefficients after data
normalization when all variables are centered and scaled by subtracting their sample mean and dividing by
their sample standard deviation , where is the sample mean of the and is the
estimated standard deviation. The ridge regression parameter is estimated by cross‐validation before
estimating the ridge regression coefficients; the more unstable the linear system, the more correction is
required. The stabilization of the linear system is obtained at the price of additional bias of the local linear
estimator which increases as the kernel’s bandwidth increases.
Another way to stabilize the linear regression fitting is removing points that contribute to the instability the
most (this approach is used in ArcGIS 9.3). This can be done based on the condition number that indicates
how close the explanatory data are to the linear dependency in each prediction location. According to the
literature on linear regression, a condition number in excess of 10 may indicate a near linear dependency
between variables. Note that the condition number estimation requires that the data transformation should
be avoided because it may mask data collinearity (see the examples in reference 1 of “Further reading”).
Geographically weighted regression is very sensitive to data outliers (data values that are extreme with
respect to neighboring values), and it does not work well with explanatory variables that exhibit trend, for
example, with distance to a particular geographical object (such as the location of the environmental accident
or the coast line) because that can make the system of geographically weighted regression ill‐conditioned.
It is true that in the presence of nonstationarity, a global statistic produced by conditional and simultaneous
spatial autoregressive models may not be accurate locally, as also the case of geostatistical modeling of data
collected over a large territory (see “Moving windows and disjunctive kriging” in chapter 9). However, the
opposite situation can also occur when the stationary process is treated as nonstationary. Then local statistics
are less reliable than global statistics as in the case of empirical Bayes smoothing discussed in chapter 11.
When input data are aggregated in administrative regions as in the case of housing data example above,
information about data variation inside each polygon is usually not available and variation of the regression
coefficients inside each polygon is not justified (see the discussion in the section “Regional data aggregation
and disaggregation” below).
With any statistical model, the actual model must be differentiated from its use when model assumptions are
violated. Geographically weighted regression is a local method, and the problems with violating modeling
assumptions are similar to those discussed in chapter 9 for geostatistical models; however, they are made
worse due to additional regression coefficient calculations and subsequent inference.
The geographically weighted regression is sometimes used for prediction of unknown values such as annual
precipitation over a large territory with varying meteorological conditions. Generally, this may not be a good
idea because geographically weighted regression is not a true nonstationary model since it assumes that data
Figure 12.8 at left shows the prices of houses that were sold in the first half of 2006 in part of the city of
Nashua, New Hampshire. These data are described and used in appendix 2. Four Gaussian variables were
simulated at the houses’ locations using an exponential semivariogram model with a range parameter of
4,000 feet and various nugget and partial sill parameters. A value of 4,000 for the range parameter was used
because this is an approximate range of the sale price variability. One of these simulations is shown in figure
12.8 at right (the data values are arbitrary).
Courtesy of the City of Nashua, N.H.
Figure 12.8
Figure 12.9 at left shows scatterplots of the sale price (first row) and four simulated variables. We see from
the scatterplots that all five variables are uncorrelated, and this is what the geographically weighted
regression model requires. Another good feature of the simulated data is that they do not have outliers and
trends by construction. The data were randomly divided into two parts, 369 and 124 houses, and the
geographically weighted regression was fitted using 369 values of four simulated variables as predictors.
Then the predictions were validated using the remaining 124 house price data values. The result of the
validation exercise is shown in figure 12.9 at right. The green line is locally weighted regression, and red is
the 1:1 line. It appears that four arbitrary mutually uncorrelated random variables predict low and median
house prices reasonably well!
Figure 12.9
Figure 12.10 show maps of two local regression coefficients. They are slowly changing in space, “explaining”
why housing prices are low or high in different parts of the city. Since the sale prices of the expensive houses
are not predicted well, it may be that additional variables are necessary for simulation to improve the
geographically weighted regression predictions.
Figure 12.10
The validation exercise was repeated using the standard linear regression model, which also assumes that
data are uncorrelated. Standard linear regression estimates global regression coefficients. In this case, all
regression coefficients are nonsignificant (their p‐values are near 0.5 instead of being smaller than 0.05 when
the relationships between the variable of interest and the explanatory variables are significant).
When the simulated data were used to improve prediction of house prices using the ordinary cokriging
model, Geostatistical Analyst assigned extremely low weights to all explanatory variables by default (figure
Figure 12.11
It can be concluded that while geographically weighted regression can be effective in isolating the
contributions from individual factors, it should be used with great care (with careful diagnostics) because the
model can produce absurd results.
An example of the geographically weighted regression diagnostics is shown below with the following model
using actual housing data:
[house price at location s] =
β1(s)⋅[gross area] + β2(s) [sq feet] + β3(s) [garage area] + β4(s) [house age]
Figure 12.12
Figure 12.13
Figure 12.14 shows maps of correlation between the regression coefficients for the square footage and the
age of a house (left), the area of the lot (center), and the garage area (right). Since cross‐correlation is strong
in relatively large areas in red, the housing price prediction with geographically weighted regression can be
inaccurate.
Figure 12.14
Real estate data collected in the city of Nashua are further analyzed in appendixes 2 and 3. It is shown that
predictions made by the generalized additive model (it was presented in chapter 6, and its variant is further
discussed in the section “Semiparametric regression” below) explains about 90 percent of the housing price
while geographically weighted regression explains only 64 percent. We also used a conditional
autoregressive model and cokriging with the same data, and those models explained about 82 percent of the
housing price. This example shows that the geographically weighted regression can be poor predictor.
The geographically weighted regression is available in ArcGIS 9.3 Spatial Statistics Tools for users with
Geostatistical Analyst or ArcInfo licenses.
Mixed models extend standard regression models, allowing for random effects. The necessity of random
effects can be illustrated using measurements of carbon monoxide (CO) collected in California in 1999.
California is divided into 12 regions with supposedly similar air pollution shown in figure 12.15 at left using
different colors. A scatterplot of monthly data (CO values in ppm versus the month number), observed in 24
locations in the San Francisco Bay area, is shown in figure 12.15 at right. Although the monitoring stations are
situated in a relatively small area, there is a large variability of carbon monoxide each month, and answering
questions about air pollution in the region is rather difficult.
Data courtesy of California Air Resources Board.
Figure 12.15
Figure 12.16 shows CO data in the Bay area with lines connecting measurements made at the same
monitoring station. We see that data variability is different for different monitoring stations.
Data courtesy of California Air Resources Board.
Figure 12.16
COij = β0 + β1⋅monthi + β2⋅(monthi)2 + errorij,
where index i goes through the months number, index j is used for numbering monitoring stations, and errorij
is identically and normally distributed random error with constant variance .
Examples of linear model fitting can be found in appendix 2. For the linear model for CO data, the estimated
residual standard error equals 0.82 and the regression coefficients are displayed in table 12.1.
Coefficient Estimate Standard.error Significance
β 0 4.19 0.17 significant
β 1 ‐1.01 0.06 significant
β 2 0.083 0.0046 significant
Table 12.1
A problem with the linear model above is that it ignores correlation and population effects within a particular
month and within a monitoring station (that is between stations and between months variability), which are
of interest in the air pollution monitoring. One solution to the problem is adding the random effects to the
model that is describing CO data variability around the mean value for each month:
monthi = mean(monthi) + (monthi mean(monthi))= mean(monthi) + variability(monthi)
The last term in the expression above has a mean value equals to zero by construction so that only its
variance needs to be estimated. The resulting linear mixed model is the following:
COij = β0 + γi + β1⋅monthi + β2⋅(monthi)2 + errorij,
where γi is a random sample from normal distribution with zero mean and variance .
Fitting linear mixed model gives the following standard errors for the varying intercept and residuals:
=0.339 and = 0.767. The regression coefficients are shown in table 12.2. The difference between linear
and linear mixed models is in estimated standard errors for the regression coefficients and for the residuals:
they are larger in the case of the linear mixed model, meaning that the linear model is underestimated the
model uncertainty.
Table 12.2
β0 + β1⋅monthi + β2⋅(monthi)2
at the monitoring stations (red).
Data courtesy of California Air Resources Board.
Figure 12.17
γi in the model above describes the variability of CO measurements around monthly averages. Another
possibility is adding a normally distributed variable γj with zero mean and variance for modeling random
variation at each monitoring station. It also makes sense to add spatial variability to the model because CO
values are spatially correlated (this can be shown using semivariogram modeling). Spatial correlation in the
mixed model can be described using both geostatistical (through the semivariogram model) and spatial
autoregressive (through spatial weights) approaches to spatial data analysis. A spatial linear mixed modeling
case study using the SAS statistical software package can be found in chapter 15.
GENERALIZED LINEAR AND GENERALIZED LINEAR MIXED MODELS
Spatial autoregressive and geographically weighted regression models discussed in chapter 11 and in this
chapter assume that the mean value of is a linear function of the explanatory variables
and the observed values are distributed around the mean. To derive optimal model parameters and their
associated standard errors, assumptions about the distribution of errors must be made, and it is
usually assumed that errors are distributed according to the Gaussian distribution. Violations of the
assumptions about linearity and Gaussian distribution may not seriously affect the estimation of the
coefficients , but estimation of the regression coefficient uncertainty is much more sensitive to the
distribution misspecification.
provides complete knowledge about the variation of the data.
If Gaussian approximation to the data does not hold, data transformation can result in more symmetrical
distribution with smaller variance heterogeneity, but it may not be possible to linearise the mean relationship
and stabilize the variance at the same time. For example, there is usually no acceptable way to transform
count or rate data to normal distribution with constant variance. In any case, if it is known that the data
distribution is not Gaussian but another theoretical distribution, the best approach is to use models that are
designed for that distribution.
Generalized linear models extend linear modeling to a broad exponential family of distributions. This family
includes the following members:
continuous distributions: Gaussian, log‐Gaussian, inverse Gaussian, gamma, and beta;
discrete distributions: Bernoulli (binary), Poisson, binomial, and negative binomial.
The name exponential family is used because the probability density functions in this family can be written in
exponential form.
The following example shows what happens if a nonspatial linear model is used with binary data. In the top of
figure 12.18, the subset of the arsenic concentration measurements in groundwater collected in the
southeastern part of Bangladesh is displayed as depths of wells over a map of the probability that a specific
threshold value in drinking water is exceeded.
Groundwater arsenic contamination was measured to determine whether drinking water in the wells had
polluted above the upper permissible contamination level. Suppose that the available measurements
take only two values, 1 (dangerous level of arsenic) and 0 (well is not contaminated), and our goal is to
determine whether the probability that the well is contaminated depends on the well’s depth . In this
case, the linear model is inadequate because complicated constraints on the regression coefficients are
required to make sure that the predictions are inside the interval [0, 1].
Figure 12.18
The result of the linear model fitting is shown in figure 12.18 as a red line.
Although observed data is in the interval [0, 1], the predictions are outside this interval when the well
depth is deeper than 235 meters. We see that arsenic concentration is very low starting from the well depth
about 60 meters, and we would expect that prediction of the threshold exceedance have sigmoidal rather
than linear shape. The green line in Figure 12.18 shows predictions made by a model that takes into account
the binomial nature of the predictor variable , the generalized linear model with logit link
, where is a probability that threshold concentration of
arsenic is exceeded (see also the section “Agriculture” in chapter 6).
Suppose that we are modeling the dependence of the Poisson random variables , say epidemiological
data, with a vector of explanatory variables . There is a problem with the linear model of mean
because the right hand side can have any real value while the Poisson mean (the
expected count of events) on the left hand side must be nonnegative. A solution to the problem is to model the
logarithm of the mean value assuming that the transformed mean follows a linear model:
. This is the generalized linear model with log link. Its inverse is
relative increase in the mean if the covariate changes by one unit:
Summarizing, the systematic part of the generalized linear model describes the
transformation of the predictor variable instead of the variable itself. This transformation is called the link
function. The random part of the generalized linear model supports the exponential family of data
distributions for the variation of the data around mean value. Each theoretical data distribution is used with a
particular link function .
In the case of spatially correlated data, a generalized linear mixed model is used. In this case, the link function
has the following form:
,
where is a Gaussian random field with zero mean and a particular spatial covariance function.
adds spatially structured random variability to the generalized linear model. Note that another random
component of the model, error variance (it is discussed earlier in the section “Linear mixed model”),
which describes distribution of the random errors of the response variable, can have any distribution from
the exponential family, including Gaussian. Gaussian random field can be thought as a random
addition to the intercept coefficient .
Although prediction to the unsampled locations where exploratory variables are observed or estimated is
possible and useful, the generalized mixed model is used more often for accurate estimation of the regression
coefficients (they are called fixed effects in statistical literature because random variation of the
coefficients is not allowed).
Variance of the response variable in the spatial mixed models is defined conditionally
that is it is assumed that the data are independent given the value of Gaussian random field, meaning that
spatial correlation of does not depend on the explanatory variables , but is defined by some
€
other hidden process that generated the Gaussian random field.
The generalized linear mixed model parameters estimation is iterative, and their convergence depends on the
initial values provided by the user of the statistical software, meaning that it can be dangerous to rely upon
the default parameters estimated by the software. Therefore, good understanding of the theory and specific
algorithms used by the software developers is essential for successful generalized linear mixed model fitting
and subsequent statistical inference.
Additional information and examples of generalized linear mixed models usage can be found in appendixes 3
and 4.
Authors of the book in reference 10 showed that the varying intercept in mixed models can be described by
penalized splines, and the resulting fitted regression line is the same as in the mixed model (for example, the
black line in figure 12.17). Note that the number of parameters to be estimated by the penalized spline model
is the same as in the mixed model discussed in the section “Linear mixed model”: the variance of the spline
smoother is estimated instead of coefficient β2. The regression spline modeling concept is introduced in
chapter 6. In this section, we illustrate the penalized spline implementation in the semiparametric regression
model.
The semiparametric regression model is a variant of the generalized additive model introduced in the section
“Fishery” in chapter 6. Figure 12.19 shows how the semiparametric regression model estimates the
dependency between house price (y axis) and the explanatory variables family size, number of rooms, house
age, distance to the freeway, income (x axis), as well as large‐scale spatial data variation (a map in the bottom
right corner) for the block group data from the vicinity of Rancho Cucamonga, California. Gray areas in the
graphs show a 95 percent prediction interval around estimated dependency shown as the black lines.
Variability bands are obtained by adding and subtracting twice the standard error of the estimated function.
These graphs are very informative about the relationships between housing prices and the housing
characteristics in the area under investigation. Using the dependency between housing prices and housing
characteristics, predictions to the locations with known house characteristics are possible.
Figure 12.19
Often researchers are interested in individual level inference, and working with aggregated data may be not
appropriate. For example, the risk of disease depends on age, gender, and bad habits, such as smoking, and
these individual characteristics are not very helpful when they are averaged over administrative boundaries.
The semiparametric regression allows using both spatial and individual‐specific data.
presence (1) or absence (0) of malaria in a blood sample taken from a child (36 percent of blood
samples were tested positive for malaria).
satellite‐derived measure of the greenness of vegetation in the immediate vicinity of the village
(arbitrary units).
age of the child in days (average age is 1,080 days)
indicator variable denoting whether (1) or not (0) the child regularly sleeps under a bed‐net (71
percent of the children used a bed‐net).
indicator variable denoting the presence (1) or absence (0) of a health center in the village (68
percent of villages have a health center).
Modeling of the malaria risk is difficult since three variables are binary and two variables are continuous, and
there are many measurements in each village. The log odds of malaria presence at a given
location can be modeled by the semiparametric model as
,
Fitting the semiparametric model to the malaria data gives the dependencies shown in figure 12.20 (the y axis
is the exponent of the indicator variable malaria presence). According to the model, all explanatory variables
play significant role in explaining the presence of malaria in children.
Figure 12.20
The greenness of vegetation (figure 12.20 at bottom right) is modeled using spline, allowing for a nonlinear
relationship between the malaria presence and the covariate, while the dependence between the response
variable and other variables is linear.
There is a link between the variance of spatial effect and the smoothing parameter of the spline
model, and that smoothing parameter can be estimated through semivariogram modeling. Then we can say
for clarity that the semiparametric regression models spatial effect using a particular kriging model, although
formally it is a spline with estimated parameters that correspond to a particular semivariogram model.
Specifically, the spatial variation of the disease presence term is modeled via simplified kriging.
Kriging is simplified assuming that the covariance model is KBessel with a shape parameter that equals 3/2
(see the discussion on the KBessel model in chapter 8), a nugget parameter that equals zero, a range
parameter that is equal to the largest distance between all pairs of points, and a partial sill parameter that is
calculated from the data variance. This allows fitting of a spatial model by the numerical optimization of just
one parameter (partial sill). The resulting semivariogram model describes large‐scale data variation, and
kriging based on this semivariogram model generates an over‐smoothed prediction surface, shown in figure
Data from Diggle, P. J., and P. J. Ribeiro. 2007. Modelbased Geostatistics. New York: Springer.
Figure 12.21
A comparison of the maps in figure 12.21 shows that only part of the spatial variation is captured by
semiparametric regression. Note also that simplified kriging underestimates the prediction standard errors.
This can be shown by comparing prediction standard errors made by simplified and nearly optimal kriging
models.
Data from Diggle, P. J., and P. J. Ribeiro. 2007. Modelbased Geostatistics. New York: Springer.
Figure 12.22
Conceptually modeling spatial data variation in a semiparametric model using penalized splines is similar to
the radial basis function approach discussed in chapter 7. The main difference is that the radial basis
functions are simply located at the measurement locations, while semiparametric regression—as well as the
generalized additive model discussed in chapter 6 and the radial smoother option to the generalized linear
mixed model discussed in appendix 4—require a relatively small number of “well‐distributed” locations
(called knots in the literature) for the basis functions to be computationally efficient. The task of selecting
optimal locations for knots (a number between 50 and 200 in the case of spatial data) in such a way that
additional knots became unimportant to the final prediction is very difficult; it may require more
computational time than the model fitting and prediction.
The semiparametric model should be used carefully in the decision making because a large number of the
model parameters is estimated by the software and, as with any other model, there is no guarantee that the
default parameters are close to the optimal values.
In many GIS applications, there are multiple interactive factors that can affect the variable of interest. Because
the best possible prediction requires all available information, researchers are looking for complex and
realistic models that can successfully generate observed data.
In statistics, complex problems can be modeled using either a joint or conditional approach. In practice, it is
difficult to specify joint multivariate spatial covariance structures for complex spatial processes (in principle,
it can be done using multivariate copulas, see the section “Copula‐based spatial regression” below). It is easier
to factor such joint distributions into a series of conditional models. For example, in modeling the quality of
wine grapes discussed in chapter 6, it is possible to simplify modeling specifications and account for the data
and model uncertainties using a series of conditional models linked together by the probability rules as
shown in the examples of hierarchical modeling of spatially uncorrelated data using the Bayesian belief
network in chapter 5.
Hierarchical modeling is based on the probability rule that decomposes the joint distribution of a collection of
random variables into a series of conditional models: if A, B, and C are random variables, then we can write
their joint distribution as [A, B, C] = [C| B, A][B|A][A]. Here the probability distribution in bracket [A] refers to
the distribution of A, and [A| B] refers to the conditional distribution of A given B (see also chapter 5).
Usually hierarchical data modeling includes three stages:
1. Data model: [data| process, parameters]
This stage describes the distribution of the data given the process of interest and the parameters of the
process. For example, the observed count data can be modeled using a Poisson distribution [Z(s)|λ(s)] ~
Poisson(λ(s)), where λ(s) is the process mean.
2. Process model: [process| parameters]
This stage describes the process, conditional on the parameters. For example, the logarithm of process mean
λ(s) can be modeled using several exploratory variables, spatially correlated small‐scale effects, and
uncorrelated noise.
3. Parameter model: [parameters].
The last stage accounts for the uncertainty in the parameters of the process. These parameters may vary
depending on the origin of the data.
From these three entities, a joint distribution is obtained using Bayes’ rule:
[data, process, parameters] ∝ [data| process, parameters][process| parameters][parameters]
With Bayes’ rule, we can estimate the posterior distribution of everything that we did not observe given
everything that we did.
[Z1(s), Z2(s)| Y(s), θ1, θ2] = [Z1(s)| Y(s), θ1][ Z2(s)| Y(s), θ2],
where θ1 and θ2 are the model parameters, meaning that the data Z1(s) and Z2(s) are independent given
(conditioned on) the true process Y(s). This does not mean that the two datasets are unconditionally
independent. Instead, it is assumed that the dependence among the datasets is due to the true process Y(s): if
we know Y(s), we do not need to know Z1(s) to explain Z2(s).
Suppose Yi is the number of cancer deaths in ith region during the study period. The expected number of
deaths Ei is calculated as the sum of age‐specific death rates rj to the stratified population at risk in the
region ,
,
where M is a number of age‐specific groups (a discussion on the use of age‐specific groups can be found in
chapter 4). This model is appropriate if (a) the disease is rare, (b) the individual risk to have a disease is not
spatially clustered within areas, and (c) the risk associated with living in area i is proportional to the average
disease rate.
Independently in each area, the number of deaths Yi is assumed to follow a Poisson distribution with mean
and variance Ei⋅Θi, where Θi is the unknown risk of cancer mortality in region i. Then the model of the number
of cancer deaths, the first hierarchical level of the model, is the following:
Yi given Θi = Yi|Θi =Poisson(Ei⋅Θi)=
In the case of a rare disease, an assumption about constant relative risk Θi is unrealistic, for example, because
of clustering of cases in space and time. Therefore, modeling the relative risk Θi becomes a second level of the
hierarchical model. The risk variation itself can be modeled as
,
where ui describes the geographically unstructured component of heterogeneity of the relative risk, and vi
represents a locally spatially structured relative risk variation:
,
,
Figure 12.23
As a rule, modern papers on hierarchical modeling are also papers on Bayesian inference, as higher levels in
hierarchy act as priors for parameters in lower levels. Several reproducible examples of hierarchical models,
including a model shown in figure 12.23, are presented in appendix 3.
To better understand the ideas behind hierarchical data modeling, it is useful to compare it with the
geostatistical variant of multivariate data analysis (cokriging). To do this, we will discuss a hierarchical model
for the prediction of precipitation using elevation data as a covariate variable.
Suppose that precipitation data are collected in a mountainous area. As a rule, the amount of precipitation
tends to increase as altitude increases, and it is reasonable to use elevation data to improve the prediction of
the amount of rain in the unsampled locations.
Cokriging and the hierarchical model require specifications of the mean and covariance (the first two
moments of the data distribution), which we will write as
(mean, covariance)
The simple cokriging model assumes that precipitation Z1 and elevation Z2 variables are jointly distributed in
the region being studied with means µ1 and µ2, the covariances and , and the cross‐covariances
and (which are usually equal, = ):
,
,
where is precipitation Z1 given elevation Z2; W is a weight matrix that specifies the dependence
between Z1 and Z2; and is the conditional covariance of Z1 given Z2, that is, the covariance of Z1 after
removing influence of Z2, which is considered as part of the trend component . In other words,
trend in precipitation Z1 is a function of elevation Z2, and the conditional covariance is estimated using
residuals after trend removal.
The hierarchical model requires specification of two moments of Z1|Z2 and Z2.
,
,
Then it can be shown that joint distribution of Z1 and Z2 is defined by
The formula above shows that the hierarchical model requires a knowledge of the means, the covariance of
the explanatory variable , the conditional covariance , and the weights W. The weights W usually
vary in space, and, consequently, covariance of the hierarchical model is nonstationary in contrast to the
global cokriging covariance and cross‐covariance models.
The difference between cokriging and the hierarchical model is in the way dependence between variables is
modeled. The main goal of the hierarchical model of precipitation given elevation is to avoid specification of
the cross‐covariance between variables by modeling the dependence between mean values. This is because in
practice it is much easier to estimate regression coefficients and the covariance models, than to assess the
cross‐covariance. This is especially true if one more variable is available, for example, the distance to the
ocean, since fitting two cross‐covariance models is much more difficult than one, whereas adding one or more
variables to the hierarchical regression model is usually unproblematic.
Note also that the main goal of disjunctive kriging discussed in chapter 9 is similar to the goal of the
hierarchical model: avoiding the cross‐covariance model estimation.
Estimation of hierarchical model parameters is iterative, since estimation of spatial regression coefficients
require measurements of precipitation Z1 and elevation Z2 at the same locations. The algorithm can be the
following:
STEP 2 Predict both the precipitation and elevation values at the required locations, for example, on the
regular grid, using kriging.
STEP 3 Estimate the spatial regression coefficients, assuming that the estimated covariance models and the
predicted precipitation and elevation at the previous step are true values.
STEP 5 Predict new values of the precipitation and, if necessary, elevation.
STEP 6 Repeat steps 3–5 until convergence.
Usually, the algorithm converges in less than a dozen iterations. Then the predictions are immediately
available since they are calculated at each iteration, and prediction standard errors are calculated from the
estimated joint covariance structure.
An example of hierarchical modeling can be found below in the “Spatial factor analysis.” A case study using a
two‐level hierarchical model can be found at the end of chapter 13. Additional examples of hierarchical
modeling are presented in appendix 3.
HIERARCHICAL MODELS VERSUS BINOMIAL AND POISSON KRIGING
The hierarchical model features are extensively discussed, and suggestions on their usage are made in the
statistical literature. There are also papers about the comparison of hierarchical models using simulated and
real data. The situation with binomial and Poisson kriging discussed in chapter 11 is different: these models
are used in several applications, but there are no simulation studies in which binomial and Poisson kriging
models are compared with standard disease‐mapping based on hierarchical models. However, the followings
general notes on the comparison of these models can be made.
Binomial and Poisson kriging are based on the estimated mean and variance values, and they may
produce negative predictions of rates and counts. The existence of negative predictions based on
nonnegative data may suggest that the chosen semivariogram model is wrong for short distances
(see also the discussion on the indicator semivariogram models in chapter 8).
The conditional expectation used in hierarchical models is theoretically better than linear prediction
in kriging.
Many hierarchical models use the Bayesian framework and are more similar to geostatistical
conditional simulation than kriging since the predictions are calculated from a posterior distribution
(see the examples in appendix 3). These distributions can be used for further modeling and for more
realistic estimation of the prediction credible intervals.
Hierarchical models are used not only for smoothing and prediction at the unsampled polygons but
also for the explanation of the reasons for the increased levels of animal abundance, crime, or disease
in certain areas, while the interpretation of cokriging weights in terms of casual dependence is
doubtful.
Theoretically, weights used in spatial regression models are more flexible than the semivariogram
functions that are based on the Euclidean distance between polygon centroids, because weights can
take into account the size and the shape of the regions. However, in practice, researchers often limit
themselves by using a weighting scheme based on the common border between polygons. Such
neighborhoods can be far from optimal, as shown in chapter 11. Consequently, hierarchical models
that use such simplistic weights may perform poorly.
If spatial weights are defined based on the distances between polygon centroids, kriging is preferable
because it fits the covariance model objectively, using the data, while weights for spatial regression
models are typically defined deterministically, without analysis of the spatial correlation between
neighboring polygons.
Kriging typically uses a larger number of neighboring observation with nonzero weights than spatial
regression models, and this results in smoother predictions. Depending on the application, this can
be an advantage or disadvantage.
Both models have advantages and disadvantages, and it is desirable to have them both in the research
arsenal.
MULTILEVEL AND RANDOM COEFFICIENT SPATIAL MODELS
Multilevel models are usually used in applied educational and econometric research to analyze hierarchically
nested data. In multilevel models, separate predictors characterize the micro (smaller) level units such as
students and macro (larger) level units such as schools.
Multilevel spatial statistical models analyze data simultaneously at different spatial scales. Figure 12.24
shows part of a large city with blocks of houses (white lines) built at various times for people with various
incomes. Yellow circles show house locations with recent and accurate information on house features,
whereas brown circles show locations with partial information. House features, including housing prices, are
similar within a block, but may be markedly different across the street. The idea of multilevel modeling is to
have a regression equation characterizing relationships at the micro, or house, level, then model the
regression coefficients as functions of variables (predictors) at the macro, or block, level. Accurately specified
multilevel modeling should be able to explain the variability of housing prices both at city block and at house
levels.
Photo courtesy of City of Riverside, Calif.
Figure 12.24
This approach is sometimes called a random coefficient model because regression coefficients are modeled as
random variables with a distribution that is generally different for different spatial levels. Formally, a random
coefficient model is more general than a multilevel model. A random coefficient model is a nonstationary
linear model in which the covariance between the data depends on the similarity of the nearby explanatory
The price for house i within block j can be modeled as
priceij = mean + effect of block j + errorij
The simplest assumption about errorij is that it is an uncorrelated random variation with zero mean and
specified variance, Normal(0, σ2), but in this nested case, it might be reasonable to assume that this error is
composed of a house within a block effect and an unstructured error.
errorij = houseij + new errorij
Substituting this model for error into the one for price, we get
priceij = mean + effect of block j + houseij + new errorij
In the case of housing data, there are n blocks, and within block j there are nj houses. At the individual level,
the price of house i from block j is explained by variables X1ij, X2ij, …, XMij, meaning that there are M regression
coefficients and the intercept (a constant to be defined) for each house.
It is assumed that the errors are uncorrelated with the predictors Xkij, and these predictors fully explain
the variable of interest priceij (in practice, these assumptions must be verified).
For example, house price priceij can be a function of square footage, X1ij=SFij; number of bedrooms, X2ij=BRij;
the presence of a garage, specified by the indicator variable X3ij=GEij (with a value of 1 if there is a garage and
zero otherwise); and house age, X4ij=HAij.
,
The parameter (intercept) can be modeled as
The parameter explains the influence of the square footage on the price of a house. It may depend on the
average income known at the block level. In this case, it can be modeled as
β 2 j = χ 20 + χ 21 ⋅ family _ size j + u2 j ,
β 3 j = χ 30 + χ 31 ⋅ population _ density j + u3 j ,
€ β 4 j = χ 40 + χ 41 ⋅ downtown j + u4 j
€
It is usually assumed that intrablock errors , , , , and are normally distributed variables
with zero mean and specified variance,
€
or they are spatially dependent similar to the random errors in a conditional autoregressive model.
= 1 if house i in block j is very near downtown and 0 otherwise.
= 1 if house i in block j is near, and 0 otherwise.
= 1 if house i in block j is far, and 0 otherwise.
There are three indicator variables because there are four categorical variables for travel time, and if we
know three of them, then we can determine the remaining one.
Then the model for parameter is
Note that in the example above, all houses in block j have the same random coefficient distribution.
Multilevel models are useful for examination of how variables at one level of hierarchy are related to
processes at another level and can help in studying the interactions between phenomena at different spatial
scales. Therefore, we can learn when to use macro level only and when micro level should be taken into
account.
They may try to use x and y coordinates as explanatory variables, as in the example in chapter 6 (the
section “Fishery”) in which the generalized additive model with independent errors predicts egg
counts reasonably well. This approach does not always work, and there is a theoretical problem with
it: it is a curve‐fitting approach, which assumes that the data are generated according to the specified
function up to additive error and in principle that function can be unrelated to the actual process that
generated the data, as in the example with simulated data in the section “Geographically weighted
regression.” A curve‐fitting approach does not differentiate between the measurement errors and
model misspecification.
They can assume that only residuals are spatially correlated. Although this approach works in many
situations (see an example of modeling groundwater arsenic contamination in chapter 15), it has
limitations when not only the response, but also the explanatory variables, are spatially dependent
(see an example of less than successful modeling of pine beetle attacks using logistic regression in
appendix 4).
Researchers can use the autoregressive models discussed in chapter 11 by assigning weights to the
nearest data. The main problem with this approach is that in many applications such as ecology there
is no theory on how to construct the weight matrix. Even when weights are somehow chosen, they
compete with regression coefficients for the same information because the explanatory variables
usually contain spatial information as well. This may lead to an inaccurate estimate of model
parameters and an inaccurate prediction of new values because small changes in the weights or in
the data may lead to large changes in the model output. Another obstacle is that the majority of
software implementations of spatial autoregressive models assume that the response variable is
normally distributed. This limits applications of the autoregressive models, although in practice some
researchers ignore this problem.
Weights can be selected objectively using geostatistical methods (semivariogram modeling)
assuming that data are located at the mathematical points at the polygon centroid locations, but this
assumption is rarely acceptable. Even if this problem is solved using areal interpolation as discussed
below, the data stationarity assumption may not hold.
Researchers can fit large numbers of classical linear regression models with spatially varying weights
with geographically weighted regression, but this approach should not be used outside of the data
exploration step of spatial data analysis.
If all the approaches above are unsatisfactory, researchers can try a multilevel model that simplifies the data
correlation structure, assuming that data located in different macro level units are independent. Note that a
multilevel model for housing data discussed in this section can be made more realistic (and more
complicated) by relaxing the assumption that data at micro levels are spatially independent.
An example of multilevel modeling is presented in appendix 3.
GEOGRAPHICALLY WEIGHTED REGRESSION VERSUS RANDOM COEFFICIENTS
MODELS
Both geographically weighted regression and random coefficients models are very popular, depending on the
user community: the former is often used by GIS users, and the latter are frequently used by statisticians.
Surprisingly, we found only one paper that compares the two approaches by exploring associations between
local levels of regional data (see reference 4 in “Further reading”). Authors of that paper used Houston crime
data (violence, alcohol sales, and illegal drug activities) introduced in chapter 6, in the section “Criminology.”
This section summarizes the difference between geographically weighted regression and random coefficients
models based on the results from paper 4 in “Further reading.” First, we present a variant of the random
coefficients model used in that paper. Second, we compare two modeling approaches. In short, it was found
that the only advantage of geographically weighted regression is speed of computations (this is in line with
the comparison in earlier chapters between inverse distance weighted interpolation and kriging.) Interested
readers may try to reproduce the comparison results using information from assignment 5 in appendix 3.
The spatial random coefficients model can be formulated as follows. For the ith tract the observed number
of events (crimes) is described by a Poisson distribution
and (in this case, the logarithm of standardized numbers of alcohol sales and illegal drug
is population size in tract i. The nonspatial component represents extra‐Poisson variability, and
the random intercept is spatially structured with spatial correlation between nearby regions defined
Spatial cross‐correlations between the spatially varying regression coefficients , , , and
are specified through the prior distribution of spatial random effects in two steps. First, the local
spatially correlated random effects are defined as the weighted sum of their neighboring values
,
scatterplot of these coefficients for all tracts. A scatterplot shows clear correlation between and
(regression coefficients of the alcohol sales and illegal drug activities). Remember that geographically
weighted regression assumes that regression coefficients are not correlated. Therefore, it is expected that the
estimation of regression coefficients and by the geographically weighted regression model
would be inaccurate.
The maps in figure 12.25 are not exactly the same as in the discussed paper because we used a different
weighting scheme: from 8 to 12 nearest neighbors with slowly decreasing weights with increasing distance
between polygon centroids instead of a small number of adjacent neighbors with equal weights, as was used
in the original paper. The use of different neighborhoods results in different surfaces of spatially varying
regression coefficients. Typically, these surfaces are only slightly different; otherwise the model is unstable
and, therefore, not very useful.
Courtesy of Texas City Police Department and Texas Alcoholic Beverage Commission.
Figure 12.25
Geographically weighted regression and random coefficients models can be compared as follows.
Geographically weighted regression can successfully describe strong spatial data dependence only,
while a random coefficient model is more general because it includes both nonspatial data variation
and spatial data correlation, weak or strong.
Geographically weighted regression requires independent explanatory data (this is often unrealistic
requirement) to make the inference while a spatial random coefficient model incorporates spatial
cross‐correlation between the explanatory variables and random intercept. Authors of the discussed
paper fitted 9 random coefficient models, and the best were those that included correlation between
the spatially varying covariate effects.
Geographically weighted regression does not have a formal theory for testing whether the spatial
variation in the regression parameters is significant when the multiple weights assigned to each
In summary, the random effects spatially varying coefficients model provides a broader class of statistical
models, which allow better interpretation and wider use of the estimated spatial data relationships.
Courtesy of Texas City Police Department and Texas Alcoholic Beverage Commission.
Figure 12.26
SPATIAL FACTOR ANALYSIS
Spatial regression methods discussed so far in this chapter have dealt with modeling one particular spatial
variable using observations of that variable at other locations and using one or more explanatory variables.
However, there are applications in which several variables show similar patterns of geographical variation
that suggests existence of a common latent variable or variables (other terms for latent variables used in the
literature include factors, unmeasured variables, hypothetical constructs, and true scores). For example,
different diseases are caused by similar environments, social and food habits, and other common things. The
idea that observable phenomena are caused by unobserved forces is as old as humanity. For instance, Plato
considered latent variables as real things, not just mind constructions. In everyday life people often explain
events based on the things that cannot be observed directly.
The latent variables help to generalize relationships between things and facts. They are used in such
statistical applications as regression, factor analysis, and structural equation modeling. The idea is that one or
more latent variables create the relationship between observable variables that are independent, providing
that the latent variable is not changing. In the case of spatial data, one goal of factor data analysis can be a
determination of which observed variables are caused by the same latent spatially correlated factor and an
estimation of that factor. Another goal can be data reduction by summarizing multiple observations by one
index. For example, the index of material deprivation (a latent variable) is constructed from regional
information on percent unemployed, percent of households without a car, percent of rented houses, and
percent of households with more than one person per room. Ideally, this index should be spatially correlated
as the data from which it is constructed, and it should be accompanied by the information about its
uncertainty. Another example is the quality of wine grapes, which is not measured by a particular device but
can be constructed from observable variables (see the section “Wine grapes quality model” in chapter 6).
In the section “Criminology” in chapter 6, one method for estimating the latent variable called the common
spatial factor model is introduced. The idea is to select a statistical model that allows for relevant statistical
inferences with observed data and assume that the latent variable can be a part of that model. Then the
estimation procedure is specified, and hidden spatial factors are estimated.
Suppose there are k observed variables Zi(s), i=1,2,…, k, which can be explained by a linear function of an m
unobservable factors fj(s), j=1,2,…,m plus a zero‐mean error processes εi(s), i=1,2,…,k, meaning that all
relationships among the Zi(s) are explained by the fj(s) and not by the εi(s) processes. In this model, all fj(s)
We illustrate the common spatial factor model using epidemiological data: bladder, esophagus, lung, and
pancreas cancer mortality among men in three southern U.S. states. Figure 12.27 shows cancer mortality
rates. Our goal is finding a possible spatial structure that can explain all four types of cancer.
Courtesy of National Atlas of the United States, http://nationalatlas.gov.
Figure 12.27
In the common spatial factor model above, the only reason for spatial variation of all four cancer rates is the
factor . Figure 12.28 shows the estimated mean and prediction standard error values of the common
spatial factor. It would be interesting to know if epidemiologists had an opinion on how reasonable the
estimated common factor is.
Courtesy of National Atlas of the United States, http://nationalatlas.gov.
Figure 12.28
The regression model for the Poisson mean values can be rewritten as
In the equation above, is a sum of the logarithm of the relative risk and spatially uncorrelated
factor on spatial distribution of each cancer. Figure 12.29 shows values of relative risk for lung and
bladder cancer mortality. Values greater than 1 show areas with an excess of the cancer mortality counts, and
we see that the area where lung cancer mortality is greater than expected is larger than the area where the
number of bladder cancer counts is greater than expected. We do not know a reason for that, and it could be
Courtesy of National Atlas of the United States, http://nationalatlas.gov.
Figure 12.29
The common spatial factor analysis model presented in this section can be improved using additional
covariates and additional latent variables. For example, there may be factors that influence cancer mortality
such as smoking prevalence, characteristics of the lifestyle, and socioeconomic inequalities. Also, disease‐
specific risk factors can be added to the model. Finally, the common spatial factor model assumes that the
latent and observed variables are independent, ignoring possible interactions between the model’s
components. Cross‐correlation can be added to the model using an approach discussed in the section
“Geographically weighted regression versus random coefficients models.”
COPULABASED SPATIAL REGRESSION
Bivariate copula‐based geostatistical modeling is briefly introduced at the end of chapter 9. In this section,
additional information on modeling data dependence with copulas is provided. Then an example of a spatial
regression model based on a particular copula is sketched. This section can be difficult to read because a well‐
developed case study is not provided, since at the time of this writing the copula‐based regression software
was not accessible.
Copulas are used to couple two or more univariate distributions. For example, a copula can describe
dependence between two different Gaussian distributions or between a gamma distribution and a Gaussian
distribution. This flexibility is not found in traditional multivariate statistics. Figure 12.30 at top left shows
the bivariate normal distribution density for distributions Normal(mean=3, standard deviation=2) and
Normal(mean=5, standard deviation=3). The dependence structure is symmetrical with equal probabilities for
both high and low data quantiles. Because probabilities of very low and very high values are low, bivariate
normal distribution has a nearly zero probability for joint extreme values. However, in risk management,
where everything that can go wrong may go wrong at the same time, another type of dependence is required.
Three other graphs in figure 12.30 show examples of possible dependence between the two normally
distributed variables described by different copulas. In particular, dependence as described by the Gumbel
copula at bottom left allows a stronger correlation between high data values. Note that all bivariate
distributions in figure 12.30 are characterized by the same dependence parameter (Kendal’s tau, which is
equal to 0.5 in this case).
Figure 12.30
Figure 12.31 shows the dependence between a variable with gamma distribution with parameters shape = 1
and rate = 2 and a variable with normal distribution with parameters mean=4 and standard deviation=3 in
three dimensions. Two different copulas describe different dependence for low and high values although the
dependence between two variables is characterized by the same Kendal’s tau parameter; this time it is equal
to 0.3.
Figure 12.31
Statistical literature (see reference 13 in “Further reading”) shows that for any univariate cumulative
distribution functions F(x) and G(y) and any copula c(F(x), G(y)), a two‐dimensional function
H(x,y)=c(F(x),G(y))
Spatial Statistical Data Analysis for GIS Users 524
is a joint two‐dimensional distribution function with marginal distributions F(x) and G(y). Furthermore, if
F(x) and G(y) are continuous, then the function c(F(x),G(y)) is unique. In other words, the copula is a
multivariate distribution function of the distribution functions of the variables, not the variables themselves.
A Copula is constructed from known inverse distributions F1(x) and G1(y) as
c(u,v)=H(F1(u), G1(v))
This corresponds to the normal score transformation discussed in chapter 8 when Gaussian random function
is predicted and then back‐transformed to obtain the required empirical marginal distribution.
In practice, the researchers do not know the joint distribution, and they are choosing a particular copula that
adequately captures the dependence structure of the data preserving marginal distributions.
In regression context, each marginal distribution can depend on covariates. Then the regression analysis
consists of selecting:
the appropriate copula that defines the dependence between the univariate marginal distributions
and
the marginal distributions F1( | , ) and F2( | , ), where and
Next, the maximum likelihood technique is applied to the joint distribution
F( | , , , ) =
to find the model parameters.
Reference 14 in “Further reading” presents an example of logistic spatial regression based on the Farlie‐
Gumbel‐Morgenstern (FGM) copula. In the bivariate case, the FGM copula has the following form:
The parameter θ allows for a correlation between the uniform marginal distributions U and V. When θ is
positive, the dependence is stronger if u and v are both close to one or both close to zero. When θ is negative,
the dependence is stronger if u is high and v is low or if v is high and u is low. Zero θ corresponds to data
independence. There is a limitation, however: the correlation between U and V is restricted in the range
between 1/3 and 1/3.
The authors of the paper in reference 14 developed the following variant of the bivariate logistic regression
model. The response Z(si) and explanatory variables X(si) are linked by the latent variable framework:
Spatial Statistical Data Analysis for GIS Users 525
where is a logistically distributed error with a scale parameter of (this allows spatially varying
data variance).
Then the probability of the observed indicators can be written, after some algebraic
manipulations with FGM copula, as:
In the probability function above, the spatial correlation is defined by the terms. In particular, when all
the are equal to zero, the expression above becomes a heteroscedastic logistic model without spatial
correlation. In practice, estimating all terms from the data is not possible, and one of the expressions from
the section “Distance between polygonal features” in chapter 11 can be used instead (note that each term
should be bounded between 0 and 1).
The model above can be compared with Bayesian modeling as discussed in appendix 3, which shows how to
estimate a full conditional distribution for the model parameters for several spatial regression models. An
important advantage of the copula‐based spatial regression is a closed‐form expression for the joint
distribution, so that there is no need for a large number of simulations, since the model parameters can be
estimated using well‐known maximum likelihood algorithms.
In the model above, the correlation between observations is limited to a moderate level. This may not work in
all situations, although the allowed range of correlation is usually sufficiently large because the dependence
between neighboring data decreases quickly as the distance increases between them. However, additional
research is required before implementing copula‐based spatial regression models in commercial software.
REGIONAL DATA AGGREGATION AND DISAGGREGATION
Diseases are often tabulated according to ZIP codes from hospital records, and crimes are calculated by
precinct, but many of the socio‐demographic risk factors for crime and disease are tabulated using census
tracts. In these cases, aggregation and disaggregation of regional data are necessary. A problem arises in
many applications, and it has different names, including the modifiable areal unit problem, the scaling
problem, inference between incompatible zonal systems, pycnophylactic geographic interpolation, the
polygonal overlay problem, areal interpolation, inference with spatially misaligned data, contour
reaggregation, change of support problem, and multiresolution modeling.
Simple averaging for aggregation and proportional allocation for disaggregation are not sufficient because
they do not incorporate spatial data correlation and do not provide a measure of uncertainty associated with
the newly aggregated or disaggregated values (examples of data averaging using geostatistics can be found in
the beginning of chapter 2 and in chapter 10, figure 10.48). Disaggregation is more challenging because it can
be done in a number of ways since a value at large scale (statewide, for example) can be decomposed to
Spatial Statistical Data Analysis for GIS Users 526
smaller scale (counties, for example), using many functions.
In geostatistical jargon, aggregation and disaggregation are called the change of support problem, or block
kriging. The term support means the geometrical size, shape, and spatial orientation associated with each data
value. Changing the support of a variable creates a new variable. This new variable is related to the original
one but has different statistical properties.
The predictions of regional data to new regions (they can be partially overlapping, and they can be
infinitesimally small, that is points; see figure 12.32 at left) should satisfy the following requirements:
The method should work when the input data and predictions are both averages and counts. In
practice, this means that generalization of Gaussian, binomial, and Poisson kriging is required.
Shapes of the polygons should be taken into account.
The predicted surface should be smooth across polygon boundaries.
When data for downscaling are counts, the predictions should satisfy the volume preserving
(pycnophylactic) property , where A is the area of the polygon. In other words, the
number of predicted counts should remain the same in the entire region and in each polygon with
observed counts.
Standard errors of the predictions should be accurately computed.
Covariates should be used to improve predictions.
Figure 12.32
The key part of areal kriging interpolation is reconstruction of the semivariogram model for points from the
empirical semivariogram values calculated from the averaged or aggregated data (see illustration in figure
11.4 in chapter 11). The geostatistical disaggregation model assumes that both observed polygonal and
unobserved point data values come from the same spatial random process so that the areal value in polygon
A, Z(A), is a weighted average of the unobserved point values z(si) in the polygon: , where
N is a number of unobserved points in the polygon. Covariance between region A with observed averaged or
aggregated data and region B where prediction is required is a double‐weighted average of the point
covariance values for points si and sj in regions A and B with N and M points, respectively:
,
Spatial Statistical Data Analysis for GIS Users 527
where and are weights; or, more precisely,
The task is to use one of the available algorithms to estimate parametric (that is, defined by a particular
formula; see chapter 8) semivariogram model (the blue line in figure 12.33 at left) for points on the
overlapping grid (blue triangles in figure 12.33 at right) from which the observed empirical semivariogram
values for regions (the pink crosses in figure 12.33 at left) can be reconstructed after data averaging.
Since spatial correlation in geostatistics is based on point data, calculation of empirical semivariogram values
for polygonal data requires specification of a single point that represents the region. Choosing a good
representative point is not an easy task. However, point locations are used for calculating distances between
input data, and average distance between polygons can be calculated using a sufficient number of randomly
distributed points in the polygons, as illustrated in figure 12.33 at right, where lines show connections
between all randomly located points in the yellow polygon, with one particular random point in the green
polygon.
Figure 12.33
Using the fitted semivariogram model for points, predictions and prediction standard errors (conditioned on
the observed areal data) can be produced both for specified new polygons and for points.
Note that reconstructing a semivariogram model encounters the problem that existing algorithms cannot
estimate the nugget effect parameter and instead assume that it is equal to zero, although most real data are
not error‐free, so that nugget parameter is typically nonzero. However, binominal and Poisson kriging
assume the data uncertainty.
The reestimated covariance values can be calculated using the last formula above (for the observed regions Ai
and Aj). Generally, the re‐estimated covariance values are fluctuating around the observed ones as shown in
figure 12.34 at left for two empirical semivariogram points. For illustration purposes, the difference between
the observed and reestimated semivariogram values can be shifted to show where the true empirical point
semivariogram values can be, as shown in figure 12.34 at left with blue crosses.
Spatial Statistical Data Analysis for GIS Users 528
Figure 12.34 at right shows a scatterplot of the predicted versus observed empirical covariance values.
Ideally, these points should lie on 1:1 line (gray). Different point covariance models will produce different
empirical semivariogram re‐estimations, so that the semivariogram model modified by the researcher
(“Expert’s” gray line in figure 12.34 at left) may produce a cloud of points that are closer to the 1:1 line in
figure 12.34 at right.
Figure 12.34
Details on the Gaussian geostatistical areal interpolation model can be found in reference 3 of “Further
reading.” Areal interpolation for Gaussian, binomial, and Poisson data will be available in the Geostatistical
Analyst version after version 9.3.
SPATIAL REGRESSION MODELS DIAGNOSTICS AND SELECTION
Regression diagnostics indicate whether the fitted model accurately describes the data. Nonspatial linear
regression model diagnostics are discussed in appendixes 2 and 4. There are several assumptions about
standard linear regression models that should be verified:
Data are produced by random sampling or the response variable is generated by the linear model.
The response variable is correlated, but the residuals are not.
Errors are random, uncorrelated with the explanatory variables, and normally distributed (often
with constant variance).
Regression coefficients are constants.
Explanatory variables are measured exactly.
Explanatory variables completely determine the response variable.
When these assumptions do not hold, the model output can be incorrect, and decision making based on it can
be wrong.
The range of diagnostic tools is large, and they are very focused, each assuming that there is a single problem
with the model and everything else is almost right. Therefore, the strategy is to use various tests trying to
solve the most serious potential problem first and then go to the next problem. Understanding ideas behind
standard linear regression diagnostics is important because a large number of these diagnostics can be used
with spatial regression models.
The first diagnostic test is usually the outliers identification because unusually large values can either ruin the
regression analysis (if they are wrong) or be the most interesting data in the collection (if they are correct).
Spatial Statistical Data Analysis for GIS Users 529
By outlier we mean here the unusual value of the response variable given the predictors. Detection of the
unusual values has a long history, and applied regression literature proposes large numbers of useful
diagnostics. One of the most common tools for the outliers detection is the studentized residual (the residual
adjusted by dividing it by an estimate of its standard deviation). One way to illustrate how studentized
residuals work is via the regression model with an additional indicator predictor variable which is
equal to 1 for an observation i and 0 for all others:
If the regression coefficient is not equal to zero, then the observation i is considered as outlier. The
hypothesis is usually tested using the studentized residuals.
Another common outlier diagnostic is called “dfbeta,” or the influence measure. It calculates how much the
regression coefficients are changed when observation i is deleted (more often, the standardized
difference, which divides dfbeta by the standard error of the coefficients is used).
The goals of the regression diagnostics for the spatial models are the same as in nonspatial linear regression:
to detect possible problems and confirm that the chosen model is correct (or justify why the model is wrong).
Researchers are typically concerned with correctly specifying the random part of the linear model, but in
practice the deterministic part (mean value) can be more important. If a distribution of the random errors is
incorrectly specified, the uncertainty of regression coefficients and predictions will not be estimated
correctly, and the confidence intervals will be inaccurate. But if the true data model is not linear or the
regression coefficients are estimated inaccurately due to, for example, data collinearity, all modeling results
can be wrong.
The residual spatial correlation can be tested using a semivariogram or using indices of spatial data
association. If the residuals are correlated, they contain information about response variable variation and,
therefore, the explanation of the dependence between response and explanatory variables may be incorrect
since the explanatory variables do not explain the response variable completely. Another reason for the
residuals correlation is that weights of the spatial nearest neighbors are defined poorly. This can be tested
using the cross‐validation diagnostic as shown in chapter 11.
The distribution of errors dictates the choice of method for fitting the regression model. For example, least
squares algorithms perform well when the errors are normally distributed. Decisions about errors
distribution can be made based on the distribution of the linear regression residuals, although the two
distributions are usually not identical since the residuals are weighted averages of the data, and they tend
toward Gaussian distribution.
Data exploration tools are used to reveal the distribution of errors. If the normality hypothesis is rejected, one
of the normalizing transformations discussed in chapter 8 can make data (and, consequently, random errors)
close to a normal distribution. If transformation to data normality is not possible, another distribution for the
random error should be postulated and, consequently, another model (for example, a generalized linear
model) should be used instead of the linear model.
The assumption of constant data variance is almost always violated when spatial models are used with
regional data, primarily because the population in the administrative regions is different. The solutions to the
problem with nonconstant variance are the same as in the case of violation of the normality assumption: data
transformation or an assumption about different distribution for random errors. Nonconstant variance can be
Spatial Statistical Data Analysis for GIS Users 530
also corrected by weighing random error in inverse proportion to the population in the region (see chapters
11 and 16).
Linear regression models assume that the expectation of the error term is equal to 0 everywhere. Violation of
this assumption can be caused by nonlinear relationships between the response and predictor variables. One
possible solution is modeling the relationship between response and predictor variables as low‐order
polynomials. Generally, if there is evidence that nonlinearity is a feature of the data, the nonlinear regression
model should be used instead of the linear one.
If two or more explanatory variables are strongly correlated with one another, it is difficult to estimate their
distinct effects on the response variable. Linear dependence between explanatory variables (collinearity)
influences the estimated standard errors: the greater the linear dependence, the larger the estimated
regression coefficient standard error. However, if the standard error is large, small changes in the
explanatory variable (for example, due to measurement error) can significantly change the estimated
regression coefficients. Data collinearity is a property of the explanatory variables. Therefore, to check the
collinearity property of the generalized linear model, it is sufficient to estimate the equivalent linear model
with collinearity diagnostic options. When data collinearity is discovered, some action should be taken, such
as dropping one or more variables or combining variables into a common factor.
In the case of data nonstationarity, the regression model can be improved, allowing the regression
coefficients to change in space. In this case, geographically weighted regression can be used for data
exploration while the multilevel and random coefficients regression models are candidates for data modeling.
Diagnostics for the geographically weighted regression in the commonly used software packages are usually
the same as in the classical linear regression model. However, the geographically weighted regression is not a
model, but a set (usually a large set) of linear models with essentially different weights, and there is no theory
that justifies the use of various linear regression diagnostics such as Cook’s distance.
There are examples in statistical literature when the confidence interval for statistical model predictions and
parameters is based on bootstrap approximation. Bootstrapping refers to a tale about Baron Münchausen,
who was able to lift himself out of a swamp by pulling himself up by his own hair. Bootstrapping is a general
approach to statistical inference based on building a sampling distribution for a statistic using the sample
data from which repeated samples are drawn. The bootstrap approach differs from classical simulation
because bootstrapping resamples the observed data rather than a hypothetical population. The key idea of
bootstrap resampling is that statistics of the sample are analogous to statistics of the original distribution. In
the case of a small sample, bootstrapping may be more reliable than traditional methods such as confidence
interval based on the data mean and standard deviation. Bootstrapping is a method for assessing a statistic; it
does not provide new method for estimation.
The bootstrap estimation of the confidence interval of the model’s parameter may consist of the following
steps:
1. Estimate the model’s parameter using observed data.
Spatial Statistical Data Analysis for GIS Users 531
Sample the data with replacement n times, and each time estimate the model’s parameter using the
sample.
2. Use the estimated parameter from step 1 and the bootstrap distribution from step 2 to create the
confidence interval for the parameter as follows: Assume that the distribution of is normal with
mean . Then a 95% confidence interval is
2
n
∑( θ i − mean θ ( )) 1 n
σθ = i=1
n −1
, where mean ()
θ = ∑θ i
n i=1
Alternatively, a bootstrap confidence interval can be constructed from the bootstrap samples. We illustrate
this approach using 40 temperature and precipitation data collected in August 2004 in Catalonia (see the data
€ €
description in chapter 9). Scatterplots of the temperature versus precipitation and vice versa are shown in
figure 12.35 at left. The Pearson correlation coefficient can be calculated using formula
∑ (temp − µ )( prec
i temp i − µmoist ) ,
i=1
r=
(n −1)σ tempσ prec
where tempi and preci are temperature and precipitation at the monitoring station i and µ and σ denote
estimated means and standard deviations of two variables. The correlation coefficient is equal to 0.66357.
€
The confidence interval for the correlation coefficient can be calculated, for example, in SAS using Fisher’s z
transformation as shown in figure 12.35 at left. This confidence interval can be compared with a bootstrap
confidence interval which we calculated from 1,000 samplings with replacements of 30 measurements from
40 meteorological data. A histogram of the 1,000 correlation coefficients is shown in figure 12.35 at right. It is
not symmetrical, and the formula for 95 percent confidence interval
mean ± 1.96⋅(standard deviation) = 0.66357 ± 1.96⋅0.10819
produces an inaccurate result [0.87562, 0.45152], while the 0.025 and 0.975 quantiles of the bootstraps
samples distribution give the confidence interval [0.8151, 0.44394], which is close to the one estimated
using the SAS procedure corr. Note that temperature and precipitation values are sampled simultaneously;
that is, 30 monitoring stations are chosen randomly at each iteration. If temperature and precipitation are
sampled separately, the resulting correlation would be near zero because temperature and precipitation
would be nearly independent by construction.
Spatial Statistical Data Analysis for GIS Users 532
Figure 12.35
Statistical literature suggests that 200 bootstraps are usually sufficient for a good estimation of the standard
deviation of the model’s parameter , and about 2,000 bootstraps are necessary for estimating quantiles of
the parameter’s distribution. The required number can be smaller or larger depending on how far into the
tails of the distribution we want to look. The number of bootstraps should be large enough so that the result
is not substantially changed when the bootstrap sampling is repeated with a larger number of repetitions.
There are two general approaches for bootstrapping for linear regression model :
bootstrapping the observations and bootstrapping the residuals. If observations are bootstrapped, the entire
vector of a response and covariate values ( , ) is resampled with a replacement. In this case the
distribution of covariate values is not fixed in the bootstrap samples, meaning that mean and variance are
changing. Bootstrapping the residuals is a three‐step process. First, residuals are
calculated for each observation. Then a sample is drawn with a replacement from the observed residuals
. Finally, the bootstrap sample of observations is constructed by adding a randomly sampled residual to the
predicted value for each observation: . An example of a bootstrapping algorithm for the
generalized additive model confidence intervals estimation can be found in reference 5 in “Further reading.”
The bootstrapping algorithms discussed above assume that data are independent and the data variance is
constant. This assumption is not valid in the case of spatially correlated data because the bootstrapping
algorithms destroy the correlation pattern. One solution to the problem is resampling in such a way that the
spatial dependence structure is preserved: instead of single observations, the dependent neighboring
observations are selected. Usually, observations located in the nonoverlapping polygons are drawn with a
replacement, although bootstrapping based on the moving window searching neighborhood (overlapping
resampling) is also used. Another solution is to write the model in terms of independent and identically
distributed components that can be estimated and then resampled (see reference 6 in “Further reading”).
One of the main uses of regression diagnostics is optimal model selection. In the case of relatively complex
regression models, including spatial, the strategy is choosing several alternative models, checking
Spatial Statistical Data Analysis for GIS Users 533
assumptions about models, looking for data outliers, and comparing how the models fit using sensitivity
analysis, cross‐validation and validation techniques, and residual plots. If models give substantially different
results and each model makes sense, the best we can do is report different conclusions. If the model outputs
are similar, the choice is based on penalties for the lack of fit and for model complexity (a more complex
model usually fits the observed data better but may do a bad job predicting new data by mistakenly
substituting noise for signal), where complexity reflects the number of parameters in the model. The Akaike
Information Criterion (AIC) is used most often for comparing the prediction accuracy of different models (see
reference 7 for the spatial variant of AIC). It tells how good the model is for predicting new data. Note that AIC
is producing results similar to cross‐validation diagnostics for large data samples.
Models that have stronger assumptions also have a larger choice of diagnostics and model selection methods,
compared to models with weaker assumptions. For example, the SAS/STAT procedure glmselect provides five
selection methods for obtaining a parsimonious nonspatial linear model that does not overfit the data.
However, there is no procedure for selecting optimal parameters of the generalized linear model in SAS 9.1.3.
A similar situation exists in Bayesian regression analysis: although a general theory for the optimal model
selection exists (reversible jump Markov chain Monte Carlo), the implemented function for regression model
selection in WinBUGS software (function jump.lin.pred) works for linear regression only.
ASSIGNMENTS
1) INVESTIGATE THE EFFECT OF SUN EXPOSURE ON LIP CANCER DEATHS.
Figure 12.36 shows the lip cancer standardized morbidity ratio at left and the percentage of the workforce
engaged in agriculture, fishing, or forestry at right in the 56 districts of Scotland from 1975 to 1980.
From Breslow, N. E., and D. G. Clayton. 1993. "Approximate Inference in Generalized
Figure 12.36 Linear Mixed Models." Journal of the American Statistical Association, 88:9‐25.
Spatial Statistical Data Analysis for GIS Users 534
This is a classic dataset for regional data analysis, and it was analyzed by many statisticians. Reading a case
study in reference 4 from the “Further reading” section, pages 392–398, you will learn that:
Decomposition between covariate effects and random variation is not unique: the variation that one
model attributes to the covariates may be attributed to random variation in another model, and both
models could be valid.
If the signal in the data is strong, all meaningful models should give similar conclusions, even if the
estimated parameters and standard errors are different.
It is a good idea to fit several statistical models in order to understand their similarities and their
differences.
Data for the case study are in the folder Assignment12.1. Use these data with any spatial regression model that
you can run (for example, with models discussed in appendixes 3 and 4) and compare the result of your
analysis with the published ones.
2) PRACTICE WITH THE ARCGIS 9.3 GEOGRAPHICALLY WEIGHTED REGRESSION
GEOPROCESSING TOOL.
Read ArcGIS 9.3 Geographically Weighted Regression geoprocessing tool documentation and use the tool for
exploration of the infant mortality data from assignment 3 of chapter 15.
Additional assignments on regional data modeling can be found at the end of appendixes 3 and 4.
FURTHER READING
1. Belsley, D. 1991. Conditioning Diagnostics: Collinearity and Weak Data in Regression. Wiley, 396 pp.
This book discusses tools for assessing quality and reliability of linear regression estimates.
2. Nakaya T., A. S. Fotheringham, C. Brunsdon, M. Charlton. 2005. “Geographically Weighted Poisson
Regression for Disease Association Mapping.” Statistics in Medicine 24:2695–2717.
This paper discusses the generalization of geographically weighted regression to Poisson regression with
conditionally independent observations.
3a. Gotway, C. A., and L. J. Young. 2004. “A Geostatistical Approach to Linking Geographically‐Aggregated Data
from Different Sources.” Technical report # 2004–012. Department of Statistics, University of Florida.
3b. Young, L. J., C. A. Gotway. 2007. “Linking Spatial Data from Different Sources: The Effects of Change of
Support.” Stochastic Environmental Research and Risk Assessment 21(5):589–600.
3c. Gotway, C. A., L. J. Young. 2007. “A Geostatistical Approach to Linking Geographically Aggregated Data
from Different Sources.” Journal of Computational and Graphical Statistics 16(1):115–135.
These papers present a general geostatistical framework for linking geographic data from different sources.
3d. Gelfand, A. E., L. Zhu, and B. P. Carlin. 2001. “On the Change of Support Problem for Spatiotemporal Data.”
Biostatistics 2(1):31–45.
This paper proposes a Bayesian approach for geostatistical prediction from points to points, points to
polygons, polygons to points, and polygons to polygons.
This paper shows how statisticians view geographically weighted regression. The proposed alternative model
is complex and Bayesian, but the end result is the same: interpretation and mapping of the spatially varying
regression coefficients.
5. Wood, S. N. 2006. “On Confidence Intervals for Generalized Additive Models Based on Penalized Regression
Splines.” Australian and New Zealand Journal of Statistics. 48(4): 445–464.
This paper discusses the construction of confidence intervals for the generalized additive model using
bootstrapping.
6. Cressie, N. A. C. 1993. Statistics for Spatial Data. Revised Edition. New York: John Wiley & Sons.
This book is the most cited reference on spatial statistics by statisticians, but it is very difficult to read for
nonstatisticians. However, the introductions in chapters 6 and 7 on regional data analysis are not that
mathematically difficult. Section 7.3.2 discusses bootstrapping algorithms.
7. Hoeting J. A., R. A. Davis, A. A. Merton, and S. E. Thompson. 2006. “Model Selection for Geostatistical
Models.” Ecological Applications, 16(1):87–98.
The authors consider the problem of spatial mixed model (universal kriging with external trend) selection.
First, they discussed the application of Akaike information Criterion to a geostatistical model with a different
number of explanatory variables. Then the principle of minimum description length to the covariates
selection is explored. The principle of minimum description length attempts to achieve maximum data
compression by the fitted model.
8. Waller, L. A., and C. A. Gotway. 2004. Applied Spatial Statistics for Public Health Data. New York: John Wiley
& Sons, 494.
Although this well‐written book is for advanced undergraduate or graduate courses in statistics, a reader
with minimal statistical background can learn about assumptions, differences, similarities, and implications
of statistical models. Chapter 7 on regional data analysis is recommended for all readers. The Web site
http://www.sph.emory.edu/~lwaller/WGindex.htm includes many of the data sets featured in
“Data Breaks” and “Case Studies” sections of the book and also some computer codes.
9. Schabenberger, O., and C. A. Gotway. 2004. Statistical Methods for Spatial Data Analysis. New York:
Chapman & Hall/CRC, 488.
Chapter six of this book is recommended for readers who want an in‐depth discussion of the spatial
regression models.
10. Ruppert, D., M. P. Wand, and R. J. Carroll. 2003. Semiparametric Regression. Cambridge University Press.
386 pp.
This book presents an extension of linear modeling by combining parametric terms with the regression
splines approach to smoothing. The discussion on the semiparametric regression in this chapter is based on
the semiparametric regression book and R software package SemiPar developed by Matt Wand.
The authors proposed a model for the joint spatial analysis of several related diseases.
12. Royle, J. A., and L. M. Berliner. 1999. “A Hierarchical Approach to Multivariate Spatial Modeling and
Prediction.” Journal of Agricultural, Biological, and Environmental Statistics 4:29–56.
The authors compare multivariate spatial modeling and prediction using cokriging and hierarchical modeling.
13. Nelsen, R. B. 2006. An Introduction to Copulas. Second Edition. New York: Springer‐Verlag.
This is the most cited reference to copulas.
14. Bhat, C. R., and I. N. Sener, "A Copula‐Based Closed‐Form Binary Logit Choice Model for Accommodating
Spatial Correlation Across Observational Units," Technical paper, Department of Civil, Architectural &
Environmental Engineering, The University of Texas at Austin, June 2008.
The authors proposed a copula‐based approach for spatial logistic regression. This is probably the first paper
on the subject.
15. Research and applied papers on copulas can be found at the home pages of the following organizations:
Financial Econometric Research Center, UK
Actuarial Science, Katholieke Universiteit Leuven, Belgium
Center for Research in Economics and Statistics, France
International Center for Financial Asset Management and Engineering, Switzerland
16. Tikhonov, A. N. 1943. “On the Stability of Inverse Problems.” Doklady Akademii Nauk SSSR 39(5):195–198
Tikhonov regularization is a common method for solving ill‐posed problems named after Soviet statistician
Andrey Tykhonov. In statistics, the method is usually called ridge regression.
PRINCIPLES OF MODELING
DISCRETE POINTS
EXAMPLES OF POINT PATTERNS
SPATIAL POINT PROCESSES: COMPLETE SPATIAL RANDOMNESS AND POISSON
PROCESSES; SPATIAL CLUSTERS; INHIBITION PROCESSES; COX PROCESSES
POINT PATTERN ANALYSIS AND GEOSTATISTICS
RIPLEY’S K FUNCTION
CROSS K FUNCTION
PAIR CORRELATION FUNCTIONS
TEST OF ASSOCIATION BETWEEN TWO TYPES OF POINT EVENTS
MODEL FITTING
INHOMOGENEOUS K FUNCTIONS
K FUNCTIONS ON A NETWORK
CLUSTER ANALYSIS
MARKED POINT PATTERNS
HIERARCHICAL MODELING OF SPATIAL POINT DATA
RESIDUAL ANALYSIS FOR SPATIAL POINT PROCESSES
LOCAL INDICATORS OF SPATIAL ASSOCIATION
ASSIGNMENTS
1) MODELING THE DISTRIBUTION OF EARLY MEDIEVAL GRAVESITES
2) GETTING TO KNOW GIBBS PROCESSES
FURTHER READING
The data are called a marked point pattern if a value is associated with each data location.
Geostatistical (continuous) and marked point pattern data differ in that marked point patterns are observed
in a counted number of locations, whereas geostatistical data can be potentially measured anywhere. An
example of a marked point pattern is tree locations, with the tree type as a categorical mark variable and the
tree diameter as a continuous mark variable. The goal of marked point pattern analysis is explaining the
correlation between marks and between marks and locations in addition to explaining the correlation
between locations.
Formulas used in the point pattern analysis are often more complex than formulas used in geostatistics and
regional data analysis. This results in a gap between practitioners who are usually using simplistic
exploration tools and statisticians who are writing papers that a limited number of GIS users can understand.
The number of users of modern point pattern analysis software is much smaller than the number of users of
geostatistical software, although the range of applications is comparable.
Discussion of the point data analysis in this chapter begins with examples of point patterns and general ideas
about the distribution of points. The summary statistics of the areas and perimeters of the Voronoi polygons
with centers in point locations can be used in qualitative analysis of the point distribution as shown for four
basic spatial point processes—random, cluster, inhibition, and Cox. Point process is a probabilistic model of a
point pattern. Although the word “process” suggests some development in time, usually distributions of
points on the plane are studied.
Introduction to the modeling of point patterns begins with a comparison between geostatistical and point
pattern analyses. Then Ripley’s K function and pair correlation function are introduced and illustrated using
simulated and real data.
Next, two examples of detecting clusters in point patterns are presented.
Then, fitting a model to observed points is discussed, and the points’ density and intensity concepts are
presented and illustrated using Euclidean and network distance metrics.
Next, inhomogeneous K function and K function on a network are introduced and illustrated using real data.
Then marked point patterns are discussed, and an example of modeling using shrub data is presented.
Next, regression modeling and model diagnostics are introduced and illustrated using explanatory variables
collected in the area where shrubs grow.
Finally, exploration and modeling of local features of the point pattern is introduced and illustrated using
simulated and real data.
Figure 13.1 shows locations of Martian craters in an arbitrarily selected part of the planet (green) and three
simulated point patterns in the same area: random (violet), regular (yellow), and clustered (blue). Statistical
analysis of the distribution of crater locations can detect whether craters result from volcanic activity (a
nonrandom pattern) or from impacts from meteors (random).
Courtesy United States Geological Survey. Ftp://ftpflag.wr.usgs.gov/dist/pigpen/mars/crater_consotrium/.
Figure 13.1
One of the goals of point pattern analysis is recognizing the type of point pattern. As a rule, the observed
pattern is compared with a random pattern first. If there is no significant difference between random and
observed patterns, the analysis is completed. If there is a difference, the next step is to find the closest pattern
among those that can be generated by models with known statistical features.
The following example shows the difference between random and nonrandom patterns from the application
point of view. Figure 13.2 shows locations of trees in the region with area A. The number of trees N can be
estimated based on measuring the distances hi from the M randomly selected locations (green sticks) to the
nearest tree using the following formula:
This formula works well in case of randomly distributed trees, but it becomes less accurate when trees are
more clustered. The result is similar to the result of the data averaging discussed in chapter 2, in which the
Events locations are distributed nonrandomly if they are generated by a mechanism that prefers some areas
over others. In that case, the locations are somehow dependent.
Figure 13.2
Point patterns can be compared using Voronoi polygons. Figure 13.3 at left shows Voronoi polygons
constructed around two points. The line that divides the region in two is equidistant to those points. Each
point within a Voronoi polygon is closer to the center point than to the center point of any other polygon.
In figure 13.3 at right, which displays three points, any location in the yellow Voronoi polygon is closer to the
blue point than any location in the pink or green Voronoi polygons. The intersection of the three lines, which
are perpendicular to the lines that connect blue and red, blue and green, and red and green points, is
equidistant to all three points.
Figure 13.3
Courtesy United States Geological Survey. Ftp://ftpflag.wr.usgs.gov/dist/pigpen/mars/crater_consotrium/.
Figure 13.4
In figure 13.5 at left, the histograms of the areas and perimeters of the randomly simulated points look
similar to the Martian craters.
Figure 13.5
Courtesy United States Geological Survey. Ftp://ftpflag.wr.usgs.gov/dist/pigpen/mars/crater_consotrium/.
Figure 13.6
However, the histograms of areas and perimeters generated by the cluster process in figure 13.7 are
markedly different.
Courtesy of City of Redlands, Police Department.
Figure 13.7
Courtesy United States Geological Survey. Ftp://ftpflag.wr.usgs.gov/dist/pigpen/mars/crater_consotrium/.
Figure 13.8
The expected values of Voronoi polygons constructed using a random point process (called the Poisson
process, see below) with an intensity of λ are shown in table 13.1:
Number of vertices 6
Area 1/λ
Perimeter 4⋅λ1/2
Length of a polygon edge 0.667⋅λ1/2
Table 13.1
These values can be used when testing point patterns for spatial randomness.
Point patterns can also be described using variability indices similar to indices of spatial association
discussed in chapter 11. Usually, these indices are based on the location pairs such as
event <–> neighboring event
event of type A <–> neighboring event of type B
For example, the aggregation index is defined as
A value of the aggregation index greater than 1 indicates that the pattern has a tendency of regularity,
whereas values less than 1 indicate clustering.
It is not sufficient to make one comparison between observed and simulated point patterns. It is necessary to
simulate many patterns with the same statistical features so that they can be compared with the statistical
characteristics of the observed data. The simulation approach is discussed in the section “Ripley’s K function”
below.
SPATIAL POINT PROCESSES
Spatial point processes produce spatial locations in a known area, usually called the data domain, using
random value generators. It is assumed that all points in the data domain are counted; that is, there is no
omission.
The realization of the point process can be also thought of as a fine grid overlapping the data domain with
values of 1 in the grid cells with events locations and values of 0 otherwise. However, it is assumed that there
is a chance to find the observed points at any cell.
Point patterns without points outside the data domain are called finite point patterns; for example, the
locations of holes in a slice of cheese. These points have a natural border, which influences the distribution of
points. If additional points can be observed outside the sampling window, it is usually assumed that the
observed points are a part of the infinite point pattern; for example, locations of craters on Mars. In this case,
the influence of the sampling window should be eliminated using edge correction techniques because for
points near the border of the data domain there could be closer neighbors outside of the border then inside it.
Interestingly, the theory of infinite point patterns is simpler than the theory of finite point patterns from the
point of view of math, and methods for the former predominate in applications.
There are three main spatial processes: Poisson, cluster, and inhibition, which generate random, aggregated,
and regular point patterns.
The spatial process is characterized by the intensity of points λ, defined as the average number of events per
unit area. Multiplied by the area of interest A, it yields
the expected number of events in area A = λA
The formula above is correct if the spatial process is stationary; that is, the intensity of the point distribution
is the same everywhere. Stationarity assumption is difficult to verify using the point pattern alone because,
for example, the pattern can be part of a large cluster. However, stationarity assumptions can be justified
using additional information. For example, if environmental conditions and soil properties are nearly the
same in the area under investigation, there is no reason for nonstationarity in the point pattern consisting of
plant locations.
Figure 13.9 shows an area of 10 by 10, or 100 square units, which contain 20 points. If points are randomly
distributed in the area, then we would expect approximately one percent of the observations in one grid cell.
Figure 13.9
A point pattern is completely random if the following conditions are simultaneously applied:
The number of events in region A follows a Poisson distribution with a mean of λ⋅area(A)$, where λ is
the mean number of points per unit area (the intensity of the process).
The xi and yi coordinates of the point i are independent random sampling from two uniform
distributions.
The algorithm for simulating a homogeneous Poisson process (it is usually called a Poisson process) on
rectangle (0,0) – (a,b) with intensity λ⋅area(A)= λ⋅ab is the following:
Simulate a random number n from Poisson distribution with intensity λ⋅ab.
Simulate the x coordinates xi (i = 1 to n) of the two‐dimensional Poisson process in the rectangle from
a uniform distribution Uniform(0, a).
Simulate the y coordinates yi (i = 1 to n) of the two‐dimensional Poisson process in the rectangle from
a uniform distribution Uniform(0, b).
To simulate a Poisson process in the irregular polygon, simulate points in the bounding rectangle with sides a
and b then remove the points simulated outside the polygon.
The process requiring N random points to be generated in the area is even simpler than the Poisson process
because it does not require the first step in the algorithm above.
Note that the simulations described above are unconditional because they do not take into account existing
point patterns. There are applications in which new points need to be simulated in addition to the observed
points. For example, after fitting a point process model using data collected in a typical area in the forest,
researchers may want to simulate a forest in a larger area (see the case study in chapter 6). If the data domain
for simulation is only several times larger than the observed point pattern, it makes sense to preserve the
A superposition of several Poisson processes is also a Poisson process because the sum of independent
Poisson variables is itself a Poisson variable.
The homogeneous Poisson process is a reference point in point patterns analysis. If the number of locations is
larger than expected under the homogeneous Poisson process, the pattern tends to be clustered. If the
number of events is smaller than expected under the homogeneous Poisson process, the pattern tends to be
regular.
Figure 13.10 shows simulated locations using the Poisson process with a constant intensity of 70 (only 61
points were simulated; in the next simulation it can be a different number close to 70) in the unit square at
left and using the Poisson process with intensity as a linear function of coordinates λ =140⋅x (67 points were
simulated). The red arrow at right shows the direction of the elevating intensity of points.
The choice of symbol for visualization may mask point distribution because the expectation about “typical”
patterns of trees and houses might be different.
Figure 13.10
In practice, the assumption of the Poisson process with a constant intensity may be unrealistic. For example,
in epidemiology, a model with a constant intensity of a disease is appropriate only when the population is
homogeneous. Because population density changes from area to area, the expected number of cases of
disease is larger in more populated areas. This can be modeled using an intensity that is a function of
population density.
A clustered process with its intensity defined by another random process is called the Cox process. For
example, intensity of disease incidence may depend on air pollution, which can be estimated using
geostatistical methods.
SPATIAL CLUSTERS
In 1939, Jerzy Neyman studied the effects of insecticides by counting larvae in plots in a field. Neyman found
that distributions of larvae could not be explained by the Poisson process. There were too many empty plots
and too few with one larva. He made the following assumptions:
The masses of eggs in the field are distributed independently and uniformly.
The masses of eggs produce a random number n of survivors with the specific probability.
The various larvae are asocial and behave independently.
Larvae movements are slow.
Later, Neyman investigated a mechanism for the spread of epidemics. He studied a population of susceptible
people and a population of people who became infected. Neyman found that the only difference between the
larvae and epidemic models was that people became infected after the incubation period.
If the mechanism that generates the data is known, an inference can be made. For example, model fitting can
be based on minimizing the squared differences between the model and the properties of the sample. After
successfully fitting a model to the epidemic data, Neyman found that vaccinating a proportion ϕ of the
population at random reduces the expected total size of the epidemic by a factor of (1 − ϕ)/ϕ . For ϕ=0.05, the
factor is equal to 19.
Additional examples of the clustered point processes are the following:
Wild strawberries and mushrooms usually occur at high density and low frequency. Examples of
such clusters are shown in figure 13.11.
In meteorology, an analysis of fine resolution radar data shows that storm locations are described by
a homogeneous Poisson process in space and time. Within a storm, places where rainfall is intense
are clustered in both time and space.
Figure 13.11
A homogeneous parent Poisson point process with an intensity of 10 in the unit square was
generated.
Each parent point was replaced by 10 (maples and palms) or 15 (pines) points randomly over the
area in a circle with a radius of 0.1. These points were also randomly shifted in the circle with radius
0.1 (palms and pines) or 0.2 (maples).
Figure 13.12
In this figure at bottom right, pines and maples are displayed together, showing a situation where there are
two types of points.
An inhibition process has a radius within which two or more points are not allowed.
Figure 13.13 shows two patterns generated using the following inhibition process:
A point with random coordinates is generated in the unit square.
Another point with random coordinates is generated. If the new point lies closer than a specified
distance to the previously generated point (0.04 for yucca and 0.025 for tulips), it is rejected and
another random point is generated.
New points are generated until 200 points are reached or until a new point is rejected 10 times.
Figure 13.13
Typical examples of inhibition processes are the distribution of animal and plant locations. Environmental
monitoring stations are usually located in cities but not close to each other and can be described by an
inhibition process as well.
COX PROCESSES
Cox process is a generalization of the Poisson process with the intensity function defined by a continuous
random variable. It can be generated in two steps:
1. nonnegative intensity function λ(s) is generated
2. Poisson process with intensity λ(s) is constructed
The resulting distribution of points is conditionally independent given the intensity λ(s).
In practice, the intensity function of the stationary Cox process is defined by a stationary random field with
nonnegative values. This random field cannot be generated automatically by conventional kriging since it may
produce both positive and negative values. To avoid possible negative values, the intensity function of the Cox
process is defined as an exponent of the Gaussian random field, λ(s)=exp(Z(s)), where Z(s) is a spatially
correlated Gaussian variable such as a kriging prediction surface or a Gaussian conditional simulation. The
point process generated using the intensity function λ(s)=exp(Z(s)) is called the log‐Gaussian Cox process.
If Z(s) is stationary and isotropic, the log‐Gaussian Cox process is also stationary and isotropic with a pair
correlation function (see its definition below) defined by the mean and variance of random variable Z(s):
g(h)=1+cov(h)/mean2
In the case of simple kriging, .
An example of the log‐Gaussian Cox process usage can be found in the section “Kriging assumptions and
model selection” in chapter 9.
POINT PATTERN ANALYSIS AND GEOSTATISTICS
Geostatistical models assume that data locations are fixed, and the distributions of values measured at these
fixed locations are analyzed.
Figure 13.14 at left shows a probability density for possible values at one particular location or some area (if
block kriging is used), where . The probability that a value is in the interval (20, 30) is
Figure 13.14
In point pattern analysis, the distribution of data locations themselves is studied, and it is of interest to know
how many different values can be found in the specific area. Therefore, we are interested in finding the
probability density f(s)
,
where integration is taken for the area, s are the spatial coordinates, and N is a number of points s in the area
A in Figure 13.14 at right. The shape of the surface provides information on the local density of points. Since
the integral is not equal to 1, f(s) is called the intensity function. Usually, notation is used instead of f(s).
Researchers are interested in estimating the intensity surface and mapping it.
The geostatistical model, kriging, predicts the most probable (expected) value at any location in the study
area. In point pattern analysis, the intensity function predicts the expected number of points per unit
area (that is, the mean value), and the density function f(s) estimates the probability of observing an event at a
location s. The expected number of points in any part of the data domain is proportional to its area if the point
process is homogeneous. The integral over the data domain of the density function is equal to one. The
relationship between the density fA(s) in the area A and the intensity in the point s is
,
where is an average number of points in A.
Density function mapping is analogous to interpolation in geostatistics: cluster centers can be identified from
the density surface similar to identifying high‐pollution regions when interpolating continuous data. Figure
13.15 at left shows the predicted surface of the variable measured at points shown as circles. Figure 13.15 at
right shows the estimated density of the same points with the high density in hot colors and the low density
in cool colors.
Figure 13.15
To understand the concept of the commonly used kernel density estimation, imagine that towers of sand are
placed at each point location and the flat surface on which the towers stand is gently shaken for some time.
The resulting surface corresponds to the point density map.
Usually, the intensity of points is not the same everywhere. The intensity of points within the distance h can
be estimated using a cylindrical kernel, which calculates the average number of points when moving over an
area under investigation.
,
where ω is the edge correction parameter, I(expression) is an indicator function equal to 1 if the expression is
true and zero otherwise, and si is the location of a point i. Alternatively, radially symmetric kernels
(Gaussian),
1 , (Epanechnikov),
(five order polynomial),
(inverse distance weighted),
where r is a radius with a center in the point s, are used instead of cylindrical kernel.
Statistical literature states that the choice of the bandwidth h is more important than the choice of the kernel
function. A reliable value of the bandwidth h can be estimated from the data by minimizing the mean squared
error of the intensity function. The bandwidth also can be selected based on typical distances derived from
the point patterns; for example, from the distribution of the nearest distances between pairs of points.
Sometimes knowledge of the physical or artificial process that supposedly generated a point pattern can be
used for choosing the bandwidth value. Since selection of the bandwidth using data themselves is
problematic, it is a good idea to try several h values and choose the h that corresponds to the intensity surface
with desirable smoothness.
The intensity of Martian craters using a cylindrical kernel is presented in figure 13.16. Low values are shown
in blue and high in red. Brown lines are the pedestals of the largest craters.
Figure 13.16
Figure 13.17
Mathematically, the intensity is defined as
,
where ds is an infinitesimally small area containing the point s.
Note that the intensity function does not contain information about the spatial dependence between
points. The spatial dependency can be introduced using the second‐order intensity:
If the spatial point process is stationary, the first‐order intensity l(s) is constant, and the second‐order
intensity is a function of the distance between pairs of points and perhaps the directions between pairs of
points (in the case of anisotropy): . This concept is similar to the semivariogram and
covariance concepts discussed in chapter 8.
By definition, the covariance of random variables A and B is a function of the expected values E(AB), E(A), and
E(B):
cov(A, B) = E(AB) E(A)E(B)
By analogy, a covariance density of a spatial point process is defined as
RIPLEY'S K FUNCTION
Ripley's K function is the variance of the number of events occurring in the area A, a subdivision of the area
under study. It estimates the average number of points within distance h normalized on the average number
of points per unit area. Ripley's K function is estimated as
= ,
that is, the expected number of points within distance h from an arbitrary point divided by the intensity λ of
the point pattern, where is a distance between points i and j, I(•) is an indicator function that equals 1 if
< h and zero otherwise, N is a number of points, and is the edge correction factor.
In Figure 13.18, is the proportion of a circle centered at point i with a radius h within the area D.
This correction is based on the assumption that the outside part of the circle could have contained the same
neighbors density as the inside part. This is one of the possible edge correction algorithms called Ripley’s
correction. Usually, point pattern analysis software proposes a choice between Ripley’s correction and
several other algorithms. Therefore, it is advisable to consult the software manual for available corrections
and the differences between them.
Figure 13.18
Under the null hypothesis that the point process is homogeneous Poisson with intensity λ, is
asymptotically normal,
2πh 2
K (h) = Normal πh 2 , 2 ,
λA
€
Spatial Statistical Data Analysis for GIS Users 555
as the area of observation A tends to infinity. In other words, if points are distributed independently from
each other, the expected value of the K function at distance h is equal to πh2. This value is used as a
benchmark.
The quantity λ⋅ is interpreted as the mean number of points in a disk of radius h, centered at a point
inside the area under study.
Although the semivariogram is useful in exploring spatial data dependency, its main use in geostatistics is to
calculate weights of neighboring measurements when predicting a value to the unsampled location. Ripley’s K
function is not used for prediction but for testing hypotheses about data randomness, clustering, and
regularity.
In the case of clustered patterns, most points are encircled by the outer points of the cluster, and Ripley’s K
function is larger than it would be for randomly distributed points. If points are regularly distributed, the
average neighbors’ density is smaller than the average point density of the random pattern, and Ripley’s K
function is relatively small.
Recall that the cluster process generates a specified number of offspring around randomly distributed cluster
centers. In this case, there is aggregation at small distances and regularity at larger distances. Ripley’s K
function can detect different scales in point patterns. This is an important property because most spatial
processes are scale dependent, and their characteristics may change across scales.
In some cases, a small number of events are missed. If the locations of missed events are random, the Ripley’s
K function estimates the second‐order property of the complete process correctly because random thinning
reduces both the number of points in the searching neighborhood and the overall intensity of the process by
the same amount (the nominator and denominator in the formula for Ripley’s K function).
When the point pattern is close to randomly distributed points, it is convenient to use a normalized Ripley’s K
function with a benchmark value of zero:
is called L function in point pattern analysis literature. L function may be preferable to K function
because the square root transformation may reduce data variability, making data close to stationary. Also, the
features of the point process are more easily visualized and interpreted using the L function than the K
function because the L function tends to be a horizontal line, making it easier to detect any deviation from it.
The Ripley’s K function assumes that all points are equivalent, that is, point characteristics do not matter. It
may be inappropriate for analysis of objects whose size is compatible with the shortest distance between
points.
Figure 13.19 shows Ripley’s K function for the Martian craters (red) and the simulated clustered (blue) and
random (green) points shown in the beginning of this chapter in figure 13.1. Comparing these lines, we can
conclude that the Martian craters are distributed similarly to random points.
The estimated values of K and L functions are compared to benchmarks πh2 and 0 given by a homogeneous
Poisson process. The Monte Carlo algorithm is used to test whether a value of K function is significantly
different from the benchmark πh2 as follows:
A large number of random point datasets, perhaps N=1000, is generated.
A confidence level α is chosen, for example, α=5%.
Estimated values of simulations i=1, 2, … N are sorted in increasing order for each lag h.
For N=1000 and α=5%, the 26th and the 974th values are selected.
is considered significantly different from the benchmark πh2 if its value is outside the interval
[the 26th, the 974th] values.
CROSS K FUNCTION
The bivariate (or cross‐type) K function of two point patterns observed in area A is defined as the expected
number of points in a pattern of the first kind within a given distance h of an arbitrary point in a pattern of
the second kind, divided by the intensity λ1 of points in a pattern of the first kind:
K12(h) = (the mean number of points in a pattern of the first kind in a circle of radius h,
given that there is a point in a pattern of the second kind in the center of the circle)/λ1
An example of using the cross K function is shown in the section “Test of association between two types of
point events” below.
PAIR CORRELATION FUNCTIONS
The pair correlation function g(h) of a stationary point process arises if the circles of Ripley’s K function are
replaced by rings. It gives the expected number of points at distance h from an arbitrary point, divided by the
intensity λ of the pattern. The quantity l⋅g(h) is interpreted as the mean number of points at distance h from a
point inside the area under study.
Using an analogy with univariate statistics, K function is the cumulative distribution functions, while the pair
correlation function is the probability density function. Most researchers prefer using the probability density
function for exploratory data analysis, especially its empirical counterpart, the histogram. Similarly, the pair
correlation function has become more popular among practitioners who have access to the software with
both K and pair correlation functions.
For a stationary Poisson process, the pair correlation function is equal to 1. Values of g(h) smaller than 1
indicate inhibition between points, and values greater than 1 indicate clustering. At large distances, the pair
correlation function usually tends toward a value of 1.
Figure 13.20 shows a simulated cluster process (left) and an estimated pair correlation function using these
simulated data (right). By analogy with semivariogram/covariance modeling, the range of correlation
(intersection of the estimated pair correlation function and horizontal line y=1) shows the average size of the
clusters.
Figure 13.20
Figure 13.21
Figure 13.22 at left shows Ripley’s K function K(h) for a Poisson process (green) and for lightning strike
locations data presented in chapter 1 (purple). We see that there is a clustering tendency for the lightning
strike locations. Figure 13.22 at right shows the pair correlation function g(h) for lightning strike data. It
looks like the geostatistical covariance model: locations separated by small distances tend to be clustered (in
geostatistics, correlated data values separated by small distances are more similar). Note that the wave shape
of the function is allowed for pair correlation function. In this case, it may be due to an insufficient number of
data points, which results in some instability of the numerical computations.
If the hypothesis of circular clusters is accepted, then the pair correlation function suggests a mean cluster of
about 20,000 to 30,000 feet in diameter, based on the distance of the intersection of the two lines, the
observed pair correlation function, and the expected value of 1 for random locations. The distance of the
intersection can also be interpreted as the range of correlation between events beyond which locations are
independently distributed.
Figure 13.22
,
where is the derivative of the cross‐type K function K12(h). The function λ1 g12(h) gives the expected
density of points of pattern 1 at distance h of an arbitrary point of pattern 2:
If two patterns are independent, the cross‐type pair correlation function g12(h) is equal to 1. It is smaller than
1 for point inhibition and greater than 1 for point attraction.
Figure 13.23 at left shows lightning strikes classified according to the number of strikes, with yellow symbols
showing the location of a single strike and red displaying locations of multiple strikes. The cross‐type pair
correlation function g12(h) for single and multiple strikes in figure 13.23 at right is greater than 1 for locations
separated by distances of less than 20,000 feet, meaning that single and multiple lightning strike patterns are
more similar than different.
Figure 13.23
TEST OF ASSOCIATION BETWEEN TWO TYPES OF POINT EVENTS
Cross pair and cross K functions can be used for testing the association between events of different types, but
these tests are incomplete without constructing the confidence intervals. If the studied point pattern can be
described by a Poisson process, the confidence intervals can be constructed by simulating many random
points in the data domain and displaying cross statistics on the same graph. Unfortunately, this approach
does not work in the case of clustered data such as the locations of single and multiple lightning strikes.
Permutation of the types of lightning strikes is also not appropriate because the point pattern is
inhomogeneous.
This approach cannot be used directly since a new pattern may not occupy the data domain. This problem can
be overcome by wrapping the pattern on a torus. Figure 13.24 at left shows the original patterns of single
(blue) and multiple (empty circle) lightning strikes and one shifted pattern of the multiple strikes (red).
Figure 13.24 at right shows the estimated (red line) and the expected (blue) cross K function in the case of
random pattern and 200 cross K functions calculated using randomly shifted multiple strike locations
wrapped on a torus (green). The original values of the cross K function show much stronger local dependence
than do the values calculated using shifted multiple strike locations. This confirms that the single and
multiple lightning strike patterns are dependent.
Figure 13.24
MODEL FITTING
There is a general algorithm called the method of minimum contrast for fitting point process models with a
known theoretical K function (or pair correlation or any other theoretical function) to observed point pattern
data. The algorithm minimizes the criterion
p
h0
q
∫ {[ }
q
D(θ ) = ]
K (h) − [K theoretical (h)] dh
0
by finding the closest match between the theoretical and empirical functions. Here is the K function
computed from the data, is the theoretical value of the K function, and p and q are two
€
exponents.
Process K function
Homogeneous Poisson π h 2
Neiman‐Scott
cπh2, 0≤h≤δ
Strauss
πh2 –(1 c)πδ2, h≥δ
Table 13.2
where λ is the intensity of the parent homogeneous Poisson process, 2σ2 is the mean squared distance to an
offspring from the parent, δ is the minimum distance between points, and c is the probability of observing
two points closer than δ units. The Strauss process is a generalization of a simple inhibition process that
arises if parameter c of the Strauss process is equal to zero. The Neiman‐Scott process with theoretical K
function above becomes the Thomas point process if the locations of the offspring points of one parent are
independent and normally distributed around the parent point with standard deviation σ.
Fitting the Thomas process with exponents q= 0.25 and p= 2 to the lightning strike data gives the following
estimation of the model’s parameters (using thousands of feet distance units):
The intensity of the Poisson process of cluster centers λ = 0.00105.
The standard deviation of displacement of a point from its cluster center σ = 7.11.
The expected number of points per cluster µ = 10.16.
This information helps in understanding of the lightning strike data features. In addition, fitted Thomas point
process can be used for simulating similar point patterns for the purpose of experimentation.
Computations with Poisson, Neiman‐Scott, and Strauss cluster processes are direct. More flexible and
complex spatial processes are fitted and simulated using numerical methods. Examples of fitted more
complex processes will be presented later in this chapter.
INHOMOGENEOUS K FUNCTIONS
Ripley’s K function discussed above was defined for stationary data—point processes with constant intensity
λ. If the point process is not stationary, deviations between the empirical and theoretical K functions can be
explained by the variations in intensity λ(s). The inhomogeneous K function is a generalization to
nonstationary point processes with intensity λ(s) varying with location s. The is modified by rescaling
the interevent distances by the product of the intensities λ(si) and λ(sj) at locations si and sj:
= ,
where, as before, is a distance between points i and j, I(expression) is an indicator function that equals 1 if
< h and zero otherwise, N is a number of points, and is the edge correction factor.
is greater than the benchmark, the points are clustered.
Figure 13.25 at left shows estimated lightning strike intensity using second‐order polynomial
with fitted coefficients for trend formula a0 = 5.81,
a1 = 6.82·104, a
2 = 0.04·102, a
3 = 7.05·106, a
4 = 9.9·105, a
5 = 3.0·10 . The estimated intensity increases from
4
dark green to pink colors. Figure 13.25 at right shows the inhomogeneous K function (red) and benchmark
πh2 (gray). We see that after adjusting for trend in the intensity of lightning strikes, there is still evidence of
locations clustering.
Figure 13.25
Figure 13.26
One simplified diagnostic is based on comparing the number of observed points in quadrats i and the
area(quadrat i), where is the estimated constant point intensity. In the case of an inhomogeneous
counts), and the Pearson residual value is calculated as
If the modeled point process fits the observed one well, the expression above has the approximate mean of 0
and variance of 1. Figure 13.27 at left shows the point pattern (blue points), the values of Pearson residuals
(bottom), the observed (top left), and the expected (top right) counts for 9 quadrats that overlap the lighting
strike data domain. The number of expected counts is calculated assuming that the point process is Poisson
with a constant intensity so that the number of expected counts is 15 in all quadrats. A quadrat with a
Pearson residual of 8.1 departs from the Poisson process model significantly.
The expected number of counts is different for the point process with the intensity shown in figure 13.26 at
left. Figure 13.27 at right shows the expected numbers and the calculated Pearson residuals over the locally
estimated points intensity. This time the departure of the observed point pattern from the inhomogeneous
Figure 13.27
This section would be incomplete without an algorithm for simulating an inhomogeneous Poisson process
with the intensity λ(s) in area A. It is the following:
Simulate a random number n from Poisson distribution with the intensity (λab), where
.
Simulate the x coordinates xi (i is from 1 to n) in the data extent from a uniform distribution
Uniform(0, a).
Simulate the y coordinates yi (i is from 1 to n) in the data extent from a uniform distribution
Uniform(0, b).
Simulate values ci (i is from 1 to n) from a uniform distribution Uniform(0, 1).
Remove the points simulated outside the polygon.
K FUNCTIONS ON A NETWORK
In chapters 1 and 7, the problem of choosing a reliable distance between geographic objects was discussed,
and examples of comparing interpolation using Euclidean and non‐Euclidean metrics were presented. We
might expect that point pattern analysis models are more sensitive to the distance metric choice than
interpolation models because all calculations in point pattern analysis are based on pair distances.
Figure 13.28 shows six locations of robberies in Redlands, California, in 1998, labeled using their
identification numbers. Most crimes occur close to roads since most people, including criminals, travel by car
in this area. Street segments are colored according to the closest crime location measured along roads. Any
two locations are not visible from the others, and they may belong to neighborhoods with different social
Courtesy of City of Redlands, Police Department.
Figure 13.28
Figure 13.29 shows locations of auto thefts in red, robberies in blue, and random points in yellow and green.
Observed crime locations were shifted to the nearest point on the network. Random points were generated
from the Poisson process on a network so that the probability of a point being placed on a unit line segment
on a network is the same regardless of the location of the segment. It is difficult to decide visually whether the
distribution of crime locations is random or not on the road network.
Courtesy of City of Redlands, Police Department.
Figure 13.29
where the intensity of points λ is
The following estimator of the network K function (the observed K function) is used:
The observed K function can be used to detect whether the crime locations are independently and randomly
distributed with respect to a set of randomly located points on the street network.
The observed K function calculated using auto theft locations on the street network is shown in red in figure
13.30. A 95 percent confidence envelope calculated from 100 simulations of random points on the same
street network is shown in blue. Interpretation of the graph is the same as in the case of the Euclidean
distance metric: auto theft locations are clustered on the network because observed K function values are
larger than the upper 95 percent interval of the K function calculated using simulated random points.
Courtesy of City of Redlands, Police Department.
Figure 13.30
Figure 13.31 shows the locations of auto theft in red and robbery in blue. Two point patterns can be
compared using network cross K function defined as
with estimator
Courtesy of City of Redlands, Police Department.
Figure 31.31
In figure 13.32, the observed cross K function values (red) are greater than the values of K function estimated
using independent point locations (blue), meaning that locations of the two crimes tend to be adjacent.
Therefore, we may suggest that the same group of people may be involved in robbery and auto theft crimes.
This suggestion should be further investigated using additional information. For example, if the crimes did
not occur closely in time, then it might not be the same people but instead different criminals taking
advantage of a vulnerable location. Examples of point patterns modeling using covariates are presented later
in this chapter.
Courtesy of City of Redlands, Police Department.
Figure 13.32
CLUSTER ANALYSIS
Spatial clusters occur when events are located more closely than points from random sampling from the
population at risk (note that this population is often clustered as well) or, if the population at risk is not
defined, just from random sets of points. Epidemiological literature has many examples when local clusters
were first observed and then scientific investigation determined the causes of the clusters. For example, hot
spots of female lung cancer in China in 1980s were attributed to cooking with smoky coal in unventilated
houses. Note that identification of the areas of low risk is sometimes of interest as well because, for example,
these areas may indicate exceptional treatment of the disease.
Cluster detection methods are especially useful when there is a cause that increases the number of events and
this increase is small so that it is difficult to detect. For example, in most cases environmental exposures are
not high enough to produce clearly seen clusters of cancer events, taking into account the long incubation
period and population mobility.
Examples of applications in which cluster detection is helpful include road accidents that can be influenced by
street design; crimes that can be due to the presence of pubs, brothels, and ATMs; and an influenza epidemic
that can be detected earlier than might otherwise be identified by counting an increased number of the
purchased medications.
There are many approaches to the detection of spatial clusters. They should be used carefully since some of
the approaches cannot properly adjust for heterogeneous population density because the intensity of events
varies with the varying population (see also discussion on the cluster detection in regional data in chapter
11). However, these same approaches can be used in applications which need not take population
heterogeneity into account. One such approach, model‐based clustering, is presented in chapter 16 using
forestry data.
Are the locations of pharmacies located preferentially?
Does the distribution of pharmacies reflect a variation in the population density?
Figure 13.33
K function is invariant under random sampling from the same population. If the locations of the events (in our
case, pharmacies) are a random sample from the population (in point pattern analysis literature, this random
sampling is usually called random labeling), they should have the same K functions:
Kevents = Kpopulation = Kcross
Figure 13.34 at left shows the estimated L functions for pharmacies locations (green), population (blue), and
cross L function (red). The L functions are clearly different.
Figure 13.34 at right shows the difference between K functions Kevents and Kpopulation (red) and the simulation
envelope for assessing the significance of the difference. The simulation envelope was calculated using the
following algorithm:
Randomly assign labels “events” and “population” to the point pattern that consists of the
pharmacies and population locations shown in figure 13.33.
Calculate the K functions for “events” and “population” and the difference between them.
Plot the lower and upper differences.
We see that the difference
Kevents ‐ Kpopulation
is outside the simulation envelope. This confirms that the two point patterns are distributed differently and
that the locations of pharmacies are more aggregated.
The next step in cluster analysis is finding the cluster locations. We will discuss two approaches. The idea of
the first approach can be explained graphically, and it is implemented in the R software package spatclus.
The spatclus software package creates a trajectory in which each next point is the nearest among the
remaining ones (blue segments in the left part of figure 13.35). The first point is selected near the border of
the region where the cluster center is unlikely. The idea of the cluster detection algorithm is based on the
assumption that points in the clusters have consecutive selection orders and that their segments’ lengths are
smaller than the average distance between the nearest neighbors as can be seen inside green rectangle. Using
the trajectory, two variables are constructed: the distance between points and the order of the points’
selection. Then the weighted distance is defined as the ratio of the distance between points and expected
distance given the population density. The ordered weighted distances are shown in the right part of figure
13.35. Finally, the relationship between the weighted distance and the order variable is modeled using the
staircase function (red line). Jumps in the line indicate the borders of the clusters.
Figure 13.35
Figure 13.36
A more traditional statistical approach to spatial cluster analysis is based on comparison of the densities (or
intensities) of the events and population.
The intensity of points in the formula for the points’ intensity
is defined as intensity per unit area. The unit area can be replaced by the intensity of the underlying
population by estimating the ratio of the two kernels:
,
practice, the locations are usually unknown and they are replaced by “control” locations from another
point process which can be considered as representative of the population variation. For example, in
epidemiology, locations of two cancers incidences are often used as case and control events if the disease
etiology (the cause or origin of disease) is completely different.
As mentioned above, the choice of the kernel function is less important than the choice of the bandwidth h.
The value of the bandwidth misspecification usually has a larger effect on the estimated intensities ratio. For
example, the small change of the bandwidth may produce a large change in the intensities’ ratio in the areas
with a sparse population and a relatively large number of cases. A rule of thumb is that the bandwidth should
be several times larger than the range of points’ interactions.
Figure 13.37 at right shows the ratio of two intensities. The area of the largest values of the ratio almost
coincides with cluster estimated by the spatclus software package (red contour).
Two different cluster detection methods found a cluster in nearly the same area, and this increases our
confidence that the number of pharmacies in the city center is indeed larger than the expected value based on
the population density.
Figure 13.37
The next research question in cluster analysis is: “Do events cluster around a particular point or linear
source?” In the case of the disease locations, the point of interest can be the location of a chemical or nuclear
power plant or a high voltage power line. In Montpellier, pharmacies are clustered in the city center. A
possible reason is that there are more people in the city during the daytime than at night; consequently,
additional pharmacies are required (note that in Europe pharmacies are generally closed at night in contrast
to pharmacy hours in the United States).
In the analysis above, we used the data exploration approach. However, from the decision‐making point of
view, formal hypothesis testing is required. For example, the null hypothesis of no clustering of pharmacies is
equivalent to the location of pharmacies being an independent random sample from the superposition of the
pharmacies and population.
The population at risk or the control process is usually known and its intensity can be estimated
assuming an inhomogeneous Poisson process. We expect that the intensity of cases and population are
proportional; therefore, the local intensity of cases can be expressed as
,
(
f (s − s0 ,θ ) = 1+ α exp −β ( s − s0 )
2
) ,
€
Spatial Statistical Data Analysis for GIS Users 573
where α and β are parameters to be estimated, although peak‐decline function (figure 13.38, blue) is
sometimes used as well because the largest risk may occur at some distance from the source. For example,
the peak‐decline function can be a superposition of two or more physical processes.
Figure 13.38
In some applications, measurements associated with known or hypothesized risk factors for cases are
available. For example, when studying the distribution of disease locations, such covariates as age, gender,
alcohol consumption, and smoking status are usually used. Then the model for intensity can be
adjusted by multiplying the right part of expression to an exponent of the linear
function of the covariates (exponentiation is used to guarantee that is positive):
,
where ai are regression coefficients to be estimated and Xi are the explanatory variables. Note that the
estimate of in the expression above depends on the case locations through the values of the covariates
Xi at those locations. In practice, one of the two factors, or , is
dominating so that, after fitting the model for intensity , the model can be often simplified by removing
one of the factors.
The regression coefficients ai can be estimated using a logistic regression for the indicator variable
constructed from events (value of 1) and controls (value of 0) and logit link function as discussed in chapters
6 and 12 and in appendix 4.
Additional examples of cluster analysis, including a case‐control case‐study, are presented in chapter 16 and
its assignments.
Marked point patterns are point patterns that also have a value attached to the event locations. For example,
in epidemiology, the point pattern is formed by the incidence of disease. Any personal characteristic of ill
people, such as race, age, or gender, makes this data a marked point pattern. In particular, the case‐control
data discussed at the end of the previous section can be analyzed as a marked point pattern with marks 1 (or
a label “case”) and 0 (or a label “control”) and a covariate surface with values defined as the distance to the
source of risk.
The difference between marks and covariates (explanatory variables) is that the former are observed at the
point event locations while the latter can be found at any point in the data domain. For example, a value of the
tree diameter is a mark, while the predicted surface of organic matter is a covariate.
The sum of the marks of all points in area A (for example, the total wood volume in a forest stand) for a
stationary marked point process with intensity λ is estimated as λ⋅A⋅µ, where µ is the mean value of the marks
in area A.
The interpolation of marks does not make sense since marks can only be measured at a counted number of
measurement locations. For example, there is no sense in predicting tree diameter between trees. Moreover,
events typically interact at short distances, and the processes that generated marks and locations can be
dependent. For example, plants compete for light and nutrients, and neighboring trees stunt one another.
Then the interaction between trees can be described by modeling dependence between the distance between
trees and tree thicknesses.
The mark correlation function ρ(h) of a marked point process is a measure of the dependence between the
marks of two points separated by distance h. It is defined as the proportion of the expected value of the
function of marks m(s) and m(s+h) at locations s and s+h attached to two points separated by a distance h to
an expected value of the function of marks separated by a very large distance (independent marks):
The expected value is calculated as the mean number of pairs of points separated by distance h with mark
values m(s) and m(s+h). Similar to the semivariogram model in geostatistics, the function ρ(h) is required to
be known for any distance. Because a number of pair distances is counted, the estimated intensity of points is
used in the calculation of the mark correlation so the mark correlation function is proportional to the point
intensity.
The function should be nonnegative. For continuous real‐valued marks, the following
two functions are commonly used:
which formally corresponds to the covariance in geostatistics, and
€ 1 2
f ( m(s),m(s + h)) = (m(s) − m(s + h)) ,
2
which formally conforms to the semivariogram.
The mark correlation function ρ(h) can take any nonnegative real value. A value of 1 means that the marks
are independent. The interpretation of values larger or smaller than 1 depends on the choice of
function . In the case of covariance‐like mark correlation function, objects with values
larger than 1 attract one another, and objects with values less than 1 inhibit one another. In the case of mark
semivariogram, objects with values other than 1 correlate, but the type of correlation is undefined.
To highlight the difference between mark and geostatistical semivariograms, the definition of both functions
is repeated in the table below. Note that a mark semivariogram is conditioned on the existence of the points,
but not on the values of the marks at those points.
Geostatistics Point process
The semivariogram is equal to 1/2 of the The mark semivariogram is equal to 1/2 of the mean
mean squared difference of the random squared difference of the marks m(s) and m(s+h) at the
field values Z(s) and Z(s+h) at the locations s and s+h, where h is the distance between s and
locations s and s+h, where h is the s+h, under the condition that at the locations s and s+h,
distance between s and s+h. the events are really observed.
Table 13.3
The semivariogram and mark semivariogram coincide if both marks and locations are independent. In that
case, the event is marked according to the value of the random field, completely ignoring the existence of
other events. For example, we can imagine that the diameter of the tree takes its value from the random field,
assuming that no other trees influence the tree’s growth. In this case, there is no interaction between trees,
but there is a spatial correlation between the diameters of trees as defined by the random field.
Figure 13.39 shows the heights and diameters of shrubs over the location intensity surface, using colors from
yellow to brown. Data were collected in an area 10 by 10 meters in Valencia Province, Spain. The area was
recently burned and is considered natural post‐fire soil.
Figure 13.39
The pair correlation function (blue), calculated using the shrub locations, shows that the locations are
correlated (since its shape is different from the pink line that corresponds to the independent point pattern)
and distributed rather regularly (since the pair correlation line is below the horizontal pink line for small
distance between pairs of points), with a correlation range of approximately 1.5 meters, as shown in figure
13.40.
Figure 13.40
Mark semivariograms for shrub heights (left) and diameters (right) are shown in figure 13.41. A
semivariogram shaped this way is not acceptable in geostatistics. The best fit by the Geostatistical Analyst
semivariogram model produces a nugget effect, shown by the blue horizontal line in the top right corners.
However, visual fitting suggests decreasing the semivariogram for short distances (light brown lines), but this
is prohibited since the model in light brown does not belong to a class of positive‐definite semivariogram
functions (see chapter 8). However, mark semivariograms in the figure below are valid functions. Therefore,
using geostatistical software for an estimation of the mark semivariogram can be misleading.
Figure 13.41
Figure 13.42
HIERARCHICAL MODELING OF SPATIAL POINT DATA
The point pattern can be a result of the interplay of many factors. For example, bird distribution is likely due
to birds’ attraction to specific habitats, and it is known that habitat is correlated in space. As another example,
possible changes in local environmental variables, such as groundcover vegetation, light conditions,
microclimate, and soil characteristics, may affect the distribution of trees in the forest.
If we know the coordinates of the tree locations and the amount of organic matter under each tree, then the
statistical model can be formulated as follows:
The tree distribution is modeled as a Poisson process with a spatially varying intensity.
The intensity is explained by a change of the organic matter.
The expected number of trees Y(s) at the location s depends on the spatially varying mean intensity
process λ(s):
The common approach is modeling the mean intensity process λ(s) as a Poisson log‐linear regression (so that
λ(s) is nonnegative)
,
where µ(s) is a mean component, Z(s) is a continuous, spatially correlated random component; and ε(s) is an
uncorrelated random component.
The Spanish shrub data introduced above also include information on soil features collected under each
shrub for 60 locations. The kriging prediction map of organic matter created in Geostatistical Analyst is
shown in figure 13.43. In the following illustrative case study, we use just the most probable values predicted
by kriging and ignore information on prediction standard errors. This brings up uncertainty issues, which can
be addressed via simulating many possible surfaces and using them all in further analysis as outlined in
chapter 5.
Figure 13.43
Figure 13.44
The intensity map created using information on organic matter is a compromise between the intensity map
using shrub locations only and a prediction map of organic matter. The next section presents more details on
modeling the points’ intensity using covariates.
RESIDUAL ANALYSIS FOR SPATIAL POINT PROCESSES
The following measurements are available for the Spanish shrubs data:
Shrub morphological characteristics: diameter, height, and biomass
Soil characteristics: organic matter, humidity, structural elements (aggregates) and pH
Using this information, a large number of point process models can be fitted to the Spanish shrubs data, and
several of them are presented below.
Fitted coefficients of the nonstationary Poisson process with intensity as a function of the spatial coordinates x
and y
λ(x, y) =β1 exp(−(β2 x + β3 y)),
where β2 x + β3 y is a trend formula, are the following:
Log(β1) β 2 β 3
‐0.265 ‐0.0254 ‐0.0248
The trend formula can be made more complicated by adding a spatially variable covariate calculated on the
grid in the data domain
λ(x, y) =β1 exp(−(β2 x + β3 y + β4 grid)).
Log(β1) β 2 β 3 β 4
‐3.382 ‐0.0645 ‐0.004 1.327
According to this model, organic matter has more influence on point intensity than spatial trend x + y since
coefficient β4 is much larger than coefficients β2 and β3 and the organic matter unit is compatible with the
coordinates units.
Another modeling possibility is to transform diameters of the shrubs—which vary in the range from 17 to 70
millimeters—from numeric values into factors (0,20), (20,30), (30,40), and (40,70) and use these marks as
covariates. Then the fitted trend coefficients
β1 + β2 x + β3 y + β4 marks(20,30) + β5 marks(30,40) + β6 marks(40,70)
of the nonstationary multitype Poisson process with marks can be estimated as
Log(β1) β 2 β 3 β 4 β 5 β 6
‐2.057 ‐0.0254 ‐0.0248 0.916 0.47 ‐0.105
The estimated regression coefficients show that the smaller the shrub diameters, the larger the points
intensity. This makes sense because large shrubs crowd out smaller ones.
Each of three models above has some objective and subjective justification. Therefore, the next modeling
steps verify the models’ assumptions and make the models’ comparison. Several diagnostics for the first two
models above—the nonstationary Poisson process with intensity λ(x, y) =β1 exp(−(β2 x + β3 y)) and the model
with organic matter as a spatial covariate—are presented below.
Estimated intensity of the nonstationary Poisson process is shown in figure 13.45, and the estimated intensity
of the process with organic matter as covariate was shown in figure 13.44.
Figure 13.45
If the model was fitted correctly and the area where shrubs grow is typical for that forest, we would expect to
see three distributions of shrubs in figure 13.46 in other parts of the forest. In other words, the fitted model
can be used for the management of the forest; for example, simulations from the fitted model may help in
planning forest thinning and regeneration.
Figure 13.46
Figure 13.47 shows three simulations using the nonstationary Poisson process with organic matter
predictions on a grid as spatial covariates. The density surfaces show more variability than the density
surfaces shown in figure 13.46 since the estimated intensity for the point process with covariate organic
matter is more variable.
Figure 13.47
Diagnostics of the point process models are based on the conditional point intensity and on the residuals of
the fit of a point process model to a spatial point pattern.
Informally, the conditional intensity is the probability of seeing a point at a specified location s if the point
pattern everywhere else (locations u) is known. It is called the Papangelou conditional intensity in
the point pattern analysis literature. One way to estimate this intensity is the following: for a point s and a
disk with radius r centered at s, search for the location on the disk where point s should be placed to make the
conditional intensity optimal in a statistical sense (maximum likelihood function optimization). The
Papangelou conditional intensity is a random variable with a mean value that equals the intensity of
the point process .
,
where s is a data set consisting of the locations s1,…, sn and is the number of points falling in A.
In geostatistics, diagnostics are calculated for measurement locations only. In point pattern analyses, the
residuals are also ascribed to the unsampled locations. This is because the observed information consists of
both locations of the points of the pattern and of areas where events do not occur, so that information about
the absence of points is also informative.
Figure 13.48 at left shows a smooth map of residuals for the Poisson process with the trend β1 + β2 x + β3 y.
Positive values of the smoothed residual suggest where the model underestimates the points intensity, while
negative values indicate where the model overestimates the points intensity.
A diagnostic of the interpoint interaction is a Q–Q‐plot of the residuals. In geostatistics, a Q–Q‐plot compares
the data or the residuals quantiles. The Q–Q‐plot in figure 13.48 at right compares quantiles of the smoothed
residuals for the Poisson process with spatial trend β1 + β2 x + β3 y with the expected quantiles of the fitted
model residuals, shown as small circles. The expected residuals are calculated for a large number of
independent simulations from the fitted model. For each simulation, the smoothed residuals are calculated on
the grid, and the mean of the ith quantile is calculated. The two red‐dotted lines in figure 13.48 at right show
the 2.5 and 97.5 percentiles of the simulated values.
If a line of circles in figure 13.48 at right is close to the black 1:1 line and does not fall outside the two red‐
dotted lines (that is, the fitted model is inside the confidence interval), the fitted model consistently describes
point interactions. It is not so in the case of a Poisson process with spatial trend β1 + β2 x + β3 y, meaning that a
more complicated point process model is required to describe the point pattern; for example, one of the
cluster or inhibition models discussed earlier in this chapter.
Figure 13.48
Figure 13.49 shows the diagnostics of the nonstationary Poisson process with the intensity
λ(x, y) =β1 exp(−(β2 x + β3 y)) = exp(0.7672 +0.0254 x 0.0248 y)
for spatial trend x + y (left), which is included in the model, and for the possible covariate effect pH (right),
which is not included in the model. The empirical plot in red is shown together with its expected values
(green), assuming that the model is true. The two standard‐deviation limits
mean ± two standard deviations,
are also displayed (blue).
In figure 13.49 at left, the value of the y‐axis is the sum of the residuals over all points on the fine grid where
the spatial trend x + y takes a value less than or equal to the running value of the x‐axis. In the same figure at
right, the value of the y‐axis is the sum of the inverse residuals over all data points where the covariate pH
takes a value less than or equal to the running value of the x‐axis.
Because the empirical and theoretical lines deviate substantially from one another at short and medium
distances, the interpretation of the diagnostics is that the fitted model does not correctly account for the
dependence on both spatial trend x + y and the covariate pH.
Figure 13.49
The diagnostic tools described above detect departures from the fitted model but do not identify locations
where the model fits poorly. One way to find possible outliers is to use the semivariogram cloud tool with
estimated residuals as input data. Figure 13.50 shows that unusually large semivariogram values have arisen
because of just two points (inside the pink circles). This suggests that the point process model can be
improved by assuming a cluster process with a small number of clusters (assuming that the unusual points
are not outliers).
Another observation from the semivariogram cloud plot is that there is a strong southeast–northwest trend
in the residuals. Because such a trend component is already included in the model, one can try to explain this
large‐scale variation with the additional covariate.
Figure 13.50
Other geostatistical exploratory data analysis tools can be used as well since the conditional intensity and the
residuals can be estimated at any point in the data domain, meaning that the residuals are continuous
variables.
Figure 13.51 at left shows the difference between the residuals of the nonstationary Poisson model and that
model with the inclusion of one more covariate, organic matter. The residuals for the model with a covariate
are smaller in the areas displayed in hot colors and larger in the areas displayed in cool colors. We conclude
that the presence of organic matter in the model reduces the point intensity in the areas with dense points.
Figure 13.51 at right shows the diagnostics for the possible covariate effect pH, which is not included in the
model. Comparison with the same diagnostics for the model without the covariate effect (figure 13.49 at
right) shows that addition of the organic matter covariate to the model makes it more sensitive to the pH
values, although the expected and the empirical lines in green and red are still not very close to each other.
Figure 13.51
Figure 13.52
LOCAL INDICATORS OF SPATIAL ASSOCIATION
The residual analysis presented in the previous section can be used to find unusual points configurations:
points that are not expected in a particular area. It can be also used for finding areas where points are
expected according to the statistical model but are not observed. In this section we discuss several other
approaches for investigating individual points and groups of points in the pattern in relationship to their
neighboring points.
A preliminary detection of areas with extra points can be based on the analysis of values assigned to the
points; for example, areas and perimeters of the Voronoi polygons, as presented in the beginning of this
chapter. Figure 13.53 illustrates this idea by showing semivariogram modeling (left) and the kriging
prediction of the Voronoi polygons’ areas (right) using the lightning strike locations data.
Figure 13.53
From figures 13.53 and 13.54, we see that there are clusters of lightning strikes, and there are low populated
areas. The next step after exploring the data is explaining the local point pattern features through regression
modeling.
Figure 13.54
There are two main approaches for finding local point pattern features. They are based on the nearest‐
neighbor distances and pair correlation function statistics.
The nearest‐neighbor distance distribution function is the cumulative distribution function of the interevent
distances
,
where is the distance from point i to the nearest other point and I(expression) is the indicator function
which is equal to 1 if the expression is true and equal to 0 otherwise. If interactions between points exist, we
expect an excess number of short nearest‐neighbor distances compared to the completely random point
pattern for which is known:
,
where is the intensity of the Poisson point process.
Figure 13.55 at left shows two simulated point processes in the unit square: Poisson (green) and cluster
(pink) and the same figure at right shows a mixture of the two processes colored in gray.
Figure 13.55
The nearest‐neighbor distance functions of the two processes, their mixture, and a completely random
process are shown in figure 13.56 at left. If all we have is a point pattern in figure 13.55 at right, we can view
the data as a mixture of two components arising from patterns with different densities and consequently
different nearest‐neighbor distance functions.
If a point is far from other points, then it tends to have a large nearest‐neighbor distance, and it can be
considered an outlier. However, if a cluster contains a moderate number of points, it may be a feature of the
data. Figure 13.56 at right shows the result of mixture process declustering based on the local nearest‐
neighbor distance estimator: according to the method, red and blue points belong to two different point
processes.
Figure 13.56
Figure 13.57
Figure 13.58 at left shows houses (blue), which were sold in a portion of Nashua city, New Hampshire, during
a six‐month period. The same figure at right shows houses (red) in areas with more dense sales. The
remaining points are classified as noise. The resulting sale clusters may indicate rapidly appreciating
neighborhoods where houses are being easily “flipped” or locations with problems that force the owners to
sell shortly after moving in.
Figure 13.58
Courtesy of the City of Nashua, N.H.
Courtesy of the City of Nashua, N.H.
Figure 13.59
The local nearest‐neighbor distance estimator is a useful statistic summarizing the inter‐distance aspect of
the clustering of points. Another approach is to use a local version of the pair correlation function, which is a
rate of change of the K function , which in turn is a measure of clustering in a point pattern.
The K function can be used as well, but statistical literature suggests that the pair correlation function is more
sensitive to different patterns and unusual points.
References to the papers on the local indicators of spatial association models for point patterns can be found
in the “Further reading” section.
1) MODELING THE DISTRIBUTION OF EARLY MEDIEVAL GRAVESITES
Figure 13.60 shows the locations of gravesites near the Neresheim, Baden‐Wurttemberg, Germany. One of the
research questions is whether the spatial pattern of the graves with affected teeth (missing or reduced
wisdom teeth) differs from the pattern of the graves with nonaffected teeth.
Figure 13.60
The analysis of these data can be found on pages 134–136, 141–146, 167–171, 173–174, 177–181, and 183 of
the Waller and Gotway book (see the reference in “Further reading”). Repeat the analysis with point pattern
analysis software that you can run (several packages are described in chapter 16 and appendix 2). Data for
the case study are in the folder assignment 13.1.
2) GETTING TO KNOW GIBBS PROCESSES.
Most of point processes discussed in this chapter describe points that are independent of each another or
they are conditionally independent given the intensity function. In both cases, points are not interacting.
Gibbs point processes describe a more realistic situation when interaction (attraction or inhibition) between
points exists. Originally, Gibbs processes were used in statistical physics for building new processes from old:
Gibbs point processes are often used in forestry for modeling of short‐range interactions between trees. In a
bounded area A, the Gibbs point process is defined through the density function formula
for a point pattern of n points si, where n is known and is the pair potential function of
distance r equals between pairs of points. φ(r) is interpreted in terms of interaction: positive
values of φ(r) describe inhibition and negative values of φ(r) depict attraction between points.
= 0 if and only if the two points si and sj are not interacting. Gibbs point processes are
finite since they are defined in the bounded area. This is often an advantage because geographical data are
almost always bounded.
The simplest Gibbs process is the Strauss process with pair potential function
,
for which points separated by distances smaller than the threshold distance D are interacting. More realistic
processes are those that allow varying the strength of interaction with varying distance between points.
The theory of Gibbs processes is complicated (see reference 12 in “Further reading”). However, software for
simulating Gibbs processes and fitting Gibbs models to the observed point patterns is available: freeware R
package spatstat.
Learn about Gibbs processes through examples provided with the spatstat software package. Try the
following interaction functions supported by spatstat: Poisson, Strauss, StraussHard, MultiStrauss,
MultiStraussHard, Softcore, DiggleGratton, Pairwise, PairPiece, AreaInter, Geyer, BadGey, SatPiece,
LennardJones, Saturated, OrdThresh, and Ord.
Fit one or several Gibbs processes to the data from assignment 1, and interpret the results of the model
fitting.
Additional assignments on point pattern analysis can be found at the end of chapter 16.
1. Diggle, P. J. 2003. Statistical Analysis of Spatial Point Patterns. Second Edition. London: Arnold, 160.
From a mathematical point of view, modern point pattern analysis is often more complicated than
geostatistics and regional data analysis. Peter Diggle is one of the leaders in this field, and his book on central
concepts of spatial point processes is accessible to researchers without a strong background in mathematics.
2. Okabe, A., and I. Yamada. 2001. “The K Function Method on a Network and Its Computational
Implementation.” Geographical Analysis 33(3):271–290.
This paper discusses the network K and cross K function for analyzing the distribution of points on a network.
Point pattern analysis in section The K Functions on a Network is performed using SANET, a toolbox for spatial
analysis on a network, developed under leadership of Atsuyuki Okabe, University of Tokyo.
3. Walder, O., and D. Stoyan. 1996. “On Variograms in Point Process Statistics.” Biometrical Journal 38:395–
405.
This paper discusses the difference between semivariogram in geostatistics and mark semivariogram in point
pattern analysis, using simulated and real data.
4. Waller, L. A., and C. A. Gotway. 2004, Applied Spatial Statistics for Public Health Data. New York: John Wiley
& Sons, 494.
Chapters 5 and 6 of this book discuss point pattern analysis for public health data.
5. Baddeley, A., and R. Turner. 2005. Spatstat: An R Package for Analyzing Spatial Point Patterns. Journal of
Statistical Software 12(6):1‐42.
Many figures in this chapter were produced using the spatstat package.
6. Baddeley A., R. Turner, J. Møller, and M. Hazelton. 2005. “Residual Analysis for Spatial Point Processes (with
discussion).” Journal of the Royal Statistical Society, Series B 67(5):617–666.
This paper discusses the theory of residual methods for spatial point processes by comparing them with
residual methods for classical regression analysis. Several examples of model diagnostics are presented. A
discussion of the paper by leading statisticians is very instructive and should not be overlooked.
Papers 7–9 discuss and compare local indicators of spatial association.
7. Byers, S., and A. E. Raftery. 1998. “Nearest‐Neighbor Clutter Removal for Estimating Features in Spatial
Point Processes.” Journal of the American Statistical Association 93:577–584.
8. Cressie, N., and L. B. Collins. 2001. “Patterns in Spatial Point Locations: Local Indicators of Spatial
Association in a Minefield with Clutter.” Naval Research Logistics, 48:333–347.
9. Mateu, J., G. Lorenzo, and E. Porcu. 2007. “Detecting features in spatial point processes with clutter via local
indicators of spatial association.” Journal of Computational and Graphical Statistics 16(4):968‐990.
This book discusses various models and tools for disease mapping and risk assessment and presents a range
of case‐studies on the applications of the statistical methods to particular problems. In particular, a
comprehensive review of spatial cluster detection methods is provided.
11. Demattei C., N. Molinari, and J. P. Daures. 2006. “Arbitrarily Shaped Multiple Spatial Cluster Detection for
Case Event Data.” Computational Statistics and Data Analysis 51:3931‐3945.
The authors of this paper proposed a method for detecting spatial clusters based on the weighted distance
between the nearest neighbors.
12. Van Lieshout, M. N. M. 2000. Markov Point Processes and their Applications. Imperial College Press.
The author studies Gibbs point processes (in Van Lieshout’s book, they are called Markov point processes),
showing that they form a flexible class of models for a range of problems involving the interpretation of
spatial point data. The mathematics in this book is advanced.
CHAPTER 14: GEOSTATISTICS FOR EXPLORATORY SPATIAL DATA
ANALYSIS
CHAPTER 15: USING COMMERCIAL STATISTICAL SOFTWARE FOR SPATIAL
DATA ANALYSIS
CHAPTER 16: USING FREEWARE R STATISTICAL PACKAGES FOR SPATIAL
DATA ANALYSIS
APPENDIX 1: USING ARCGIS GEOSTATISTICAL ANALYST 9.2
APPENDIX 2: USING R AS A COMPANION TO ARCGIS
APPENDIX 3: INTRODUCTION TO BAYESIAN MODELING USING WINBUGS
APPENDIX 4: INTRODUCTION TO SPATIAL REGRESSION MODELING
USING SAS
GEOSTATISTICS FOR EXPLORATORY
SPATIAL DATA ANALYSIS
DATA VISUALIZATION
EXPLORATION OF OZONE DATA CLUSTERING, DEPENDENCE, DISTRIBUTION,
VARIABILITY, STATIONARITY, AND FINDING POSSIBLE DATA OUTLIERS
ANALYSIS OF SPATIALLY CORRELATED HEAVY METAL DEPOSITION IN
AUSTRIAN MOSS
ZONING THE TERRITORY OF BELARUS CONTAMINATED BY RADIONUCLIDES
AVERAGING AIR QUALITY DATA IN TIME AND SPACE
ANALYSIS OF NONSTATIONARY DATA FROM A FARM FIELD IN ILLINOIS
SPATIAL DISTRIBUTION OF THYROID CANCER IN CHILDREN IN POST
CHERNOBYL BELARUS
ASSIGNMENTS
1) EXPLORE THE ARSENIC GROUNDWATER CONTAMINATION IN BANGLADESH IN
1998
2) AVERAGE THE PARTICULATE MATTER DATA COLLECTED IN THE UNITED STATES
IN JUNE 2002 IN TIME AND SPACE
3) EXPLORE THE ANNUAL PRECIPITATION DISTRIBUTION IN SOUTH AFRICA
FURTHER READING
E xploratory spatial data analysis consists of a variety of techniques for discovering facts about data.
Each technique clarifies only some features of the data, and it is better to use many tools rather than
just one to summarize data features.
Display data with all possible covariates and available geographical layers.
Review basic statistical properties of the data.
Identify the spatial data type.
Explore the sampling plan.
Look into the distribution of the data and possible data transformations to a known theoretical
distribution.
Look for unusual data (outliers).
Separate small‐ and large‐scale data variation.
Investigate the correlation between data values and maybe between data locations as a function of
distance between observations.
Expose the dependence and possible causal relationships between the variables under study.
Exploratory data analysis is usually used in the beginning of statistical model formulation. However, it can be
also used for finding aspects of the data not explained by the model. In this case, exploratory data analysis can
reveal modeling problems that were ignored or overlooked at the model formulation stage. An example of the
model output exploration can be found in the section “Residual analysis for spatial point processes” in
chapter 13.
It is not always possible to find the best model for the available data. In this case, exploratory data analysis is
especially helpful because it may suggest using several different models in an attempt to explain the observed
data and predict new data.
Various data exploration tools are discussed in previous chapters and in the appendixes. In this chapter,
several typical data exploration scenarios are presented in the context of exploring geographical data from
several applications, mostly using Geostatistical Analyst. In chapters 15 and 16, other geographical data will
be explored using commercial and freeware software packages. References to the books that discuss other
data exploration tools and other software packages can be found in the “Further reading” section.
DATA VISUALIZATION
Visualization is the starting point in spatial data exploration. It can be done in many ways, from simple
schematics to rich 3D pictures showing the data in every detail, depending on what visualization tools are
available. Figure 14.1 displays ozone data (left) and an interpolated surface (right) using these ozone values
in some statistical software environment. Graphs such as these can be found in many journal papers written
by statisticians. However, they are not very helpful in understanding spatial variations of ozone concentration
and investigating hypotheses about the causes of such variation.
Figure 14.1
Figure 14.2 shows a map of ozone measurements for Southern California in parts per million (some of these
observations are also displayed in figure 14.1 at left). Pollutants in the lower layer of air from the Los Angeles
basin flow eastward when it gets warm and often reach the Palm Springs and Joshua Tree National Park
areas. Joshua Tree National Park in the mountainous desert has reported airborne pollutants, including
nitrogen, affecting soil and plants there.
To obtain a better understanding of how pollution might move, the same map displays topography. One
of the major sources of air pollution is exhaust from motor vehicles, so major road networks are also
displayed on the map.
Figure 14.2
Good data visualization facilitates choosing the appropriate model for air quality. For example, figure 14.2
reveals a large‐scale, east‐west trend in the ozone values and the barrier to ozone movement that the
mountains provide. Larger values of ozone tend to be close to the mountains in the east, and the ozone
concentration declines toward the coast.
EXPLORATION OF OZONE DATA CLUSTERING, DEPENDENCE, DISTRIBUTION,
VARIABILITY, STATIONARITY, AND FINDING POSSIBLE DATA OUTLIERS
In this section, we will go through various spatial data exploration tools using the maximum one‐hour ozone
concentration on June 22, 1998, and the maximum monthly concentration of nitrogen dioxide in June 1998 in
California. Units for both pollutants are parts per million (ppm).
Figure 14.3 shows ozone measurements over elevation. Red and yellow circles show ozone concentrations
larger than the critical threshold 0.08 ppm (the upper permissible level for one‐hour ozone concentration),
while the blue circles show relatively low ozone values.
Data transformation to the normal distribution allows one to use models that require data normality. The
data closeness to normality can be further verified using a quantile‐quantile plot, shown at top right.
Normally distributed data will lie along the solid line. After a Box‐Cox transformation with a power value of
0.5, the data became more close to the straight line, especially the most critical high‐contamination values,
bottom right.
Because sources of air pollution and meteorological conditions can be different in different parts of the state,
data distribution can be different as well. The selection in blue in the ozone histogram highlights the data in
the eastern part of California. The data distribution is different from the distribution of the entire dataset.
This is the evidence of data nonstationarity (see also the section “Gaussian processes” in chapter 4,).
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.3
Figure 14.4
Figure 14.5 shows modeling of the ozone concentration histogram (red lines) using a combination of two
Gaussian distributions (left) and six Gaussian distributions (right) with different means (“Mu”) and standard
deviations (“Sigma”). The weight of each normal distribution component is shown in the right part of the
dialog box (“P”). Parameters of the Gaussian distributions can be saved and then used for further data
analysis (see the section “Modeling data distribution as a mixture of Gaussian distributions” in chapter 4).
Figure 14.5
The mean data value is used as representative of the contamination in the specified area and as input to
modeling. For example, simple and disjunctive kriging require data mean specification.
When data are uniformly sampled, the data mean is obtained by summation assuming that all measurements
have the same weights, equaling 1/N, where N is a number of measurements. Ozone concentration data were
preferentially sampled, with more dense samples taken in the highly populated areas near the ocean where
the majority of the low readings were taken. To estimate the mean value of ozone concentration, we will use
the cell declustering dialog box of the Geostatistical Wizard shown in figure 14.6 at left.
In the cell declustering dialog box, rectangular cells are arranged over the data locations in a grid, and the
weight attached to each datum in the cell is inversely proportional to the number of data points in its cell.
Cell shapes may vary, and an optimization technique for optimal cell size and orientation is available if
clustering occurs for large or small values but not when clusters with both small and large data values are
present. The cell size that corresponds to the maximum weighted mean should be selected if the data has
been preferentially sampled in areas of low values, as in the case of ozone measurements, shown in figure
14.6 at left. For ozone (the blue points in this figure at left), the mean value estimated by declustering is
0.0687 parts per million, while the arithmetical average is 0.0664.
The cell size that corresponds to the minimum weighted mean should be selected if the data have been
preferentially sampled in areas of high values, as in figure 14.6 at right where clustered data inside the pink
contours have the largest values in the dataset. These particular data are the specially designed tutorial data
from the geostatistical software library (GSLIB) freeware software (see also figure 9.21 in chapter 9).
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.6
On the map of ozone data (figure 14.2), most of the low values of ozone concentration occur near the ocean in
the western part of California while the high values tend to be in the east. This spatial trend can be further
analyzed using a 3D perspective of the data in figure 14.7 at left (with the Geostatistical Analyst Trend
Analysis tool).
Locations of sample points are plotted on the XY plane. The value of each sample point is depicted by the
height of a gray stick in the Z dimension. Then the values are projected onto the XZ plane (light green) and the
YZ plane (light blue) as scatter plots.
When a point is selected (red stick), pink lines perpendicular to the XZ and YZ planes show projections of the
selected points on the planes.
Blue and green polynomial lines are then fitted through the scatter plots on the projected planes. In this
example, a third order polynomial is chosen. The green line shows that data values increase on average from
the west and from the east to the center, resulting in the highest data values in the center of the state, while
the parabolic blue line shows that the ozone values are smaller in the north and the south than in the center.
Figure 14.7
A single mean value in most cases does not provide sufficient information about data. Data values can slowly
fall or rise as one moves from one measurement location to another along a fairly long line. Large‐scale
variation in the data is called trend. Data values in neighboring locations usually vary, and these variations
can oscillate around a particular value at short distances, representing small‐scale data variation. Both types
of variations in data often happen simultaneously.
To account for the trend in data, a mean surface is used instead of a single mean value. The mean value now
depends on location and is being estimated in every location using samples from a limited number of nearest
neighbors (local interpolation) or from the entire data set (global interpolation).
Figure 14.8 shows the ozone concentration estimated by the first‐order local polynomial model
.
with varying coefficients for four different searching neighborhoods.
Figure 14.8
Kriging requires the data to be stationary; that is, both the data mean and the data variance are constant at
any location in the data domain. The constant mean requirement can be fulfilled by subtracting the mean
surface values from the data in each location. The result of this subtraction, called residuals, can be used as
input data to kriging with a mean value equaling zero.
The next step of data exploration is to verify that the data variance is constant. The usual approach to
analyzing data variance is mapping of the local Moran’s I (see the section on cluster detection methods in
chapter 11). Figure 14.9 at left shows a Voronoi map constructed around ozone samples. The Voronoi
polygons are colored according to the calculated local Moran’s I using the nine closest measurements. The
critical threshold of data similarity is their mean value, so if neighboring data are on the same side of the
distribution in relation to the arithmetical mean value, the Moran’s I is positive, and if they are on different
sides, Moran’s I is negative.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.9
Like other exploratory data analysis tools, local Moran’s I is a qualitative statistical method, so other tools
should be used to confirm or reject conclusions made based on the local Moran’s I surface. Possible
alternatives include the locally weighted standard deviation (see the section “Spatial smoothing” in chapter
11) and local entropy.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.10
The standard deviation usage has several shortcomings: it is scale dependent (it is better to have a unit‐free
measure of uncertainty for comparing different variables), it is sensitive to unusually large values (outliers),
and the combined standard deviation is not equal to the sum of its components. Another widely used measure
of uncertainty, entropy, does not have some of these shortcomings.
Suppose we define four intervals for all ozone values using the quantile classification, as shown in figure
14.11 at right. Then the Voronoi polygons are constructed (figure 14.11 at left) as discussed in chapter 13,
and neighbors for each polygon are defined using the common border rule. The numbers of ozone values in
the neighborhood (shown in figure 14.11 at right) that fall into each of the four intervals are printed in red.
For other local neighborhoods (not shown here) these numbers can be different, as shown by the green
numbers in figure 14.11 at bottom right.
The proportion
Pi = (number of values in the interval i)/(number of polygons in the neighborhood)
of the values that fall into each interval to the total number of values in the local neighborhood is calculated.
These frequencies can be similar (figure 14.11 at top right),
or one frequency can dominate (figure 14.11 at bottom right),
P1 =0/9, P2 =0/9, P3 =0/9, P4 =9/9.
The entropy value of each Voronoi polygon is calculated as
where M is a number of intervals (classes in the histogram in figure 14.11). Entropy values in our examples
are equal to
for the red numbers and , for the green numbers.
Minimum entropy occurs when all values in the local neighborhood fall into the same interval.
P1 = P2 = P3 = 0 ; P4 =1
Emin = 1 log2 (1) = 0
Minimum entropy corresponds to the lowest uncertainty of the data in the neighborhood.
Maximum entropy occurs when the frequency is the same for each class interval.
P1 = P2 = P3 = P4 =P
Emax = P⋅log2(P) + P⋅log2(P) + P⋅log2(P) + P⋅log2(P) = 4P⋅log2(P)
Maximum entropy corresponds to the highest uncertainty of the data in the neighborhood.
Figure 14.10 at right shows smoothed local entropy using the same neighbors and weights as in the cases of
local Moran’s I (figure 14.10 at left), and weighted local standard deviation (figure 14.9 at right).
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.11
In previous chapters we discussed and used a semivariogram, which is defined as half the average of the
squared difference between pairs of data values separated by the specified distance and direction. Each
component of the average, half the difference squared between data Zi and Zj at the locations i and j,
2
(Z i − Z j)
2
can be used for finding unusually large and unusually small values in the local neighborhood.
When dots are selected on the semivariogram cloud (shown in black), the pairs that these dots represent are
connected on the map.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.12
To prepare figure 14.13, one value in the ozone dataset was changed from 0.072 to 0.72 to imitate a typing
error. This value is different from any other value in the dataset and is called a global outlier.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.13
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.14
Potential erroneous ozone measurements should be verified and removed from the dataset if they are
incorrect. Figure 14.15 illustrates what happens when that outlier remains in the input data for kriging
interpolation. The Searching Neighborhood dialog box to the right shows the nine closest measurements
(colored circles) to the prediction location (black cross). Correspondence between points in the Searching
Neighborhood dialog box and points on the map is shown by blue and pink lines. Weights assigned to the
neighbors by kriging are displayed in the right part of the dialog box. The weight of the outlier in light blue is
highlighted. Prediction to the location marked by a cross equals 0.065 with the outlier and 0.056 when the
outlier value of 0.125 was replaced by 0.04, which is a typical ozone concentration for coastal areas.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.15
The directional change in semivariogram values can be further analyzed in Geostatistical Wizard. In figure
14.16, averaged semivariogram values (red points) are shown in two perpendicular directions, with the
direction angle highlighted by pink ellipses. The semivariogram values clearly depend on direction, which
should be addressed by the model.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.16
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.17
If samples of several variables are available and these variables are spatially correlated, using those
additional variables in the model improves predictions of each of the variables.
Figure 14.18 (center) shows locations of ozone (green) and nitrogen dioxide (pink) samples. Some
measurements of ozone and nitrogen dioxide were collected at the same locations and others at different
locations.
The cross covariance between variables Z and Y at the locations si and sj is defined as a product of the
differences between the two observed values and their corresponding mean values:
The cross‐covariance cloud for ozone and nitrogen dioxide is presented in figure 14.18 at right. Most of the
cross‐covariance values are close to zero, but there are unusually large positive and negative values. Samples
of ozone and nitrogen dioxide corresponding to the highlighted large negative cross‐covariance values at
relatively short distances between data locations are connected on the map. In those connected pairs, one is
always a large value of one variable and another one is a small value of another variable.
The general quantile‐quantile plot is used to assess the similarity of the distributions of two datasets. For two
identical distributions, the general quantile‐quantile plot is a straight line. The general quantile‐quantile plot
in figure 14.18 at left shows that the highlighted data (pink and blue points) are from the tails of the ozone
and nitrogen dioxide data distributions. Highlighted ozone and nitrogen dioxide samples are also selected in
the table associated with the data and can be further analyzed for possible errors, because any unusually
large value of one variable surrounded by a small value of another variable could be erroneous.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.18
The linear correlation coefficient r measures the strength and the direction of a linear relationship between
two independent variables Z and Y. The value of r falls between 1 and +1. If Z values increases when Y values
also increases, r is positive. Negative values of r indicate that as values of Z increase, the values of Y decrease
(or vice versa). A correlation coefficient greater than 0.7 usually indicates a strong correlation between
variables, and a coefficient of less than 0.4 points to a weak correlation. If the linear correlation between the
two variables is negligible, r is close to zero. However, a nearly zero correlation coefficient may also indicate a
nonlinear correlation between variables.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.19
The linear correlation between ozone and nitrogen dioxide data is estimated as
According to this estimate, the linear correlation between ozone and nitrogen dioxide is negligible. This is
surprising because it is known that these two air pollutants are related to one another. However, the
estimated linear correlation reflects the average relationship between ozone and nitrogen dioxide, and it is
possible that there areas with positive and negative correlations between two variables that compensate each
other on average.
Local cross correlation between ozone and nitrogen dioxide can be explored using the standardized cross‐
Moran’s index of spatial data association, calculated for an observation i as
,
around the location i, with a mean of ; is the standard deviation calculated using the entire data Z, and
is the standard deviation calculated using the entire data Y.
The interpretation of the cross‐Moran’s I is the same as the interpretation of the correlation coefficient r: if
the cross‐Moran’s I is large, there is a positive correlation between pollutants, while large negative values
indicate negative spatial correlation. A cross‐Moran’s I close to zero indicates areas where correlation
between two variables is weak. Note that the cross‐Moran’s I can be greater than 1 and less than 1.
Since cross‐Moran’s I should be calculated at the same locations for both variables (and only part of the ozone
and nitrogen dioxide sample locations coincide), the nitrogen dioxide values were predicted to the ozone
locations using kriging.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.20
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.21
The linear cross‐correlation coefficient r for this subset is calculated as
This calculation shows that there is a strong positive correlation between ozone and nitrogen in the northern
part of the state.
Since both semivariogram and cross covariance models above are estimated with uncertainty, the cross‐
correlation coefficient is also uncertain. The distribution of the cross‐correlation coefficient can be calculated
using Monte Carlo simulation as shown in chapter 5.
ANALYSIS OF SPATIALLY CORRELATED HEAVY METAL DEPOSITION IN AUSTRIAN
MOSS
In 1995, the study of heavy metal deposition in mosses was conducted in 29 European countries. Heavy
metals deposited from the atmosphere tend to be retained by mosses, since mosses obtain most of their
nutrients directly from precipitation and dry deposition. Each sampling site was located at least 300 meters
from main roads and populated places and at least 100 meters from roads or houses of any kind.
Copyright © 1995, Umweltbundesamt, GmbH.
Figure 14.22
Among the objectives of the moss survey were identification of the main polluted areas and production of
regional maps of heavy metals deposition. This was done by mapping the arithmetical mean of the measured
concentrations for each metal on the 50 by 50 kilometer grid. Figure 14.23 shows arithmetical averaging of
the cadmium data using the 50‐ by 50‐km grid.
Copyright © 1995, Umweltbundesamt, GmbH.
Figure 14.23
Copyright © 1995, Umweltbundesamt, GmbH.
Figure 14.24
Figure 14.25 shows cadmium values averaging in twenty‐six Voronoi polygons created around regularly
spaced points over the map of Austria using arithmetical and statistical averaging. In this case, conditional
geostatistical simulation (see chapter 10) was used to create the map at right. We see that estimated mean
cadmium values in the polygons are different. The spatial dependence of the averaged data shown in the
semivariogram clouds and semivariogram surfaces is clearly different as well.
Copyright © 1995, Umweltbundesamt, GmbH.
Figure 14.25
Copyright © 1995, Umweltbundesamt, GmbH.
Figure 14.26
Copyright © 1995, Umweltbundesamt, GmbH.
Figure 14.27
Even if the measurement of molybdenum at that location is correct, it is advisable to estimate a
semivariogram model without an outlier because a single unusually large value makes the spatial correlation
structure (which is clearly seen in the dialog box in figure 14.28 at right) unrecognizable. Then the outlier can
be put back in the dataset for predictions (see the section “Predictions using ordinary kriging” in appendix 1).
It is very unlikely that the arithmetical averaging of sulfur, lead, molybdenum, and vanadium can be used to
produce reliable maps.
Copyright © 1995, Umweltbundesamt, GmbH.
Figure 14.28
Predicting or averaging of one heavy metal concentration can be improved using data for another heavy
metal (that is, using cokriging instead of kriging). It is appropriate to combine heavy metals that have same
sources of origin. Some of the sources of heavy metal contamination are shown in table 14.1 below.
Table 14.1
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 14.29
ZONING THE TERRITORY OF BELARUS CONTAMINATED BY RADIONUCLIDES
After the Chernobyl accident in 1986 in the former Soviet Union, many people were forced to live in areas
contaminated with radionuclides. Many attempts were made to build meteorological models that would help
calculate the radioactive deposition at any location. However, because the information on the atmospheric
release of radioactive material from the nuclear reactor during the accident was fragmentary and
meteorological conditions were complicated, forecasts of the radioactive fallouts have little in common with
the actual spatial distribution of radionuclides at distances just several dozen miles away from Chernobyl.
The only practical way of assessing the risk of living there is through the measurements of radionuclide
concentrations in the soil, food, and human body.
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 14.30
Zoning of the territory is done by assessing threshold levels of soil contamination. Table 14.2 lists the
threshold values of Cs‐137 soil contamination and the appropriate actions to be taken according to
regulations.
Zone name Cs137, Ci/km2
Zone with periodic control 1–5
Zone with rights for relocation 5–15
Relocation required in the future 15–40
Relocation of people required > 40
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus
Table 14.2
However, point measurements are not available for every village and town, and the measurements that are
available contain locational, averaging, and data integration errors, resulting in a value associated with a
particular place known to be in the interval ±(10–30) percent of its measured values (see the example of Cs‐
137 soil contamination zoning in the beginning of chapter 1).
Science fiction often portrays what might happen after a nuclear attack. Often the characters try to survive in
the forests. In reality, the forest is by far not the best place to live after radioactive fallout. The forest absorbs,
concentrates, and recycles radionuclides, which soon can be found in every wild product, including
mushrooms, berries, game, and timber.
One component of the uncertainty of the food contamination data is the uncertainty of the data locations. This
uncertainty results from food samples being brought to measuring stations from villages as far away as tens
of kilometers. One way to explore the consequences of imprecise coordinates on data analysis can be
investigated by comparing the statistics of the data bound to their original coordinates with the statistics of
the same data after the sample coordinates have been randomly shifted to signify that the food samples could
have been collected anywhere within the specified distance from the measuring stations.
If there are fewer than 1,000 people in a settlement, the radius of the circle around its centroid in
which the new location will be randomly simulated is 1,000 meters.
If the population is more than 1,000 and less than 2,000, then the radius of the shift equals to
2⋅population.
If there are more than 2,000 people in the populated place, then the radius of the shift equals 5,000
meters.
Examples of the calculated radius of the circle can be found in the table in figure 14.31.
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 14.31
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 14.32
Another component of uncertainty in the food contamination data is the measurement error of the counts of
radioactive decays per unit time. The uncertainty of such measurement can be estimated using a square root
from the number of counts.
Figure 14.33
Because both data values and data coordinates contain errors, and because spatial distribution of
radionuclides is too complicated for modeling using a statistical model with a small number of estimated
parameters, Monte Carlo simulation algorithms may be preferable, since they allow incorporating uncertainty
in both input data and model parameters.
AVERAGING AIR QUALITY DATA IN TIME AND SPACE
Southern California experiences some of the worst air quality in the United States. Atmospheric mixing is low
here, so pollutants concentrate in the air. In addition, long hours of sunshine facilitate photochemical
production of ozone. Federal ozone standards are often exceeded from spring through early autumn. But the
late autumn and winter are not much better with high carbon monoxide and particulate matter levels.
The Geostatistical Analyst kriging functionality allows exploring both temporal and spatial characteristics of
the air quality. Although kriging is commonly used for spatial data analysis, it can be used for time series data
analysis as well. Andrey Kolmogorov developed the theory of optimal interpolation (later called “simple
kriging”) in one dimension in 1941, some 18 years before Lev Gandin generalized Kolmogorov’s theory for
two‐dimensional data, and kriging predictions in one dimension are used in various applications (see an
example at the beginning of chapter 10).
Figure 14.34
To make daily pollution measurements in one particular place two‐dimensional, a dataset is placed as points
on a plane with temporal axis X and with the same arbitrary constant pseudo‐coordinate Y1. The daily
measurements of air pollution fall into a straight line (see figure 14.35), with a value of pollution associated
with each point. Because contour calculation on the line is impossible, Geostatistical Analyst will not work
with such data until an additional datum with the coordinate (X, Y1 + ΔY) is added to the dataset, where X is
any value between minimum and maximum x‐coordinates and ΔY is greater than the difference between
maximum and minimum x‐coordinates, ΔY > (Xmax – Xmin). After the pseudo datum is added (its value can be
equal to the data mean), kriging modeling continues as usual. The pseudo datum is placed far enough from
the measured data on the line and does not influence the kriging predictions.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.35
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.36
Data averaging using a six‐day lag interval for one‐hour maximum ozone concentration in San Francisco is
presented in figure 14.37. The Geostatistical Analyst validation option Save Validation was used to save the
predictions of ozone for each day in 1999 to a .dbf file. Microsoft Excel was used to create the graph in figure
14.37.
The semivariogram model used for predictions is in the top left corner. It changes when averaging with
different time intervals; for example, when a lag value of three or eight days is used. The red line shows the
filtered ordinary kriging prediction (it was assumed that the nugget parameter consists of the data
measurement error only), and the pink and blue lines are the upper and lower 95 percent confidence
intervals, assuming that the predictions and their standard errors are distributed normally.
Figure 14.37
Although daily measurements of maximum one‐ and eight‐hour concentrations of several air pollutants are
available for analysis and mapping, annual and seasonal maps are typically used to characterize regional air
quality. Such maps could be created in two different ways: by averaging daily measurements over the desired
period of time (week, month, etc.) and mapping the averages, or by using all values collected over time at
each location, thus modeling the replicated measurements.
Ozone measurements in California in June 1999 and their histogram in one of the cities, Alameda, are shown
in figure 14.38 at left. Note that measurements made at the same location at a different time vary
significantly. Figure 14.38 (center) shows a semivariogram modeling dialog box for the monthly average
ozone maximum one‐hour concentration. Alternatively, all daily ozone measurements can be used as input to
ozone concentration modeling. Note that the semivariogram model based on repeated measurements for
each monitoring station, shown in figure 14.38 at right, has a nugget parameter three times larger than the
semivariogram model for the averaged values, meaning that usage of monthly average data underestimates
data uncertainty.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.38
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.39
Figure 14.40 at left shows the cross‐covariance model of maximum one‐hour concentrations of ozone and
nitrogen dioxide in June 1999. The cross‐covariance surface is asymmetric; the highest cross‐covariance for
small distances between pairs of data locations occurs at locations separated by some distance. The visual fit
of the cross‐covariance cloud (red line) suggests that there should be a maximum at a distance around 50
kilometers, shown in pink, but not at the zero distance, as in the model shown as a blue line.
Ozone and nitrogen dioxide maximum values are shifted because ozone is produced as a result of chemical
reactions between pollutants. Ozone reaches maximum concentration several hours after nitrogen dioxide is
emitted. During this time, pollutants shift to the east because typically the wind direction in the summer is
from the ocean.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.40
Figure 14.41 presents two additional examples of the cross‐covariance models with shift parameters:
between carbon monoxide and nitrogen dioxide (left) and between ozone and a grid of distances from major
California roads (right). In both cases, the largest correlation occurs at the nonzero distances between the
locations of pairs of the variables, and it would be interesting to find meaningful answers why it is so.
The yellow lines in figure 14.41 at right show the semivariogram models in different directions, while the
figure at left shows a semivariogram model in one particular direction.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.41
Historically, when farmers’ fields were small and uniform, averaging crop and soil data did not pose a
problem. However, mechanization led to fields of increasing size and to increasing variability of the data.
Yield data can be collected using sensors and GPS devices located on the harvesting machine, whereas soil
data can be obtained through sampling. Yield and soil characteristics of the data are displayed as continuous
surfaces through interpolation algorithms. The accuracy of the interpolated maps is essential for deciding, for
example, which fertilizer to use and how much.
The statistical approach to data interpolation, kriging, requires data stationarity that is used to obtain the
necessary data replication:
The data mean is constant between samples in the searching neighborhood used in the interpolation
model.
The semivariogram is the same between any two points the same distance apart.
Figure 14.42 shows phosphorus samples from a field in Illinois. The stationarity of these data should be
verified before using kriging.
Farm Spatial statistics data courtesy of the Department of Crop Sciences, University of Illinois at Urbana–Champaign.
Figure 14.42
One way to assess spatial variability of the mean and the variance of the data is to use local statistics
calculated using Voronoi polygons. The polygons are created so that every location within each polygon is
closer to the sample point in this particular polygon than to a sample point in any other polygon. Polygons
with common borders are considered neighbors. Then a local mean is computed by averaging the
neighboring samples, with the average value assigned to that polygon. This is repeated for all polygons and
their neighbors, and the polygons are then colored according to their local mean values. The same approach is
used for calculating local standard deviation and other statistics. To ensure data stationarity, the local mean
values and the local standard deviation should be varying slowly throughout the map.
Figure 14.43 displays Voronoi maps that show the local mean (left) and the local standard deviation (right)
for 1,749 samples of phosphorus in the soil. Neither the mean nor the standard deviation is constant. Notice
also that areas with a large local mean usually have a large local standard deviation. Therefore, optimal
kriging maps of predictions and prediction standard errors should be also related.
Figure 14.43
Splitting data into subsets and exploring them separately can further clarify the data stationarity issue. Figure
14.44 at left shows six semivariogram surfaces created using subsets with approximately 100 points each in
different parts of the field. Subsets of data and their semivariogram surfaces are shown inside the red frames.
Semivariogram surfaces of the subsets look different, providing additional evidence of the nonstationarity of
the data.
Surprisingly, the semivariogram surface of the entire dataset, shown in figure 14.44 at right, indicates little
difference in the data variability in different directions. This means that variation of the phosphorus content
in one direction in one part of the field is compensated by the data variation in an opposite direction in
another part of the field.
Farm Spatial statistics data courtesy of the Department of Crop Sciences, University of Illinois at Urbana–Champaign.
Figure 14.44
It can be seen that the spatial data structure varies across the field, and the sill and the range values are
correlated: in most cells, the larger the range, the smaller the sill and vice versa.
Farm Spatial statistics data courtesy of the Department of Crop Sciences, University of Illinois at Urbana–Champaign.
Figure 14.45
The phosphorus data exploration suggests transforming the data in order to remove or reduce the mean and
the variance nonstationarity. The mean nonstationarity can be removed by detrending. The variance
nonstationarity sometimes can be alleviated by data transformations, such as power transformation or the
normal score transformation. Figure 14.46 at left shows the prediction standard error map using square‐root
transformation.
Farm Spatial statistics data courtesy of the Department of Crop Sciences, University of Illinois at Urbana–Champaign.
Figure 14.46
If data transformations fail to improve data stationarity, the nonstationary variant of interpolation should be
used on such data.
SPATIAL DISTRIBUTION OF THYROID CANCER IN CHILDREN IN POSTCHERNOBYL
BELARUS
Prior to the Chernobyl accident, the annual number of thyroid cancer cases was about one per the entire
Belarus child population. Figure 14.47 shows the number of thyroid cancer cases in children in 117 Belarus
districts registered during the seven years after the accident. The black digits on the map correspond to the
CASES column in the table. There were 435 cases of thyroid cancer in 1994 in children born after May 1962;
that is, they were under 14 years of age at the time of the accident.
The short‐lived radioisotopes, mainly radioactive iodine I‐131, that enter the body with contaminated air and
food are believed to cause thyroid cancer because radioactive iodine accumulated in the thyroid emits beta
particles, which destroy thyroid cells.
The I‐131 isotope has a half‐life of only eight days, which is why direct measurements of I‐131 in Belarus
were scarce. Assessment of the human exposure to I‐131 is problematic because the intake of milk and
duration of outdoor stays throughout the critical time after the accident cannot be reconstructed with
reasonable accuracy several years after the event.
Figure 14.47
The rate of thyroid cancer per 1,000 children in the table in figure 14.47 was calculated for each district using
formula
The standard error of the estimated rate per 1,000 children was calculated assuming a binomial distribution
(see chapter 4).
The bars in figure 14.47 at left show cancer rates in blue and their standard errors in red. These two bars
have similar values for districts with a few cases of cancer, meaning that the uncertainty about the disease
rate is high there.
Confidence intervals for the rates were calculated based on the normal approximation to the binomial
distribution as follows:
In this case, rates fall into the interval between the lower and upper values with 95 percent probability, as
shown in the two last columns in the table at right in figure 14.47.
The rates are unsteady if they are calculated for a small number of thyroid cases or for a small number of
children in the district, or both. In this case, a choropleth map of the rates is not an effective way to describe
the spatial distribution of thyroid cancer.
The risk of thyroid cancer was elevated in the regions that were under a radioactive cloud in the first few
days after the Chernobyl accident, and varying risk existed in each place in Belarus, depending on the levels of
radioactivity. Therefore, districts with zero rates should not be interpreted as safe. One practical solution to
the problem of zero values in the rare disease datasets is to use the adjusted rates by adding one cancer case
to the actual number of cases in each district to distinguish between one and zero cases, uniformly elevating
the morbidity level without changing its spatial structure, as follows:
cases + 1
rate _ adjusted = 1000 ⋅
population
€
Spatial Statistical Data Analysis for GIS Users 636
The table in the bottom left part of figure 14.47 shows original rates (INCID1000) and adjusted rates
(INCID1000A) for three districts with similar populations of children and zero, one, and two cases of thyroid
cancer in children. Using adjusted rates, the proportion between districts with one and zero cases is equal to
approximately two instead of an infinitely large value in the case of original rates.
For investigating spatial distribution of rare diseases, such as thyroid cancer, methods based on the analysis
of the similarity of neighboring regions can be used. Two maps, empirical Bayes smoothing and Walter’s I
index (discussed in chapter 11 in the section called “Spatial smoothing”), are presented in figure 14.48. The
main difference between empirical Bayes smoothing and the choropleth map of thyroid cancer in children
rates displayed in figure 6.5 of chapter 6 is that in the former, shown in figure 14.48 at left, no districts are
displayed as having zero rates. In reality, 35 of 117 districts have no registered cancer cases and 24 districts
have one cancer case each and a 95 percent probability that the number of thyroid cases is between zero and
three. The empirical Bayes smoothing makes adjustments for the districts with small rates and small
population by using the weighted average rate for the entire country.
The Walter’s I values in figure 14.48 at right divide Belarus into three parts, with thyroid cancer rates higher
than average in the south, lower than average in the north, and with an intermediate zone in between.
Districts are colored in red in the south because rates in the neighboring polygons are greater than the
expected rate, and districts in the northern part of the country are also colored in red because neighboring
values are below the expected rate. Other indexes of spatial data association can be used to make this zoning
more certain.
Figure 14.48
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 14.49 shows a map of adjusted thyroid cancer rates and several Geostatistical Analyst data exploration
tools commonly used in the first stage of continuous data modeling. When polygons are used as input data to
the Geostatistical Analyst tools and models, the software uses the polygon’s centroids as input points.
In figure 14.49, the Trend Analysis tool displays the trend in the north–south direction. The quantile–quantile
plot indicates that the data are far from being normally distributed, while the histogram shows that after log
transformation adjusted rates became close to normally distributed data. Selected points in the
semivariogram cloud highlight two districts with the largest difference in rates from the neighboring districts.
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 14.49
Geostatistical Analyst Wizard can be used to investigate the correlation between rates and deposition of
radionuclides.
The most comprehensive information on radionuclide measurements in soil is the data on cesium
radioisotope 137, Cs‐137. Although Cs‐137 did not induce thyroid cancer in children, it is commonly assumed
that its spatial distribution is similar to the distribution of the radioactive iodine in the territory close to
Chernobyl.
Figure 14.50 shows semivariogram models for thyroid cancer rates (left), for Cs‐137 soil contamination
(center), and the cross‐covariance between them. The linear correlation coefficient between thyroid cancer
and radioactive cesium in soil (right), using semivariograms and cross‐covariance model parameters in figure
14.50 can be estimated as 0.68, meaning that there is a strong correlation between Cs‐137 and thyroid cancer
rates. See, however, the case study in appendix 3, which concludes that one should be careful with explaining
the thyroid cancer risk using cesium‐137 soil contamination data.
Figure 14.50
Depending on weather patterns, depositions of heavy Cs‐137 in relation to light I‐131 particles became
subject to considerable regional changes every day, and it would be unrealistic to expect a linear correlation
between Cs‐137 and I‐131 contamination on the territories where depositions occurred in windy and wet
conditions. Therefore, several attempts were made to reconstruct radioiodine deposition and thyroid gland
exposure to it.
Figure 14.51 shows semivariogram and cross‐covariance modeling made by the author from the mid‐ to late
1990s using one such reconstruction of the amount of radioiodine in milk in the first days after the Chernobyl
accident. The linear correlation coefficient between thyroid cancer and the amount of I‐131 in milk can be
estimated as 0.88.
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 14.51
All data were log transformed because cross‐Moran I only works correctly with data that are close to normal
distribution. The two histograms in figure 14.52 at right show the log‐transformed iodine concentration in
milk (top) and the thyroid cancer rate (bottom). The almost symmetric data distribution of both variables
justifies the use of the cross‐Moran I statistics for data exploration.
A strong positive correlation between the reconstructed environment in the first few days following the
Chernobyl accident and thyroid cancer in children over the next seven years is evident in the south and
southeast parts of Belarus, where both rates and doses are larger than their expected values so that the cross‐
Moran I has large positive values.
Courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure 14.52
The modeling stage of the data analysis can begin once the exploratory data analysis is completed. See how
thyroid cancer data are further analyzed in appendixes 3 and 4.
ASSIGNMENTS
1) EXPLORE THE ARSENIC GROUNDWATER CONTAMINATION IN BANGLADESH IN
1998.
Use Geostatistical Analyst to explore spatial distribution of arsenic concentration at wells sampled in
Bangladesh in 1998. Data can be downloaded from http://bicn.com/acic/resources/arsenic-
on-the-www/data.htm.
a) Use kriging in one dimension with the particulate matter data measured in one particular city.
b) Use replicated data for interpolation of the particulate matter data collected in U.S. cities.
Data are in the folder assignment 14.2. Measured locations are displayed in figure 14.53.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 14.53
Use Geostatistical Analyst’s moving window kriging to explore spatial distribution of the annual precipitation
in South Africa. Data are in the folder assignment 14.2. Data are displayed in figure 14.54.
Copyright @ 2004, R.S.A. Water Research Commission.
Figure 14.54
FURTHER READING
1. Tukey, J. W. 1977. Exploratory Data Analysis. Reading, Mass.: Addison‐Wesley.
Tukey’s book is about thinking numerically without a computer. It is important reading for researchers who
analyze data frequently since a very large number of ideas and approaches to data analysis are presented and
clearly explained by the author.
The next three books present a wide range of data‐exploration tools. Each book uses a particular statistical
software package. What can be done with one statistical software package is largely reproducible with
another. Therefore, all books are useful even if you are using another software package.
2. Reimann, C., P. Filzmoser, R. G. Garrett, R. Dutter. 2008. Statistical Data Analysis Explained: Applied
Environmental Statistics with R. Chichester, UK: Wiley.
The authors of this book try to avoid mathematical formulas and use examples and graphics instead. For all
graphics presented in the book, the R codes are provided.
This book illustrates the computational aspects of exploratory data analysis using functions from the
commercial software package MATLAB and the MATLAB Statistics Toolbox. The authors also provide pseudo‐
codes for users of other software packages.
4. Millard, S. P., and N. K. Neerchal. 2000. Environmental Statistics with SPLUS. CRC Press.
This book discusses the vast array of statistical methods used to collect and analyze environmental data using
commercial software package S‐Plus and the add‐on modules EnvironmentalStats for S‐PLUS, S+SpatialStats,
and S‐PLUS for ArcView.
USING COMMERCIAL STATISTICAL
SOFTWARE FOR SPATIAL DATA
ANALYSIS
PROGRAMMING WITH SAS
TRADITIONAL (NONSPATIAL) LINEAR REGRESSION
LINEAR REGRESSION WITH SPATIALLY CORRELATED ERRORS (KRIGING WITH
EXTERNAL TREND)
USING MATLAB AND LIBRARIES DEVELOPED BY MATLAB USERS
MORAN’S I SCATTERPLOT.
SIMULTANEOUS AUTOREGRESSIVE MODEL
USING SPLUS SPATIAL STATISTICS MODULE S+SPATIALSTATS
CREATING SPATIAL NEIGHBORS
MORAN’S I
CONDITIONAL AUTOREGRESSIVE MODEL
ASSIGNMENTS
1) REPEAT THE BANGLADESH CASE STUDY USING ANOTHER SUBSET OF THE DATA
2) USE MATLAB FOR NONGAUSSIAN DISJUNCTIVE KRIGING
3) USE THE CAR MODEL FROM S+SPATIALSTATS MODULE FOR ANALYSIS OF INFANT
MORTALITY DATA COLLECTED IN NORTH CAROLINA FROM 1995–1999
FURTHER READING
T his chapter shows examples of statistical spatial data analysis using three well‐respected commercial
statistical software programs. The use of general‐purpose statistical software, such as the statistical
application system (SAS), MATLAB, S‐PLUS, and R (see chapter 16) has several advantages for
Rereading chapters 11‐12 can be helpful for understanding the case studies in this chapter.
PROGRAMMING WITH SAS
SAS is an extensive statistical software package for data management and analysis. Although the SAS bridge
for Esri makes it easier to move data back and forth between SAS and ArcGIS, the SAS procedures that make
explicit use of spatial locations and relationships are still fairly limited. However, SAS has powerful
procedures for regression analysis, some of which are illustrated below. Additional examples of SAS usage
can be found in appendix 4. All examples of SAS usage in this book were prepared using the software version
9.1.3.
Programming with SAS can be illustrated using arsenic data collected in Bangladesh. In that country, many
people are drinking groundwater with arsenic concentrations far above acceptable levels. Thousands of
people have been diagnosed with symptoms of arsenic poisoning. Detailed information about this
environmental disaster and arsenic data can be found at
http://bicn.com/acic/resources/arsenic-on-the-www/data.htm.
The probable origin of the arsenic lies in the outcrops of hard rocks higher in the Ganges River catchments
from where it is redeposited in Bangladesh by courses of the Ganges River. Pesticides can leach into the
groundwater as well. Figure 15.1 shows 333 measurements of arsenic concentration (in logarithmic scale)
collected in the southeastern part of the country.
BGS and DPHE. 2001. “Arsenic Contamination of Groundwater in Bangladesh.” Edited by Kinniburgh, D. G., and P. L. Smedley.
British Geological Survey Technical Report WC/00/19, Keyworth: British Geological Survey.
Figure 15.1
proc import datafile='c:\myfiles\asdata.dbf' out=arsenic;
run;
This imports the data file asdata.dbf into a temporary SAS dataset called arsenic. To make transformations
with the imported data, the following code can be used (notice that x and y coordinates were divided by 1,000
to make calculations stable):
Data arsenic; set arsenic;
log_As=log(As);
log_depth=log(depth);
x1=x/1000;
y1=y/1000;
run;
Arsenic does not occur at all depths in alluvial sediments. Figure 15.2 at left shows a scatterplot of arsenic
concentration versus well depths with an estimated linear regression line using a logarithm‐logarithm scale.
Linear regression is used to analyze the logarithms of the variable values because it is on the log‐log scale that
the relationship between arsenic concentration and well depth can be approximately described by a straight
line. A first look at the data suggests that high concentrations are restricted to the upper 150 meters of the
alluvial sediments and offer prospects of obtaining arsenic‐free waters from deeper layers. In this case study,
the relationships between arsenic concentration and well depths will be investigated for one of the most
contaminated areas of the country.
The graph in figure 15.2 at left suggests that there is a negative correlation between the logarithms of arsenic
concentration and well depth. This correlation, r=0.4335, can be obtained using the statements
Proc Corr;
Var log_As log_depth;
Note that the correlation coefficient can also be estimated using the Geostatistical Analyst
Semivariogram/Covariance Modeling dialog box as
The estimated cross‐covariance between arsenic and well depths is shown in figure 15.2 at right.
BGS and DPHE. 2001. “Arsenic Contamination of Groundwater in Bangladesh.” Edited by Kinniburgh, D. G., and P. L. Smedley.
British Geological Survey Technical Report WC/00/19, Keyworth: British Geological Survey.
Figure 15.2
There are many places where a data value is different from all its closest neighbors. This can be visualized
using the Geostatistical Analyst Voronoi Map Cluster tool, presented in figure 15.3. Voronoi polygons are
created so that every location within a polygon is closer to the sample point in that polygon than any other
sample point. After the polygons are created, neighbors of a sample point are defined as sample points whose
polygons share a border with the chosen sample point. If the class interval of a cell is different from every
neighbor, the cell is colored pink. A Voronoi map uses a quantile classification method so that each class
interval contains an equal number of data. The values in pink cells are possible local outliers, most likely due
to unusually shallow or deep wells. Local data variability is large, and large prediction errors are expected for
any interpolator.
BGS and DPHE. 2001. “Arsenic Contamination of Groundwater in Bangladesh.” Edited by Kinniburgh, D. G., and P. L. Smedley.
British Geological Survey Technical Report WC/00/19, Keyworth: British Geological Survey.
Figure 15.3
,
Model 1: A logarithm of the variable well depth
Model 2: Coordinates x and y, their product x⋅y, and a logarithm of the wells’ depth
By including x, y, and their product, a trend surface is used to account for possible large‐scale variation in
arsenic concentrations because of sources other than well depths.
The analysis has two goals: 1) to determine whether or not well depth, the trend surface variables, or both
are significant predictors of arsenic concentration and, if so, 2) to use the variables deemed statistically
significant to make predictions of arsenic concentration on the entire territory covered by the samples of
arsenic data.
TRADITIONAL (NONSPATIAL) LINEAR REGRESSION
First, we use nonspatial linear regression with constant but unknown errors at all arsenic data locations:
This can be done in SAS using either the General Linear Model Procedure (Proc GLM)
Model I Model II
Proc GLM; Proc GLM;
Model log_As=log_depth; Model log_As=x1 y1 x1y1 log_depth;
or the Mixed Model Procedure (Proc Mixed)
Model I Model II
Proc Mixed; Proc Mixed;
Model log_As=log_depth; Model log_As=x1 y1 x1y1 log_depth;
Both procedures give the same results but in slightly different formats. Since Proc Mixed can handle the case
of spatially correlated errors, the results from Proc Mixed are presented in tables 15.1 and 15.2. The
estimated variance σ2 and the Akaike’s information criterion (AIC) are shown at the top of the tables. The AIC
criterion is a statistic that assesses model fit. In linear regression, the basic model fit is assessed by measuring
the squared deviations between the observed data and the fitted regression equation (Proc GLM) or,
equivalently, by calculating the value of the likelihood evaluated using the estimated maximum likelihood
parameters (Proc Mixed). Since a regression model with more parameters will always fit the data better than
one with a smaller number of parameters, the AIC criterion adjusts for the number of parameters used in the
model. The model that has the smallest AIC value is preferred.
Model 1. Regression of log(As) on log(depth). σ2 =2.9991, AIC=1316.0
BGS and DPHE. 2001. “Arsenic Contamination of Groundwater in Bangladesh.” Edited by Kinniburgh, D. G., and P. L. Smedley.
British Geological Survey Technical Report WC/00/19, Keyworth: British Geological Survey.
Table 15.1
Model 2. Regression of log(As) on x, y, x⋅y, and log(depth). σ2 =2.1710, AIC=1242.5
BGS and DPHE. 2001. “Arsenic Contamination of Groundwater in Bangladesh.” Edited by Kinniburgh, D. G., and P. L. Smedley.
British Geological Survey Technical Report WC/00/19, Keyworth: British Geological Survey.
Table 15.2
All variables are significant in predicting the arsenic values since their t statistics are greater than the
absolute value 2, and the p‐values are less than 0.05. The regression statistics indicated that the trend surface
estimated using a low‐order polynomial is significant and that the error variance σ2 is reduced by adding x, y,
and x⋅y terms.
By definition, the residuals are equal to the observed data values minus the values predicted by the fitted
model. The studentized residual equals
These can be obtained by adding an option to the SAS statements above. For example, for model 1:
Proc Mixed;
Figure 15.4 shows the studentized residuals produced by model 2. Red and dark blue circles identify
observations that are poorly explained by the nonspatial regression model.
BGS and DPHE. 2001. “Arsenic Contamination of Groundwater in Bangladesh.” Edited by Kinniburgh, D. G., and P. L. Smedley.
British Geological Survey Technical Report WC/00/19, Keyworth: British Geological Survey.
Figure 15.4
Though there are some locations with fairly large errors, these tend to be associated with locations near the
edges of the study region. There does not seem to be an obvious pattern in the errors.
The next step is estimating the semivariogram model of the studentized residuals from the fitted model using
Geostatistical Analyst. The semivariogram model will be used to determine an initial mode for the
semivariogram of these residuals when doing spatial regression in SAS. Because studentized residuals are in
units of standard deviations, the sill (in Geostatistical Analyst, it is a sum of nugget and partial sill
parameters) should be equal to 1. If the estimated sill is much larger, it indicates trend. Figure 15.5 shows
semivariogram models for models 1 and 2.
BGS and DPHE. 2001. “Arsenic Contamination of Groundwater in Bangladesh.” Edited by Kinniburgh, D. G., and P. L. Smedley.
British Geological Survey Technical Report WC/00/19, Keyworth: British Geological Survey.
Figure 15.5
Using these semivariogram models, the linear regression models in SAS can be refitted, accounting for
residual spatial autocorrelation. This can be done using the repeated statement in Proc Mixed:
Proc mixed;
model log_as = log_depth;
repeated / type=sp(exp)(x1 y1) local subject=intercept;
parms (0.8) (20) (.22);
Proc mixed;
model log_as = x1 y1 x1y1 log_depth;
repeated / type=sp(exp)(x1 y1) local subject=intercept;
parms (0.65) (15) (0.35);
The repeated statement tells SAS that there are repeated measurements on each experimental unit. In the
spatial case, the experimental unit is the spatial domain of interest, which is specified using the
subject=intercept option. This option allows all data values to be spatially correlated. The type=sp(exp)(x1 y1)
option informs SAS that the correlation structure is a spatial one (sp) of exponential form (exp) with spatial
coordinates x1 and y1. The parms statement gives SAS starting values (partial sill, range, and nugget) for the
parameters of this exponential spatial autocorrelation model, inferred through geostatistical analysis of the
studentized residuals. Note that for some reason when the exponential semivariogram model is used, the
procedure mixed does not report the actual range value but instead reports one third of it.
The following tables show the results of fitting two models. In the spatial models, σ02 is a nugget parameter
assumed to be pure measurement error and σ12 is partial sill.
Table 15.3
Model 2: Regression of log(As) on x, y, x⋅y, and log(depth). σ02=1.2528; σ12 =3.0478; estimated range of data
correlation is 213 km; AIC=1163.7
Table 15.4
All spatial models seem to fit better than nonspatial models, since the AICs and the standard errors are
substantially smaller. The estimated regression coefficients obtained from both spatial models are significant.
The next step is prediction from spatial models to the dense grid over the data domain and mapping. Because
depth values are not known at the prediction grid locations, kriging is used to predict depth values; depth
predictions are then used in the regression model to predict the logarithm of arsenic values and make a map.
BGS and DPHE. 2001. “Arsenic Contamination of Groundwater in Bangladesh.” Edited by Kinniburgh, D. G., and P. L. Smedley.
British Geological Survey Technical Report WC/00/19, Keyworth: British Geological Survey.
Figure 15.6
There are problems with predicting well depths at the unsampled locations using the 333 observed wells. The
semivariogram is always close to the nugget model, since deep wells are distributed rather randomly and are
often surrounded by shallow wells (see figure 15.7 at left).
One possible solution is to remove the deepest wells highlighted in figure 15.6, estimate the semivariogram,
then use the entire dataset for predictions. It makes sense because it is better to underestimate depth to
protect people, since deep wells are the cleanest. Therefore, 303 relatively shallow wells are used for
predictions of well depth, meaning that the worst‐case situation is being modeled. The semivariogram model
in this case shows strong spatial data dependence (see figure 15.7 at right).
BGS and DPHE. 2001. “Arsenic Contamination of Groundwater in Bangladesh.” Edited by Kinniburgh, D. G., and P. L. Smedley.
British Geological Survey Technical Report WC/00/19, Keyworth: British Geological Survey.
Figure 15.7
First, we input the depth predictions into SAS and create a new dataset that includes all locations, both those
at which arsenic concentration was measured and those at which it is to be predicted. For the prediction
locations, initially the arsenic concentration will be assigned a missing value.
proc import datafile='c:\myfiles\depthpred.dbf' out=depthpred;
data depthpred; set depthpred;
x1=x/1000;
y1=y/1000;
log_depth=log(preddepth);
run;
data comb;
set arsenic depthpred;
run;
Then, we fit the same models as before but include the outpred option to output predicted values to a new
dataset (called pred3 here).
proc mixed data=comb;
model log_as = x1 y1 x1y1 log_depth/outpred=pred3;
repeated / type=sp(exp)(x1 y1) local subject=intercept;
parms (0.15) (15) (0.05);
run;
Figure 15.8 at left shows the resulting map using model 2. This map can be compared with predictions made
by the Geostatistical Analyst’s universal cokriging with depth as a secondary variable in figure 15.8 at right.
BGS and DPHE. 2001. “Arsenic Contamination of Groundwater in Bangladesh.” Edited by Kinniburgh, D. G., and P. L. Smedley.
British Geological Survey Technical Report WC/00/19, Keyworth: British Geological Survey.
Figure 15.8
The map in figure 15.9 shows the difference between the predictions in maps shown in figure 15.8. Green
colors indicate areas where predictions based on model 2 are higher, and areas in red show where
predictions using a universal cokriging model are higher. In the gray areas, predictions are similar.
BGS and DPHE. 2001. “Arsenic Contamination of Groundwater in Bangladesh.” Edited by Kinniburgh, D. G., and P. L. Smedley.
British Geological Survey Technical Report WC/00/19, Keyworth: British Geological Survey.
Figure 15.9
There is an advantage in using model 2 (figure 15.8 at left) to model arsenic water contamination to see what
happens if all well depths are shallow or if all are deep. Figure 15.10 shows the predictions at the locations on
the transect displayed as black rectangles, assuming that all well depths are 10 meters deep (red line) and
150 meters deep (blue) and using estimated well depths (black). Error bars show prediction standard errors.
The height of each vertical line is equal to two prediction errors. The predicted arsenic concentration
depends heavily on the well depth, confirming the observation made at the beginning of the case study on the
importance of drilling deep drinking water wells in the country.
BGS and DPHE. 2001. “Arsenic Contamination of Groundwater in Bangladesh.” Edited by Kinniburgh, D. G., and P. L. Smedley.
British Geological Survey Technical Report WC/00/19, Keyworth: British Geological Survey.
Figure 15.10
BGS and DPHE. 2001. “Arsenic Contamination of Groundwater in Bangladesh.” Edited by Kinniburgh, D. G., and P. L. Smedley.
British Geological Survey Technical Report WC/00/19, Keyworth: British Geological Survey.
Figure 15.11
As discussed in chapter 2, transforming the regression results back to the original scale without bias cannot
be done using the formula
arsenic = exp(prediction value on log scale).
To do an unbiased transformation, substantial programming in SAS/IML is required, since the formulas for
bias correction are complex. Geostatistical Analyst automatically transforms predictions back to the original
scale using the bias correction formulas if the logarithm or another data transformation is used.
USING MATLAB AND LIBRARIES DEVELOPED BY MATLAB USERS
MATLAB does not have a library for spatial statistical data analysis. However, several statistical spatial data
analysis libraries have been created by MATLAB users and are freely available in the public domain. The
following illustrates the use of one of these libraries developed by James LeSage, called the Econometrics
Toolbox, available at http://www.spatial-econometrics.com/.
Econometrics is the application of statistical and mathematical methods to the empirical estimation of
economic relationships. The Econometrics Toolbox consists of functions for regional data analysis and
includes several tests for spatial data dependence and several spatial regression models. A distinctive feature
of the toolbox is the ability to work with large datasets.
To obtain, install, and use the toolbox, visit the download index, download MATLAB Arc Map Toolbox and
MATLAB version 6 (or 5.3 or 7, depending on the MATLAB software) Zip files, and unpack the Zip files. Start
MATLAB and use the File > Menu > Set Path option. In the Set Path dialog box, choose Add folder with
subfolders. Click Save and Close to exit the Set Path window.
Using these data, it can be shown how two functions from the Econometric Toolbox work: one data
exploration tool, the Moran scatterplot; and one spatial regression model, the simultaneous autoregressive
model. These tools were chosen because they can be used immediately after installing the toolbox, whereas
most of others require understanding of the library organization and some programming.
Figure 15.12 shows a map of the proportion of Church of the Nazarene adherents in U.S. counties and part of
the dataset, bottom.
From Chaves, Mark, and Shawna Anderson. 2008. “National Congregations Study: Cumulative Data File and Codebook.”
Durham, N.C.: Duke University, Department of Sociology.
Figure 15.12
In figure 15.12 at left, 1,334 of 3,065 counties have no Church of the Nazarene adherents. To distinguish
among counties without Church of the Nazarene adherents but with different numbers of total adherents, the
following data transformation is used:
The histogram in figure 15.12 at right shows the distribution of the logarithm of the new variable nazarn_pp.
It consists of two bell‐shaped graphs that correspond to counties without Church of the Nazarene activities
(yellow) and the proportion of adherents in the remaining counties (blue).
The Moran scatterplot provides a visual exploration of spatial autocorrelation by plotting the same variable
for the counties and their neighbors. The spatial Econometrics Toolbox provides a function to find a specified
number of the closest neighbors based on straight‐line distances between polygon centroids. In this exercise,
five closest neighbors will be used.
Figure 15.13 shows a histogram of the variable nazarn_pp and a map of its weighted local standard deviation
(see the description in chapter 11). The variability in the proportion of Church of the Nazarene adherents is
different in different parts of the United States.
From Chaves, Mark, and Shawna Anderson. 2008. “National Congregations Study: Cumulative Data File and Codebook.”
Durham, N.C.: Duke University, Department of Sociology.
Figure 15.13
From Chaves, Mark, and Shawna Anderson. 2008. “National Congregations Study: Cumulative Data File and Codebook.”
Durham, N.C.: Duke University, Department of Sociology.
Figure 15.14
Before mapping, the variables are standardized so that the units in the graph correspond to standard
deviations. The standardized variables, called z‐scores, are differences between the measured and mean
value of a variable divided by its overall standard deviation.
In the Moran’s I scatterplot in figure 15.15, the x‐axis is the standardized proportion of Church of the
Nazarene adherents (z‐score) for each county. The y‐axis is the neighborhood average of z‐scores. The axes
values are in standard deviation units.
The four quadrants in the Moran’s I scatterplot provide a classification of four types of spatial
autocorrelation, as shown in table 15.5.
Scatterplot
quadrant Autocorrelation Interpretation
upper right positive value and its neighbors high
lower right negative low value with high value among neighbors
lower left positive value and its neighbors low
upper left negative high value with low value among neighbors
From Chaves, Mark, and Shawna Anderson. 2008. “National Congregations Study: Cumulative Data File and Codebook.”
Durham, N.C.: Duke University, Department of Sociology.
Table 15.5
From Chaves, Mark, and Shawna Anderson. 2008. “National Congregations Study: Cumulative Data File and Codebook.”
Durham, N.C.: Duke University, Department of Sociology.
Figure 15.15
The code used to create the maps in figure 15.15 was modified from the demonstration provided with the
MATLAB Arc_Map Toolbox as follows.
filename = '..\shape_files\church_counties_prop';
results = shape_read(filename);
% use logarithm of the variable in the 18 column of the dbffile, nazarn_pp
variable =log(results.data(:,18));
% assign a name to the variable
vnames = strvcat('proportion of Church of the Nazarene adherents ');
options.vname = vnames;
% create variables with x and ycoordinates of the polygons’ centroids
latt = results.xc;
long = results.yc;
% construct a spatial weight matrix based on the 5 nearest neighbors
W = make_nnw(latt,long,5);
% create a Moran's I scatterplot and map
arc_moranplot(variable,W,results,options);
There is a small menu with a zoom option in the bottom left corner of the map. Using this zoom option, the
area of the southeastern United States was enlarged in figure 15.16 at left. Figure 15.16 at right shows the
corresponding Moran’s I scatterplot.
From Chaves, Mark, and Shawna Anderson. 2008. “National Congregations Study: Cumulative Data File and Codebook.”
Durham, N.C.: Duke University, Department of Sociology.
Figure 15.16
The Moran’s I scatterplot indicates that there is local autocorrelation in the data, so there is a reason for
spatial regression modeling.
SIMULTANEOUS AUTOREGRESSIVE MODEL
The Econometric Toolbox has several spatial regression models including Casetti’s spatial expansion model,
simultaneous autoregressive model, the spatial Durbin model, and spatial error models. Computations can be
done using both maximum likelihood and Bayesian approaches. These models and the computational
approaches are described in the documentation available at the toolbox’s Web site.
Figure 15.17 at right shows an output from a simultaneous autoregressive model in which the logarithm of
the proportion of Church of the Nazarene adherents is modeled as a weighted sum of the neighbors and the
logarithm of the proportions of American Lutheran Church and Catholic Church adherents.
log(Nazarene) ~ β1⋅+ β2⋅log(Lutheran) + β3⋅log(Catholic)
Table 15.6
According to the model output, proportions of Lutheran and Catholic church adherents explain the
proportion of Church of the Nazarene adherents because their t‐statistics are greater than 2. Estimates of the
coefficients β1 and β2 are negative, meaning that there are inverse relationships between the proportions.
Therefore, it is reasonable to expect a significantly large proportion of adherents to the Church of the
Nazarene in counties where the proportion of Catholics is relatively large and the proportion of Lutherans is
small.
The additional column of the estimated model’s parameter below the βis estimate shows that the adherents’
proportions are strongly correlated in the neighboring counties: correlation coefficient ρ equals 0.609, and its
estimated standard deviation is 0.036. Parameters ρmin and ρmax show possible diapason for the coefficient ρ,
from ‐0.999 to 1, indicating that the spatial weights matrix was standardized (see discussion in chapter 11).
Figure 15.17
From Chaves, Mark, and Shawna Anderson. 2008. “National Congregations Study:
Cumulative Data File and Codebook.”
Durham, N.C.: Duke University, Department of Sociology.
USING SPLUS SPATIAL STATISTICS MODULE S+SPATIALSTATS
The S‐PLUS spatial statistics module, S+SpatialStats, provides analytical options for all three types of spatial
data. Because the geostatistical library in S+SpatialStats is very limited in comparison with that in
Geostatistical Analyst and the point pattern analysis library is very limited in comparison with those in the R
packages described in chapter 16, S+SpatialStats will be used here for regional data analysis (called the
analysis of “lattice data” in S+SpatialStats). An advantage of S+SpatialStats is that it permits much spatial
analysis to be done using GUI menus and dialog boxes.
The first step in the analysis of regional data is the definition of spatial neighbors. S+SpatialStats can create
spatial neighbors based on the straight‐line distance between observations. Since the observations are
associated with regional polygons, the polygon centroid is used to represent the polygon. S+SpatialStats
allows a choice between a specified number of neighbors and a maximum distance between regions, within
which observations are considered neighbors.
It is also possible to read information about neighbors from an ASCII file. Figure 15.18 shows six
administrative units and a list of adjacent polygons. Polygon 1 has polygons 2, 3, and 5 as neighbors; polygon
2 has polygons 1, 3, and 6 as neighbors; and so on. The function read.neighbor reads in a text file and converts
the list of neighbors to the internal spatial.neighbor object using the following command:
mynbr < read.neighbor(“nbrs.txt”)
File nbrs.txt:
1 2 3 5
2 1 3 6
3 1 2 4 5 6
4 3 5
5 1 3 4
6 2 3
Figure 15.18
To illustrate use of the S+SpatialStats module, data from the Environmental Performance Measurement
Project, http://www.yale.edu/esi/, will be used. This Web site provides a diverse set of
socioeconomic, environmental, and institutional indicators that characterize environmental sustainability at
Figure 15.19 at left displays a choropleth map of mortality rates per 1,000 live births for children under five
(given in the U5MORT field). The five selected countries in this figure are Morocco (first row in the dataset)
and its four closest neighbors established according to the straight‐line distance between polygon centroids.
Other variables shown below can be used in the regression analysis for explanation of child mortality:
DISCAS: Average number of deaths per million inhabitants from floods, tropical cyclones, and droughts
WATSUP: Percentage of population with access to improved drinking water source
TFR: Total fertility rate
INDOOR: Indoor air pollution from solid fuel use
CO2GDP: Carbon emissions per million U.S. dollars GDP (the total dollar value of all final goods and services
produced in the country in a year)
VOCKM: Anthropogenic volatile organic compounds (VOC) emissions per populated land area
WATAVL: Freshwater availability per capita
GRDAVL: Internal groundwater availability per capita
NOXKM: Anthropogenic NOx emissions per populated land area
SO2KM: Anthropogenic SO2 emissions per populated land area
UND.NO: Percentage of undernourished children in total population
PECR: Female primary education completion rate
Meaningful regression analysis using 12 explanatory variables with only 43 observations is difficult because
some variables measure similar things (this is called multicollinearity), which leads to unstable results of
calculations (see the discussion in the section “Geographically weighted regression” in chapter 12).
Data from Environmental Sustainability Index, Yale University.
Figure 15.19
Data from Environmental Sustainability Index, Yale University.
Figure 15.20
Some of the functions of the S+SpatialStats module, including Moran’s I, allow for no data values, but some,
including spatial regression models, generate an error message in that case.
CREATING SPATIAL NEIGHBORS
Figure 15.21 shows how to create spatial neighbors. First, the S‐PLUS module spatial is initialized using the
command File > Load Module. Then the dialog box Spatial Neighbors is initialized using the command Spatial
‐> Spatial Neighbors. On this dialog box, the AfricaESIs dataset is specified, fields “X” and “Y” are the
coordinates of the country centroids, the number of closest neighbors are designated as four, the Euclidean
distance metric is used to define closeness and determine the closest neighbors, and the resulting list of
neighbors is called AfricaNbrs. This list is shown in the right part of figure 15.21. The first two columns show
the row numbers with neighbors, and the third column displays the weights of each neighbor, which are
equal to 1 by default. It is assumed that pairs of data points that are not specified in the columns row.id and
col.id have zero weights and are not related. Weights can be changed for the neighbors, thus making some
regions more influential than others. This is illustrated in more detail below.
Figure 15.21
Connections between neighbors can be visualized by typing the command
"plot(AfricaNbrs, xc=AfricaESIs$X, yc=AfricaESIs$Y,scaled=T)"
in the Commands window, which produces the map at left in figure 15.22.
Data from Environmental Sustainability Index, Yale University.
Figure 15.22
Neighbors in the AfricaNbrs window can be edited and saved for further usage. The map at right in figure
15.22 shows neighbors after adding rows
1.00 18.00 1…1
1.00 22.00 1 1
after the fourth row, that is, adding Libya and Niger to the list of Morocco’s neighbors. The new line segments
between neighbors are highlighted in figure 15.22 at right.
With specified neighbors and their weights, spatial data exploration using Moran's I and Geary c statistics can
be done using the Spatial Correlations dialog box shown in figure 15.23. This dialog box specifies the dataset,
AfricaESIs; the variable of interest, U5MORT; the neighbor object, AfricaNbrs, saved from the neighborhood
construction above; the name of the statistic of interest (Moran); and the sampling type used to specify the
probability distribution of Moran’s I that is needed to determine its statistical significance (to answer the
question, “Is the found value unusually high or low?”). With free sampling, all observations are assumed to
follow identical and independent Gaussian distribution, and the distributional properties of the Moran’s I
statistic can be derived from this assumption. A second assumption, called the randomization assumption,
may be made instead. Under this assumption, S+SpatialStats randomly rearranges all the data values many
times, in this case 500 times, and computes Moran’s I for each permutation. Since the set of observations
remains the same in each randomization, this is also referred to as nonfree sampling. Further discussion of
the Moran’s I p‐values can be found in chapter 16.
After clicking Apply, spatial correlation statistics for the U5MORT variable are displayed in the Report1
window and are highlighted with the pink frame in figure 15.23. Users can immediately see statistics for
another variable, perhaps VOCKM, in the green frame.
In this example, there is significant spatial correlation (because p‐values are small) between neighboring
countries in the case of child mortality data but not in the case of VOC emissions (because of large p‐values).
Data from Environmental Sustainability Index, Yale University.
Figure 15.23
,
using the following code:
distance < sqrt((AfricaESIs$X[AfricaNbrs$row.id] AfricaESIs$X[AfricaNbrs$col.id])^2 +
(AfricaNbrs$Y[rc1$row.id] AfricaNbrs$Y[rc1$col.id])^2)
min.dist < min(distance)
AfricaNbrs$weights < min.dist/distance
Using these new weights, estimation of the spatial correlation of child mortality remains practically the same
(see the red frame in figure 15.24). However, this choice of weights has a great impact on the estimate of the
spatial correlation of volatile organic compound emissions (green frame), which is now significant. Because
the choice of weights influences the result of the analysis, additional research using more sophisticated
weights and other indexes of spatial data association is required.
Data from Environmental Sustainability Index, Yale University.
Figure 15.24
Spatial correlation exists for some of the AfricaESI variables and this can be incorporated in modeling spatial
relationships between variables. The S+SpatialStats module provides three models for spatial regression
analysis with regional data. We will use the conditional autoregressive model (CAR) discussed in chapter 11.
The Spatial Regression dialog box, shown in figure 15.25 requires specification of the dataset, AfricaESIs; the
spatial neighbors, AfricaNbrs; the type of model, CAR; and its schematic formula, which consists of the
response variable, a tilde (~), and a list of predictor variables separated by plus symbols. An intercept is
automatically included by default. In figure 15.25, the response variable U5MORT is explained by correlation
with neighboring U5MORT values (intersect in the report window) and the predictors INDOOR and WATAVL.
The research model is the following:
U5MORT ~ β1⋅+ β2⋅INDOOR + β3 WATAVL
Data from Environmental Sustainability Index, Yale University.
Figure 15.25
However, after clicking Apply, S‐PLUS informs that the weighted spatial neighbor matrix is not symmetric,
and the model does not run. Indeed, the conditional autoregressive model requires that ,
AfricaNbrNew < spatial.neighbor(row.id=AfricaNbr2$row.id, col.id = AfricaNbr2$col.id, AfricaNbr2$weights,
symmetric=T)
With a new neighborhood object, AfricaNbrNew, the conditional autoregressive model will work.
AfricaESIs$weights < 10/AfricaESIs$POP.CNTRY
and it will be used as an estimation of the data variances :
.
Fitted coefficients β1, β2, and β3 are presented in the Report1 window in the Value column (see figure 15.26).
P‐values in column Pr(>|t|) indicate that the only significant explanatory variable with a very small p‐value is
INDOOR (indoor air pollution from solid fuel use). Indicator WATAVL and freshwater availability per capita
are not considered significant because their p‐values are much greater than 0.05.
Data from Environmental Sustainability Index, Yale University.
Figure 15.26
The response variable U5MORT in the neighboring countries helps to explain child mortality data in a
particular country, since its p‐value 0.01 is smaller than 0.05. However, the estimated parameter of spatial
autocorrelation ρ (rho) is small and negative, −0.006. The range of values that ρ can take on in this model can
be found from the eigenvalues of the spatial weights matrix using the following code:
sw < spatial.weights(AfricaNbrNew)
esw < eigen(sw)
1/min(esw$values)
Therefore, possible ρ values are in the diapason [−0.91, 0.48].
The Correlation of Coefficient Estimates part of the report may help in understanding the relationships
between pairs of variables. For example, a researcher may wonder why freshwater availability (WATAVL)
negatively correlates with indoor air pollution.
To analyze relationships of child mortality to other indicators, only countries where all these indicators are
available can be used. In this case, only six predictors collected in 32 countries can be used, since in
S+SpatialStats, NoData values are not allowed in the regression modeling.
Figure 15.27 shows countries where 13 indicators were calculated.
Data from Environmental Sustainability Index, Yale University.
Figure 15.27
Figure 15.28 shows the result of another run of spatial regression modeling, this time using the explanatory
variable UND.NO, the percentage of undernourished children in the total population. According to the output
of the spatial regression model, indicator UND.NO can be used to explain child mortality rates in African
countries, since its p‐value is very small.
Figure 15.28
ASSIGNMENTS
1) REPEAT THE BANGLADESH CASE STUDY USING ANOTHER SUBSET OF THE
DATA.
If you have SAS software, repeat the Bangladesh case study using another subset of the data. Data can be
downloaded from http://bicn.com/acic/resources/arsenic-on-the-www/data.htm.
Additional exercises using SAS can be found in appendix D.
2) USE MATLAB FOR NON‐GAUSSIAN DISJUNCTIVE KRIGING.
Xavier Emery provides the MATLAB code with paper
Emery, X. (2006) A disjunctive kriging program for assessing point‐support conditional distributions.
Computers & Geosciences, Volume 32, Issue 7, Pages 965–983.
This paper discusses generalization of disjunctive kriging for a set of data distributions including gamma,
Poisson, binomial, and negative binomial. There are two tutorials datasets, which are analyzed assuming that
data follow gamma and negative binomial distributions.
If you have access to MATLAB,
1. Repeat the author’s analysis.
2. Use the code provided with the paper for mapping binomial data from assignment 2 of chapter 11
(Smoothing data on the tapeworm infection in red foxes).
If you have S‐PLUS and S+SpatialStats module, use the CAR model for analysis of infant mortality data
collected in North Carolina from 1995–1999, as shown in figure 15.29. Data are in the assignment 15.3 folder.
Note that similar data are analyzed in chapter 16.
Courtesy of National Atlas of the United States, http://nationalatlas.gov.
Figure 15.29
FURTHER READING
1. The SAS user’s guide is available online at http://v8doc.sas.com/sashtml/stat/index.htm and
http://support.sas.com/documentation/onlinedoc/91pdf/index_913.html.
2. The Econometrics Toolbox for MATLAB, written by James LeSage, can be found at
http://www.spatial-econometrics.com. The author states that “Anyone is free to use these
routines, no attribution (or blame) need be placed on the author/authors.”
3. The spatial Statistics Toolbox, http://www.spatial-statistics.com/index.htm.
The Spatial Statistics Toolbox Web site contains public domain spatial software, written in MATLAB, capable
of using spatial autoregression models (simultaneous spatial autoregression and conditional spatial
autoregression; SAR and CAR) with very large datasets, up to 1 million observations. The “Spatial Statistics
Articles” section provides articles using some of the techniques in the toolbox.
4. Martinez, A. R., and W. L. Martinez. (2007) Computational statistics handbook with MATLAB. CRC Press.
This book describes some of the most commonly used techniques in computational statistics. The authors
provide MATLAB codes, example files, and datasets used in the book. Chapter 12 shows examples of basic
point pattern analysis.
5. Kaluzny, S . P., S. C. Vega, T. P. Cardoso, and A. Shelly. 1998. S+SpatialStats: User’s Manual for Windows and
UNIX. Springer.
This book describes a spatial statistics extension to S‐PLUS software.
USING FREEWARE R STATISTICAL
PACKAGES FOR SPATIAL DATA
ANALYSIS
ANALYSIS OF THE DISTRIBUTION OF AIR QUALITY MONITORING STATIONS IN
CALIFORNIA USING THE R SPLANCS PACKAGE
EPIDEMIOLOGICAL DATA ANALYSIS USING THE R ENVIRONMENT AND SPDEP
PACKAGE
ANALYSIS OF THE RELATIONSHIPS BETWEEN TWO TYPES OF CRIME EVENTS
USING THE SPLANCS PACKAGE
CLUSTER ANALYSIS USING THE MCLUST PACKAGE
ASSIGNMENTS
1) SIMULATE SPATIAL PROCESSES WITH THE SPATSTAT PACKAGE
2) REPEAT THE ANALYSIS OF INFANT MORTALITY USING DATA COLLECTED IN
NORTH CAROLINA FROM 1995 TO 1999
3) REPEAT THE ANALYSIS OF THE RELATIONSHIPS BETWEEN ROBBERY AND AUTO
THEFT CRIME EVENTS USING THE SPLANCS PACKAGE WITH 1998 REDLANDS
DATA
4) ESTIMATE THE DENSITY AND CLUSTERING OF GRAY WHALES NEAR THE
COASTLINE OF FLORES ISLAND
5) TEST FOR THE SPATIAL EFFECTS AROUND A PUTATIVE SOURCE OF HEALTH RISK
FURTHER READING
I n this chapter, we present several packages for spatial data analysis written in R, a freely available
language and environment for statistical computing and graphics, which provides a wide variety of
statistical and graphic techniques.
The programming languages R and S‐PLUS (discussed in chapter 15) look almost the same to most users. If
you know one, you can use the other. Among the differences is data handling: R keeps all data in the main
memory while S‐PLUS writes data out to files after each execution.
R is an interpreted language. R programs are usually edited in a text editor and then pasted to the R console.
Help functions (usually they are very short) are provided by authors of the packages, and the source code is
usually accessible. While the authors of R packages strive to create the highest‐quality product possible, R is,
nevertheless, free software, and each package may not be subjected to the rigorous beta testing found with
commercial software. When a problem arises, the user should contact the package developer(s) for support.
Reference 5 in “Further reading” provides a link to the Web site with a list of R packages for spatial statistical
analysis. Geostatistical packages are not discussed in this chapter because we assume that readers of this
book have access to the comprehensive and better tested Geostatistical Analyst extension to ArcGIS (the
software tutorial can be found in appendix 1). There are several packages for regional data analysis, and
examples of the most popular package, spdep, usage are presented below. At the time of this writing, R is the
only statistical environment for doing modern point pattern analysis, and we illustrate the use of the
packages called splancs and mclust. It should be noted that a large number of the examples in chapter 13 were
prepared using the most comprehensive point pattern analysis package, spatstat.
An introduction on the installation and use of R is provided in appendix 2. Reading chapters 11‐13 on
regional and point pattern analysis aids in understanding the case studies in this chapter.
ANALYSIS OF THE DISTRIBUTION OF AIR QUALITY MONITORING STATIONS IN
CALIFORNIA USING THE R SPLANCS PACKAGE
In chapter 13, the distribution of data locations within a study area was described by the estimated K function
, where h is a distance between pairs of points. The observed K function values can be compared to the
K function calculated from simulated values for a given spatial point process model. The results are displayed
by recording for each simulation and plotting the largest and smallest values for each h as a simulation
envelope. An observed inside the envelope means that the process used in the simulation could have
generated the observed pattern.
In this section, we will test the locations of the one‐hour maximum ozone concentration measured in June
1999 in California for the presence of a clustered process.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 16.1
Figure 16.2 at left shows the R Console with a code that initializes two packages; reads DBF files with the
ozone measurement locations and the California border using the package foreign, written by Saikat DebRoy
and Roger Bivand; and draws the data in the window at the right.
The x‐ and y‐coordinates of monitoring locations and the California border are converted to the point pattern
objects o3.spp and poly.spp. The bounding box around the California border points is calculated using the R
range function. Points are displayed using the plot function, and a polygon is drawn using the polymap
function. The original coordinates were divided by 100,000 to avoid using very large values.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 16.2
The code in figure 16.3 at left uses the splancs package written by Barry Rowlingson and Peter Diggle and
adapted and packaged for R by Roger Bivand.
is displayed as circles for 50 equally spaced distances in figure 16.3 at right.
The meaning of the estimated K function can be understood by comparing it with the K function
calculated for the appropriate theoretical point process. Looking at the ozone data in figure 16.2 at right, one
could conclude that the points are clustered. Therefore, K functions calculated for cluster process with
parameters estimated from the ozone data locations can be used for comparison.
The pcp function estimates the intensity of the Poisson cluster parent process and the mean squared distance
of an offspring from its parent. The average number of offspring per parent m is also calculated and is equal to
23.2.
Then a random number generator is initialized, and kenv.pcp computes the envelope of from nsim=30
simulations of a Poisson cluster process for the California territory.
The upper and lower bounds of the envelope and the mean of the simulations are displayed by dashed lines.
Because calculated using the locations of the ozone measurements is inside the envelope and close to
the mean of the Poisson cluster process simulations, we conclude that air quality monitoring stations in
California form a clustered process similar to the estimated Poisson cluster process.
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 16.3
From California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
Figure 16.4
It is no surprise that ozone monitoring stations are clustered around large population centers—this is done
purposely. Confirmation that ozone data locations are distributed nonrandomly is important because most of
the interpolation models, including kriging, require random, not clustering, data sampling. Therefore,
interpolation models that do not address clustering of observations may produce nonoptimal results.
The following code
sims < pcp.sim(pcp.fit$par[2], m, pcp.fit$par[1], poly.spp)
polymap(poly.spp)
pointmap(as.points(sims), add=TRUE, col="red")
was used to simulate points from the Poisson cluster process with parameters estimated using function
pcp(),figure 16.5 at left.
Cluster centers can be visualized using just one point per cluster:
sims1 < pcp.sim(pcp.fit$par[2], 1, pcp.fit$par[1], poly.spp)
windows()
polymap(poly.spp)
pointmap(as.points(sims1), add=TRUE, pch="+", col="blue")
Three simulated point patterns of the possible cluster centers, blue crosses, are shown in the center and right
part of figure 16.5. Command windows() in the code above allows displaying new graph in the new window.
EPIDEMIOLOGICAL DATA ANALYSIS USING THE R ENVIRONMENT AND THE SPDEP
PACKAGE
This section will consider data on the number of babies who died in Georgia during the period 1995–1999.
Data for U.S. counties appear in the National Atlas of the United States,
http://nationalatlas.gov/atlasftp.html. The choice of the state of Georgia is arbitrary.
Figure 16.6 created in ArcGIS shows a choropleth map of the death rates (the proportion of deaths to number
of live births per 1,000 births) in Georgia counties and the raw number of deaths. The data table includes
county centroids, per capita personal income for each county; the percentage of mothers who smoked during
pregnancy; and the rates of women of ages 10–14, 15–17, 18–19, 20–29, 30–39, and 40–54 who had a live‐
born infant per 1,000 women of that age. This additional data can be used in the spatial regression modeling
in an attempt to explain high and low infant mortality rates.
Per capita personal income is the total personal income of the residents of a given area divided by the
resident population of the county. All dollar estimates are in 2002 dollars, not adjusted for inflation.
The percentage of mothers who smoked during pregnancy is the number of women who gave birth to live
infants and who smoked tobacco at any time during pregnancy, per 100 women who gave birth to live infants.
This section will show how to:
Read the Georgia infant mortality shapefile in R and write the results of calculations to a new
shapefile.
Calculate expected counts of cases and relative risks, assuming that data follow a Poisson
distribution.
Create and visualize a Choynowski probability map.
Create two types of spatial neighborhoods and define associated weights.
Visualize and compare spatial neighborhoods.
Calculate and display local empirical Bayes statistics.
Calculate global Moran’s I statistics.
Calculate and display local Moran’s I statistics.
Examine the spatial relationship between death rates and per capita personal income, mothers’ ages,
and smoking habits using a simultaneous autoregressive model.
In assignment 2 for this chapter, a reader can repeat the analysis using a similar dataset with epidemiological
data collected in North Carolina.
In the following text, R code is in italics and R text output is in green.
The default folder for the data can be set up using the command
setwd("c:\\book_data\\infant_mortality").
A maptools package created by Nicholas J. Lewin‐Koh for reading shapefiles is loaded using the command
library(maptools).
Then a shapefile with infant mortality data is read using the command
data_im < read.shape("Georgia_infant_mortality_income").
im_polys < Map2poly(data_im, region.id=as.character(data_im$att.data$FIPS))
im_df < data_im$att.data
rownames(im_df) < as.character(im_df$COUNTY)
The data frame in R corresponds to the dataset with a list of variables of the same length. The str() function
prints a partial listing of the variables and their names.
str(im_df)
'data.frame': 159 obs. of 20 variables:
$ STATE_NAME: Factor w/ 1 level "Georgia": 1 1 1 1 1 1 1 1 1 1 ...
$ FIPS : Factor w/ 159 levels "13001","13003",..: 119 155 105 23 41 ...
$ POP2000 : int 15050 83525 36506 53282 15154 61053 9319 17289 ...
$ FEMALES : int 7623 41491 18264 27499 7733 31412 4908 8792 10232 ...
$ COUNTY : Factor w/ 159 levels "Appling Cou..",..: 119 155 105 23 41 ...
$ YEAR : Factor w/ 1 level "199599": 1 1 1 1 1 1 1 1 1 1 ...
$ birth : int 438 2557 1258 1698 447 1955 305 495 623 740 ...
$ x : num 1138733 999410 1019195 982788 949983 ...
$ y : num 1388765 1362953 1363286 1372267 1363185 ...
$ death_rate: num 7.08 7.46 7.20 7.36 7.39 ...
$ death : int 3 19 9 12 3 15 2 4 4 5 ...
$ b_rate_all: num 57.5 61.6 68.9 61.8 57.8 ...
$ smok_all : num 17.9 17.9 18.2 17.0 18.8 ...
$ AGE10_14 : num 0.977 1.300 1.215 1.335 1.296 ...
$ AGE15_17 : num 35.5 45.0 52.3 44.6 40.3 ...
$ AGE18_19 : num 90.4 115.9 136.3 114.0 106.0 ...
$ AGE20_29 : num 108 113 130 113 107 ...
$ AGE30_39 : num 39.0 39.4 37.7 41.4 38.0 ...
$ AGE40_54 : num 1.99 2.07 2.13 2.13 1.89 ...
$ pc_income : int 23720 26485 20400 23086 21463 22201 25588 23270 ...
attr(, "data_types")= chr "C" "C" "N" "N" ...
All the data can be displayed by printing the name of the data frame, for example, by typing the command
im_df.
Next, load the spdep package for the analysis and the ColorBrewer palettes package for creating nice‐looking
palettes:
library(spdep)
library(RColorBrewer)
(The spdep package is created by Roger Bivand, and the ColorBrewer package is created by Cynthia Brewer.)
The spdep probmap function
pmap < probmap(im_df$death, im_df$birth)
calculates the expected number (count) of deaths in county i using the following formula:
,
Relative risk (the ratio of observed and expected counts of cases multiplied by 100) in county i is calculated
as
100⋅ /
The probability of getting a more extreme count in county i than what is actually observed, , is
calculated from Poisson distribution with the expected (mean) value .
The summary(pmap) command displays a summary of the result of calculations (the ratio of deaths to births,
expected number of deaths, relative risk, and estimated probability) as
Using this information, the data classes for visualizing the relative risk (relRisk) and the probability map
(pmap) can be the following:
brks_relRisk =c(59, 90, 107, 128, 185) and
brks_prob < c(0.022, 0.05, 0.3, 0.7, 0.95, 0.975, 1),
where function c (short for concatenate) creates a vector of values. In other words, the classification is chosen
based on the data quartiles for relative risk and data intervals for highlighting very small and very large
values of the estimated probabilities.
The relative risk and probability maps are displayed in figure 16.7 using the “Accent” and “PuOr” palettes as
follows.
cols < brewer.pal(5, "Accent")
plot(im_polys, col=cols[findInterval(pmap$relRisk, brks_relRisk, all.inside=TRUE)], forcefill=FALSE,
main="Relative risk")
legend(c(13e5,13e5), c(14e5,10e5), fill=cols, legend=leglabs(brks_relRisk), bty="n", ncol=1, x.intersp=0.9,
y.intersp=0.9)
cols < rev(brewer.pal(7, "PuOr"))
plot(im_polys, col=cols[findInterval(pmap$pmap, brks_prob, all.inside=TRUE)], forcefill=FALSE, main="Poisson
probability map")
legend(c(13e5,13e5), c(14e5,10e5), fill=cols, legend=leglabs(brks_prob), bty="n", ncol=1, x.intersp=0.9,
y.intersp=0.9)
The position of the legend was selected manually.
Another variant of probability mapping is the Choynowski approach, which tests the hypothesis of the
homogeneity of rates
count i
pi = = const , that is, p1 = p2 = p3 = … = pN.
population i
The code for calculation and visualization of a Choynowski probability map using the spdep package is the
following:
€
ch < choynowski(im_df$death, im_df$birth)
The summary() function is used again to specify breaks for the data visualization.
summary(ch)
pmap:
Min. 1st Qu. Median Mean 3rd Qu. Max. NAs
0.003236 0.268300 0.381400 Inf 0.502400 Inf 2.0
type:
Mode FALSE TRUE
Logical 93 66
Here, a TRUE value is assigned to the county if the observed number of deaths is less than expected.
The code for visualizing the Choynowski probability map is the following:
brks_ch < c(0.003, 0.05, 0.25, 0.5, 0.95, 1)
cols < rev(brewer.pal(6, "BrBG"))
plot(im_polys, col=cols[findInterval(ch$pmap, brks_ch, all.inside=TRUE)], forcefill=FALSE, main="Choynowski
probability map")
legend(c(13e5,13e5), c(14e5,10e5), fill=cols, legend=leglabs(brks_ch), bty="n", ncol=1, x.intersp=0.9,
y.intersp=0.9)
The resulting map is shown in figure 16.8. The white color indicates counties with an unusually high number
of deaths.
So far, we have not used information about data in neighboring counties. That is, our analyses have been
nonspatial. The first step in spatial regional data analysis is defining neighbors and their weights for each
geographic object. The spdep package provides several functions for neighborhood definition, and we will use
two of them.
Some of the functions below require a variable that consists of the polygons’ centroid coordinates in the form
of a two‐column matrix. Two vectors with coordinates of the centroids can be glued together using the cbind()
function:
im_cents < cbind(im_df$x, im_df$y)
The most intuitive way to define neighbors is based on the common borders of polygons. Command poly2nb
creates a list of such neighbors.
im_cont_nb < poly2nb(im_polys)
Polygons that share a common border have a weight of 1, and nonadjacent polygons have a weight of zero. It
seems natural to make weights proportional to the length of the common border, but a function for
calculation of the length is not yet available in spdep.
Lines in figure 16.9 show adjacent Georgia counties that form data neighborhoods. This graph was created
using the following commands:
plot(im_polys, border="gray")
plot(im_cont_nb, im_cents, add=TRUE)
Using the summary() function, useful information about neighborhood features, including information on the
number of polygons with 1, 2, 3, …, 10 neighbors, can be printed:
summary(im_cont_nb, im_cents)
Neighbor list object:
Number of regions: 159
Number of nonzero links: 856
Percentage of nonzero weights: 3.385942
Average number of links: 5.383648
Link number distribution:
1 2 3 4 5 6 7 8 9 10
1 4 12 27 38 40 28 6 1 2
One least‐connected region:
13083 with one link
Two most‐connected regions:
13121, 13107 with 10 links
Summary of link distances:
Min. 1st Qu. Median Mean 3rd Qu. Max.
15140 28840 33330 34450 39680 68790
k1 < knn2nb(knearneigh(im_cents))
nb_Euclid < dnearneigh(im_cents, 0, 55000, row.names=rownames(im_df))
summary(nb_Euclid, im_cents)
Neighbor list object:
Number of regions: 159
Number of nonzero links: 1272
Percentage of nonzero weights: 5.031447
Average number of links: 8
Link number distribution:
3 4 5 6 7 8 9 10 11 12 13
5 10 11 17 24 21 27 21 11 6 6
Five least‐connected regions:
Seminole County, Charlton County, Echols County, Camden County, and Chatham County with three links
Six most‐connected regions:
Jackson County, Barrow County, Rockdale County, Newton County, Henry County, and Butts County with 13
links
Summary of link distances:
Min. 1st Qu. Median Mean 3rd Qu. Max.
15140 30810 38440 38520 46550 54980
Neighbors based on the Euclidean distance between centroids can be visualized using the following code (see
figure 16.10):
plotpolys(im_polys, bbs, border="grey")
plot(nb_Euclid, im_cents, add=TRUE)
Courtesy of National Atlas of the United States, http://nationalatlas.gov.
Figure 16.10
plot(im_polys, border="grey")
plot(im_cont_nb, im_cents, add=TRUE)
plot(diffnb(im_cont_nb, nb_Euclid), im_cents, col="red", add=TRUE)
In figure 16.11, red lines show links that belong to neighbors according to the Euclidean distance method but
are not included in the neighborhoods based on the common border method.
Courtesy of National Atlas of the United States, http://nationalatlas.gov.
Figure 16.11
The following code calculates the local empirical Bayes rates (see “Spatial smoothing” in chapter 11) using
two different definitions of neighbors. Classes for visualization are defined based on eight data quantiles.
EB_cont < EBlocal(im_df$death, im_df$birth, im_cont_nb)
brks_EBL < round(quantile(EB_cont$est1000, seq(0,1,1/6)),2)
plot(im_polys, col=cols[findInterval(EB_cont$est1000, brks_EBL, all.inside=TRUE)], forcefill=FALSE, main="Local
Empirical Bayes rate per 1000, contiguity weights")
legend(c(13e5,13e5), c(14e5,10e5), fill=cols, legend=leglabs(brks_EBL), bty="n", ncol=1, x.intersp=0.9,
y.intersp=0.9)
EB_Euclid < EBlocal(im_df$death, im_df$birth, nb_Euclid)
plot(im_polys, col=cols[findInterval(EB_Euclid$est1000, brks_EBL, all.inside=TRUE)], forcefill=FALSE,
main="Local Empirical Bayes rate per 1000, inverse distance weights")
legend(c(13e5,13e5), c(14e5,10e5), fill=cols, legend=leglabs(brks_EBL), bty="n", ncol=1, x.intersp=0.9,
y.intersp=0.9)
The resulting maps created in the R environment are shown in figure 16.12.
The result of the empirical Bayes calculations of the rates can be written to a shapefile using the following
code:
write.polylistShape(im_polys, data.frame(data_im$att.data$FIPS, EB_cont=EB_cont$est1000,
EB_Euclid=EB_Euclid$est1000), file="GA_EB")
Figure 16.13 uses ArcMap to visualize the proportion of the empirical Bayes rates using contiguity and
inverse distance weights. The difference between estimated rates is not large, and both neighborhoods can be
used for the exploration and modeling of infant mortality in Georgia.
Courtesy of National Atlas of the United States, http://nationalatlas.gov.
Figure 16.13
Courtesy of National Atlas of the United States, http://nationalatlas.gov.
Figure 16.14
The permutation algorithm is the following: randomly rearrange all the values, compute Moran’s I, and repeat
this many times (for example, 1,000). The permutation method tries to determine the range of values
generated for Moran’s I when there is no spatial autocorrelation. If the original Moran’s I value is large
compared to all the permutations of Moran’s I, there is more positive autocorrelation than expected under a
random data pattern hypothesis.
Each statistical test has an associated null hypothesis, and the p‐value is a measure of how much evidence is
obtained against the null hypothesis. A null hypothesis of Moran’s I test is that data are spatially uncorrelated.
To get the p‐value, count the proportion of Moran’s I values calculated based on permutations greater than
the one observed in the given data. Traditionally, researchers reject a hypothesis if the p‐value is less than
0.05, meaning that there is a 95‐percent certainty that a hypothesis, in this case complete spatial randomness,
is false.
The general rule is that a small p‐value is evidence against the null hypothesis, whereas a large p‐value means
there is little evidence against the null hypothesis. Note that a failure to reject the null hypothesis does not
mean that the null hypothesis is true; there could be many other hypotheses that will also not be rejected by
the same test.
However, in spdep, Moran’s I under randomization is not a permutation test as such but a test based on an
asymptotic formula with a number of samples and neighbors’ weights as parameters.
moran.test(1000im_df$death/im_df$birth, nb2listw(im_cont_nb, style="W"), randomization=TRUE,
zero.policy=TRUE, alternative="greater")
Moran I statistic standard deviate = 14.0754, p‐value < 2.2e‐16
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
0.685307327 ‐0.006329114 0.002414548
According to the randomization test, there is strong evidence that infant mortality rate data are clustered.
Because data are clustered, the second step is finding where neighboring data are more alike. Local Moran’s I
with contiguity weights can be calculated and visualized using the following code:
im_localI < localmoran(1000im_df$death/im_df$birth, nb2listw(im_cont_nb, style="B"), zero.policy=TRUE)
localI < as.data.frame(im_localI)
brks_LM < c(Inf, 1, 0.5, 0.5, 1, 2.5, 4, +Inf)
cols < rev(brewer.pal(7, "PuOr"))
plot(im_polys, col=cols[findInterval(localI$Z.Ii, brks_LM, all.inside=TRUE)], forcefill=FALSE, main="Local
Moran's I, randomization standard deviates")
legend(c(13e5,13e5), c(14e5,10e5), fill=cols, legend=leglabs(brks_LM), bty="n", ncol=1, x.intersp=0.9,
y.intersp=0.9)
Figure 16.15 shows a map of a local Moran’s I index of spatial association. Counties with large Moran’s I
values (brown) have death‐rate values similar to those of their neighbors. As in the previous examples, the
color intervals were selected based on the summary statistics summary(localI).
Courtesy of National Atlas of the United States, http://nationalatlas.gov.
Figure 16.15
and variances caused by the denominator of often lead to rejection of the null
hypothesis of spatial autocorrelation because of heterogeneity in the population distribution.
The alternative approach implemented in the spdep is treating the data as normally distributed with a
constant mean and variance. In this case, exact formulas for the expected mean value and the expected
variance of Moran’s I are available under the null hypothesis of no spatial autocorrelation. To make
the test, the probability value is computed, where z1‐α/2 is the 1‐α/2 percentage
point of the Gaussian distribution (usually 1.96 for a 0.05‐sized test). This approach also assumes constant
mean and variance, and it can be better than the permutation method if normality assumption is verified and
justified.
The approach accepted in modern statistical literature is Monte Carlo simulation from the appropriate model.
In this approach, we would generate realizations from a specified univariate distribution that describes the
data, calculate Moran’s I for each, then compute the p‐value as in the permutation approach.
The Monte Carlo approach is more flexible than approaches based on permutation and data normality since
realizations can be generated from any distribution, including a Poisson or a binomial, which are very useful
for aggregated data including epidemiological, crime, and census. The Monte Carlo approach allows relaxing
of the assumption of data stationarity; that is, it allows the population sizes—in this example, the number of
live births—to differ among counties.
The Monte Carlo approach can test alternative hypotheses other than just a hypothesis about spatially
uncorrelated data. For example, correlated Gaussian data could be generated from a simultaneous
autoregressive model with a specified correlation coefficient, perhaps ρ = 0.4, then used to test this null
hypothesis against the alternative hypothesis that ρ ≠ 0.4.
The next spdep function that will be used in this case study is the simultaneous autoregressive model (see
chapter 11). The intercept and four variables, with coefficients α0–α4 of the following model, will be used in
an attempt to explain the death rates:
Infant mortality rate fit =
α0+
α1⋅(per capita income) +
α2⋅(the percentage of mothers who smoked during pregnancy) +
α3⋅(the rate of 10 to 14yearold women who had a liveborn infant) +
α4⋅(the rate of 40 to 54yearold women who had a liveborn infant)
A code that performs estimation of the coefficients is the following:
pc_income_scale < 0.001*im_df$pc_income
sar_im < errorsarlm(1000*im_df$death/im_df$birth ~ pc_income_scale + im_df$smok_all + im_df$AGE10_14 +
im_df$AGE40_54, listw=nb2listw(im_cont_nb, style="B"), zero.policy=TRUE)
The summary(sar_im) command prints a report of the model parameters including the following:
1. Table 16.1 is a summary of statistics on the residuals; that is, the difference of
infant mortality rate data − infant mortality rate fit.
2. Table 16.2 gives the estimated coefficients of the infant mortality rates model.
The first column names the coefficient; the second gives the estimate; the third is the standard error of the
estimate; the fourth is the proportion of the estimate to the standard error (called z‐value in spdep); and the
last column is the p‐value (called Pr(>|z|) in spdep, the probability of observing a value that corresponds to
the null hypothesis, which is a zero coefficient αi). If the p‐value is less than 0.05, researchers usually reject
the null hypothesis of having no effect and conclude that the variable has a significant effect on outcome.
The estimates (e.g., 1.354204 for variable AGE10_14) are the estimated αi coefficients in the simultaneous
autoregressive model. These give the relative contribution of each of the variables in explaining infant
mortality. Associated with each estimate is a standard error that measures the uncertainty in these estimates.
Two variables are inversely related to infant mortality, since their estimated coefficients are negative. Only
two out of five variables contribute significantly to the explanation of infant mortality in Georgia: rates in the
neighboring counties (intercept) and the rate of 10‐ to 14‐year‐old females who had a live‐born infant,
because their p‐values are close to zero. Per capita income, smoking habits, and the rate of 40‐ to 54‐year‐old
women who had a live‐born infant are not significant because their p‐values are greater than 0.18. Indeed,
their prediction coefficient uncertainties are too large to be useful in decision making. For example, the
estimated coefficient α4=im_df$AGE40_54 is in the 95‐percent significance interval of [−0.2, 0.5], and it can be
both positive and negative.
3. Lambda: 0.12881 p‐value: 4.9405e‐14
The estimated spatial autocorrelation parameter is lambda=0.12881 (it is called ρ in the discussion on the
simultaneous autoregressive model in chapter 11), and it appears to be significantly different from zero
because its p‐value is very small. Therefore, a hypothesis that neighboring data are not correlated should be
rejected.
The Akaike Information Criterion (AIC) informs how well the model can be expected to predict new data. It
should be as small as possible. The main use of AIC is for model comparison.
Package spdep has another function for fitting the simultaneous autoregressive model, spautolm. This
function has additional options, including weights specification, and additional output. The code below fits
the same regression model with weights equal to the number of births, which was justified in chapter 11.
sar_weighted < spautolm(1000im_df$death/im_df$birth ~ pc_income_scale + im_df$smok_all + im_df$AGE10_14
+ im_df$AGE40_54,
+ listw= nb2listw(im_cont_nb), weights=im_df$birth, zero.policy=TRUE)
The following table displays the estimated coefficients of the infant mortality rates model with option
weights.
The estimated spatial autocorrelation parameter and AIC are as follows:
Lambda: 0.88698 pvalue: < 2.22e16
AIC: 495.52
We see that signs of the coefficients α1 and α2 are changed (however, variables pc_income_scale and smok_all
were and are insignificant for explanation of infant mortality rates); coefficient α4 becomes large and variable
AGE40_54 become nearly significant; the prediction accuracy of the model with weights is better because AIC
is smaller; and the spatial autocorrelation parameter increases dramatically, meaning that the spatial
dependence between neighboring observations is very strong according to the model with weights.
Next, we can try a model without insignificant variables: pc_income_scale and smok_all:
sar_weighted2var < spautolm(1000im_df$death/im_df$birth ~ im_df$AGE10_14 + im_df$AGE40_54, listw=
nb2listw(im_cont_nb), weights=im_df$birth, zero.policy=TRUE)
Output from this model is the following:
Figure 16.16 shows three components of the simultaneous autoregressive model
—
nonspatial component of fitted values (left), spatial component of fitted values
(center), and the difference between observed and fitted values (right). The
model’s components are stored in the objects sar_weighted2var$signal_trend,
sar_weighted2var$signal_stochastic, and sar_weighted2var$residuals, correspondingly.
The semivariogram models in the top right corners in figure 16.16 show the spatial dependence of each
component of the simultaneous autoregressive model. We see that the first two components exhibit
anisotropic spatial correlation (note that the semivariogram surface is asymmetric), while the residuals are
spatially independent. The clear anisotropic structure of the spatial component of the fitted values suggests
using spatial weights that depend on the direction between lines that connect region centroids or adding the
covariate that explains directional variation in the rates.
Courtesy of National Atlas of the United States, http://nationalatlas.gov.
Figure 16.16
It is interesting to compare coefficient estimates using the nonspatial variant of linear regression discussed in
appendix 2 (that is, ignoring information about neighbors and their weights). The code for performing
traditional linear modeling is the following:
im_lm < lm(formula = 1000im_df$death/im_df$birth ~ pc_income_scale + im_df$smok_all + im_df$AGE10_14 +
im_df$AGE40_54, data=im_df)
summary(im_lm)
The following table shows estimated regression coefficients. The coefficients show that the smoking and
older‐age coefficients have small p‐values. Based on this model, one may conclude that smoking and
pregnancy at ages 40–54 help to reduce infant mortality in Georgia. According to the estimated model
performance based on R2,
multiple Rsquared = 0.5077, adjusted Rsquared = 0.4949,
A simultaneous autoregressive model is a linear model with spatially correlated errors. When this spatial
correlation is strong as it appears to be in the case of infant mortality data, it has a big impact on the standard
errors of the regression coefficients. The linear nonspatial regression fit ignores this correlation and in doing
so reports biased standard errors. Therefore, when spatial correlation is present, results from nonspatial
linear regression can be misleading, causing erroneous conclusions, as when explaining the causes of infant
mortality in Georgia.
It is possible to eliminate spatial autocorrelation within the observations
by data transformation that uses information on eigenvalues of the weights matrix. The idea of this approach
is to minimize spatial correlation instead of maximizing the fit of a model with spatial dependence. This is
called “spatial filtering” in geographical literature. Note that in statistical literature, “filtering” usually means
filtering out noise from the data with measurement errors (see the discussion on filtered kriging in the
section “Geostatistical model” in chapter 8).
There are several methods for replacing the spatial component of fitted values by a
linear combination of eigenvectors , (see reference 6 in “Further reading”). These
eigenvectors are independent from the explanatory variables , but they depend on the weights matrix
, meaning that different weights matrices produce different eigenvectors. The following code calculates a
feasible eigenvectors subset:
sar.filtered < SpatialFiltering(1000*im_df$death/im_df$birth ~ pc_income_scale + im_df$smok_all +
im_df$AGE10_14 + im_df$AGE40_54, nb=im_cont_nb, ExactEV=TRUE)
The values of four selected eigenvectors are displayed in figure 16.17 together with estimated semivariogram
models. The most suitable models have different shapes and a very small nugget effect, which arises only
because the shortest distance between the region centroids is relatively large. It is worth remembering that
these four (and many more) spatial structures originated from the data displayed in figure 16.16, center.
Courtesy of National Atlas of the United States, http://nationalatlas.gov.
Figure 16.17
eigenvectors and are the estimated regression coefficients. The latter model can be fitted using the
following code:
lm.sar.nw < lm(1000im_df$death/im_df$birth ~ pc_income_scale + im_df$smok_all + im_df$AGE10_14 +
im_df$AGE40_54 + fitted(sar.filtered))
summary(lm.sar.nw)
We see that spatial dependence disappears after filtering because the estimated semivariogram model is the
nugget effect model.
Output from the linear model lm.sar.nw is shown in table 16.6. The estimated regression coefficients for the
intercept and four explanatory variables are the same, while their standard errors are different and this time
unbiased because eigenvectors are not correlated with the explanatory variables, and the residuals are
spatially independent (not shown here). Note that per capita income becomes significant in explaining the
infant mortality data. However, negative correlation between the mortality rates and smoking and the rate of
women age 40–54 contradicts common sense. The estimated model performance
multiple Rsquared = 0.7701, adjusted Rsquared = 0.746
is much better than the linear model, but the improvement results from the addition of unknown factors that
are responsible for spatial dependence. Researchers should think about model improvement using additional
explanatory variables because the linear combination of eigenvectors can be interpreted as model
misspecification. Regression diagnostics discussed in chapter 12 and appendixes 2 and 4 may help to identify
problems other than unexplained spatial data correlation.
As discussed in chapter 12, a more appropriate model for counts of disease or death is the generalized linear
model, which can be fitted using the following code:
glm.base < glm(im_df$death ~ offset(log(im_df$birth)) + pc_income_scale + im_df$smok_all + im_df$AGE10_14 +
im_df$AGE40_54, family="quasipoisson")
The linear combination of eigenvectors can be used with the generalized linear model similar to the linear
model discussed above:
glm.filtered < glm(im_df$death ~ offset(log(im_df$birth)) + pc_income_scale + im_df$smok_all +
im_df$AGE10_14 + im_df$AGE40_54 + fitted(sar.filtered), family="quasipoisson")
Table 16.7 shows the estimated regression coefficients and their standard errors from the last model. Note
that all the explanatory variables are significant, and the regression coefficients have the same signs as in the
linear regression. This means that using a generalized linear model instead of a linear one does not really
improve our analysis of infant mortality data collected in Georgia when the spatial data dependence is
ignored.
Courtesy of National Atlas of the United States, http://nationalatlas.gov.
Figure 16.19
In this case study, we conclude that a simultaneous autoregressive model produces more understandable and
better interpretable results than do linear and generalized linear models with eliminated spatial dependence.
However, as the purpose of this exercise is demonstrating the R software and not rigorous epidemiological
analysis, the reader should accept the unanswered question regarding an optimal statistical model for this
particular regional data.
It is often of interest to compare two point processes. To illustrate how this can be done using the splancs
package, a comparison will be made of auto theft (red, 288 events) and robberies (green, 68 events) in
Redlands, California, in 2000, shown in figure 16.20.
Data courtesy of Redlands Police Department, Redlands, Calif.
Figure 16.20
Figure 16.21 at right shows a raster image of the intensity of auto theft events and the contour lines of
robbery events along with crime locations. Figure 16.21 at left displays the code that calculated the raster and
contour lines and displayed them together with points. Coordinate scaling matters, so the original
coordinates were divided by 1,000.
Data courtesy of Redlands Police Department, Redlands, Calif.
Figure 16.21
To answer this question, the K functions were estimated for both types of crime as well as the bivariate K
function that estimates the expected number of robbery events within distance h of the auto theft
events (see the code in the top left part of figure 16.22).
The estimated bivariate K function is shown as the line of tiny circles. The individual K functions for
auto theft events (blue) and robbery events (green) are also shown. These are close to the ,
suggesting that the same process describes both types of crimes.
The bivariate K function for two independent processes, , is known exactly.
is shown as a red‐dashed line in figure 16.22. It is clearly different from the bivariate K
function calculated using robbery and auto theft events locations, confirming that these two crimes are
dependent.
Data courtesy of Redlands Police Department, Redlands, Calif.
Figure 16.22
Although the intensity maps of robberies and auto thefts look similar, there are differences between them. To
see the differences, the code at the left in figure 16.23 can be used, where the two intensities are divided.
Because the intensity is small or equal to zero in some areas, the estimated proportion between them is
checked for division by zero, and a NODATA value is assigned to the resulting grid if the denominator,
robbery events intensity, is equal to zero.
Data courtesy of Redlands Police Department, Redlands, Calif.
Figure 16.23
It is possible to add a legend to the map in R or export the results of the data analysis to ArcMap for
visualization, as in figure 16.24. In this map, blue areas are where the estimated intensity of robberies is
larger than the estimated intensity of auto thefts. The light blue to light red colors show areas with an
increasing proportion of auto theft intensity over robbery intensity.
The map in figure 16.24 suggests that the intensity of auto thefts is greater than the intensity of robberies in
most parts of the city of Redlands (light red and yellow). This suggestion can be verified using the approaches
discussed in “Hierarchical modeling of point data” in chapter 13 (after collecting relevant explanatory data).
Data courtesy of Redlands Police Department, Redlands, Calif.
Figure 16.24
Cluster analysis helps to identify groups of points on the plane that are separated from other points. The
simplest clustering algorithm, nearest neighbor clustering, repeats the following action:
find the two nearest pairs of points and combine them into a group
until the required number of groups is reached or the distance between the remaining pair of points becomes
larger than the specified distance. Nearest neighbor clustering is a deterministic model. It performs poorly
when clusters are not well separated.
Model‐based clustering implemented in the mclust02 package written by Chris Fraley and Adrian Raftery is
based on the idea that the observed data are a mixture of several populations with possibly different areas,
shapes, and orientations. Model‐based clustering is a probabilistic model. It provides the uncertainty of the
points classification on clusters. The mclust package uses the Bayesian information criterion (BIC) to find the
optimal cluster model.
We begin the case study by loading the mclust and foreign (for reading .dbf files) packages:
library(mclust02)
library(foreign)
Then we specify a folder with the data input and the results of calculations output:
setwd("c:\\book_data\\Bavarian_forest")
Data for this case study are stored in the file spruce_trees.dbf. They were described in the section called
“Forestry” in chapter 6. The following command reads the data to the variable trees:
trees < read.dbf("spruce_trees.dbf")
As with several other packages, mclust works better with relative or normalized coordinates. We create a data
frame mydata with origin (0,0) using the following commands:
x_min < min(trees$X_GK)
y_min < min(trees$Y_GK)
mydata < data.frame(x=trees$X_GK x_min, y=trees$Y_GK y_min)
In the code above, columns X_GK and Y_GK are data coordinates in the Gauss‐Kruger projection.
The following commands are doing all calculations. The result is stored in the variable treesModel:
treesBIC < EMclust(mydata)
treesModel < summary(treesBIC, mydata)
The command treesModel reports the result of the modeling. The most informative part for us is printed at the
end of the report:
best BIC values:
EEE,6 EEE,7 VVV,3
‐3059.498 ‐3075.882 ‐3061.573
best model: elliposidal, equal variance
This means that according to the Bayesian information criterion, the optimal model consists of six ellipsoidal
clusters with equal variance. The difference between Bayesian information criterion values for the optimal
EEE,6 (six clusters with equal shape, orientation, and variance) and VVV,3 (three clusters with variable shape,
orientation, and variance) models is small. Model EEE,6 is simpler than VVV,3 since it has a smaller number of
The mclust package provides a function for mapping the result of data clustering. The classification and
uncertainty maps for the optimal EEE,6 model are displayed in figure 16.25 using the following commands
(we will show how to visualize model VVV,3 at the end of this section):
mclust2Dplot(mydata, type = "classification", z = treesModel$z, ask=FALSE,mu = treesModel$mu, sigma =
treesModel$sigma)
mclust2Dplot(mydata, type = "uncertainty", z = treesModel$z, ask=FALSE,mu = treesModel$mu, sigma =
treesModel$sigma)
In each map, six equal ellipses show the locations of clusters. Different symbols at the left are used to show
points assigned to different clusters. The size of the symbols at the right corresponds to the uncertainty in the
classification: the larger the symbol size, the larger the uncertainty.
Data courtesy of Bavarian Forest National Park.
Figure 16.25
The following code saves the result of the modeling (classification, uncertainty of classification, and
conditional probabilities that a point belongs to each of the six clusters) to the shapefile:
zz < data.frame(x=trees$X_GK, y=trees$Y_GK, clust_n=treesModel$classification,
clas_uncert=treesModel$uncertainty, cond_prob1=treesModel$z[,1],cond_prob2=treesModel$z[,2],
cond_prob3=treesModel$z[,3], cond_prob4=treesModel$z[,4], cond_prob5=treesModel$z[,5],
cond_prob6=treesModel$z[,6])
write.dbf(zz, "m_clusters_50_stat.dbf")
Figure 16.26 shows interpolated maps of the uncertainty of the classification (at left, the variable
treesModel$uncertainty) and the conditional probability that a point belongs to the cluster number six with
points displayed as gray circles (at right, the variable treesModel$z[,6]) in ArcMap. The conditional probability
is large on the edges of the cluster and small in the cluster’s center.
The mclust package provides a function for estimating the density of points:
surfacePlot(mydata, mu = treesModel$mu, sigma = treesModel$sigma, pro=treesModel$pro, type = "contour",
what = "density", grid=100, nlevels = 8, transformation = "none")
The output of this function is contours of the points density shown in figure 16.27.
Data courtesy of Bavarian Forest National Park.
Figure 16.27
minx < min(mydata$x)
miny < min(mydata$y)
maxx < max(mydata$x)
maxy < max(mydata$y)
factor < (maxxminx)/(maxyminy)
ncol < 100
nrow < as.integer(ncol/factor)
x < grid1(ncol, range=c(minx,maxx))
y < grid1(nrow, range=c(miny,maxy))
xyDens < do.call("dens", c(list(data=grid2(x, y)), treesModel))
xyDens < matrix(xyDens, ncol=length(y), nrow=length(x))
image(x,y,xyDens, col = terrain.colors(20), axes = FALSE)
contour(x, y, xyDens, add = TRUE, col = "peru")
axis(1, at = as.integer(10x)/10)
axis(2, at = as.integer(10y)/10)
points(mydata, pch=19, cex=.5)
Data courtesy of Bavarian Forest National Park.
Figure 16.28
The clusters density can be saved to dBASE format using the following code:
xy < expand.grid(x + x_min, y + y_min)
zz < data.frame(x=xy[,1], y=xy[,2], z=c(xyDens))
zz$z[is.na(zz$z)] < 9999
write.dbf(zz, "m_clusters_50.dbf")
This density can then be visualized in ArcGIS as shown in figure 16.29.
The following code creates a model with three clusters of different size, orientation, and variance:
treesVVV < EMclust(mydata, G = 3, modelName = "VVV")
treesModel < summary(treesVVV, mydata)
Figure 16.30 shows the result of points classification (left) and points density estimation (right).
Data courtesy of Bavarian Forest National Park.
Figure 16.30
ASSIGNMENTS
1) SIMULATE SPATIAL PROCESSES WITH THE SPATSTAT PACKAGE.
Various random point patterns can be simulated using the spatstat package developed by Adrian Baddeley
and Rolf Turner.
The R code below shows how to simulate the inhomogeneous Poisson random pattern (function rpoispp),
Matern cluster process (function rMatClust), Neyman‐Scott cluster process (function rNeymanScott), and
simple sequential inhibition process (function rSSI). Each process is simulated in the unit square. The result
of each simulation is saved to the DBF file for further visualization and use in ArcGIS.
library(spatstat)
# Poisson random pattern
ppsim < rpoispp(function(x,y) { 140x }, 140 )
plot(ppsim$x, ppsim$y, xlab="x", ylab="y")
title(main="Inhomogeneous Poisson Random Pattern")
title(sub="maximum intensity = 140, trend=140x")
# Matern cluster process
ppsim < rMatClust( lambda=15 , r=0.05 , mu=15 , win=owin( c(0,1), c(0,1) ) )
plot(ppsim$x, ppsim$y, xlab="x", ylab="y")
title(main="Simulation Matern Cluster Process")
title(sub="lambda=15 r=0.05 mu=15")
# NeymanScott cluster process
nclust < function(x0, y0, radius, n){return (runifdisc(n, radius, x0,y0 ))}
ppsim < rNeymanScott(lambda=10 , rmax=0.10 , rcluster=nclust , radius=0.10 , n=15, win=owin(c(0,1),c(0,1)) )
plot(ppsim$x, ppsim$y, xlab="x", ylab="y")
title(main="Simulation NeymanScott cluster")
title(sub="lambda=10, Max radius =0.10, Cluster radius =0.10, Points per Cluster=15")
# Simple sequential inhibition process
ppsim < rSSI(n= 200 , r=0.02 , win= owin(c(0,1),c(0,1)) )
plot(ppsim$x, ppsim$y, xlab="x", ylab="y")
title(main="Simple Sequential Inhibition")
title(sub="radius=0.02, number of points =200")
Figure 16.31 shows four simulated point patterns visualized in ArcMap.
Modify the above code to simulate point patterns in other data domains. Learn about other point processes
available in the spatstat package, and simulate points using these processes.
2) REPEAT THE ANALYSIS OF INFANT MORTALITY USING DATA COLLECTED IN
NORTH CAROLINA FROM 1995 TO 1999.
This is shown in figure 15.29 (assignment 3 of chapter 15). Data are in the assignment 15.3 folder.
3) REPEAT THE ANALYSIS OF THE RELATIONSHIPS BETWEEN ROBBERY AND
AUTO THEFT CRIME EVENTS USING THE SPLANCS PACKAGE WITH 1998
REDLANDS DATA.
Two shapefiles with information about crime event coordinates—crime description, date, and time—are in
the assignment 16.3 folder. Locations of robbery events (green) and auto theft events (pink) are shown in
figure 16.32.
Data courtesy of Redlands Police Department, Redlands, Calif.
Figure 16.32
Gray whale point locations recorded in 2002 (green) and the coastline of Flores Island are shown in figure
16.33.
Data courtesy of Dr. Dave Duffus and Laura‐Joan Feyrer, University of Victoria, Department of Geography, Whale Research Lab.
Figure 16.33
Data are in the assignment 16.4 folder. The sum02_prj shapefile includes coordinates of gray whale locations
and the number of recorded animals. A random distance in the diapason (−15, 15) meters was added to the
locations where more than one whale was recorded (whales_02all shapefile). Create intensity maps using
each dataset and a polygon bounded by the coastline and some distances from Flores Island with the splancs
package. Find the clusters using the mclust package.
Data are provided by Dr. Dave Duffus and Laura‐Joan Feyrer, University of Victoria, Department of
Geography, Whale Research Lab, P.O. Box 3025, Victoria, British Columbia, Canada V8W 3P5.
5) TEST FOR THE SPATIAL EFFECTS AROUND A PUTATIVE SOURCE OF HEALTH
RISK.
One of the problems in the spatial analysis of the disease events is how to estimate the disease risk in relation
to a point source of environmental pollution, see “Cluster analysis” in chapter 13. A classical example of this
problem is what statistician Peter Diggle called “raised incidence” modeling using the locations of lung and
larynx cancer cases and the location of an old incinerator in Lancashire. The analysis of these data can be
found in many papers and books beginning with the following paper:
“Peter J. Diggle (1990) A point process modeling approach to raised incidence of a rare phenomenon in the
vicinity of a prespecified point. Journal of the Royal Statistical Society. Series A (Statistics in Society), Vol. 153,
No. 3, pp. 349–362”
Use the code below and provide comments on each logical part of the testing procedure.
FURTHER READING
R software packages used in this chapter can be found at the following Web sites:
http://cran.r-project.org/web/packages/spdep/index.html
http://cran.r-project.org/web/packages/splancs/index.html
http://cran.r-project.org/web/packages/spatstat/index.html
http://cran.r-project.org/web/packages/mclust/index.html
and their descriptions are in the following four papers:
1. Bivand, R. S. 2006. “Implementing Spatial Data Analysis Software Tools in R.” Geographical Analysis 38:23–
40.
2. Rowlingson, B., and P. Diggle. 1993. “Splancs: Spatial Point Pattern Analysis Code in S‐Plus.” Computers and
Geosciences 19:627–55.
3. Baddeley, A., and R. Turner. 2005. “Spatstat: An R Package for Analyzing Spatial Point Patterns.” Journal of
Statistical Software 12(6):1–42.
4. Fraley, C., and A. E. Raftery. 2002. “MCLUST: Software for Model‐Based Clustering, Density Estimation and
Discriminant Analysis.” Technical Report, Department of Statistics, University of Washington. See
http://www.stat.washington.edu/tech.reports.
5. A guide to resources for analysis of spatial data using R can be found at http://cran.r-
project.org/web/views/Spatial.html. Note, however, that there are functions for spatial data
analysis in some other R packages.
This paper discusses methods for spatial dependence elimination in nonspatial regression models.
7. Krivoruchko, K., and R. Bivand. 2009. “GIS, Users, Developers, and Spatial Statistics: On Monarchs and Their
Clothing. In Interfacing Geostatistics and GIS. pp. 209‐228. Springer.
This paper discusses the benefits of freeware and commercial software usage. Available at
http://training.esri.com/campus/library/index.cfm
DISPLAYING THE SEMIVARIOGRAM SURFACE ON THE MAP
STATISTICAL PREDICTIONS
PREDICTIONS USING ORDINARY KRIGING
REPLICATED DATA PREDICTION USING LOGNORMAL ORDINARY KRIGING
CONTINUOUS PREDICTIONS
A CLOSE LOOK AT PREDICTIONS WITH REPLICATED DATA
QUANTILE MAP CREATION USING SIMPLE KRIGING
MULTIVARIATE PREDICTIONS
PROBABILITY MAP CREATION USING COKRIGING OF INDICATORS
PROBABILITY MAP CREATION USING DISJUNCTIVE AND ORDINARY KRIGING
MOVING WINDOW KRIGING
EXAMPLE OF GEOPROCESSING: FINDING PLACES FOR NEW MONITORING
STATIONS
DETERMINISTIC MODELS
VALIDATION DIAGNOSTICS
ABOUT ARCGIS GEOSTATISTICAL ANALYST 9.3
ASSIGNMENTS
1) REPEAT THE ANALYSIS SHOWN IN THIS APPENDIX
2) USE THE GEOSTATISTICAL ANALYST MODELS AND TOOLS TO ANALYZE THE
HEAVY METALS MEASUREMENTS COLLECTED IN AUSTRIA IN 1995
3) FIND THE 20 BEST PLACES FOR COLLECTING NEW VALUES OF ARSENIC TO
IMPROVE PREDICTIONS OF THIS HEAVY METAL DISTRIBUTION OVER
AUSTRIAN TERRITORY
Spatial Statistical Data Analysis for GIS Users 716
4) PRACTICE WITH THE GAUSSIAN GEOSTATISTICAL SIMULATION
GEOPROCESSING TOOL
FURTHER READING
A rcGIS Geostatistical Analyst is an extension of ArcGIS Desktop (ArcView, ArcInfo, and ArcEditor) that
provides a variety of tools for spatial data exploration, optimal prediction, and surface creation. This
appendix is prepared using ArcGIS version 9.2, service pack 2. Information about a 60‐day trial
version of ArcGIS Geostatistical Analyst can be found at
http://www.esri.com/software/arcgis/extensions/geostatistical
/eval/evalcd.html.
The ArcGIS Geostatistical Analyst user manual is a good introduction to the software, with tutorials based on
air quality data collected in California. A condensed version of the manual is available as help files, which can
also be accessed online at http://webhelp.esri.com/arcgisdesktop/9.2/ (see figure A1.1).
Figure A1.1
This appendix shows examples of data analysis using cesium‐137 soil and food contamination data collected
in Belarus six years after the Chernobyl accident, focusing on the enhancements implemented in ArcGIS 9.2.
Therefore, it is a good idea to read the manual or the help files first, then repeat the geostatistical analysis
tutorials presented in this appendix.
Figure A1.2 displays the data that will be used in this appendix: the cesium‐137 soil contamination (left) and
the concentration of cesium‐137 in berries collected in the forest, with coordinates assigned to the nearby
village where the measurements were made (right). Notice that the Chernobyl nuclear power plant is located
in the bottom right part of the maps. These data are stored in the Assignment A1.1 folder. The data are
described in chapters 1 and 3.
Spatial Statistical Data Analysis for GIS Users 717
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.2
EXPLORATORY SPATIAL DATA ANALYSIS
Figure A1.3 at left shows a histogram of log‐transformed cesium‐137 data. It is also shown in figure A1.3 at
right, with a superimposed approximation of the data distribution using a combination of six Gaussian
distributions, with parameters shown on the right side of the Normal Score Transformation dialog box (Mu is
mean, Sigma is standard deviation, and P is weight of the Gaussian kernel). This dialog box is accessible
through the Geostatistical Wizard when a simple or disjunctive kriging model is selected and the normal
score transformation option is chosen. We see that log‐transformation does not make data exactly normally
distributed since more than one Gaussian kernel is needed for the normal distribution approximation.
However, log‐transformed cesium‐137 data are much closer to symmetric Gaussian distribution than the
original data. Therefore, lognormal kriging may produce more accurate predictions than kriging without the
data transformation option.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.3
When visualizing berry contamination data, the Histogram tool shows a handling coincidental samples dialog
Spatial Statistical Data Analysis for GIS Users 718
box (shown in figure A1.4 at left). Figures A1.4 (center) and A1.4 (right) show the histogram and simple
statistics when Use Mean (center) and Include all (right) options are selected (in both cases, the data were log
transformed). The peculiarity of this data is that more than one sample was analyzed in most of the villages.
Several of the Geostatistical Analyst kriging models can use replicated data, and an example of data modeling
will be presented later in this appendix.
Figure A1.4
The histograms in figure A1.4 are almost symmetrical, and it might be expected that the log‐transformed data
is close to a normal distribution. This assumption can be verified using the normal quantile‐quantile plots,
shown in figure A1.5. Since the dots lie close to the line that corresponds to the exact fit of the normal data,
log transformation makes both the soil (figure A1.5 at left) and berry (figure A1.5 at right) contamination data
close to a normal distribution, and lognormal kriging may be a good candidate for the optimal geostatistical
model for predicting the level of contamination at the unsampled locations.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.5
The general quantile‐quantile plot (figure A1.6, center) assesses the similarity between two data
Spatial Statistical Data Analysis for GIS Users 719
distributions. This dialog box does not provide an option to transform the input data, and the fields log_Cs137
and log_berry—with a logarithm of the cesium‐137 measurements in soil and berries—are used. Since most
of the points in the General QQPlot dialog box in figure A1.6 lie close to the hidden 1:1 line, we may assume
that the distributions of soil and berry contamination are similar.
If one more exploratory data analysis plot, Crosscovariance Cloud, is opened (left), lines appear on the map
between pairs of points selected in the General QQPlot dialog box. This lines show that small values of soil
contamination are usually located far from the locations of small values of berry contamination. This is an
indication that the relationship between cesium‐137 content in the soil and forest berries is complex,
meaning that the cesium‐137 soil contamination level in the villages (where all samples were taken) can be
insufficient to accurately predict the level of contamination in the nearby forest.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.6
Spatial Statistical Data Analysis for GIS Users 720
When the Semivariogram/Covariance Cloud tool is run the first time with more than 300 input points (notice
that the number of points in the Cs137 layer is 445), a message is displayed giving the rationale for this
default value (see figure A1.7 at left). Several default options can be changed in the ArcMap Advanced Settings
dialog box shown in figure A1.7 at right. The maximum number of points in the Semivariogram/Covariance
and Crosscovariance Cloud dialog boxes is set at 500 in figure A1.7 at right so that the maximum number of
pairs allowed is 500*499/2=124,750. The default value of 300 input data is used because it is difficult to
explore a very large number of points in a relatively small graph window. Note that there is no limitation on
the number of input data in other parts of Geostatistical Analyst.
Figure A1.7
The Trend Analysis tool in figure A1.8 at left shows that there is large‐scale cesium‐137 data variation in two
perpendicular directions after rotating the data 107 degrees clockwise, meaning that data change the most in
the north‐northeast and perpendicular directions. This observation can be further verified in the Local
Polynomial Interpolation dialog box of the Geostatistical Wizard shown in figure A1.8 at right. The power
value at the top right of this figure shows the order of the polynomials. The slider at the top of the dialog box
shows the relative size of the data‐averaging window. If a slider is in the left position, all data are used in
calculating the polynomial coefficients. When the slider is moved to the right, the polynomial coefficients are
calculated locally, and the number of points involved in the calculation at every point in the data domain
decreases, resulting in a map with more details. Usually, the large‐scale data variation is clearly seen when
the slider is positioned about one‐third from the left.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.8
Spatial Statistical Data Analysis for GIS Users 721
The Voronoi Map tool helps to investigate local data variability. Figure A1.9 at left shows clipped Voronoi
polygons colored according to the entropy of the log‐transformed cesium‐137 soil contamination values
calculated using each Voronoi polygon and its neighbors. The large entropy values indicate large local data
variability. Local statistics calculated in the Voronoi Map window can be exported as a polygon feature layer
(see figure A1.9 at right) for further analysis. For example, it makes sense to collect new measurements of
cesium‐137 contamination in the pink and yellow areas. Locations with large local entropy values (for
example, 20 percent of the largest values) can be found using command Selection‐>Select By Attributes. They
are visualized in figure A1.9 at right as green crosses.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.9
Spatial Statistical Data Analysis for GIS Users 722
The Semivariogram/Covariance Cloud dialog box is shown in figure A1.10. The semivariogram surface in the
bottom part of the dialog has a clear structure, and this is evidence of spatially dependent input data. Several
unusually large semivariogram values at short distances between the pairs of points are highlighted, and the
pairs are displayed on the map at right as green lines (the color of these lines was changed from the default
light blue to green in the Cs137 layer Properties dialog box). Two points that are connected to most of the
selected neighbors are highlighted with red circles. They have unusually large values in comparison with
their neighbors. In fact, the selected data points in the semivariogram cloud are either unusually small or
large, but the map shows that they have large values in this case. Both points are located near the border of
the data domain, and their values might not be unusual if samples to the east (in Russia) were available.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.10
Spatial Statistical Data Analysis for GIS Users 723
It is good practice to select several subregions and explore spatial data dependence in the selected areas
separately. Figure A1.11shows semivariogram clouds with unusually large values highlighted for three data
subsamples from the Cs137 layer (yellow, green, and pink circles). This time, the number of pairs is smaller,
and local spatial data dependence can be explored in more detail. In particular, clusters of unusual values at
shorter distances are found in the northern and central parts of the region while large semivariogram values
in the southern subset do not form a cluster. Also, there is a clear northwest trend in the spatial data
variability in the south. The semivariogram surfaces and the semivariogram clouds are different in different
parts of the area under investigation, suggesting that soil contamination data are nonstationary.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A.1.11
Spatial Statistical Data Analysis for GIS Users 724
Spatial data dependence can be further explored using the Semivariogram/Covariance Modeling dialog box.
Figure A1.12 shows directional semivariograms using data collected to the northwest of Chernobyl, a similar
subset as shown in green in figure A1.11. The cloud of averaged semivariogram values is very different when
lines between pairs are directed northwest versus northeast. This is evidence of large‐scale variation of
cesium‐137 soil contamination in that part of the data domain. Directional data variation can be taken into
account at the modeling stage using the Geostatistical Analyst data detrending and anisotropy options.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.12
In addition to the standard semivariogram cloud, the cloud of the indicator semivariogram values can be used
for spatial data exploration. The indicator value is equal to zero if the data value at location s is below the
threshold and equal to 1 otherwise:
(
I(s) = I Z(s) < threshold ) = .
The indicator values can be calculated using the ArcGIS field calculator.
Spatial Statistical Data Analysis for GIS Users 725
Figure A1.13 shows two semivariogram clouds for indicators created using cesium‐137 soil contamination
threshold values of 20 (left) and 5 (right) Ci/km2. Although at first the clouds of points look useless since they
are displayed as just two horizontal lines (because the only two possible values for the indicator
semivariogram are 0 and 0.5), selecting points for distances smaller than a particular distance on the top line
(left) leads to the selection of pairs with values on the other side of the threshold (blue lines), while selecting
points on the bottom line (right) leads to the selection of pairs with values on the same side of the threshold
(yellow lines). Therefore, the Semivariogram Cloud tool with indicators based on different thresholds helps to
find local data similarity and dissimilarity. This is especially useful when the threshold value is an upper
permissible level of contamination or another important regulatory value.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.13
The empirical indicator semivariogram values and models calculated by the Geostatistical Analyst wizard (in
this case, it is not necessary to calculate the indicator values because the transformation is done automatically
by the software) are shown in figure A1.14 for the thresholds 10, 5, and 1 Ci/km2. The indicator
semivariogram surface and model may show a clearer spatial correlation structure than the semivariogram
surface based on the original data since the indicator semivariogram has neither very small nor very large
values that may corrupt the picture of spatial dependency in the original data. The estimated semivariogram
models shown at the bottom of the dialog boxes are different for different threshold values. In particular, the
range parameter underlined in red is increasing with decreasing threshold values, meaning that the
difference between values on the other side of the threshold is recognizable at the shorter scale for larger
threshold values. Therefore, to safely identify areas with large contamination values, a larger number of data
should be collected than for identification of the areas with moderate cesium‐137 soil contamination. Note
also that data variability (sill parameter) is decreasing with increasing threshold values. In this case, the
transformation to bivariate normal distribution can be problematic, and predictions with disjunctive kriging
can be non‐optimal (see also the sections on copulas in chapters 9 and 12).
Spatial Statistical Data Analysis for GIS Users 726
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.14
Figure A1.15 at left shows the cross‐covariance values for log‐transformed cesium‐137 soil and berry
contamination. The cross‐covariance surface is not symmetrical, and finding the cross‐covariance model
could be a challenge (the cross‐covariance model is necessary for multivariate prediction, which is called
cokriging in geostatistical literature). Figure A1.15 at right shows how this cross‐covariance model may look
in the Geostatistical Analyst Semivariogram/Covariance Modeling dialog box: a sum of two models (actually,
the difference since one of the models has negative partial sill parameter), circular and exponential, is used;
the models are anisotropic; and their origin is shifted to the east and slightly to the north.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.15
Spatial Statistical Data Analysis for GIS Users 727
Figure A1.16 shows two Searching Neighborhood dialog boxes. The right part of each dialog box shows
weights of neighboring observations when predicting to the center of the circle in the case of lognormal
ordinary kriging (left) and lognormal ordinary cokriging (right). The more complicated the model, the more
complicated the distribution of weights can be. The distribution of weights can be considered to be part of
data exploration when it is desirable to have full control over predictions to important locations where
people live or to where they are scheduled to be evacuated.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.16
DISPLAYING THE SEMIVARIOGRAM SURFACE ON THE MAP
Figure A1.17 shows the menus, which are initialized by right‐clicking in the Semivariogram/Covariance Cloud
and the Semivariogram/Covariance Modeling dialog boxes. The semivariogram values, the empirical
semivariogram values, and the semivariogram model can be saved to a DBF file for further analysis and
visualization. The semivariogram model and the semivariogram surface can be saved to raster format.
Spatial Statistical Data Analysis for GIS Users 728
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.17
Spatial Statistical Data Analysis for GIS Users 729
To display the raster on the map, it should be projected first and then shifted to the desired position. Katja
Krivoruchko at Esri wrote geoprocessing tools to perform these tasks per these instructions:
1. Copy files from the Assignment A1.1 folder to the folder on your hard drive.
2. Add the Geostatistical Analyst samples toolbox to ArcMap as shown in figure A1.18 at left.
3. Create the semivariogram model using the Geostatistical Analyst wizard, and save the semivariogram
surface in raster format. In preparing this exercise, five subsamples were created using the
Neighborhood Selection tool from the Geostatistical Analyst Tools toolbox available in ArcGIS 9.2 SP2
(figure A1.18, center). The data for the semivariogram surfaces estimation are shown with five colors
in figure A1.18 at right.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.18
4. Click Geostatistical Analyst samples, then Semivariogram, and then double‐click Shift Semivariogram.
The Shift Semivariogram dialog box opens (figure A1.19 at left).
5. 5. Fill out the tool with the following parameters:
For the Semivariogram raster, select the raster created in step 3.
Provide the coordinate system for the new raster by pointing to the dataset from which
the semivariogram was created (figure A1.19 at right).
For the x and y coordinates, enter the coordinates of where you want the
semivariogram’s center to be. In this example, they are the coordinates of five points
shown as red crosses in figure A1.18 at right.
Choose the name and location of the output raster.
Spatial Statistical Data Analysis for GIS Users 730
Figure A1.19
6. Run the tool.
7. Repeat steps 5 and 6 four more times with four other semivariogram surfaces. Alternatively, use the
Batch Shift Semivariogram tool that displays series of semivariogram surfaces (table five_pts.dbf in
the folder Assignment A1.1 shows an example of the specification of five input semivariogram
surfaces that should be projected and displayed at five specified locations).
The resulting five semivariogram surfaces are shown in figure A1.20. The raster legends on the left indicate
that the semivariogram values are significantly different in the northern and southern parts of the data
domain. Note that in this exercise the cesium‐137 soil contamination data were not transformed.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.20
Spatial Statistical Data Analysis for GIS Users 731
STATISTICAL PREDICTIONS
Figure A1.21 at left shows a list of interpolation methods available in ArcGIS Geostatistical Analyst 9.2. The
first four methods are deterministic, and they will be briefly discussed at the end of this appendix. The fifth
and sixth methods on the list—kriging and cokriging—are statistical. In cokriging, besides the neighboring
data, additional variables are used for predicting data at the unsampled locations. Each sampled location can
have one or more measurements. Geostatistical Analyst recognizes whether multiple measurements are
available at the sampled locations, and the dialog box with the choice of actions appears as shown at the
bottom left of figure A1.21 at left.
Figure A1.21 at right shows a list with six kriging models: ordinary, simple, universal, indicator, probability,
and disjunctive. The names are historical and not always descriptive. For example, the simplest method in the
list is not simple kriging but ordinary (this is the main reason why ordinary kriging is a default model in
Geostatistical Analyst 9.2). In step 1 of the wizard, several options are available, including data
transformation, data detrending, and threshold selection. Option availability depends on the chosen kriging
model. In the case of cokriging, different options can be used for each variable.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.21
The following sections show several examples of geostatistical data analysis and mapping using different
kriging models.
PREDICTIONS USING ORDINARY KRIGING
In this section, the cesium‐137 soil contamination data will be analyzed. This time instead of using the entire
dataset, a subset of data points will be used (located around the green semivariogram surface on the map in
figure A1.20). This data subset is shown in figure A1.22 at left. Values greater than 3 Ci/km2 are printed in
circles. A histogram of the logarithms of the data is shown in figure A1.22 at right. This histogram can be
approximated with two Gaussian distributions (right part of the dialog box). Two‐modal data distribution
may originate from two different processes that are contributed to the soil contamination values, perhaps dry
deposition and precipitation.
Spatial Statistical Data Analysis for GIS Users 732
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.22
Figure A1.23 at left shows the estimated semivariogram of original data (without log transformation). The
wide spread of points means that the estimated semivariogram model may not describe spatial data variation
accurately; therefore, kriging predictions based on that semivariogram model may be inaccurate as well.
Indeed, a cross‐validation diagnostic in figure A1.23 at right confirms that there are problems with
predictions since the root‐mean‐square and average standard prediction errors are very different. One
possible reason for this discrepancy is an outlier with a value (20.99) twice the next largest value in the
subset (9.54) (see Measured and Predicted values in the cross‐validation table). By removing the outlier, one
can see how the cloud of empirical semivariogram values and the cross‐validation statistics change.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.23
The next three steps in the cesium‐137 soil contamination data analysis are removing the outlier, estimating a
new semivariogram model that produces better model diagnostic statistics, and applying this new model
back to the subset containing the outlier. All this can be done in Geostatistical Analyst because of its ability to
preserve selections made in ArcMap and use the kriging model as input to the interpolation methods.
Selections can be made in ArcMap by drawing a box around multiple features with the Select Features tool or
by using the Select By Attributes or Select By Location tools or with the Geostatistical Analyst Neighborhood
Selection tool. Back in Geostatistical Analyst, it is clear that the semivariogram cloud changed dramatically
after removing the largest value in the data subset: the semivariogram model line now lies close to most of
the empirical semivariogram values, especially at the most important small distances (figure A1.24 at left).
Spatial Statistical Data Analysis for GIS Users 733
The cross‐validation statistics shown in figure A1.24 at right are also significantly improved.
Removing the outlier helped build a better semivariogram model for predicting values at the unsampled
locations. We saved this model to the “Ordinary Kriging 1 point out” layer.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belaru.
Figure A1.24
Now, take a closer look at the 20.99 value. An outlier like this one could be caused by a measuring or typing
error, in which case it should be removed from the subsequent data processing, but it could be a reliable
measurement, and omitting it could result in the loss of important information. In the present example, a
subset of data that included only one high value of soil contamination was purposely selected. If, however,
more data points to the east were included, one would see that many of them are as high or even higher,
meaning that the 20.99 value is most probably correct, and it should be used in predicting values at the
unsampled locations. Therefore, it is always advisable to use caution when deciding how to treat possible
data outliers.
The Create Geostatistical Layer tool (figure A1.25 at left) allows the application of the “Ordinary Kriging 1
point out model” to other datasets. We used the dataset that includes the 20.99 value. The resulting
prediction map is shown in figure A1.25 at right. After activating the Show Map Tips option, one can see the
predicted values at the cursor locations. Predictions in the vicinity of the outlier are close to 20.99.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.25
Spatial Statistical Data Analysis for GIS Users 734
REPLICATED DATA PREDICTION USING LOGNORMAL ORDINARY KRIGING
Simple, ordinary, and universal kriging can be used with more than one datum per measurement location.
Having several measurements per location allows the estimation of measurement data error, which is not an
option when data are measured once per location. In this section, lognormal ordinary kriging will be used on
cesium‐137 berry contamination data to illustrate measurement error estimation and usage. The key dialog
box used for modeling replicated data is Semivariogram/Covariance Modeling. The semivariogram model of
the replicated data must have a nugget. However, when the software finds the best‐fit model for the
semivariogram cloud, it can sometimes give a model with a zero nugget. In such cases, the nugget value
should be manually changed because different values at the same location dictate that at zero distance
between pairs of points data variance must be non‐zero. Figure A1.26 shows the Semivariogram/Covariance
Modeling dialog box with its Error Modeling tab, in which the proportion of measurement error in the nugget
parameter can be typed in or estimated by clicking the Estimate button. For the data used in this example, the
estimated measurement error equals 0.5857, or 85 percent of the nugget effect of 0.68418 for the fitted
circular semivariogram model (figure A1.26 at left).
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.26
However, if the exponential model is used, the nugget parameter estimated by the software is 0.4612 (figure
A1.26 at right), which is less than the measurement error in figure A1.26 at left. If the software is forced to
estimate measurement error in this case, it changes the measurement error to the value equal to the nugget
parameter (figure A1.27 at left).
With a nonzero measurement error, new predictions to data locations are possible, and figure A1.27 at right
shows a histogram of the prediction errors and a part of the table with the number of measurements available
at each location (column Join_Count), the lowest measurement (AKTBKLKG), ordinary kriging prediction
(ok), and prediction standard error (ok_stderr). Note that the prediction standard error distribution in the
histogram is similar to the lognormal distribution, meaning that the prediction errors depend on the data
values.
Spatial Statistical Data Analysis for GIS Users 735
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.27
This is in contrast to ordinary kriging without data transformation, which produces nearly uniform
distribution of the prediction standard errors (figure A1.28 at left), although distributions of the input data
and predictions look like the lognormal distribution (figure A1.28 at right).
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.28
The next modeling step is a decision about the most suitable searching neighborhood. In the case of cesium‐
137 berry contamination, an optimal configuration of neighbors is difficult to find because the pattern of
measurement locations is clustered. The Declustering dialog boxes shown in figure A1.29 provide evidence
that low values are indeed clustered since the cell declustering method found a global maximum in the graph
of the weighted mean value (left), and the polygonal method shows that areas of the Voronoi polygons
around points are very different (right). Note that cell declustering is allowed with replicated data while
polygonal declustering is not and the mean value of berry contamination at the locations with replicated data
was used to produce figure A1.29 at right. Ideally, the optimal searching neighborhood should be different in
areas with different sampled locations density, but the optimal number of varying neighbors is difficult to find
automatically.
Spatial Statistical Data Analysis for GIS Users 736
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.29
A reasonable strategy is to use eight sectors in the searching neighborhood circle or ellipse to prevent kriging
from using too many nearby points and to include points that are disposed in several directions, as shown in
figure A1.30 at left. However, even with this searching neighborhood, filled contours in the prediction map
(figure A1.30 at right) change abruptly.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.30
The roughness of predictions is clearly seen when hillshade visualization is activated in the geostatistical
Layer Properties dialog box as shown in figure A1.31 at left. The noncontinuous prediction map in figure
A1.31at right is a feature of all interpolation methods at the locations where the set of the neighbors is
changing (see discussion in chapter 8).
Spatial Statistical Data Analysis for GIS Users 737
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.31
CONTINUOUS PREDICTIONS
A prediction map is continuous if all measurements are used to predict values at all prediction locations. But
using all measurements is often not a good idea because 1) solving a system of a very large number of linear
equations is time consuming (there are 1,381 measurements in the cesium‐137 berry contamination dataset),
and 2) measurements are not absolutely precise; hence each one adds uncertainty to the predictions. The
largest number of neighbors allowed in Geostatistical Analyst 9.2 is 200; figure A1.32 shows predictions using
a searching neighborhood with the nearest 200 measurements. The resulting prediction map is much
smoother than in figure A1.31 at right, but it is still not continuous.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.32
Spatial Statistical Data Analysis for GIS Users 738
The Smooth option tab on the Searching Neighborhood dialog box (figure A1.33 at left) allows the creation of
the continuous map shown in figure A1.33 at right. The idea behind the smoothing method is making kriging
weights equal to zero outside the searching neighborhood as explained in chapter 8. All the points within
three circles in figure A1.33 at left are used in the interpolation. The points that fall outside the smaller ellipse
but inside the largest ellipse are weighted using a sigmoidal function with a value between zero and one.
Empty areas in figure A1.33 at right arise because searching neighborhood centered on these areas does not
include measurements (the option Include at least is not supported in smooth interpolation).
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.33
The geostatistical layer stores parameters of the interpolation model, and the map of prediction standard
errors can be immediately created using the Create Prediction Standard Error Map option from the
Geostatistical Layer context menu (figure A1.34 at left). To create probability or quantile maps, the Method
Properties option is used to go back to the Geostatistical Wizard (figure A1.34 at right). All that is needed to
create a probability map with the same model parameters (data transformation, detrending, semivariogram
model, searching neighborhood, and others) is to change the output map type and choose a threshold (for
example, the upper permissible level of the cesium‐137 forest berry contamination, 185 Bq/kg).
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.34
Spatial Statistical Data Analysis for GIS Users 739
A CLOSE LOOK AT PREDICTIONS WITH REPLICATED DATA
The difference between data averaging before and during kriging interpolation can be illustrated using a
subset of cesium‐137 berry contamination data. Figure A1.35 shows the measurement locations as green
circles. Digits are the counts of replicated measurements. The Semivariogram/Covariance Modeling dialog
boxes, with default parameters in the case of averaged (left) and replicated (right) measurements, show that
the estimated models are very different: weak correlation in the case of averaged data and moderate when all
the data are used as input to Geostatistical Analyst. For the final map production, it is preferable to use the
semivariogram estimated using replicated data even if the mean (or maximum or minimum) value at each
measurement location was chosen in the Handling Coincidental Samples dialog. This is because the quality of
modeling is usually improving as more information is used in model building.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.35
The Searching Neighborhood dialog box (figure A1.36) shows the measurements and their weights when
predicting to the location shown as symbol “+” with coordinates displayed in the top two rows in the right
part of the dialog box. The eight nearest points are used in prediction, and they are concentrated in just three
locations. Note that weights are equal for measurements made in the same location.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.36
Spatial Statistical Data Analysis for GIS Users 740
Figure A1.37 (left) shows the measurements selected by the Searching Neighborhood, with a maximum of
three values in each of the eight sectors. Only three out of six measurements are used in the northern location
as well as three out of 43 and three out of five in two other locations with replicated samples. Information on
which three measurements Geostatistical Analyst is using is not provided. In fact, it is not important, since
there is not an automatic rule to decide which three measurements are more important than the others. Using
part of the measurements can still be better than using just one or the average value. Alternatively, all
measurements in the moving circle or ellipse can be used, as in the example in figure A1.37 (right), with the
number of neighbors to include equal to 100 (in this example, the number of the neighboring measurements
used in prediction is equal to 57, see window Identify).
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.37
Figure A1.38 shows the weights of the eight nearest measurements when the mean value at each
measurement location was used as input to a kriging model. This time all weights assigned to the averaged
values at the measurement locations are different.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.38
Because predictions at the same unsampled location in figures A1.36‐A1.38 are different, the choice of the
searching neighborhood is important in the case of replicated data.
Spatial Statistical Data Analysis for GIS Users 741
QUANTILE MAP CREATION USING SIMPLE KRIGING
A prediction map shows the expected (most probable) values. This is what is usually required in the case of
nearly symmetrical data distribution when the mean and median values are similar. Figure A1.39 shows the
Geostatistical Analyst Classification dialog box with non‐symmetric distribution of cesium‐137 forest berry
contamination. Note that the mean and median values, 385 and 230, are very different. The median value can
be a more reliable predictor because half of the possible prediction values is smaller and the other half is
larger than the median value.
Figure A1.39
A quantile map shows the predicted quantile values. Quantile predictions require knowledge of the prediction
distribution at each point, which is unavailable. Geostatistical Analyst assumes that the predicted distribution
is Gaussian with a mean equal to the kriging prediction and standard deviation equal to the kriging prediction
standard error. This assumption is difficult to verify in practice. However, it holds if the input data are
distributed normally (the assumption on the input data normality is verifiable). Therefore, data
transformation is usually the first step in quantile map creation. Quantile map creation is illustrated below
using a subset of the cesium‐137 soil contamination data used in the earlier section, “Predictions using
ordinary kriging.”
Spatial Statistical Data Analysis for GIS Users 742
We use simple kriging (figure A1.40 at left) because it has a normal score transformation option, which
guarantees that transformed data are normally distributed. Normal score transformation should be used with
data without trend; hence we detrend the data first as shown in figure A1.40 at right. Note that kriging then
will be performed on the residuals, which have both negative and positive values.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.40
The next step is the Normal Score Transformation dialog box in which the Gaussian Kernels method is chosen
(this choice is justified in chapter 8). Eighty‐nine percent of the detrended data are described by Gaussian
distribution with a mean of −0.35 and a standard deviation of 0.97 (note that it is close to the standard normal
distribution with zero mean and standard deviation of 1), while three other kernels comprise the remaining
11 percent of the detrended data (figure A1.41 at left). The cross‐validation diagnostics in figure A1.41 at
right are quite good, and we can expect that the quantile predictions will be nearly optimal after accurate
back transformation of the predictions and prediction standard errors.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.41
Spatial Statistical Data Analysis for GIS Users 743
Figure A1.42 shows 0.7 (left) and 0.8 (right) quantile maps. They can be used in decision making instead of
the 0.5 quantile (median) map because it may be desirable to overestimate cesium‐137 soil contamination to
prevent a possible mistake in classifying contaminated soil as safe.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.42
MULTIVARIATE PREDICTIONS
The Geostatistical Analyst’s cokriging can use one, two, or three additional (secondary) variables to improve
prediction of the primary variable at unsampled locations. Theoretically, the cokriging predictions are better
than predictions made by the kriging models because more information is used. In this section, we use a mean
value of cesium‐137 berry contamination data in each measurement location as the variable of interest and
more densely sampled cesium‐137 soil contamination data as a secondary variable. Both variable
distributions are asymmetrical, and it is a good idea to transform them both. A normal score transformation
was applied to both variables (see figure A1.43 for the primary variable called “Dataset 1”).
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.43
Spatial Statistical Data Analysis for GIS Users 744
Three models should be fitted in the Semivariogram/Covariance Modeling dialog box, two semivariograms
(or covariances) and one cross‐covariance. Figure A1.44 shows these three models. The exponential model
seems to be a reasonable choice in this case.
Note that one model may be not sufficient because there are situations when different variables are better
described by different models. For example, the spatial structure of the primary variable can be well
described by the circular model while the exponential model can be more suitable for the explanatory
variable. Other times, different variables have different degrees of spatial correlation, and semivariogram
models with different range parameters are required; for example, two spherical models, one with a small
range and the other with a large range.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.44
We used default searching neighborhood parameters, and the next step is model diagnostics (figure A1.45 at
left). We see that the root‐mean‐square and average standard errors are very similar, suggesting that the
predictions in figure A1.45 at right are nearly optimal.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.45
Cross‐validation and validation diagnostics can help to verify which model performs better. The cross‐
validation graphs and the prediction error statistics can be compared using the geostatistical layer context
menu option Compare, as shown in figure A1.46 at left. From the same figure at right, we see that lognormal
ordinary kriging with replicated data outperforms simple cokriging because the three most important
diagnostics (root‐mean‐square, average standard, and root‐mean‐square standardized prediction errors) are
Spatial Statistical Data Analysis for GIS Users 745
better for the kriging model. But according to geostatistical theory, cokriging, if used properly, must work at
least as well as kriging. Then the cross‐validation diagnostics comparison is an indication that our cokriging
model is not the optimal one. In fact, it is possible to significantly improve the cokriging model using
replicated cesium‐137 berry contamination data jointly with cesium‐137 soil contamination data, and we
suggest that the reader do this exercise.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.46
PROBABILITY MAP CREATION USING COKRIGING OF INDICATORS
In environmental applications, researchers are often interested in the probability that the upper permissible
level of soil, air, or groundwater contamination is exceeded. Indicator kriging, cokriging of indicators, and
probability kriging are the models that were developed to provide a solution to this problem. In this section
we show an example of probability mapping using cokriging of indicators. We use cesium‐137 soil
contamination data and a threshold of 5 Ci/km2.
Spatial Statistical Data Analysis for GIS Users 746
Figure A1.47 at left shows that the indicator kriging is chosen and the threshold 5 is typed in the Primary
Threshold control window. Geostatistical Analyst will perform indicator transformation, but other data
transformations are not allowed. The detrending option is hidden because indicator values cannot be
represented as a sum of the large scale and small scale data variation.
The next dialog box in the wizard is Additional Cutoffs Selection (figure A1.47 at right). Here, up to three
additional cutoffs (thresholds) can be chosen. The software will then do indicator transformations using
these cutoffs, and the transformed data will be used as secondary variables for cokriging; hence, the name of
the method—cokriging of indicators. In this example, two additional cutoffs, 2 and 8, are chosen on the
different sides of the primary threshold that equals 5.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.47
Figure A1.48 shows the Semivariogram/Covariance Modeling dialog box. With three indicator variables, three
semivariograms and three cross‐covariance models should be estimated. It is important to check all of them
since there is no guarantee that the sixth model is estimated accurately just because the five others are.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.48
Spatial Statistical Data Analysis for GIS Users 747
Figure A1.49 at left shows the resulting probability map. A standard error of indicators map can be
immediately created by clicking Create Standard Error of Indicators Map on the Geostatistical Layer
Properties menu (figure A1.49 at right). The error of the indicator prediction depends on the data
configuration, not on the data values (displayed as colored circles), as it is desired. This is an expected result
because the observed data values are not used by cokriging of indicators.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.49
PROBABILITY MAP CREATION USING DISJUNCTIVE AND ORDINARY KRIGING
Data normality is a desirable feature for linear kriging models. Disjunctive kriging is the only geostatistical
model in Geostatistical Analyst that requires that data follow a particular distribution (bivariate normal). Real
data are rarely normally distributed, and the first step in the disjunctive kriging usage is data transformation,
usually a normal score transformation (see figure A1.50).
Figure A1.50
The next step after the semivariogram/covariance modeling (figure A1.51 at left) is checking the assumption
on bivariate normal distribution. Figure A1.51 at right shows the Examine Bivariate Distribution dialog box in
which the theoretical covariance models (blue line) estimated using the covariance model from the previous
dialog box are compared with covariance models for several data quantiles (green). If the blue and green lines
are nearly the same, the assumption about bivariate normal distribution holds, and disjunctive kriging can be
used safely. Otherwise, disjunctive kriging may produce poor predictions. We see in figure A1.51 at right that
the blue and green lines are reasonably close.
Spatial Statistical Data Analysis for GIS Users 748
Figure A1.51
Weights are not calculated in the Searching Neighborhood dialog box (figure A1.52 at left) because
disjunctive kriging uses a linear combination of functions of the data instead of a linear combination of the
original data values as linear kriging does. To create the probability map (figure A1.52 at right), disjunctive
kriging uses a linear combination of indicators. The difference between cokriging of indicators and
disjunctive kriging is that the latter uses a different number of indicators at different prediction locations, and
the indicators are uncorrelated so that the most difficult part of semivariogram modeling—cross‐covariance
model fitting—is avoided (see the discussion in chapter 9).
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.52
Spatial Statistical Data Analysis for GIS Users 749
When a geostatistical layer is created, other output maps can be immediately displayed using the same
disjunctive kriging model. For example, A1.53 shows the standard error of indicators (left) and prediction
(right) maps. The ability of producing prediction and prediction standard error maps is one of the advantages
of using the disjunctive kriging model over indicator kriging.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.53
Indicator, probability, and disjunctive kriging with replicated data are not possible since these models require
an exact answer to the question about excess of the threshold: yes (indicator value 1) or no (indicator value
0). In the case of cesium‐137 contamination of forest berries, it is not always easy to decide whether the
upper permissible value of 185 Bq/kg was exceeded at the measurement location. Figure A1.54 at left shows
histograms and simple statistics for the measurements made in three villages. There are many measurements
that are either smaller or larger than the threshold 185. Although all three mean values are above the
threshold, two median values are below it.
All measurements can be used with one of the linear kriging models. For example, figure A1.54 at right shows
the map of probabilities that the 185 Bq/kg threshold is exceeded created using lognormal ordinary kriging.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.54
A probability map can also be created using geostatistical conditional simulation techniques (see chapter 10
and the last section of this appendix).
Spatial Statistical Data Analysis for GIS Users 750
MOVING WINDOW KRIGING
From the data exploration and data modeling above, it is clear that spatial variation of cesium‐137 soil and
berry contamination is different in different parts of the data domain. In this case, it is difficult to describe the
data variation accurately by a single semivariogram model even after data detrending and transformation.
The Geostatistical Analyst geoprocessing tool Moving Window Kriging helps to estimate the semivariogram
model locally, using a specific number of the nearest observations (see chapter 9). To illustrate the Moving
Window Kriging tools, cesium‐137 soil contamination data are used.
Figure A1.55 at left shows Method Summary, the last dialog box in the Geostatistical Analyst wizard. The
geostatistical model can be saved to the XML file by clicking Save and typing the name of the file, for example,
Ordinary Kriging lognormal.xml. The next step is to run the Moving Window Kriging tool (figure A1.55 at
right). The previously created kriging model is used as the geostatistical model source, and the cesium‐137
feature layer is used as the input dataset. A feature layer with coordinates of the prediction locations is also
needed, and in this example, the one used has centers of the overlapped grid with 10‐by‐10‐mile cell sizes, the
g16pts shapefile. The number of nearest points that will be used for estimation of the local semivariogram
model (in this example, the 100 nearest points) is also specified. The model is run, and a new layer with local
predictions, local prediction standard errors, and local semivariogram model parameters is produced.
Figure A1.55
Spatial Statistical Data Analysis for GIS Users 751
Figure A1.56 shows two maps of locally predicted standard errors produced using lognormal ordinary
kriging (left) and disjunctive kriging with normal score transformation (right). Although in both cases the
prediction standard errors depend on the data values, the errors variations are different, and additional
efforts are required to find out which moving window kriging model better describes the variation of cesium‐
137 soil contamination data.
Figure A1.56
Figure A1.57 at left shows a smooth prediction map created using moving window kriging predictions as
input to the local polynomial interpolation model with parameters estimated by the software. The prediction
map is not very detailed because a relatively small number of input data were used.
Estimated semivariogram parameters are useful from a data exploration point of view because they have
physical sense: partial sill is the amount of data variation, nugget is largely the data variation due to
measurement errors, and range is the distance beyond which data are statistically independent. The
histogram in figure A1.57 at right shows the distribution of estimated nugget parameters at the prediction
locations. The distribution consists of two subpopulations. The population with a large nugget parameter is
selected, and the map in figure A1.58 at left shows where large measurement errors are expected.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.57
Spatial Statistical Data Analysis for GIS Users 752
Figure A1.58 shows two additional Histogram dialog boxes with histograms of estimated partial sill and range
parameters and a table with moving window kriging output. The exploratory data analysis tools, the table,
and a feature layer are linked, and the values of the parameters of the semivariogram models and the
predictions in the areas with large nugget parameter are highlighted. We can see that typically when nugget
parameter is large, partial sill and predictions are small, prediction standard errors are large, and ranges are
rather indifferent to the estimated nugget values. This information is useful for better understanding of the
distribution of the cesium‐137 soil contamination data and the geostatistical modeling.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.58
Figure A1.59 shows smooth maps of the locally estimated nugget at left and sill (sum of the nugget and partial
sill) semivariogram model parameters at right. These maps can be used, for example, in the monitoring
network design: it makes sense to collect new measurements of cesium‐137 soil contamination in areas
where nugget, or sill, or both are large.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.59
EXAMPLE OF GEOPROCESSING: FINDING PLACES FOR NEW MONITORING
STATIONS
Suppose that a researcher wants to collect 20 additional measurements of cesium‐137 berry contamination.
The question arises, where are the best locations for the new measurements given the observed data on the
soil and berry cesium‐137 contamination? This section will show how to choose new monitoring stations
Spatial Statistical Data Analysis for GIS Users 753
with geoprocessing Geostatistical Analyst tools, using formulas from “Ideas on Network Design Formulated
after 1963” in chapter 10.
In practice, cesium‐137 measurements are collected in the populated places, not in the arbitrary
mathematical points. Therefore, locations for new measurements will be chosen among the Byelorussian
settlements. A limitation for potential measurement locations can be introduced using, for example, the
prediction standard error map of berry contamination created with the cokriging model discussed in the
Multivariate Prediction section.
Figure A1.60 shows settlement locations (gray dots) inside contours where the prediction standard error was
greater than 555 Bq/kg. These contours were created by exporting the cokriging predictions using the
geostatistical layer’s option Export to Vector (top left). The cesium‐137 berry contamination data are
displayed with large circles.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.60
Potential locations for new measurements in figure A1.60 are selected using information on the prediction
standard error, and it makes sense to use other kriging outputs: 1) the estimated probability that the upper
permissible value of the contamination is exceeded, and 2) the estimated quartile values at the candidate
locations. The following two formulas will be used:
1. Locations with the largest weighted prediction standard error, , at location s
are candidates for a new monitoring station.
2. Locations with the largest optimization function
,
Spatial Statistical Data Analysis for GIS Users 754
where Z(s) is the estimated value of the variable under study at location s and the subscript indicates the
quantile of the estimated prediction distribution at location s.
The next step is to use a tool written by Katja Krivoruchko at Esri. This tool selects the specified number of
new monitoring stations from the list of candidate locations based on the largest prediction standard error
StdErr, or optimization criteria O1 and O2. The Python scripts, the Monitoring Network tool, and the data are
provided in the folder Assignment A1.1. Instructions for running the script are as follows:
1. Add the Geostatistical Analyst samples toolbox to ArcToolbox.
2. If the script tool has a red ✕ next to it, right‐click the tool and click Properties. The Monitoring
Network Properties dialog box opens (figure A1.61 at left). Navigate to the folder with
Ga_monit_network.py file. Also make sure that this folder contains GA_AddStations.py and
GA_OptimizeLocs.py files.
3. Three kriging models need to be created using the Geostatistical Analyst wizard. First, create a
kriging model for prediction map; for example, using ordinary kriging with log transformation, and
save the resulting model as prediction.xml by clicking Save (figure A1.61 at right). Go back to the first
wizard dialog box and change Output Type to quantile map, click Finish, and save the model as
quantile.xml. Repeat this operation one more time to create a probability model with a 370 Bq/kg
threshold and save it to the probability.xml file.
4. Open the Monitoring Network tool (figure A1.62) within the Geostatistical Analyst samples toolbox.
Run the tool with specified inputs.
Figure A1.61
Spatial Statistical Data Analysis for GIS Users 755
Notes:
Most of the Monitoring Network tool’s parameters are self‐explanatory. The last parameter,
Inhibition distance, does not allow choosing the next location as a candidate for new measurement if
it is inside a circle with a radius equal to the specified inhibition distance from the locations selected
at previous iterations. This prevents clustering in the pattern of new monitoring locations and
reduces computational time since the number of potential locations where kriging makes predictions
is reduced at each iteration significantly.
Make sure that probability.xml, prediction.xml, and quantile.xml files exist in the specified workspace
directory.
Make sure that the output is specified as a shapefile (a file with .shp extension).
In this exercise, a threshold value of 370 Bq/kg was used, which is two times larger than the upper
permissible level. A large threshold value was chosen because the majority of berry samples are
contaminated above the upper permissible level of 185 Bq/kg, and there are too many unsampled
locations that, when added, will improve the monitoring network if a value of 185 Bq/kg is used as a
threshold.
Figure A1.62
The output shapefile BerryCs_20_O1 contains the newly selected monitoring stations. The monit_network
shapefile created in the workspace folder contains predictions, prediction standard errors, probability,
quantile, O1, and O2 values for the proposed new monitoring locations. The map of smoothed optimization
criterion O1 is shown in figure A1.63 at left. Green points shows 20 proposed locations for new measurements
calculated using O1 criteria. These 20 points are also shown in figure A1.63 at right together with the
remaining after inhibition potential locations (small green points).
Spatial Statistical Data Analysis for GIS Users 756
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.63
Similar maps using optimization criterion O2 are shown in figure A1.64. In this example, the 20 proposed
measurement locations are selected in quite different places because the optimization criteria are based on
different weighting of the prediction standard error. Therefore, additional research and perhaps more
sophisticated algorithms for finding the best new measurement locations are needed (see one possible
improvement of the optimization criteria in assignment 2 at the end of this appendix).
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.64
It should be noted that several monitoring network design geoprocessing tools will be available in the
Geostatistical Analyst version after 9.3.
DETERMINISTIC MODELS
The implementation of three deterministic models in Geostatistical Analyst is similar to the implementation
of kriging: both have default values, a flexible searching neighborhood, prediction map preview, anisotropy
and smooth map creation options, and validation and cross‐validation diagnostics. The deterministic models
often produce nice‐looking maps, but the accuracy of these maps is unknown. Prediction uncertainty can be
estimated for deterministic models although not without problems. For example, local polynomial
interpolation prediction error can be estimated assuming that the model is correct while it is often not so.
Then, in addition to the prediction uncertainty, a spatially varying function that shows areas where the model
assumptions are violated needs to be calculated and displayed. In this section, we illustrate the usage of
inverse distance weighted, radial basis functions, and local polynomial interpolation using the northeast part
of the cesium‐137 soil contamination data.
Spatial Statistical Data Analysis for GIS Users 757
Figure A1.65 shows the IDW (inverse distance weighted) interpolation dialog box with an optimized power
value equal to 2.99 instead of the default value 2.0. The software performs optimization by iterative use of
cross‐validation to find the 2.99 power value. The preview of the map (figure A1.65 at center) is initialized by
changing the Preview type from Neighbors to Surface. The next dialog box in the wizard is the cross‐
validation diagnostics dialog box (partially shown in figure A1.65 at right). Only two averaged prediction
errors are calculated, since IDW does not provide information about prediction uncertainty.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.65
In figure A1.66, the searching neighborhood shape was changed from a circle to an ellipse. This results in
anisotropic interpolation. The optimal power value from the cross‐validation point of view is changed when
the searching neighborhood is changed and now equals 3.69. According to the cross‐validation diagnostics,
this arbitrarily chosen version of searching neighborhood leads to better predictions because the prediction
root‐mean‐square error is smaller than it is in figure A1.65. The cross‐validation diagnostic is helpful but not
always sufficient in deciding on the best model, as discussed in chapter 6.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.66
The radial basis functions are variants of spline interpolation with knots at measurement locations. Figure
A1.67 shows three default maps created using three different radius basis functions (kernels): inverse
multiquadric, thin plate spline, and completely regularized spline.
The predictions in the preview windows are different, and it is even more difficult than in the case of inverse
distance weighted interpolation to choose the best model, because the default parameters for the radius basis
functions are estimated using the cross‐validation technique so that this diagnostic is not very helpful for
Spatial Statistical Data Analysis for GIS Users 758
improving the radius basis function. In practice, this is not a big problem because the goal of radius basis
functions interpolation is smooth map production, while prediction accuracy is a secondary goal.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.67
Figure A1.68 shows previews of local polynomial interpolation with three different power values (orders of
polynomial). According to the cross‐validation diagnostic (not shown here), the best model is zero‐order
polynomial. Zero‐order local polynomial interpolation is similar to inverse distance weighted interpolation:
both involve moving window kernels with exponential and power shapes correspondingly.
Local polynomial interpolation is used in Geostatistical Analyst for exploratory data analysis and as a
detrending tool. Problems with this model motivated Lev Gandin to invent statistical optimal interpolation,
later called kriging.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A1.68
If used properly, kriging predictions are always more accurate than predictions made by the deterministic
models (in addition, a lot can be learned about data during geostatistical analysis). If the validation or cross‐
validation root‐mean‐squared prediction error of inverse distance weighted interpolation is smaller than that
of kriging, this indicates that the geostatistical model can be significantly improved.
Spatial Statistical Data Analysis for GIS Users 759
VALIDATION DIAGNOSTICS
Validation uses a subset of data, the training dataset, to develop a model. Then this model is used for
comparing the measured and predicted values at the remaining (test) locations.
In Geostatistical Analyst, the test and training datasets can be created from the data using the Create Subsets
tool shown in figure A1.69 at left. The training dataset is used as input to one of the interpolation models, and
the test dataset is used for the validation of predictions, as shown in figure A1.69 at right.
Figure A1.69
If the validation option is used, the Validation dialog box appears after the Cross‐validation one (figure
A1.70). It has the same features as the cross‐validation dialog. The difference between validation diagnostic
for deterministic (A1.70 at left) and statistical (A1.70 at right) models is in the number of calculated statistics.
A discussion and an illustrative example of validation diagnostics can be found in chapter 6 in the section
“Validation.”
Figure A1.70
Spatial Statistical Data Analysis for GIS Users 760
ABOUT GEOSTATISTICAL ANALYST 9.3
Geostatistical Analyst 9.3 provides two additional geoprocessing tools: 1) Gaussian geostatistical simulation
discussed in chapter 10 (kernel convolution method) and 2) exploratory data analysis with geographically
weighted regression discussed in chapter 12. Also, this new version of the extension supports more than one
processor and the ability to allocate tasks between the processors. This results in increasing the speed of
calculations. It should be also noted that the moving window kriging works much faster in version 9.3 than in
9.2.
Figure A1.71 shows the Gaussian geostatistical simulation geoprocessing tool. It requires the simple kriging
model as the input parameter, a geostatistical layer created in the Geostatistical Analyst wizard. Simple
kriging with normal score transformation and with the option “mixture of Gaussian kernels” is recommended
as a default model for this geoprocessing tool. The user should also specify the number of simulations, output
grid cell size, a folder where simulations will be stored, and the prefix for simulations. In the example, prefix
“b” is specified and simulations are stored as grids with names bs1, bs2, …, bs100. If conditional simulation is
required, the conditioning data should be provided. In this example, it is a shapefile with measurements of
cesium‐137 contamination in the forest berries. The following statistics can be calculated for each grid cell:
minimum; maximum; mean; standard deviation; first, second (median), and third quartiles; one specified
quantile; and the probability that the specified threshold was exceeded. Using the “input statistical polygons”
option, the statistics above as well as the “statistics of statistics” described at the end of chapter 10 are
calculated and saved into the polygonal feature layer.
Figure A1.71
Spatial Statistical Data Analysis for GIS Users 761
The output of the Gaussian geostatistical simulation geoprocessing tool is shown in figure A1.72. The mean
values of the simulations inside polygons (Belarus districts) are displayed using colors from blue to red. The
standard deviation of 100 simulations for each grid cell is shown in the background. Statistics of the
simulated values in the polygons can be used in epidemiological analysis (see the example in appendix 3).
Figure A1.72
The geographically weighted regression geoprocessing tool in figure A1.73 at left is available in the
ArcToolbox “Spatial Statistics Tools” for users who hold licenses for Geostatistical Analyst or ArcInfo. Here we
illustrate the tool usage using crime and social data collected in the counties of the southeastern United
States. One goal is to explain the total number of robberies (robbery is defined as taking or attempting to take
value by force or threat of force or violence) using data on population density, total number of unemployed
persons in the county, total number of burglaries (burglary is defined as unlawful entry into a building or
other structure with intent to commit a felony or theft), and the total number of larcenies (larceny is defined
as unlawful taking of property from another person, and is also called theft; motor vehicle thefts are not
included).
Dependent and explanatory variables are specified in the top part of the geoprocessing tool dialog. Several
options for the moving window kernel are available in the middle part of the dialog. Input data can be
weighted, for example, by the number of observations in the polygon. Output of the geographically weighted
regression geoprocessing tool includes grids of the varying regression coefficients and some diagnostics for
the input data and for the predictions at the specified locations.
The right part of figure A1.73 shows counties with a calculated condition number (see the explanations in the
section “Geographically weighted regression” in chapter 12) and a part of the output table (bottom). The
condition number values are very large, indicating that calculated spatially varying coefficients and
diagnostics such as local R2 must be questioned although, according to the model, all R2 values are nearly
equal to one, suggesting that explanatory variables perfectly explain the dependent variable. This is because
geographically weighted regression assumes that all model assumptions are satisfied, although this rarely
happens in practice.
Spatial Statistical Data Analysis for GIS Users 762
Courtesy of the National Atlas of the United States, http://www.nationalatlas.gov.
Figure A1.73
Spatially varying regression coefficients for explanatory variable number four, the total number of
unemployed persons in the county, are shown in figure A1.74 at left. The prediction standard errors shown in
figure A1.74 at right are very large, indicating that there is a problem with this regression coefficient
estimation. Note also that the large‐scale data variation displayed in the maps in figure A1.74 is clearly
different, meaning that the mean and variance of the regression coefficient are not related.
Public Domain.
Nationalatlas.gov.
Figure A1.74
More information on Geostatistical Analyst 9.3 features can be found at
http://webhelp.esri.com/arcgisdesktop/9.3/index.cfm?TopicName
=An_overview_of_Geostatistical_Analyst.
Spatial Statistical Data Analysis for GIS Users 763
ASSIGNMENTS
1) REPEAT THE ANALYSIS SHOWN IN THIS APPENDIX.
In other words, reproduce the maps shown in the following sections of this chapter:
“Displaying the semivariogram surface on the map”
“Predictions using ordinary kriging”
“Replicated data prediction using lognormal ordinary kriging”
“Quantile map creation using simple kriging”
“Continuous predictions”
“Multivariate prediction”
“Probability maps creation using disjunctive and ordinary kriging”
“Moving window kriging”
Data are in the folder Assignment A1.1.
2) USE THE GEOSTATISTICAL ANALYST MODELS AND TOOLS TO ANALYZE THE
HEAVY METALS MEASUREMENTS COLLECTED IN AUSTRIA IN 1995 (FIGURE
A1.75).
In particular:
Find the optimal kriging model for prediction of the mercury contamination over the Austrian
territory.
Create a map of the probability that cadmium contamination exceeded the threshold value of 0.2
mg/kg.
The data description can be found in chapter 14 and in the following paper: Zechmeister, Harald. 1997.
Schwermetalldeposition in Österreich erfaßt durch Biomonitoring mit Moosen (Aufsammlung 1995). Vienna:
Umweltbundesamt Monographien, Band 94.
Note that these data are truncated: zeroes correspond to the values below the measurement device’s
sensitivity limit (see “Censored and Truncated Data” in chapter 3).
Spatial Statistical Data Analysis for GIS Users 764
Copyright © 1995, Umweltbundesamt, GmbH.
Figure A1.75
Data are in the folder Assignment A1.2.
3) FIND THE 20 BEST PLACES FOR COLLECTING NEW VALUES OF ARSENIC TO
IMPROVE PREDICTIONS OF THIS HEAVY METAL DISTRIBUTION OVER AUSTRIAN
TERRITORY.
Use the geoprocessing tool described earlier in this chapter in the section “Example of geoprocessing: finding
places for new monitoring stations.” Use the coordinates of the dense grid overlapping Austrian territory.
Modify the optimization criterion by weighing it by the a priori probability values in diapason [0, 1],
. For example, assign small values to in the areas near the large
roads and populated places as required by the heavy metals deposition in mosses project (see the section
“Analysis of spatially correlated heavy metal deposition in Austrian moss” in chapter 14). Data are in the
folder Assignment A1.2.
4) PRACTICE WITH THE GAUSSIAN GEOSTATISTICAL SIMULATION
GEOPROCESSING TOOL.
If you have access to Geostatistical Analyst 9.3, read the documentation on the Gaussian Geostatistical
Simulation geoprocessing tool and then:
Practice using this tool with the data from assignment 1.
Reread the section on Monte Carlo simulation in chapter 5 and estimate the distribution of the
correlation coefficient between cadmium and mercury moss contamination from assignment 2.
Spatial Statistical Data Analysis for GIS Users 765
FURTHER READING
1) Johnston K., J. M. Ver Hoef, K. Krivoruchko, and N. Lucas. 2001. Using ArcGIS Geostatistical Analyst.
Redlands, Calif.: Esri Press.
Spatial Statistical Data Analysis for GIS Users 766
APPENDIX 2
USING R AS A COMPANION TO ARCGIS
DOWNLOADING R
THE FIRST R SESSION
READING AND DISPLAYING THE DATA
SCATTERPLOTS
THE LINEAR MODEL
FITTING A LINEAR MODEL IN R
REGRESSION DIAGNOSTICS
BEYOND LINEAR MODELS
RUNNING R SCRIPTS FROM ARCGIS
ASSIGNMENTS
1) REPEAT THE REGRESSION ANALYSIS OF DATA ON INFANT MORTALITY AND
HOUSE PRICES
2) VERIFY THE ASSUMPTIONS OF THE LINEAR REGRESSION MODEL
FURTHER READING
R is a language and environment for statistical computing and graphics. R is a different implementation
of the S programming language used in the commercial statistical software SPLUS. A large group of
statisticians and programmers developed R. Many of them have contributed libraries to SPLUS.
Today, R is used for teaching and research in most universities worldwide.
When R is started, the following statement appears: “R is free software and comes with ABSOLUTELY NO
WARRANTY.”
This is one of the differences between freeware and commercial software—vendors of commercial products
take responsibility for implemented functionality. Even so, most of the base R functions are well tested since
these functions have been used for years. Users have reported errors, and programmers have fixed them.
Unfortunately, most of R libraries have limited documentation, usually just a short description of the
functions’ parameters.
Spatial Statistical Data Analysis for GIS Users 767
R is extensive and includes a large number of libraries (usually called packages), including libraries for spatial
statistical analysis that the researchers write and make available for downloading. Generally, the quality of
these add‐on packages is unknown. Researchers often do not have the time and desire to extensively test the
algorithms and codes. As an example of the many errors found in freeware statistical software, note that the
publicly available geostatistical library GSLIB (a collection of Stanford University student works written in
FORTRAN language) included about 4,000 known errors. GSLIB is one of the most popular scientific freeware
software with thousands of users. The appearance of this freeware library dramatically changed practical
geostatistical data analysis: the number of users greatly increased, but not necessarily the quality of scientific
data analysis.
Because R is an open‐source software, existing functions can be modified using the R programming language,
which includes conditionals, loops, user‐defined recursive functions, and input and output facilities. But in
practice, modifying complex third‐party code is more difficult than developing a new implementation of the
algorithm.
R is available as free software under the terms of the Free Software Foundation’s GNU General Public License
in source code form. It runs on a wide variety of UNIX platforms, Windows, and Mac OS.
This appendix explains how to perform regression analysis in R but includes limited explanations on the
meaning of the analysis. Therefore, it might be useful to review several points about the interpretation of
linear regression models:
Several assumptions about linear regression models should be verified (see the beginning of the
section “Spatial regression models diagnostics and selection” in chapter 12 and the section “Linear
models” below).
Good model fitting does not automatically mean that the model describes causal relationships
between variables under study. There could be an unidentified additional variable (called a lurking
variable) that has a causal effect on both response and predictor variables so that they are not
directly related. It is also possible that both a causal effect and a lurking variable contribute to the
association between response and predictor variables.
It is safe to assume that the variable of interest is a function of the explanatory variables, and the
regression analysis helps to find the parameters of the assumed function, not a causal model.
Accurate causal inference cannot be made from the regression analysis alone but may be done with
additional information on the studied phenomena (for example, such information is often available in
environmental applications).
The end of the appendix presents two examples of running R scripts from ArcGIS. Readers should have
access to a computer while reading this appendix. It should be noted that the packages’ owners change the
code from time to time, so that scripts written in the past may not work with the latest version of the
software.
Spatial Statistical Data Analysis for GIS Users 768
DOWNLOADING R
The R statistical analysis program is available for download from http://cran.r-project.org/. The
process for downloading and installing is as follows:
1. Go to any CRAN site (see http://cran.r-project.org/mirrors.html for a list), navigate to
the bin/windows/base directory, and download R software (for example R‐2.3.1‐win32.exe setup
program).
2. Click on R‐2.3.1‐win32.exe and follow the standard installation instructions. When R is started, a
standard help menu is available that contains documentation on R. Additional documentation is
provided on the CRAN Web site.
To install and use R libraries, do the following:
Start R. Figure A2.1 shows the R graphical user interface window immediately after the program is
started.
Figure A2.1
Click Packages.
Click Install packages.
Browse to find the required packages (in this appendix, we will use packages maptools,
RColorBrewer, faraway, foreign, arm, MASS, mgcv, and spgwr), select them, and click OK.
Spatial Statistical Data Analysis for GIS Users 769
Information about available packages can be found at http://cran.r-
project.org/web/packages/.
R unzips the files and installs the packages in the folders in the library folder of the R software
folder.
Each time R starts a session, separately load the packages by typing library(name of the package)
to access the functions and their help files; for example, library(foreign).
After loading the library, typing help(function name) or ?(function name) at the command line
will explain how to use the library functions; for example, help(read.dbf) or ?read.dbf.
Using a text editor such as Notepad is recommended for use in combination with R.
THE FIRST R SESSION
Data analysis in R is organized as a dialog: type a statement at the > prompt, press Enter, and R executes the
statement. In this way, R can be considered as an overgrown calculator. Type the followings statements in the
R console window:
2+5
2/5
2^5
2^52/5
R includes hundreds of functions with arguments specified within parenthesis after the function name (note
that an argument to a function usually has a default value). Type the following statements (use the help
system if you do not know what some of these functions are doing, for example, type ?floor):
sqrt(abs(9))
besselK(2,3)
max(besselK(2,3), sqrt(abs(9)), 2^52/5)
cos(pi/4)
asin(0.5)
log10(20)
exp(5)
floor(5.4)
beta(2,3)
rbinom(1, size= 100, prob=0.7)
b < rbinom(9,100,0.7)
rnbinom(10,100,0.7)
rnorm(1)
rlnorm(1)
rpois(1,10)
runif(5)
In R, an argument is in fact a vector. One way to create a vector is with the “c” function, which combines
several elements. For example, type the following statement:
c(4,3,2,1)
The results returned by the “c” function can be assigned to a variable (the left‐pointing arrow < is the
assignment operator):
a < c(9,8,7,6,5,4,3,2,1)
Spatial Statistical Data Analysis for GIS Users 770
b < c(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
The two commands above do not print anything, but we can see what is inside a variable by entering the
variable’s name. The output of a vector printed to the screen is preceded by the number (vector’s index) in
sequence of the vector value in square brackets:
> log(a)
[1] 2.1972246 2.0794415 1.9459101 1.7917595 1.6094379 1.3862944 1.0986123 0.6931472 0.0000000
> mean(a)
[1] 5
> var(a)
[1] 7.5
> sd(a)
[1] 2.738613
The function summary is an example of a generic function, which behaves depending on its argument. For
example, compare the following summaries:
summary(log(a))
summary(lm(a~b))
If only one element of the vector should be printed, the index of the element is specified in the square
brackets:
a[2]
Missing values can be specified using a logical constant NA, which contains a missing value indicator:
sum(a^2)
a[3] < NA
a
sum(a^2)
Missing data are very common, and the software needs to be able to handle it. The R function sum() has an
argument called na.rm, which is set to FALSE by default, meaning that records with missing values will not be
removed by default, and the summation will fail, returning an NA. To calculate a sum of the available values,
type the following command:
sum(a^2, na.rm = TRUE)
READING AND DISPLAYING THE DATA
We will illustrate data reading and initial data analysis using infant mortality data collected in four southern
states, Alabama, Arkansas, Louisiana, and Mississippi, during 1995–1999. These data are provided by the
National Atlas of the United States at http://nationalatlas.gov/atlasftp.html. A subset of the
data used in this appendix is available in the folder Assignment A2.1. Figure A2.2 shows a choropleth map of
the death rates (the proportion of deaths to number of live births per 1,000 births) in 288 counties shown in
ArcMap. The green circles depict the locations of county centroids. Other available data include the
percentage of mothers who smoked during pregnancy, per 100 women who gave birth to live infants; and the
rates of women ages 15–17, 20–29, and 40–54 who had a live‐born infant per 1,000 women of that age. This
additional data can be used in the regression modeling in an attempt to explain the variation in infant
mortality.
Spatial Statistical Data Analysis for GIS Users 771
Courtesy of the National Atlas of the United States, http://www.nationalatlas.gov.
Figure A2.2
The default folder where the data are stored can be specified using the command setwd:
setwd("e:\\book_data\\intro_to_r")
Several packages can import commonly encountered spatial data formats. Since we are interested in
analyzing data that already exists in GIS, we will use the package maptools, which can read and write
shapefiles. In the following text we are not explaining all functions and their parameters because this
information is available in the package’s help files. The maptools package is loaded using command
library(maptools).
The next two commands read the shapefile four_southern_states and convert shapefile data to a list of
polygons us4polys:
us4 < read.shape("four_southern_states.shp")
us4polys < Map2poly(us4, region.id=as.character(us4$att.data$FIPS))
The infant mortality data are stored in the data frame us4 with rows representing observations. The columns
are accessed using the $ operator. Square brackets are used to access data frame values.
A simplified data frame with a table of observations can be created using the following command:
us4df < us4$att.data
The function str(us4df) shows the structure of the variable us4df; the function names(us4df) gets the names of
an object us4df; and the function summary(us4df) displays simple statistics of the observations (in other cases
summary() can be a summary of the results of various model fitting functions).
Visualization of the map of infant mortality requires some preparation. First, we can choose a desirable color
palette using the ColorBrewer package:
library(RColorBrewer)
display.brewer.all()
Spatial Statistical Data Analysis for GIS Users 772
Figure A2.3 shows the available palettes and their names.
Figure A2.3
Next, we compute a sequence of six quantile values that cover the range of the values in the us4df$death_rate
variable (the function seq() below generates a regular sequence from a starting value to an ending value of
the sequence by an increment):
brks < quantile(us4df$death_rate, seq(0,1,1/6))
brks
The function round() rounds the values in its first argument to the specified number of decimal places:
brks.r < round(brks, digits=1)
brks.r
For our first map we choose the "Oranges" palette with six colors:
cols < brewer.pal(length(brks.r)1, "Oranges")
The next command plots a map:
plot(us4polys, col=cols, forcefill=FALSE, main="Death rates")
The function legends() adds a legend to the map (the position of the legend’s rectangular c(8.5e5,11e5),
c(7e5,9.3e5) is chosen manually):
legend(c(8.5e5,11e5), c(7e5,9.3e5), fill=cols, legend=leglabs(brks.r), bty="n", ncol=1, x.intersp=0.9, y.intersp=0.9)
The resulting map is shown in figure A2.4 at left. Often data are centered using their mean and standard
deviation to simplify a comparison between variables. We can use the function scale() to do this data
transformation. By default, centering is done by subtracting the data mean, and then scaling is done by
dividing the centered data on the data standard deviation. The command windows() in the code below sends
the output to the new graphics window.
Spatial Statistical Data Analysis for GIS Users 773
scaled.dr < scale(us4df$death_rate)
brks < quantile(scaled.dr, seq(0,1,1/4))
brks.r < round(brks, digits=1)
pal < brewer.pal(length(brks.r)1, "Oranges")
cols < brewer.pal(length(brks.r)1, "Oranges")
windows()
plot(us4polys, col=cols, forcefill=FALSE, main="Death rates scaled")
legend(c(8.5e5,11e5), c(7e5,9.3e5), fill=cols, legend=leglabs(brks.r), bty="n", ncol=1, x.intersp=0.9, y.intersp=0.9)
SCATTERPLOTS
Figure A2.4 at right shows a scatterplot of the death rates (y axis) versus the rates of women ages 15–17 who
had a live‐born infant (x axis). An inferred relationship between the two variables is obtained using the
function lowess(), shown with a red line. This function performs robust locally weighted regression. Figure
A2.4 at right was created using the following commands:
plot(us4df$AGE15_17,us4df$death_rate)
lines(lowess(us4df$AGE15_17,us4df$death_rate), col="red", lwd=2)
As expected, we see that the increase in the death rate corresponds to the increase in the number of very
young women who gave birth.
Courtesy of the National Atlas of the United States, http://www.nationalatlas.gov.
Figure A2.4
There are several other variables, and the following command draws scatterplots for four of them in figure
A2.5:
pairs(cbind(us4df$death_rate,us4df$smok_all,us4df$AGE15_17, us4df$AGE40_54),
panel=function(x,y){
points(x,y)
lines(lowess(x,y), col="blue", lwd=2)
},
diag.panel=function(x){
par(new=TRUE)
hist(x, main="", axes=FALSE, col="green")
abline(v = mean(x), lwd=2, col="red")
Spatial Statistical Data Analysis for GIS Users 774
},
lower.panel=NULL
)
The command cbind() in the code above takes a sequence of four vectors us4df$death_rate, us4df$smok_all,
us4df$AGE15_17, and us4df$AGE40_54 and combines them by columns. The command hist() displays a
histogram of each variable. The command abline() adds a straight line through the current plot, this time a
mean value of the variable. The command function() defines our own function with the formal arguments x
and y. When the function is called, the real arguments appear in the place of formal arguments.
Figure A2.5
Figure A2.5 shows that the death rate depends on smoking (although in rather strange fashion: as smoking
increases, the death rate decreases) and on the age of women. The variables are in the order listed in the
function cbind() so that var 2 is smoking and var 3 is ages 15‐17.
Note that a similar scatterplot is available in ArcMap. Figure A2.6 shows the relationships between the same
variables. An advantage of this scatterplot is that the data table, a map, and a graph are related (pink points
and polygons in figure A2.6), allowing effective data exploration using the selection option.
Spatial Statistical Data Analysis for GIS Users 775
Figure A2.6
The next step in the infant mortality analysis is hypothesis testing on the dependencies of response and
explanatory variables. The fundamental model in regression analysis is the linear model, and we will
illustrate its use in R in the next section.
THE LINEAR MODEL
Mathematically, the linear model can be written as
,
where , called the response variable, is the variable of interest observed n times (in our case, one time in
unexplained variation in the response variable (errors), and are the unknown constants (the regression
coefficients). It is assumed that the data and the errors have a common variance denoted . It is also
assumed that m is less than n; for if m is equal to n, the model is overparameterized; that is, there are as many
unknown parameters as there are data values, and finding values of is not interesting from a statistical
point of view since the problem has a unique solution.
Spatial Statistical Data Analysis for GIS Users 776
A model is nonlinear if at least one derivative of the mean function with respect to the parameters depends
on at least one parameter. For example, the first model above is linear because neither of the derivatives
depends on the parameters :
An example of a nonlinear model is .
The expected value of the error term of a linear model is zero since a nonzero constant value is defined by
(intercept). It is assumed that is independent from .
The method for estimation of the coefficients is the least squares, which minimizes the
quantity
The estimated coefficients have a form
=function( )· ,
and an exact equation of function( ) can be found in the statistical textbooks.
quantity (nm) is called the degree of freedom.
The coefficient of multiple determination R2 is a measure of the proportion of the variability explained by a
linear regression. The closer R2 to one, the larger the proportion of explained variability by linear regression
and the better the fit of the linear model to the observed data (if the linear model assumptions are satisfied,
which is not always the case.)
Linear regression model has the following main assumptions:
The model includes all relevant predictors, and the response variable accurately describes the
process under study.
The deterministic component is a linear function of the predictors.
The error and predictors xj (not the response yi) are independent.
The variance of the regression errors is constant. It is also normally distributed in the case of
Spatial Statistical Data Analysis for GIS Users 777
hypothesis testing (typically, we test the hypothesis that regression coefficient is equal to zero
given all other regression coefficients). If random errors are not normally distributed, the least
square method may not be appropriate for estimating the model’s parameters.
The errors are uncorrelated. This assumption is important for estimating prediction uncertainty, but
it does not influence estimation of the regression coefficients. Note that in the case of spatial data, the
errors are usually correlated.
The relationship between the response and predictor variables can be an artifact of inappropriate data
domain selection. Figure A2.7 at left shows the relationship between the logarithm of cesium‐137 soil
contamination and the risk of thyroid cancer in populated places from four districts in Belarus displayed in
the center of figure A2.7. The red line is an output from the linear regression between two variables. We see a
strong relationship between cesium‐137 soil contamination and the risk of thyroid cancer. Figure A2.7 at
right displays data collected in four districts using the same four colors that were used for painting polygons
in the center of the figure. Four estimated regression lines are shown, and we can see that all individual
slopes show a weaker relationship between the response and predictor variables than in the case when all
data were processed together. This phenomenon is called Simpson’s paradox in statistics, a modifiable area
unit problem in geography, and a change of support problem in geostatistics. A hidden variable that
determines the relationship between response and predictor variables is called a lurking variable. In this
case, the lurking variable is geography: each district is separated by a different distance from the Chernobyl
nuclear power plant, and meteorological conditions in the very first days after the accident were different in
each district. Interestingly, cesium‐137 soil contamination is not a significant factor in explanation of the
spatial distribution of thyroid cancer rates, see case studies in appendixes 3 and 4.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A2.7
Examining the residuals is important for checking the fit of a model. Visual checking can be done using the
following plots:
Residuals against explanatory variables. Any systematic variability in the plot may represent some
feature of the data not captured by the model. If residual values are increasing and then decreasing, a
more reliable model can be specified by adding a quadratic term, say , to the model. If the spread
of residuals increases as the explanatory variable increases, the variability of the error may depend
on the explanatory variable. Data transformation may help to remove this dependency. If not, a
nonlinear regression model may better explain the relationships between variables.
Spatial Statistical Data Analysis for GIS Users 778
Residuals against any omitted explanatory variables. In the case of infant mortality data, it can be
variable AGE20_29, the rate of women ages 20–29 who had a child per 1,000 women of that age. If
the residuals are not correlated with the omitted variable, it is not necessary to include it in the
model.
Residuals against fitted values. This helps to check how the model is changing with the overall values
of the process. If the model is good, the residuals should look random.
The researcher may also want to consider the following diagnostic tests:
Test residuals for normality. Nonnormality suggests that the model is not describing the
relationships between variables well.
Test for unusually large residuals whose presence in the data have a distorting effect on the
estimated model’s parameters. In practice, a few data outliers (both response and explanatory
variables) may distort the model, allowing one or more observations to have undue influence on the
overall fit of the model. An outlier among the response variable often arises because of an error in the
observation or because the model is an inaccurate description of the data. It is much harder to detect
outliers among the explanatory variables if there are many such variables. The reason for the outliers
should be identified, and the suspected data may be excluded from the analysis, or robust and
resistant regression models may be considered.
FITTING A LINEAR MODEL IN R
A linear model of the dependence of the death rate on the smoking habit and on the age of women can be
fitted using function lm() with the following commands:
death_rate.linear_model < lm(death_rate ~ smok_all + AGE15_17 + AGE40_54, data=us4df, weights= birth)
death_rate.linear_model
summary(death_rate.linear_model)
Figure A2.8 shows the result of the modeling in R console.
Figure A2.8
Spatial Statistical Data Analysis for GIS Users 779
A linear model formula consists of the response variable on the left of the ~ (tilde) and predictor variables in
the regression on the right side. The data frame us4df is attached, and the variables are accessed by their
names. The model includes an intercept term (a constant ) unless it is suppressed by a “−1” term in the
formula. The lm function returns an object (in this case death_rate.linear_model). The command
death_rate.linear_model gives a brief report of the regression results, and the command
summary(death_rate.linear_model) gives a more detail report.
The option “weights” produces a weighted least squares fit. In the case of infant mortality data, statistical
literature recommends to weigh rates proportional to the number of births associated with each observation.
Then the weighted residual sum of squares is minimized. This makes sense because the number of
deaths in counties with a small number of births has more variability than the number of deaths in the county
with a large number of births. Another example of linear regression with variable weights is “geographically
weighted regression” discussed in chapter 12 and at the end of this appendix.
Residuals provide a summary of the residual values.
The first column in the table “Coefficients” displays the variable names. The second column “Estimate” gives
the regression coefficients estimate. (Intercept is interpreted as the expected value for variable
death_rate if all explanatory variables have zero values. Other give the expected increase of the response
variable for a unit increase in the explanatory variable. The third column “Std. Error” is the standard error of
the estimate. The fourth column “t value” is the proportion of the estimate to the standard error, and the last
column “Pr(>|t|)” is the probability of observing a value that corresponds to the null hypothesis, which is a
zero coefficient (p‐value). If the p‐value is less than 0.05, statisticians usually conclude that the
explanatory variable has a significant effect on the response variable. Commonly used significance codes are
listed below the table with coefficients, and the p‐values are flagged according to them.
Assuming that the regression model death_rate.linear_model adequately summarizes the data, the proportion
of smoking women, the proportion of young women, and death rates in the neighboring counties (intercept)
have a statistically significant influence on the death rate, while the proportion of middle‐age women is
nearly significant. It is in the 93‐percent significance interval (see chapter 4), meaning that this variable also
influences the infant mortality death rate.
Residual standard error estimates the standard deviation and the degree of freedom (nm) defined above.
The proportion of the variability explained by a linear regression is small since the coefficient of multiple R
squared R2 is small. An value close to 1 would suggest that the model is explaining most of the variability
of the response variable death rate. In our case, however, the model is explaining only 29 percent of the
death‐rate variability. An unbiased estimator of R2 is called adjusted Rsquared . It is an attempt to adjust
for the fact that when an explanatory variable is added to the model, the R2 will likely increase. This may help
users choose a better model among several linear regressions.
Fstatistics are used to test the null hypothesis that the data were generated from a model with an intercept
term only. The very small p‐value 2.2e‐16 states that it is not so and that the model with three explanatory
variables is better at predicting the death rates than a model that uses neighboring values of the death rates
only.
Spatial Statistical Data Analysis for GIS Users 780
Note that p‐value is not a measure for the probability that null hypothesis is wrong. P‐value measures the
probability of observing another outcome at least as extreme as the observed outcome. It means that the
following test procedure is used:
Data consistent with null hypothesis are generated n times.
The test statistics are calculated both for observed and generated data.
If the observed data are in disagreement with generated ones, the observed value of the test statistic should
be unusual, and the proportion of test statistic calculated from generated data that is at least as extreme as
the observed value of the test statistic should be small, say, less than 5 percent. However, the linear model lm
does not generate a large number of values for comparison but derives the distribution of the test statistics
from the first principles. Note also that the usage of p‐values is criticized by Bayesian statisticians.
It is possible to make predictions at known and unknown locations providing values for all explanatory
variables by using the function predict.lm. The following code creates a data frame with explanatory variables
used in the model death_rate.linear_model and predicts new values pred.interval$fit for each polygon;
prediction standard errors pred.interval$se.fit are also estimated.
p.new < data.frame(smok_all=us4df$smok_all, AGE15_17=us4df$AGE15_17, AGE40_54=us4df$AGE40_54)
pred.interval < predict.lm(death_rate.linear_model, p.new, level=0.95, data=us4df, se.fit = TRUE,
weights=us4df$birth, interval="prediction")
Histograms of the predictions and prediction standard errors can be displayed using the commands
hist(pred.interval$fit)
hist(pred.interval$se.fit)
The following commands plot a prediction map. Function pretty computes a sequence of nearly equally
spaced nice values that cover the range of the values in the predictions pred.interval$fit.
brks < pretty(pred.interval$fit)
cols < brewer.pal(length(brks)1, "Oranges")
plot(us4polys, col=cols, forcefill=FALSE, main="Predicted death rates")
legend(c(8.5e5,11e5), c(7e5,9.3e5), fill=cols, legend=leglabs(brks), bty="n", ncol=1, x.intersp=0.9, y.intersp=0.9)
The summary of a fitted regression model gives estimated coefficients and their standard errors. However,
some applications need distributions of the regression coefficients for analyzing the propagation
uncertainties as discussed in chapter 5. The following commands produce 500 simulations of the regression
coefficients using function sim.lm from the package arm:
library(arm)
sim.dr < sim.lm(death_rate.linear_model,500)
Distributions of the coefficients are visualized in four windows (specified by the command
par(mfrow=c(2,2))) in figure A2.9 using the following commands:
par(mfrow=c(2,2))
hist(sim.dr$beta[,1], xlab="Intercept")
hist(sim.dr$beta[,2], xlab="smoking")
hist(sim.dr$beta[,3], xlab="AGE 1517")
hist(sim.dr$beta[,4], xlab="AGE 4054")
par(mfrow=c(1,1))
Spatial Statistical Data Analysis for GIS Users 781
Figure A2.9
REGRESSION DIAGNOSTICS
Regression diagnostics are used to detect when the assumptions of the analysis are invalid and to reveal data
outliers. If the assumptions are violated, another model should be used. If outliers are present in the data, one
may want to remove observations that lead to misinterpretation of the data. The outliers are difficult to
identify automatically in the case of a regression model with many explanatory variables. In practice, several
different techniques are used to find unusual observations.
The standard R library provides a set of functions for computing the regression diagnostics for a linear model:
influence(), influence.measures(), rstandard(), rstudent(), dffits(), dfbetas(), covratio(), cooks.distance(), and
hatvalues().We illustrate the use of some of these functions below. Additional diagnostic functions are
provided by several add‐on R packages (see assignment 2).
The function rstandard() gives the normalized residuals with unit variance, using an overall error variance.
The following commands display the standardized residuals versus fitted values and a map of the
standardized residuals in figure A2.10:
stdres_lm < rstandard(death_rate.linear_model)
plot(death_rate.linear_model$fitted.values, stdres_lm)
abline(h = 0, lwd=1, col="green")
abline(h = 2, lwd=1, col="red")
abline(h = 2, lwd=1, col="red")
lines(lowess(death_rate.linear_model$fitted.values, stdres_lm), col="blue", lwd=2)
windows()
brks < c(Inf, 2, 1.5, 1, 1, 1.5, 2, +Inf)
cols < rev(brewer.pal(7, "PuOr"))
plot(us4polys, col=cols[findInterval(stdres_lm, brks, all.inside=TRUE)], forcefill=FALSE, main="Standardized
residuals, linear fit")
legend(c(8.5e5,11e5), c(7e5,9.3e5), fill=cols, legend=leglabs(brks), bty="n", ncol=1, x.intersp=0.9, y.intersp=0.9)
Spatial Statistical Data Analysis for GIS Users 782
Raw (nonstandardized) residuals can be calculated using the command
res_lm < residuals(death_rate.linear_model)
However, statistical literature states that the raw residuals are not a good diagnostic tool
because are correlated, while the standardized residuals have constant variance.
Courtesy of the National Atlas of the United States, http://www.nationalatlas.gov.
A2.10
The graph in figure A2.10 at left indicates that our model death_rate.linear_model may not be adequate
because ideally the cloud of residuals should appear as random scatter around the green horizontal line at
zero. The largest residuals are clearly seen when they are displayed on a map shown in figure A2.10 at right.
The standardized residuals can be written out to a shapefile for further GIS analysis using the following
command:
write.polylistShape(us4polys, data.frame(us4df, stdres_lm), file="stdres_lm"),
where option data.frame(us4df, stdres_lm) is adding column stdres_lm to the columns stored in the data frame
us4df.
A plot of residuals against omitted explanatory variable AGE20_29 (figure A2.11 at left), the rate of women
ages 20–29 who had a child per 1,000 women of that age, and against the used explanatory variable smok_all
(figure A2.11 at right) can be made using the following commands:
plot(us4df$AGE20_29, res_lm)
abline(h = 0, lwd=1, col="green")
abline(h = 2, lwd=1, col="red")
abline(h = 2, lwd=1, col="red")
lines(lowess(us4df$AGE20_29, res_lm), col="blue", lwd=2)
windows()
plot(us4df$smok_all, res_lm)
abline(h = 0, lwd=1, col="green")
abline(h = 2, lwd=1, col="red")
abline(h = 2, lwd=1, col="red")
lines(lowess(us4df$smok_all, res_lm), col="blue", lwd=2)
Spatial Statistical Data Analysis for GIS Users 783
A2.11
We see that the residuals are correlated with the omitted variable AGE20_29. Therefore, it makes sense to add
it to the model. Figure A2.12 at left shows the summary of a new model with one more explanatory variable,
and figure A2.12 at right displays the residuals against variable AGE20_29. The explanatory variables in this
model are significant (because p‐values Pr(>|t|) are small, less than 0.05), and the dependence between
residuals and the variable AGE20_29 is decreased significantly in comparison with figure A2.11 at left.
Figure A2.12
Since the residual values in the initial model death_rate.linear_model are increasing and then decreasing with
increasing the values of the variable smok_all, we can try to add a quadratic term smok_all2 (or, more
generally, using orthogonal polynomial term poly(smok_all,2), where 2 is an order of the polynomial) to the
model. This can be done using the following command:
death_rate.linear_model.2smoking < lm(death_rate ~ poly(smok_all,2) + AGE15_17 + AGE40_54, data=us4df,
weights=birth)
Figure A2.13 at left shows the summary of a new regression model with a polynomial term, and figure A2.13
at right displays the residuals of this model against variable smok_all. This time the explanatory variable
AGE40_54 is not significant, and the dependence between the residuals and the variable smok_all has almost
disappeared. Because the p‐values of the polynomial terms in the model death_rate.linear_model.2smoking are
significant, we can try a cubic term poly(smok_all,3), then the fourth power poly(smok_all,4) and so on until
the p‐values of the polynomial terms become insignificant. The disadvantage of using polynomials is that each
Courtesy of the National Atlas of the United States, http://www.nationalatlas.gov.
Figure A2.13
A linear model assumes that the errors are distributed normally. This assumption can be tested using the
following commands:
qqnorm(res_lm, ylab = "Residuals")
qqline(res_lm, lwd=1.5, col = "red")
where qqnorm() produces a normal quantile‐quantile plot of the values res_lm. The command qqline() adds a
line to a normal quantile‐quantile plot that passes through the first and third quartiles of the res_lm values. If
you run the command above, you will see that the small values of the residuals of the initial model
death_rate.linear_model are not close to the first‐third quartiles line. We are not presenting that graph here
because it is more instructive to use the Geostatistical Analyst’s quantile‐quantile plot, which is interactively
linked with a map and the data table as shown in figure A2.14.
Courtesy of the National Atlas of the United States, http://www.nationalatlas.gov.
Figure A2.14
One problem with the residuals is that they are sensitive to the outliers. Several specific tools indicate how
the model is affected by deleting observations. Fortunately, it is not necessarily to repeat the model fitting
each time we delete an observation since it is possible to derive all formulas using algebra. One such tool is
called DFFITS. It measures how the fitted value of is affected by deleting ith observation. DFFITSi is a
dffits.dr < dffits(death_rate.linear_model)
plot(us4df$death_rate, dffits.dr)
abline(h = 0, lwd=1, col="blue")
abline(h = 1, lwd=1, col="red")
abline(h = 1, lwd=1, col="red")
Alternatively, the half‐normal plot can be used to highlight unusually large DFFITSi. In the half‐normal plot,
the absolute values of deletion residuals are ordered and plotted against the n largest expected order
statistics from a normal sample of size 2n + 1, where n is a number of the observations. The half‐normal plot
can be visualized (figure A2.15 at right) using function halfnorm from the package faraway using the
following commands:
library(faraway)
halfnorm(dffits.dr, n=5, ylab = "Sorted dffits.dr")
where option n=5 indicates how many largest values of the dffits.dr variable should be labeled.
Courtesy of the National Atlas of the United States, http://www.nationalatlas.gov.
Figure A2.15
In the half‐normal plot we are not looking for a straight line relationship but for the outlier points located far
from the majority of the points. The most suspicious five points have the identification numbers 277, 288, 113,
247, and 53. The following commands remove observations with absolute values of the dffits larger than 1
from the data frame (dataset) us4df and vector dffits.dr:
dr.out < subset(us4df, abs(dffits.dr) > 1)
dr.out
dff < subset(dffits.dr, abs(dffits.dr) > 1)
dff
Death Age
State name
id dff County Birth rate Death Smoking 15–17
Jefferson
113 1.43 Alabama 20858 10.8 226 10.9 36.4
County
Mobile
247 1.09 Alabama 13719 10.7
147
15.5
43.6
County
Orleans
277
2.65
Louisiana 15906 7.9 125 6.2 40.6
Parish
Jefferson
288 1.82 Louisiana 14802 7.9 118 7.3 41.2
Parish
Courtesy of the National Atlas of the United States, http://www.nationalatlas.gov
Table A2.1
A DFFITSi value of 1 means that the fit at will change by one standard error unit if the ith data point is
removed. The following commands fit a linear model to the data without counties with the identification
numbers 277, 288, 113, and 247 where absolute values of the DFFITSi are greater than one:
dr.4out < subset(us4df, abs(dffits.dr) < 1)
death_rate.linear_model.4out < lm(death_rate ~ smok_all + AGE15_17 + AGE40_54, data=dr.4out,
weights=birth)
summary(death_rate.linear_model.4out)
The coefficients of the new model are presented in the table A2.2. We see that without 4 out of 288
observations the variable AGE40_54 became insignificant in the explanation of the death rates.
An option subset() can be used to fit a linear model lm to a subset of observations directly. For example, the
observations to be omitted from the fit can be specified using a numeric vector with negative entries:
death_rate.linear_model.sub1 < lm(death_rate ~ smok_all + AGE15_17 + AGE40_54, data=us4df, weights=birth,
subset=c(277, 288, 113, 247)),
where subset=c(277, 288, 113, 247) excludes suspicious observations 277, 288, 113 and 247. Because we
usually fit several models to the data, it is advisable to handle missing and suspicious data outside of the lm()
function to have better control over the data modeling.
The dffits and several other leave‐one‐out deletion diagnostics are calculated by function influence.measures:
im < influence.measures(death_rate.linear_model)
The command summary(im) prints a table with potentially influential observations, which is shown in slightly
updated format in table A2.3 with suspicious observations shown in blue (in R, influential values are marked
with an asterisk). The first column id shows the identification number of the observation in the data frame.
The next four columns show the DFBETAS for each model variable. DFBETAS assesses the influence of the jth
The cov.r statistic measures the change in the determinant of the covariance matrix of the estimates by
deleting the ith observation. Statistical literature suggests that observations with , where p
is the number of parameters in the model and n is the number of observations used to fit the model, can be
outliers.
Cook’s distance D measures the effect of the ith observation on the estimated regression coefficients.
According to statistical literature, observations with cook.d greater than 0.5 may need to be examined.
form). The value of hat is between 0 and 1 with average value . Observations with hat values above
cutoff should be investigated.
Note that each potentially influential observation is calculated independently from the others. However, in
practice, it is possible that influential values are related. For example, it may happen that there is no single
influential value, but two or three observations together act as an outlier.
A half‐normal plot can be used with each statistic in the table above. For example, the following commands
display dfbeta smok_all and cov.r diagnostics:
par(mfrow=c(1,2))
halfnorm(im$infmat[,2], n=2, ylab = "sorted dfbeta smok_all ")
halfnorm(im$infmat[,6], n=2, ylab = "sorted cov.r")
par(mfrow=c(1,1))
We fitted three linear models so far: death_rate.linear_model, death_rate.linear_model.2smoking, and
death_rate.linear_model.4out. It is possible to compare models from the point of view of the predicting
performance using Akaike’s Information Criteria (AIC) with the following command:
AIC(death_rate.linear_model, death_rate.linear_model.2smoking, death_rate.linear_model.4out)
df AIC
death_rate.linear_model 5 1116.315
death_rate.linear_model.2smoking 6 1097.134
death_rate.linear_model.4out 5 1017.405
The best model is the one that has the smallest AIC value; in our case it is a model
death_rate.linear_model.4out.
Many more linear regression model diagnostics can be run in the R environment. However, their usefulness
should not be overestimated because in practice one or more linear regression model assumptions are
violated, and a real problem is to decide how seriously these violations influence the conclusions drawn from
the fitted model.
BEYOND LINEAR MODELS
It is easy to find influential observations with so many available diagnostic techniques. If the number of the
suspicious observations is small, it is a good idea to remove them from the data and then refit the model.
However, the removal of one or several measurements is not a statistical decision. It is based on data quality,
the available theory of the studied phenomena, and the researcher’s intuition and experience.
If there are many influential observations, we can suspect the linear model itself. Figure A2.16 illustrates a
situation when the conditional distribution of response variable Y, Prob(Y|X), is changing with the change of
the explanatory variable X. In this case, the least square estimation is not efficient, and the linear regression
assumptions are violated because:
The relationship between the response and exploratory variables is not linear.
The center of the two‐modal distribution should be presented by two numbers instead of one.
The mean is not a good representation of the distribution’s center when distribution is skewed.
Figure A2.16
Figure A2.17
As discussed in chapters 4 and 11, distribution of the mortality rates in administrative areas is not stationary
and non‐Gaussian. The commands below show regression modeling of infant mortality data using the
generalized linear model with negative binomial distribution (a discussion and examples of negative binomial
distribution can be found in chapters 4 and 6). This model is implemented in the package MASS created by
Venables and Ripley. In the epidemiological applications, the counts of disease (or deaths) are partly
explained by the population at risk of having the disease. Therefore, the number of births in the counties can
be added to the list of explanatory variables. However, it is useful to fix the regression coefficient for the
population to the value of 1 because in this case the regression coefficients have more direct interpretation.
This is called offset in the regression analysis software. In the example below, we use the logarithm of the
number of births as the explanatory variable with the regression coefficient equal to one.
library(MASS)
res_nb < glm.nb(death ~ offset(log(birth)) + smok_all + AGE15_17 + AGE40_54, data=us4df)
summary(res_nb)
The following commands display the standardized residuals of the generalized linear model with negative
binomial distribution (figure A2.18).
stdres_nb < rstandard(res_nb)
brks < c(Inf, 2, 1.5, 1, 1, 1.5, 2, +Inf)
cols < rev(brewer.pal(7, "PuOr"))
plot(us4polys, col=cols[findInterval(stdres_nb, brks, all.inside=TRUE)], forcefill=FALSE, main="Standardised
residuals, negative binomial fit")
legend(c(8.5e5,11e5), c(7e5,9.3e5), fill=cols, legend=leglabs(brks), bty="n", ncol=1, x.intersp=0.9, y.intersp=0.9)
The standard residuals from a generalized linear model can be visually compared with the standard residuals
from a linear model (figure A2.10) using the following commands.
stdres_lm < rstandard(death_rate.linear_model)
plot(us4polys, col=cols[findInterval(stdres_lm, brks, all.inside=TRUE)], forcefill=FALSE, main="Standardised
residuals, linear model fit")
legend(c(8.5e5,11e5), c(7e5,9.3e5), fill=cols, legend=leglabs(brks), bty="n", ncol=1, x.intersp=0.9, y.intersp=0.9)
Figure A2.18
One of the criteria of good regression modeling states that the standard residuals from the model should not
be smaller than minus one and not greater than one. Comparison of the maps in figure A2.18 and A2.10 at
right shows that the generalized linear model better explains the infant mortality data. Additional examples
of generalized linear model fitting can be found in appendix 4.
Spatial regression models are discussed in chapters 11, 12, 15, 16, and appendixes 3 and 4. In particular,
chapter 16 presents the analysis of similar infant mortality data collected in Georgia using a simultaneous
autoregressive model. It is shown that the results of modeling infant mortality data with spatial and
nonspatial models are different.
Most of the regression models require substantial knowledge to successfully use them to fit real data with
available software, while some other models can work as black boxes, and even beginners can use them
easily. One such model is implemented in the R package mgcv created by Simon Wood, the generalized
additive model. The generalized additive model and its close relative, the semiparametric model, are
discussed in chapters 6 and 12.
Figure A2.19 shows houses that were sold in the first half of 2006 in part of the city of Nashua, New
Hampshire. Similar data are provided in the folder Assignment A2.1. Information on housing characteristics,
including house price; age; total area of the lot; house and garage square footage; and number of rooms,
bedrooms, and baths, is available. Similar information about almost all houses in the area, excluding the sale
price, is also available, and it is of interest to predict the house prices using a regression model, this time the
generalized additive model.
The following commands specify the default folder where data are stored (c:\\book_data\\housing) and two
packages, foreign (to read the data in DBF format) and mgcv (to fit a model):
setwd("c:\\book_data\\housing")
library(foreign)
library(mgcv)
The next two commands read the data:
tr < read.dbf("sales2006_selected.dbf")
te < read.dbf("ResidentialPropertiesSelected.dbf")
The next two commands create data frames with locations and features of sold and unsold houses:
m_tr < data.frame(x=tr$x, y=tr$y, sale.price=tr$SalePrice, rooms=tr$NumofRooms, bedrooms=tr$NumofBedro,
age=tr$EffYearBlt, sq.feet=tr$CurrentSF, area=tr$GrossArea, gar=tr$GAR, bath=tr$bath, years=2007
tr$EffYearBlt)
m_te < data.frame(x=te$X, y=te$Y, rooms=te$NumofRooms, bedrooms=te$NumofBedro, age=te$EffYearBlt,
sq.feet=te$CurrentSF, area=te$GrossArea, gar=te$GAR, years=2007te$EffYearBlt)
The following three commands fit a model using information on the sale price, x and y‐coordinates of the
houses, house age, gross area, living square footage, and garage area, and display the result of fitting (shown
in figure A2.20):
GAM.model < gam(sale.price~s(x,y)+s(years)+s(area)+ s(sq.feet)+ s(gar), data=m_tr, family=gaussian)
GAM.model
plot(GAM.model,pages=1)
The solid lines in the one‐dimensional graphs in figure A2.20 show the smoothed estimated effect of the
explanatory variables with approximate 95 percent confidence limits shown as dashed lines. Zero on the
vertical axis corresponds to no effect of the explanatory variable. The complexity of the smoothing spline is
summarized in its associated degree of freedom (a value in the parenthesis in the title of the y axes): the
larger degree of freedom, the more complex the smoothing function. Bold contours in the upper left plot show
the estimate of the isotropic smooth function of two explanatory variables, the coordinates x and y. The
nonsolid contours show the smooth function plus (dashed lines) and minus (dotted lines) the standard error
of the smooth function.
The model GAM.model was fitted using the default parameters of the function gam hiding a number of
options. If we suspect that this model is not explaining our data well, we may try to change the parameters of
the model. We will not do this in our housing case study, but if the predicted house prices are to be used in the
decision making, the researcher should not be satisfied with predictions based on the default model’s
parameters.
The following command predicts housing prices for houses for which data are available on the same
explanatory variables as those used in the fitted model.
pred.prices < predict.gam(GAM.model, m_te, se.fit=TRUE)
The estimated predictions and prediction standard errors stored in the variable pred.prices can be exported
to DBF format using commands
m_te_pred < data.frame(x=m_te$x, y=m_te$y, years=m_te$years, area=m_te$area, sq_feet=m_te$sq.feet,
garage=m_te$gar, fit_gaus=pred.prices$fit, err_gaus=pred.prices$se.fit)
Then predictions can be visualized in ArcGIS as shown in figure A2.21.
Figure A2.21
Figure A2.22 shows an enlarged part of the map, with actual house prices as large rectangles and predicted
prices as small circles.
Data courtesy of the City of Nashua, N.H.
Figure A2.22
It is instructive to compare predictions made by the generalized additive model and geographically weighted
regression because housing price analysis is a typical application of the latter model. We did this exercise by
randomly splitting sales data on the training and testing datasets using the Geostatistical Analyst option
Create Subsets. The code below can be used for the geographically weighted regression validation exercise
using R package spgwr created by Roger Bivand and Danlin Yu.
# Load the spgwr package
library(spgwr)
# read data
The generalized additive model was fitted and validated using the same data that were used with the
geographically weighted regression above. Predicted and actual prices can be compared using the linear
model. Figure A2.23 shows the validation scatterplots of 116 actual data and their predictions using two
models in Microsoft Excel. The scatterplots and the linear regression fits are shown in red for the generalized
additive model and in green for the geographically weighted regression. Predictions made by the generalized
additive model are more accurate because the estimated coefficient β0 is smaller while the estimated
coefficient β1 and the coefficient R2 are much closer to 1. This is an expected result because the geographically
weighted regression is an exploratory data analysis tool, and prediction is not the primary goal of this tool.
Figure A2.23
Predicting housing prices in Nashua using the commands for the generalized additive model presented in the
previous section can be done in ArcGIS. Housing data are available in the folder Assignment A2.1.
The first step is to install Python 2.x and Pythonwin 2.x (we used build 208 from
http://sourceforge.net/project/showfiles.php?group_id=78018) and R DCOM Server from
http://cran.r-project.org/contrib/extra/dcom/. R version 2.2 or higher should be installed as
discussed in the beginning of this appendix.
The second step is writing a python script and creating a toolbox. Katja Krivoruchko from Esri did that and
the script Estimate housing prices (R) is provided with the Geostatistical Analyst samples toolbox described in
appendix 1. Instructions for running the script are the following:
1. Open the Estimate housing prices (R) tool in the Geostatistical Analyst samples toolbox by double‐
clicking the tool.
2. Populate the tool with input data as shown in figure A2.24.
3. Run the tool and convert the output DBF table to the shapefile format using the Make XY Events
Layer tool. The resulting map should be the same as the last two maps in the previous section,
figures A2.21 and A2.22.
Figure A2.24
The next example shows the use of ArcGIS and R functionality together. The task is to compare two point
patterns by dividing their estimated spatial intensities. There are a few lines in the R script (see similar
example in figure 16.23 in chapter 16), but the data preparation and the result of the point pattern analysis
visualization require some efforts. These can be done using Python. The Python codes were written by Mark
Janikas at Esri.
Suppose we want to compare densities of different species observed in the Atlantic Ocean near the East Coast
of the United States. Data can be found and downloaded from
Data courtesy of Cetacean and Turtle Assessment Program, University of Rhode Island Graduate School of Oceanography.
Figure A2.25
Here are instructions for installation of the Point Pattern Analysis Samples Toolbox in ArcGIS 9.3. In this case
study, we used R version 2.6.2. It is recommended to uninstall previous versions of R from the computer.
1. Open R and add the following packages: splancs, foreign, and ncdf from menu Packages ‐> Install
package(s) ‐> Choose a Repository ‐> Choose a Package. Note that dependencies will auto‐install
additional packages.
2. Download and install RPy. Carefully read the instructions provided at
http://rpy.sourceforge.net/download.html . Windows users will need the Win32 Extensions
at http://starship.python.net/crew/mhammond/win32/Downloads.html. Numpy is
installed with ArcGIS 9.3, so no additional installs are required. As noted at the RPy Web site, you will
need the version of RPy that corresponds to ArcGIS 9.3, that is
http://sourceforge.net/projects/rpy/files/rpy/1.0.2/rpy-1.0.2-R-1.3.0-to-
2.6.2-win32-py2.5.exe/download on the SourceForge.net site:
http://sourceforge.net/project/showfiles.php?group_id=48422.
3. Assure that the PYTHONPATH is set to the ArcGIS Scripts directory. Go to My Computer ‐> Right Click…
Properties ‐> Advanced Tab… Environment Variables ‐> Edit/Add PYTHONPATH for the local and system
variables ‐> Add ArcGIS Home\ArcToolbox\Scripts. If the ArcGIS 9.3 Program exists in Program Files
then add C:\Program Files\ArcGIS\ArcToolbox\Scripts to the PYTHONPATH. It is OK to have more than
one folder in the PYTHONPATH; just make sure that you delineate each with a semicolon. The final
Python Path could look like this: C:\Program Files\ArcGIS\ArcToolbox\Scripts; C:\Program
Files\ArcGIS\Bin.
5. Add the Point Pattern Analysis Samples Toolbox. Open ArcMap9.3; then right click ArcToolbox; then click
Add Toolbox; then choose the location where downloaded Point Pattern Analysis Samples.tbx is stored.
6. Change the source of the script tools. Open the Point Pattern Analysis Samples Toolbox, right‐click each
tool, and choose Properties. Under the Source tab, make sure that the associated path to the Python files
is correct. Do this for DensityRatio.py and SplitFC.py files.
Figure A2.26 at top left shows the Split Feature Class By Attribute geoprocessing tool, while figure A2.26 at
bottom left shows part of the output from the tool with information on created shapefiles with prefix “fish_”
and the scientific names of the species.
Figure A2.26 at right shows the Density Ratio geoprocessing tool. It requires specification of two shapefiles
with points, the name of the output raster, and information about the shape of the study area. We selected the
Balaenoptera and Balaenoptera acutorostrata species. Balaenoptera is the largest genus of the Rorqual
whales, containing eight species, and Balaenoptera acutorostrata is the scientific name for the Minke whale.
Accurate specification of the study area (the data domain) is important because point pattern analysis models
usually assume that the area where the events are observed is known. The Density Ratio geoprocessing tool
proposes four options for specifying the data domain. We used the convex hull option, although the user‐
specified polygon option should be always preferred if there is enough information about the study area, as it
is in this case (locations of green points in figure A2.25).
Figure A2.26
Permission obtained for Python Script files from Mark Janikas, Esri.
Data courtesy of Cetacean and Turtle Assessment Program, University of Rhode Island Graduate School of Oceanography.
Figure A2.27
The density ratio geoprocessing tool is user‐friendly, and it even could be used for testing the splancs package
function kernrat.
ASSIGNMENTS
1) REPEAT THE REGRESSION ANALYSIS OF DATA ON INFANT MORTALITY AND
HOUSING PRICES.
Repeat the linear regression analysis of infant mortality data collected in four southern states and housing
price data from Nashua, New Hampshire, using the generalized additive model.
The data are in the folder Assignment A2.1
2) VERIFY THE ASSUMPTIONS OF THE LINEAR REGRESSION MODEL.
The R package lmtest (http://cran.r-project.org/web/packages/lmtest/index.html)
provides a collection of diagnostic tests for various deviations from the assumptions of the linear regression
model. Install this package and learn by examples about advanced linear regression model diagnostic.
1. “Contributed documentation” at the R Web site http://cran.r-project.org/.
This site contains a number of tutorials provided by users of R. They usually assume that readers already
know why statistical functions are implemented in the software.
2a. Faraway, J. 2005. Linear Models with R. Chapman & Hall/CRC, 230 pp.
This book is a practical guide to linear regression modeling in R. The author presents a large number of short
examples of the function lm() usage. However, he assumes that readers know the essentials of statistical
inference. A draft version of this book is available online at http://cran.r-
project.org/doc/contrib/Faraway-PRA.pdf.
2b. Faraway, J. 2006. Extending the Linear Models with R. Chapman & Hall/CRC, 301 pp.
This book presents a large number of examples of generalized linear models, mixed effects models, and
nonparametric regression models usage with R packages. However, as the author mentioned in the preface,
“the breath comes at the expense of some depth.”
3. Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with S. Fourth Edition. Springer.
The authors work through a variety of major topics in statistics and demonstrate the use of R with sample
code and datasets. As stated by the authors, almost all the material in the book is what is covered in the
master’s of science degree in Applied Statistics at the Oxford University including the following topics: linear
models, generalized linear models, nonlinear and smooth regression, tree‐based methods, random and mixed
effect models, multivariate analysis, classification, survival analysis, time series analysis, and spatial statistics.
4. Schabenberger, O., and F. J. Pierce. 2002. Contemporary Statistical Models for the Plant and Soil Sciences.
Boca Raton, Fla.: CRC Press. 738 pp.
This book discusses theory of various regression models, including linear, from the statistician’s point of
view. All case studies in this book are performed using SAS software. It would be interesting to reproduce the
authors’ analyses in R.
5. Guide to the resources for the analysis of spatial data using R can be found at http://cran.r-
project.org/web/views/Spatial.html.
6. Read, A. J., P. N. Halpin, L. B. Crowder, K. D. Hyrenbach, B. D. Best, S. A. Freeman, editors. 2003. “OBIS‐
SEAMAP: Mapping Marine Mammals, Birds, and Turtles.” World Wide Web electronic publication.
http://seamap.env.duke.edu, accessed 03/28/2006.
This paper discusses the Spatial Ecological Analysis of Marine Megavertebrate Animal Populations (SEAMAP)
initiative: the geoinformatics facility of the Ocean Biogeographic Information System (OBIS) network. OBIS‐
SEAMAP has developed an expanding geo‐database of marine mammal, seabird, and sea turtle distribution
and abundance data. The OBIS‐SEAMAP information system supports research into the ecology and
management of these important marine megavertebrates and augments public understanding of the ecology
of marine megavertebrates. A dataset with whale locations presented in the last section of this chapter was
downloaded from the OBIS‐SEAMAP information system.
INTRODUCTION TO BAYESIAN
MODELING USING WINBUGS
THE REASONS FOR USING BAYESIAN MODELING
BAYESIAN REGRESSION ANALYSIS OF HOUSING DATA
MULTILEVEL BAYESIAN MODELING
BAYESIAN CONDITIONAL GEOSTATISTICAL SIMULATIONS
REGIONAL BAYESIAN ANALYSIS OF THYROID CANCER IN CHILDREN IN
BELARUS
ASSIGNMENTS
1) REPEAT THE CASE STUDIES PRESENTED IN THIS APPENDIX
2) INTERPOLATE THE PRECIPITATION DATA USING GAUSSIAN AND GAMMA
BAYESIAN KRIGING.
3) PERFORM THE BAYESIAN ANALYSIS OF WEEDS DATA.
4) VERIFY THE CLASSIFICATION OF THE HAPPIEST COUNTRIES IN EUROPE.
5) BAYESIAN RANDOM COEFFICIENT MODELING OF THE CRIME DATA.
6) BAYESIAN SPATIAL FACTOR ANALYSIS OF THE CRIME DATA
FURTHER READING
M
ost statistical models discussed in this book assume that unknown parameters such as regression
coefficients or the semivariogram model parameters are fixed constants. For example,
conventional simple kriging assumes that the mean value or the mean surface is known exactly.
However, in Geostatistical Analyst, the mean surface is usually estimated using local polynomial interpolation,
and it is possible to estimate the uncertainty of this surface. Figure A3.1 shows the estimated mean surface
for the Geostatistical Analyst 9.2 tutorial data (left) and its prediction standard error (right) calculated using
a modification of the geographically weighted regression discussed in chapter 12 (the prediction standard
error option for this model will be available in the ArcGIS version after 9.3). The trend surface in figure A3.1
at left is not known exactly, and it is uncertain in different ways in different areas. Kriging on residuals
ignores this uncertainty, and this leads to an underestimation of the kriging prediction; moreover, this
underestimation is different in different parts of the data domain. The Bayesian version of simple kriging can
take into account the uncertainty of the mean surface. Note that many geostatisticians believe that constant
parameters values are preferable because they represent the researcher’s modeling decision, which is based
on thorough exploratory data analysis and the researcher’s experience in semivariogram modeling, so that
the chosen model is the best and, therefore, modeling additional uncertainty is not required.
Data from California Ambient Air Quality Data CD, 1980–2000, December 2000.
Courtesy of The California Air Resources Board, Planning and Technical Support Division, Air Quality Data Branch; U.S. Geological Survey, EROS Data Center.
Figure A3.1
The concept of a priori information usage in cognition has deep philosophical roots. It was an essential part of
the Plato’s theory of knowledge, and it is a cornerstone of Immanuel Kant’s philosophy. In the Critique of Pure
Reason, Kant explains the a priori contributions of the mind to experience. According to Kant, “accidental
observations, made according to no previously designed plan, can never connect up into a necessary law,
which is yet what reason seeks and requires.” Although cognition depends on observations temporally, it is
not necessarily dependent on the data logically. It follows that the principle of causality cannot be based on
the observed data alone. Kant noted that geometry, physics, and chemistry became sciences when
researchers stopped measuring objects to discover their properties and begun using methods based on causal
principles.
Kant showed that both theoretical and practical knowledge has metaphysical parts. The metaphysics of
knowledge consists of a priori rules originated in reason alone. He argues that empirical knowledge depends
jointly on what exists independently of us (data) and on our nature as subjects. Kant describes the deduction
of both objective and subjective a priori concepts. The former demonstrates the applicability of a priori
concept to objects of experience and the latter explains how a priori representations come from subjective
cognitive processes.
Finding essential prior information can be more difficult than collecting the data. Ignoring relevant prior
information in data analysis can be very costly.
Technically, Bayes’ rule can be considered as a particular formalization of Plato’s and Kant’s philosophical
methods. Readers who need an inspiration for appreciation of Bayesian statistics may want to read their
books (or books about their philosophies, which is easier).
The a priori statements are subjective because they are summaries of the researcher’s knowledge about the
model. Even if prior distribution is uniform (it is called flat prior), it is a subjective choice because the
researcher believes that the parameter can take any value with the same probability. This is a subjective
decision since another researcher may use another prior distribution.
Most Bayesian models require a large number of Monte Carlo simulations. Samples generated from the
posterior distribution are used to estimate the quantities of interest.
Bayesian methods have become more popular in applied statistics as a result of the recent increase in speed
of the computer‐intensive Markov Chain Monte Carlo calculations. According to a list maintained by the
Comprehensive R Network (CRAN) http://cran.r-project.org/web/views/Bayesian.html,
about 40 R packages are devoted to Bayesian statistics in 2008. Bayesian methods have been applied in such
areas as genetics, finance, social and political sciences, medicine, ecology, epidemiology, archaeology, geology,
and environmental sciences. Nowadays, Bayesian statistics is so popular that a word “non‐Bayesian” can be
Bayesian modeling is briefly discussed in the section of chapter 5 called “Bayesian belief network.” An
example of Bayesian kriging is presented in the section of chapter 9 called “Kriging with varying model
parameters: Sensitivity analysis and Bayesian predictions.”
Bayesian models have advantages and disadvantages. The advantages can be summarized as follows:
Like the geostatistical conditional simulation discussed in chapter 10, the Bayesian simulation
approach can answer scientific questions that a single point estimate cannot.
Bayes’ formula provides a natural way of combining relevant prior information with data. For
example, when new observations become available, the previous posterior distribution can be used
as a priori distribution, meaning that we can learn from experience. In contrast, conventional
statistical models use only information contained in the data. In Bayesian statistics, data are used to
falsify hypotheses rather than to verify them as in classical (frequentist) statistics.
Small datasets are analyzed in the same manner as large datasets, nonlinear models have the same
structure as linear models, and normally distributed data are analyzed similarly to data that follows
other theoretical distributions.
Bayesian software, such as WinBUGS, allows nonexperts in statistics to fit models of high complexity,
including nonlinear functions of mean and variance parameters.
For nonstatisticians, the results of Bayesian analysis can be more easily understood than the results
of classical statistical analysis. For example, posterior probability (such as “there is an 83 percent
probability that the increase of disease is greater than 8 percent in the contaminated areas”) and
credible interval (central area that contains 80, 90, or 95 percent of the posterior distribution) may
be more understandable than p‐values. The usage of p‐values is criticized by Bayesian statisticians
because a) unobserved data influence the result of null hypothesis testing (the p‐value is influenced
by the values that are more extreme than the observed data) and b) evidence in support of the
alternative hypothesis is ignored. More importantly, there are many problems with actual p‐value
usage. For example, some practitioners believe that the p‐value is the probability that the null
hypothesis is true.
Statistics calculated from simulations often can be made sufficiently accurate by increasing the
number of simulations, while statistics calculated in classical statistics can be less accurate because it
is based on large‐sample and distributional assumptions that are not necessarily valid.
The main disadvantages of Bayesian models are the following:
The output from Bayesian models is many simulations, and a major obstacle in using Bayesian
methods in spatial statistics is the computational burden when the number of observations is large.
While sample sizes of a few hundred are usually not a challenge, processing datasets with thousands
of observations can be very time‐consuming.
The Bayesian models are often complex and, therefore, a good understanding of the model
characteristics, and good model diagnostics are even more important than in classical statistics.
Posterior distribution may be heavily influenced by the prior distributions, but there are only general
rules as to how to select a prior. Specifying model parameter uncertainties as probability
distributions allows developing models that always produce some probability distribution of
possible values of interest, even if the model does not make physical sense. Researchers with
different prior information may produce different results using the same model. Depending on the
situation, this can be an advantage or a disadvantage. However, in practice, researchers are usually
using models and priors recommended in the literature. This minimizes the danger of producing
The above statement, that WinBUGS software allows fitting models of high complexity, requires some
explanation. In the section called “Bayesian belief network” in chapter 5, we discussed an example of a
Bayesian model for estimating bird habitat. The model uses Bayes’ theorem to go from the prior probabilities
for several spatial variables that influence bird habitat to the posterior probability. Now we consider the top
left part of the bird habitat model (figure A3.2) from a computational point of view.
This figure shows a directed acyclic graph. It is directed because each link between graph nodes is an arrow,
and it is acyclic because it is impossible to return to a node after leaving it. All nodes are equal in the sense
that each is considered a random quantity. Nodes A1, A2, and A5 are founders, as they have no parents. A1 and
A2 are parents of A7; A7 and A5 are parents of A9. Nodes A1, A2, and A5 are marginally independent; A9 is
conditionally independent given A7 and A5. The full joint distribution of the set of quantities can be expressed
using parent‐child conditional relationships (see chapter 5) as
p(A1, A2, A7, A5, A9) = p(A1)×p(A2) ×p(A7| A1,A2) ×p(A5) × p(A9| A7,A5)
Two nodes without common parents are independent only in the absence of children and after observing A9, a
dependency between A7 and A5 and between A1 and A2 is induced.
Figure A3.2
A directed acyclic graph describes the joint relationship between all quantities in a model through a series of
local relationships. A crucial assumption is that, conditional on its parent nodes, each node is independent of
all other nodes in the graph except of its own children. In other words, the conditional model takes into
consideration one unknown variable at a time, pretending that the other variables are given. For example, the
right part of the expression p(A7|A1, A2, A5, A9) = p(A7| A1, A2) ×p(A9| A7, A5) can be rewritten as
p(A7| parents(A7))* ). This assumption is a basis for the WinBUGS computational
method called Gibbs sampling.
[θ1|θ2, θ3… θn],
[θ2|θ1, θ3… θn],
…
[θn|θ1, θ2… θn1],
The Gibbs sampling algorithm consists of the following steps:
1) Choose initial values for all parameters θ1, θ2, … θn as .
2) Sample from
Sample from
…
Sample from
3) Repeat step 2 many times.
The Gibbs sampling algorithm can work when the number of parameters in the model is very large. According
to the theory, draws from this algorithm eventually converge to the joint posterior distribution of the
parameters θi if the model is formulated correctly. The only problem is deciding when to stop the algorithm.
This is a problem because there is no way to prove convergence. Instead, it is only possible to detect the lack
of convergence. If the model does not converge, it is recommended to reformulate the model, usually making
it simpler.
A very large number of models can be formulated using the concept of conditional probability. We will
discuss several such models in this appendix and many more examples can be found in the literature. All
results will be obtained using WinBUGS (Bayesian inference Using Gibbs Sampling) software available for
free at http://www.mrc-bsu.cam.ac.uk/bugs/.
Using WinBUGS, the user must specify the model and load data and initial values for model parameters. Then
the model can be run, and the results for the parameters and predictions can be reviewed and visualized.
However, from a software developer’s point of view, the WinBUGS program is amateurish and, at least for
beginners, it is more convenient to run WinBUGS from R using the package R2WinBUGS.
The classical approach to data modeling starts with data, then applies different tools and models to the data,
and finally reports predictions and prediction standard errors. In contrast, Bayesian analysis starts with a
model specification, supplies observed parameters and data, and finally, uses a particular generator of
simulations to obtain inference about quantities of interest. Therefore, Bayesian modeling requires more
preparations before the analysis is started. However, these preparations force the researcher to think more
Note that there is a correspondence between Bayesian and classical models if prior distributions are very
uncertain (so‐called noninformative priors; see below). In this case, Bayesian software works as a powerful
generator of conditional simulations (more precisely, filtered conditional simulations because data are
usually not reproduced exactly). Therefore, it is advisable to not use WinBUGS if similar results can be
produced by an accessible non‐Bayesian software package.
The main learning objectives of this appendix are:
Initializing an interest in using Bayesian methods
Providing examples of spatial Bayesian models that can be adapted for the analysis of similar
geostatistical and regional data
Before reading this appendix further, please install the WinBUGS software following instructions at
http://www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml, read or reread appendix 2,
and copy data from the folder Assignment A3.1 to your working directory to reproduce the case studies
presented below.
Note that WinBUGS, just like R packages, is not commercially validated, and for some organizations this can
be a problem, although there are many researchers who believe that the software is reliable.
BAYESIAN REGRESSION ANALYSIS OF HOUSING DATA
Figure A3.3 shows the locations of 327 sold (red) and 347 unsold (green) houses in the city of Nashua, New
Hampshire. Information about housing characteristics is available, and we will use it for predicting housing
prices. We divided the region into eight subregions based on the road network configuration displayed with
red lines. We assume that living conditions can be different in each subregion.
Data courtesy of City of Nashua, N.H.
Figure A3.3
setwd("e:\\book_data\\bugs_housing")
library(foreign)
tr < read.dbf("sel_houses.dbf")
ts < read.dbf("sel_sold_houses.dbf")
A list of variables in both datasets is identical. It can be displayed using command names(tr):
[1] "SalePrice" "NumofRooms" "NumofBedro" "GAR" "x" "y" "sold"
[8] "age" "SF" "num"
Another subset of housing data collected in the city of Nashua was analyzed in appendix 2, and here we
briefly explain the variables with abbreviations above:
SalePrice—sale price; zero values are assigned to unsold houses
NumofRooms—number of rooms
NumofBedro—number of bedrooms
GAR—size of the garage; zero means that there is no garage in the house
x—x‐coordinate
y—y‐coordinate
sold—indicator variable showing whether house was sold or not
age—age of the house
SF—square footage
num—the subregion’s number (from 1 to 8)
Next commands create data frames data.sold and data.all. Note that we give new names to some variables,
create an indicator variable for garage existence, and assign a “no data” value to the sale price variable for
houses that are not sold.
data.sold < data.frame(x=ts$x, y=ts$y, sale.price=ts$SalePrice, age=ts$age, sq.feet=ts$SF,
rooms=ts$NumofRooms, bedrooms=ts$NumofBedro, gar=ts$GAR, region=ts$num)
for(i in 1:length(data.sold$sale.price)) {
if(data.sold$gar[i] > 0) data.sold$gar[i] < 1
}
data.all < data.frame(x=tr$x, y=tr$y, sale.price=tr$SalePrice, age=tr$age, sq.feet=tr$SF,
rooms=tr$NumofRooms, bedrooms=tr$NumofBedro, gar=tr$GAR, sold=tr$sold, region=tr$num)
i.all < length(data.all$sale.price)
for(i in 1:i.all) {
if(data.all$sale.price[i] < 1) data.all$sale.price[i] < NA
if(data.all$gar[i] > 0) data.all$gar[i] < 1
}
Command round(cor(data.sold), 3) calculates correlation between houses features:
Figure A3.4
Next commands fit the linear regression model to the data and display the summary of the regression
coefficients estimation:
linear.model1 < lm(sale.price ~ sq.feet + age + rooms + bedrooms + gar, data=data.all)
summary(linear.model1)
The summary is shown in figure A3.5.
Figure A3.5
According to the linear regression model, all variables except the number of bedrooms are significant in
explaining housing prices. There are several ways to explore the multicollinearity problem; here we will
calculate the variance inflation factor (VIF) using the vif() function from the R package faraway:
library(faraway)
xx < model.matrix(linear.model1)[,1]
vif(xx)
The result of calculations is shown in the table below.
where is the coefficient of determination obtained from regressing the ith independent variable (say, age)
on the remaining independent variables (say, square feet, rooms, bedrooms, and garages). Ideally, the variance
inflation factor should be equal to one. As a rule, variance inflation factors greater than 10 indicate a
collinearity problem. Values greater than 2 indicate some problem with the independent variables rooms and
bedrooms, meaning that the estimated regression coefficient values may be distorted.
The following commands predict sale prices of the sold houses and create a graph of the actual house prices
versus the predicted ones. The graph is shown in figure A3.6 at left. The red line represents a linear fit and the
green line is a locally weighted polynomial regression (lowess). Description of the function lowess can be
displayed in R using command ?lowess.
pred.lm1 < predict.lm(linear.model1, data.all, level=0.95, data=data.all, se.fit = TRUE, interval="prediction")
plot(data.all$sale.price[1:327], pred.lm1$fit[1:327], xlab="actual sale price", ylab="predicted sale price")
lines(lowess(data.all$sale.price[1:327], pred.lm1$fit[1:327]), col="green", lwd=2)
abline(lm(data.all$sale.price[1:327] ~ pred.lm1$fit[1:327]), col="red", lwd=2)
Data courtesy of City of Nashua, N.H.
Figure A3.6
These predictions certainly could not satisfy us since standard deviation of the residuals calculated using
command
sd(data.all$sale.price[1:327] pred.lm1$fit[1:327])
is equal to 43535.44 dollars (any reasonable person can predict the house price values more accurately using
information on nearby houses). We will try to improve the prediction performance using the WinBUGS
program (note that we did a similar exercise using spatial statistical models in appendix 2).
WinBUGS cannot automatically transform data as some other statistical software do. Therefore, it is the user’s
responsibility to avoid very large data values, and we will use standardized data. The following commands
create two data frames of housing data using function scale(), which subtracts mean value from the datum
and divides the difference by data standard deviation.
data.sold < data.frame(x=scale(ts$x), y=scale(ts$y), sale.price=scale(ts$SalePrice), age=scale(ts$age),
sq.feet=scale(ts$SF), rooms=scale(ts$NumofRooms), bedrooms=scale(ts$NumofBedro), gar=ts$GAR,
region=ts$num)
i.sold < length(data.sold$sale.price)
for(i in 1:i.sold) {
if(data.sold$gar[i] > 0) data.sold$gar[i] < 1
}
data.all < data.frame(x=scale(tr$x), y=scale(tr$y), sale.price=tr$SalePrice, age=scale(tr$age),
The following command runs a linear model with standardized variables.
linear.model1 < lm(sale.price ~ sq.feet + age + rooms + bedrooms + gar, data=data.all)
Inference from the linear model with standardized variables is the same as in the case of linear regression
with data in the original scale as can be seen from the linear.model1 summary displayed in table A3.2 and in
cross‐validation graph in figure A3.6 at right.
A Bayesian version of the linear regression model above can be written in WinBUGS as follows (note that like
R, the WinBUGS model is case sensitive).
model
{
for (i in 1:n)
{
price[i] ~ dnorm(mu[i], tau);
mu[i] < a0 + a1*sq.feet[i]+a2*age[i]+a3*rooms[i]+a4*bedrooms[i]+a5*gar[i]
}
# Priors
a0 ~ dnorm(0,0.0001);
a1 ~ dnorm(0,tau.alpha)
a2 ~ dnorm(0,tau.alpha)
a3 ~ dnorm(0,tau.alpha)
a4 ~ dnorm(0,tau.alpha)
a5 ~ dnorm(0,tau.alpha)
tau ~ dgamma(1,0.0001);
tau.alpha < k*tau
# ridge factor
k ~ dgamma(1,1)}
}
The model above should be saved to a file. We saved it to the file “housing_model1.bug.”
), although gamma distribution can be a better choice (see the section “Use of gamma distribution for
modeling positive continuous data” in chapter 4). Mean value is modeled as a linear combination of the same
explanatory variables as in the classical linear model above.
Next, prior distributions for each coefficient are defined. Again, we use normal distribution. Note that prior
variance is very large: we expect the coefficient a0 to be in the range (−100, 100) because ,
meaning that prior distribution does not provide much additional information. This type of prior distribution
is called noninformative because the range of allowed coefficient values includes practically all reasonable
values. Sometimes normal distribution dnorm(0, 0), which is flat over the entire number line, is also used.
This distribution is called improper because there is an infinite area under the distribution curve (instead of
area equals 1). We define the variance for regression coefficients a1a5 as a function of ridge factor. (There is
also a generalized ridge regression with different ridge factors for different predictors. It can be programmed
in WinBUGS by specifying different ridge factors k1k5 for the coefficients a1a5.)
Figure A3.7
Figure A3.8
Distributions available in WinBUGS are listed in table I of the WinBUGS user manual. Detailed discussion of
prior distributions can be found in reference 3 under “Further reading.”
Noninformative and improper prior distributions are used when little is known beyond the observed data.
One way to choose informative prior is illustrated by the following example. Suppose we want to predict the
price of the existing house, and we know nothing about it except its approximate location, something like
“near Diamond Hill golf course.” Our intuition tells us that the price should be around $400,000, and it would
0.95*normal(400,000; 120,000) + 0.05*uniform(50,000; 1,500,000)
This prior distribution is displayed in figure A3.9 at left.
Figure A3.9
It is also possible to add the unknown weight w to the model
w *normal(400,000; 120,000) + (1w)*uniform(50,000; 1,500,000)
and interpret the unknown random variable w as the assessed probability that distribution normal(400,000;
120,000) is correctly describing the data.
Alternatively, gamma distribution can be used instead of normal distribution because it does not allow for
negative values. For example, if we assume that mean value is equal to $400,000 and standard deviation
equals $100,000, we can solve the following system of linear equations to find the parameters scale and
shape:
scale/shape = 400,000
scale/shape^2 = 100,000,
which gives the values 1,600,000 and 4 for the parameters scale and shape. This distribution is shown in
figure A3.9 at right.
We return to the preparations for running a Bayesian version of the linear regression model for housing
prices.
The initial values for the model parameters can be specified using the command
model.inits < function() {list(a0 = 0, a1=0, a2=0, a3=0, a4=0, a5=0, tau= 0.01, k=0.1)}
The Gibbs sampling should work long enough to “forget” the initial values. Therefore, the choice of initial
values is not important, although they should not be arbitrarily large or small for numerical efficiency
reasons. Note, however, that we do not use prior information on the regression coefficients (estimated by the
linear model linear.model1) for the initial values to prevent immediate convergence to possibly nonoptimal
values.
Data for the WinBUGS model should be stored in the form of a list, with list names equal to the names of data
in the corresponding model. The following command does the job:
housing.data < list(n=i.all,
price=data.all[,"sale.price"],
sq.feet=data.all[,"sq.feet"],
age=data.all[,"age"],
rooms=data.all[,"rooms"],
bedrooms=data.all[,"bedrooms"],
gar=data.all[,"gar"])
In WinBUGS, data can have NA (“no data”) values. In this case, they are treated as unknown parameters and
these parameters are estimated by the software. This is why we assign NA values to the sale price variable
data.all$sale.price for unsold houses. However, explanatory variables should be specified for each datum to
make predictions possible. An interesting feature of the WinBUGS model is that modeled data (in our case,
housing prices) may not be provided at all. In this case, they are specified by distribution (in our case, the
normal distribution dnorm(mu[i], tau). Then WinBUGS produces unconditional simulations (see the
discussion on conditional and unconditional simulations in chapter 10).
Finally, it is necessary to specify which parameters should be monitored. The following command creates a
list of such parameters for our linear model:
model.parameters < c("a0," "a1," "a2," "a3," "a4," "a5," "tau," "k," "mu")
The following commands load package R2WinBUGS and run WinBUGS using function bugs():
library(R2WinBUGS)
housing.1 < bugs(housing.data, model.inits, model.parameters, "housing_model1.bug", n.chains=1,
n.iter=100000, debug=TRUE, bugs.directory = "e:/program/WinBUGS14/")
The function bugs() is a wrapper for several other functions. It writes the data files data.txt and stores initial
values in files inits1.txt, inits2.txt and others into the R working directory (in this example, the folder
e:\\book_data\\bugs_housing). Then these files are used by WinBUGS during batch processing, the text for
which is created and stored in the file script.txt. If the statement inits() is not specified, WinBUGS uses the
starting values created by the software. It is better to specify the initial values to have better control over the
modeling process.
It is possible to run one or more simulation processes (Markov chains) simultaneously. In this case, the
number of Markov chains n.chains must be specified together with the chains’ initial values in function
model.inits(). Although the literature suggests using several chains, in this introduction we will always use
only one chain because the models discussed are relatively simple, and the chance that one chain is
insufficient is very low. In the case of using more than one chain, the initial values are usually selected from
different ends of the overdispersed prior distributions for the initial values.
Figure A3.10
by the following formulas
Cˆ (h) ˆ
n−h
1
ρˆ (h) =
ˆ
C (0)
, C (h) = ∑
n − h t=1
(θ it +h − θ i )(θ it − θ i ),0 ≤ h < n ,
where n is a number of samples in the chain. High correlations between long lags indicate poor convergence.
This is not so in our case: starting from lag equals 2h, autocorrelation is negligible.
€
Figure A3.11
Figure A3.12
The R package boa provides a variety of other diagnostics including (see reference 8 in “Further reading”):
Summary statistics
Convergence diagnostics such as Gelman‐Rubin, Geweke, and Raftery‐Lewis tests that indicate the
Markov chain samples nonconvergence
Auto‐ and cross‐correlations of the simulated model’s parameters
A variety of plots including running means and lag correlations shown above
Interested readers should consult the boa manual for information on how to display additional convergence
diagnostics.
The communication between WinBUGS and R is based on files, and several output files are saved in the
working directory after closing the WinBUGS session. In particular, the log.txt file contains the node statistics,
part of which is shown in table A3.3. The columns in the table represent the following (note that WinBUGS
provides information on the lower and upper 95‐percent credible intervals instead of the p‐value):
node: the name of the quantity of interest
mean: the average of the simulations
sd: the standard deviation of the simulations
MC error: the computational accuracy of the mean. It can be interpreted as a round‐off error.
2.5%: the 2.5th percentile of the simulations
median: the median or 50th percentile of the simulations
MC Error
Node Mean sd 2.5% Median 97.5% Start Sample
a0 0.1563 0.05071 0.001316 0.2487 0.1572 0.0487 1001 1000
a1 0.1689 0.04586 0.001378 0.08151 0.1684 0.2567 1001 1000
a2 0.1683 0.03899 0.001055 0.241 0.1692 0.08775 1001 1000
a3 0.4567 0.05325 0.001761 0.3588 0.4577 0.5563 1001 1000
a4 0.08879 0.05898 0.001703 0.02972 0.08769 0.2043 1001 1000
a5 0.6064 0.09124 0.002817 0.4318 0.6031 0.7937 1001 1000
deviance 701.4 3.938 0.1425 695.9 700.7 711.3 1001 1000
k 2.133 1.189 0.03254 0.4672 1.954 5.128 1001 1000
mu[1] 0.007859 0.1025 0.002774 0.2044 0.01078 0.1952 1001 1000
mu[2] 0.9028 0.08193 0.002799 0.7528 0.905 1.068 1001 1000
Table A3.3
A comparison of this table with coefficients estimated by the classical linear regression model linear.model1
fitted above shows little difference between the estimated values, meaning that the collinearity problem is
not severe.
MULTILEVEL BAYESIAN MODELING
In this section, we will fit the multilevel model discussed in chapter 12. We start with a model with the
varying coefficient a2 for house age, allowing for different values in different subregions, while all other
coefficients have an unknown constant value. This model can be written as follows:
model
{
for (i in 1:n)
{
price[i] ~ dnorm(mu[i],tau);
mu[i] < a0 + a1*sq.feet[i]+a2[region[i]]*age[i]+a3*rooms[i]+a4*bedrooms[i]+a5*gar[i]
}
for (j in 1:reg)
{
a2[j] ~ dnorm(mu0,tau.alpha);
}
mu0 ~ dnorm(0.0,0.0001)
a0 ~ dnorm(0,tau.alpha)
a1 ~ dnorm(0,tau.alpha)
a3 ~ dnorm(0,tau.alpha)
a4 ~ dnorm(0,tau.alpha)
a5 ~ dnorm(0,tau.alpha)
tau ~ dgamma(1,0.0001);
tau.alpha < k*tau
# ridge parameter
k ~ dgamma(1,1)}
}
a_init < rep(0, 8)
for (i in 1:8) a_init[i] < runif(1,1, 1)
model.inits < function() {list(a0=0, a2=a_init, a1=0, a3=0, a4=0, a5=0, tau=0.0001, k=1)}
The only differences in the input data are adding a parameter for the number of subregions and adding a
variable region that indicates to which subregion a house belongs:
housing.data < list(n=i.all, reg=8,
price=data.all[,"sale.price"],
sq.feet=data.all[,"sq.feet"],
age=data.all[,"age"],
rooms=data.all[,"rooms"],
bedrooms=data.all[,"bedrooms"],
gar=data.all[,"gar"],
region=data.all[,"region"])
The monitoring parameters are the same as in the previous model:
model.parameters < c("a0,” "a1,” "a2,” "a3,” "a4,” "a5,” "tau,” "k,” "mu")
The following command runs the model:
housing.2 < bugs(housing.data, model.inits, model.parameters, "housing_model2.bug", n.chains=1,
n.iter=100000, debug=TRUE, bugs.directory = "e:/program/WinBUGS14/")
The resulting node statistics are presented in table A3.4.
MC error
Node Mean sd 2.5% Median 97.5% Start Sample
a0 0.2284 0.05211 0.001644 0.3361
0.2291
0.1239
1001
1000
a1 0.2158 0.04481 0.001277 0.1252 0.2162 0.3049 1001 1000
a2[1] 0.04803 0.06117 0.002052 0.1666 0.04864 0.07588 1001 1000
a2[2] 0.09 0.1877 0.004832 0.2929 0.09928 0.454 1001 1000
a2[3] 0.4774 0.1094 0.003553 0.7034 0.4753 0.2691 1001 1000
a2[4] 1.108 0.1686 0.00501 1.433 1.11 0.7605 1001 1000
a2[5] 0.1539 0.1252 0.00331 0.4033 0.1502 0.0967 1001 1000
a2[6] 0.157 0.1028 0.00329 0.3568 0.1644 0.05476 1001 1000
a2[7] 0.07009 0.1019 0.002634 0.2702 0.06918 0.124 1001 1000
a2[8] 0.05573 0.1428 0.004357 0.3229 0.05772 0.2284 1001 1000
a3 0.4135 0.05364 0.001766 0.3088 0.4115 0.5181 1001 1000
a4 0.1497 0.057 0.001654 0.03673 0.1498 0.2615 1001 1000
a5 0.5946 0.08388 0.002631 0.4292 0.5916 0.7601 1001 1000
deviance 656.0 5.654 0.1698 647.2 655.1 669.1 1001 1000
k 2.262 0.8931 0.02521 0.8866 2.167 4.29 1001 1000
mu[1] 0.1238 0.2737 0.00845 0.3961 0.1143 0.6967 1001 1000
Table A3.4
We see that coefficient a2 is different in different regions and most of the coefficients are changed in
comparison with the linear model with constant regression coefficients. One question is how these changes
Figure A3.13
A model with six varying coefficients can be written as follows:
model
{
for (i in 1:n)
{
price[i] ~ dnorm(mu[i],tau);
mu[i] < a0[region[i]] +
a1[region[i]]*sq.feet[i]+a2[region[i]]*age[i]+a3[region[i]]*rooms[i]+a4[region[i]]*bedrooms[i]+a5[region[i]]*ga
r[i]
}
for (j in 1:reg)
{
a0[j] ~ dnorm(mu0,tau.alpha);
a1[j] ~ dnorm(mu0,tau.alpha);
a2[j] ~ dnorm(mu0,tau.alpha);
a3[j] ~ dnorm(mu0,tau.alpha);
a4[j] ~ dnorm(mu0,tau.alpha);
a5[j] ~ dnorm(mu0,tau.alpha);
}
mu0 ~ dnorm(0.0,0.0001)
tau ~ dgamma(1,0.0001);
tau.alpha < k*tau
# ridge parameter
k ~ dgamma(1,1)}
}
The initial values can be specified as
a0_init < a1_init < a2_init < a3_init < a4_init < a5_init < rep(0, 8)
model.inits < function() {list(a0=a0_init, a1=a1_init, a2=a2_init, a3=a3_init, a4=a4_init, a5=a5_init, tau=0.0001,
k=0.1)}
The data preparation and the parameters are the same as in the previous model.
We run a third Bayesian model using the following command:
housing.3 < bugs(housing.data, model.inits, model.parameters, "housing_model3.bug", n.chains=1,
n.iter=100000, debug=TRUE, bugs.directory = "e:/program/WinBUGS14/")
Figure A3.14 shows actual housing prices against predicted housing prices. There is a visual improvement in
comparison with the previous model, confirmed by the estimated adjusted R‐squared value 0.74, which is
much closer to the ideal value of 1 than 0.58 from the previous model.
Figure A3.14
The predictions and prediction standard errors can be saved to a DBASE file for further visualization and use
in ArcGIS using the following commands:
ls < length(data.all$sale.price)
prediction < rep(NA, ls)
stderr < rep(NA, ls)
for(i in 1:ls) {
prediction[i] < housing.3$mean$mu[i]
stderr[i] < housing.3$sd$mu[i]
}
res3 < data.frame(data.all, prediction, stderr)
write.dbf(res3, "housing_bugs_ridge_ml.dbf")
polynum < seq(1,8)
a0 < a1 < a2 < a3 < a4 < a5 < a0se < a1se < a2se < a3se < a4se < a5se < rep(NA, 8)
for(i in 1:8) {
a0[i] < housing.3$mean$a0[i]
a1[i] < housing.3$mean$a1[i]
a2[i] < housing.3$mean$a2[i]
a3[i] < housing.3$mean$a3[i]
a4[i] < housing.3$mean$a4[i]
a5[i] < housing.3$mean$a5[i]
a0se[i] < housing.3$sd$a0[i]
a1se[i] < housing.3$sd$a1[i]
a2se[i] < housing.3$sd$a2[i]
a3se[i] < housing.3$sd$a3[i]
a4se[i] < housing.3$sd$a4[i]
a5se[i] < housing.3$sd$a5[i]
}
res.coef < data.frame(num=polynum, a0, a1, a2, a3, a4, a5, a0se, a1se, a2se, a3se, a4se, a5se)
write.dbf(res.coef, "housing_bugs_ridge_ml_coef.dbf")
The coefficients can be added to the polygonal feature layer using the ArcGIS function “Join.” The maps of
varying coefficients for age of the house and garage existence are shown in figure A3.15.
Figure A3.15
Figure A3.16
Since the predictions are still not very accurate, we can try a model with varying coefficients for every house.
This model can be written as follows:
model
{
for (i in 1:n)
{
price[i] ~ dnorm(mu[i],tau);
mu[i] < a0[i] + a1[i]*sq.feet[i]+a2[i]*age[i]+a3[i]*rooms[i]+a4[i]*bedrooms[i]+a5[i]*gar[i]
}
for (j in 1:n)
{
a0[j] ~ dnorm(mu0,tau.alpha);
a1[j] ~ dnorm(mu0,tau.alpha);
a2[j] ~ dnorm(mu0,tau.alpha);
a3[j] ~ dnorm(mu0,tau.alpha);
a4[j] ~ dnorm(mu0,tau.alpha);
a5[j] ~ dnorm(mu0,tau.alpha);
}
mu0 ~ dnorm(0.0,0.0001)
tau ~ dgamma(1,0.0001);
tau.alpha < k*tau
# ridge parameter
k ~ dgamma(1,1)}
}
The following commands create a graph of 68‐percent credible intervals for predicted prices of the unsold
houses. First, we back‐transform predictions and prediction standard errors to the original scale:
prediction < stderr < rep(NA, i.alli.sold)
i < 1
mean.s < mean(ts$SalePrice)
sd.s < sd(ts$SalePrice)
for(j in (i.sold+1):i.all) {
Second, we create variables with lower and upper bounds and sort the predictions in increasing order:
nn < seq(1:(i.alli.sold))
low < predictionstderr
up < prediction+stderr
pred.sort < rbind(prediction,stderr,low, up)[, order(prediction,stderr, low, up)]
pred.sort.n < rbind(nn, pred.sort)
Finally, the following commands create a graph in figure A3.17 (we do not show predictions with the seven
largest values because they are so large that the graph will be distorted):
ylim<c(min(pred.sort.n[4,1:340]),max(pred.sort.n[5,1:340]))
plot(pred.sort.n[1,1:340], pred.sort.n[2,1:340], col="blue", lwd=2, ylim=ylim, xlab=" ", ylab="predicted sale
price")
lines(pred.sort.n[1,1:340], pred.sort.n[5,1:340], col="green", lwd=2)
lines(pred.sort.n[1,1:340], pred.sort.n[4,1:340], col="green", lwd=2)
Figure A3.17
We see that prediction uncertainty is unacceptably large, with an average standard deviation of the first 340
most reliable predictions equal to $50,876. This is because we tried to estimate too many parameters.
Therefore, this overparametrized model should be rejected. Our previous model can be improved using more
reasonable subregions, which can be defined by local people.
price[i] ~ dnorm(mu[i], tau)
mu[i]<a0+a1*exp(a2*sq.feet[i])+a3* age[i]^(3/2)+a4*sqrt(rooms[i]*bedrooms[i])
if it makes practical sense. This is one of the advantages of using Bayesian regression analysis, which also
makes it possible to do the following things that are difficult to reproduce with classical regression:
use arbitrary sampling distributions for the data and model parameters
handle missing and censored data
allow for measurement errors in the data
include restrictions for model parameters
allow for correlation between explanatory variables (see example in chapter 12)
BAYESIAN CONDITIONAL GEOSTATISTICAL SIMULATIONS
WinBUGS has functions for fitting Bayesian geostatistical models. We illustrate their usage using two subsets
of the housing data. Note, however, that the Bayesian geostatistical simulations in WinBUGS are very slow,
and one should be prepared to wait several hours for the result of calculations. Therefore, it is advisable to
run the model for several hundred iterations first and see how simulations converge. If there are obvious
problems with convergence, the model or its parameters should be adjusted before running a final model
with dozens or hundreds of thousands of iterations.
Below is the model for predicting house prices. Comments briefly explain the meaning of each command.
model
{
for (i in 1:n)
{
# the housing price, given the spatial surface, has a normal distribution
price[i] ~ dnorm(surface[i] ,tau);
# the spatial surface has a mean of beta and zeromean smallscale autocorrelation eps
surface[i] < beta + eps[i]
}
# this sets the mean mu at 0 for the spatially varying component eps
for(i in 1:n) { mu[i] < 0 }
# WinBUGS only has a stable covariance model described in chapter 8 and we set the power value to 1
# (the last parameter in the function below), so it is an exponential model.
# For some reason, WinBUGS uses inverse sill and inverse range instead of sill and range
eps[1:n] ~ spatial.exp(mu[], x[], y[], inv.sill, inv.range, 1)
# a flat prior on beta is specified
beta ~ dflat()
# a diffuse prior on the inverse sill
inv.sill ~ dgamma(0.001, 0.001)
# we are interested in the posterior distribution for the sill rather than its inverse
sill < 1/inv.sill
# a general prior that is suggested for an inverse range parameter
inv.range ~ dunif(0.001, 0.8)
# the posterior distribution for the range rather than its inverse
Save the model above to the file housing_model5.bug.
Note that we did not define a rule for selecting neighbors in the function spatial.unipred because WinBUGS
uses all data for predictions. This is one of the reasons why geostatistical functionality is very slow in this
software package.
We will use sold houses data randomly divided into two parts: 228 training and 99 testing locations. The data
are shown in figure A3.18.
Data courtesy of City of Nashua, N.H.
Figure A3.18
tr < read.dbf("sold_houses_training.dbf")
te < read.dbf("sold_houses_test.dbf")
i.tr < length(tr$SalePrice)
i.te < length(te$SalePrice)
We will use standardized house price data using the mean and standard deviation of the training dataset. We
will also scale the coordinates using maximum values of the x‐ and y‐coordinates. The following commands
do the job:
mean.s < mean(tr$SalePrice)
sd.s < sd(tr$SalePrice)
maxx < max(tr$x)
maxy < max(tr$y)
Next, we create two data frames with standardized data using the following commands:
h.c < data.frame(price=scale(tr$SalePrice), x=tr$x/maxx, y=tr$y/maxy)
h.c.p < data.frame(x.pred=te$x/maxx, y.pred=te$y/maxy)
A list of housing data in WinBUGS format is created using the following command:
housing.data < list(n=i.tr,
n.pred=i.te,
price=h.c$price,
x=h.c$x,
y=h.c$y,
x.pred=h.c.p$x.pred,
y.pred=h.c.p$y.pred)
We also specify the initial values and the parameters to monitor using the following commands:
eps.inits < rep(0, i.tr)
eps.pred.inits < rep(0, i.te)
model.inits < function() {list(inv.sill=0.01, inv.range=0.4, beta=0.1, eps.pred=eps.pred.inits, eps=eps.inits,
tau=0.1)}
model.parameters < c("beta", "sill", "range", "price.pred", "eps.pred", "surface")
Bayesian geostatistical simulations and summary statistics are produced and saved using the command:
housing.5 < bugs(housing.data, model.inits, model.parameters, "housing_model5.bug", n.chains=1, n.iter=5000,
debug=TRUE, bugs.directory = "e:/program/WinBUGS14/")
Simulations are stored in the housing.5$sims.array in the same order as the model parameters: beta, sill,
range, price.pred[1], price.pred[2], and so on. The following commands display densities of the simulated
mean, sill, and range values as well as predictions at the first testing location:
par(mfrow=c(2,2))
plot(density(housing.5$sims.array[,1,1]), xlim=c(1,4), lwd=3, col="green", main="Mean Distribution",
xlab="Mean")
plot(density(housing.5$sims.array[,1,2]), xlim=c(0,500), lwd=3, col="blue", main="Sill Distribution", xlab="Sill")
plot(density(housing.5$sims.array[,1,3]), xlim=c(0,15), lwd=3, col="red", main="Range Distribution",
xlab="Range")
plot(density(housing.5$sims.array[,1,4]), xlim=c(1,3), lwd=3, col="black", main="Prediction Distribution",
The resulting graphs are shown in figure A3.19. The difference between WinBUGS simulations and
conditional geostatistical simulations discussed in chapter 10 is in treating the kriging model parameters:
they are varying in the former case and fixed in the latter. Bayesian conditional simulations are more variable
and better represent the variability of the phenomena if the statistical model is correct and the calculations
are reliable.
Figure A3.19
Now we will do a validation exercise. The following commands create the graph in figure A3.20 at left and
estimate quality of the prediction to the sampled locations:
par(mfrow=c(1,2))
plot(res5.tr$SalePrice, res5.tr$pred_new, main="Crossvalidation", xlab="actual sale price", ylab="predicted sale
price", xlim=c(100000, 400000), ylim=c(100000, 400000))
lines(lowess(res5.tr$SalePrice, res5.tr$pred_new), col="blue", lwd=3)
cv.old < lm(res5.tr$SalePrice ~ res5.tr$pred_new)
summary(cv.old)
The adjusted R‐squared is equal to 0.8173. This is a good performance, but in fact we made predictions to the
sampled locations where house price values are known. We conclude that Bayesian geostatistical simulations
in WinBUGS do not honor the observed data values. Although we did not specify the nugget parameter in the
covariance model, the software has estimated it for us and treated the nugget parameter as measurement
error. Why should we assume that house prices are observed with error? Probably because there is some
subjectivity in the prices; it is unlikely that two very similar used houses would be sold for the same price.
Also, data were collected during a six‐month period and some trend in sale prices can be interpreted as
measurement error in the data.
cv.new < lm(res5.te$SalePrice ~ res5.te$pred)
summary(cv.new)
plot(res5.te$SalePrice, res5.te$pred, main="Validation", xlab="actual sale price", ylab="predicted sale price",
xlim=c(100000, 400000), ylim=c(100000, 400000))
lines(lowess(res5.te$SalePrice, res5.te$pred), col="blue", lwd=3)
par(mfrow=c(1,1))
Figure A3.20
The model performance is not good since the adjusted R‐squared is equal to 0.4146. However, our
geostatistical model can be improved by modeling mean value using explanatory variables instead of
estimating just a constant value. In other words, useful information on housing characteristics should be used
to improve the model performance. For example, the mean value can be modeled similarly to our first
Bayesian linear regression model:
for (i in 1:n)
{
price[i]~dnorm(surface[i],tau);
surface[i]< a0+a1*sq.feet[i]+a2*age[i]+a3*rooms[i]+a4*bedrooms[i]+a5*gar[i]+eps[i]
mu[i]<0
}
We recommend that the reader do this exercise.
REGIONAL BAYESIAN ANALYSIS OF THYROID CANCER IN CHILDREN IN BELARUS
In this section we will use several Bayesian models for modeling risk of thyroid cancer in children using data
collected in Belarusan districts from 1986 to 1994. At the end of this section, we will use two explanatory
variables: average value of cesium‐137 soil contamination estimated by conditional Gaussian simulation
discussed in chapter 10 and distance from the districts’ centroids to the Chernobyl nuclear power plant.
Next commands prepare data as it was described in appendix 2:
setwd("e:\\book_data\\bugs_thyroid")
library(maptools)
data_th < read.shape("Districts_stat")
# convert shapefile data to a list of polygons and a data frame
th_polys < Map2poly(data_th, region.id=as.character(data_th$att.data$FIPS))
th_df < data_th$att.data
th_cents < cbind(th_df$xcoord, th_df$ycoord)
rownames(th_df) < as.character(th_df$NAME)
# load spdep package
library(spdep)
# calculate the expected counts of thyroid cases using children population variable POPULUSE
pmap < probmap(th_df$CASES, th_df$POPULUSE)
Our first Bayesian model will ignore spatial data correlation in estimating relative risk of thyroid cancer in
children. By relative risk we mean the ratio of observed thyroid cancer counts to expected counts in each
district. When the disease is rare, as in the case of thyroid cancer, the incidences of disease in each region can
be assumed to be independent and described by Poisson distribution
,
model
{
for (i in 1:n)
{
# Poisson distribution for observed counts
cases[i] ~ dpois(mu[i])
log(mu[i]) < log(expected[i]) + alpha + v[i]
# Relative Risk
theta[i] < exp(alpha + v[i])
# Prior distribution for the uncorrelated heterogeneity
v[i] ~ dnorm(0, tau)
}
# Exchangeable (meaning that it does not depend on district location) prior distribution on inverse variance
parameter of random effects
tau ~ dgamma(0.5,0.0005)
# Vague prior distributions for intercept α
alpha ~ dnorm(0.0,0.00001)
}
We saved this model to the file thyroid_model1.bug. The next commands specify initial values for the model:
kk < length(th_df$xcoord)
A list of data includes the number of Belarusan districts, number of thyroid cancer cases in each district, and
expected numbers estimated by function probmap from the R package spdep (the formulas are presented in
chapter 16):
thyroid.data < list(n=kk,
cases = th_df$CASES,
expected = pmap$expCount)
Monitoring parameters are the following:
model.parameters < c("theta,” "mu,” "v,” "alpha")
Run the model using the following command:
thyroid.1 < bugs(thyroid.data, model.inits, model.parameters, "thyroid_model1.bug", n.chains=1, n.iter=1000,
debug=TRUE, bugs.directory = "e:/program/WinBUGS14/")
The results of estimations can be stored in the data frame res1 using the following commands:
rr.prediction < rr.stderr < mu < mu.stderr < uncor.heter < uncor.heter.stderr < rep(NA, kk)
for(i in 1:kk) {
rr.prediction[i] < thyroid.1$mean$theta[i]
rr.stderr[i] < thyroid.1$sd$theta[i]
mu[i] < thyroid.1$mean$mu[i]
mu.stderr[i] < thyroid.1$sd$mu[i]
uncor.heter[i] < thyroid.1$mean$v[i]
uncor.heter.stderr[i] < thyroid.1$sd$v[i]
}
res1 < data.frame(th_df, expected=pmap$expCount, mu=mu, mu_stderr=mu.stderr, rr_freq=pmap$relRisk/100,
rr=rr.prediction, rr_stderr=rr.stderr, u_heter=uncor.heter, u_heter_se=uncor.heter.stderr)
The data frame res1 can be written out to a shapefile using command
write.polylistShape(th_polys, res1, file="thyroid_relrisk_bugs")
and the relative risk and its uncorrelated heterogeneity component can be visualized in ArcMap as shown in
figure A3.21. We see that both relative risk (left) and uncorrelated heterogeneity (right) have spatial
structure with large values in the south and small values in the north.
An obvious improvement of this model would consist of modeling spatial correlation of the thyroid data.
There are many ways to incorporate spatial correlation into a Bayesian model of the thyroid cancer relative
risk, and we will use four such models in the rest of this section.
First, we will use a conditionally specified prior spatial structure using a conditional autoregressive (CAR)
model discussed in chapter 11. A conditional autoregressive model requires specification of the neighbors of
each polygon and their weights. We create the neighbors list using the following commands (see chapter 16
for a discussion about neighborhood creation using the spdep package):
five.nn < knn2nb(knearneigh(th_cents, k=6), sym=TRUE)
plotpolys(th_polys, bbs, border="grey")
plot( five.nn, th_cents, add=TRUE)
Figure A3.22 shows the connected neighbors.
Figure A3.22
model
{
for (i in 1:n)
{
cases[i] ~ dpois(mu[i])
log(mu[i]) < log(expected[i]) + alpha + u[i] + v[i]
theta[i] < exp(alpha + u[i] + v[i])
v[i] ~ dnorm(0,tau.v)
# We also calculate the posterior probability PP that relative risk is greater than 1:
PP[i] < step(theta[i] 1 + eps)
}
eps < 1.0E6
# CAR prior distribution for spatial correlated heterogeneity
u[1:n] ~ car.normal(adj[], weights[], num[], tau.u)
# Improper prior distribution for the mean relative risk in the study region
alpha ~ dflat()
mean < exp(alpha)
# Prior distributions on inverse variance parameters of random effects
tau.u ~ dgamma(0.5,0.0005)
tau.v ~ dgamma(0.5,0.0005)
}
In the model above we used the so‐called intrinsic Gaussian conditional autoregressive model prior
distribution car.normal(), which has the following parameters:
adj[]: A vector listing the ID numbers of the adjacent areas for each area.
weights[]: A vector of the same length as adj[] giving weights associated with each pair of areas.
num[]: A vector giving the number of neighbors for each area.
tau: A scalar argument representing the precision (inverse variance) parameter of the Gaussian
conditional autoregressive model prior distribution.
The first three arguments must be entered as data, and the variable tau is treated as an unknown constant
that requires a prior distribution. We saved a model to the file thyroid_model_BYM.txt , “BYM” because this
model was proposed in the paper written by Besag, York, and Mollie (see Bibliography).
The parameters for monitoring are the following:
model.parameters < c("theta", "mu", "v", "u", "alpha", "PP")
The initial values can be specified as
mu_init < v_init < u_init < rep(0.1, kk)
model.inits < function() {list(alpha = 0, tau.v=10, tau.u=10, u=u_init, v=v_init)}
Data preparation requires some preliminary programming in R as follows:
dlist < nbdists(five.nn, th_cents)
num < rep(0, kk)
for (i in 1: kk) num[i] < length(five.nn[[i]])
ll < sum(num)
# equal weights:
weights < rep(1, ll)
A list of data can be created using the command
thyroid.data < list(n=kk,
cases = th_df$CASES,
expected = pmap$expCount,
weights=weights,
num=num,
adj=adj)
A command for running WinBUGS is similar to all previous commands of this sort:
thyroid.2 < bugs(thyroid.data, model.inits, model.parameters, "thyroid_model_BYM.bug", n.chains=1,
n.iter=100000, debug=TRUE, bugs.directory = "e:/program/WinBUGS14/")
The result of estimations can be stored in the data frame res2 using the following commands:
rr.prediction < rr.stderr < mu < mu.stderr < pp.prediction < pp.stderr < uncor.heter < uncor.heter.stderr <
cor.heter < cor.heter.stderr < rep(NA, kk)
for(i in 1:kk) {
rr.prediction[i] < thyroid.2$mean$theta[i]
rr.stderr[i] < thyroid.2$sd$theta[i]
pp.prediction[i] < thyroid.2$mean$PP[i]
pp.stderr[i] < thyroid.2$sd$PP[i]
mu[i] < thyroid.2$mean$mu[i]
mu.stderr[i] < thyroid.2$sd$mu[i]
cor.heter[i] < thyroid.2$mean$u[i]
cor.heter.stderr[i] < thyroid.2$sd$u[i]
uncor.heter[i] < thyroid.2$mean$v[i]
uncor.heter.stderr[i] < thyroid.2$sd$v[i]
}
res2 < data.frame(th_df, expected=pmap$expCount, mu=mu, mu_stderr=mu.stderr, rr_freq=pmap$relRisk/100,
rr=rr.prediction, rr_stderr=rr.stderr, pp=pp.prediction, pp_stderr=pp.stderr, u_cor=cor.heter,
u_cor_se=cor.heter.stderr, u_heter=uncor.heter, u_heter_se=uncor.heter.stderr)
The data frame can be written out to a shapefile for visualization in ArcMap using the following command:
write.polylistShape(th_polys, res2, file="thyroid_relrisk_bym")
Maps of uncorrelated (left) and correlated heterogeneities (right) v and u are shown in figure A3.23. The
uncorrelated heterogeneity is small and fluctuates around zero value, while correlated heterogeneity is much
larger and shows clear spatial structure of the thyroid cancer distribution. A comparison of these maps with
uncorrelated heterogeneity estimated by the previous model without a spatial component shows that the
previous uncorrelated heterogeneity looks like a sum of uncorrelated and correlated heterogeneities of the
new model.
Figure A3.23
Figure A3.24 presents maps of relative risk (left) and the probability that relative risk is greater than 1
(right). They show clear evidence that the risk of thyroid cancer is much larger in the areas close to the
Chernobyl location, which is in the bottom right part of the map. Further discussion about spatial distribution
of thyroid cancer in children in Belarus can be found in chapter 6 and in the paper, “Analyzing the
Consequences of Chernobyl Using GIS and Spatial Statistics,” available at
http://www.esri.com/news/arcnews/fall03articles/analyzing.html.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A3.24
Another approach for modeling spatial correlation is geostatistical. In the model below we are modeling
correlated heterogeneity using the already‐discussed function spatial.exp:
model
{
for (i in 1:n)
{
cases[i] ~ dpois(mu[i])
log(mu[i]) < log(expected[i]) + alpha + u[i] + v[i]
theta[i] < exp(alpha + u[i] + v[i])
PP[i] < step(theta[i] 1 + eps)
v[i] ~ dnorm(0,tau.v)
}
We saved this model to the file thyroid_model_kr.bug. The model parameters are the same as in the previous
model:
model.parameters < c("sill", "range", "alpha", "theta", "PP", "u", "v", "mu")
The initial values and data can be specified using the following commands:
model.inits < function() {list(alpha = 0, tau.v=10, inv.sill=0.01, inv.range=0.4)}
thyroid.data < list(n=kk,
cases = th_df$CASES,
expected = pmap$expCount,
x=th_df$xcoord,
y=th_df$ycoord)
We run our new model using the command
thyroid.2a < bugs(thyroid.data, model.inits, model.parameters, "thyroid_model_kr.bug", n.chains=1,
n.iter=10000, debug=TRUE, bugs.directory = "e:/program/WinBUGS14/")
The density of the distribution of the parameters sill and range as well as the relative risk and correlated
heterogeneity for one selected district (Hojnikskij) can be visualized using the following commands:
par(mfrow=c(2,2))
plot(density(thyroid.2a$sims.array[,1,1]), xlim=c(0,2), lwd=3, col="blue", main="Sill Distribution", xlab="Sill")
plot(density(thyroid.2a$sims.array[,1,2]), xlim=c(0,15), lwd=3, col="red", main="Range Distribution",
xlab="Range")
plot(density(thyroid.2a$sims.array[,1,39]), xlim=c(0,8), lwd=3, col="green", main="Relative Risk Distribution",
xlab="HOJNIKSKIJ District")
plot(density(thyroid.2a$sims.array[,1,273]), xlim=c(1.0,3.0), lwd=3, col="black", main="Correlated
Heterogeneity Distribution", xlab="HOJNIKSKIJ District")
par(mfrow=c(1,1))
Figure A3.25
Maps of relative risk (left) and correlated heterogeneity (right) are presented in figure A3.26. They are
similar to the maps produced by the model thyroid_model_BYM.bug above but more noisy. This is probably
because kriging in WinBUGS uses all data for modeling spatial correlation, while the conditional
autoregressive model uses several nearby observations only.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A3.26
sar_th < spautolm(th_df$INCID1000 ~ th_df$MEAN + th_df$DISTANCE, listw= nb2listw( five.nn),
zero.policy=TRUE)
summary(sar_th)
Part of the model‐fitting summary is presented in table A3.5. According to the simultaneous autoregressive
model, cesium‐137 soil contamination and distance to the Chernobyl nuclear power plant are significant for
explanation of the thyroid cancer rates.
Table A3.5
We modify model thyroid_model_BYM.bug as follows:
model
{
for (i in 1:n)
{
cases[i] ~ dpois(mu[i])
log(mu[i]) < log(expected[i]) + alpha + u[i] + v[i] + b1*cs[i] + b2*dist[i]
theta[i] < exp(alpha + u[i] + v[i] + b1*cs[i] + b2*dist[i])
PP[i] < step(theta[i] 1 + eps)
v[i] ~ dnorm(0,tau.v)
RR_exp[i]<exp(b1*cs[i] + b2*dist[i])
RR_het[i]<exp(v[i])
RR_clust[i]<exp(u[i])
}
eps < 1.0E6
u[1:n] ~ car.normal(adj[], weights[], num[], tau.u)
alpha ~ dflat()
mean < exp(alpha)
b1 ~ dnorm(0.0,1.0E6)
b2 ~ dnorm(0.0,1.0E6)
tau.u ~ dgamma(0.5,0.0005)
tau.v ~ dgamma(0.5,0.0005)
}
We saved this model to the file thyroid_model_BYM_eco.bug. The model’s parameters, data, and initial values
can be specified using the following commands:
model.parameters < c("theta,” "mu,” "v,” "u,” "alpha,” "PP,” "b1,” "b2,” "RR_exp,” "RR_het,” "RR_clust")
thyroid.data < list(n=kk,
cases = th_df$CASES,
We run our modified BYM model using the command
thyroid.3 < bugs(thyroid.data, model.inits, model.parameters, "thyroid_model_BYM_eco.bug", n.chains=1,
n.iter=500000, debug=TRUE, bugs.directory = "e:/program/WinBUGS14/")
Figure A3.27 shows the relative risk (left) and its environmental part (right, variable RR_exp[]).These maps
describe the relative risk of thyroid cancer more accurately than the maps created using previous models.
This statement can be verified using WinBUGS diagnostics, but we will not discuss this important feature of
the software in this appendix.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A3.27
In our final model, we will allow the regression coefficients to be spatially correlated:
model
{
for (i in 1:n)
{
cases[i] ~ dpois(mu[i])
log(mu[i]) < log(expected[i]) + alpha + u[i] + v[i] + b1[i]*cs[i] + b2[i]*dist[i]
theta[i] < exp(alpha + u[i] + v[i] + b1[i]*cs[i] + b2[i]*dist[i])
PP[i] < step(theta[i] 1 + eps)
v[i] ~ dnorm(0,tau.v)
RR_exp[i]<exp(b1[i]*cs[i] + b2[i]*dist[i])
RR_het[i]<exp(v[i])
RR_clust[i]<exp(u[i])
}
eps < 1.0E6
u[1:n] ~ car.normal(adj[], weights[], num[], tau.u)
The maps in figure A3.28 show the estimated relative risk (left) and its standard error (right). They are not
very different from the previous model.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A3.28
We can use this model to map each variable regression coefficient to see how much each covariate influences
the value of the thyroid cancer risk locally. Figure A3.29 shows maps of the regression coefficient for cesium‐
137 soil contamination (left) and its standard error (right). We see that influence of the cesium‐137 soil
contamination covariate is decreasing in the northwest direction. However, the prediction standard error is
very large, and we should be careful in relating the cesium‐137 soil contamination to thyroid cancer risk.
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A3.29
Data courtesy of International Sakharov Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
Figure A3.30
ASSIGNMENTS
1) REPEAT THE CASE STUDIES PRESENTED IN THIS APPENDIX.
Repeat the analysis of the housing data from the city of Nashua and thyroid cancer in children in Belarus
using data from the folder Assignment A3.1.
2) INTERPOLATE THE PRECIPITATION DATA USING GAUSSIAN AND GAMMA
BAYESIAN KRIGING.
When the number of data is small, restoration of the true semivariogram model becomes difficult. In this case,
Bayesian kriging can produce more reliable results because it takes into account the semivariogram model
uncertainty. In chapter 6 (section “Model diagnostic”), a small dataset of meteorological data collected in
Catalonia was analyzed using the Bayesian variant of geostatistical simulation. Use precipitation variable and
3) PERFORM THE BAYESIAN ANALYSIS OF WEEDS DATA.
In chapter 6, we analyzed spatial distribution of the counts of weeds in small squares into which the
agriculture field was divided. We mentioned that part of this analysis was done using the WinBUGS program.
Perform the Bayesian analysis of weeds data. The data can be found in the folder Assignment A3.3.
4) VERIFY THE CLASSIFICATION OF THE HAPPIEST COUNTRIES IN EUROPE.
WinBUGS provides posterior interval estimates for ranks: function rank(x[], i) returns the rank of the ith
element of array x. Use this function to verify the classification of the happiest countries in Europe discussed
in chapter 11. Assume that the observed rank is equal to the true rank plus the random error. Data are
available in the folder Assignment A3.4.
5) BAYESIAN RANDOM COEFFICIENT MODELING OF THE CRIME DATA.
Reproduce the results of the Bayesian random coefficient modeling of the crime data in the section
“Geographically weighted regression versus random coefficient models” in chapter 12. Data are described in
the section “Criminology” in chapter 6 and are available in the folder Assignment A3.56.
6) BAYESIAN SPATIAL FACTOR ANALYSIS OF THE CRIME DATA.
Reproduce the results of the Bayesian spatial factor analysis of the crime data described in the section
“Criminology” in chapter 6 (see also the section “Spatial factor analysis” in chapter 12). Data are available in
the folder Assignment A3.56.
FURTHER READING
1. Spiegelhalter, D. J., A. Thomas, N. G. Best, and D. Lunn. 2003. WinBUGS User Manual (Version 1.4).
Cambridge: Mrc Biostatistics Unit, www.mrc-bsu.cam.ac.uk/bugs/.
Similar to most other software manuals, this explains how to use the software, not why.
2. Lindley, D. V. 2006. Understanding uncertainty. New Jersey: Wiley & Sons.
This introduction to probability theory is delightfully written by the legendary Bayesian statistician. Bayesian
statistics play the central role in examples of probability calculations using everyday situations.
3. Press, S. J. 2003. Subjective and objective Bayesian statistics. Second Edition. New Jersey: Wiley & Sons.
This book is often recommended as an introduction to Bayesian statistics for researchers who have already
studied statistics and probability but know little about Bayesian theory and methods.
4. Congdon, P. 2006. Bayesian Statistical Modeling. Second Edition. Chichester, UK: Wiley & Sons.
This book presents numerous applications of Bayesian data analysis using WinBUGS. However, it is difficult to
read, even for professional Bayesian statisticians.
This book presents examples of spatial epidemiological data analysis using WinBUGS, with short explanations
about reasons to use one or another Bayesian model.
6. Ferreira, M. A. R., and V. De Oliveira. 2007. “Bayesian Reference Analysis for Gaussian Markov Random
Fields.” Journal of Multivariate Analysis 98(4):789‐812.
The authors evaluate the default priors by comparing Bayesian and classical spatial predictions using the
mean squared error of parameters estimation. They show that estimations and predictions may depend on
the choice of noninformative priors.
7. R2WinBUGS: a package for running WinBugs from R.
http://cran.r-project.org/web/packages/R2WinBUGS/index.html.
Using this R package, it is possible to call a BUGS model, summarize inferences and convergence in a table and
graph, and save the simulations in arrays for easy access in R.
8. Bayesian Output Analysis Program (BOA) for MCMC. http://www.public-
health.uiowa.edu/boa/
This is a menu‐driven program and library of functions for carrying out convergence diagnostics and
statistical and graphical analysis of the Markov chain Monte Carlo sampling output.
INTRODUCTION TO SPATIAL
REGRESSION MODELING USING SAS
LOGISTIC REGRESSION
LOGISTIC REGRESSION WITH SPATIALLY CORRELATED ERRORS
POISSON REGRESSION WITH SPATIALLY CORRELATED ERRORS
BINOMIAL REGRESSION WITH SPATIALLY CORRELATED ERRORS
SEMIVARIOGRAM MODELING USING GEOSTATISTICAL ANALYST AND THE
PROCEDURES MIXED AND NLIN
ASSIGNMENTS
1) REPEAT ANALYSIS OF THE PINE BEETLE AND THYROID CANCER DATA
2) RECONSTRUCT THE SEMIVARIOGRAM MODEL PARAMETERS
3) COMPARE THE FITTING OF TWO SEMIVARIOGRAM MODELS
FURTHER READING
I n appendixes 1, 2, and 3 we introduced the statistical extension to ArcGIS Geostatistical Analyst 9.2, and
two widely used freeware statistical packages, R and WinBUGS. In this appendix, we show examples of
regression analysis using commercial statistical software package SAS/STAT, version 9.1.3. Although
there are many good statistical packages, SAS/STAT is certainly among the best on the basis of the range
of available models and quality of documentation. The latter is very important when studying new statistical
techniques. In practice, SAS documentation is often used as an aid to some poorly documented freeware
statistical software packages.
Statistical models are more likely to be used in a comfortable software environment, and using SAS
procedures is convenient for both beginners and professionals. Generally, for researchers who use statistical
software only rarely, learning how to use SAS/STAT is easier than studying the related set of R packages.
Researchers can do almost any nonspatial, non‐Bayesian statistical data analysis using SAS procedures and, if
necessary, they can build custom models using the powerful SAS math library. However, the spatial statistics
part of the software is limited, and the available models (mostly geostatistical) are very basic compared to
specialized spatial statistical modules. Fortunately, there is an exception: SAS’s linear mixed and generalized
An example of analysis of arsenic data collected in Bangladesh using the SAS procedure mixed is presented in
chapter 15. Procedure mixed assumes that the response variable (in this case, arsenic contamination) is
normally distributed. This appendix shows how to use the generalized linear mixed model with binary,
Poisson, and binomial data.
The linear model discussed in appendix 2,
,
unexplained variation in the response variable, and are the unknown constants) has one random effect,
as known constants and called fixed effects in statistical literature. It is assumed that represent all
relevant explanatory variables. The linear regression model should not be generally used with correlated
variables.
The linear mixed model includes one additional random‐effect term :
,
coefficients with zero mean and covariance matrix G, . are random effects because
they represent a sample from known probability distribution, usually Gaussian. The meaning of the random
effect is changed in the mixed model compared to the general linear model: there is no requirement for
to be independent and homogeneous; instead, can be described by the normal distribution with the
covariance matrix R: . In the case of spatial data, R is usually described by the spatial
covariance model. The variance of the response variable is a sum of random effects variances because the
variance of fixed effects is equal to zero.
Linear mixed models have different names in different applications, including multilevel models, hierarchical
linear models, random‐coefficient models, random‐effects models, variance and covariance‐component
models, contextual‐effects models, repeated‐measures models, and kriging with external trend.
The linear mixed model assumes that the relationship between the mean of the response variable and the
fixed and random effects can be modeled as a linear function, the variance is not a function of the mean, and
that the random effects follow a normal distribution. If these assumptions are violated, the generalized linear
mixed models can be used. It allows researchers to focus on selecting an appropriate model instead of finding
the data transformation that satisfies the linear mixed model assumptions.
The linear predictor is the systematic component of the model: . It
differs from the linear mixed model above by dropping the random effect term.
Unlike a linear model, the mean response is a function of the linear predictor. This function is called
an inverse link function (because link function is ), which transforms the
expected value of the linear predictor to the response variable. Most commonly used link functions
are
o identity: ;
o logit: ;
o log: ;
o complementary log‐log: ;
The distributions used most often in the generalized linear mixed model are normal, Bernoulli, and
Poisson.
The linear mixed model is available in most statistical software packages, but modeling random effects using
spatial covariance (this is the most important component of the model from the spatial regression modeling
point of view) is not a common option. Implementation of the generalized linear mixed model with spatial
covariance is much less frequent, although only two changes in the linear mixed model are needed: to add a
link function and to solve iteratively the estimating equations.
The SAS procedure glimmix is one of the best implementations of the generalized linear mixed model in
statistical software packages. Glimmix has so many options that most GIS users can use it properly only by
following examples provided by experts in the field. Note that the generalized linear mixed model can be also
fitted in the Bayesian software package WinBUGS as discussed in appendix 3, but this is a different story.
This appendix ends with examples of semivariogram model fitting using the maximum likelihood method
implemented in the SAS procedure mixed and the weighted least square method implemented in
Geostatistical Analyst 9.2. It also shows how to fit nonstandard semivariogram models using the SAS
procedure nlin. Semivariogram modeling is the central part of geostatistics, and alternative model fitting
methods are useful for finding reliable semivariogram model parameters, especially in the most difficult
situations from a practical point of view when the number of data is small.
The first part of this appendix uses a dataset from Stochastic Modeling of Scientific Data, a 1995 book by Peter
Guttorp. The data describe beetles attacking lodgepole trees in Oregon in 1984. Beetles can destroy a stand in
a few years; see for example a paper about a Colorado beetle infestation at
http://www.guardian.co.uk
/environment/2007/mar/19/usnews.conservationandendangeredspecies.
Data contain trees’ x‐ and y‐coordinates, a vigor (in a gram of stemwood per square meter of crown leaf area),
the diameter at breast height (in inches), a leaf area (in square meter), the age (in years), and the response
indicator variable (1 for attacked and 0 for not attacked). The goal is estimation of the probability that a given
pine tree suffers a beetle attack as a function of the covariates. Data can be downloaded from Peter Guttorp’s
Web site at http://www.stat.washington.edu/peter/book.data/set7. Figure A4.1 shows the
indicator data called attacked over the interpolated map of the age of the pine trees (this map does not make
sense outside the exercise). One may have a visual impression that the indicator and age variables are related,
and we will verify this impression. Note that there are not enough data for interpolation in the top right
corner of the map.
Data from H.K. Preisler and R.G. Mitchell. 1993. “Colonization Patterns of the Mountain Pine Beetle in Thinned and Unthinned Lodgepole Pine Stands.”
Forest Science 39(3):528–545. Used by permission.
Figure A4.1
The dbf‐file beetle_infection.dbf, converted from the original data in ASCII format, can be read in SAS using the
following commands (they are formatted as in the SAS 9.1.3 Editor window):
The first five records of the dataset with the nickname beetles can be displayed using the following command:
proc print data= beetles (obs=5); run;
The result is shown in table A4.1.
The mean value of the indicator variable attacked, p=258/449=0.575, is also the probability of drawing a tree
labeled as 1 at random from the distribution of the variable attacked. The variance of a binary distribution is
p(1 p)=0.244 and the standard deviation is =0.494.
We might suspect that the explanatory variables are interrelated because both leaf area and vigor are
expected to be large for the pine tree with a large diameter and old age. The collinearity assumption can be
verified using the SAS procedure reg. Figure A4.2 shows the SAS session after submitting a regression model
ATTACKED = β0 + β1⋅LEAF_AREA + β2⋅DIAMETER + β3⋅VIGOR + β4⋅AGE
with options tol, vif and collin:
title 'linear regression diagnostics';
proc reg data=beetles;
model ATTACKED = LEAF_AREA DIAMETER VIGOR AGE/ tol vif collin;
run;
Figure A4.2
According to the parameter estimates in figure A4.2, only two variables, diameter and age, are significant in
explaining the indicator variable attacked, and our next model includes these two variables only:
Tables A4.2 and A4.3 show the results of estimations. This time all the explanatory variables are significant,
and there is not much evidence on the data collinearity.
It makes sense to analyze data collinearity with a linear regression because collinearity is a feature of the
explanatory variables, not the response one. Although for probability values between 0.2 and 0.8 the linear
model provides results similar to those produced by more appropriate generalized linear regression models,
there are several reasons to use logistic regression (a particular case of the generalized linear model) instead
of a linear regression when the response variable (in this case, the indicator variable attacked) is binary:
A linear regression assumes that the variance of the response variable is constant across values of
the covariates. However, the variance of a binary variable is equal to p(1 p). So, when 50 percent of
the trees are attacked, then the variance is 0.25, its maximum value, while the variance decreases
with changing proportion of the attacked trees. For example, when p=0.2, the variance is 0.2*0.8 =
0.16 and as p approaches one or zero, the variance approaches zero.
The model for the relationship between a binary response variable and one or more explanatory variables,
the logistic regression, was discussed in chapter 6. The logistic regression has the form:
,
where is ith covariate and are the regression coefficients to be estimated.
The expression above looks rather complicated because logistic regression uses odds instead of probability.
Suppose we only know a tree’s diameter, and we want to predict whether that tree is attacked or not. The
odds can be found by counting the number of attacked and not attacked trees for a particular tree diameter
and dividing one number by the other. We can think about the odds of being attacked or not in terms of
probability so that if the probability of being attacked at a given diameter is 0.8, then the odds of being
attacked are
or four to one.
The odds of being not attacked are 1/4. Figure A4.3 at left shows the relationship between probability p and
odds graphically.
One can remove the odds asymmetry using a logarithm. The natural logarithm of 4 is 1.386, and the natural
logarithm of 1/4 is 1.386, so the log odds of being attacked are exactly opposite to the log odds of being not
attacked as shown in figure A4.3. If log odds are linearly related to the covariate , then the relation
between probability p and is nonlinear; it is an S‐shaped curve.
Figure A4.3
The following code fits nonspatial logistic model with explanatory variables diameter and age:
ods html;
ods graphics on;
title 'logistic regression, two covariates';
proc logistic data=beetles descending;
model ATTACKED = DIAMETER AGE / influence iplots;
run;
ods graphics off;
ods html close;
SAS proc logistic models the probability of being not attacked (attacked = 0) by default, that is SAS chooses the
smaller value to estimate its probability. To change the default setting in order to model the probability of
being attacked (attacked = 1), the descending option is added to the procedure logistic statement. Options
influence and iplots are used to display the regression diagnostics and the index plots.
The ods html statement specifies an html destination for the output (“ods” is an abbreviation of the “output
delivery system”). The ods graphics on statement requests graphics in addition to the tabular output.
The ods graphics off statement disables graphics, and the ods html close statement closes the
html destination.
Figure A4.4 shows the output tables. Part of the diagnostics is for comparison with other possible models, and
we will fit one more model shortly. The "Model Fit Statistics" table contains the Akaike Information Criterion
(AIC), the Schwarz Criterion (SC), and the negative of twice the log likelihood (2 Log L) for the intercept‐only
The "Analysis of Maximum Likelihood Estimates" table lists the parameter estimates, their standard errors,
and the results of the Wald test for individual parameters. The result shows that the estimated logistic model
is
,
where p is the probability of a pine tree being attacked by beetles. The slope coefficients 0.6062 and 0.00702
represent the change in log odds for a one‐unit increase in diameter and age. We see that intercept and
diameter variables are significant in explanation of the variability of the indicator variable attacked, while the
variable age, in contrast to the linear regression diagnostics presented earlier, does not.
The odds ratios 1.833 and 1.007 in the "Odds Ratios Estimates" table shows the ratio of odds for a one‐unit
change in diameter and age, along with 95 percent Wald confidence intervals. We see that the variable
diameter is more important in explanation of the pine trees attack by beetles than the variable age.
Figure A4.4
The "Association of Predicted Probabilities and Observed Responses" table in figure A4.4 contains four
measures of association for assessing the predictive ability of a model. A pair of observations with different
responses (t, attacked and not attacked) is said to be concordant (nc) if the observation with the response that
has value 1 has the higher predicted value than the case with a 0. Otherwise it is called discordant (nd). If a
pair of observations with different responses is neither concordant nor discordant, it is a tie (t ‐ (nc) ‐(nd)).
The enumeration of the total numbers of concordant and discordant pairs in SAS is produced by categorizing
where N is the sum of observation frequencies in the data. The values of the indices above are between 0 and
1, and higher values correspond to stronger association between the predicted and observed values.
Figure A4.5 shows the diagnostics computed for the first 16 observations. Two residuals, Pearson and
Deviance, are useful for determining observations that are the most poorly fit by the model. The hat matrix
diagonal is a measure of how extreme the observation is in the space of the explanatory variables. Dfbetas are
similar to cross‐validation statistics; they show how much each regression coefficient changes when an
observation is deleted (the change is divided by the coefficient standard error). The displacements C and CBar
are analogous to Cook’s distance in linear regression; they show the overall change of the regression
coefficients when an observation is deleted. The last two columns in figure A4.5 show the change in deviance
with the deletion of the observation.
Figure A4.5
Graphs of the statistics shown in figure A4.5 are also provided, see figure A4.6.
In the case of spatial data, we want to see the locations of the suspicious observations. Figure A4.7 at left
shows the diagnostics table in ArcMap (after saving the table shown in figure A4.5 and joining it with the
beetle attack point layer). The 16 largest Pearson residuals are selected, and they are shown as green circles
on the map in figure A4.7 at right. These observations are considered abnormal because they represent
relatively old trees with large trunk diameters, and these trees were not attacked. However, these
observations do not look particularly unusual on the map since most of the selected trees are surrounded by
trees that were not attacked as well.
Data from H.K. Preisler and R.G. Mitchell. 1993. “Colonization Patterns of the Mountain Pine Beetle in Thinned and Unthinned Lodgepole Pine Stands.”
Forest Science 39(3):528–545. Used by permission.
Figure A4.7
Figure A4.8
Before using logistic regression with spatially correlated random errors, we will fit one more nonspatial
model, this time with just one explanatory variable, tree diameter (note that one additional line was added to
the code below to save the predictions and their confidence intervals to the SAS internal object, which we
called pred_1d):
ods html;
ods graphics on;
title 'logistic regression, one covariate';
proc logistic data=beetles descending;
model ATTACKED = DIAMETER / influence iplots;
output out=pred_1d p=prediction upper=up lower=lo resdev=dev predprobs=x;
run;
ods graphics off;
ods html close;
The SAS output can be saved to the excel file shown in figure A4.9 using properties of the output that are
accessible after right‐clicking the mouse button.
Figure A4.9
Comparing the global diagnostics does not reveal much difference between logistic regression models that
have one and two explanatory variables. In this case, the simpler model with just one explanatory variable
(tree diameter) is preferable.
Next, we can display the probability of beetles attacking a pine tree given the tree’s diameter according to the
pred_1d model. The following code saves the predictions to the DBASE‐file logit_1D:
filename a 'e:\book_data\beetle\logit_1D.dbf';
proc dbf db4=a data=pred_1d;
format prediction e8.3 up e8.3 lo e8.3;
run;
To predict whether a tree is attacked or not using information on the tree’s diameter, we plot the relation
between the two variables as shown in figure A4.10 in red along with the lower and upper 95 percent
confidence intervals (blue), where the y‐axis is predicted probability p, which shows the proportion of
attacked trees at any given value of tree diameter. Note that none of the observations actually falls on the
regression line since they all are 0s or 1s (green).
Figure A4.10
Categorical data also can be analyzed in SAS using the procedure genmod (the abbreviation of the generalized
linear model). In addition to the binomial and multinomial distributions fitted by the procedure logistic, the
genmod procedure can fit models based on data distributions from exponential (Gaussian, binomial, Poisson,
gamma, and inverse Gaussian distributions) and nonexponential families (overdispersed binomial and
Poisson distributions and the negative binomial distribution, that are suitable for data with the variance
changing more rapidly than the mean). We are not discussing the genmod procedure here because our main
interest is spatial regression that can be fitted in SAS using mixed and generalized mixed linear models
(procedures mixed and glimmix).
LOGISTIC REGRESSION WITH SPATIALLY CORRELATED ERRORS
We already know that part of the attacked data dependency can be attributed to spatial correlation, and
figure A4.11 provides additional evidence of this: the semivariogram model of the attacked data (left) shows
relatively strong spatial dependence, and the local mean estimated using the Voronoi map (right) indicates
that the attacked trees are distributed nonrandomly in space. Green circles indicate trees with the nearest
neighbors having opposite indicator values (these data are found using the option cluster). These trees
contribute the most to the spatial correlation uncertainty between the trees at small distances, and, therefore,
they look like spatial outliers. However, these seven trees are not outliers from the point of view of the
nonspatial logistic regression model discussed in the previous section. Remember that earlier we observed
the opposite situation: outliers from the nonspatial logistic regression model do not look unusual on the map.
Therefore, we expect that a model that takes into account the characteristics of the trees (such as age and
diameter) and the spatial correlation of the attacked data will better explain the spatial configuration of the
attacked pine trees.
To make the predictions sensitive to the presence of spatial random effects, the semivariogram model is used
in the procedure glimmix, which fits the generalized linear mixed model to the beetles data using the
following commands:
data beetles;
set beetles;
obs = _n_;
run;
title 'spatial logistic regression 3 parameters';
proc glimmix data=beetles;
class obs;
model ATTACKED (descending) = DIAMETER VIGOR AGE /
dist=binary solution ddfm=residual;
random obs / type=sp(exp) (X Y);
parms
/* sill */ (0.1 to 3 by 0.2);
/* range */ (5 to 30 by 2)
output out=gmxout pred(ilink)=p stderr(ilink)=se;
run;
Note that a new variable containing the observation number was created (variable obs). It is used in the class
and random statements. The predictions from this model are spatially varying. Other notes on the code above
are the following:
The attacked data have a binary distribution, a special case of the binomial distribution, and this
distribution is specified explicitly.
The default computation method does not have any degree of freedom left for testing the contained
random effects, and the option ddfm = residual is one possible alternative to the default.
The statement solution requests that a solution for fixed effects diameter, vigor, and age be produced.
The statement sp(exp) specifies the exponential semivariogram model, and statement (X Y) indicates
spatial coordinates fields in the input dataset.
Mixed models implemented in SAS estimate the spatial covariance parameter using maximum
likelihood methods that do not require the assumption about constant mean value. They estimate
regression coefficients and spatial data structure simultaneously.
The parms statement suggests possible values for the parameters range and sill (the values were
chosen using the Geostatistical Analyst’s semivariogram modeling dialog shown in figure A4.11 at
left). The glimmix procedure does not have a sophisticated algorithm for choosing reliable starting
Figure A4.12 shows part of the procedure glimmix output. The estimated spatial covariance parameters are
displayed in the “Covariance Parameter Estimates” table: the estimate of the sill parameter is reported as
Variance, and the range estimate is reported as sp(exp). The “Solution for Fixed Effects” table shows the
estimate of the intercept and three regression coefficients with their standard errors. Similar to the
nonspatial logistic regression model, the intercept and diameter are the only significant explanatory variables
since their values in the last column are less than the benchmark value of 0.05.
Figure A4.12
The model above assumes that the nugget parameter is equal to zero. It makes sense because we know
exactly whether a tree was attacked by beetles or not. In the next section we will use the generalized linear
mixed model with a nonzero nugget parameter.
The following statements displays a 3D scatterplot of the predicted probabilities that trees are attacked by
beetles, shown in figure A4.13.
Figure A4.13
The commands below and figure A4.14 show the generalized linear mixed model with only one covariate
diameter and the model output.
Figure A4.14
The estimated regression coefficients with and without spatial random errors are not very different, but look
at figure A4.15, which shows the predicted probabilities versus the tree diameter (blue). Red circles show the
probability line from the nonspatial logistic regression model discussed in the previous section. We see that
spatial correlation introduces additional uncertainty in the probability values so that a statement about
possibility that a tree with a given diameter is attacked became much less certain for trees with small and
moderate sizes.
Figure A4.15
Figure A4.16 at left shows probabilities calculated by the procedure glimmix at the tree locations interpolated
by radial basis functions (see discussion in chapter 7). This map can be compared with a map of probabilities
created using indicator kriging, as shown in figure A4.16 (center). The main difference between the two maps
is that indicator kriging assigns very large probabilities to the areas surrounded by indicator values 1s and
very low probabilities to the areas surrounded by indicator values 0s, while the generalized linear mixed
model is not that certain when pairs of zero‐one indicator values are separated by the distance comparable
with the range of data correlation. Another difference between the two models is in the prediction standard
error maps: indicator kriging produces a surface that reflects the configuration of tree locations only, while
prediction errors estimated by the generalized linear mixed model (figure A4.16 at right) depend on the data
values (see also a comparison of indicator kriging and the logistic regression in the section “Agriculture” in
chapter 6).
Data from H.K. Preisler and R.G. Mitchell. 1993. “Colonization Patterns of the Mountain Pine Beetle in Thinned and Unthinned Lodgepole Pine Stands.”
Forest Science 39(3):528–545. Used by permission.
Figure A4.16
The fitted model should be further investigated using diagnostic tools as discussed in chapter 12 in the
section “Regression models diagnostics and selection,” in appendix 2, and in previous sections of this
appendix. Residual and influence diagnostics for the generalized linear mixed model are available, and
interested readers can read more on the similarity and differences between linear and mixed models
diagnostics in the book identified in reference 2 of “Further reading” at the end of this chapter.
Figure A4.17
Spatial regression models can predict new values at the locations where explanatory variables are known. For
this purpose, a new SAS dataset with missing values for the response variable should be created. Then the
new data are merged with the original data, and the regression model with the merged dataset is used.
Because the new dataset has missing values for the response variable, explanatory variables at the new
locations do not affect the model fit, while the predictions at new locations are produced. Figure A4.18 shows
interpolated values of the tree diameter on a grid estimated and displayed in ArcGIS. Of course, there are no
trees between the observed ones, but the interpolated values are needed for creation of a continuous map of
the risk that pine trees are attacked within the study domain.
Figure A4.18
Unfortunately, the glimmix procedure with random effects described by spatial covariance is an exception
from the rule in the sense that it cannot predict values at the locations where the response variable is not
available because of issues in using pseudo‐quasi‐likelihood, back‐transformation, and valid prediction
intervals (note that the mixed procedure, which assumes that the response variable is distributed normally,
can do the job, see an example in chapter 15). However, the predictions_set data can be used for predictions at
the unsampled locations using the option rsmooth as shown in the statements below.
In the statements above
The option rsmooth initiates the radial smoother in procedure glimmix (very similar option is
implemented in the semiparametric regression model discussed in chapter 12). It computes the
mixed model splines at any location, observed or not.
nloptions is one way of getting the algorithm to converge. What happens is that the procedure
glimmix by default performed only 20 optimizations and stopped. The number of optimizations can
be increased with the option maxopt, or the user can interface with aspects of the optimization
through the nloptions statement.
The model above produces the following output (after conversion from SAS Mono to Times News Roman
font):
Standard
Cov Parm Estimate Error
Var[RSmooth(X, Y)] 4.862E‐6 5.582E‐6
Solutions for Fixed Effects
Standard
Effect Estimate Error DF t Value Pr > |t|
Intercept ‐4.5705 0.7755 447 ‐5.89 <.0001
DIAMETER 0.6994 0.07820 447 8.94 <.0001
We see that, similar to the model with spatial covariance model, both intercept and explanatory variable
DIAMETER are significant for an explanation of the attacked pine trees. “Var[RSmooth(X, Y)]” in the output
above is the estimated variance of the radial smoother spline coefficients. This variance has a large standard
error suggesting changing the default options. The model can be improved using a better knots configuration
than the default one. Interested readers can find the necessary information about penalized regression
splines parameters, including the optimal knots selection, in the book noted in reference 5 of “Further
reading”.
The maps in figure A4.19 show the probabilities that the pine tree will be attacked (at left) and their
prediction standard errors (at right). It seems that the prediction standard errors are too large on the edges.
This is probably because the default options of the radial smoother are not optimal.
Data from H.K. Preisler and R.G. Mitchell. 1993. “Colonization Patterns of the Mountain Pine Beetle in Thinned and Unthinned Lodgepole Pine Stands.”
Forest Science 39(3):528–545. Used by permission.
Figure A4.19
POISSON REGRESSION WITH SPATIALLY CORRELATED ERRORS
The data showing tapeworm infection in red foxes, collected in 43 regions in northern Germany, are
presented in assignment 2 of chapter 11. The goal of the research paper cited in that assignment was
smoothing the regional data and providing the uncertainty of the smoothing using kriging. The author did this
in two steps: (1) he used the empirical Bayesian smoothing discussed in chapter 11; and (2) he used smooth
values in the regions from step 1 as input to universal kriging to create a smooth continuous map of
predictions and prediction standard errors. The proposed smoothing method was motivated by the absence
of Poisson kriging in the available statistical software packages.
One disadvantage of the proposed workaround is that the uncertainty of the resulting continuous map of the
tapeworm infection in red foxes simply reflects the unfortunate feature of conventional universal kriging—its
inability to produce prediction standard errors that depend on the data values as dictated by a Poisson
distribution. In this section, we will use Poisson spatial regression (Poisson kriging) to smooth out the
infection rates in the polygons. Then any reasonable smooth interpolator (for example, local polynomial
interpolation or radial basis functions), can be used for creating the required continuous surface. We will also
show how to produce smoothed predictions by the SAS procedure glimmix with the radial smoother option.
The following statements read the data, namely the centroids coordinates (X and Y), and the number of tested
(N) and infected (M) foxes:
Next, data are prepared for modeling as in the case study in the previous section. In addition, a new variable
logtested was created: the logarithm of the total number of tested foxes in each polygon.
The required model (called the conditional spatial generalized linear mixed model in the book cited in
reference 3 of “Further reading”) is the following.
In the model above, we specify a Poisson data distribution; the option offset specifies a fixed effect with a
regression coefficient known to be 1 (this is a standard option in a Poisson regression model); the option
sp(sph) requires using a spherical semivariogram model; a measurement error component (nugget effect) is
added to the model by the statement random _residual_; ods select asks for printing the result of the
covariance parameter estimates; the second row from the bottom specifies the optimization technique, where
nrridg states for a Newton‐Raphson optimization with ridging algorithm; and the last row saves the
predictions and prediction standard errors in the original data scale to the foxpred dataset.
The estimated parameters and their standard errors are shown in the following SAS report (the nugget
parameter is called “Residual (VC)” here):
Conditional Spatial GLMM
The GLIMMIX Procedure
Covariance Parameter Estimates
Standard
Cov Parm Estimate Error
Variance 0.5414 0.2656
SP(SPH) 18.1406 5.1245
Residual (VC) 0.9684 0.4959
Solutions for Fixed Effects
Standard
Effect Estimate Error DF t Value Pr > |t|
Intercept ‐2.1244 0.3305 42 ‐6.43 <.0001
The first ten predictions are shown in figure A4.20. The observed numbers of the infected foxes (column M)
can be compared with the predicted numbers (column condpred). We see that many predictions are similar to
the observed values, but the number of predicted infected foxes in Wolfsburg (observation 3) is two times
smaller than the observed value. Also note that the prediction standard error (column se) depends on the
predicted values as required by a Poisson distribution.
Figure A4.20
The following statements save the predictions and their standard errors to the file sglmm_p.dbf for further
visualization in ArcGIS:
Figure A4.21 shows the observed rates at left (the proportion of the infected foxes to the tested foxes) and the
predicted rates at right in the Voronoi polygons created around regional data centroids. We see that the
northern part of the region, where the proportion of the infected foxes is low, is significantly updated.
Data from the International Journal of Health Geographics.
Figure A4.21
Data from the International Journal of Health Geographics.
Figure A4.22
The following statements fit the generalized linear mixed model using the radial smoother instead of spatial
covariance. Predictions require knowledge of the explanatory variable. In this case it is a logarithm of the
number of tested foxes log(Ni) in each region i. Therefore, a grid with log(Ni) values was created, figure A4.23
at left, and saved to the DBASE file fox_data_pts.dbf.
The output of the generalized linear mixed model with the radial smoother is shown in ArcMap in figure
A4.23 center (predictions) and right (prediction standard errors). Histograms under the maps show
distributions of predictions and prediction standard errors.
Data from the International Journal of Health Geographics.
Figure A4.23
One advantage of the generalized linear mixed model usage is that both predictions and prediction standard
errors can be improved using relevant covariate data.
BINOMIAL REGRESSION WITH SPATIALLY CORRELATED ERRORS
In appendix 3, we show how to fit several spatial regression models to the thyroid cancer data collected in
Belarus during the first seven years after the Chernobyl accident. One of the discussed models is the Poisson
regression with spatial random effects described by the covariance model (Poisson kriging). In this section,
we will fit the thyroid cancer data (CASES of the thyroid cancer in children and children population
POPULUSE in the Belarus districts) with two covariates—the average cesium‐137 contamination in the
districts (variable MEAN) and the distance from the district centroids to the Chernobyl nuclear power plant
(variable DISTANCE)—using the Binomial regression with spatially correlated errors. The code below should
be understandable since similar statements were used in the examples above. Note that we do not specify the
response data distribution because binomial distribution is a default distribution in the procedure glimmix.
The output from the model above is shown below. The result of the regression coefficient estimation is
important from an epidemiological point of view: both explanatory variables MEAN and DISTANCE are not
significant in explaining the thyroid cancer rates in Belarus districts. This result confirms our analysis in
appendix 3.
Covariance Parameter Estimates
Standard
Cov Parm Estimate Error
Variance 1.1432 0.3657
SP(SPH) 352013 48093
Residual (VC) 0.6121 0.1322
Solutions for Fixed Effects
Standard
Effect Estimate Error DF t Value Pr > |t|
Intercept ‐8.8220 0.5041 114 ‐17.50 <.0001
MEAN 0.02469 0.02738 114 0.90 0.3690
DISTANCE ‐0.05384 0.04614 114 ‐1.17 0.2457
Figure A4.24 shows the adjusted (left) and the observed (right) rates. The latter looks like a noisy version of
the former and the adjusted rates can be used as a representative map of thyroid cancer in children.
Data from the International Journal of Health Geographics.
Figure A4.24
In this section, we compare semivariogram model fitting using Geostatistical Analyst 9.2 and the SAS
procedures mixed and nlin using relatively small datasets simulated in the unit square. Small datasets are
more difficult to analyze, and different semivariogram model fitting algorithms can help to find the most
reliable semivariogram model for kriging predictions. The advantages and disadvantages of the two methods
of the semivariogram model fitting—weighted least square (Geostatistical Analyst) and restricted maximum
likelihood (SAS procedure mixed)— are discussed in chapter 8 in the section “Semivariogram and covariance
model fitting.”
A key advantage of the semivariogram model fitting with Geostatistical Analyst is interactivity. It can be
illustrated as follows.
Figure A4.25 at left shows the default spherical semivariogram model. The most important part of the model
at distances smaller than the range of the data correlation occupies approximately one third of the available
graph space, and it is good idea to change the lag size and the number of lags so that space beyond the range
be about one fourth of the x‐axis, shown in figure A4.25 at right.
Figure A4.25
Although the estimated model looks good, it makes sense to try several other options. Often it is preferable to
use the empirical covariance for estimating the range parameter and the empirical semivariogram for the
nugget parameter and the model shape. This is because the variance of the empirical semivariogram
increases at large lags due to small number of pairs of points at large distances and, therefore, estimating the
range parameter can be unreliable. However, the variance of the empirical covariance at large lags is two
times smaller (see the formulas for the empirical semivariogram and covariance weights in reference 6 of
“Further reading.”) Therefore, we may want to concentrate on covariance modeling of just one parameter, the
range of data correlation, by changing the number of lags and the lag‐size values. After several iterations, the
estimated range is 0.194, as shown in figure A4.26 at left. We fix this value by clicking the pencil icon.
Sometimes researchers choose a particular semivariogram model in advance and then select the lag value and
the number of lags to support their favorite model. This results in a process of preparing data for a particular
model instead of finding a model that fits the data as it should be. Consequently, fitting semivariogram models
with different shapes is a necessary step in semivariogram modeling. Therefore, we go back to the
semivariogram graph and choose a stable semivariogram model, which has an additional parameter
Figure A4.26
We see that the estimated shape parameter of the stable model is close to 1, that is, the stable model is nearly
the same as the exponential semivariogram model (see chapter 8 for the semivariogram models’ formulas).
Consequently, it makes sense to use the exponential semivariogram model shown in figure A4.27 at left.
Finally, all parameters can be rounded off, and the resulting model is shown in figure A4.27 at right.
Figure A4.27
Other typical steps in choosing the optimal semivariogram model are the following:
Checking for possible periodicity in the semivariogram model by trying the hole effect or the J‐Bessel
model.
Examining data anisotropy by estimating both the major and minor range parameters.
Verifying if a mixture of two or three semivariogram models describe the cloud of empirical
semivariogram points better than one model. It makes sense to try two different models, say
Gaussian and exponential, or the same model with different ranges.
Examining the bivariate normality assumption. This is a requirement for disjunctive kriging and a
desirable feature for optimal estimation of the prediction errors in simple kriging.
Estimating or specifying the measurement error component of the nugget effect to properly use the
filtered kriging model.
All these can be done in Geostatistical Analyst using the mouse only.
The data for the semivariogram modeling exercise above are from assignment 1 to chapter 8, which requires
finding the optimal semivariogram model for several simulated Gaussian data. The semivariogram model in
the figure A4.27 at right matches the true semivariogram model exactly. It was possible because the number
of data is large (500 points); data are Gaussian; data locations are randomly distributed in space; and there
are no trend and outliers in the data. Such perfect data are rarely available, and it is practically impossible to
reconstruct a true semivariogram model if the number of data is small, data distribution is non‐Gaussian,
measurements are not absolutely precise, and data are nonstationary. In practice, it is difficult to reconstruct
the semivariogram model when only one condition above is not fulfilled, as we will illustrate below.
Although semivariogram model fitting is much easier in Geostatistical Analyst, it is always good to double
check the fitted model to ensure that the semivariogram model choice is reliable and the SAS linear mixed
model provides good alternative algorithms: in addition to the default restricted maximum likelihood, the
maximum likelihood and the minimum variance quadratic unbiased estimation methods can be used. Note,
however, that the procedure mixed can be successfully used for semivariogram modeling only when the
number of data is smaller than several hundreds.
The following statements read the simulated data (x‐ and y‐coordinates X and Y and values Z) from the
exercise above and fit the spherical model. The name of the spherical semivariogram model, “sph”, is in the
brackets of the function sp(). The nugget parameter is added to the model by the option local in the repeated
statement. An option subject=intercept treats the data as potentially correlated.
The result of the semivariogram modeling shown below is rather disappointing because the estimated partial
sill (“Variance”) and range (“SP(SPH)”) parameters are larger than the true values 10 and 6 times
respectively.
Cov Parm Subject Estimate
Variance Intercept 9.6785
SP(SPH) Intercept 1.2089
Residual 0.02056
However, it is recommended to supply reasonable default semivariogram model parameters as we did earlier
in this appendix. Otherwise, the maximum likelihood algorithm may converge to local instead of global
maxima and provide unreliable estimates. The following code (note that in all estimates by SAS procedures
mixed and nlin below, the semivariogram parameter partial sill is called sill although, according to the
geostatistical literature, sill is the sum of partial sill and nugget)
produces the following estimates:
Cov Parm Subject Estimate
Variance Intercept 0.8371
SP(SPH) Intercept 0.1002
Residual 0.01154
This is better, but the range parameter is two times smaller than the true one. This is probably because the
number of input data is too large for the SAS’ implementation of the restricted maximum likelihood method.
Next, we will compare semivariogram modeling estimates using a small dataset, 49 points. The true spherical
model has the following semivariogram model parameters: a partial sill of 0.9, range of 0.4, and nugget of 0.1.
The statements below are used for fitting a spherical semivariogram model with the procedure mixed:
Figure A4.28 shows a simulated surface generated with the true semivariogram model, the sample data, the
Geostatistical Analyst’s Semivariogram/Covariance Modeling dialog with a model estimated by changing
options lag value and the number of lags only, and the estimates produced by the procedure mixed.
Figure A4.28
Both estimates are reasonably close to the true semivariogram model, although Geostatistical Analyst
produced a slightly better result for all three parameters. Since semivariogram modeling is usually used for
predictions, it is interesting to compare the cross‐validation diagnostics of ordinary kriging predictions using
both semivariogram models, as shown in figure A4.29. Cross‐validation diagnostics is slightly better for the
semivariogram model estimated by Geostatistical Analyst because the first four prediction error values are
larger for the semivariogram model estimated by the procedure mixed. However, the difference is small, and
the main conclusion is that both methods work reasonably well.
Figure A4.29
Figure A4.30 at left shows the same data as in the previous example (green circles) and one additional datum
in pink with the unusually large value of 6. Just one additional but erroneous data value makes reconstruction
of the true semivariogram model very difficult if not impossible as estimates made both by the Geostatistical
Analyst and the procedure mixed show in the right part of figure A4.30. This is because the expression
Figure A4.30
Semivariogram modeling is more challenging when the range of spatial data correlation is small compared to
the size of the data domain. This is the case when a decision about the selection of the final semivariogram
model may require support from the fitting algorithms which are based on different methods. Figure A4.31
shows a simulated surface generated using the exponential semivariogram model when the partial sill equals
0.8, the range equals 0.2, and the nugget equals 0.1. 96 sample points are used for fitting the exponential
semivariogram model using Geostatistical Analyst and the procedure mixed with the following statements:
Figure A4.31
Both model fitting algorithms are not very effective: the nugget parameter was seriously overestimated by
both methods, and each algorithm successfully found only one parameter value (Geostatistical Analyst—
range and procedure mixed—partial sill). Note that for some reason the procedure mixed reports not the
range value but one third of it when the exponential model is used, so that according to the procedure mixed,
the estimated range in our example is actually equal to 0.315.
The next example shows semivariogram model fitting using anisotropic data. 80 Gaussian data were
simulated using the Gaussian semivariogram model with the partial sill equaling 0.75, the ranges equaling 0.2
and 0.4, the angle equaling 45 degrees, and the nugget equaling 0.15. The following statements are used to fit
the semivariogram model with the procedure mixed. In the case of anisotropic semivariogram modeling, 5
parameters should be specified. Two additional parameters are the angle of anisotropy (in radians) and the
proportion of the major and minor semiaxes.
Figure A4.32
The SAS procedure mixed supports the following semivariogram models: spherical, exponential, Gaussian,
power, K‐Bessel (called “Mattern” in SAS; it is fitted extremely slow), and linear with sill (this model is invalid
in two or more dimensions). Only the first three models have the anisotropy option. The procedure mixed
does not support nested models. The nugget parameter cannot be divided on the measurement error and
microscale variation components. The multivariate data modeling using cross‐covariance or cross‐
semivariogram is not available. The choice of the semivariogram models in the procedure glimmix is even
more limited.
It is possible to fit semivariogram models that are not available in the procedure mixed using the procedure
nlin. We demonstrate this by fitting the pentaspherical model. Figure A4.33 shows a simulated surface
generated using the pentaspherical model with the partial sill of 0.75, the range of 0.13, and the nugget of
0.05; 98 sample data; and the results of the model fitting using Geostatistical Analyst and procedure nlin.
Figure A4.33
The semivariogram model fitting using the procedure nlin requires some preparations. First, the empirical
semivariogram values, their weights, and lags should be saved to DBASE file using option “Save values as
table” in the Geostatistical Analyst’s Semivariogram/Covariance Modeling dialog shown in the top right part
of figure A4.33. We saved the cloud of points to the file penta_sem_cloud_p.dbf.
The following commands read the DBASE file:
The empirical semivariogram cloud in figure A4.34 at left includes values calculated for the same distance
between the pairs of points (they correspond to different directions between the data pairs, see chapters 6
and 8), and they should be averaged before using the procedure nlin. The data averaging step can be done
using the following statements:
The averaged empirical semivariogram values are shown in figure A4.34 at right. New empirical
semivariogram values are stored in the SAS data object svar.
Figure A4.34
Finally, the pentasperical semivariogram model can be fitted using the following statements:
From the right part of figure A4.33, we see that Geostatistical Analyst was more successful in estimating the
partial sill (an almost perfect estimation) and the range, while the procedure nlin was better in estimating the
nugget parameter (note that the estimated standard error of the nugget parameter is very large). In addition,
the procedure nlin shows that the semivariogram model parameters are dependent, especially the partial sill
and nugget:
Approximate Correlation Matrix
range nugget sill
range 1.0000000 0.5161746 0.1261505
nugget 0.5161746 1.0000000 ‐0.7132963
sill 0.1261505 ‐0.7132963 1.0000000
1) REPEAT ANALYSIS OF THE PINE BEETLE AND THYROID CANCER DATA (IF YOU
HAVE ACCESS TO SAS SOFTWARE).
Data are in the folders Assignment A4.1 and Assignment 11.2. Try to analyze these data using R functions glm
(package stats) and glmmPQL (package VR) or WinBUGS if you do not have SAS.
2) RECONSTRUCT THE SEMIVARIOGRAM MODELS PARAMETERS.
Data in the file gaus_plus_exp.dbf are simulated using a mixture of exponential, Gaussian, and nugget models.
Use the procedure nlin to reconstruct the semivariogram model parameters. Data are in the folder Assignment
A4.2. Then compare your result with true semivariogram model parameters from the file
SemivariogramModelingAnswer.txt. Use R function nls for the semivariogram model parameter reconstruction
if you do not have SAS.
3) COMPARE THE FITTING OF TWO SEMIVARIOGRAM MODELS.
The authors of reference 3 in “Further reading” describe the diagnostic for examining an estimated
covariance (or semivariogram) model (page 350 of their book). They also provide the SAS code for comparing
two semivariogram models fitting using soil carbon and nitrogen data. Repeat the authors’ analysis, then
modify the SAS code and compare two or more fitted semivariogram models using data from assignment 2.
FURTHER READING
1) SAS/STAT 9.1 User’s Guide.
http://support.sas.com/documentation/onlinedoc/91pdf/sasdoc_91/stat_ug_7313
.pdf
2) Littell, R., G. Milliken, W. Stroup, R. Wolfinger, and O. Schabenberger. 2006. SAS for Mixed Models. Second
Edition. Cary, N.C.: SAS Press.
This book presents the theory of mixed models, discusses a wide variety of applications, and provides SAS
codes for the illustrative examples.
3) Schabenberger, O., and C. A. Gotway. 2004. Statistical Methods for Spatial Data Analysis. New York:
Chapman & Hall/CRC, 488 pp.
Chapter 6 of this book is recommended for readers who want an in‐depth discussion of the spatial regression
models. Section 6.3.6 explains how to make spatial predictions in the generalized linear mixed models.
4) Berke, O. 2004. “Exploratory Disease Mapping: Kriging the Spatial Risk Function from Regional Count
Data.” International Journal of Health Geographics 3:18. The electronic version of this article can be found
online at: http://www.ij-healthgeographics.com/content/3/1/18
This paper provides fox tapeworm data and discusses methods for smoothing regional data.
This book discusses penalized regression splines and mixed models. The radial smoother in SAS procedure
glimmix is implemented based on the materials in chapter 13 of this book.
6) Gribov, A., K. Krivoruchko, and J. M. Ver Hoef. 2006. “Modeling the Semivariogram: New Approach,
Methods Comparison, and Simulation Study.” In “Stochastic Modeling and Geostatistics: Principles, Methods,
and Case Studies,” edited by T. C. Coburn, J. M. Yarus, and R. L. Chambers. 45–57. Volume II: AAPG Computer
Applications in Geology 5.
This paper discusses methods for computing empirical semivariogram and covariance and for fitting
semivariogram and covariance models implemented in Geostatistical Analyst (versions from 8.1 to 9.3).
Available at http://training.esri.com/campus/library/index.cfm
Statistical software manuals usually refer to books and papers for explaining when and why to use certain
functions. A typical problem that GIS users have with reading statistical texts is dry academic writing in the
spirit of early scientific tradition based on Euclid’s famous book Elements. Although it is difficult to imagine
actual research without various frustrations because of many false steps inherent in modeling a particular
phenomenon, the textbooks and dissertations usually present the research as an inevitable logical process in
which A follows B, C follows D, and, therefore, E is proven. Far from being obvious for readers without a good
background in mathematics, the formulas and concepts are often presented as axioms because this is the
easiest way of scientific writing. However, most GIS practitioners are not educated in pure, abstract thinking
to the same extent as ancient Greeks and modern mathematicians. As a result, statistical analysis becomes
more of an exercise in memorization and less in scientific research. As a side effect, the difference between
the quality of the data modeling presented at statistical and GIS conferences is large, and this difference is
increasing with time because newly developed complex statistical models are rarely used in the GIS
community.
According to literature on the philosophy of science, if mathematics plays only a supporting role in your
research, you follow Aristotle, who logically systematized and explained the meaning of the natural and social
phenomena based on common sense (therefore, his methods are relatively easy to understand). If
mathematics plays a fundamental role in your scientific research, you follow Plato, who believed that nature
is perfectly designed and, therefore, can be explained using dialectical logic and quantitatively described by
sophisticated mathematical models. Plato’s scientific approach has been used by the majority of researchers
in both ancient and modern times, while Aristotle’s scientific methods dominated in medieval times because
they are better suited for dogmatic theories. Being Aristotelian nowadays seems like an anachronism because
real data and their interactions are often too complex for analysis by human intellect alone. Moreover,
believing in the existence of a mathematical solution to a complex problem may help in finding the right
model.
One popular workaround to the problem of reading difficult statistical texts has been writing statistical
textbooks for GIS‐oriented students in the style of popular cookbooks (the collections of recipes for solving
typical problems). A cookbook style is problematic because it oversimplifies the process of the data modeling;
it is generally not suitable for communicating concepts more complex than a traditional recipe for Greek
salad. Foods are composed of water, proteins, fats, carbohydrates, minerals, vitamins, pigments, and flavor
elements. These components react when heated or mixed with other foods. Understanding cookery in
chemical and physical terms is necessary for improving cooking skills. Below is part of a randomly selected
recipe from one of the numerous books on the world’s best cooking. It is given without cooking instructions,
2 pounds of new potatoes, boiled; 2 tablespoons of red wine vinegar; ½ teaspoon of ground black pepper; 3
boiled eggs; 1 small celery stalk; and so on.
At first look this would appear to be a simple recipe to reproduce at home; however, in the realm of chemistry
and physics, it is not as simple as it might seem. For example, boiling an egg can be described by a parametric
model so that the characteristics of cooked eggs vary significantly with varying model parameters. The
proteins in egg white coagulate when the temperature is greater than 63 degrees Celsius, while the yolk
requires 70 degrees. The yolk in a soft‐boiled egg should not be heated above 70 degrees, and the white of an
egg therefore requires a relatively long heating time. Cooking the eggs to the desired state requires
controlling the temperature inside the egg white and yolk. The time required to reach a particular
temperature at the center of the yolk can be estimated using the following approximate formula
,
where is the temperature of the water (acceptable temperature is between 63 and 90 degrees Celsius),
is the initial temperature of the egg, and D is the diameter of the egg in millimeters. Note that the cooking
time depends on three parameters, including the egg size.
The analogy between making food using popular cookbook recipes and oversimplified statistical data analysis
is transparent: neither is reproducible with varying inputs and equipment without understanding what
makes the respective recipe and statistical model work. The qualities of cooking and modeling depend on
how deeply the researcher understands the underlying scientific principles.
Several attempts have been made to develop efficient automatic statistical interpolation of the measurements
taken from the monitoring networks. Information on one of the recent projects, motivated by mapping
environmental radioactivity at a European scale, can be found at http://www.intamap.org. In this
project, it was found that detection of spatial extreme values is problematic with fast and simple interpolation
models such as conventional kriging. The proposed interpolation scheme takes into account the large‐scale
data variation and data stationarity, data distribution, data anisotropy and clustering, presence of extreme
data values and measurement errors. It can be concluded that a natural desire for straightforward data
geoprocessing and automatic map production conflicts with requirements for modern scientific data analysis.
Theoretically, the best solution to the automatic interpolation problem is fully Bayesian kriging with
noninformative prior distributions for all parameters. The main problem with this model is a very large
computational time even for a moderately sized dataset (several hundred measurements). Therefore,
practitioners are trying to find a reasonable compromise between the quality of predictions and
computational time.
We tried to avoid major problems found in the popular cookbooks and Euclidean presentations of the
statistical theories by cycling between conceptual explanations and focused case studies. Our goal was to
provide readers an informed and generally accessible guide to major spatial statistical models. Some
statistical misconceptions found in the literature and commonly presented at the GIS conferences are also
The selection of models and case studies in this book is based on discussions with GIS users and on the
author’s experience in developing statistical software and teaching spatial statistics. Several significant GIS
data analysis themes have not been discussed in sufficient detail, including the following:
Nonspatial statistical data analysis. Classical statistical methods are useful because geographic data
are not necessary spatially correlated. Presenting the use of nonspatial statistical models was
relegated to a secondary role in the book because good tutorials are available on statistical data
analysis foundations and statistical inference, as well as on regression analysis. However, it is
difficult to find a really good book on the philosophy of science, which defines classes of random
events best described by particular probability models.
Kriging and spatial regression for very large datasets. Very large spatial datasets almost always
represent nonstationary processes with the data mean and the variance varying in space instead of
being held constant. One of the modern modeling approaches to interpolation of large nonstationary
data is projecting the original spatial process onto a subspace at a representative set of locations
(called knots by the analogy with radial smoothers discussed in appendix 4) and then modeling the
reduced spatial dataset using a flexible nonstationary covariance function, which allows multiple
scales of spatial variation. Software for modeling large nonstationary datasets was not available in
2007, when most of the materials for this book were collected.
Spacetime and 3D models. The number of GIS users with space‐time and 3D data is increasing, but
reliable statistical software for these types of data is not easy to find. This is because the
development of a kriging model with space‐time interactions is difficult since space and time are not
directly comparable and 3D data typically have strong anisotropy. Note that an interesting
alternative to space‐time kriging, called functional kriging, was recently proposed. It uses a series of
measurements at data locations such as daily observations of air pollution or ocean temperature
profiles. Then the model predicts the series of values at the unknown locations. Functional kriging is
a generalization of the cokriging model discussed in chapter 9. Its implementation is relatively easy,
and addition of this kriging model to the geostatistical software is expected. Note that functional
models for regional and marked point pattern data are also under development.
Extreme values modeling. The theory of extreme values deals with the stochastic behavior of the
maximums and minimums of random variables. Applications include rainfall (to answer questions
like, “What is the probability of extreme rainfall in a particular area?”); floods (to answer questions
like, “What is the probability of simultaneous flooding at two specified locations?”); and air pollution
(to answer questions like, “Given extreme pollution in one place, what is the probability of extreme
pollution at some other place?”). The mean and the extreme of a spatial process (such as the annual
mean and the annual maximum of daily measurements of air pollution) may have very different
spatial structures. As in the case of space‐time and 3D models, software for spatial modeling of
extreme values is uncommon because the theory is complex, and the calculations are very involved.
Complex hierarchical models. Although simple hierarchical models discussed in this book may be
adequate for many GIS applications, advanced models can be more efficient. For example, analysis of
public health data has a tendency to explain the relationships between disease counts and individual
exposure by taking into account known characteristics of human behavior and geographic
distribution of people and sources of air pollution. It was shown that the results of this type of spatial
regression analysis and traditional analysis of the relationship between the disease rates and
ambient concentration of air pollutants are significantly different. One step in the direction of more
Many GIS practitioners ask for additional statistical functionality in ArcGIS. The geoprocessing tools based on
the statistical software packages (see examples in appendix 2 and the “R Point Clustering Tools for ArcGIS”
publication at http://resources.esri.com/geoprocessing/) may help GIS users to access the
required statistical models until they become available in GIS software.
After the first draft of this book was prepared in 2006, we received comments from several groups of
researchers and students. We have always appreciated these comments even though we may not have
included all suggestions during revision of the text. We would likewise be grateful for your comments on the
content of this book and ideas on themes to raise and discuss in the next edition. To contact the author, please
direct your correspondence to [email protected]. Thank you for reading this book.
additive model A statistical model that can express effect as a weighted sum of independent variables so
that the portion of the effect contributed by one explanatory variable does not depend on the values of the
others.
adjusted R2 An adjustment to the R2 statistics for the number of predictors in the model.
Akaike’s information criterion (AIC) A statistic that assesses model fit. The model that has the smallest
AIC value is preferred. AIC adjusts for the number of parameters estimated in the model, so that
parsimonious models are preferred.
algorithm The steps to do a particular calculation.
alternative hypothesis A conclusion that follows when the null hypothesis is rejected.
anisotropy A property of a spatial process or data in which spatial dependence varies with both the
distance and the direction between pairs of data locations.
a priori Knowledge derived from reasoning alone (before the data analysis).
a posteriori Knowledge gained on the basis of experience; the conditional probability calculated after the
data are taken into account.
approximation Not exact but sufficiently close to be a useful model.
arc sine transformation A transformation aimed to stabilize the data variance. Usually used for
proportion data.
autocorrelation The statistical correlation between spatial random variables of the same type, where the
correlation depends on the distance and direction that separates the spatial objects.
autoregressive models A model in which the statistical properties of the past behavior of a variable are
used to predict the current value. The spatial autoregressive model uses a linear combination of neighboring
values of the variable of interest in addition to the standard components of the linear regression model.
assumptions The conditions under which statistical models and tools produce valid results.
bandwidth An argument of the kernel function that specifies the maximum distance at which the nearby
data points are used in prediction. Prediction bias increases and prediction variance decreases with
bandwidth increase.
Bayesian network An expert system based on conditional probabilities and Bayes’ theorem.
Bernoulli process A sequence of independent identically distributed Bernoulli trials.
Bernoulli random variable A Bernoulli random variable takes the value 1 with probability p and the value
0 with probability 1 p.
Bernoulli trial An experiment whose outcome is random and can be either of two possible outcomes,
"success" or "failure."
beta distribution The distribution takes values between 0 and 1, has many shapes depending on the two
parameters, and contains the uniform distribution as a special case. Beta distribution is often used for
modeling proportions.
bias
1. Systematic failure of a sampling method to represent the underlying population.
2. The difference between the expected value of a distribution and the true value.
3. A systematic distortion of the statistical result.
4. An unjustified tendency to favor a particular point of view.
bin A classification of lags, where all lags that have similar distance and direction are put into the same bin.
binomial distribution A theoretical distribution used to model the occurrence of discrete events. The
binomial distribution depends on two parameters, n and p; where n is the total number of trials. For each
independent trial, the chance of observing the event of interest is p, and of not observing it, 1 p.
binomial probability model A model for a random variable that counts the number of successes in a fixed
number of Bernoulli trials.
bivariate Data that are recorded as a pair of variables and considered simultaneously.
bivariate distribution The joint probability distribution of two random variables (or the joint behavior of
the random variables) that are simultaneously measured when defining the outcomes of a random
experiment.
block kriging Estimation of average values inside polygons by kriging.
bootstrap A simulation approach to statistical inference based on taking repeated samples with
replacements from sample data.
Box–Cox transformation Data transformation designed to achieve normality.
categorical variable A variable labeled according to one of several possible categories.
cell declustering A method of declustering that weighs data based on the number of data points falling
within each cell of the overlapping grid so that each cell receives the same weight.
Chebyshev’s inequality A statement about the proportion of observations that are within specified
number of standard deviations of the mean for any probability distribution.
centroid A point at the geometric center of a polygon.
change of support The support of a spatial feature is the size, shape, and orientation of the feature.
Changing the support of a variable by averaging or aggregation creates a new variable with different support
(different statistical properties).
choropleth map A thematic map displaying patterns in polygonal data by using different colors selected
according to some classification scheme.
cluster process The process consisting of a set of parent locations, each of which generates a random set of
nearby offspring locations.
cluster analysis Methods for sorting objects into groups (clusters) on the basis of their similarity (in the
case of spatial data—proximity) to each other.
coefficient of determination The square of the correlation coefficient between two variables. Estimates
the proportion of the variation in the response variable explained by the explanatory ones, providing that the
model’s assumptions are satisfied.
cokriging A statistical interpolation method that predicts values of the primary variable at the unsampled
locations using covariate data that correlate with the primary variable. Cokriging also provides standard
errors of predictions.
collinearity A general regression problem when the predictors are nearly linearly related to each other.
Collinearity may lead to unexpected regression coefficients and inflated standard errors.
complete spatial randomness A situation where an event is equally likely to occur at any location within
the study area, regardless of the locations of other events.
condition number/condition index A value that indicates the proximity of explanatory variables to the
linear dependency. A condition number in excess of 20 indicates a near linear dependency between variables.
When the condition number is greater than 10, strong dependencies may affect the estimate of regression
coefficients and their standard errors.
conditional intensity The probability of observing a point at a specified location if the point pattern
everywhere else is known.
conditional probability The probability that event A will occur given that another event B in the same
sample space has occurred, shown as Pr(A׀B).
conditional simulation A simulation method that constrains each simulated surface to pass through the
given data values at sampled locations.
confounding variable A variable which may affect the variables under study and confound the
relationship between the independent and dependent variables in regression analysis. There is potentially a
large number of confounding variables.
convolution A mathematical operator that takes a weighted mean of some quantity over a narrow range of
some variable. In image processing (such as smoothing, sharpening, edge detection), an operation in which
each output pixel is a weighted sum of neighboring pixels with the weights defined by the convolution kernel.
Cook’s distance A diagnostic in regression analysis designed to show the influential observations. It
estimates the influence of the ith data value on all fitted values except of the fitted value of the ith
observation.
correlation An interdependence between pairs of variables; the degree to which two datasets are related.
correlation coefficient A measure of the strength of the linear association (straight‐line relationship)
between a pair of random variables. Note that the true relationship between the variables may be highly
nonlinear.
correlation matrix A symmetrical table that shows the correlations between a set of variables. The
diagonal of the table shows the correlation of each variable with itself which is equal to 1. The values of the
correlations in the lower left‐hand triangle of the matrix are usually the mirror image of those in the upper
right‐hand triangle.
count data Data obtained by counting the number of occurrences of a particular event.
covariance The expected value of the product of the deviations of two random variables from their mean
values. The statistical tendency of two variables to vary in ways related to each other. Positive covariance
occurs when both variables tend to be above their respective means, and negative covariance occurs if one
variable tends to be above its mean and the other variable is below its mean.
covariates Variables that are used in regression modeling because they are likely to explain variation of
the response variable. Usually the same as explanatory variables.
cost distance The distance between two locations that costs the least to traverse, where cost is a function
of time, slope, or some other criteria.
cost surface The value of each cell in the cost surface represents the resistance of passing through the cell
and may be expressed in units of cost, risk, or travel time.
Cox process A point process with the intensity function defined by covariates and spatial correlation.
crosscorrelation Statistical correlation between two spatial random variables, where the correlation
depends on the distance and direction that separates the locations.
crossvalidation The procedure where one data value is removed and the rest of the data are used to
predict the removed data value. The difference between observed and predicted values is used for model
diagnostics.
data smoothing Extracting a pattern from spatial observations contaminated by noise.
declustering The techniques for correcting the data distribution estimation bias caused by clustered data.
degrees of freedom The number of independent units of information in a sample used in the estimation of
the model’s parameters.
Delaunay triangulation The triangulation of a point set as a collection of edges satisfying an "empty circle"
property: there are no points inside the circumcircle of any triangle.
density function The density function f(s) estimates the probability of observing an event at location s.
deterministic The property of being perfectly reproducible without error, usually reachable only in
computer experiments. The fixed (nonrandom) components of a model. In Geostatistical Analyst 9.3, all
interpolation methods that do not have random components are deterministic. Statistical methods may have
a deterministic component that is often called trend.
detrending The process of subtracting the trend surface (usually polynomial function of the spatial x‐ and
y‐coordinates) from the original data values. The resulting values are called residuals.
deviance A measure of matching of the model to the data when the model parameters are estimated by the
maximum likelihood techniques.
diagnostics Identification of departures from statistical model assumptions or observed data values.
Digital Elevation Model (DEM) A data model that attempts to provide a three‐dimensional representation
of Earth relief as a continuous surface.
discrete variable A variable that takes the counting number of values.
disjunctive kriging Nonlinear kriging predictor that uses a linear combination of functions of the original
data. Gaussian disjunctive kriging assumes that all data pairs come from a bivariate normal distribution.
dissimilarity Becoming less and less alike. The semivariogram is a dissimilarity function because it
increases with distance, indicating that values become less alike as they get farther apart. Thus, the higher the
semivariogram value, the more dissimilar the values.
distribution A graphic or analytical representation of the frequency of occurrence for outcomes of a
random experiment.
downscaling The process of converting spatial data from a larger scale (for example, county) to the smaller
scale (for example, ZIP codes). Same as disaggregation. In geostatistical jargon, decreasing the support of the
data.
E Abbreviation for expected data value.
edge effect Arises in point pattern analysis when the region where the pattern is observed is part of a
larger region on which the point process operates.
eigenvalue Number λ is an eigenvalue of a matrix A if there is a nonzero column vector x for which A⋅x=λ⋅x.
Eigenvalues are used to determine the stability of systems of equations.
empirical A situation when the quantity depends on the data only; that is, it is not a model. For example,
empirical semivariogram is computed on data only, in contrast to theoretical semivariogram model.
empirical Bayes In some applications, the Bayesian estimators are so complicated to compute that
simplification is required. Empirical Bayes approximates the full Bayesian estimator by estimating the
subjective prior distribution of the parameters from the data.
entropy A measure of the degree of randomness of a system. In thermodynamics, entropy is a measure of
how close a system is to equilibrium.
error The data variation that the researcher cannot fully control or measure in a particular study.
estimation The process of forming a statistic from observed data to gauge the parameters in a model or the
data distribution.
event A subset of all the possible outcomes of an experiment to which a probability is assigned.
expected value The theoretical long‐run average value of a random variable.
explanatory variables The variables on the right‐hand side of the regression equation which attempt to
explain the response variable. Usually the same as covariates and independent variables, although
explanatory variables are rarely independent of one another.
exponential family of probability distributions Includes the normal distribution, gamma distribution,
binomial distribution, and Poisson distribution as special cases.
extrapolate To estimate an unknown quantity outside the surveyed area.
exploratory spatial data analysis (ESDA) Statistical and visualization techniques that attempt to produce
a good summary of the spatial data.
F test Tests the null hypothesis that the sample variances of two normally distributed populations are
equal. In linear regression, a test of the hypothesis that a proposed regression model fits well.
factor analysis 1) Statistical approach for reducing the model complexity by grouping the observed
variables. 2) A model that assumes that the dependencies between observed variables arise from the
relationship of these variables to unobservable variables. These unobservable variables are also called
“latent” and “common factors.”
faults Structural deformation of the geological layers that leads to a disconnection between the layers.
Fourier transform The algorithm that converts digital information from the time or space domain to the
frequency domain for rapid spectral analysis. For example, Geostatistical Analyst uses the Fourier transform
for computing convolution integrals without forming convolution matrices in the Gaussian geostatistical
simulation geoprocessing tool.
Gabriel graph A subgraph of the Delaunay triangulation. Two points are connected by an edge in the
Gabriel graph in a way that no point is inside the circumcircle of any edge.
gamma distribution A two‐parameter continuous probability distribution that describes positive real
numbers. The distribution is usually positively skewed (with a long tail on the right). Chi‐squared distribution
is a special case of the gamma distribution characterized by a single parameter, the number of degrees of
freedom (in this case, the mean of the gamma distribution is equal to the degrees of freedom and the variance
is equal to the mean multiplied by 2.)
Gaussian variable A variable that follows a Gaussian (normal) distribution. Since most classical statistical
models assume Gaussian variables, input data transformation to Gaussian distribution is often a mode
requirement.
generalized additive model A generalized linear model in which part of the linear predictor is specified in
terms of a sum of smooth functions of predictor variables, usually splines.
generalized additive mixed model A generalized additive model that accounts for data correlation by
adding random effects to the additive predictor.
generalized linear models A collection of models extending linear regression to applications where the
expected value of the response depends on a smooth monotonic function of the linear predictor, and the error
term follows one of the distributions from the exponential family, including normal, binomial, and Poisson.
generalized linear mixed model A statistical model that extends the class of generalized linear models by
incorporating normally distributed random effects.
geographically weighted regression A series of weighted linear regression models across the spatial data
domain, each fitted separately. The spatial coordinates of the data are used to calculate weights of the
neighboring observations. The model output includes estimated spatially varying regression coefficients.
Assumptions for geographically weighted regression are rarely fulfilled and, therefore, it is considered an
exploratory spatial data analysis tool rather than a predictive model.
geostatistics Statistical methodologies that use spatial coordinates in models used for prediction of
spatially dependent continuous data at unsampled locations. Spatial dependence is defined using
semivariogram and covariance models. The first application of geostatistics was the prediction of
meteorological variables.
geostatistical conditional simulation Realization of a random function that has the same statistical
characteristics as the observed data. Simulation is preferred over predictions made by kriging when the local
data fluctuations are more important than prediction accuracy.
goodness of fit Statistic that measures the agreement between the observed data and the corresponding
predicted values from the model.
halfnormal distribution Distribution of the absolute values of variable Z, with Z having a normal
distribution.
halfnormal probability plot A graphic tool that highlights unusually large values of a particular statistic.
hat matrix A matrix H used to obtain the predicted values of the response variable using the observed
values Z via the equation (it is said that the matrix H puts the “hat” on the variable Z.)
histogram A bar graph where data are divided into groups. The width of the bar shows the range of values
in each group, and the height of the bar indicates how many values are in each group.
hypothesis testing The use of statistics to determine whether a given hypothesis is true.
independent events Events are independent if the occurrence of one event does not change the
probability of occurrence of any other event. Correlated variables are not independent.
indicator A transformation of observation into binary numbers depending on whether the observation is
below or above a threshold value.
indices of spatial association Spatially weighted forms of the correlation coefficient. They show similarity
between a data value (usually collected in region) and values of its nearest neighbors. The most widely used
index of spatial association is Moran’s I. Several other indices adapt the Moran’s I for analysis of non‐Gaussian
data.
indicator kriging Ordinary or simple kriging using indicators as input data. The predicted value at the
unsampled location is interpreted as the probability that the threshold value is exceeded.
inference The process of using statistical models to learn about population based on samples from that
population.
inhibition process A process that does not allow two points to lie closer than a specified inhibition
distance.
interpolate Predicting values at locations where data have not been observed, using data from locations
where data have been collected. Typically, interpolation is used for predictions within the area where data
have been collected rather than areas outside the data domain.
intercept In regression analysis, the predicted value of the response variable when the predictor variables
are equal to zeroes.
intensity function Predicts the expected number of points per unit area.
intrinsic stationarity An assumption that the data come from a random process with a constant mean and
a semivariogram that depends only on the distance and direction separating any two locations. Since this is
the most common kriging assumption, kriging users should verify it before making maps to be sure that the
predictions and predictions standard errors are relevant.
isoline A line on a surface that connects points of equal value.
isotropy A property of a natural process or data where spatial dependence changes only with the distance
between two locations (direction is unimportant).
joint distribution The same as multivariate distribution.
K function A function that summarizes spatial dependence between event locations over a range of
possible distances between points. It gives the average number of points within specified distance divided by
the average number of points per unit area.
kernel A weighing function used in estimation techniques.
kriging A statistical interpolation method that uses data with a single attribute to predict values of that
same attribute at unsampled locations. Kriging also provides standard errors of predictions.
kriging with external drift/trend Kriging with mean value defined by the explanatory variables. This
model is called linear mixed model in statistical literature.
kurtosis A measure of whether the data are peaked or flat relative to a normal distribution.
L function A normalized K function.
lag The line that separates two locations. A lag has length and direction.
largescale data variation Spatial data variation can often be decomposed into components, each of
which vary over different spatial resolutions or scales. Large‐scale variation refers to variation at the coarse
scale and is usually modeled with deterministic components such as low‐order polynomials of spatial
coordinates (also known as trend).
latent variable Unobserved variable that is believed to be related to the observed ones.
leastsquares fit A model (line, surface, or smooth function) that is fit to data by finding the parameters of
the model that minimize the sum of the squared differences between each data value and the model.
leverage An undesirable effect when a single data point which is located far from the bulk of the data has
large effect on the estimated regression parameters.
likelihood The probability of an event occurring given the values of model parameters.
linear kriging A kriging model that predicts a value obtained as a weighted average of neighboring data.
link function A function of the expected value of a linear combination of the explanatory variables. It
allows mapping from the data space to another space in which the data can be linearly represented.
linear model of coregionalization A model for multivariate spatial prediction (cokriging) formed by
taking a linear combination of covariance models. In this model, all covariances, semivariograms, and cross‐
covariances are linear combinations of the same basic structures.
the error term enter the regression model in a linear way; for example,
.
local polynomial interpolation Interpolation using polynomials with locally varying coefficients.
logistic model A model that assumes that for each possible set of values for the explanatory variables,
there is a probability p that an event occurs.
logistic regression A regression analysis where the response variable is a binary variable.
lognormal distribution A distribution where the logarithm of the variable values follows a Gaussian
probability distribution.
loss function The loss in utility or in money estimated as a function of the difference between the predicted
value and the observed or expected value. For example, kriging prediction standard error is a quadratic loss
function.
lurking variable A hidden variable that simultaneously affects both the response and predictor variables,
accounting for the correlation between them.
marginal distribution The probability distribution of a variable obtained from the multivariate
distribution by integrating over the other variables.
marked point pattern The point pattern that has values attached to the event locations.
mark correlation function A measure of the dependence between the marks of two points separated by
distance h.
Markov chain Monte Carlo Algorithms for simulating random observations from high dimensional
probability distributions.
Markov process A stochastic process in which the distribution of future states depends only on the present
state. A process in which a limited number of nearest neighbors screen the information of all further away
data.
maximum likelihood A method based on the idea that unknown model parameters should be chosen to
maximize the probability that the observed data are in accordance with the model assumptions.
mean
1. The arithmetic mean (the sum of the observations divided by the number of observations).
2. The expected value of a random variable (the population mean).
mean square error The expected value of the square of the difference between the predicted and the true
value. Note that the true value is usually not equal to the observed value.
mean stationarity
A property of a spatial process where all the spatial random variables have the same mean value.
median A value that splits the sample data into two parts of equal size.
mode The most frequently occurring observation in a set of data.
model A mathematical statement of the relationships among variables. Probability models predict the
relative frequency of different random outcomes. Semivariogram model is a function that gives the
semivariogram values for all distances and directions between points.
Monte Carlo hypothesis test A computer experimental method that uses random numbers. In Monte Carlo
testing, the test statistic value based on the observed data and the same statistic for a large number of data
simulated independently under the null hypothesis of interest are calculated. The proportion of test statistic
values based on simulated data exceeding the value of the test statistic for the observed data provides a
Monte Carlo estimate of the p‐value.
moving average A weighted average of the observation and its nearest neighbors used to recognize and
visualize local data features.
multilevel model Model for hierarchically organized data.
multiplicative model A model in which the joint effect of the explanatory variables is the product of their
separate effects.
multivariate distribution Distribution involving three or more variables at the same time.
multivariate normal distribution Theoretical distribution that generalizes the one‐ and two‐dimensional
Gaussian distributions to higher dimensions. A multivariate Gaussian distribution is defined by the mean
value and the covariance matrix.
negative binomial distribution The probability distribution of the number of failures k before the rth
success in a series of independent and identically distributed Bernoulli trials. Negative binomial distribution
has two parameters, the number of successes r and the probability of success p on each trial.
noninformative prior distribution A prior distribution that is not specific about the frequency of the
model parameter occurrence. For example, a uniform distribution.
nonlinear kriging A kriging model that predicts a value obtained as a weighted average of functions of
neighboring data values.
nonlinear regression Regression analysis in which the fitted value of the response variable is a nonlinear
function of one or more explanatory variables.
nonparametric model A statistical model that analyzes data from the population without assuming a
particular theoretical probability distribution.
normal score transformation The transformation that ranks the data from lowest to highest values and
matches these ranks to equivalent ranks from a standard normal distribution (normal distribution with the
mean equals 0 and the standard deviation equals 1).
null hypothesis The hypothesis which is tested, for example, “no difference,” “no effect,” “no relationship.”
nugget A parameter of a covariance or semivariogram model that represents independent error,
measurement error, and microscale data variation. The nugget effect is seen on the graph as a discontinuity at
the origin of either the covariance or semivariogram model.
objective methods Methods of data collection and data analysis that do not depend on the opinions or
knowledge of a particular individual. Objective methods are reproducible.
odds of a success The ratio of the probability of a success to the probability of a failure.
offset A fixed regression coefficient included in the regression model.
ordinary cokriging Multivariate spatial prediction using covariance and cross‐covariance models that rely
on spatial relationships among two or more variables. Ordinary cokriging has constraints on cokriging
weights: the sum of primary variable weights equal to 1 and the sum of each secondary variable weights
equal to 0. These constraints limit the influence of the secondary variables on the prediction of the primary
variable.
ordinary kriging Spatial prediction using a semivariogram or covariance model that relies on spatial
relationships among the data. Ordinary kriging assumes intrinsic stationarity and requires that the sum of
kriging weights is equal to 1.
outlier Observed values that are not consistent with the rest of the data. They may be mistakes or unusual
values that require special attention.
overdispersion The presence of greater data variability than is expected based on a given theoretical data
distribution.
overfitting Having more information than necessary for estimating model parameters.
pair correlation function A function that summarizes spatial dependence between event locations. It gives
the expected number of points at distance h from an arbitrary point, divided by the intensity of the point
pattern.
parametric distribution A probability distribution characterized by several parameters such as mean and
variance values.
partial sill A parameter of a covariance or semivariogram model that represents the variance of a spatially
correlated process without the nugget effect. In the semivariogram model, the partial sill is the difference
between the nugget and the sill.
permutation A reordering of all the objects in a set in which the order of the objects makes a difference.
point pattern A set of locations irregularly distributed within a region of space presumably generated by
some stochastic mechanism.
Poisson regression Analysis of the relationship between an observed counts with a Poisson distribution
and a set of explanatory variables.
polygonal declustering A declustering method that uses Voronoi polygons to weigh the influence of each
datum when estimating the data distribution.
polynomial An expression composed by the summing of a finite number of terms, each term being the
product of a constant coefficient and one or more variables raised to a nonnegative integer power. In a spatial
context, a polynomial often has terms 1, x, x2, y, y2, xy, x2y, and so on, all of which are added with coefficients:
b0 + b1x + b2y + …
population 1) The set of all possible outcomes of a random experiment. 2) The set of all individuals limited
by geographical location within which a statistical inference is performed.
posterior probability The revised probability calculated after the relevant evidence is taken into account.
The conditional probability that results from the application of Bayes’ theorem.
prediction The process of forming a statistic from observed data, experience, and scientific reasoning to
predict random variables at locations where data have not been collected.
prediction standard error The square root of the prediction variance, which is the variation associated
with the difference between the true and predicted value. A rule of thumb is that 95 percent of the time the
true value will be within the interval of predicted value ± 2 times the prediction standard error if data are
normally distributed.
prior probability Beliefs about the mean and the spread of the model parameter before the data are
examined, assuming that a process that may generate the data is at least partially known.
probability A measure of how likely it is that a particular outcome will occur. In a process that can be
repeated n times, m of which lead to a particular result, the probability of the result is m/n. In statistical
literature, the phrase “the probability that event A will occur” is abbreviated as Pr(A).
probability kriging A variant of cokriging in which the primary variable is indicators and the secondary
variable is the observed data from which indicators are calculated. The resulting prediction is interpreted as
the probability that a specified threshold is exceeded.
probability map A surface that gives the probability that the variable of interest is above (or below) some
threshold value. The threshold value can be fixed or varying. An example of the latter case is a quantile map in
which the threshold value is equal to a specific quantile value, say 0.75, estimated at the predicted location.
process A repeatable sequence of activities with measurable input and output.
pvalue The conditional probability of observing values as extreme as the one computed from the data,
taking into account the null hypothesis, statistical model, and sample size.
qqplot A scatterplot in which the quantiles of two distributions are plotted against each other.
quantile A fraction that divides a collection of observations arranged in order of magnitude into two
specific parts. The pth quantile, where p is between 0 and 1, is the value that has a proportion p of the data
below that value.
quantitative data Data with numerical values.
quartiles Three values that split the ordered sample values into four groups of equal size. The second
quartile is the median. The difference between the third and first quartiles is called the interquartile range.
radial basis functions A particular case of splines with knots at the observed data locations. Radial basis
functions make surfaces that pass through measured sample values with the least amount of curvature.
random coefficient model A type of the mixed model in which the regression coefficients are modeled as
random effects.
random component A component of the statistical model that defines the distribution of the errors.
random error Error that occurs due to natural variation in the measurement process. Random errors
follow some law but cannot be predicted exactly.
random field The realization of spatial data. The term “field” came from physics as an analogy to magnetic
or gravitation fields.
random point process A set of points with the following features: 1) the number of points is random; 2)
the points’ locations are random.
random sampling A sampling method in which each subject in the population has an equal chance of being
selected.
random variable A variable that takes different values when an experiment is repeated under the same
conditions. A random variable complements a deterministic variable whose outcome is completely
predictable, so that the word “random” refers to a lack of information. The degree of information is usually
expressed in terms of probability.
randomization Reassigning observed values among the fixed locations randomly.
range A parameter of a covariance or semivariogram model that represents a distance beyond which there
is little or no correlation among variables.
rate The total number of observed cases within a fixed time interval in a geographical region divided by the
total number of possible cases. For example, the number of thyroid cancer cases in children, observed during
a five‐year period after the Chernobyl accident, divided by the population of children in a particular region.
realization A collection of values or locations generated under the spatial process model, actual or
simulated.
regression A statistical method in which a variable, often called the response or dependent variable, is a
function of one or more other variables, called covariate, explanatory, or independent variables. There are
several methods for fitting the function.
relative risk The ratio of observed to expected counts (typically, disease or crime events) in the region.
resampling Production of new hypothetical samples and examination of their properties.
residuals 1) The difference between the observed value and the predicted value derived from a model. 2)
Values formed by subtracting the trend surface (in geostatistics, usually a polynomial function of the spatial x‐
and y‐coordinates) from the observed data values.
response variable The characteristic about the population we study. A variable which is expected to
change when the predictor (explanatory) variables are changed.
R2 (Rsquared) A statistic of how successful the regression model is linearly relating the response and
predictor variables assuming that the model assumptions are satisfied. In the case of a perfectly fitted model,
R2 is equal to 1.
robustness The property of a statistical test or model being insensitive to small departures from the
assumptions on which they depend, such as the assumption that the data distribution is normal.
sample
1. A representative set of values chosen from all possible data (from population).
2. The collection of measurements that are actually observed.
sampling The process of selecting a part of a population that will be used to estimate the features of the
entire population.
sampling design The procedure by which a sample is selected from the population.
scan statistic A statistic for evaluating whether a point pattern consists of clusters.
scatterplot A graph of a pair of variables that plots the first variable along the x‐axis and the second
variable along the y‐axis, often accompanied by a line of best fit for the cloud of points. Shows the way the two
variables are related to one another.
searching neighborhood A subset of data around the prediction location. Data within the searching
neighborhood are used for prediction of the value at unsampled location.
secondorder stationarity An assumption that the data come from a random process with a constant
mean and covariance that depends only on the distance and direction separating any two locations.
semiparametric regression A variant of generalized additive model.
semivariogram The variogram divided by two.
semivariogram model An analytical function involving several parameters that gives the semivariogram
its values for all distances and directions.
semivariogram surface The semivariogram values plotted in polar coordinates with a common center for
all pairs of sample points and retaining the angle of the line connecting pairs and the distance between them
along that line.
shape parameter A numerical parameter of the parametric model. For example, several semivariogram
models, including K‐Bessel and stable, have a shape parameter that allows additional flexibility when
modeling spatial data dependence.
shortrange variation Data variation at the scale comparable with typical distances between pairs of the
observed data. It is usually modeled as spatially correlated random variable. Also called small‐scale variation.
significant Means that the research finding is not due to chance.
sill A parameter of a covariance or semivariogram model that represents a value that the model approaches
when distance between pairs of points is very large. At large distances, variables become uncorrelated, so the
sill of the semivariogram model is equal to the variance of the random variable.
simple cokriging Multivariate spatial prediction using covariance and cross‐covariance models that rely
on spatial relationships among two or more variables. Simple cokriging requires knowledge of the mean
values for the primary and secondary variables.
simple kriging Spatial prediction using semivariogram or covariance models that rely on spatial
relationships among the data. Simple kriging assumes the data stationarity and that the data mean is a known
constant or surface.
simulated annealing A probabilistic algorithm for the global optimization problem, which often provides a
good approximation to the global minimum of a complex function with a large number of parameters. The
algorithm is an adaptation of the Monte Carlo method for generating sample states of a thermodynamic
system. It is used in geostatistics for producing spatial data with known properties by simulating a process of
slowly cooling metal or glass to relieve internal stresses after it was formed.
simulation Realization of a random function that has the same statistical properties as the observed data.
simulation envelope A confidence interval constructed by simulating from the model that is assumed to
be true. For example, the maximum and minimum values of a function calculated from simulations can be
displayed on the same graph with a function of the observed data. If the model is good, the observed function
should be inside the envelope made by the lines constructed from the extreme simulated values.
Simpson’s paradox A surprising possibility when something true for each subset of a population need not
be true for the population as a whole.
skewness The degree of asymmetry in the data distribution. A distribution with many small values and few
large values is positively skewed; the opposite is negatively skewed.
slope In regression analysis, the change of one unit in the response variable associated with the change of
one unit in the predictor variable.
smallscale variation The same as shortrange variation.
smoothing The process of removing random fluctuations in the data.
spatial dependence The observation that things near to each other are more similar than things farther
apart.
spline A smooth curve or surface that passes through the set of control points. Spline functions are defined
piecewise, allowing the fitting of data by joining relatively simple curves or surfaces.
standard deviation A measure of variability of the mean value. The square root of the variance of the data.
standard error The standard deviation of sampling distribution.
standard normal distribution A normal distribution with a mean of 0 and a standard deviation of 1.
standardized value A value calculated by subtracting the sample mean and dividing by the sample
standard deviation.
stationarity Preserving statistical properties of the data after an arbitrary shift of the points in the data
domain. See mean stationarity, intrinsic stationarity, and secondorder stationarity.
statistic A numerical summary about a group of observations.
statistical inference Inferring properties of a population from the data sampled from that population.
stochastic process A process governed by the laws of probability.
stratified random sampling A sampling design in which the population is divided into several
subpopulations and then random samples are drawn from each subpopulation.
structural analysis The same as variography.
Student’s t A unimodal and symmetric probability distribution that arises in the problem of estimating the
mean of normally distributed data when the sample size is small.
studentized residual A residual divided by an estimate of its standard deviation. Often used in the
detection of the data outliers.
systematic component In a regression model, nonrandom variation in the response variable represented
by the linear combination of explanatory variables.
systematic error Error that affects all measurements in the same way.
support In geostatistics, the largest area for which the property of interest is considered homogeneous.
“Change of support” refers to prediction of larger units from smaller ones, or vice versa.
transformation Manipulation with data to make a particular model fit better or be more easily interpreted.
trend The nonrandom part of a spatial model. The same as the largescale data variation.
triangulation The process during which a mesh of triangles is covering the cloud of points.
truncated data Data for which sample values smaller or larger than a threshold value are not observed.
uncertainty The variability in the measured or estimated value. Uncertainty is often measured by standard
errors and confidence intervals.
uniform distribution A distribution in which all possible outcomes have an equal chance of occurring.
unimodal A distribution with only one maximum (peak).
univariate analysis The analysis of single variable without reference to other variables.
univariate distribution A function for a single variable that gives the probability that the variable will take
a given value.
universal kriging A type of kriging with a trend model. The trend model is defined as a smoothly varying
low‐order polynomial function of spatial coordinates.
upscaling Increasing the support of the research area. Same as data aggregation and data averaging.
validation The procedure where part of the data are removed and the rest of the data are used to predict
the removed part. Validation checks whether the model works well for the test subset of the data. If so, it
usually works for the entire dataset.
variance The expected value of the squared deviation from the expected value. A measure of the spread of
data in a sample or population estimated by averaging the squared difference between the predicted values
and the expected value. The unit of variance is the square of the unit of the variable.
variogram A function of the distance and direction separating two locations, used to quantify spatial data
dependence. The variogram is defined as the variance of the difference between values of two variables at
two locations. The variogram generally increases with distance and is described by several parameters, such
as nugget, sill, range, and shape.
variography The process of estimating the theoretical semivariogram model. It begins with computing the
empirical semivariogram, then binning, fitting a semivariogram model, and using diagnostics to assess the
fitted model. Also called structural analysis (because the original name of semivariogram is structure
function).
Voronoi polygon Area‐of‐influence for the sampling point. The locus of points in the plain that are closer to
the sample point than to any other sample point. Same as Thiessen polygon.
weighted average The sum of the products of the quantities and their relative importance, or weights,
divided by the total number of quantities.
weighted least squares An estimation method that minimizes a weighted sum of squares of the
differences between the observed and predicted values. Usually used when the variance of the observed
variable is varying. For example, this method is used in Geostatistical Analyst for semivariogram modeling.
white noise process A process for which an average value in area A is distributed as a normal variable
with zero mean and with the variance proportional to area A, .
zscore A value that shows how many standard deviations a variable is from the mean.
zeroinflated regression A regression model for data with excess zeros.
Aldworth, J., and N. Cressie. 1999. “Sampling Designs and Prediction Methods for Gaussian Spatial Processes,”
in Multivariate Design and Sampling, edited by S. Ghosh, 1–54. New York: Marcel Dekker.
Aldworth, J., and N. Cressie. 2003. “Prediction of Nonlinear Spatial Functionals.” Journal of Statistical Planning
and Inference, 112:3–41.
Alexander, F. E., and P. Boyle. 1996. Methods for Investigating Localized Clustering of Disease. International
Agency for Research on Cancer, Lyon, France: IARC Scientific Publications No. 135.
Armstrong, M., and P. A. Dowd, editors. 1994. “Geostatistical Simulations.” Proceedings of the Geostatistical
Simulation Workshop. Fontainbleau, France: Kluwer Academic Publishers.
Anselin, L. 1988. Spatial Econometrics: Methods and Models. Dordrecht: Kluwer Academic Publishers.
Anselin, L. 1995. “Local Indicators of Spatial Association—LISA.” Geographic Analysis 27(2):93–115.
Arbia G. 2006. Spatial Econometrics. New York: Springer.
Armstrong, M., and G. Matheron. 1986. “Disjunctive Kriging Revisited, Part I.” Mathematical Geology 18
(8):711–728.
Armstrong, M., G. Matheron. 1986. “Disjunctive Kriging Revisited, Part II.” Mathematical Geology 18 (8):729–
742.
Armstrong, M., A. Galli, G. Le Loc’h, F. Geffroy, and R. Eschard. 2003. Plurigaussian Simulations in Geosciences.
Berlin: Springer, 160 pp.
Assuncao, R. M., and E. A. Reis. 1999. “A New Proposal to Adjust Moran’s I for Population Density.” Statistics in
Medicine 18:2147–2162.
Baafi, E. Y., and N. A. Schofield, editors. 1997. Geostatistics Wollongong '96 (Quantitative Geology and
Geostatistics). Springer.
Baddeley A., R. Turner, J. Møller, and M. Hazelton. 2005. “Residual Analysis for Spatial Point Processes (with
discussion).” Journal of the Royal Statistical Society, Series B 67(5):617–666.
Bailey, T. C., and A. C. Gatrell. 1995. Interactive Spatial Data Analysis. Essex: Addison Wesley Longman Ltd.
Banjevic, M., and P. Switzer. 2002. "Bayesian Network Designs for Fields With Variance as a Function of the
Location." Proceedings of the 2002 JSM Conference, Section on Statistics and the Environment. New York.
Banerjee, S., B. P. Carlin, and A. E. Gelfand. 2003. Hierarchical Modeling and Analysis for Spatial Data. Boca
Raton: Chapman and Hall/CRC,.
Barry, R. D., and J. M. Ver Hoef. 1996. “Blackbox Kriging: Spatial Prediction Without Specifying The
Variogram.” Journal of Agricultural, Biological, and Environmental Statistics 1:297–322.
Berk, R. A., 2004. Regression Analysis: A Constructive Critique. Thousand Oaks: Sage.
Belsley D., 1991. Conditioning Diagnostics: Collinearity and Weak Data in Regression. Wiley, 396 pp.
Besag, J. 1972. “Nearest‐Neighbor Systems and the Auto‐Logistic Model for Binary Data.” Journal of the Royal
Statistical Society, Series B 34:75–83.
Besag, J. 1974. “Spatial Interaction and the Statistical Analysis of Lattice Systems.” Journal of the Royal
Statistical Society, Series B 36:192–225.
Besag, J. 1975. “Statistical Analysis of Non‐Lattice Data.” The Statistician 24:179–195.
Besag, J., and C. Kooperberg. 1995. “On Conditional and Intrinsic Autoregressions.” Biometrika 82:733–746.
Besag, J., and J. Newell. 1991. “The Detection of Clusters in Rare Diseases.” Journal of the Royal Statistical
Society, Series A 154:327–333.
Besag, J., J. York, and A. Mollie. 1991. “Bayesian Image Restoration, With Two Applications in Spatial Statistics
(with discussion).” Annals of the Institute of Statistical Mathematics 43(1), 1–59.
Best, N. G., K. Ickstadt, and R. L. Wolpert. 2000. “Spatial Poisson Regression for Health and Exposure Data
Measured at Disparate Resolutions.” Journal of the American Statistical Association 95, 1076–1088.
Best N., S. Richardson, and A. Thomson. 2005. “A Comparison of Bayesian Spatial Models For Disease
Mapping.” Statistical Methods in Medical Research 14:35–59.
Bierkens, M. F. P., P. A. Finke, and P. De Willigen. 2000. Upscaling and Downscaling Methods for Environmental
Research (Developments in Plant and Soil Sciences). Dordrecht: Kluwer, 190 pp.
Bivand R. S. 2006. “Implementing Spatial Data Analysis Software Tools in R.” Geographical Analysis 38:23–40.
Borchers, D. L., S. T. Buckland, I. G. Priede and S. Ahmadi. 1997. "Improving the Precision of the Daily Egg
Production Method Using Generalized Additive Models." Canadian Journal of Fisheries and Aquatic Sciences
54:2727–2742.
Box, G. E. P. 1979. “Robustness in the Strategy of Scientific Model Building.” In Robustness in Statistics, edited
by Lanner and Wilkerson, 201–236. New York: Academic Press.
Box, G. E. P., and D. R. Cox. 1964. “An Analysis of Transformations.” Journal of the Royal Statistical Society,
Series B 26:211–243.
Brewer, C. A. 1999. “Color Use Guidelines for Data Representation.” Proceedings of the American Statistical
Association’s Section on Statistical Graphics 55–60.
Brownie, C., and M. L. Gumpertz. 1997. “Validity of Spatial Analyses for Large Field Trials.” Journal of
Agricultural, Biological, and Environmental Statistics 2(1):1–23
Burrough, P. 1989. “Fuzzy Mathematical Methods for Soil Survey and Land Evaluation.” Journal of Soil Science
40:477–492.
Byers, S., and A. E. Raftery. 1998. “Nearest‐Neighbor Clutter Removal for Estimating Features in Spatial Point
Processes.” Journal of the American Statistical Association 93:577–584.
Carlin, B. P., and T. A. Louis (2000). Bayes and Empirical Bayes Methods for Data Analysis. Second Edition. Boca
Raton: Chapman and Hall/CRC.
Chiles, J. P. 1975. “How to Adapt Kriging to Non‐Classical Problems: Three Case Studies.” Advanced
Geostatistics in the Mining Industry. Proceedings of the NATO Advanced Study Institute. Rome: October:69–89.
Chiles, J. P., and P. Delfiner. 1999. Geostatistics: Modeling Spatial Uncertainty. New York: John Wiley & Sons.
Choynowski, M. 1959. “Maps Based on Probabilities.” Journal of the American Statistical Association 54:385–
388.
Clayton, D. G., and J. Kaldor. 1987. “Empirical Bayes Estimates of Age‐Standardized Relative Risks for Use in
Disease Mapping.” Biometrics 43:671–682.
Cleveland, W. S. 1979. “Robust Locally Weighted Regression and Smoothing Scatterplots.” Journal of the
American Statistical Association 74:829–836.
Cliff, A. D., and J. K. Ord. 1981. Spatial Processes: Models and Applications. London: Pion Ltd.
Comas, C., and J. Mateu. 2006. “Modelling Forest Dynamics: A Perspective From Point Process Methods.”
Biometrical Journal 48(5):1–21. Technical Report, 87–2005, Universitat Jaume I.
Congdon, P. 2006. Bayesian Statistical Modeling. Second Edition. Chichester: Wiley & Sons.
Cook, R. D., and S. Weisberg. 1999. Applied Regression Including Computing and Graphics. Wiley, 594 pp.
Cowell, R. G., A. P. Dawid, S. L. Lauritzen, and D. J. Spiegelhalter. 1999. Probabilistic Networks and Expert
Systems. Springer‐Verlag.
Cressie, N. 1985. “Fitting Variogram Models by Weighted Least Squares.” Journal of the International
Association for Mathematical Geology 17:563–586.
Cressie, N. 1990. “The Origins of Kriging.” Mathematical Geology 22:239–252.
Cressie, N. 1992. “Smoothing Regional Maps Using Empirical Bayes Predictors.” Geographical Analysis 24:75–
95.
Cressie, N.A.C. 1993. Statistics for Spatial Data. Revised Edition. New York: John Wiley & Sons.
Cressie, N. 1996. “Change of Support and the Modifiable Areal Unit Problem.” Geographical Systems 3:159–
180.
Cressie, N. 1998. “Fundamentals of Spatial Statistics.” In Collecting Spatial Data: Optimum Design of
Experiments for Random Fields, edited by W. G. Muller, 9–33. Physica‐Verlag.
Cressie, N., and G. Johannesson. 2001. “Kriging for Cut‐Offs and Other Difficult Problems.” In geoENV III—
Geostatistics for Environmental Applications, edited by P. Monestiez, D. Allard, and R. Froidevaux, 299–310.
Dordrecht: Kluwe.
Cressie, N., and M. Pavlicova. 2002. “Calibrated Spatial Moving Average Simulations.” Statistical Modelling
2:267–279.
Cressie, N., and G. Johannesson. 2006. “Fixed Rank Kriging for Large Spatial Datasets.” Technical Report No.
780 for the Department of Statistics, The Ohio State University.
Cressie, N. 2006. “Block Kriging for Lognormal Spatial Processes.” Mathematical Geology 38:413–443.
Cressie, N., C. Calder, J. Clark, J. Ver Hoef, and C. Wikle. 2009. “Accounting for Uncertainty in Ecological
Analysis: The Strengths and Limitations of Hierarchical Statistical Modeling.” Ecological Applications
19(3):553–570.
Curriero, F. C. 2005. “On the Use of Non‐Euclidean Isotropy in Geostatistics.” John Hopkins University,
Department of Biostatistics Working Papers. Working Paper 94.
Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Applications. Cambridge University Press.
Diamond, P. 1989. “Fuzzy Kriging.” Fuzzy Sets and Systems 33:315–332.
De Oliveira, V., B. Kedem, and D. A. Short. 1997. “Bayesian Prediction of Transformed Gaussian Random
Fields.” Journal of the American Statistical Association 92:1422–1433.
Deutsch, C. V., and A. G. Journel. 1992. GSLIB: Geostatistical Software Library and User’s Guide. New York:
Oxford University Press.
Diggle , P. J. 1990. “A Point Process Modelling Approach to Raised Incidence of a Rare Phenomenon in the
Vicinity of a Prespecified Point.” Journal of the Royal Statistical Society, Series A (Statistics in Society)
153(3):349–362.
Diggle, P. J. 2003. Statistical Analysis of Spatial Point Patterns. Second Edition. London: Arnold.
Diggle, P. J. 2000. “Overview of Statistical Methods for Disease Mapping and its Relationship to Cluster
Detection.” In Spatial Epidemiology: Methods and Applications, edited by P. Elliott, J. C. Wakefield, N. G. Best,
and D. J. Briggs. Oxford: Oxford University Press.
Diggle, P. J., S. E. Morris, and J. C. Wakefield. 2000. “Point‐Source Modeling Using Matched Case‐Control Data.”
Biostatistics 89:89–105.
Diggle, P. J., J. A. Tawn, and R. A. Moyeed. 1998. “Model‐Based Geostatistics.” Applied Statistics 47:229–350.
Diggle, P., R. Moyeed, B Rowlingson, and M. Thomson. 2002. “Childhood Malaria in the Gambia: A Case‐Study
in Model‐Based Geostatistics.” Journal of the Royal Statistical Society, Series C (Applied Statistics) 51(4):493–
506.
Dobrushin, R. L. 1968. “Description of a Random Field by Means of Conditional Probabilities and the
Conditions Governing its Regularity.” Theory of Probability and its Applications 13:197–224.
Dubrule, O. 1984. “Comparing Splines and Kriging.” Computers and Geosciences 10(2–3):327–338.
Dyn, N., and G. Wahba. 1982. “On the Estimation of Functions of Several Variables From Aggregated Data.”
SIAM Journal on Mathematical Analysis 13(1):134–152.
Ecker, M. D., and A. E. Gelfand. 1997. “Bayesian Variogram Modeling for an Isotropic Spatial Process.” Journal
of Agricultural, Biological and Environmental Statistics 2:347–369.
Ecker, M. D., and A. E. Gelfand. 1999. “Bayesian Modeling and Inference for Geometrically Anisotropic Spatial
Data.” Mathematical Geology 31(1)67–83.
Emery, X. 2005. “Variograms of Order ω: A Tool to Validate a Bivariate Distribution Model.” Mathematical
Geology 37(2):163–181.
Emery, X. 2005. “Conditional Simulation of Random Fields with Bivariate Gamma Isofactorial Distributions.”
Mathematical Geology 37(4):419–445.
Emery, X., 2006. “A Disjunctive Kriging Program for Assessing Point‐Support Conditional Distributions.”
Computers & Geosciences 32(7):965–983
Emery, X., 2007. “Simulation of Geological Domains Using the Plurigaussian Model: New Developments and
Computer Programs.” Computers & Geosciences 33(9):1189–1201.
Emery, X. 2008. “Substitution Random Fields with Gaussian and Gamma Distributions: Theory and
Application to a Pollution Data Set.” Mathematical Geosciences 40(N1):83–100.
Faraway, J. 2005. Linear Models with R. Chapman & Hall/CRC, 230 pp.
Fotheringham, A. S., C. Brunsdon, and M. Charlton. 2002. Geographically Weighted Regression. New York: John
Wiley & Sons.
Fisher, R. A. 1935. The Design of Experiments. Oliver and Boyd.
Gandin, L. S. 1959. “The Problem on Optimal Interpolation.” Trudy GGO 99:67–75. (In Russian.)
Gandin, L. S. 1960. “On Optimal Interpolation and Extrapolation of Meteorological Fields.” Trudy GGO 114: 75–
89. (In Russian.)
Gandin, L. S., and R. L. Kagan. 1962. “The Accuracy of Determining the Mean Depth of Snow Cover from
Discrete Data.” Trudy GGO 130:3–10. (In Russian.)
Gandin, L. S., 1963. “Objective Analysis of Meteorological Fields.” Gidrometeorologicheskoe Izdatel’stvo
(GIMIZ), Leningrad (translated by Israel Program for Scientific Translations, Jerusalem, 1965).
Gandin, L. S., and R. L. Kagan. 1976. “Statistical Methods for Interpreting Meteorological Data.” Leningrad:
Gidrometeoizdat, 359 pp. (In Russian)
Gelfand, A. E., and M. D. Ecker. 1997. “Bayesian Variogram Modeling for an Isotropic Spatial Process.” Journal
of Agricultural, Biological, and Environmental Statistics 2:347–369.
Gelfand, A. E., L. Zhu, and B. P. Carlin. 2001. “On the Change of Support Problem for Spatio‐Temporal Data.”
Biostatistics 2(1):31–45.
Gelman, A., and P. N. Price 1999. “All Maps of Parameter Estimates Are Misleading.” Statistics in Medicine
18:3221–3234.
Gelman A., Y. Goegebeur, F. Tuerlinckx, and I. Mechelen. 2000. “Diagnostic Checks for Discrete‐Data
Regression Models Using Posterior Predictive Simulations.” Applied Statistics 49:247–268.
Goldberger, A. S. 1962. “Best Linear Unbiased Prediction in the Generalized Linear Regression Model.” Journal
of the American Statistical Association 57:369–375.
Goodchild, M. F. 1992. “Geographical Information Science.” International Journal Geographical Information
Systems 6:31–45.
Gneiting, T., Z. Sasvári, and M. Schlather. 2001. “Analogies and Correspondences Between Variograms and
Covariance Functions.” Advances in Applied Probability 33:617–630.
Gotway, C. A. 1994. “The Use of Conditional Simulation in Nuclear Waste Site Performance Assessment (with
discussion).” Technometrics 36:129–161.
Gotway, C. A., and W. W. Stroup. 1997. “A Generalized Linear Model Approach to Spatial Data Analysis and
Prediction.” Journal of Agricultural, Biological and Environmental Statistics 2:157–178.
Gotway, C. A., and R. D. Wolfinger. 2003. “Spatial Prediction of Counts and Rates.” Statistics in Medicine
22:1415–1432.
Gotway, C. A., and L. J. Young. 2002. “Combining Incompatible Spatial Data.” Journal of the American Statistical
Association 97:632–648.
Gotway, C. A., and L. J. Young. 2004. “A Geostatistical Approach to Linking Geographically Aggregated Data
from Different Sources.” University of Florida, Department of Statistics, Technical Report 2004–012.
Gotway, C. A., and L. J. Young. 2007. “A Geostatistical Approach to Linking Geographically Aggregated Data
from Different Sources.” Journal of Computational and Graphical Statistics 16(1):115–135.
Goovaerts, P. 1998. “Ordinary Cokriging Revisited.” Mathematical Geology 30(1):21–42.
Goovaerts, P. 2006. “Geostatistical Analysis of Disease Data: Accounting for Spatial Support and Population
Density in the Isopleth Mapping of Cancer Mortality Risk Using Area‐to‐Point Poisson Kriging.” International
Journal of Health Geographics 5:52
Guttorp, P. 1995. Stochastic Modeling of Scientific Data. London: Chapman and Hall/CRC.
Gribov, A., K. Krivoruchko, and J. M. Ver Hoef. 2006. “Modeling the Semivariogram: New Approach, Methods
Comparison, and Simulation Study.” In Stochastic Modeling and Geostatistics: Principles, Methods, and Case
Gribov, A., and K. Krivoruchko. 2004. “Geostatistical Mapping With Continuous Moving Neighborhood.”
Mathematical Geology 36(2) February 2004.
Guarascio, M., M. David, and C. Huijbrets, editors. 1976. Advanced Geostatistics in the Mining Industry.
Proceedings of the NATO Advanced Study Institute. D. Reidel Publishing Co.
Gumpertz, M. L., J. M. Graham, and J. B. Ristaino. 1997. “Autologistic Model of Spatial Pattern of Phytophthora
Epidemic in Bell Pepper: Effects of Soil Variables on Disease Presence.” Journal of Agricultural, Biological, and
Environmental Statistics 2:131–156.
Haas, T. C. 1990. “Lognormal and Moving Window Methods of Estimating Acid Deposition.” Journal of the
American Statistical Association 85:950–963.
Handcock, M., and M. Stein. 1993. “A Bayesian Analysis of Kriging.” Technometrics 35(4):403–410.
Haining, R. 1990. Spatial Data Analysis in the Social and Environmental Sciences. Cambridge: Cambridge
University Press.
Heuvelink, G. 1998. Error Propagation in Environmental Modeling with GIS. London: Taylor & Francis Books
Ltd.
Higdon, D. 1998. “A Process‐Convolution Approach to Modeling Temperatures in the North Atlantic Ocean.”
Environmental and Ecological Statistics 5(2):173–190.
Higdon, D. M., J. Swall, and J. Kern. 1999. “Non‐Stationary Spatial Modeling.” In Bayesian Statistics 6.
Proceedings of the Sixth Valencia International Meeting, 761–768. Oxford University Press.
Hoeting J. A., R. A. Davis, A. A. Merton, and S.E. Thompson. 2006. “Model Selection for Geostatistical Models.”
Ecological Applications 16(1):87–98.
Ihaka, R., and R. Gentleman. 1996. “R: A language for Data Analysis and Graphics.” Journal of Computational
and Graphical Statistics 5:299–314.
Jeffreys, H. 1961. Theory of Probability. Oxford University Press.
Jensen F. V. 2001. Bayesian Networks and Decision Graphs. Springer.
Johnson G. A., D. A. Mortensen, and C. A. Gotway. 1996. “Spatial Analysis of Weed Seeding Population Using
Geostatistics.” Weed Science 44:704–710.
Johnston, K., J. M. Ver Hoef, K. Krivoruchko, and N. Lucas. 2001. Using ArcGIS Geostatistical Analyst. Redlands:
Esri Press.
Jolliffe, I. T., and D. B. Stephenson, editors. 2003. Forecast Verification: A Practitioner's Guide in Atmospheric
Science. Wiley and Sons.
Journel, A. G. 1983. “Nonparametric estimation of spatial distributions.” Journal of the International
Association for Mathematical Geology 15:445–468.
Kagan R. L. 1997. Averaging of Meteorological Fields. Kluwer Academic Publishers. (Original Russian edition:
1979 St. Petersburg, Russia: Gidrometeoizdat)
Kaluzny S. P., S. C. Vega, T. P.Cardoso, and A. Shelly. 1998. S+SpatialStats: User's Manual for Windows and UNIX.
Springer.
Kazakevich, D. I. 1977. Foundation of the Theory of Random Functions and its and Application in
Hydrometeorology. Leningrad: Hydrometeorological Publishing.
Kazianka, H., and J. Pilz. 2008. Spatial Interpolation Using CopulaBased Geostatistical Models. Springer.
Kelsall, J. E., and P. J. Diggle. 1995a. Kernel Estimation of Relative Risk. Bernoulli 1:3–16.
Kern, J. C., and D. M. Higdon. 2000. “A Distance Metric to Account for Edge Effects in Spatial Analysis.” In
Proceedings of the American Statistical Association, Biometrics Section, Alexandria, Va., 49–52.
Kelsall, J., and J. Wakefield. 2002. “Modeling Spatial Variation in Disease Risk: A Geostatistical Approach.”
Journal of the American Statistical Association 97(459):692–701.
Kolmogorov, A. N. 1933. Foundation of the Theory of Probability. Second English Edition: Chelsea, 1956.
Kolmogorov, A. N. 1941. “Interpolation and Extrapolation of Stationary Random Sequences.” Izvestiya
Akademii Nauk SSSR. Seria Matematicheskaya 5:3–14. (Translation, 1962, Memo RM‐3090‐PR, RAND Corp.,
Santa Monica).
Kondor, R. I., and J. Lafferty. 2002. “Diffusion Kernels on Graphs and Other Discrete Input Spaces.” In
Proceedings of the Nineteenth Annual International Conference on Machine Learning. San Francisco: Morgan
Kaufmann Publishers Inc.
Konishi, S., and G. Kitagawa. 2008. Information Criteria and Statistical Modeling. Springer.
Krivoruchko, K., and R. Bivand. 2009. GIS, “Users, Developers, and Spatial Statistics: On Monarchs and Their
Clothing.” In Interfacing Geostatistics and GIS, pp. 209–228. Springer.
Krivoruchko, K., and C. A. Gotway. 2003. “Assessing the Uncertainty Resulting from Geoprocessing
Operations.” In GIS, Spatial Analysis, and Modeling. 2005. Redlands: Esri Press
Krivoruchko, K., and A. Gribov. 2004. “Geostatistical interpolation in the presence of barriers.” In geoENV IV—
Geostatistics for Environmental Applications: Proceedings of the Fourth European Conference on Geostatistics
for Environmental Applications 2002 (Quantitative Geology and Geostatistics).
Krivoruchko, K., C. A. Gotway, and A. Zhigimont. 2003. “Statistical Tools for Regional Data Analysis Using GIS.”
In The proceedings of the Eleventh ACM International Symposium on Advances in GIS, edited by E. Hoel and P.
Rigaux, 41–48. ACM Press.
Krivoruchko, K., A. Gribov, and J. M. Ver Hoef. 2006. “A New Method for Handling the Nugget Effect in Kriging.”
In Stochastic Modeling and Geostatistics: Principles, Methods, and Case Studies, Volume II: AAPG Computer
Applications in Geology 5. Edited by T. C. Coburn, J. M. Yarus, and R. L. Chambers, 81–89.
Kulldorff, M., T. Tango, and P. J. Park. 2003. “Power Comparisons for Disease Clustering Tests.” Computational
Statistics and Data Analysis 42:665–684.
Lajaunie, C. 1991. “Local Risk Estimation for a Rare Noncontagious Disease Based on Observed Frequencies.”
Centre de Geostatistique de l’Ecole des Mines de Paris. Fontainebleau. Note N‐36/91/G.
Lantuejoul, C. 2002. Geostatistical Simulation: Models and Algorithms. Oxford: Springer‐Verlag, 256.
Laplace. 1951. A Philosophical Essay on Probabilities. New York: Dover Publication Inc. (The original French
edition was published in Paris in 1814)
Lawson, A., A. Biggere, D. Bohning, E. Lesaffre, J.‐F. Viel, and R. Bertollini, editors. 1999. Disease Mapping and
Risk Assessment for Public Health. Wiley.
Lawson, A. B. 2001. Statistical Methods in Spatial Epidemiology. Chichester: John Wiley & Sons.
Lawson, A. B., and D. G. T. Denison. 2002. Spatial Cluster Modeling. Boca Raton: Chapman and Hall/CRC.
Lawson, A. B., W. J. Browne, and C. L. V. Rodeiro. 2003. Disease Mapping with WinBUGS and MLwiN. Cornwall:
Wiley.
Le, N. D. and J. V. Zidek. 1992. “Interpolation with Uncertain Spatial Covariances: A Bayesian Alternative to
Kriging.” Journal of Multivariate Analysis 43:351–374.
Le, N. D. and J. V. Zidek. 2006. Statistical Analysis of Environmental SpaceTime Processes. Springer.
Lele, S. R., and A. Das. 2000. “Elicited Data and Incorporation of Expert Opinion for Statistical Inference in
Spatial Studies.” Mathematical Geology 32(4):465–487.
Leyland, A. H., I. H. Langford, J. Rasbash, and H. Goldstein. 2000. “Multivariate Spatial Models for Event Data.”
Statistics in Medicine 19:2469–2478.
Li, H., C. A. Calder, and N. Cressie. 2007. “Beyond Moran’s I: Testing for Spatial Dependence Based on the SAR
Model.” Geographical Analysis 39:357–375.
Lindley, D. V. 2006. Understanding Uncertainty. New Jersey: Wiley & Sons Inc.
Little, L. S., D. Edwards, and D. E. Porter. 1997. “Kriging in Estuaries: As the Crow Flies or as the Fish Swims?”
Journal of Experimental Marine Biology and Ecology 213:1–11.
Littell R. C., G. A. Milliken, W. W. Stroup, R. D. Wolfinger, and O. Schabenberber. 2006. SAS for Mixed Models.
Second Edition. SAS Publishing.
Longley, P. A, M. F. Goodchild, D. J. Maguire, and D. Rhind. 2001. Geographic Information Systems and Science,
Chichester: John Wiley and Sons.
Marcot, B. G., R. S. Holthausen, M. G. Raphael, M. Rowland, and M. Wisdom. 2001. “Using Bayesian Belief
Networks to Evaluate Fish and Wildlife Population Viability Under Land Management Alternatives from an
Environmental Impact Statement.” Forest Ecology and Management 153: 29–42.
Martinez, W. L., and A. R. Martinez. 2004. Exploratory Data Analysis with MATLAB. Chapman & Hall/CRC.
Martinez, A. R., and W. L. Martinez. 2007. Computational statistics handbook with MATLAB. CRC Press.
Matern, B. 1986. Spatial Variation (Lecture Notes in Statistics). Second Edition. New York: Springer‐Verlag.
Mateu, J., and P. J. Ribeiro. 1999. “Geostatistical Data Versus Point Process Data: Analysis of Second‐Order
Characteristics.” Quantitative Geology and Geostatistics 10:213–224.
Mateu, J., and M. Montenegro. 2004. “On Kernel Estimators of Second‐Order Measures for Spatial Point
Processes.” In Spatial Point Process Modelling and its Applications. Edited by A. Baddeley, P. Gregori, J. Mateu,
R. Stoica, and D. Stoyan. 155–186. Publicaciones de la Universitat Jaume I.
Mateu, J., G. Lorenzo, and E. Porcu. 2007. “Detecting Features in Spatial Point Processes with Clutter via Local
Indicators of Spatial Association.” Journal of Computational and Graphical Statistics 16(4): 968–990.
Matheron, G. 1963. “Principles of Geostatistics.” Economic Geology, 58:1246–1266.
Matheron, G. 1968. Osnovy Prikladnoi Geostatistiki (Principles of Applied Geostatistics). Moscow: Mir. (In
Russian)
Matheron, G. 1976. “A Simple Substitute for Conditional Expectation: The Disjunctive Kriging.” In Advanced
Geostatistics in the Mining Industry. Edited by M. Guarascio, M. David, and C. Huijbregts. Dordrecht: Reidel.
Matheron, G. 1989. Estimating and Choosing. Berlin: Springer‐Verlag.
McCullagh, P., and J. A. Nelder. 1989. Generalized Linear Models. Second Edition. New York: Chapman and Hall.
McMillen D. P. 2003. “Spatial Autocorrelation or Model Misspecification?” International Regional Science
Review 26:208–217.
McNeill, L. 1991. Interpolation and Smoothing of Binomial Data for the Southern African Bird Atlas Project.
South African Statistical Journal 25:129–136.
Millard, S. P., and N. K. Neerchal. 2000. Environmental Statistics with SPLUS. CRC Press.
Møller, J., and R. P. Waagepetersen. 2003. Statistical Inference and Simulation for Spatial Point Processes, Boca
Raton: Chapman & Hall/CRC.
Monestiez P., L. Dubroca, E. Bonnin, J. P. Durbec, and C. Guinet. 2006. “Geostatistical Modeling of Spatial
Distribution of Balenoptera Physalus in the Northwestern Mediterranean Sea from Sparse Count Data and
Heterogeneous Observation Efforts.” Ecological Modeling 193:615–628.
Moran, P. A. P. 1950. “Notes on Continuous Stochastic Phenomena.” Biometrika 37:17–23.
Moyeed, R. A., and A. Papritz. 2002. “An Empirical Comparison of Kriging Methods for Nonlinear Spatial
Prediction.” Mathematical Geology 34:365–386.
Neyman, J. 1939. “On a New Class of ‘Contagious’ Distributions, Applicable in Entomology and Bacteriology.”
Annals of Mathematical Statistics 10:35–57.
Neyman, J., and E. L. Scott. 1958. “Statistical Approach to Problems of Cosmology.” Journal of the Royal
Statistical Society, Series B 20:1–43.
Oden, N. 1995. “Adjusting Moran’s I for Population Density.” Statistics in Medicine 14:17–26.
O’Hara, R. B. 2005. “Species Richness Estimators: How Many Species can Dance on the Head of a Pin?” Journal
of Animal Ecology 74: 375–386.
Okabe, A., and I. Yamada. 2001. “The K Function Method on a Network and Its Computational
Implementation.” Geographical Analysis 33(3):271–290.
Olea, R. A., ed. 1991. Geostatistical Glossary and Multilingual Dictionary. New York: Oxford University Press.
Oliver, D.S. 1995. “Moving Averages for Gaussian Simulation in Two and Three Dimensions.” Mathematical
Geology 27(8):939–960.
Omre, H. 1984. “The Variogram and its Estimation.” In Geostatistics for Natural Resources Characterization,
Part 1. Edited by G. Verly, M. David, A. G. Journel, and A. Meréchal, 107–125. Dordrecht: Reidel.
Openshaw, S., and P. Taylor. 1981. “The Modifiable Areal Unit Problem.” In Quantitative Geography: A British
View. Edited by N. Wrigley and R. Bennett, 60–69. London: Routledge & Kegan Paul.
Pilz, J. 1994. “Robust Bayes Linear Prediction of Regionalized Variables.” In Geostatistics for the Next Century.
Edited by R. Dimitrakopoulos, 464–475. Dordrecht: Kluwer.
Pinheiro J. C., and D.M. Bates. 2000. MixedEffects Models in S and SPLUS. Springer.
Power, C., A. Simms, and R. White. 2001. “Hierarchical Fuzzy Pattern Matching for the Regional Comparison of
Land Use Maps.” International Journal of Geographical Information Science 15(1):77–100.
Press, S. J. 2003. Subjective and Objective Bayesian Statistics. Second Edition. New Jersey: Wiley &Sons.
Rabinovich, S. G. 2000. Measurement Errors and Uncertainties. New York: Springer.
Rathbun, S. L. 1998. “Kriging Estuaries.” Environmetrics, 9:109–129.
Reilly, C., and A. Gelman. 2007. “Weighted Classical Variogram Estimation for Data with Clustering.”
Technometrics, 49(2), June 2007:184–194
Reimann C., P. Filzmoser, R. G. Garrett, and R. Dutter. 2008. Statistical Data Analysis Explained: Applied
Environmental Statistics with R. Chichester: Wiley.
Ripley, B. D. 1981. Spatial Statistics. New York: John Wiley & Sons.
Ripley, B. D. 1987. Stochastic Simulation. Chichester: John Wiley & Sons.
Rivoirard, J. 1994. Introduction to Disjunctive Kriging and Nonlinear Geostatistics. Oxford: Clarendon Press.
Ross, T. J., J. M. Booker, and W. J. Parkinson, editors. 2002. Fuzzy Logic and Probability Applications: Bridging
the Gap. ASA‐SIAM Series on Statistics and Applied Probability. Philadelphia: SIAM. Alexandria, Va.: ASA.
Royle, J. A., and L. M. Berliner. 1999. “A Hierarchical Approach to Multivariate Spatial Modeling and
Prediction.” Journal of Agricultural, Biological, and Environmental Statistics 4:29–56.
Ruppert, D., M. P. Wand, and R. J. Carroll. 2003. Semiparametric Regression. Cambridge University Press.
Sampson, P. D., and P. Guttorp. 1992. “Nonparametric Estimation of Nonstationary Spatial Covariance
Structure.” Journal of the American Statistical Association 87:108–119.
Schabenberger, O., and C. A. Gotway. 2004. Statistical Methods for Spatial Data Analysis. New York: Chapman &
Hall/CRC.
Schabenberger, O., and F. J. Pierce. 2002. Contemporary Statistical Models for the Plant and Soil Sciences. Boca
Raton: CRC Press.
Shapiro, A., and J. D. Botha. 1991. “Variogram Fitting with a General Class of Conditionally Nonnegative
Definite Functions.” Computational Statistics and Data Analysis, 11:87–96.
Silverman, B. W. 1986. Density Estimation for Statistics and Data Analysis. Boca Raton: Chapman and Hall/CRC.
Schlather, M. 2002. “Characterisation of Point Processes with Gaussian Marks Independent of Locations.”
Mathematische Nachrichten. 239–240(1) (2002):204 –214.
Schlather, M., P. Ribeiro, and P. Diggle. 2004. “Detecting Dependence Between Marks and Locations of Marked
Point Processes.” Journal of the Royal Statistical Society, Series B (Statistical Methodology) 66(1):79–93.
Soares, A. O., editor. 1993. Geostatistics Tróia ‘92: Volume 1 & 2 (Quantitative Geology and Geostatistics).
Springer.
Soros, G. 1994. The Alchemy of Finance: Reading the Mind of the Market. John Wiley.
Spiegelhalter, D. J., A. Thomas, N. G. Best, and D. Lunn. 2003. WinBUGS User Manual (Version 1.4). Cambridge:
Mrc Biostatistics Unit, www.mrc‐bsu.cam.ac.uk/bugs/
Stein, M. L. 1999. Interpolation of Spatial Data: Some Theory of Kriging. New York: Springer‐Verlag.
Stevens, D. L. Jr., and A. R. Olsen. 2004. “Spatially Balanced Sampling of Natural Resources.” Journal of the
American Statistical Association 99(465):262–278.
Stoyan, D., and A. Penttinen. 2000. “Recent Applications of Point Process Methods in Forestry Statistics.”
Statistical Science 15(1):61–78.
Stoyan, D., and O. Walder. 2000. “On Variograms in Point Process Statistics, II: Models of Marking and
Ecological Interpretation.” Biometrical Journal 42:171–187.
Tango, T. 1990. “An Index for Cancer Clustering.” Environmental Health Perspectives 87:157–162.
Taylor, J. R. 1997. An Introduction to Error Analysis. Sausalito: University Science Books.
Thiebaux, H. J., and M. A. Pedder. 1987. Spatial Objective Analysis: With Applications in Atmospheric Science.
San Diego: Academic Press.
Tiefelsdorf, M. 2000. Modelling Spatial Processes: The Identification and Analysis of Spatial Relationships in
Regression Residuals by Means of Moran’s I. Berlin: Springer Verlag.
Tiefelsdorf, M., and D. A. Griffith. 2007. “Semiparametric Filtering of Spatial Autocorrelation: The Eigenvector
Approach.” Environment and Planning A 39(5):1193–1221.
Tobler, W. 1979. “Smooth Pycnophylactic Interpolation for Geographical Regions (with discussion).” Journal
of the American Statistical Association 74:519–536.
Tukey, J. W. 1977. Exploratory Data Analysis. Reading, Mass.: Addison‐Wesley.
Turnbull, B. W., E. J. Iwano, W. S. Burnett, H. L. Howe, and L. C. Clark. 1990. “Monitoring for Clusters of
Disease: Application to Leukemia Incidence in Upstate New York.” American Journal of Epidemiology
Supplement 132:S136–S143.
Upton, G. J. G., and B. Fingleton 1985. Spatial Data Analysis by Example. Volume I: Point Pattern and
Quantitative Data. New York: John Wiley & Sons.
Van Lieshout, M. N. M. 2000. Markov Point Processes and their Applications. London: Imperial College Press.
Venables W. N., and B. D. Ripley. 2002. Modern Applied Statistics with SPLUS. Fourth Edition. Springer.
Ver Hoef, J. M. 1993. “Universal Kriging for Ecological Data.” In Environmental Modeling with GIS, edited by M.
F. Goodchild, B. Parks, and L. T. Steyaert, 447–453. Oxford University Press.
Ver Hoef, J. M., and N. Cressie. 1993. “Multivariable Spatial Prediction.” Mathematical Geology 25:219–240.
Ver Hoef, J., and R. P. Barry. 1998. “Constructing and Fitting Models for Cokriging and Multivariable Spatial
Prediction.” Journal of Statistical Planning and Inference 69:275–294.
Ver Hoef, J. M. 2002. “Sampling and Geostatistics for Spatial Data.” Ecoscience 9:152–161.
Ver Hoef, J. M., N. Cressie, and R. P. Barry. 2004. “Flexible Spatial Models for Kriging and Cokriging Using
Moving Averages and the Fast Fourier Transform (FFT).” Journal of Computational and Graphical Statistics
13:265–282.
Ver Hoef, J. M., E. Peterson, and D. Theobald. 2006. “Spatial Statistical Models that Use Flow and Stream
Distance.” Environmental and Ecological Statistics 13:449–464.
Ver Hoef, J. M., and P. L. Boveng. 2007. “Over‐Dispersed Poisson Versus Negative Binomial Regression: How
Should We Model Count Data?” Ecology 88:2766–2772
Verly, G. 1983. “The Multi‐Gaussian Approach and Its Applications to the Estimation of Local Reserves.”
Journal of the International Association for Mathematical Geology 15:259–286.
Wahba, G. 1990. Spline Models for Observational Data. Philadelphia: SIAM.
Wakefield, J. 2004. “A Critique of Statistical Aspects of Ecological Studies in Spatial Epidemiology.”
Environmental and Ecological Statistics, 11:31–54.
Waller, L. A. and C. A. Gotway. 2004. Applied Spatial Statistics for Public Health Data. New York: John Wiley &
Sons.
Waller, L. A., L. Zhu, C. A. Gotway, D. M. Gorman, and P. J. Gruenewald. 2007. “Quantifying Geographic
Variations in Associations Between Alcohol Distribution and Violence: A Comparison of Geographically
Weighted Regression and Spatially Varying Coefficient Models.” Stochastic Environmental Research and Risk
Assessment 21(5):573–588.
Walter, S. D. 1992. “The Analysis of Regional Patterns in Health Data I: Distributional Considerations.”
American Journal of Epidemiology 136:730–741.
Walter, S. D. 2000. “Disease Mapping: A Historical Perspective.” In Spatial Epidemiology: Methods and
Application.. edited by P. Elliott, J. C. Wakefield, N. G. Best, and D. Briggs, 223–239. Oxford: Oxford University
Press.
Walder, O., and D. Stoyan. 1996. “On Variograms in Point Process Statistics.” Biometrical Journal 38:395–405.
Wang, F., and M. M. Wall. 2003. “Generalized Common Spatial Factor Model,” Biostatistics 4:569–582.
Wartenberg, D., and M. Greenberg. 1990. “Detecting Disease Clusters: The Importance of Statistical Power.”
American Journal of Epidemiology 132 Supplement:S156–S166.
Wartenberg, D. 1985. “Multivariate Spatial Correlation: A Method for Exploratory Geographical Analysis.”
Geographical Analysis 17:263–283.
Wiener, N. 1949. “Extrapolation, Interpolation and Smoothing of Stationary Time Series.” Cambridge: MIT
Press.
Wheeler, D., and C. A. Calder. 2006. “An Assessment of Coefficient Accuracy in Linear Regression Models with
Spatially Varying Coefficients.” Journal of Geographical Systems 9(2):145‐166,
Wheeler, D., and M. Tiefelsdorf. 2005. “Multicollinearity and Correlation Among Local Regression Coefficients
in Geographically Weighted Regression.” Journal of Geographical Systems 7:161–187
Whittle, P. 1954. “On Stationary Processes in the Plane.” Biometrika 41:434–449.
Wolpert, R. L. and K. Ickstadt. 1998. “Poisson/Gamma Random Field Models for Spatial Statistics.” Biometrika
85:251–267.
Wood, S. N. 2006. Generalized Additive Models: An Introduction With R. 391. Boca Raton: CRC.
Xia, G., M. L. Miranda, and A. E. Gelfand 2006. “Approximately Optimal Spatial Design Approaches for
Environmental Health Data.” Environmetrics 17(4):363–385.
Yaglom, A. N. 1987. Correlation Theory of Stationary and Related Random Functions I. New York : Springer‐
Verlag. Original Russian edition: Yaglom A. N. 1952. “Introduction to the Theory of Stationary Random
Functions.” Uspekhi Matematicheskikh Nauk 7(5):3–168.
Yaglom, A. M. 1955. “Correlation theory of processes with random stationary nth increments” (in Russian).
Matematicheskii Sbornik 37:141–196. (English translation in American Mathematical Society Translations,
Series 2, 1958, 8:87–141.)
Yalgom, A. M. 1957. “Some Class of Random Fields in N‐Dimensional Space, Related to Stationary Random
Processes.” Theory of Probability and its Applications 2:273–320.
Yates, S. R., and A. W. Warrick. 1987. “Estimating Soil Water Content Using Cokriging.” Soil Science Society of
America Journal 51:23–30.
Young, L. J., and C. A. Gotway. 2007. “Linking Spatial Data from Different Sources: The Effects of Change of
Support.” Stochastic Environmental Research and Risk Assessment 21(5):589–600.
Zimmerman, D. L. 1993. “Another Look at Anisotropy in Geostatistics.” Mathematical Geology 25:453–470.
Zimmerman, D. L., and M. B. Zimmerman. 1991. “A Comparison of Spatial Semivariogram Estimators and
Corresponding Kriging Predictors.” Technometrics 33:77–91.
CHAPTER 3 DATA SOURCES INCLUDE
\StatAnalysis\assignment3.1\Errors and Links\Data courtesy of the author.
\StatAnalysis\assignment3.1\Milk_Cs137_pop_subsample.shp\Data courtesy of International Sakharov
Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
\StatAnalysis\assignment3.2\village_cs137food_contamination.xls\Data courtesy of International Sakharov
Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
CHAPTER 4 DATA SOURCES INCLUDE
\StatAnalysis\assignment4.2\ozone_1999m.shp\Data courtesy of The California Air Resources Board,
Planning and Technical Support Division, Air Quality Data Branch.
\StatAnalysis\assignment4.3\golfcourses_aroundEsri.shp\Data courtesy of United States Board on
Geographic Names; United States Forest Service; Office of Coast Survey; Federal Communications
Commission; Federal Aviation Administration; United States Army Corps of Engineers.
\StatAnalysis\assignment4.3\03_june_average_1hour.shp\Data courtesy of The California Air Resources
Board, Planning and Technical Support Division, Air Quality Data Branch.
\StatAnalysis\assignment4.3\Redlands.shp\Data courtesy of United States Board on Geographic Names;
United States Forest Service; Office of Coast Survey; Federal Communications Commission; Federal Aviation
Administration; United States Army Corps of Engineers.
CHAPTER 5 DATA SOURCES INCLUDE
\StatAnalysis\assignment5.1\asthma_risk_data.shp\Data courtesy of the author.
\StatAnalysis\assignment5.3\pp_dens.img\Image courtesy of International Sakharov Environmental
University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
\StatAnalysis\assignment5.3\sr90.img\Image courtesy of International Sakharov Environmental University,
Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
\StatAnalysis\assignment5.3\sr90se.img\Image courtesy of International Sakharov Environmental
University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
CHAPTER 6 DATA SOURCES INCLUDE
\StatAnalysis\assignment6.3\03daily_June99.shp\Data courtesy of The California Air Resources Board,
Planning and Technical Support Division, Air Quality Data Branch.
\StatAnalysis\assignment6.3\03daily_June99_change_prediction15p.shp\Data courtesy of The California Air
Resources Board, Planning and Technical Support Division, Air Quality Data Branch.
\StatAnalysis\assignment6.3\state_border.shp\Data courtesy of Esri Data and Maps, courtesy of Esri.
\StatAnalysis\assignment6.4\Jonathan_apples_poly.shp\Data from Batchelor, L. D., and H. S. Reed.
\StatAnalysis\assignment6.4\navel_orange_plantation_poly.shp\Data from Batchelor, L. D., and H. S. Reed.
\StatAnalysis\assignment6.4\walnut_plantation_poly.shp\Data from Batchelor, L. D., and H. S. Reed.
CHAPTER 7 DATA SOURCES INCLUDE
\StatAnalysis\assignment7.1\Mountains.shp\Data courtesy of the author.
\StatAnalysis\assignment7.1\Mountains_cluster_selection.shp\Data courtesy of the author.
\StatAnalysis\assignment7.1\Mountains_inhibition_selection.shp\Data courtesy of the author.
\StatAnalysis\assignment7.1\Mountains_random_selection.shp\Data courtesy of the author.
\StatAnalysis\assignment7.1\Mountains.shp\Data courtesy of the author.
\StatAnalysis\assignment7.2\CS137_3districts.shp\Data\Courtesy of International Sakharov Environmental
University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
CHAPTER 8 DATA SOURCES INCLUDE
\StatAnalysis\assignment8.2\s1.shp\Data courtesy of the author.
\StatAnalysis\assignment8.2\s2.shp\Data courtesy of the author
\StatAnalysis\assignment8.2\s3.shp\Data courtesy of the author.
\StatAnalysis\assignment8.3\SampleData.shp\Data courtesy of the author.
CHAPTER 9 DATA SOURCES INCLUDE
\StatAnalysis\assignment9.1\WesternEuropeanContinent.shp\Data courtesy of NOAA.
\StatAnalysis\assignment9.1\World_temperature.shp\Data courtesy of NOAA.
\StatAnalysis\assignment9.5\Lake.shp\Data courtesy of Mike Price, GISP, Entrada/San Juan Inc.
\StatAnalysis\assignment9.5\silt.shp\Data courtesy of Mike Price, GISP, Entrada/San Juan Inc.
\StatAnalysis\assignment9.5\silt_plus.shp\Data courtesy of Mike Price, GISP, Entrada/San Juan Inc.
\StatAnalysis\assignment9.6\plitvice_merged_sel.shp\Data courtesy of Damir Medak, Faculty of Geodesy,
University of Zagreb.
CHAPTER 10 DATA SOURCES INCLUDE
\StatAnalysis\assignment10.1\ozone_pts.shp\Data courtesy of The California Air Resources Board, Planning
and Technical Support Division, Air Quality Data Branch.
\StatAnalysis\assignment10.1\pm10_pts.shp\Data courtesy of The California Air Resources Board, Planning
and Technical Support Division, Air Quality Data Branch.
\StatAnalysis\assignment10.1\shifted_pop_places.shp\Data courtesy of the author.
\StatAnalysis\assignment10.1\state_border.shp\Data courtesy of Esri Data and Maps, courtesy of Esri.
\StatAnalysis\assignment10.5\cs_rep.shp\Data courtesy of International Sakharov Environmental
University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
\StatAnalysis\assignment10.5\Districts.shp\Data courtesy of International Sakharov Environmental
University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
CHAPTER 11 DATA SOURCES INCLUDE
\StatAnalysis\assignment11.2\fox_data_poly.shp\Data courtesy of www.ij-healthgeographics.com.
\StatAnalysis\assignment11.2\fox_data_pts.shp\Data courtesy of www.ij-healthgeographics.com.
CHAPTER 12 DATA SOURCES INCLUDE
\StatAnalysis\assignment12.1\scottish_lip_cancer.shp\Data courtesy of the author.
CHAPTER 13 DATA SOURCES INCLUDE
\StatAnalysis\assignment13.1\anemones.shp\Data courtesy of the author.
\StatAnalysis\assignment13.2\border.shp\Data courtesy of the author.
\StatAnalysis\assignment13.2\medieval_graves_sites.shp\Data courtesy of the author.
CHAPTER 14 DATA SOURCES INCLUDE
\StatAnalysis\assignment14.2\EPA_June2002.shp\Data courtesy of The California Air Resources Board,
Planning and Technical Support Division, Air Quality Data Branch.
\StatAnalysis\assignment14.2\Data courtesy of the author.
CHAPTER 15 DATA SOURCES INCLUDE
\StatAnalysis\assignment15.3\NC_infant_mortality_data.shp\Data courtesy of National Atlas of the United
States.
CHAPTER 16 DATA SOURCES INCLUDE
\StatAnalysis\assignment16.3\Auto_theft_98.shp\Data courtesy of Redlands Police Department, Redlands,
California.
\StatAnalysis\assignment16.3\Redlands_streets.shp\Data courtesy of the author.
\StatAnalysis\assignment16.3\Robbery_98.shp\Data courtesy of Redlands Police Department, Redlands,
California.
\StatAnalysis\assignment16.4\sum02_prj.shp\Data courtesy of Dr. Dave Duffus and Laura‐Joan Feyrer,
University of Victoria, Department of Geography, Whale Research Lab.
\StatAnalysis\assignment16.4\whales_02all.shp\Data courtesy of Dr. Dave Duffus and Laura‐Joan Feyrer,
University of Victoria, Department of Geography, Whale Research Lab.
APPENDIX 1 DATA SOURCES INCLUDE
\StatAnalysis\assignmentA1.1\Districts.shp\Data\Data courtesy of International Sakharov Environmental
University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
\StatAnalysis\assignmentA1.1\BerryCs.shp\Data\Data courtesy of International Sakharov Environmental
University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
\StatAnalysis\assignmentA1.1\Border_Cs137.shp\Data\Data courtesy of International Sakharov
Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
\StatAnalysis\assignmentA1.1\Chernobyl.shp\Data\Data courtesy of International Sakharov Environmental
University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
\StatAnalysis\assignmentA1.1\Districts.shp\Data\Data courtesy of International Sakharov Environmental
University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
\StatAnalysis\assignmentA1.1\five_pts.xls\Data courtesy of the author.
\StatAnalysis\assignmentA1.1\five_pts.shp\Data courtesy of the author.
\StatAnalysis\assignmentA1.1\five_pts.xls\Data courtesy of the author.
\StatAnalysis\assignmentA1.1\settlements.shp\Data courtesy of the author.
\StatAnalysis\assignmentA1.2\Austria_border.shp\Data courtesy of the author.
\StatAnalysis\assignmentA1.2\heavy_metals_tutorial.shp\Data courtesy of Umweltbundesamt, GmbH.
\StatAnalysis\assignmentA2.1\four_southern_states.shp\Data courtesy of National Atlas of the United States.
\StatAnalysis\assignmentA2.1\four_southern_states.shp\Data courtesy of National Atlas of the United States.
\StatAnalysis\assignmentA2.1\ResidentialPropertiesSelected.shp\Data courtesy of the City of Nashua, New
Hampshire.
\StatAnalysis\assignmentA2.1\sales2006_selected.shp\Data courtesy of the City of Nashua, New Hampshire.
\StatAnalysis\assignmentA2.1\sales2006_testing.dbf\Data courtesy of the City of Nashua, New Hampshire.
\StatAnalysis\assignmentA2.1\sales2006_training.dbf\Data courtesy of the City of Nashua, New Hampshire.
\StatAnalysis\assignmentA2.1\sales2006validation.dbf\Data courtesy of the City of Nashua, New Hampshire.
APPENDIX 3 DATA SOURCES INCLUDE
\StatAnalysis\assignmentA3.1\eight_polys.shp\Data courtesy of the author.
\StatAnalysis\assignmentA3.1\sel_houses.shp\Data courtesy of the City of Nashua, New Hampshire.
\StatAnalysis\assignmentA3.1\sel_sold_houses.shp\Data courtesy of the City of Nashua, New Hampshire.
\StatAnalysis\assignmentA3.1\Districts_stat.shp\Data\Data courtesy of International Sakharov
Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
\StatAnalysis\assignmentA3.1\Districts_stat.shp\Data\Data courtesy of International Sakharov
Environmental University, Minsk, Belarus; Institution of Radiation Safety, Belrad, Republic of Belarus.
\StatAnalysis\assignmentA3.2\catalonia_border.shp\Data courtesy of the author.
\StatAnalysis\assignmentA3.2\precipitation_m.shp\Data courtesy of Generalitat de Catalunya. Departament
de Medi Ambient i Habitatge. Direcció General de Qualitat Ambiental.
\StatAnalysis\assignmentA3.3\w_border.shp\Data courtesy of Gregg A. Johnson, University of Minnesota.
\StatAnalysis\assignmentA3.4\WeedsData.shp\Data courtesy of Gregg A. Johnson, University of Minnesota.
\StatAnalysis\assignmentA3.4\western_europe.shp\Data courtesy of the author.
APPENDIX 4 DATA SOURCES INCLUDE
\StatAnalysis\assignmentA4.1\beetle_diameter.shp\Data from Preisler, H. K., and R. G. Mitchell.
\StatAnalysis\assignmentA4.1\Districts_stat.shp\Data courtesy of the City of Nashua, New Hampshire.
\StatAnalysis\assignmentA4.2\gaus_plus_exp.shp\Data courtesy of the author.