0% found this document useful (0 votes)
67 views

Notebook On Spatial Data Analysis

The document provides an overview of the course ESE 502, which introduces students to statistical methods for analyzing spatial data. The course covers spatial point pattern analysis, continuous spatial data analysis, and areal data analysis. It includes topics like nearest-neighbor analyses, variogram and kriging analyses, and autoregression analyses. Students gain experience with software like ArcGIS, JMP, and MATLAB. The course is taught by Tony E. Smith in spring 2015, with lectures covering various statistical models and methods for different types of spatial data. Homework assignments are due throughout the semester.

Uploaded by

Jay Rajput
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Notebook On Spatial Data Analysis

The document provides an overview of the course ESE 502, which introduces students to statistical methods for analyzing spatial data. The course covers spatial point pattern analysis, continuous spatial data analysis, and areal data analysis. It includes topics like nearest-neighbor analyses, variogram and kriging analyses, and autoregression analyses. Students gain experience with software like ArcGIS, JMP, and MATLAB. The course is taught by Tony E. Smith in spring 2015, with lectures covering various statistical models and methods for different types of spatial data. Homework assignments are due throughout the semester.

Uploaded by

Jay Rajput
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 615

Tony E.

Smith
SPATIAL DATA ANALYSIS
ESE 502 COURSE

Philadelphia
2014
ESE 502 COURSE DESCRIPTION
The course is designed to introduce students to modern statistical methods for analyzing spatial data.
These methods include nearest‐neighbor analyses of spatial point patterns, variogram and kriging
analyses of continuous spatial data, and autoregression analyses of areal data. The underlying statistical
theory of each method is developed and illustrated in terms of selected GIS applications. Students are
also given some experience with ARCMAP, JMP, and MATLAB software.

Instructor: Tony E. Smith


274 Towne (898‐9647)
[email protected]

© Penn Engineering, 2015:


University of Pennsylvania School of engineering and applied science

COURSE TOPICS
Spatial Point Pattern Analysis
• Nearest-Neighbor Methods
• K-Function Methods

Continuous Spatial Data Analysis


• Variogram Methods
• Kriging Methods

Regional Data Analysis


• Spatial Regression Models
• Maximum Likelihood Estimation
• Spatial Diagnostics
TENTATIVE SCHEDULE FOR SPRING 2015

Lectures Day/Date Topic Homework


INTRO Th/Jan.15 Introduction
1 Tu/Jan.20 Point Pattern Data
2 Th/Jan.22 CSR Hypothesis
3 Tu/Jan.27 Nearest-Neighbor Methods
4 Th/Jan.29 Data Applications PS1 due
5 Tu/Feb.3 K-Function Analysis
6 Th/Feb.5 Simulation Testing Methods
7 Tu/Feb.10 Bivariate K-Functions
8 Th/Feb.12 Tests of Pattern Similarity
9 Tu/Feb.17 Local K-Functions
10 Th/Feb.19 Continuous Spatial Data PS2 due
11 Tu/Feb.24 Spatial Variograms
12 Th/Feb.26 Variogram Estimation
13 Tu./Mar.3 Simple Kriging Model
14 Th/Mar.5 Kriging Predictions
Tu/Mar.10 SPRING BREAK
Th/Mar.12 SPRING BREAK
15 Tu/Mar.17 Simple Regression Model
16 Th/Mar.19 Generalized Least Squares PS3 due
17 Tu/Mar.24 Universal Kriging Model
18 Th/Mar.26 Universal Kriging Estimation
19 Tu/Mar.31 Data Applications
20 Th/Apr.2 Data Applications
21 Tu/Apr.7 Regional Spatial Data PS4 due
22 Th/Apr.9 Spatial Autocorrelation
23 Tu/Apr.14 Spatial Concentration
24 Th/Apr.16 Spatial Autoregression
25 Tu/Apr.21 Spatial Lag Model
26 Th/Apr.23 Spatial Diagnostics
27 Tu/Apr.28 Additional Regression Topics PS6 due
PS7 Mon/May 4 Last Assignment PS7 due
NOTEBOOK
ON SPATIAL DATA ANALYSIS

NOTE: To cite this material, use:

Smith, T.E., (2014) Notebook on Spatial Data Analysis [online]


http://www.seas.upenn.edu/~ese502/#notebook

INTRODUCTION
I. SPATIAL POINT PATTERN ANALYSIS
1. Examples of Point Patterns
1.1 Clustering versus Uniformity
1.2 Comparisons between Point Patterns

2. Complete Spatial Randomness


2.1 Spatial Laplace Principle
2.2 Complete Spatial Randomness
2.3 Poisson Approximation
2.4 Generalized Spatial Randomness
2.5 Spatial Stationarity

3. Testing Spatial Randomness

3.1 Quadrat Method


3.2 Nearest-Neighbor Methods
3.2.1 Nearest-Neighbor Distribution under CSR
3.2.2 Clark-Evens Test
3.3 Redwood Seedling Example
3.3.1 Analysis of Redwood Seedlings using JMPIN
3.3.2 Analysis of Redwood Seedlings using MATLAB
3.4 Bodmin Tors Example
3.5 A Direct Monte Carlo Test of CSR

4. K-Function Analysis of Point Patterns

4.1 Wolf-Pack Example


4.2 K-Function Representations
4.3 Estimation of K-Functions
4.4 Testing the CSR Hypothesis
4.5 Bodmin Tors Example
4.6 Monte Carlo Testing Procedures
ТОС_1
4.6.1 Simulation Envelopes
4.6.2 Full P-Value Approach
4.7 Nonhomogeneous CSR Hypotheses
4.7.1 Housing Abandonment Example
4.7.2 Monte Carlo Tests of Hypotheses
4.7.3 Lung Cancer Example
4.8 Nonhomogeneous CSR Hypotheses
4.8.1 Construction of Local K-Functions
4.8.2 Local Tests of Homogeneous CSR Hypotheses
4.8.3 Local Tests of Nonhomogeneous CSR Hypotheses

5. Comparative Analyses of Point Patterns

5.1 Forest Example


5.2 Cross K-Functions
5.3 Estimation of Cross K-Functions
5.4 Spatial Independence Hypothesis
5.5 Random-Shift Approach to Spatial Independence
5.5.1 Spatial Independence Hypothesis for Random Shifts
5.5.2 Problem of Edge Effects
5.5.3 Random Shift Test
5.5.4 Application to the Forest Example
5.6 Random-Labeling Approach to Spatial Independence
5.6.1 Spatial Indistinguishability Hypothesis
5.6.2 Random Labeling Test
5.6 3 Application to the Forest Example
5.7 Analysis of Spatial Similarity
5.7.1 Spatial Similarity Test
5.7.2 Application to the Forest Example
5.8 Larynx and Lung Cancer Example
5.8.1 Overall Comparison of the Larynx and Lung Cancer Populations
5.8.2 Local Comparison in the Vacinity of the Incinerator
5.8.3 Local Cluster Analysis of Larynx Cases

6. Space-Time Point Processes

6.1 Space-Time Clustering


6.2 Space-Time K-Functions
6.3 Temporal Indistinguishability Hypothesis
6.4 Random Labeling Test
6.5 Application to the Lymphoma Example

APPENDIX TO PART I

II. CONTINUOUS SPATIAL DATA ANALYSIS


1. Overview of Spatial Stochastic Processes

1.1 Standard Notation

ТОС_2
1.2 Basic Modeling Framework

2. Examples of Continuous Spatial Data

2.1 Rainfall in the Sudan


2.2 Spatial Concentration of PCBs

3. Spatially-Dependent Random Effects

3.1 Random Effects at a Single Location


3.1.1 Standardized Random Variables
3.1.2 Normal Distribution
3.1.3 Central Limit Theorems
3.1.4 CLT for the Sample Mean
3.2 Multi-Location Random Effects
3.2.1 Multivariate Normal Distribution
3.2.2 Linear Invariance Property
3.2.3 Multivariate Central Limit Theorem
3.3 Spatial Stationarity
3.3.1 Example: Measuring Ocean Depths
3.3.2 Covariance Stationarity
3.3.3 Covariograms and Correlograms

4. Variograms

4.1 Expected Squared Differences


4.2 The Standard Model of Spatial Dependence
4.3 Non-Standard Spatial Dependence
4.4 Pure Spatial Dependence
4.5 The Combined Model
4.6 Explicit Models of Variograms
4.6.1 The Spherical Model
4.6.2 The Exponential Model
4.6.3 The Wave Model
4.7 Fitting Variogram Models to Data
4.7.1 Empirical Variograms
4.7.2 Least-Squares Fitting Procedure
4.8 The Constant-Mean Model
4.9 Example: Nickel Deposits on Vanvouver Island
4.9.1 Empirical Variogram Estimation
4.9.2 Fitting a Spherical Variogram
4.10 Variograms versus Covariograms
4.10.1 Biasedness of the Standard Covariance Estimator
4.10.2 Unbiasedness of Empirical Variogram for Exact-Distance Samples
4.10.3 Approximate Unbiasedness of General Empirical Variograms

5. Spatial Interpolation Models

5.1 A Simple Example of Spatial Interpolation


5.2 Kernel Smoothing Models
5.3 Local Polynomial Models
ТОС_3
5.4 Radial Basis Function Models
5.5 Spline Models
5.6 A Comparison of Models using the Nickel Data

6. Simple Spatial Prediction Models

6.1 An Overview of Kriging Models


6.1.1 Best Linear Unbiased Predictors
6.1.2 Model Comparisons
6.2 The Simple Kriging Model
6.2.1 Simple Kriging with One Predictor
6.2.2 Simple Kriging with Many Predictors
6.2.3 Interpretation of Prediction Weights
6.2.4 Construction of Prediction Intervals
6.2.5 Implementation of Simple Kriging Models
6.2.6 An Example of Simple Kriging
6.3 The Ordinary Kriging Model
6.3.1 Best Linear Unbiased Estimation of the Mean
6.3.2 Best Linear Unbiased Predictor of Y
6.3.3 Implementation of Ordinary Kriging
6.3.4 An Example of Ordinary Kriging
6.4 Selection of Prediction Sets by Cross Validation
6.4.1 Log-Nickel Example
6.4.2 A Simulated Example

7. General Spatial Prediction Models

7.1 The General Linear Regression Models


7.1.1 Generalized Least Squares Estimation
7.1.2 Best Linear Unbiasedness Property
7.1.3 Regression Consequences of Spatially Dependent
Random Effects.
7.2 The Universal Kriging Model
7.2.1 Best Linear Unbiased Prediction
7.2.2 Standard Error of Predictions
7.2.3 Implementation of Univesal Kriging
7.3 Geostatistical Regression and Kriging
7.3.1 Iterative Estimation Procedure
7.3.2 Implementation of Geo-Regression
7.3.3 Implementation of Geo-Kriging
7.3.4 Cobalt Example of Geo-Regression
7.3.5 Venice Example of Geo-Regression and Geo-Kriging

APPENDIX TO PART II

A2.1. Covariograms for Sums of Independent Spatial Processes


A2.3. Expectation of the Sample Estimator under Sample Dependence
A2.3. A Bound on the Binning Bias of Empirical Variogram Estimators
A2.4. Some Basic Vector Geometry
A2.5. Differentiation of Functions
A2.6. Gradient Vectors
ТОС_4
A2.7. Unconstrained Optimization of Smooth Functions
7.1 First-Order Conditions
7.2 Second-Order Conditions
7.3 Application to Ordinary Least Squares Estimation
A2.8. Constrained Optimization of Smooth Functions
8.1 Minimization with a Single Constraint
8.2 Minimization with Multiple Constraints
8.3 Solution for Universal Kriging

III. AREAL DATA ANALYSIS

1. Overview of Areal Data Analysis

1.1 Extensive versus Intensive Data Representations


1.2 Spatial Pattern Analysis
1.3 Spatial Regression Analysis

2. Modeling the Spatial Structure of Areal Units

2.1 Spatial Weights Matrices


2.1.1 Point Representations of Areal Units
2.1.2 Spatial Weights based on Centroid Distances
2.1.3 Spatial Weights based on Boundaries
2.1.4 Combined Distance-Boundary Weights
2.1.5 Normalizations of Spatial Weights
2.2 Construction of Spatial Weights Matrices
2.2.1 Construction of Spatial Weights based on Centroid Distances
2.2.2 Construction of Spatial Weights based Boundaries

3. The Spatial Autoregressive Model

3.1 Relation to Time Series Analysis


3.2 The Simultaneity Property of Spatial Dependencies
3.3 A Spatial Interpretation of Autoregressive Residuals
3.3.1 Eigenvalues and Eigenvectors of Spatial Weights Matrices
3.3.2 Convergence Conditions in Terms of Rho
3.3.3 A Steady-State Interpretations of Spatial Autoregressive Residuals

4. Testing for Spatial Autocorrelation

4.1 Three Test Statistics


4.1.1 Rho Statistic
4.1.2 Correlation Statistic
4.1.3 Moran Statistic
4.1.4 Comparison of Statistics
4.2 Asymptotic Moran Tests of Spatial Autocorrelation
4.2.1 Asymptotic Moran Test for Regression Residuals
4.2.2 Asymptotic Moran Test in ARCMAP
ТОС_5
4.3 Random Permutation Test of Spatial Autocorrelation
4.3.1 SAC-Perm Test
4.3.2 Application to English Mortality Data

5. Tests of Spatial Concentration

5.1 A Probabilistic Interpretation of G*


5.2 Global Tests of Spatial Concentration
5.3 Local Tests of Spatial Concentration
5.3.1 Random Permutation Test
5.3.2 English Mortality Example
5.3.3 Asymptotic G* Test in ARCMAP
5.3.4 Advantage of G* over G for Analyzing Spatial Concentration

6. Spatial Regression Models for Areal Data Analysis

6.1 The Spatial Errors Model (SEM)


6.2 The Spatial Lag Model (SLM)
6.2.1 Simultaneity Structure
6.2.2 Interpretation of Beta Coefficients
6.3 Other Spatial Regression Models
6.3.1 The Combined Model
6.3.2 The Durbin Model
6.3.3 The Conditional Autoregressive (CAR) Model

7. Spatial Regression Parameter Estimation

7.1 The Method of Maximum-Likelihood Estimation


7.2 Maximum-Likelihood Estimation for General Linear Regression Models
7.2.1 Maximum-Likelihood Estimation for OLS
7.2.2 Maximum-Likelihood Estimation for GLS
7.3 Maximum-Likelihood Estimation for SEM
7.4 Maximum-Likelihood Estimation for SLM
7.5 An Application to the Irish Blood Group Data
7.5.1 OLS Residual Analysis and Choice of Spatial Weights Matrices
7.5.2 Spatial Regression Analyses

8. Parameter Significance Tests for Spatial Regression

8.1 A Basic Example of Maximum Likelihood Estimation and Inference


8.1.1 Sampling Distribution by Elementary Methods
8.1.2 Sampling Distribution by General Maximum-Likelihood Methods
8.2 Sampling Distributions for General Linear Models with Known Covariance
8.2.1 Sampling Distribution by Elementary Methods
8.2.2 Sampling Distribution by General Maximum-Likelihood Methods
8.3 Asymptotic Sampling Distributions for the General Case
8.4 Parameter Significance Tests for SEM
8.4.1 Parametric Tests for SEM
8.4.2 Application to the Irish Blood Group Data
8.5 Parameter Significance Tests for SLM
8.5.1 Parametric Tests for SLM
ТОС_6
8.5.2 Application to the Irish Blood Group Data

9. Goodness-of-Fit Measures for Spatial Regression

9.1 The R-Squared Measure for OLS


9.1.1 The Regression Dual
9.1.2 Decomposition of Total Variation
9.1.3 Adjusted R-Squared
9.2 Extended R-Squared Measures for GLS
9.2.1 Extended R-Squared for SEM
9.2.2 Extended R-Squared for SLM
9.3 The Squared Correlation Measure for GLS Models
9.3.1 Squared Correlation for OLS
9.3.2 Squared Correlation for SEM and SLM
9.3.3 A Geometric View of Squared Correlation

10. Comparative Tests among Spatial Regression Models

10.1 A One-Parameter Example


10.2 Likelihood-Ratio Tests against OLS
10.3 The Common-Factor Hypothesis
10.4 The Combined-Model Approach

APPENDIX TO PART III

A3.1. The Geometry of Linear Transformations


3.1.1 Nonsingular Transformations and Inverses
3.1.2 Orthonormal Transformations
A3.2. Singular Value Decomposition Theorem
3.2.1 Inverses and Pseudoinverses
3.2.2 Determinants and Volumes
3.2.3 Linear Transformations of Random Vectors
A3.3. Eigenvalues and Eigenvectors
A3.4. Spectral Decomposition Theorem
3.4.1 Eigenvalues and Eigenvectors of Symmetric Matrices
3.4.2 Some Consequences of SVD for Symmetric Matrices
3.4.3 Spectral Decomposition of Symmetric Positive Semidefinite Matrices
3.4.4 Spectral Decompositions with Distinct Eigenvalues
3.4.5 General Spectral Decomposition Theorem

ТОС_7
INTRODUCTION

In this NOTEBOOK we develop the elements of spatial data analysis. The analytical
methods divided into three parts: Part1. Point Pattern Analysis, Part II. Continuous
Spatial Data Analysis, and Part III. Regional Data Analysis. This classification of spatial
data types essentially follows the course text by Bailey and Gatrell (1995)1, hereafter
referred to as [BG]. It should be noted that many of the examples and methods used in
these notes are drawn from [BG]. Additional materials are drawn from Cressie (1993)
and Anselin (1988).

This course is designed to introduce both the theory and practice of spatial data analysis.
The practice of spatial data analysis depends heavily on software applications. Here we
shall use ARCMAP for displaying and manipulating spatial data, and shall use both
JMPIN and MATLAB for statistical analyses of this data. Hence, while these notes
concentrate on the statistical theory of spatial data analysis, they also develop a number
of explicit applications using this software. Brief introductions to each of these software
packages are given in Part IV of this NOTEBOOK, along with numerous tips on useful
procedures.

These notes will make constant reference to files and programs that are available in the
Class Directory, which can be opened in the Lab with the menu sequence:

File → Open → courses…(F:)\sys502\

The relevant files are organized into three subdirectories: arcview, jmpin, and matlab.
These are the three software packages used in the course. The files in each subdirectory
are formatted as inputs to the corresponding software package. Instructions for opening
and using each of these packages can be found in the Software portion of this
NOTEBOOK.

To facilitate references to other parts of the NOTEBOOK, the following conventions are
used. A reference to expression (3.4.7) means expression (7) in Section 3.4 of the same
part of the NOTEBOOK. If a reference is made to an expression in another part of the
NOTEBOOK, say Part II, then this reference is preceded by the part number, in this case,
expression (II.3.4.7). Similar references are made to figures by replacing expressions
numbers in parentheses with figure numbers in brackets. For example, a reference to
figure II.3.4 means Figure 4 in Section 3 of Part II.

1
All references are listed in the Reference section at the end of this NOTEBOOK.
SPATIAL POINT PATTERN ANALYSIS

1. Examples of Point Patterns

We begin by considering a range of point pattern examples that highlight the types of
statistical analyses to be developed. These examples can be found in ARCMAP map
documents that will be discussed later.

1.1 Clustering versus Dispersion

Consider the following two point patterns below. The first represents the locations of
redwood seedlings in a section of forest.1 This pattern of points obviously looks too

0 10
  feet

Fig.1.1. Redwood Seedlings Fig.1.2. Redwood Seedlings

clustered to have occurred by chance. The second


microscope slide.2 While this pattern may look
more random than the redwood seedlings, it is
actually much too dispersed to have occurred by
chance.3 This can be seen a bit more clearly by
including the cell walls, shown schematically in
Figure 1.3 to the right. This additional information
shows that indeed there is a natural spacing between
these cells, much like the individual cells of a
beehive. [The cell walls were actually constructed Fig.1.3. Cell Walls
schematically in ARCMAP by using the “Voronoi Map” option in the Geostatistical
Analyst extension of ARCMAP. But this process is a reasonable depiction of the actual
1
This data first appeared in Strauss (1975), and is the lower left-hand corner of his Figure 1 (which
contains 199 redwood seedlings).
2
This data first appeared in Ripley (1977), where it relates to an interesting biological problem regarding
the process of cell division, posed by Dr. Francis Crick (of “Crick and Watson” fame).
3
The term “dispersion” is sometimes called “uniformity” in the literature. Here we choose the former.
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

cell-packing process]. So the key question to be addressed here is how we can


distinguish these patterns statistically in a manner that will allow us to conclude that the
first is “clustered” and the second is “dispersed” – without knowing anything else about
these patterns.

The approach adopted here is to begin by developing a statistical model of purely random
point patterns, and then attempt to test each of these patterns against that statistical
model. In this way, we will be able to conclude that the first is “significantly more
clustered than random” and the second is “significantly more dispersed than random”.

1.2 Comparisons between Point Patterns

Figures 1.4 and 1.5 below show the locations of abandoned houses in central Philadelphia
for the year 2000.4 The first shows those abandonments for which the owner’s residence
is off site, and the

!
(
! !
(! !
(! (
!
( ( !
( (!
(!(! ( !
! ( ! ( !
(
!
( (! ! !
(
!
(! (!(!( ! !
( !
( ((( !!
( (!
(!(
!
( !
( !
(
!
(!(! (
(!
( (
! !
(! ( ! (!
! (
( !
! (
!!
( !
( ! !
( !
(
(! (!
( (!
! !
( !
(!
(( (!
! ( ! ( ! !
( (
!
( ( !
( !
(!
(!
( (!
! (!
(! !
( !
(
!
( !
( (!
( !
!
( (!
!( !
(!(!
(!
( !( ! (! !
(
!
(
!
( !
( !
( !
(
!
(!
!
(
(!
( !
( (!
! (
!
( (!
(!( !
(! !
(!
(
!
(
!
(( (
!
(! !
(
!
(
( ! (! ( !
(
!
( !
( !
( (!!
(! (!!
((!(!
(!
( (!
( !!
((!( !
(
! ( ! (
! !
( ( !
! (!
((
!( !
(! (
!
( ! !
(!
!
((
!!
( ! !
(!
(!
( ! !
(!
(!
(!! (!
!
(!
( ( ( (
(!(! !
( ( !
( !
( (((!
(!(!(
!
(!
((
! !
(!
!
((
!
(
(!
!
!
( (!
(
!(!
(!
!
(
(!
( !
( !
(!
!
( (! !
(!(
!
(!
(
!
(
!
(
!
(!!
(!
(
!(!
(! (!
(
!
(
!
(
!
(
(!
!
!
( (!(
!
(
(!
!
!
(!
!(!
( !
(! !( (! !
(!
(! !
(( !
( ( !
(
(!
! ( !
(!
(
!
((!
! (
!
(!
(!
( (
!
(!((! !
!
(
( !
( !
( !( (!
(! ! ( !
(
!
(( !
(
!
(!
( !
(! !
(
((!
(! ! !
(
!
(!
((!
!( !
(
!
(!
!
(
!
(
!
(!
!
( (!
( (!
!
(
! (
!
(
!
(!
(
!!
!
(
(( !
(! !
( !( (
(! ! !
( (
!(
( ! (!
!
!
( !
(
(!
(
!
(!
( !
( !
(( !
( !
( !
(
!!
(
!
(
(!(
!
(!
(!
(!
(!
(
!! !
(! !
(
!
(!
(
(!(
!
(!
!
(( !
(
! ( !
(!
( !
(
(
(!
!
( (!
(! (!
(!(!
(!
(
!
( !
(! (!
(!
! (
!
(!
((!
!
(
(! !
(!!!
(!
( !
(
!
(!
(! ( ! (((
((! !
(
!
(!
((
!
(
(!
! ( !
(
!
(
!
(!
(!
(!
(!
(
!
!
(
!
(
!
(
!
(
!
(
!
(
!
(
!
(
!
(!
(( (
!!
!
(! ( !
(
!
(
!
(!
(!
(
!
(!!
(
!
(!
(
(!
!
( !
( !
( !(
( !
(
!
(
!
(!
(
!
(
!
(
!
(
!
(
!
(
!
(!
(
!
(!
(! !
(!!
((!!
(! !
!
(!
!
(! (!((
!
(
!
(
(
!
(
!!
(
!
(
!
(
!
(
!!
( (( (!
!!
!
(!
((!
(!
!
(
!
((!
!
( (!
! !
(
(!
! (!
(!
(!!
(
(! !
! (
! !
!
((
((!
! ( !
((
!
( !
(!
(!
( ! ! !
(!
(
!
( (( ( ( !
( !
(
(!
!
( (! ( (!
!(!(!
( (!
!(
!
(!
((
!
(
(!!
(!
(
!
(!
(!
( !
(!
(!
( ! (
(!
!
( (! ( !
(
! !
(!
(
!(!
!
(
(
((!
! !
(!
( (!
!
(((!
!(!
!
( !
(!(! !
(
(!(!
!
((
(!
!(!
(! (
(!
!((!
(!((
!
( !
(!
(!
(
!
( !
(! (!
( !
(
!
(!
(
!
(!!
(!!
(!
(
!
(!
!
(
(
!
(!
(!!
(
(!
!
(!
(( !
(! !
(!!
(
((((!
(
!
!(!(! ( (!
(!(
(!
(
!
(! !
( ! !
(!(!
(!
( (!
(!
! (
!
(
(!
(!
(!
(!
(! (!
( !
(
!
(!
(!
( !
( !
( !
!
(
!
(
!
(
!
(
!
(!
(!
(!!(
!
(!
(!
(!
(!
(!
(
!
(!
(
!
(!
(
!
(!
(
!
(
(
!
(
!
(!
(
!
(!( (!
(!
(
! !
(!
(!
(!
(
!
( ! (!
(!(
!
(!
!
((!
(
! ( !
( !
(
!
(
!
!
(
!
(
(! !
(!
(!
( !
(!
!
((!
(!
!
(
(!
!
(
(
!
(! !
(!
(!
(!
( ! !
(
!
(!
(
!
(
! !
(!
(
!
!
(! !
(
(
!
( (! ( !
( !
(
!
(
(
(!
! ( !
(
!
(!
(
!
(!
(!
(
!
(
(!
!
(
(!
(!!
(
!
(
(!
(
(
!
(
!
(
!
!
(
!
!
(
!
((!
!(
!
(
!
(!
(
!
(( !
!
(
!(
!
(
!
(
!
(
!
(( !
! (!
((!(!
!
( !
(! (( ! !
( ( !
! ( ! (
!
( !
(
!
( !
(!
(!
(
!
((!
!
(
((!
!
!
(
(!
(!
(
!
(
(!
! (!(
!
(
(!
(!
(!
((!
(
!
(!
( ((!
!
((
(!
! ( ( ! ((!
!
(!
!
(! (
( !
! ( !
(!
(!
(
!
( !
(
!
(
!
( !
( !
(!
( !
(
!
( !
( !
( !
( !
(
!!
(!
(
!
(
!
( !
( !
(! !
((!
(!
( ! !
(
(!(! (!!
( ( !!!
(
!
(
!
(!
( !
( !
( !
( !
(!
!
(!
( ( !
( !
(!
(!
(
!
(!!
(
(
!
(
!
(
!
(!!
(
(!
!
(
(! !
((!
(
!
(!
(!
( !
(
!
(!
(
!
( (!
! !
(
( ! (!
( ( !
! (! (!!! !
(!
(!
!
(
!
(!
((!
(
!
(!
(!
!
(!
((!
!(!
(!( !
(!
(
(!
((!
!((!
! ( (!
! (
!
(!!
((!
(!!
(
!
(( (
! !
(
!
(! (!
(
(!(!((
(!
!
( ! !
(
( ! !
(
(
!
(!
(!
(!
(
!
(!!
(
( !
!!
(
(
!
!
(
(!
(
!
(!
!
(
(
!
(!
(
!
(
!
(!
( !
( !
(! !
(
(!
(
!
(
!
(
(!
! (
!
(
!
(
!
!
(
!
(
(!( !
!
( !
( ! (!
( !
( !
(
!
(((
!
(
(!
(
!
(!
(!
(!
(!
(!
(
!
(!
(
!
(!
((
!
(!
!
((!
(
!
(
!
!
(
(
!
( !
( (!
! ((
!
(
!
(
!
(
!
(
!
(!
(
!
(( !
! (( !!
(
!
(
(!
!((!
!
!! (!
(!(! !
(! (!
(!
(!
(
!(!
(!(!!
(
!
(!!
(
(!
!
(! !
(! ( ( !
( !
(! !
( (!!
(!
((!
(! (
!
!
(!
(! !
(!
(!(!
(
!
(!
( (!
(!!
(
!
(
(!
!
(!
( (!
(!
(!
! (
( ! !
(
(!(
(!
!
(
!!
(
(
((
!((!
!(
(!
!(
!!
((
!
(!
!
( (!
(!
(
!
(
!
(
!
( (!
(!
( (! (!
(! (!
((!
(!!
((
!
(!
(!!
(
!
(!
(
(
!
(!
( !
( !
(!
(
!
(!
(
!
(
!
(
(!
! (!
(!
(!
!
(
!
(
!
(!
((
!!
(!
(
(
(!
(
!(
!
(!
( ( !
(
!
(!
(!
!
!
(!
(!
(!
((!
!
( (!( (!
! (
(!
!( ! (!
(
!(!
(
!
( !
(
(!
(!
!( (! ( !
!
(
! (!
!( (!
!
!
(
(
!
(!
(!
( !
( !
(!
( ( !!
( !
( ! (!
( (!
! (
!
( ! ( ! ( ! !
(! !
(!
(
(!
! (!(! !
(!
(
(
(
!
(! ((
! (!( !
( !
(!
(
( ! ( !
( (!(! !
( !
( !
(!
! !
(! (!
(!
(!
(!
(!(!
( (!
!( !
( (!
!
!
(
!
!(!!
(
(!
(
!
(!
(
!
(!
(
!
(!
(!
(! !
( (!
(!( !
(
(!
!
(( !
(!
((
!
(
!
(
!
(!
(!(!
(!
( (!
(!!
(
((
!
(!
(!
(!
!!
(!
!
(
(!!
((
!
( ( !
(
!
( !
(!
(!
(!
(
!
(!
( (!
!(
(!
!(!
(
!
(!
(
!
(
!(
(
(!!!
(
!
(
(!
!
(!
(
!
(
!
(
!
(
!
( (!
(!
! (( !
(!
(!
!
(
!
( !
( !
(
!
(
! !
(
(!
! (! !
(
!(!
! !
(!(!
(!
( !
(
(
!
(!
(
!
(
!
(!
(
!
(!
(!
( !
(
!
!
(
!
(
!
(
!
(
!
(!
!
(
!
(
!
(
!
(
!
(! (!
!
( !
(((
!
(!
(
!!
(
(! !
(
(!
(
(!
! !
(
(
!
(
!
(!
(
!
(!!
(!
(
(
!(!
((!(!
(
!
(
!
(
!
(
!
((
!
(
!
(
!
(
!
((
!
(
!
(
!
(
!
(!
(
!
(
!
(
!
(
!
(!
(
!
(!
( !
( ( (!!
! (!
(! (
!
!
(
(!
(
(!
( !
( (
!
(!
( !
(!!
(
(
!
(!
(!
(
!
(!
(
!
(
!
(
!
(
(
!
(!
(!
(
!
(!
(
!
(
!
(
!
(
!
( ( ! (!
!(
( !
! ( !
(!
(
!
(!!
(
(!!
(
!
(!
(
!!
(
(!!
(!
( !!
(( !
(
!
(!
(
!
(!
(
!!
(
(! !
(
(
!!
(
!
(
( !!
( !
( !
(
!
(!
( ( (! ( (( ! !
(
!
(! !
(
!
( ! (
( !
(!
(
!
(!
( (!
! ! ( !
(
!
(!
(
(
!(!
!(
(!
! (
( !
(!
(!
(
!
(
!
(!
(
!
( !
(( !
(!
((
!
(!
(
!
(!
((!
! (
!
((
(!
!
!
(
!
(!( ( !
! ( !
! ( ! (
!
( !
(
!
(
(!
!(

Fig.1.4. Off-Site Owners Fig.1.5. On-Site Owners

second shows properties for which the owner’s residence is on site. If off-site ownership
tends to reflect abandoned rental properties, while on-site ownership reflects abandoned
residences, then one might hypothesize that different types of decisions were involved:
abandoning a rental property might be more directly an economic decision than
abandoning one’s home. However, these patterns look strikingly similar. So one may ask
whether there are any statistically significant differences between them.

Notice that there appears to be significant clustering in each pattern. But here it is
important to emphasize that one can only make this judgment by comparing these
4
This data was obtained from the Neighborhood Information System data base maintained by the
Cartographic Modeling Lab here on campus, http://www.cml.upenn.edu/. For further discussion of this data
see Hillier, Culhane, Smith and Tomlin (2003).

________________________________________________________________________
ESE 502 I.1-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

patterns with the pattern of all housing in this area. For example, there are surely very
few houses in Fairmont Park, while there are many houses in other areas. So here it is
important to treat the pattern of overall housing as the relevant reference pattern or
“backcloth” against which to evaluate the significance of any apparent clusters of
abandoned houses.

A second comparison of point patterns is given


by an example from [BG] (p.80,129-132). This
example involves a study of lung and larynx
cancer cases in Lancashire county, England
during the period 1974-1983.5 The specific data
set is from the south-central area of Lancashire
county, shown by the red area in Figure 1.6. An
enlargement of this region is shown in Figure
1.7 below, where the population of blue dots are
0 50 100
lung cancers during that period, and the smaller   
population of red dots are larynx cancers. Here
the smaller areal subdivisions shown are
parishes [also called civil parishes (cp)] and Fig.1.6. Lancashire County
correspond roughly in scale to our census tracts.

Here again it should be clear that clustering of such cancer cases is only meaningful
relative to the distribution of population in this area. The population densities in each
parish are shown in Figure 1.8 below.

(!
! (!
( (!
! (!
(
(!
! (
! ((
! (!
! (
!
(!(!((
(
!(
!
(!
!
(!
!
(
(!
(
!
!
(!
!
(
!
((
!(! (!
! (! (
!(
! (!
! ( !(
!(
!
(!
!
(!(
(!
(
!
((
!
!
!
((!
!
(
! (!
! (
!(
! (!
! (
((!
(
!(!
( (!
! ( (!
! ( ( !
!
(
! (!(!(!
( ( !
! ((!
!(!
( (!
!( (!
!(
! ( (
!( !
! (!(! (!
( ( !
!
( !
! (!!
((
!
(!
(
(!
!( (!
(
!
!
(
! ((!
!((
!(
!(
!
( (!
(!
!
(
! (!
(
!
((
!
(
! (!
! (
!
(
!(
!(!
!( (
!(!
! ( ! !
(
(!
! ( !
( (
! (!
!
(!
!((!
(
(!
! (!
(
!
!
(
! ((!
!((
!(
!((
!(!
(!
! (!
(
(
!(
!
(
! (!
! (
!(
!
(
!(!
(
!
! ( (
!(!
! ( ! !
(
(!
! (
(
! (
! (!
! !
( (
! !
((!
(! (
!
((!
( (!
! (!
! ( ( !
! ((
(! ( ( !
! ( !
( (
! !
(
(
!
(!
! (!
((!
( ((!
((!
!
!
( (!
! ( ( !
! (!(! ( ! ((
(! (
(
!(
! (
(
! !!(!
((!
!
(
!
(
!((!
!((
!
! (!
(
(
! ! !
(
(
!
(!
(
! (
( (!
! ( ! (!(!
(!
!
( ! (
!(!
!( (!
!(!(
(!
! (!!(!
((!
!
(
!
(
!( (! !(
! ! (
!
(!
(
!
!
(
((!
! (!
! ( ! (!
! (!(
! (
!(!
!
( !
( (
(!
! (!
(! (
(
!
(
!
(
! (
!
!
( (
! (
!
(
!(
!
(
!
(
! (
!(
! !
( (
!(
!!
(
!(
!
(!
!
(
!
(
!
(
!(
(
!(!
!((
! (
!
(
(
!
(!(
!
(( !
!
(
!
( (!
! (!(!(
!
(
!(
!
( !
( (
! (
!
(
!
(
!
(
!
(
! (
!(
! !
( (
!(
! (
!(
!
(
!
(
!((
(
!
!(
!
!
(
(!
!((
!(
! (( !
! ( (!
(
!(
!
(
!(!
(!
( (
! (!(!
( (
! (!
! (
!
(
!(
!!
(
(
!(!
! (
!
(
! ( ! ( (
! ( (
!(
!
(! (! (
! (!
! (
!(
!(!(
! (
! (
!(
!(
!(
! (
! ( !
! (
(!
! (! (!
(! (
(!
! (!
!
( ( !
! (! (!(!
!(! (
! ! (!
!( (
! ( (
!
(
!(
!
(
!
(
! (
!
(
!(
!((
!
!
(
(
!(
!
!
(! ! (!
! (!(!
(! (!
(( (!
(!
! (!(!
!(! ( ( !
! (!((!
((!
(
((
! (!
! (!
! (
!(
!
(!
!
((!
( (
(!
((
! (
!
(!
((
! (!
!(
!
(!
( ! ( (!
! ((
!
(
! ( !
(!(!
!
(!
(
((!
(
!
!(! ((
!
(!
( (!
!( ! (!(
(
!
(
!(!
! (!
!(!
(( (! ( ! ( (!
! ( ! !(!
(!
( (
!
(!
! ((
!
(!
( !
(!(!((
!
! (!
(
! ( ! (
!
(
!(
!
! (
!
(
!!((
! (! ( ! (
(!
! (! (!
!( ! ( (
(!
! (!
( ( (!
!
(
!
((!
(! (! !
(!(!
! (!
( ( (!
! !
(
(!
! (!
(!
! !
(!( (!
! ( !
! (
! !
( ! ( (!
!
( (! (
((!
! ( !
! (
!
( !
! (
( (!
! ( ( !
( (
!
!
(
! !
(!
!((!
! (
!
( !
( (
! (
!
(!
!((!
!((
! (
! ( (
! (
!
( !
(
!
! (!( (
! (!
! (! (
!
(( !
! ( !
(
!
! ( (
! (!
! (! (! (
!(( !
! (
!
(
! (!
! (
!(!
! (! (
!(!
! (!(
!
(!
(
!( (! ( (!
! ((
! (
! (
!
( !
! (!
!(! (
!(!
! (! (!
!
(
!
(
(
!(
!(!
((
!
(!
(
!
!((
!(!
!(
( (!
! ((
!
(
! ( !
! (!( ((
!
(
!
!
(!
(
! (!
( (
!(!
!( (
! ( (! (!(( (!
(!
! (! (
!(
!
(
!!
(!
(!
(
( !
! (!
( ( (!
!
!
( (!
(! (!
( (
!
(
!(!
(!
! (!
(!
(
!
(
!
(
!
(
(
! (! (
!
(
! (
! (
! (
!
(!
! (!
!( (! ( ( !
! ( ( !
! (!
( !
(!
! !
(
(
!(!
(
! (
!(
!
(
!(
!
(
!(
!(
!
(!
(
!
(
!(
!
(
!
((
! (
!
(
!
(
!(!
! (
!
(
! (!
! (
!
(!
!
( (!
!(
(!(! (!
! ( ( !
! (
(!
! (
! (
! (
!
(
! (
! (
! (
!
(!
! (
!
((
! (
! (
!(
!
(!
(!
(! (
! (
! (
! (!(! ! (
! (
!(
! (
!
((
! (
!
(
!(!
! (
! (
!
(!
! (
! (
!
(!
! !
(
(!( ( (!
( ( !
(
(!
! (
!
(
!
(
(!
(
!( (
!(!
!
(
!
(
!(
!((
!
(
!
(
!(!
!
(
! (
(
!!
( (!
!
(
!( (
!
(
!
(
! !(
!
(
! (
(
!(
!
(
! (
! (
! (!
(
(!
! (!
! ( (
! (!
! ((
!(
!
(
! (
!(!
!
(
! (!
(
(
!(
!(
!
(
!(
!
(
!(
! (
!
( (
!(
(
! (
!(
!
(
!
(
!
(
! (
! (
!(
!
(
! (
!
(
!
(!
! ( (
! (
! (!
!
(
! ((!
!
(!
((
!
!
(
(
!(!
! (!
((!
!(
(
!(
!(
!
(
!(
!
(!
!(( !
! (!(!
((
!(
!(!(!((
! (
! (!
!( (
!
(!
! (
(!
(!
! (!(!
(
(!
! ((!
!((
!
(
!(
!
(!
!((
! (
!
(!
! (! ( (
!(
! (!
!(
(!
(!
! (!
! ((
!
!
((
! (!
! ( ! (
! (!
! ((
!(
! (
! (!
( (!
! (
!((
! (
! (
! (
!
(
! (!
!( (
!
(
!(
!
( !
! (!
! (
! ( !
! ( ( !
! (
!
( ( !
! ( ! ( ((!
! (( ! ( (
!
( ! ( (
!
(!
!
( !
(! (
(
! (
!
(
!
(!
! ( !
!
( (
! (!
(!
! (!(
! (!
! (
(
! (
!
(!
(!
! (! (!
! (! ( (!((
(!
! ( (
! (! (
!
( (
! (
! (!
! (( ( (
! (!
! !
((!
! (!
! (!
(!
( ( !
(!
! (!
!
(
! (
! ( !
! ( (
!
(!
! (!!
((!(! (
!
(
(
!
(( !
!
(!
!
(
!(
!
(
(
!!
(
!
(!
!
(! !(!
(! (!
! (
!
(!
( (
! ( !
! (
!( !(
( (
!
(
! (
(!
!
(
!
(
!
( ! (
(
!
(!
! (
(
!
(
! !(
!
(!(!
!
(!
((
(
!
(
!
(!
!(
(!
( ( !
! (! (!
(! ( (
!(
! (
!(!
! ((
(
! (
!
(!
! (! (
!
(
! (
! (
!
(
!
(
! !
( (
! (!
(
! (
! (
! (
! ( (
!
(
!
(
! (!
(!
! ( ( !
! ( (( !
! (
!
(!
( (
!
((!
! (
!
(
!(
!!
(
(
!
(
! (
!(
!
(
!
(
! (
!( !
! (
!
(
(!
! (!
!
(
((
(
!
(
!(
!!
(
(
!
(
!
(
! (
!(
!
(
! (
! !
(
! ((
! (
! (
!
(
!
(
!
(
!(!
!(
(
! (
! (
!
(
!
(!
! (
!
(
(!
! !
(
!(
!!
(
!
(!
((
!
(!
!((
!
(
!
(
!
(!
!
(
!(
(
!
(
!
(
! (!
!(!
( (
!
(!
! ((
!
(!
!((!
!(! (
! (!
!
(
!
(
(!
! (!
((!
!
(
!
(
(!
!((
!
(
!(
!
!
(
(
!
!
(
(!
!
(
!
(
!
(
!
(
!
(
!
(
!
(!
(
!(!
(
(!
! (
(!
! ((!
!(!
((
! (! (
!
(
!(
!(
!
!
((
!
(
!
(
(
!
(
!
(
!
(!
! (
!
(
!(
!
(
!!
(
(
!
!
(
(
!
(
!
(
!
(!
!
(
!
(
! ( (!
! ((!
! (!( (
!
(!
!((
!(
!
(
!
(
!
(
!
!
(
(
!
(
!
(
!
(
!(
!(
!
(
!(!
!(
(!
! (
!(
!
(
!
(!
!(
(
!
(
!(!
(
!
!((
!
(
!
(
!
(!( ( (
! (
! (!
(
!
! ((
! (
! ( (
! (
! (
!
( (
!
!
(
!(!
!((!
! (
! (!
!
(
!
(!
(
(
!
(
!
(
!
(
!
(
!
(
!
(
!
(
!
(
!(!
(
!
!( (
(!
!!
(
(
!
(!
(!
!
(
!
(
((
!
(!
!( (
!(
! (
! (!
!(
(!
! (
(
! (
!
(!
!
(
!
(
!
(
!
(
!
(!
(
!
(
!
(!
!(
(
!
!
(
(
!
(
!
(
!
(
!!
(
!
(
(
!
(
!
(
!
(
!
(
!
(!
! ( (!
! (
!
(! (
!!
(
(
! (
!
(
!(
! (
!
(
! (
! (
!
(!
(!
( (!
! ((
(
!
( !
!
(!
!
((
(!
(! (
!(!
((
!
(!
!
(
!( ( !
! (!
( (!
!((!
!
(!( !
(! (!
! (
(!(
!
(! ! (!
( (
! ( (!
! (
!(!
!
(
! ( (!
!
(
!((
(
!(!
!
(
! (!
! (!
( !
(!(
!
(
!((
! (!
!(
(
!
(
! (
!(
! (!
! (! (
(
! (!
!(
!
(!
((!
! (!
!(!
!
( (!
! (!
( (! (
(
!! (( !
! (!
! (
(
! (!
! (!
! (
!(
!
(
!(!
!(!
((
!
!
( ! (
(
! (!
! (!
! (
!
(
!(!
(
!
!(!
((
! (!
! (! ((
! (! (! (!(!(!
((
!(
!
!! (!
(!(
(!
! (( (
!
(
! (! (! ((
! (!
! (!
(!(
(
!
(
(
!
!
( !(!
(!( ( (
(
!
(
!!
(!
! (
!
(!
(!
!
(((
(!
! (
!
(
!
(!
!(! (
!
(!
! (
!(
! (
! (
! (
! (! (
!
(!
(
(
! (
! (((!
(!
( (
! (!
! ((!(
(
! (!
! (!(
!

Fig.1.7. Larynx and Lung Cases Fig.1.8. Population Backcloth

5
This data first appeared in the paper by Diggle, Gatrell and Lovett (1990) which is included as Paper 12
“Larynx Cancer” in the Reference Materials on the class web page.
________________________________________________________________________
ESE 502 I.1-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

An examination of these population densities reveals that the clustering of cases in some
of the lower central parishes is now much less surprising. But certain other clusters do
not appear to be so easily explained. For example the central cluster in the far south
appears to be in an area of relatively sparse population. This cluster was in fact the center
of interest in this particular study. An enlargement of this southern portion in Figure 1.9
below indicates that a large incinerator6 is located just upwind of this cluster of cases.7

INCINERATOR

Fig.1.9. Incinerator Location

Moreover, an examination of the composition of this cluster suggests that there are
significantly more larynx cases present than one would expect, given the total distribution
of cases shown in Figures 1.7 and 1.8 above. This appears to be consistent with the fact
that large airborne particles such as incinerator ash are more likely to lodge in the larynx
rather than the lungs. So there is some suspicion that this incinerator may be a significant
factor contributing to the presence of this particular clustering of cases.

To analyze this question statistically, one may ask how likely it is that this could simply
be a coincidence. Here one must model the likelihood of such local clustering patterns.

6
According to Diggle, Gatrell and Lovett (1990), this incinerator burned industrial wastes, and was active
during the period from 1972-1980.
7
Prevailing winds are from the Atlantic ocean to the west, as seen in Figure 1.6 above.
________________________________________________________________________
ESE 502 I.1-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

2. Models of Spatial Randomness

As with most statistical analyses, cluster analysis of point patterns begins by asking:
What would point patterns look like if points were randomly distributed ? This requires a
statistical model of randomly located points.

2.1 Spatial Laplace Principle

To develop such a model, we begin by considering a square region, S , on the plane and
divide it in half, as shown on the left in Figure 2.1 below:

1/4 1/4
1/2 1/2    C
1/4 1/4
S

Fig. 2.1. Spatial Laplace Principle

The Laplace Principle of probability theory asserts that if there is no information to


indicate that either of two events is more likely, then they should be treated as equally
likely, i.e., as having the same probability of occuring.1 Hence by applying this principle
to the case of a randomly located point in square, S , there is no reason to believe that this
point is more likely to appear in either left half or the (identical) right half. So these two
(mutually exclusive and collectively exhaustive) events should have the same probability,
1/2, as shown in the figure. But if these halves are in turn divided into equal quarters,
then the same argument shows that each of these four “occupancy” events should have
probability 1/4. If we continue in this way, then the square can be divided into a large
number of n grid cells, each with the same probability, 1 n , of containing the point. Now
for any subregion (or cell ), C  S , the probability that C will contain this point is at
least as large as the sum of probabilities of all grid cells inside C , and similarly is no
greater that the sum of probabilities of all cells that intersect C . Hence by allowing n to
become arbitrarily large, it is evident that these two sums will converge to the same limit
– namely the fractional area of S inside C . Hence the probability, Pr(C | S ) that a
random point in S lies in any cell C  S is proportional to the area of C .2

a(C )
(2.1.1) Pr(C | S ) 
a( S )

Finally, since this must hold for any pair of nested regions C  R  S it follows that3

1
This is also known as Laplace’s “Principle of Insufficient Reason”.
2
This argument in fact simply repeats the construction of area itself in terms of Riemann sums [as for
example in Bartle (1975, section 24)].
3
Expression (2.1.2) refers to equation (2) in section 2.1. This convention will be followed throughout.

________________________________________________________________________
ESE 502 I.2-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

Pr(C | S ) a (C ) / a ( S )
(2.1.2) Pr(C | S )  Pr(C | R)  Pr( R | S )  Pr(C | R)  
Pr( R | S ) a( R) / a ( S )

a(C )
 Pr(C | R) 
a( R)

and hence that the square in Figure 2.1 can be replaced by any bounded region, R , in the
plane. This fundamental proportionality result, which we designate as the Spatial Laplace
Principle, forms the basis for almost all models of spatial randomness.

In probability terms, this principle induces a uniform probability distribution on R ,


describing the location of a single random point. With respect to any given cell, C  R , it
convenient to characterize this event as a Bernoulli (binary) random variable, X (C ) ,
where X (C )  1 if the point is located in C and X (C )  0 otherwise. In these terms, it
follows from (2.1.2) that the conditional probability of this event (given that the point is
located in R ) must be

(2.1.3) Pr  X (C )  1| R   a (C ) / a ( R) ,

so that Pr  X (C )  0 | R   1  Pr  X (C )  1| R   1  [a (C ) / a ( R )] .

2.2 Complete Spatial Randomness

In this context, suppose now that n points are each located randomly in region R . Then
the second key assumption of spatial randomness is that the locations of these points have
no influence on one another. Hence if for each i  1,.., n , the Bernoulli variable, X i (C ) ,
now denotes the event that point i is located in region C , then under spatial randomness
the random variables { X i (C ) : i  1,.., n} are assumed to be statistically independent for
each region C . This together with the Spatial Laplace Principle above defines the
fundamental hypothesis of complete spatial randomness (CSR), which we shall usually
refer to as the CSR Hypothesis.

Observe next that in terms of the individual variables, X i (C ) , the total number of points
appearing in C , designated as the cell count, N (C ) , for C , must be given by the random
sum

N (C )   i1 X i (C )
n
(2.2.1)

[It is this additive representation of cell counts that in fact motivates the Bernoulli (0-1)
characterization of location events above.] Note in particular that since the expected

________________________________________________________________________
ESE 502 I.2-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

value of a Bernoulli random variable, X , is simply P( X  1) ,4 it follows (from the


linearity of expectations) that the expected number of points in C must be

E  N (C ) | n, R    
n n
(2.2.2) i 1
E[ X i (C ) | R]  i 1
Pr[ X i (C )  1| R]

a(C ) a (C )  n 

n
 i 1
 n    a(C )
a( R) a( R)  a( R) 

Finally, it follows from expression (2.1.3) that the under the CSR Hypothesis, the sum of
independent Bernoulli variables in (2.2.1) is by definition a Binomial random variable
with distribution given by

k nk
n!  a(C )   a(C ) 
(2.2.3) Pr[ N (C )  k | n, R]    1   , k  0,1,.., n
k !(n  k )!  a( R)   a( R) 

For most practical purposes, this conditional cell-count distribution for the number of
points in cell, C  R (given that n points are randomly located in R ) constitutes the
basic probability model for the CSR Hypothesis.

2.3 Poisson Approximation

However, when the reference region R is large, the exact specification of this region and
the total number of points n it contains will often be of little interest. In such cases it is
convenient to remove these conditioning effects by applying the well-known Poisson
approximation to the Binomial distribution. To motivate this fundamental approximation
in the present setting, imagine that you are standing in a large tiled plaza when it starts to
rain. Now consider the number of rain drops landing on the tile in front of you during the
first ten seconds of rainfall. Here it is evident that this number should not depend on
either the size of the plaza itself or the total number of raindrops hitting the plaza. Rather,
it should depend on the intensity of the rainfall – which should be the same everywhere.
This can be modeled in a natural way by allowing both the reference region (plaza), R ,
and the total number of points (raindrops landing in the plaza), n , to become large in
such a way that the expected density of points (intensity of rainfall) in each unit area
remains the same. In our present case, this expected density is given by (2.1.2) as

n
(2.3.1)  (n, R) 
a( R)

Hence to formalize the above idea, now imagine an increasing sequence of regions
R1  R2    Rm   and corresponding point totals n1  n2    nm   that expand
such a way that the limiting density

4
By definition E ( X )   x
x  p ( x )  1  p (1)  0  p (0)  p (1) .

________________________________________________________________________
ESE 502 I.2-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

nm
(2.3.2)   lim m  (nm , Rm )  lim m
a( Rm )

exists and is positive. Under this assumption, it is shown in the Appendix (Section 1) that
the Binomial probabilities in (2.2.3) converge to simple Poisson probabilities,

[a (C )]k a (C )


(2.3.3) Pr[ N (C )  k |  ]  e , k  0,1, 2,...
k!

Morover, by (2.2.2) and (2.3.2), the expected number of points in any given cell (plaza
tile), C , is now given by

(2.3.4) E[ N (C )]  a(C )

where density  becomes the relevant constant of proportionality. Finally, if the set of
random variables {N (C )} describing cell-counts for every cell of finite area in the plane
is designated as a spatial point process on the plane, then any process governed by the
Poisson probabilities in (2.3.3) is designated as a spatial Poisson process on the plane.
Hence, when extended to the entire plane, the basic model of complete spatial
randomness (CSR) above corresponds precisely to a spatial Poisson process.

2.4 Generalized Spatial Randomness

The basic notion of spatial randomness above was derived from the principle that regions
of equal area should have the same chance of containing any given randomly located
point. More formally, this Spatial Laplace Principle asserts that for any two subregions
(cells), C1 and C2 , in R ,

(2.4.1) a (C1 )  a  C2   Pr[ X (C1 )  1| R]  Pr[ X (C2 )  1| R]

However, as was noted in the Housing Abandonment example above, simple area may
not always be the most relevant reference measure (backcloth). In particular, while one
can imagine a randomly located abandoned house, such houses are very unlikely to
appear in the middle of a public park, let alone the middle of a street. So here it makes
much more sense to look at the existing housing distribution, and to treat a “randomly
located abandoned house” as a random sample from this distribution. Here the Laplace
principle is still at work, but now with respect to houses. For if housing abandonments
are spatially random, then each house should have that same chance of being abandoned.
Similarly, in the Larynx cancer example, if such cancers are spatially random, then each
individual should have the same chance of contracting this disease. So here, the existing
population distribution becomes the relevant reference measure.

________________________________________________________________________
ESE 502 I.2-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

To generalize the above notion of spatial randomness, we need only replace “area” with
the relevant reference measure, say (C ) , which may be the “number of houses” in C or
the “total population” of C . As a direct extension of (2.4.1) above, we then have the
following Generalized Spatial Laplace Principle: For any two subregions (cells), C1 and
C2 , in R :

(2.4.2) (C1 )    C2   Pr[ X (C1 )  1| R]  Pr[ X (C2 )  1| R]

If (2.4.1) is now replaced by (2.4.2), then one can essentially reproduce all of the results
above. Given this assumption, exactly the same arguments leading to (2.2.3) now show
that

k nk
n!  (C )   (C ) 
(2.4.3) Pr[ N (C )  k | n, R]    1   , k  0,1,.., n
k !(n  k )!  ( R)   ( R) 

To establish the Poisson approximation, there is one additional technicality that needs to
be mentioned. The basic Laplace argument in Figure 2.1 above required that we be able
to divide the square, S , into any number of equal-area cells. The simplest way to extend
this argument is to assume that the relevant reference measure,  , is absolutely
continuous in the area measure, a . In particular, it suffices to assume that the relevant
reference measure can be modeled in terms of a density function with respect to area. 5 So
if housing (or population) is the relevant reference measure, then we can model this in
terms of a housing density (population density) with respect to area. In this setting, if we
now let  (n, R )  n / ( R ) , and again assume the existence of limiting positive density

nm
(2.4.4)   lim m (nm , Rm )  lim m
( Rm )

as the reference region becomes larger, then the same argument for (2.3.3) [in Section
A1.1 of the Appendix] now shows that

[(C )]k  (C )


(2.4.5) Pr[ N (C )  k | ]  e , k  0,1, 2,...
k!

Spatial point processes governed by Poisson probabilities of this type (i.e., with non-
uniform reference measures) are often referred to as nonhomogeneous spatial Poisson
processes. Hence we shall often refer to this as the nonhomogeneous CSR Hypothesis.

5
More formally, it is assumed that there is some “density” function, f , on R such that  is the integral
of f , i.e., such that for any cell, C  R , (C )   C f ( x ) dx .

________________________________________________________________________
ESE 502 I.2-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

2.5 Spatial Stationarity

Finally we consider a number of weaker versions of the spatial randomness model that
will also prove to be useful. First observe that some processes may in fact be “Laplace
like” in the sense that they look the same everywhere, but may not be completely
random. A simple example is provided by the cell centers in Figure 1.1 of Section 1
above. Here one can imagine that if the microscope view were shifted to the left or right
on the given cell slide, the basic pattern of cell centers would look very similar. Such
point processes are said to be stationary. To make this notion more precise, it is
convenient to think of each subregion C  R as a “window” through which one can see
only part of larger point process on all of region R . In these terms, the most important
notion of stationarity for our purposes is one in which the processs seen in C remains the
same no matter how we move this window. Consider for example the pattern of trees in a
large rain-forest, R , part of which is shown in Figure 2.2 below. Here again this pattern
is much too dispersed to be completely random, but nonetheless appears to be the same
everywhere. Suppose that the relevant subregion, C , under study corresponds to the
small square in the lower left. In these terms, the appropriate notion of stationarity for our
purposes amounts to the assumption that the cell-count distribution in C will remain the

! ! ! ! !
! ! ! ! !
!
! ! ! ! !
! ! !
! !
! ! ! ! !
! ! ! ! !
!
! ! !
! ! ! ! !
!
!
! ! ! !
! ! ! ! !
!
!
! ! ! !
! ! ! !
! ! ! !
! !
! ! ! ! ! ! !
! ! !
! !
! ! ! ! !
!
! ! ! !
! ! !
! ! !
! !
! ! ! !
! ! ! !
! ! ! !
!
! ! ! !
! ! ! !
!

Fig.2.2. Isotropic Stationarity Fig.2.3. Anisotropic Stationarity

same no matter where this subregion is located. For example the tilted square shown in
the figure is one possible relocation (or copy) of C in R . More generally if cell, C2 , is
simply a translation and/or rotation of cell, C1 , then these cells are said to be
geometrically congruent, written C1  C2 . Hence our formal definition of stationarity
asserts that the cell-count distributions for congruent cells are the same, i.e., that for any
C1 , C2  R

(2.5.1) C1  C2  Pr[ N (C1 )  k ]  Pr[ N (C2 )  k ] , k  0,1,...

Since the directional orientation of cells make no difference, this is also called isotropic
stationarity. There is a weaker form of stationarity in which directional variations are

________________________________________________________________________
ESE 502 I.2-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

allowed, i.e., in which (2.5.1) is only required to hold for cells that are translations of one
another. This type of anisotropic stationarity is illustrated by the tree pattern in Figure
2.3, where the underlying point process tends to produce vertical alignments of trees
(more like an orchard than a forest). Here the variation in cell counts can be expected to
differ depending on cell orientation. For example the vertical cell in Figure 2.3 is more
likely to contain extreme point counts than its horizontal counterpart. (We shall see a
similar distinction made for continuous stationary processes in Part II of this
NOTEBOOK.)

One basic consequence of both forms of stationarity is that mean point counts continue to
be proportional to area, as in the case of complete randomness, i.e. that

(2.5.2) E[ N (C )]    a(C )

where  is again the expected point density (i.e., expected number of points per unit
area). To see this, note simply that the basic Laplace argument in Figure 1.1 of Section 1
depends only on similarities among individual cells in uniform grids of cells. But since
such cells are all translations of one another, it now follows from (2.5.1) that they all
have the same cell-count distributions, and hence have the same means. So by the same
argument above (with cell occupancy probabilities now replaced by mean point counts) it
follows that such mean counts must gain be proportional to area. Thus while there can be
many types of statistical dependencies between counts in congruent cells (as in the
dispersed tree patterns above), the expected numbers of points must be the same in each.

One final point should be made about stationarity. This concept implicitly assumes that
the reference region, R , is sufficiently large to ensure that the relevant cells C never
intersect the boundary of R . Since this rarely happens in practice, the present notion of
stationarity is best regarded as a convenient fiction. For example, suppose that in the rain-
forest illustrated in Figure 2.2 above there is actually a lake, as shown in Figure 2.4
below. In this case, any copies of the given (vertical) cell that lie in the lake will of course
contain no trees. More generally, those cells that intersect that lake are likely to have
fewer trees, such as the tilted cell in the figure. Here it is clear that condition (2.5.1)
cannot possibly hold. Such violations of (2.5.1) are often referred to as edge effects.

! ! ! !
! ! ! !
! !
! ! ! !
! ! ! !
! !
! ! ! ! ! ! ! !
! ! !
!
! !
! !
! ! ! !
! !
! !
! ! ! !
! ! ! ! !
! ! !
! !
! ! ! !
! ! ! !
! ! ! !
! !
! ! ! !
! ! ! ! ! !
!
!
! LAKE
!
!
!
!
! !
!
! ! ! ! ! ! ! !
! !

Fig.2.4. Actual Landscape Fig.2.5. Stationary Version


________________________________________________________________________
ESE 502 I.2-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

Here there are two approaches that one can adopt. The first is to disallow any cells that
intersect the lake, and thus to create a buffer zone around the lake. While this is no doubt
effective, it has the disadvantage of excluding some points near the lake. If the forest, R,
is large, this will probably make little difference. But if R is small (say not much bigger
than the section shown) then this amounts to throwing away valuable data. An alternative
approach is to ignore the lake altogether and to imagine a “stationary version” of this
landscape, such as that shown in Figure 2.5. Here there are seen to be more points than
were actually counted in this cell. So the question is then how to estimate these missing
points. A method for doing so (known as Ripley’s correction ) will be discussed further in
Section 4.3 below.

________________________________________________________________________
ESE 502 I.2-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

3. Testing Spatial Randomness

There are at least three approaches to testing the CSR hypothesis: the quadrat method, the
nearest-neighbor method, and the method of K-functions. We shall consider each of these
in turn.

3.1 Quadrat Method

This simple method is essentially a direct test of the CSR Hypothesis as stated in
expression (2.1.3) above. Given a realized point pattern from a point process in a
rectangular region, R , one begins by partitioning R it into congruent rectangular
subcells (quadrats) C1 ,.., Cm as in Figure 3.1 below (where m  16 ). Then, regardless of
whether the given


  

 
 


 

Fig. 3.1. Quadrat Partition of R

pattern represents trees in a forest or beetles in a field, the CSR Hypothesis asserts that
the cell-count distribution for each Ci must be the same, as given by (2.1.3). But rather
than use this Binomial distribution, it is typically assumed that R is large enough to use
the Poisson approximation in (2.3.3). In the present case, if there are n points in R , and
if we let a  a (C1 ) , and estimate expected point density  by

n
(3.1.1) ˆ 
a( R)

then this common Poisson cell-count distribution has the form

(ˆ a) k ˆ a
(3.1.2) Pr[ N i  k | ˆ ]  e , k  0,1, 2,...
k!

Moreover, since the CSR Hypothesis also implies that each of the cell counts,
N i  N (Ci ), i  1,.., k , is independent, it follows that  N i : i  1,.., k  must be a
independent random samples from this Poisson distribution. Hence the simplest test of

________________________________________________________________________
ESE 502 I.3-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

this hypothesis is to use the Pearson  2 goodness-of-fit test. Here the expected number of
points in each cell is given by the mean of the Poisson above, which (recalling that
a  a( R) / m by construction) is

n n
(3.1.3) E ( N | ˆ )  a  ˆ  a  
a( R) m

Hence if the observed value of N i is denoted by ni , then the chi-square statistic

(ni  n / m) 2
 2   i1
m
(3.1.4)
n/m

is known to be asymptotically chi-square distributed with m  1 degrees of freedom,


under the CSR Hypothesis. Thus one can test this hypothesis directly in these terms. But
since n / m is simply the sample mean, i.e., n / m  (1/ m) i1 ni  n , this statistic can also
m

be written as

(ni  n ) 2 s2
 2   i1
m
(3.1.5)  (m  1)
n n


m
where s 2  1
m1 i 1
(ni  n ) 2 is the sample variance. But since the variance if the Poisson
distribution is exactly the mean, it follows that var( N ) / E ( N )  1 under CSR. Moreover,
since s 2 / n is the natural estimate of this ratio, this ratio is often designated as the index
of dispersion, and used as a rough measure of dispersion versus clustering. If s 2 / n  1
then there is too little variation among quadrat counts, suggesting possible “dispersion”
rather than randomness. Similarly, if s 2 / n  1 then there is too much variation among
counts, suggesting possible “clustering” rather than randomness.

But this testing procedure is very restrictive in that it requires an equal-area partition of
the given region.1 More importantly, it depends critically on the size of the partition
chosen. As with all applications of Pearson’s goodness-of-fit test, if there is no natural
choice of partition size, then the results can be very sensitive to the partition chosen.

3.2 Nearest-Neighbor Methods

In view of these shortcomings, the quadrat method above has for the most part been
replaced by other methods. The simplest of these is based on the observation that if one
simply looks at distances between points and their nearest neighbors in R , then this
provides a natural test statistic that requires no artificial partitioning scheme. More

1
More general “random quadrat” methods are discussed in Cressie (1995,section 8.2.3).
________________________________________________________________________
ESE 502 I.3-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

precisely, for any given points, s  ( s1 , s2 ) and v  (v1 , v2 ) in R we denote the


(Euclidean) distance between s and v by2

(3.2.1) d ( s, v)  ( s1  v1 ) 2  ( s2  v2 ) 2

and denote each point pattern of size n in R by Sn  ( si : i  1,.., n) , then for any point,
si  S n ,3 the nearest neighbor distance (nn-distance) from si to all other points in Sn is
given by4

(3.2.2) di  di ( Sn )  min{d ( si , s j ) : s j  S n , j  i}

In a manner similar to the index of dispersion above, the average magnitudes of these
nn-distances (relative to those expected under CSR) provide a direct measure of
“dispersion” or “clustering” in point patterns. This is seen clearly by comparing of the
two figures below, each showing a pattern of 14 points.

 

  
    

 
  
 
  
   
 

Fig.3.2. Dispersed Pattern Fig.3.3. Clustered Pattern

In Figure 3.2 these points are seen to be very uniformly spaced, so that nn-distances tend
to be larger than what one would expect under CSR. In Figure 3.3 on the other hand, the
points are quite clustered, so that nn-distances tend to be smaller than under CSR.

2
Throughout these notes we shall always take d ( s , v ) to be Euclidean distance. However there are many
other possibilities. At large scales it may be more appropriate to use great-circle distance on the globe.
Alternatively, one may take d ( s , v ) to be travel distance on some underlying transportation network. In
any case, most of the basic concepts developed here (such as nearest neighbor distances) are equally
meaningful for these definitions of distance.
3
The vector notation, S n  ( si : i  1,.., n) , means that each point si is treated as a distinct component of
S n . Hence (with a slight abuse of notation), we take si  S n to mean that si is a component of pattern S n .
4
This is called the event-event distance in [BG] (p.98). One may also consider the nn-distance from any
random point, x  R to the given pattern as defined by d x ( S n )  min{d ( x , s ) : i  1, .., n} . However, we
i

shall not make use of these point-event distances here. For a more detailed discussion see Cressie (1995,
section 8.2.6).
________________________________________________________________________
ESE 502 I.3-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

3.2.1 Nearest-Neighbor Distribution under CSR

To make these ideas precise, we must determine the probability distribution of nn-
distance under CSR, and compare the observed nn-distance with this distribution. To
begin with, suppose that the implicit reference region R is large, so that for any given
point density,  , we may assume that cell-counts are Poisson distributed under CSR.
Now suppose that s is any randomly selected point in a pattern realization of this CSR
process, and let the random variable, D , denote the nn-distance from s to the rest of the
pattern. To determine the distribution of D , we next consider a circular region, Cd , of
radius d around s , as shown in Figure 3.4 below.
Then by definition, the probability that D is
at least equal to d is precisely the probability R
that there are no other points in Cd . Hence if
we now let Cd ( s )  Cd  {s} , then this proba- d
s
bility is given by C d

(3.2.3) Pr( D  d )  Pr{N [Cd ( s)]  0}

But since the right hand side is simply a


cell-count probability, it follows from Fig.3.4. Cell of radius d
expression (2.3.3) that,

Pr( D  d )  e a[Cd ( s )]  e d


2
(3.2.4)

where the last equality follows from the fact that a[Cd ( s )]  a(Cd )  d 2 . Hence it
follows by definition that the cumulative distribution function (cdf), FD (d ) , for D is
given by,

FD (d )  Pr( D  d )  1  Pr( D  d )  1  e d


2
(3.2.5)

In Section 2 of the Appendix to Part I it is shown that this is an instance of the Rayleigh
distribution, and in Section 3 of the Appendix that for a random sample of m nearest-
neighbor distances ( D1 ,.., Dm ) from this distribution, the scaled sum (known as Skellam’s
statistic),

S m  2 i1 Di2


m
(3.2.6)

is chi-square distributed with 2m degrees of freedom (as on p.99 in [BG]). Hence this
statistic provides a test of the CSR Hypothesis based on nearest neighbors.

________________________________________________________________________
ESE 502 I.3-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

3.2.2 Clark-Evans Test

While Skellam’s statistic can be used to construct tests, it follows from the Central Limit
Theorem that independent sums of identically distributed random variables are
approximately normally distributed.5 Hence the most common test of the CSR
Hypothesis based on nearest neighbors involves a normal approximation to the sample
mean of D , as defined by


m
(3.2.7) Dm  1
m i 1
Di

To construct this normal approximation, it is shown in Section 2 of the Appendix to Part I


that mean and variance of the distribution in (3.2.4) are given respectively by
1
(3.2.8) E ( D) 
2 

4
(3.2.9) var( D) 
4
To get some feeling for these quantities observe that under the CSR Hypothesis, as the
point density,  , increases, both the expected value and variance of nn-distances
decrease. This makes intuitive sense when one considers denser scatterings of random
points in R .
Next we observe from the properties of independently and identically distributed ( iid )
random samples that for the sample mean, Dm , in (3.2.7) we must then have

1

m
(3.2.10) E ( Dm )  1
m i 1
E ( Di )  m1 [mE ( D1 )]  E ( D1 ) 
2 

and similarly must have


4
var( Dm )   m1  
2 m
(3.2.11) var( Di )  m12 [m var( D1 )] 
i 1
m(4)

But from the Central Limit Theorem it then follows that for sufficiently large sample
sizes,6 Dm must be approximately normally distributed under the CSR Hypothesis with
mean and variance given by (3.2.10) and (3.2.11), i.e., that:

 1 4 
(3.2.12) Dm ~ N  , 
 2  m(4) 
5
See Section 3.1.4 in Part II of this NOTEBOOK for further detail. Here we simply state those results
needed for the Clark-Evans test.
6
Here “sufficiently large” is usually taken to mean m  30 , as long as the distribution in (3.2.4) is not “too
skewed”. Later we shall investigate this by using simulations.

________________________________________________________________________
ESE 502 I.3-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

Hence this distribution provides a new test of the CSR Hypothesis, known as the Clark-
Evans Test [see Clark and Evans (1954) and [BG], p.100]. If the standard error of Dm is
denoted by

(3.2.13)   Dm   var( Dm )  (4  )  m 4 

then to construct this test, one begins by standardizing the sample mean, Dm , in order to
use the standard normal tables. Hence, if we now denote the standardized sample mean
under the CSR Hypothesis by

Dm  E ( Dm ) Dm  [1/(2  )]
(3.2.14) Zm  
( Dm ) (4  )  m4 

then it follows at once from (3.2.12) that under CSR,7

(3.2.15) Z m ~ N (0,1)

To construct a test of the CSR Hypothesis based on this distribution, suppose that one
starts with a sample pattern S n  ( si : i  1,.., n) and constructs the nn-distance di for each
point, si  S n . Then it would seem most natural to use all these distances (d1 ,.., d n ) to
construct the sample-mean statistic in (3.2.10) above. However, this would violate the
assumed independence of nn-distances on which this distribution theory is based. To see
this it is enough to observe that if si and s j are mutual nearest neighbors, so that di  d j ,
then these are obviously not independent. More generally, if s j is the nearest neighbor of
si , then again di and d j must be dependent.8

However, if one were to select a subset of nn-distance   


values that contained no common points, such as those
 
shown in Figure 3.5, then this problem could in principle
be avoided. The question is how to choose independent  
  
pairs. We shall return to this problem later, but for the
moment we simply assume that some “independent”  
 
subset (d1 ,.., d m ) of these distance values has been 
selected (with m  n ). [This is why the notation “ m ”
rather than “ n ” has been used in the formulation above.] Fig.3.5. Independent Subset

For any random variable, X with E ( X )   and var( X )   , if Z  ( X   ) /   X /    /  then


7 2

2
E ( Z )  E ( X ) /    /   0 and var( Z )  var( X ) /   1 .
8
If the random variable D j is the nearest neighbor of j , then since D j cannot be bigger than d1 it follows
that Pr( D j  d i | Di  d i )  1 , and hence that these nn-distances are statistically dependent.
________________________________________________________________________
ESE 502 I.3-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

Given this sample, one can construct a sample-mean value,


m
(3.2.16) dm  1
m i 1
di

an use this to construct tests of CSR.

Two-Tailed Test of CSR

The standard test of CSR in most software is a two-tailed test in which both the
possibility of “significantly small” values of d m (clustering) and “significantly large”
values of d m (dispersion) are considered. Hence it is appropriate to review the details of
such a testing procedure. First recall the notion of upper-tail points, z , for the standard
normal distribution as defined by Pr( Z  z )   for Z ~ N (0,1) . In these terms, it
follows that for the standardized mean in (3.2.14)

(3.2.17) Pr  Z m  z / 2   Pr  ( Z m   z / 2 ) or ( z / 2  Z m )   

under the CSR Hypothesis. Hence if one estimates point density as in (3.1.1), and
constructs corresponding estimates of the mean (3.2.10) and standard deviation (3.2.13)
under CSR by

(3.2.18) ˆ 
1
2 ˆ

, ˆ m  (4   ) m4ˆ 
then one can test the CSR Hypothesis by constructing the following standardized sample
mean:

d m  ˆ
(3.2.19) zm 
ˆ

If the CSR Hypothesis is true, then by (3.2.14) and (3.2.15), zm should be a sample from
N (0,1) .9 Hence a test of CSR at the  -level of significance10 is then given by the rule:

Two-Tailed CSR Test : Reject the CSR Hypothesis if and only if | zm |  z / 2

The significance level,  , is also called the size of the test. Example results of this
testing procedure for a test of size  are illustrated in Figure 3.6 below. Here the two

9
Formally this assumes that ̂ is a sufficiently accurate estimate of  to allow any probabilistic variation
in ̂ to be ignored.
10
By definition, the level of significance of a test is the probability,  , that the null hypothesis (in this case
the CSR Hypothesis) is rejected when it is actually true. This is discussed further below.
________________________________________________________________________
ESE 502 I.3-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

samples, zm , in the tails of the distribution are seen to yield strong evidence against the
CSR Hypothesis, while the sample in between does not.

One-Tailed Tests of Clustering and Dispersion

As already noted, values of d m (and hence zm ) that are too low to be plausible under
CSR are indicative of patterns more clustered than random. Similarly, values too large
are indicative of patterns more dispersed than random. In many cases, one of these
alternatives is more relevant than the other. In the redwood seedling example of Figure
1.1 it is clear that trees appear to be clustered. Hence the only question is whether or not

/2 /2

  
zm  z / 2 0 zm z / 2 zm

Reject Do Not Reject


CSR Reject CSR

Fig.3.6. Two-Tailed Test of CSR

this apparent clustering could simply have happened by chance. So the key question here
is whether this pattern is significantly more clustered than random. Similarly, one can ask
whether the pattern of Cell Centers in Figure 1.2 is significantly more dispersed than
random. Such questions lead naturally to one-tailed versions of the test above. First, a test
of clustering versus the CSR Hypothesis at the  -level of significance is given by the
rule:

Clustering versus CSR Test : Conclude significant clustering if and only if zm   z

Example results of this testing procedure for a test of size  are illustrated in Figure 3.7
below. Here the standardized sample mean zm to the right is sufficiently low to conclude
the presence of clustering (at the  -level of significance), and the sample toward the
middle is not.

________________________________________________________________________
ESE 502 I.3-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________


 
zm  z zm 0

Significant No Significant
Clustering Clustering

Fig.3.7. One-Tailed Test of Clustering

In a similar manner, one can construct a test of dispersion versus the CSR Hypothesis at
the  -level of significance using the rule:

Dispersion versus CSR Test : Conclude significant dispersion if and only if zm  z

Example results for a test of size  are illustrated in Figure 3.8 below, where the sample
zm to the left is sufficiently high to conclude the presence of dispersion (at the  -level of
significance) and the sample toward the middle is not.


 
0 zm z zm

No Significant Significant
Dispersion Dispersion

Fig.3.8. One-Tailed Test of Dispersion

While such tests are standard in literature, it is important to emphasize that there is no
“best” choice of  . The typical values given by most statistical texts are listed in Tables
3.1 and 3.2 below:

Significance  z / 2 Significance  z
“Strong” .01 2.58 “Strong” .01 2.33
“Standard” .05 1.96 “Standard” .05 1.65
“Weak” .10 1.65 “Weak” .10 1.28

Table 3.1. Two-Tailed Significance Table3.2. One-Tailed Significance

________________________________________________________________________
ESE 502 I.3-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

So in the case of a two-tailed test, for example, the non-randomness of a given pattern is
considered “strongly” (“weakly”) significant if the CSR Hypothesis can be rejected at the
  .01 (   .10) level of significance.11 The same is true of one-tailed tests (where the
cutoff value, z / 2 , is now replaced by z ). In all cases, the value   .05 is regarded as a
standard (default) value indicating “significance”.

P-Values for Tests

However, since these distinctions are admittedly arbitrary, another approach is often
adopted in evaluating test results. The main idea is quite intuitive. In the one-tailed test of
clustering versus CSR above, suppose that for the observed standardized mean value, zm ,
one simply asks how likely it would be to obtain a value this low if the CSR Hypothesis
were true? This question is easily answered by simply calculating the probability of a
sample value as low as zm for the standard normal distribution N (0,1) . If the cumulative
distribution function for the normal distribution is denoted by

(3.2.20) ( z )  Pr( Z  z )

then this probability, called the P-value of the test, is given by

(3.2.21) Pr( Z  zm )   ( zm )

as shown graphically below:

 ( zm )

zm 0

Fig.3.9. P-value for Clustering Test

Notice that unlike the significance level,  , above, the P-value for a test depends on the
realized sample value, zm , and hence is itself a random variable that changes from
sample to sample. However, it can be related to  by observing that if P ( Z  zm )   ,
then for a test of size  , one would conclude that there is significant clustering. More
generally the P-value, P ( Z  zm ) can be defined as the largest level of significance
(smallest value of  ) at which CSR would be rejected in favor of clustering based on the
given sample value, zm .

Similarly, one can define the P-value for a test of dispersion the same way, except that
now for a given observed standardized mean value, zm , one asks how likely it would be to

11
Note that lower values of  denote higher levels of significance.
________________________________________________________________________
ESE 502 I.3-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

obtain a value this large if the CSR Hypothesis were true. Hence the P-value in this case
is given simply by

(3.2.22) Pr( Z  zm )  Pr( Z  zm )  1  Pr( Z  zm )  1  ( zm )

where the first equality follows from the fact that Pr( Z  zm )  0 for continuous
distributions.12 This P-value is illustrated graphically below:

1   ( zm )


0 zm

Fig.3.10. P-Value for Dispersion Test

Finally, the corresponding P-value for the general two-tailed test is given as the answer to
the following question: How likely would it be to obtain a value as far from zero as zm if
the CSR Hypothesis were true? More formally this P-value is given by

(3.2.23) P (| Z |  zm )  2   ( | zm |)

as shown below. Here the absolute value is used to ensure that  | zm | is negative
regardless of the sign of zm . Also the factor “2” reflects the fact that values in both tails
are further from zero than zm .

( | zm |) ( | zm |)

 
 | zm | 0

Fig.3.11. P-Value for Two-Tailed Test

3.3 Redwood Seedling Example

We now illustrate the Clark-Evans testing procedure in terms of the Redwood Seedling
example in Figure 1.1. This image is repeated in Figure 3.12a below, where it is
compared with a randomly generated point pattern of the same size in Figure 3.12b. Here
it is evident that the redwood seedlings are more clustered than the random point pattern.

12
By the symmetry of the normal distribution, this P-value is also given by  (  z m ) [  1   ( z m )] .
________________________________________________________________________
ESE 502 I.3-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

However, it is important to notice that there are indeed some apparent clusters in the
random pattern. In fact, if there were none then this pattern would be “too dispersed”. So
the key task is to distinguish between degrees of clustering that could easily occur by
chance and those that could not. This is the essence of statistical pattern analysis.

0 10
  feet

Fig.3.12a. Redwood Seedlings Fig.3.12b. Random Point Pattern

To do so, we shall start by assuming that most of the necessary statistics have already
been calculated. (We shall return to the details of these calculations later.) Here the area,
a( R)  44108 sq.meters., of this region R is given ARCMAP. It appears in the Attribute
Table of the boundary file Redw_bnd.shp in the map document Redwoods.mxd. The
number of points, n  62 , in this pattern is given in the Attribute Table of the data file,
Redw_pts.shp, in Redwoods.mxd. [The bottom of the Table shows “Records (0 out of
62 Selected). Note that there only appear to be 61 rows, because the row numbering
always starts with zero in ARCMAP.] Hence the estimated point density in (1) above is
given by

n 62
(3.3.1) ˆ    .00141
a( R) 44108

For purposes of this illustration we set m  n  62 , so that the corresponding estimates of


the mean and standard deviation of nn-distances under CSR are given respectively by

1 1
(3.3.2) ˆ    13.336 meters
2 ˆ 2 .00141

4  4  3.14
(3.3.3) ˆ n    .8853
n4ˆ (62)4(3.14)(.00141)

For the redwood seedling pattern, the mean nn-distance, d n , turns out to be

________________________________________________________________________
ESE 502 I.3-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

(3.3.4) d n  9.037 meters

At this point, notice already that this average distance is much smaller than the theoretical
value calculated in (3.3.2) under the hypothesis of CSR. So this already suggests that for
the given density of trees in this area, individual trees are much too close to their nearest
neighbors to be random. To verify this statistically, let us compute the standardized mean

d n  ˆ 9.037  13.336
(3.3.5) zn     4.855
ˆ n .8853

Now recalling from Table 2 above that there is “strongly significant” clustering if
zn   z.01  2.33 , one can see from (3.3.5) that clustering in the present case is even
more significant. In fact the P-value in this case is given by13
(3.3.6) P-value  P( Z  zn )   ( zn )   ( 4.855)  .0000006

(Methods for obtaining  -values are discussed below). So the chances of obtaining a
mean nearest-neighbor distance this low under the CSR hypothesis are less than one in a
million. This is very strong evidence in favor of clustering versus CSR.

However, one major difficulty with this conclusion is that we have used the entire point
pattern (m  n) , and have thus ignored the obviously dependencies between nn-distances
discussed above. Cressie (1993, p.609-10) calls this “intensive” sampling, and shows
with simulation analyses that this procedure tends to overestimate the significance of
clustering (or dispersion). The basic reason for this is that positive correlation among nn-
distances results in a larger variance of the test statistic, Z n , than would be expected
under independence (for a proof of this see Section 4 of the Appendix to Part I, and also
see p.99 in [BG]). Failure to account for this will tend to inflate the absolute value of the
standardized mean, thus exaggerating the significance of clustering (or dispersion). With
this in mind, we now consider two procedures for taking random subsamples of pattern
points that tend to minimize this dependence problem. These two approaches utilize
JMPIN and MATLAB, respectively, and thus provide convenient introductions to using
these two software packages.

3.3.1 Analysis of Redwood Seedlings using JMPIN

One should begin here by reading the notes on opening JMPIN in section 2.1 of Part IV
in this NOTEBOOK.14 In the class subdirectory jmpin now open the file,
Redwood_data.jmp in JMPIN. (The columns nn-dist and area contain data exported
from MATLAB and ARCMAP, respectively, and are discussed later). The column
Rand_Relabel is a random ordering of labels with associated nn-distance values in the

13
Methods for obtaining  -values are discussed later.
14
This refers to section 2.1 in the Software portion (Part IV) of this NOTEBOOK. All other references to
software procedures will be done similarly.
________________________________________________________________________
ESE 502 I.3-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

column, Sample. [These can be constructed using the procedure outlined in section
2.2(2) of Part IV in this NOTEBOOK.]

Now open a second file, labeled CE_Tests.jmp, which is a spreadsheet constructed for
this class that automates Clark-Evans tests. Here we shall use a random 50% subsample
of points from the Redwood Seedlings data set to carry out a test of clustering.15 To do
so, click Rows  Add Rows and add 31 rows ( 62 / 2) . Next, copy-and-paste the first
31 rows of Redwood_data.jmp into these positions.

In Redwood_data.jmp :

(i) Select rows 1 to 31 (click Row 1, hold down shift, and click Row 31)
(ii) Select column heading Sample (this entire column is now selected)
(iii) Click Edit  Copy

Now in CE_Tests.jmp :

(i) Select column heading nn-dist


(ii) Click Edit  Paste

Finally, to activate this spread sheet you must fill in the two parameters (area, n), start
with area as follows:

(i) Right click on the column heading area.


(ii) Right click on the small red box (may say “no formula”)
(iii) Type 44108, hit return and click Apply and OK. (The entire column should
now contain the value “44108” in each row.)

The procedure for filling in the value n ( 62) is the same. Once these values are
registered, the spread sheet does all remaining calculations. (Open the formula windows
for lam, mu, sig, s-mean, and Z as above, and examine the formulas used.) The results
are shown below (where only the first row is displayed).

lam mu sig s-mean Z P-Val CSR P-Val Clust P-Val Disp

0.0014 13.3362 1.2521 8.2826 -4.0363 0.0000546 0.0000273 0.9999727

Notice first that all values other than lam differ from the full-sample case (m  n)
calculated above since we have only m  31 samples. Next observe that the P-value for
clustering (.0000273) is a full order of magnitude larger than for the full-sample case. So
while clustering is still extremely significant (as it should be), this significance level has
15
In [BG] (p.99) it is reported that a common a rule-of-thumb to ensure approximate independence is to
take a random subsample of no more than 10% (i.e., m  n /10 ). But even for large sample sizes, n , this
tends to discard most of the information in the data. An alternative approach will be developed in the
MATLAB application of Section 3.2.5 below.
________________________________________________________________________
ESE 502 I.3-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

been deflated by removing some of the positive dependencies between nn-distances.


Notice also that the P-value for CSR is (by definition) exactly twice that for Clustering,
and similarly that the P-value for Dispersion is exactly one minus that for Clustering.
This latter P-value shows that there is no statistical evidence for Dispersion in the sense
that values “as large as” Z  4.0363 are almost bound to occur under CSR.

3.3.2 Analysis of Redwood Seedlings using MATLAB

While the procedure in JMPIN above does allow one to take random subsamples, and
thereby reduce the effect of positive dependencies among nn-distances, it only allows a
single sample to be taken. So the results obtained depend to some degree on the sample
selected. What one would like to do here is to take many subsamples of the same size
(say with m  31 ) and look at the range of Z-values obtained. If almost all samples
indicate significant clustering, then this yields a much stronger result that is clearly
independent of the particular sample chosen. In addition, one might for example want to
use the P-value obtained for the sample mean of Z as a more representative estimate of
actual significance. But to do so in JMPIN would require many repetitions of the same
procedure, and would clearly be very tedious. Hence an advantage of programming
languages like MATLAB is that one can easily write a program to carry out such
repetitious tasks. With this in mind, we now consider an alternative approach to Clark-
Evans tests using MATLAB.

One should begin here by reading the notes on opening MATLAB in section 3.1 of Part
IV in this NOTEBOOK. Now open MATLAB, and set the Current Directory (at the top
of the MATLAB window) to the class subdirectory, T:/sys502/matlab, and open the data
file, Redwoods.mat.16 The Workspace window on the left will now display the data
matrices contained in this file. For example, area, is seen to be a scalar with value,
44108, that corresponds to the area value used in JMPIN above. [This number was
imported from ARCMAP, and can be obtained by following the ARCMAP procedure
outlined in Section 1.2(8) of Part IV.] Next consider the data matrix, Redwoods, which is
seen to be a 62 x 2 matrix, with each row denoting the (x,y) coordinates of one of the 62
redwood seedlings. You can display the first three rows of this matrix by typing

>> Redwoods(1:3,:).

I have written a program, ce_test.m,17 in MATLAB to carry out Clark_Evans tests. You
can display this program by clicking Edit  Open and selecting the file ce_test.m.18
The first few lines of this program are displayed below:

16
The extension .mat is used for data files in MATLAB.
17
The extension .m is used for all executable programs and scripts in MATLAB.
18
To view this program you can also type the command >> edit ce_test.
________________________________________________________________________
ESE 502 I.3-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

function OUT = ce_test(pts,a,m,test)

% CE_TEST.M performs the Clark-Evans tests.


%
% NOTE: These tests use a random subsample (size = m) of the
% full sample of n nearest-neighbor distances, and
% ignore edge effects.

% Written by: TONY E. SMITH, 12/28/99

% INPUTS:
% (i) pts = file of point locations (xi,yi), i=1..n
% (ii) a = area of region
% (iii) m = sample size (m <= n)
% (iv) test = indicator of test to be used
% 0 = two-sided test for randomness
% 1 = one-sided test for clustering
% 2 = one-sided test for dispersion
%
% OUTPUTS: OUT = vector of all nearest-neighbor distances
%
% SCREEN OUTPUT: critical z-value and p-value for test

The first line defines this program to be a function call ce_test, with four inputs
(pts,a,n,test) and one output called OUT. The percent signs (%) on subsequent lines
indicate comments intended for the reader only. The next few comment lines describe
what the program does. In this case ce_test takes a subsample of size m  n and
performs a Clark-Evans test as in JMPIN. The next set of comment lines describe the four
inputs in detail. The first, pts, contains the (x,y) coordinates of the given point pattern,
and corresponds in our present case to Redwoods. The parameter a corresponds to area,
and m corresponds to the number of subsamples to be taken (in this case m  31 ). Finally
test is an indicator denoting the type of test to be done, so that for a one-tailed test of
clustering we would give test the value 1. During the execution of this program, the
nearest-neighbor distance for each pattern point is calculated. Since this vector of nn-
distances is useful for other applications (such as the JMPIN spread-sheet above) it is
useful to save this vector. Hence the single output, OUT, is in this case the n x 1 matrix
of nn-distances. The last comment line describes the screen output of this program,
which in the present case is simply a display of the Z-value obtained and its
corresponding P-value.

________________________________________________________________________
ESE 502 I.3-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

To run this program, suppose that you want to save the nn-distance output as a vector
called D (the names of inputs and outputs can be anything you choose). Then at the
command prompt you would type:

>> D = ce_test(Redwoods,area,31,1);

Here it is important to end this command statement with a semicolon (;), for otherwise,
all output will be displayed on the screen (in this case the contents of D). Hence by
hitting return after typing the above command, the program will execute and give a
screen display such as the following:

RESULTS OF TEST FOR CLUSTERING

Z_Value = -3.3282

P_Value = .00043697

The results are now different from those of JMPIN above because a different random
subsample of size m  31 was chosen. To display the first four rows of the output vector,
D, type19

>> D(1:4,:)

As with the Redwoods display above, the absence of a semicolon at the end will cause
the result of this command to be displayed. If you would like to save this output to your
home directory (S:) as a text file, say nn_dist.txt, then use the command sequence20

>> save S:\nn_dist.txt D -ascii

As was pointed out above, the results of this Clark-Evans test depend on the particular
sample chosen. Hence, each time the program is run there will be a slightly different
result (try it!). But in MATLAB it is a simple matter to embed ce_test in a slightly larger
program that will run ce_test many times, and produce whatever summary outputs are
desired. I have constructed a program to do this, called ce_test_distr.m. If you open this
program you will see that it has a similar format:

19
Since D is a vector, there is only a single column. So one could simply type D(1:4) in this case.
20
To save D in another directory, say with the path description, S:\path , you must use the full command:
>> save S:\path\nn_dist.txt D -ascii .

________________________________________________________________________
ESE 502 I.3-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

function OUT = ce_test_distr(pts,a,m,test,N)

% CE_TEST_DISTR.M samples ce_test.m a total of N times

% Written by: TONY E. SMITH, 12/28/99

% INPUTS:
% (i) pts = file of point locations (xi,yi), i=1..n
% (ii) a = area of region
% (iii) m = sample size (m <= n)
% (iv) test = indicator of test to be used
% 0 = two-sided test for randomness
% 1 = one-sided test for clustering
% 2 = one-sided test for dispersion
% (v) N = number of sample tests.
%
% OUTPUTS: OUT = vector of Z-values for tests.
%
% SCREEN OUTPUT: (1) Normal fit of Histogram for OUT
% (2) Mean of OUT
% (3) P-value of mean (if normcdf present)

The only key difference is the new parameter, N, that specifies the number of point
pattern samples of size m to be simulated (i.e., the number of times the ce_test is to be
run). The output chosen for this program is the vector of Z-values obtained. So if N =
1000, then OUT will be a vector of length 1000. The screen outputs now include
summary measures of this vector of Z-values, namely the histogram of Z-values in OUT,
along with the mean of these Z-values and the P-value for this mean. If this program is
run using the command

>> Z = ce_test_distr(Redwoods,area,31,1,1000);

then 1000 samples will be drawn, and the resulting Z-values will be saved in a vector, Z.
In addition, a histogram of these Z-values will be displayed, as illustrated in Figure 3.13
below. Notice that the results of this simulated sampling scheme yield a distribution of Z-
values that is approximately normal. While this normality property is again a
consequence of the Central Limit Theorem, it should not be confused with the normal
distribution in (3.2.12) upon which the Clark-Evans test is based (that requires n to be
sufficiently large). However, this normality property does suggest that a 50% sample
(m  n / 2) in this case yields a reasonable amount of independence among nn-distances,
as it was intended to do.21

21
Hence this provides some evidence that the 10% rule of thumb in footnote 15 above is overly
conservative.
________________________________________________________________________
ESE 502 I.3-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

90

80

70

60

50

40

30

20

10

0
-5.5 -5 -4.5 -4 -3.5 -3 -2.5 -2 -1.5

Fig.3.13. Sampling Distribution of Z-values

In particular, the mean of this distribution is now about -3.46 as shown by the program
output below:

RESULTS OF TEST FOR CLUSTERING

Mean Z-Value = -3.4571

P-Value of Mean = 0.00027298

Here the P-value, .000273, is of the same order of magnitude as the single sample above,
indicating that this single sample was fairly representative.22 However it is of interest to
note that the single sample in JMPIN above, with a P-value of .0000546 is an order of
magnitude smaller. Hence this sample still indicates more significance than is warranted.
But nonetheless, a P-value of .000273 is still very significant – as it should be for this
redwood seedling example.

3.4 Bodmin Tors Example

The Redwood Seedling example above is something of a “straw man” in that statistical
analysis is hardly required to demonstrate the presence of such obvious clustering. Rather

22
Again it should be emphasized that this P-value has nothing to do with the sampling distribution in
Figure 13. Rather it is the P-value for the mean Z-value under the normal distribution in (3.2.12).
________________________________________________________________________
ESE 502 I.3-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

it serves as an illustrative case where we know what the answer should be.23 However,
the presence of significant clustering (or dispersion) is often not so obvious. Our second
example, again taken from [BG] (Figure 3.2), provides a good case in point. It also serves
to illustrate some additional limitations of the above analysis.

Here the point pattern consists of granite


outcroppings (tors) in the Bodmin Moor, located
at the very southern tip of England in Cornwall
county, as shown to the right. (The granite in
these tors was used for tomb stones during the
Bronze age, and they have a certain historical
BODMIN MOOR
significance in England.)

The map in Figure 3.14a below shows a portion of the Moor containing n  35 tors. A
randomly generated pattern of 35 tors is shown for comparison in 3.14b.

# #
# #
# # # # ##
# # #
#
# #
#
# # #
#

#
# #
##
#
#
# # #
# # #
# #
# #
# #
# # #
# ## # #
#
# # ## #
#
# #
# #
#
# # #
# #
# 0 5 km # #
  #

Fig.3.14a. Bodmin Tors Fig.3.14b. Random Tors

Here there does appear to be some clustering of tors relative to the random pattern on the
right. But it certainly not as strong as the redwood seedling example above. So it is of
interest to see what the Clark-Evans test says about clustering in this case (see also
exercise 3.5 on pp.114-15 in [BG]). The maps in Figures 3.14a and 3.14b appear in the
ARCMAP project, bodmin.mxd, in the directory arview/project/Bodmin. The area,
a( R)  206.62 , of the region R in Figure 3.14a is given in the Attribute Table of the
shapefile, bod_bdy.24 This point pattern data was imported to MATLAB and appears in
the matrix, Bodmin, of the data file, bodmin.mat, in the matlab directory. For our
present purposes it is of interest to run the following full-sample version of the Clark-
Evans test for clustering:

23
Such examples are particularly useful for providing consistency checks on statistical methods for
detecting clustering.
24
The area and distance scales for this pattern are not given in [BG].
________________________________________________________________________
ESE 502 I.3-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

>> D = ce_test(Bodmin,area,35,1);

RESULTS OF TEST FOR CLUSTERING

Z_Value = -1.0346

P_Value = 0.15043

Hence even with the full sample of data points, the Clark-Evans test yields no significant
clustering. Moreover, since subsampling will only act to reduce the level of significance,
this tells us that there is no reason to proceed further. But for completeness, we include
the following results for a subsample of size m  18 (approximately 50%):25

>> ce_test_distr(Bodmin,area,18,1,1000);

RESULTS OF TEST FOR CLUSTERING

Mean Z-Value = -0.71318

P-Value of Mean = 0.23787

So even though there appears to be some degree of clustering, this in not detected by
Clark-Evans. It turns out that there are two key theoretical difficulties here that have yet
to be addressed. The first is that for point pattern samples as small as the Bodmin Tors
example, the assumption of asymptotic normality may be questionable. The second is
that nn-distances for points near the boundary of region R are not distributed the same as
those away from the boundary. We shall consider each of these difficulties in turn.

First, with respect to normality, the usual rule-of-


thumb associated with the Central Limit Theorem is 12

that sample means should be approximately normally 10

distributed for independent random samples of size at 8

least 30 from distributions that are not too skewed. 6

Both of these conditions are violated in the present 4

case. To achieve sufficient independence in the 2

present case, subsample sizes m surely cannot be 0


0.5 1 1.5 2 2.5

much larger that 20. Moreover, the sampling distri-


bution of nn-distances in Figure 3.15 shows a Fig.3.15. Bodmin nn-Distances
definite skewness (with long right tail).

This type of skewness is typical of nn-distances – even under the CSR hypothesis. [Under
CSR, the theoretical distribution of nn-distances is given by the Rayleigh density in
expression (2) of Section 2 in the Appendix to Part I, which is seen to have the same
skewness properties.]

25
Here we are not interested in saving the Z-values, so we have specified no outputs for clust_distr.
________________________________________________________________________
ESE 502 I.3-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

The second theoretical difficulty concerns the special nature of nn-distances near the
boundary of region R. The theoretical development of the CSR hypothesis explicitly
assumed that the region R is of infinite extent, so that such “edge effects” do not arise.
But in practice, many point patterns of interest occur in regions R where a significant
portion of the points are near the boundary of R. Recall from the discussion in Section
2.4 that if region R is viewed as a “window” through which part of a larger (stationary)
point process is being observed, then points near the boundary will tend to have fewer
observed neighbors than points away from the boundary. So in cases where the nearest
neighbor of a point in the larger process is outside R, the observed nn-distance for that
point will be greater than it should be (such as the example shown in Figure 3.16 below).
Thus the distribution of nn-distances for such points will clearly have higher expected
values than for interior points. For samples from CSR processes, this will tend to inflate
mean nn-distances relative to their theoretical values under the CSR hypothesis. This
edge effect will be demonstrated more explicitly in the next section.

  

 
 
  R
 

Fig.3.16. Example of Edge Effect

3.5 A Direct Monte Carlo Test of CSR

Given these shortcomings, we now develop a testing procedure that simulates the true
distribution of Dn in region R for a given pattern size, n .26 While this procedure is
computationally more intensive, it will not only avoid the need for normal approxi-
mations, but will also avoid the need for subsampling altogether. The key to this
procedure lies in the fact that the actual distribution of a randomly located point in R can
easily be simulated on a computer. This procedure, known as rejection sampling, starts
by sampling random points from rectangles. Since each rectangle is the Cartesian product
of two intervals, [a1 , b1 ]  [a2 , b2 ] , and since drawing a random number, si from an
interval [ai , bi ] is a standard operation in any computer language, one can easily draw a
random point s  ( s1 , s2 ) from [a1 , b1 ]  [a2 , b2 ] . Hence for any given planar region, R, the
basic idea is to sample points from the smallest rectangle, rec( R) containing R, and then
to reject any points which are not in R.

26
Procedures for simulating distributions by random sampling are known as “Monte Carlo” procedures.
________________________________________________________________________
ESE 502 I.3-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

To obtain n points in R, one continues to reject


points until n are found in R. [Thus the choice of rec( R)
rec( R) is designed to minimize the expected 

number of rejected samples.] An example for the 
case of Bodmin is illustrated in Figure 3.17,  
where for simplicity we have sampled only 
n  10 points. Here there are seen to be four

sample points that were rejected. The resulting  
sample points in R then constitute an independent 
random sample of size n that by construction R
must satisfy the CSR hypothesis. To see this note  

simply that since the larger sample in rec( R)
automatically satisfies this hypothesis, it follows
that for any subset C  R the probability that a Fig.3.17. Rejection Sampling
point lies in C given that it is in R must have the
form:

Pr(C  R ) Pr(C ) a(C ) / a[rec( R)] a(C )


(3.5.1) Pr(C | R )    
Pr( R) Pr( R) a( R) / a[rec( R)] a( R)

Hence expression (2.1.2) holds, and the CSR hypothesis is satisfied. More generally, for
any pattern of size n one can easily simulate as many samples of size n from R as
desired, and use these to estimate the sampling distribution of Dn under the CSR
hypothesis.

This procedure has been operationalized in the MATLAB 1 144


program, clust_sim.m. Here the only additional input 4.7 -9.7
information required is the file of boundary points defining 4.4 -10.2
the Bodmin region, R . The coordinates of these boundary : :
points are stored in the 145 x 2 matrix, Bod_poly, in the : :
data file, bodmin.mat. To display the first three rows and 5.2 -9.2
last three rows of this file: first type Bod_poly(1:3,:), hit 5.1 -9.2
return, and type Bod_poly(143:end,:). You will then see 4.7 -9.7
that this matrix has the form shown to the right.

Here the first row gives information about the boundary, namely that there is one
polygon, and that this polygon consists of 144 points. Each subsequent row contains the
(x,y) coordinates for one of these points. Notice also that the second row and the last row
are identical, indicating that the polygon is closed (and thus that there are only 144
distinct points in the polygon). This boundary information for R is necessary in order to
define the rectangle, rec( R) . It is also needed to determine whether a given point in
rec( R) is also in R or not. While this latter determination seems visually evident in the
present case, it turns out to be relatively complex from a programming viewpoint. A brief
description of this procedure is given in section 5 of the Appendix to Part I.

________________________________________________________________________
ESE 502 I.3-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

The program clust_sim is designed to estimate the sampling distribution of Dn by


simulating a large number, N , of random patterns of size n in R, and then using this
statistical population to determine whether there is significant clustering in a given
observed pattern in R with mean nn-distance, d n . To do so, observe that if d n were in
fact a sample from this same distribution, then the probability Pr( Dn  d n ) of obtaining a
value as low as d n can be estimated by the fraction of simulated mean nn-distance values
that do not exceed d n . More precisely, if N 0 denotes the number of simulated patterns
with mean nn-distances not exceeding d n , then this probability, can be estimated as
follows:

(3.5.2)  D  d )  N0
Pr(
N 1
n n

Here the denominator N  1 includes the observed sample along with the simulated
samples. This estimate then constitutes the relevant P-value for a test of clustering
relative to the CSR hypothesis. Hence the testing procedure in clust_sim consists of the
follows two steps:

(i) Simulate N patterns of size n and for each pattern i  1,.., N compute the
mean nn-distance, d n(i ) .

(ii) Determine the number of patterns, N 0 , with d n(i )  d n and calculate the
P-value for d n using (3.5.2) above.

To run this program we require one additional bit of information, namely the value of d n .
Given the output vector, D, of nn-distances for Bodmin tors obtained above from the
program, ce_test, this mean value (say m_dist) can be calculated by using the built-in
function, mean, in MATLAB as follows:

>> m_dist = mean(D);

In the present case, m_dist = 1.1038. To input this value into clust_sim, we shall use a
MATLAB data array known as a structure. Among their many uses, structures offer a
convenient way to input optional arguments into MATLAB programs. In the present
case, we shall input the value m_dist together with the number of bins to be used in
constructing a histogram display for the simulated mean nn-distance values. [The default
value in MATLAB is bin = 10 is useful for moderate samples sizes, say N  100 . But for
simulations with N  1000 , is better to use bin = 20 or 25.] If you open the program,
clust_sim, you will see that the last input of this function is a structure namely opts (for
“options”) that is described in more detail under INPUTS:

________________________________________________________________________
ESE 502 I.3-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

function OUT = clust_sim(poly,a,m,N,opts)

% CLUST_SIM.M simulates the sampling distribution of average


% nearest-neighbor distances in a fixed polygon. It can also determine
% the P-value for a given mean nearest-neighbor distance, if supplied.
%
% Written by: TONY E. SMITH, 12/31/00

% INPUTS:
% (i) poly = boundary file of polygon
% (ii) a = area of polygon
% (iii) m = number of points in polygon
% (iv) N = number of simulations
% (v) opts = an (optional) structure with variable inputs:
% opts.bins = number of bins in histogram (default = 10)
% opts.m_dist = mean nearest-neighbor distance for testing

To define this structure in the present case, we shall use the value of m_dist just
calculated, and shall set bins = 20. This is accomplished by the two commands:

>> opts.m_dist = m_dist; opts.bins = 20;

Notice that opts is automatically defined by simply specifying its components.27 The key
point is that only the structure name, opts, needs to be specified in the command line.
The program clust_sim will look to see if either of these components for opts have been
specified. So if you want to use the default value of bins, just leave out this command.
Moreover, if you just want to look at the histogram of simulated values (and not run a test
at all), simply leave opts out of the command line. This is what is meant in the
description above when opts is referred to as an “(optional) structure”.

Given these preliminaries, we are now ready to run the program, clust_sim, for Bodmin.
To do so, enter the command line:

>> clust_sim(Bod_poly,area,35,1000,opts);

Here we have specified n = 35 for the Bodmin case, and have specified that N = 1000
simulated patterns be constructed. The screen output will start with successive displays:

percent_done = 10
percent_done = 20
:
percent_done = 100

27
Note also we have put both commands on the same line to save room. Just remember to separate each
command by a semicolon (;)
________________________________________________________________________
ESE 502 I.3-25 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

that indicate how the simulations are proceeding. The final screen output will then
include both a histogram of mean nn-distance values, and some numerical outputs, as
described in the “SCREEN OUTPUT” section of the comments in clust_sim. The
histogram will be something like that shown in Figure 3.18 below (the red vertical bar
will be discussed below):

150

100

50

0
0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8

Fig.3.18. Histogram of Mean nn-Distances

Note first that in spite of the relatively skewed distribution of observed nn-distance
values for Bodmin, this simulated distribution of mean nn-distances appears to be
approximately normal. Hence, given the sample size, n  35 , it appears that the
dependencies between nn-distance values in this Bodmin region are not sufficient to rule
out the assumption of normality used in the Clark-Evans test.

But in spite of its normality, this distribution is noticeably different from that predicted
by the CSR hypothesis. To see this, recall first that that for the given area of Bodmin,
a( R)  206.6 , the point density estimate is given by ˆ  35 / 206.6  .1694 . Hence the
theoretical mean nn-distance value predicted by the CSR hypothesis is

1
(3.5.3) ˆ   1.215
2 ˆ

However, if we now look at the numerical screen output for this simulation, we have

CLUST_SIM RESULTS

SIM_MEAN_DIST = 1.3087

M_DIST = 1.1038

P-VALUE FOR M_DIST = 0.044955

________________________________________________________________________
ESE 502 I.3-26 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

Here the first line reports the mean value of the 1000 simulated mean nn-distances. But
since (by the Law of Large Numbers) a sample this large should give a fairly accurate
estimate of the true mean, E ( Dn ) , we see that this true mean is considerably larger than
that predicted by the CSR hypothesis above.28 The key point to note here is that the edge
effects depicted in Figure 3.16 above are quite significant for pattern sizes as small as
n  35 relative to the size of the Bodmin region, R.29 So this simulation procedure does
indeed give a more accurate distribution of nn-distances in the Bodmin region under the
CSR hypothesis.

Observe next that the second line of screen output above gives the value of opts.m_dist
as noted above (assuming this component of opts was included). The final line is the
critical one, and gives the P-value for opts.m_dist, as estimated by (3.5.2) above. Hence,
unlike the Clark-Evans test where no significant clustering was observed (even under full
sampling), the present procedure does reveal significant clustering.30 This is shown by
the position of the red vertical bar in Figure 3.18 above (at approximately a value of
m_dist = 1.1038). Here there are seen to be only a few simulated values lower than
m_dist. Moreover, the discussion above now shows why this result differs from Clark-
Evans. In particular, by accounting for edge effects, this procedure reveals that under the
CSR hypothesis, mean nn-distance values for Bodmin should be higher than those
predicted by the Clark-Evans model. Hence the observed value of m_dist is actually
quite low once this effect is taken into account.

28
You can convince yourself of this by running clust_sim a few times an observing that the variation in
this estimated mean values is quite small.
29
Note that as the sample size n becomes larger, the expected nn-distance, E ( Dn ) , for a given region, R,
becomes smaller. Hence the fraction of points sufficiently close to the boundary of R to be subject to edge
effects eventually becomes small, and this edge effect disappears.
30
Note again that this P-value will change each time clust_sim is run. However, by trying a few runs you
will see that all values are close to .05.
________________________________________________________________________
ESE 502 I.3-27 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

4. K-Function Analysis of Point Patterns

In the Bodmin Tors example above, notice from Figure 3.14a (p.20) that the clustering
structure is actually quite different from that of the Redwood Seedling example in Figure
3.12a (p.12). Rather than small isolated clumps, there appear to be two large groups of
points in the northwest and southwest, separated by a large empty region. Moreover, the
points within each group are actually quite evenly spaced (locally dispersed). These
observations suggest that the pattern of tors exhibits different structures at different
scales. Hence the objective of the present section is to introduce a method of point pattern
analysis that takes such scale effects into account, and in fact allows “scale” to become a
fundamental variable in the analysis.

4.1 Wolf-Pack Example

To motivate the main ideas, we begin with a new example involving wolf packs. A map
is shown in Figure 4.1a below representing the relative locations of wolf packs in a
portion of the Central Arctic Region in 1998.1 The enlarged portion in Figure 4.1b is a
schematic map depicting individual wolves in four of these packs.


   


  Wolf packs

 


 
  

   

0

50 km

Fig.4.1a. Map of Wolf Packs Fig.4.1b. Enlarged Portion

At the level of individual wolf locations in Figure 4.1b, there is a pattern of isolated
clumps that bears a strong resemblance to that of the Redwood seedlings above.2
Needless to say, this pattern would qualify as strongly clustered. But if one considers the
larger map in Figure 4.1a, a different picture emerges. Here, the dominant feature is the
remarkable dispersion of wolf packs. Each pack establishes a hunting territory large
enough for its survival (roughly 15 to 20 km in diameter), and actively discourages other

1
This map is based on a more detailed map published in the Northwest Territories Wolf Notes, Winter
1998/99. See the class file: ese502/extra_materials/wolf_packs.jpg.
2
The spacing of individual wolves is of course exaggerated to allow a representation at this scale.

________________________________________________________________________
ESE 502 I.4-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

packs from invading its territory.3 Hence this pattern of wolf locations is very clustered at
small scales, and yet very dispersed at large scales.

But if one were to analyze this wolf-location pattern using any of the nearest-neighbor
techniques above, it is clear that only the small-scale clustering would be detected. Since
each wolf is necessarily close to other wolves in the same dens, the spacing between dens
would never be observed. In this simple example one could of course redefine wolf dens
to be aggregate “points”, and analyze the spacing between these aggregates at a larger
scale. But there is no way to analyze multiple scales using nearest neighbors without
some form of re-aggregation.4

4.2 K-Function Representations

To capture a range of scales in a more systematic way, we now consider what amounts to
an extension of the quadrat (or cell-count) method discussed in section 1 above. In
particular, recall that the quadrat method was criticized for being too dependent on the
scale of individual cells. Hence the key idea of K-functions is to turn this dependency
into a virtue by explicitly incorporating “scale” as a variable in the analysis. Thus, rather
than fixing the scale and locations of cell grids, we now consider randomly sampled cells
of varying sizes. While many sampling schemes of this type can be defined, we shall
focus on the single most basic scheme which is designed to answer the following
question for a given point process with density  : What is the expected number of point
events within distance h from any randomly sampled point event? Note that this expected
number is not very meaningful without specifying the point density,  , since it will of
course increase with  . Hence if we divide by  in order to eliminate this obvious
“density effect” then the quantities of interest take the form:

(4.2.1) K ( h)  1
 E (number of additional events within distance, h, of an arbitrary event)

If we allow the distance or scale, h , to vary then expression (4.2.1) is seen to define a
function of h , designated as a K-function.5 As with nn-distances, these values, K (h) ,
yield information about clustering and dispersion. In the wolf-pack example above, if one
were to define K (h) with respect to small distances, h , around each wolf in Figure 4.1b,
then given the close proximity to other wolves in the same pack, these values would
surely be too high to be consistent with CSR for the given density of wolves in this area.
Similarly, if one were to define K (h) with respect to much larger distances, h , around
each wolf in Figure 4.1a, then given the wide spacing between wolf packs (and the
relative uniformity of wolf-pack sizes6), these values would surely be too low to be
3
Since wolves are constantly on the move throughout their hunting territories, the actual locations shown in
Figure 1a are roughly at the centers of these territories.
4
One could also incorporate larger scales by using higher-order nearest neighbors [as discussed for
example in Ripley (1996, sec.6.2)]. But these are not only more complex analytically, they are difficult to
associate with specific scales of analysis.
5
This concept was popularized by the work of Ripley (1976,1977). Note also that following standard
convention, we now denote distance by h to distinguish it from nn-distance, d .
6
Wolf packs typically consist of six to eight wolves (see the references in footnote 1 above).

________________________________________________________________________
ESE 502 I.4-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

consistent with CSR for the given density of wolves. Hence if one can identify
appropriate bench-mark values for K (h) under CSR, then these K-functions can be used
to test for clustering and dispersion at various scales of analysis. We shall consider these
questions in more detail in Section 4.4 below.

But for the moment, there are several features of definition (4.2.1) that warrant further
discussion. First, while the distance metric in (4.2.1) is not specified, we shall always
refer to Euclidean distance, d ( s, v) between pairs of points, as defined expression (3.2.1)
above. Hence with respect to any given point event, s , the expected number of point
events within distance h of s is simply the expected number of such events a circle of
radius h about s , as shown in Figure 4.2 below.

h
 K ( h) 
s
Expected Number
of Points in here

Fig.4.2. Interpretation of K(h)

This graphical image helps to clarify several additional assumptions implicit in the
definition of K (h) . First, since this value is taken to depend only on the size of the circle
(i.e., the radius h ) and not its position (i.e., the coordinates of s ) there is an implicit
assumption of spatial stationarity [as in expression (2.5.1) above]. In other words, it is
assumed that the expected number of additional points in this circle is the same regardless
of where s is located. (This assumption will later be relaxed in our Monte Carlo
applications of K-functions).

Observe next that the circularity of this region implicitly assumes that direction is not
important, and hence that the underlying point process is isotropic (as in Figure 2.2
above). On the other hand, if the point process of interest were to exhibit some clear
directionality, such as the vertical directionality in shown in Figure 2.3 above, then it
might be more appropriate to use directional ellipses as defined by weighted Euclidean
distances of the form:

(4.2.2) d ( s, v)  w1  ( s1  v1 ) 2  w2  ( s2  v2 ) 2

where the weights w1 and w2 reflect relative sensitivities of point counts to movements
in the horizontal or vertical direction, respectively.7 More generally, if the relevant point

7
One can also use appropriate quadratic forms to define anisotropic distances with any desired directional
orientations. We shall consider such distances in more detail in the analysis of spatial variograms in Part II
of this NOTEBOOK.

________________________________________________________________________
ESE 502 I.4-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

events occur in specific environments (such as the patterns of Philadelphia housing


abandonments in Figures 1.4 and 1.5), then the relevant distances might be determined by
these environments (such as travel distance on the Philadelphia street system).8

Finally, it is important to emphasize that the expected value in (4.2.1) is a conditional


expected value. In particular, given that there is a point event, s , at the center of the
circle in Figure 4.2 above, this value gives the expected number of additional points in
this circle. This can be clarified by rewriting K (h) in terms of conditional expectations.
In particular if [as in Section 3.2.1 above] we now denote the circle in Figure 4.2 minus
its center by

(4.2.3) Ch  {s}  {v  R : 0  d (v, s)  h}

then K (h) can be written more precisely as follows:

(4.2.4) K ( h)  1
 E[ N (Ch  {s}) | N ( s )  1]

To see the importance of this conditioning, recall from expression (2.3.4) that for any
stationary process (not just CSR processes) it must be true that the expected number of
points in Ch  {s} is simply proportional to its area, i.e., that

(4.2.5) E (Ch  {s})   a (Ch  {s})

But this is not true of the conditional expectation above. Recall from the wolf-pack case,
for example, that for small circles around any given wolf, the expected number of
additional wolves is much larger than what would be expected based on area alone [i.e.,is
larger than  a (Ch  {s}) ]. These ideas will be developed in more detail in Section 4.4,
where it is shown that such deviations from simple area proportionality form the basis for
all K-function tests of the CSR Hypothesis.

4.3 Estimation of K-Functions

Given this general definition of K-functions as (conditional) expected values, we now


consider the important practical question of estimating these values. To do so, we
introduce the following notation for analyzing point counts. For any given realized point
pattern, Sn  ( si : i  1,.., n) , and pair of points si , s j  Sn we now denote the Euclidean
distance between them by

(4.3.1) dij  d ( si , s j )

and for any distance, h , define the indicator function, I h , for point pairs in Sn by

8
Here it should be noted that tools are available in the spatial analyst extension of ARCMAP for
constructing cost-weighted and shortest-paths distances. However, we shall not do so in this NOTEBOOK.

________________________________________________________________________
ESE 502 I.4-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

1 , dij  h
(4.3.2) I h (dij )  I h [d ( si , s j )]  
0 , dij  h

From this definition it follows at once that for any given point si  S n , the total number of
additional points s j within distance h of si is given by the sum  j i I h (dij ) . Hence, if i
now refers to a randomly selected point generated by a point process on R, and if both the
number and locations of points in R are treated as random variables, then in terms of
(4.3.2) the K-function in (4.2.1) above can now be given the following equivalent
definition:

(4.3.3) K ( h)  1 E   j i I h (dij ) 
  

Observe also that for stationary point processes the value of K (h) must be independent
of the particular point event i chosen. So multiplying through by  in (4.3.3) and
summing over all point events i  1,.., n in region R, it follows that

E   j i I h (dij )    K (h) , i  1,.., n   E   j i I h (dij )   n K (h)


n
(4.3.4)
  i 1  

 E   j i I h (dij ) 
n
 K ( h)  1
n i 1  

This “pooled” version of K (h) motivates the following pooled estimate of K (h) ,
designated as the sample K-function,

 
n
(4.3.5) Kˆ (h)  1 I (dij )
ˆn i 1 j i h

where again, ˆ  n / a( R) .9 The advantage of this estimator is that it uses all points of the
given realized point pattern S in R. To interpret Kˆ (h) , note that if we rewrite (4.3.5) as
n

1
  
I (dij ) 
n
(4.3.6) Kˆ (h)  1
ˆ  n i 1 j i h 

then the expression in brackets is seen to be simply an average of the relevant point
counts for each of the pattern points, si  S n . Hence, if the underlying process were truly
stationary (and edge effects were small) then this sample K-function would be
9
At this point it should be noted that our notation differs from [BG] where regions are denoted by a script
 with area R. Here we use R for region, and make the area function, a ( R ) , explicit. In these terms, (4.3.5)
is seen to be identical to the estimate on the top of p. 93 in [BG], where 1/(ˆn)  a ( R ) / n .
2

________________________________________________________________________
ESE 502 I.4-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

approximately unbiased (and reasonably efficient) as an estimator of the common


expected point count E[ j i I h (dij )] in (4.3.3).10

However, since this idealization can never hold exactly in bounded regions R, it is
necessary to take into account the edge effects created by the boundary of R. Unlike the
case of nn-distances, where the expected values of nn-distances are increased for points
near the boundary (as in Figure 3.16), the expected value of point counts are reduced for
these points, as shown in Figure 4.3a below.

 
 R R

   wij
h
 si   si 
  sj 

Fig.4.3a. Edge Effects for K(h) Fig.4.3b. Ripley’s Correction

To counter this downward bias, Ripley (1976) proposed a “corrected” version of (4.3.5)
that is quite effective in practice. His correction consists of weighting each point, s j , in
the count  j i I h (dij ) in a manner that inflates counts for points near the boundary. If one
considers the circle about si passing through s j (as shown in Figure 4.3b) and defines
wij to be the fraction of its circumference that lies inside R, then the appropriate
reweighting of s j in the count for si is simply to divide I h (dij ) by wij , producing a new
estimate known as Ripley’s correction:

I h (dij )
 
n
(4.3.7) Kˆ (h)  1
ˆn i 1 j i
wij

One can gain some intuition here by observing in Figure 4.3b that weights will be unity
unless circle about si passing through s j actually leaves R. So only those point pairs will
be involved that are close to the boundary of R, relative to distance h . Moreover, the
closer that s j is to the edge of R, the more of this circumference is outside R, and a hence
the smaller wij becomes. This means that values I h (dij ) / wij are largest for points closest

10
For further discussion of this approximate unbiasedness see Ripley (1977, Section 6).

________________________________________________________________________
ESE 502 I.4-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

to the edge, thus inflating Kˆ (h) to correct the bias. [An explicit derivation of Ripley’s
correction in given in Section 6 of the Appendix to Part I.]

It should be emphasized that while Ripley’s correction is very useful for estimating the
true K-function for a given stationary processes, this is usually not the question of most
interest. As we have seen above, the key questions relate to whether this process exhibits
structure other than what would be expected under CSR, and how this structure may vary
as the spatial scale of analysis is increased. Here it turns out that in most cases, Ripley’s
correction is not actually needed. Hence this correction will not be used in the analysis to
follow.11

4.4 Testing the CSR Hypothesis

To apply K-functions in testing the CSR Hypothesis, it is convenient to begin by ignoring


edge effects, and considering the nature of K-functions under this hypothesis for points,
s  R and distances, h , that are not influenced by edge effects. Hence, in contrast to
Figure 4.3a above, we now assume that the set of locations, Ch , within distance h of s is
entirely contained in R , i.e., that

(4.4.1) Ch  {v  R : d ( s, v)  h}  R

Next recall from the basic independence assumption about individual point locations in
CSR processes (Section 2.2 above) that for such processes, the expected number of points
in Ch  {s} does not dependent on whether or not there is a point event at s , so that

(4.4.2) E[ N (Ch  {s}) | N ( s )  1]  E[ N (Ch  {s})]

Hence from expression (4.2.3), together with the area formula for circles [and the fact
that a (Ch  {s})  a (Ch ) ], it follows that

(4.4.3) E[ N (Ch  {s}) | N ( s )  1]   a(Ch  {s})   a(Ch )   h 2

which together with expression (4.2.4) yields the following simple K-function values:

(4.4.4) K ( h)  1
 ( h 2 )   h 2

Thus by standardizing with respect to density,  , and ignoring edge effects as in (4.4.1),
we see that the K-function reduces simply to area under the CSR Hypothesis. Note also
that when K (h)   h 2 , this implies a mean point count higher than would be expected
under CSR, and hence indicates some degree of clustering at scale h (as illustrated in

11
Readers interested in estimating the true K-function for a given process are referred to Section 8.4.3 in
Cressie (1993), and to the additional references found therein.

________________________________________________________________________
ESE 502 I.4-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

Section 4.2 above). Similarly, a value K (h)   h 2 implies a mean point count lower than
would be expected under CSR, and hence indicates some degree of dispersion at scale h .
Thus for any given h  0 ,

K (h)   h 2  clustering at scale h


(4.4.5)
K (h)   h 2  dispersion at scale h

While these relations are adequate for testing purposes, area values are difficult to
interpret directly. Hence it usually convenient to further standardize K-functions in a
manner that eliminates the need for considering these values. If for each h we let

K ( h)
(4.4.6) L ( h)  h

then under CSR, this L-function has the property that

 h2
(4.4.7) L ( h)  h  hh0

for all h  0 . In other words, this associated L-function is identically zero under CSR.
Moreover, since L(h) is an increasing function of K (h) , it follows that L(h) is positive
exactly when K (h)   h 2 , and is negative exactly when K (h)   h 2 . Hence the relations
in (4.4.5) can be given the following simpler form in terms of L-functions:

L(h)  0  clustering at scale h


(4.4.8)
L(h)  0  dispersion at scale h

Given the estimate, Kˆ (h) , in (4.3.7) above, one can estimate L(h) by

Kˆ (h)
(4.4.9) Lˆ (h)  h

and can in principle use (4.4.8) to test for clustering or dispersion.

4.5 Bodmin Tors Example

We can apply these testing ideas to Bodmin by using the MATLAB program,
k_function.m. The first few lines of this program are shown below:

________________________________________________________________________
ESE 502 I.4-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

function C = k_function(loc,area,b,extent)

% K_FUNCTION computes the k-Function for a point pattern


% and plots the normalized L-Function (without
% edge corrections)

% Written by: TONY E. SMITH, 11/26/01

% INPUTS:
% (i) loc = file of locations (xi,yi), i=1..m
% (ii) area = area of region
% (iii) b = number of bins to use in CDF (and plot)
% (iv) extent = 1 if max h = half of max pairwise distance (typical case)
% = 2 if max h = max pairwise distance to be considered
% DATA OUTPUTS: C = (1:b) vector containing raw Point Count
% SCREEN OUTPUTS: Plot of L-Function over the specified extent.

To apply this program, again open the data file, Bodmin.mat, and recall that the tor
locations are given in the matrix, Bodmin. As seen above, the program first computes
Kˆ (h) for a range of distance values, h , and then coverts this to Lˆ (h) and plots these
values against the reference value of zero. The maximum value of h for this illustration
is chosen to be the maximum pairwise distance between pattern points (tors), listed as
option 2 in input (iv) above. The number of intermediate distance values (bins) to be used
is specified by input (iii). Here we set b = 20. Hence to run this program, type:

>> k_function(Bodmin,area,20,2); 2

The resulting plot is shown in Figure 4.4 0


{

to the right. Here the horizontal line


Possible
indicates the “theoretical” values of L(h) -2

Clustering
under the CSR Hypothesis. So it would -4

appear that there is some degree of L


clustering at small scales, h . However, -6

recall that the above analysis was -8

predicated on the assumption of no edge


effects. Since there are clearly strong edge -10
0 2 4 6 8 10 12 14 16 18

effects in the Bodmin case, the real h


question here is how to incorporate these
effects in a manner that will allow a Fig.4.4. Bodmin L-function
meaningful test of CSR.

________________________________________________________________________
ESE 502 I.4-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

One approach is suggested by recalling that random point pattern for Bodmin was also
generated in Figure 3.14b above. Hence if the L-function for such a random pattern is
plotted, then this can serve as a natural benchmark against which to compare the L-
function for tors. This random pattern is contained in the matrix, Bod_rn2, of data file
Bodmin.mat (and is also shown again in Figure 4.7 below). Hence the corresponding
command, k_function(Bod_rn2,area,20,2), now yields a comparable plot of this
benchmark L-function as shown in Figure 4.5 below.

2 2

0 0

-2 -2

-4 -4

L L
-6 -6
Relative
Clustering
-8 -8

-10 -10

-12
-12 0 2 4 6 8 10 12 14 16 18 20
0 2 4 6 8 10 12 14 16 18 20

h h
Fig.4.5. Random L-function Fig.4.6. L-function Overlay

Here it is clear that the L-function for this random pattern is not flat, but rather is
everywhere negative, and decreases at an increasing rate. Hence relative to zero, this
pattern appears to exhibit more and more dispersion as the scale increases.

The reason for this of course is that the theory


above [and expression (4.4.1) in particular] # #

ignores those points near the boundary of the # ##


#
Bodmin region, such as the point shown in #
#
Figure 4.7. Here it is clear that for sufficiently
#
small scales, h , there is little effect on Lˆ (h) , #
##
#
so that values are close to zero for small h . #
#
# # #
But as this radius increases, it is also clear that
most of the circle is eventually outside of R, #
# 
# #

## #
and hence is mostly empty. Thus, given the #
#

estimated point density, ̂ , for Bodmin tors


#
#
# #
# #
#
inside R, point counts for large h start to look
very small relative to the area  h 2 . This is
precisely the effect that Ripley’s correction
[expression (4.3.7)] attempts to eliminate.12 Fig.4.7. Bodmin Edge Effect

12
A nice comparison of Ripley’s correction with uncorrected L-functions (such as in Figure 4 above) is
given in Figure 8.15 of Cressie (1993, p.617).

________________________________________________________________________
ESE 502 I.4-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

But if we now ignore the zero reference line and use this random L-function as a
benchmark, then a perfectly meaningful comparison can be made by overlaying these two
L-functions, as in Figure 4.6 above. Here one can see that the region of relative clustering
is now considerably larger than in Figure 4.4, and occurs up to a scale of about h  8 (see
the scale shown in Figure 3.14). But observe even these benchmark comparisons have
little meaning at scales so large that circles of radius h around all pattern points lie
mostly outside the relevant region R. For this reason, the commonly accepted rule-of-
thumb is that for any given point pattern, Sn , one should not consider h -values larger
that half the maximum pairwise distance between pattern points. Hence if we now denote
the maximum pairwise distance for Sn by, hmax  max{d ( si , s j ) : si , s j  S n } , and use h to
indicate the largest value of h to be considered in a given case, then the standard rule-of-
thumb is to set

(4.5.1) h  hmax / 2

This corresponds to option 1 for input (iv) of k_function above, and option 2 correspond
to h  hmax . We shall have occasion to use (4.5.1) in many of our subsequent analyses,
and in fact this will usually denote the “default” value of h .

A more important limitation of this benchmark comparison is that (like the JMPIN
version of the Clark-Evans test in Section 3.3.1 above) the results necessarily depend on
the random point pattern that is chosen for a benchmark. Hence we now consider a much
more powerful testing procedure using Monte Carlo methods.

4.6 Monte Carlo Testing Procedures

As we saw in Section 3.5 above, it is possible to use Monte Carlo methods to estimate the
sampling distribution of nn-distances for any pattern size in a given region of interest.
This same idea extends to the sampling distribution of any statistics derived from such
patterns, and is of sufficient importance to be stated as a general principle:

SIMULATION PRINCIPLE: To test the CSR Hypothesis for any point


pattern, Sn , of size n in a given region, R, one can simulate a large
number of random point patterns, {Sn(i ) : i  1,.., N } , of the same size, and
compare Sn with this statistical population.

Essentially, this simulation procedure gives us a clear statistical picture of what realized
patterns from a CSR process on R should look like. In the case of K-function tests of
CSR, we first consider the standard application of these ideas in terms of “simulation
envelopes”. This method is then refined in terms of a more explicit P-value
representation.

________________________________________________________________________
ESE 502 I.4-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

4.6.1 Simulation Envelopes

The essential idea here is to simulate N random patterns as above and to compare
observed estimate Lˆ (h) with the range of estimates Lˆi (h) , i  1,.., N obtained from this
simulation. More formally, if one defines the lower-envelope and upper-envelope
functions respectively by

(4.6.1) LN (h)  min{Lˆi (h) : i  1,.., N }

(4.6.2) U N (h)  max{Lˆi (h) : i  1,.., N }

then Lˆ (h) is compared with LN (h) and U N (h) for each h . So for a given observed
pattern, Sn , in region R the steps of this Monte Carlo testing procedure can be outlined as
follows:

(i) Generate a number of random patterns, {Sn(i ) : i  1,.., N } , of size n in region


R (say N  99 ).
(ii) Choose a selection of h -values, H  {h1 , h2 ,.., h } , and compute Lˆi (h) for
each h  H and i  1,.., N .
(iii) Form the lower- and upper-envelope functions, and LN (h) and U N (h) in
(4.6.1) and (4.6.2).
(iv) Plot the L-values, Lˆ (h) , for the observed pattern Sn along with the upper
and lower values, U N (h) and LN (h) , for each h  H .

The result of this procedure is to yield


a plot similar that shown in Figure 4.8
to the right. Here the blue region 0
indicates the area in which the U N ()
observed L-function, Lˆ () is outside L
the range defined by the upper- and
lower-envelope functions. In the case
shown, this area is above the envelope, LN () Lˆ ()
indicating that there is significant
clustering relative to the simulated
population under CSR. 0 h
Fig.4.8. Simulation Envelope

The key difference between this figure and Figure 4.6 above is that, rather than a single
benchmark pattern, we now have a statistical population of patterns for gauging the

________________________________________________________________________
ESE 502 I.4-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

significance of Lˆ () . This plot in fact summarizes a series of statistical tests at each
scale of analysis, h  H . In the case illustrated, if we consider any h under the blue
area in Figure 4.8, then by definition, Lˆ (h)  U N (h) . But if pattern Sn were just
another sample from this population of random patterns, then every sample value
{Lˆ (h), Lˆ1 (h),.., LˆN (h)} would have the same chance of being the biggest. So the chance
that Lˆ (h) is the biggest is only 1/( N  1) . More formally, if pattern S is consistent
n

with the CSR Hypothesis then:

1
(4.6.3) Pr[ Lˆ (h)  U N (h)]  , hH
N 1

1
(4.6.4) Pr[ Lˆ (h)  LN (h)]  , hH
N 1

These probabilities are thus seen to be precisely the P-values for one-tailed tests of the
CSR Hypothesis against clustering and dispersion, respectively. For example, if
N  99 [as in step (i) above] then the chance that Lˆ (h)  U N (h) is only 1/(99  1)  .01 .
Hence at scale, h , one can infer the presence of significant clustering at the .01-level.
Similarly, if there were any h  H with Lˆ (h)  LN (h) in Figure 4.8, then at this scale
one could infer the presence of significant dispersion at the .01 -level. Moreover, higher
levels of significance could easily be explored by simulating larger numbers of random
patterns, say N  999 .

This Monte Carlo test can be applied to the Bodmin example by using the MATLAB
program, k_function_sim.m, shown below.

function k_function_sim(loc,area,b,extent,sims,poly)

% K_FUNCTION_SIM computes the k-Function for a point


% pattern plus N random point patterns for a single polygon and
% plots the normalized L-Function plus Upper and Lower envelopes

% INPUTS:
% (i) loc = file of locations (xi,yi), i=1..n
% (ii) area = area of region
% (iii) b = number of bins to use in CDF (and plot)
% (iv) extent = 2 if max h = max pairwise distance to be considered
% = 1 if max b = half of max pairwise distance (typical case)
% (v) sims = number of simulated random patterns
% (vi) poly = polygon boundary file

________________________________________________________________________
ESE 502 I.4-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

Note that the two key additional inputs are the numbers of simulations (here denoted by
sims rather that N) and the boundary file, poly, for the region, R. As with the program,
clust_sim, in Section 3.5 above, poly is needed in order to generate random points in R.

To apply this program to Bodmin with sims = 99, be sure the data file, Bodmin.mat, in
open in the Workspace, and write:

>> k_function_sim(Bodmin,area,20,1,99,Bod_poly);

The results of this program are shown in 1

Figure 4.9 to the right. Notice first that 0.5

there is again some clustering, and that


0
now it can be inferred that this clustering
is significant at the .01-level ( N  99 ). -0.5

Notice also that the range of significant -1

clustering is considerably smaller that


-1.5
that depicted in Figure 4.6 above. This
will almost always be the case, since -2

here the Lˆ (h) values must be bigger that -2.5

99 other random values, rather than just -3


0 1 2 3 4 5 6 7 8 9
one “benchmark” value. Notice also that
this scale, roughly 1.5  h  4.5 , appears
Fig.4.9. Bodmin Envelope Test
to be more consistent with Figure 3.14a.

However, this approach is still rather limited in the sense that it provides information
only about the relation of Lˆ (h) to the maximum and minimum simulated values
U N (h) and LN (h) for each h  H . Hence the following refinement of this approach is
designed to make fuller use of the information obtained from the above Monte Carlo
procedure.

4.6.2 Full P-Value Approach

By focusing on the maximum and minimum values, U N (h) and LN (h) for each
h  H , the only P-values that can be obtained are those in (4.6.3) and (4.6.4) above.
But it is clear for example that values of Lˆ (h) that are just below U N (h) are probably
still very significant. Hence a natural extension of the above procedure is to focus
directly on P-values for clustering and dispersion, and attempt to estimate these values
on the basis of the given samples. Turning first to clustering, the appropriate P-value is
given by the answer to the following question: If the observed pattern were coming
from a CSR process in region R, then how likely it would be to obtain a value as large
as Lˆ (h) ? To answer this question let the observed L-value be denoted by l0  Lˆ (h) , and
let the random variable, LCSR (h) , denote the L-value (at scale h ) obtained from a
randomly sampled CSR pattern of size n on R. Then the answer to the above question

________________________________________________________________________
ESE 502 I.4-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

is given formally by the probability that LCSR (h) is at least as large as l0 , which we
designate as the clustering P-value, Pclustered (h) , at scale h for the observed pattern, Sn :

(4.6.5) Pclustered (h)  Pr[ LCSR (h)  l0 ] .

To estimate this probability, observe that our simulation has by construction produced a
sample of N realized values, li  Lˆi (h) , i  1,.., N , of this random variable LCSR (h) .
Moreover, under the CSR Hypothesis the observed value, l0 , is just another sample,
which for convenience we designate as sample i  0 . Hence the task is to estimate
(4.6.5) on the basis of a random sample, (l0 , l1.,.., lN ) of size N  1 . The standard
approach to estimating event probabilities is simply to count the number of times the
event occurs, and then to estimate its probability by the relative frequency of these
occurrences. In the present case, the relevant event is “ LCSR (h)  l0 ”. Hence if we now
define the indicator variables for this event by

1 , li  l0
(4.6.6)  0 (li )   , i  0,1,.., N
0 , li  l0

then the relative-frequency estimator, Pˆclustered (h) , of the desired P-value is given by13

  0 (li )
N
(4.6.7) Pˆclustered (h)  Pr[ LCSR (h)  l0 ]  1
N 1 i 0

To simplify this expression, observe that if m (l0 ) denotes the number of simulated
samples, i  1,.., N , that are at least as large as l0 [i.e., with  0 (li )  1 ], then this
estimated P-value reduces to14

m (l )  1
(4.6.8) Pˆclustered (h)   0
N 1

Observe that expression (4.6.3) above is now the special case of (4.6.8) in which Lˆ (h)
happens to be bigger than all of the N simulated values. But (4.6.8) conveys a great
deal more information. For example, suppose that N  99 and that Lˆ (h) is only the
fifth highest among these N  1 values. Then in Figure 4.9 this value of Lˆ (h) would be
inside the envelope [probably much closer to U N (h) than to LN (h) ]. But no further
information could be gained from this envelope analysis. However in (4.6.8) the
estimated the chance of observing a value as large as Lˆ (h) is 5 /(99  1)  .05 , so that

13
This is also the maximum-likelihood estimator of Pcluster ( h) . Such estimators will be considered in more
detail in Part III of this NOTEBOOK.
14
An alternative derivation of this P-value is given in Section 7 of the Appendix to Part I.

________________________________________________________________________
ESE 502 I.4-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

this L-value is still sufficiently large to imply some significant degree of clustering.
Such examples show that the P-values in (4.6.8) are considerably more informative
than the simple envelopes above.

Turning next to dispersion, the appropriate P-value is now given by the answer to the
following question: If the observed pattern were coming from a CSR process in region
R, then how likely it would be to obtain a value as small as Lˆ (h) ? The answer to this
question is given by the dispersion P-value, Pdispersed (h) , at scale h for the observed
pattern, Sn :

(4.6.9) Pdispersed (h)  Pr[ LCSR (h)  l0 ]

Here, if we let m (l0 ) denote the number of simulated L-values that are no larger than
l0 , then exactly the same argument above [with respect to the event “ LCSR (h)  l0 ”] now
shows that the appropriate relative-frequency estimate of Pdispersed (h) , is given by

m (l )  1
(4.6.10) Pˆdispersed (h)   0
N 1

To apply these concepts, observe first that (unless many li values are the same as l0 )15
it must be true that Pˆ (h)  1  Pˆ
dispersed (h) . So there is generally no need to compute
clustered

both. Hence we now focus on clustering P-values, Pˆclustered (h) for a given point pattern,
S , in region R. Observe next that to determine Pˆ
n (h) , there is no need to use L-
clustered

values at all. One can equally well order the K-values. In fact, there is no need to
normalize by ̂ since this value is the same for both the observed and simulated
patterns. Hence we need only compute “raw” K-function values, as given by the
bracketed part of expression (4.3.6). Finally, to specify an appropriate range of scales to
be considered, we take the appropriate maximum value of h to be the default value
h  hmax / 2 in (4.5.1), and specify a number b of equal divisions of h . The values of
Pˆ (h) are then computed for each of these h values, and the result is plotted.
clustered

This procedure is operationalized in the MATLAB program, k_count_plot.m. This


program will be discussed in more detail in the next section. So for the present, we
simply apply this program to Bodmin (with Bodmin.mat in the Workspace), by setting
N  99 , b  20 and writing:

>> k_count_plot(Bodmin, 99,20,1,Bod_poly);

15
The question of how to handle such ties is treated more explicitly in Section 7 of the Appendix to Part I.

________________________________________________________________________
ESE 502 I.4-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

(Simply ignore the fourth input “1” for the present.) The screen output of k_count_plot
gives the value of h computed by the program, which in this case is Dmax/2 = 8.6859.
The minimum pairwise distance between all pairs of points (Dmin = 0.5203) is also
shown. This value is useful for interpreting P-values at small scales, since all values of
h less that this minimum must have Kˆ (h)  0 and hence must be “maximally
dispersed” by definition [since no simulated pattern can have smaller values of Kˆ (h) ].

The cluster P-value plot for Bodmin is 1

shown in Figure 4.10. With respect to 0.9

significant clustering, there is seen to 0.8

be general agreement with the results 0.7

P-Values
of the envelope approach above. Here 0.6

we see significant clustering at the .05


Pˆclustered ()
0.5

level (denoted by the lower dashed red 0.4

line) for scale values in the range 0.3

1.3  h  6.1 (remember that one will 0.2

obtain slightly different values for each 0.1

simulation).16 But this figure clearly


0
0 1 2 3 4 5 6 7 8 9
shows more. In particular, clustering at
h h
scales in the range 1.7  h  5.7 is now
seen to be significant at the .01 level, Fig.4.10. Bodmin Cluster P-Values
which by definition the highest level of
significance possible for N = 99.

Here it is also worth noticing that the clustering P-value at scale h  .5 is so large (in
fact .93 in the above simulation) that it shows weakly significant dispersion (where the
upper dashed red line indicates significant dispersion at the .05 level). The statistical
reason for this can be seen from the screen output that shows the minimum distance
between any two tors to be .52. Hence at scale h  .5 it must be true that no circle of
radius .5 about any tor can contain other tors, so that we must have Kˆ (.5)  0 . But since
random point patterns such as in Figure 3.14b often have at least one pair of points this
close together, it becomes clear that there is indeed some genuine local dispersion here.
Further reflection suggests that is probably due to the nature of rock outcroppings,
which are often only the exposed portion of larger rock formations and thus cannot be
too close together. So again we see that the P-value map adds information about this
pattern that may well be missed by simply visual inspection.

4.7 Nonhomogeneous CSR Hypotheses

As mentioned in Section 2.4 above, it is possible to employ the Generalized Spatial


Laplace Principle to extend CSR to the case of nonhomogeneous reference measures.

16
Simulations with N = 999 yield about the same results as Figure 4.10, so this appears to be a more
accurate range than given by the envelope in Figure 4.9.

________________________________________________________________________
ESE 502 I.4-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

While no explicit applications are given in [BG], we can illustrate the main ideas with
the following housing abandonment example.

4.7.1 Housing Abandonment Example

As in the Philadelphia example of Section 1.2 above, suppose that we are given the
locations of n currently abandoned houses in a given city, R, such as in Figure 4.11a
below.

City
Boundary
 H3
 H2


 H1
 
 
  
 
 Hi

Fig.4.11a. Abandoned Houses Fig.4.11b. Census Tract Data

In addition, suppose that data on the number of housing units, H i   (Ci ) , in each
census tract, Ci , i  1,.., m within city R is also available, as in Figure 4.11b. If the
number of total housing units in the city is denoted by

H   ( R)   i1  (Ci )  
m m
(4.7.1) i 1
Hi

then the probability that a randomly sampled housing unit will be located in tract i is
given by

Hi  (Ci )
(4.7.2) Pi   , i  1,.., m
H  ( R)

Thus if these n housing abandonments were completely random events (i.e., with no
housing unit more likely to be abandoned than any other) then one would expect the
distribution of abandoned houses across census tracts to be given by n independent
random samples from the distribution in (4.7.2).17 More formally, this is an example of
a nonhomogeneous CSR hypothesis with respect to a given reference measure,  .

17
In particular, this would yield a marginal distribution of abandonments in each tract Ci given by the
binomial distribution in expression (2.4.3) above with C  Ci .

________________________________________________________________________
ESE 502 I.4-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

4.7.2 Monte Carlo Tests of Hypotheses

To test such hypotheses, we proceed exactly the same way as in the homogeneous case.
The only real difference here is that the probability distributions corresponding to
nonhomogeneous spatial hypotheses are somewhat more complex. Using the above
example as an illustration, we can simulate samples of n random abandonments from
the appropriate distribution by the following two-stage sampling procedure:

(i) Randomly sample a census tract, C1i , from the distribution in (4.7.2).

(ii) Randomly locate a point s1(i ) in C1i .

(iii) Repeat (i) and (ii) n times to obtain a point pattern S n( i )  ( s (ji ) : j  1,.., n) .

The resulting pattern S n( i ) corresponds to the above hypothesis in the sense that
individual abandonment locations are independent, and the expected number of
abandonments in each tract C j is proportional to the reference measure, H j   (C j ) .
However, this reference measure  is only an approximation to the theoretical
measure, since the actual locations of individual housing units are not known. [This is
typical of situations where certain key spatial data is available only at some aggregate
level.18] Hence in step (ii) the location of a housing units in Ci is taken to be uniformly
(homogeneously) distributed throughout this subregion. The consequences of this
“local uniformity” approximation to the ideal reference measure,  , will be noted in
the numerical examples below.

Given a point pattern, S n  ( s j : j  1,.., n) , such as the locations of n abandonments


above, together with N simulated patterns {Sn(i ) : i  1,.., N } from the Monte Carlo
procedure above, we are now ready to test the corresponding nonhomogeneous CSR
hypothesis based on this reference measure  . To do so, we can proceed exactly as
before by constructing K-counts, Kˆ (h) , for the observed pattern, Sn , over a selected
range of scales, h , and then constructing the corresponding K-counts, Kˆ (i ) (h) , for each
simulated pattern, i  1,.., N .

This procedure is operationalized in the same MATLAB program, k_count_plot


(which is more general than the Bodmin application above). Here the only new
elements involve a partition of region R into subregions, {Ci : i  1,.., m} , together with
a specification of the appropriate reference measure,  , defined on this set of
subregions.

18
Such aggregate data sets will be treated in more detail in Part III of this NOTEBOOK.

________________________________________________________________________
ESE 502 I.4-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

4.7.3 Lung Cancer Example

To illustrate this testing procedure, the following example has been constructed from
the Larynx and Lung Cancer example of Section 1.2 above. Here we focus only on
Lung Cancer, and for simplicity consider only a random subsample of n  100 lung
cases, as shown in Figures 4.12 below.

! !
! !!!! !
!! ! ! ! !! ! !
!! !!
! ! ! ! ! ! ! !
! ! !! ! !! !
!!
!!!
! ! ! ! !
! ! !! ! ! !
!!
!
! ! ! !!
! ! !!! ! ! !!
! !
! !!
!! ! !! ! !!
! !
!
! ! !
! !!!!! ! !
! !!
!!
!
!
! !!!!! ! ! ! ! ! ! !
! ! !
! ! ! ! !!
!
! !
!
!!! !!! ! !
!
!! ! !
! ! !! !! !
! ! ! ! ! !
!
! !!!!! !! ! ! ! !
! !!
!
! ! !
!! ! ! !
!! !
!
!!! !!
5 10 km ! !
0 !
  

Fig.4.12. Subsample of Lung Cases Fig.4.13. Random Sample of Same Size

Note from Figures 1.7 and 1.8 that this is fairly representative of the full data set (917
lung cancers). To analyze this data set we begin by observing that in terms of area
alone, the point pattern in Figure 4.12 is obviously quite clustered.

One can see this by comparison with a 0.7

typical random pattern of the same size in


0.6
Figure 4.13. This can be verified
statistically by using the program 0.5

k_function_plot (as in the Bodmin case)


P-Value

to conduct a Monte Carlo test for the 0.4

homogenous case developed above. The


0.3
results are shown in Figure 4.14 to the
right. Here it is evident that there is 0.2

extreme clustering. In fact, note from the


scale in Figure 4.12 above that there is 0.1

highly significant clustering up to a radius


of h  20 km , which is large enough to 0 0
0 5000 10000 15000 20000
h
encompass the entire region. Notice also
that the significance levels here are as high Fig.4.14. Test of Homogeneous Clustering
as possible for the given number of simu-
lations, which in this case was N  999 . This appears to be due to the fact that the
overall pattern of points in Figure 4.12 is not only more clustered but is also more
compact. So for the given common point density in these figures, cell counts centered
at pattern points in Figure 4.12 tend to be uniformly higher than in Figure 4.13.

________________________________________________________________________
ESE 502 I.4-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

But the single most important factor contributing to this clustering (as observed in
Section 2.4 above) is the conspicuous absence of an appropriate reference measure –
namely population. In Figure 4.15 below, the given subsample of lung cases in Figure
4.12 above is now depicted on the appropriate population backcloth of Figure 1.8.

(
! (
!
( !
! ( (
! !
( !
(
!
( (!( ( !
! ( (! (!
(
(!
! ((!
(! (!
((
! (!
!
(!
! ( ! (! (
! !
(!
! ( ! ( !
! ( !
((
!
(
! ( !
! ( (!
!
(
! (!
(
(
!
( (
! (
! (
! (!
! (!
(
(
!((
!!
(
! (
! (! ( ! (
( !
! ( (
!(
! !
( !
(
(
! (
! (!
! ( !
!
( (
!
(
! (
! (
! (
! (
!
(
! (
! (
! (
! (
! (
!
(
! (
! (
! ( (
! !
(
! (
! ( !
! (!
!( (
!
(
! (!
! (( !
(!
(! ( (
(
! (!
!(
( !
! ( ! !
(
(
! (!
(!(! (
! !
( (
! (
!
( !
! ( !
!
(
( !
(
(
! (
!
(!
! ( (!
!
(
! ( !
(
(
!
!!
(( (!
! ((
! (
! (
! (!
!
(
! (
! (!
! (
! (!
! ( (!
! ( (
( (!
! (!
!( (
!
( (
! !! (
! (!
!( (!( ! (!
( ((
! !
((
! (
! (!
( ! (
(
! (!
! (!
! ((
!
(!
!(
(
! (!
! (
( !
! (
! (
!
(
! (
! (
!
(
(
! (
! (
!(
!
(
!
!!
((!
(
!!
((
0 5 10 km !!
(
((
!
  

Fig.4.15. Subsample of Lung Cases Fig.4.16. Random Sample from Population

Here it is clear that much of the clustering in Figure 4.12 can be explained by variations
in population density. Notice also that the relative sparseness of points in the west and
east are also explained by the lower population densities in these areas (especially in
the east). For comparison, a random pattern generated using the two-stage sampling
procedure above is shown in Figure 4.16. Here there still appears to be somewhat less
clustering than in Figure 4.15, but the difference is now far less dramatic than above.

Using these parish population densities


as the reference measure,  , a Monte 0.8

Carlo test was run with N  999 0.7

simulated patterns (including the one 0.6

shown in Figure 4.16). The results of


P-Value

this test are plotted in Figure 4.17 to the 0.5

right. Notice that the dramatic results of 0.4

Figure 4.14 above have all but


0.3
disappeared. There is now only
significant clustering at the local scale 0.2

(with h  2 km ). Moreover, even this


0.1

local clustering appears to be an artifact


of the spatial aggregation inherent in the 0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

parish population density measure,  . h (meters)


As pointed out above, this aggregation
Fig.4.17. Test of Nonhomogeneous Clustering
leads to simulated point patterns under
the nonhomogeneous CSR hypothesis that tend to be much too homogeneous at the
parish level. This is particularly evident in the densely populated area of the south-
central portion of the region shown. Here the tighter clustering of lung cancer cases
seen in Figure 4.15 more accurately reflects local variations in population density than
does the relatively uniform scattering of points in Figure 4.16. So in fact, a more

________________________________________________________________________
ESE 502 I.4-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

disaggregated representation of population density would probably show that there is


no significant clustering of lung cancer cases whatsoever.

4.8 Local K-Function Analysis

Up to this point we have only considered global properties of point patterns, namely the
overall clustering or dispersion of patterns at various scales. However, in many cases
interest focuses on more local questions of where significant clustering or dispersion is
occurring. Here we begin by constructing local versions of K-functions, and then apply
them to several examples.

4.8.1 Construction of Local K-Functions

Recall from expression (4.3.3) that K-functions were defined in terms of expected point
counts for a randomly selected point in a pattern. But exactly the same definitions can
be applied to each individual point in the pattern by simply modifying the interpretation
of (4.3.3) to be a given point, i , rather than a randomly sampled point, and rewriting
this expression as a local K-function for each point, i :

(4.8.1) K i ( h)  1 E   j i I h (dij ) 
  

Moreover, if we now relax the stationarity assumption used in (4.3.4) above, then these
expected values may differ for each point, i . In this context, the pooled estimator
(4.3.5) for the stationary case now reduces to the corresponding local estimator:

(4.8.2) Kˆ i (h)  1
ˆ  I (dij )
j i h

Hence to determine whether there is significant clustering about point i at scale h , one
can develop local Monte Carlo testing procedures using these statistics.

4.8.2 Local Tests of Homogeneous CSR Hypotheses

In the case of homogenous CSR hypotheses, one can simply hold point i fixed in
region R and generate N random patterns of size n  1 in R (corresponding to the
locations of all other points in the pattern). Note that in the present case, (4.8.2) is
simply a count of the number of points with distance h of point i , scaled by 1/ ˆ . But
since this scaling has no effect on Monte Carlo tests of significance, one can focus
solely on point counts (which may be thought of as a “raw” K-function). For each
random pattern, one can then simply count the number of points within distance h of
point i . Finally, by comparing these counts with the observed point count, one can then
generate p-values for each point i  1,.., n and distance, h , [paralleling (4.6.8) above]:

m ( h)  1
(4.8.3) Pˆi (h)  i
N 1

________________________________________________________________________
ESE 502 I.4-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

where mi (h) now denotes the number of simulated patterns with counts at distance h
from i at least as large as the observed count. This testing procedure is operationalized
in the MATLAB program, k_count_loc.m, shown below:

function [PVal,C0] = k_count_loc(loc,sims,D,M,poly)

% K_COUNT_LOC computes the raw K-function at each point in the


% pattern, loc, for a range of distances, D, and allows tests of non-
% homogeneous CSR hypotheses by including a set of polygons, poly, with
% reference measure, M.
%
% INPUTS:
% (i) loc = population location file [loc(i)=(Xi, Yi),i=1:N]
% (ii) sims = number of simulations
% (iii) D = set of distance values (in ASCENDING order)
% (iv) M = k-vector of measure values for each of k polygons
% (v) poly = matrix describing boundaries of k polygons

Here the main output, Pval, is a matrix of P-values at each reference point and each
distance value under the CSR Hypothesis. (The point counts for each point-distance
pair are also in the output matrix, C0.) Notice that since homogeneity is simply a
special case of heterogeneity, this program is designed to apply both to homogeneous
and nonhomogeneous CSR hypotheses.

Application to Bodmin Tors

The homogeneous case can be illustrated by the following application to Bodmin tors.
Recall that the location pattern of tors is given by the matrix, Bodmin, in the workspace
Bodmin.mat. Here there is a single boundary polygon, Bod_poly. Hence the reference
measure can be set to a constant value, say M = 1. So the appropriate command for
999 simulations in this case is given by:

>> [Pval,C0] = k_count_loc(Bodmin,999,D,1,Bod_poly);

In view of Figure 4.10 above, one expects that the most meaningful distance range for
significant clustering will be somewhere between h  1 and h  5 kilometers. Hence
the selected range of distances was chosen to be D = [1,2,3,4,5]. One key advantage of
this type of local analysis is that since a p-value is now associated with each individual
point, is now possible to map the results. In the present case, the results of this Monte
Carlo analysis were imported to ARCMAP, and are displayed in Bodmin.mxd. In
Figure 4.18 below, the p-value maps for selected radii of h  2,3,5 km are shown. As

________________________________________________________________________
ESE 502 I.4-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

seen in the legend (lower right corner of the figure), the darker red values correspond to
lower p-values, and hence denote regions of more significant clustering. As expected,
there are basically two regions of significant clustering corresponding to the two large
groupings of tors in the Bodmin field.

!
!
( !
( !
!
( ( !
! ( !
(
!
( ( !
! ( !
( ! ! !
!
( !
( !
( !
( ! ! !
!
( !
( !
!
( !
( !
!
( !
( ! ! !
!
( !
( !
( !
( !
( !
(

!
( !
( !

! !
!
( ( ! !
( !
! ( ( !
! (

!
( !
( !
!
( !
! ( ( !
! ( !
!
!
( !
( !
( !
( !
!
( !
(!( !
( !
(!( ! ! ! !
!
( !
( !
( (
! ! ! !
!
( !
( !
( !
( !
!
( !
( !
!
( !
( !
( !
( !
!
( !
( ! !
!
( !
(

!
( !
( !

h = 2 km h = 3 km h = 5 km
 3
P-VALUES
! 0.001 - 0.005
! 0.005 - 0.010

Figure 4.18. Cluster P-Values for Bodmin Tors ! 0.010 - 0.050


! 0.050 - 0.100
! 0.100 - 0.999

Notice here that clustering is much more pronounced at a radius of 3 km than at smaller
or larger radii. (The red circle in the figure shows the actual scale of a 3 km radius.)
This figure well illustrates the ability of local K-function analyses to pick up sharper
variations in scale than global analyses such as Figure 4.10 above (where there
appeared to be equally significant clustering at all three scales, h  2,3,5 km). Hence it
should be clear from this example that local analyses are often much more informative
than their global counterparts.

Local Analyses with Reference Grids

The ability to map p-values in local analyses suggests one additional extension that is
often more appropriate than direct testing of clustering at each individual point. By way
of motivation, suppose that one is studying a type of tree disease by mapping the
locations of infected trees in a given forest. Here it may be of more interest to
distinguish diseased regions from healthy regions in the forest rather than to focus on
individual trees. A simple way to do so is to establish a reference grid of locations in
the forest, and then to estimate clustering p-values at each grid location rather than at
each tree. (The construction of reference grids is detailed in Section 4.8.3 below.) Such
a uniform grid of p-values can then be easily interpolated to produce a smoother visual

________________________________________________________________________
ESE 502 I.4-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

summary of disease clustering. An illustration of this reference-grid procedure is shown


in Figure 4.19 below, where the red dots denote diseased trees in the section of forest
shown, and where the white dots are part of a larger grid of reference points. In this
illustration the diseased-tree count within distance h of the grid point shown is thus
equal to 4.

° ° ° °
° h
° ° °
° ° ° °
° ° ° °
° ° ° °
Figure 4.19. Reference Grid for Local Clustering

Assuming that the forest itself is reasonably uniform with respect to the spatial
distribution of trees, the homogeneous CSR hypothesis would again provide a natural
benchmark for identifying significant clustering of diseased trees. In this case, one
would simulate random patterns of diseased trees and compare disease counts with
those observed within various distances h of each grid point. Hence those grid points
with low p-values at distance h would denote locations where there is significant
disease clustering at scale h .

To develop the details of this procedure, it is convenient to construct a reference grid


representation for Bodmin, so that the two approaches can more easily be compared. To
do so, we start by constructing a reference grid for Bodmin. By inspecting the boundary
of Bodmin in ARCMAP one can easily determine a box of coordinate values just large
enough to contain all of Bodmin. In the present case, appropriate bounding X-values
and Y-values are given by Xmin = -5.2, Xmax = 9.5, Ymin = -11.5, and Ymax = 8.3.
Next one needs to choose a cell size for the grid (as exemplified by the spacing between
white dots in Figure 4.19). One should try to make the grid fine enough to obtain a
good interpolation of the p-values at grid points. Here the value of .5 km was chosen for
spacing in each direction, yielding square cells with dimensions, Xcell = .5 = Ycell.

________________________________________________________________________
ESE 502 I.4-25 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

The construction of the corresponding reference grid is operationalized in the program


grid_form.m with the command:

>> ref = grid_form(Xmin,Xmax,Xcell,Ymin,Ymax,Ycell);

This produces a 2-column matrix, ref, of grid point coordinates. (The upper left corner
of the grid is displayed on the screen for a consistency check.). A plot of the full grid,

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Full Grid Masked Grid

Figure 4.20. Reference Grid for Bodmin

ref, is shown on the left in Figure 4.20.19 (In Section 8 of the Appendix to Part I a
procedure is developed for obtaining this full grid representation directly in MATLAB.)
While all of these grid points are used in the calculation, those outside of the Bodmin
boundary are only relevant for maintaining some degree of smoothness in the
interpolation constructed below. On the right, these grid points have been masked out in
order to display only those points inside the Bodmin boundary. (The construction of
such visual masks is quite useful for many displays, and is discussed in detail in Section
1.2.4 of Part IV in this NOTEBOOK.)

Given this reference grid, ref, the extension of k_count_loc.m that utilizes ref is
operationalized in the MATLAB program, k_count_loc_ref.m. This program is
essentially identical to k_count_loc.m except that ref is a new input. Here one obtains
p-values for Bodmin at each reference point in ref with the command:

19
Notice that the right side and top of the grid extend slightly further than the left and bottom. This is
because the Xmax and Ymax values in the program are adjusted upward to yield an integral number of
cells of the same size.

________________________________________________________________________
ESE 502 I.4-26 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

>> [Pval,C0] = k_count_loc_ref(Bodmin,ref,999,D,1,Bod_poly);

where the matrix Pval now contains one p-value for each grid point in ref and distance
radius in D. The results of this Monte Carlo simulation were exported to ARCMAP and
the p-values at each grid point inside Bodmin are displayed for h  3 km on the left in
Figure 4.21 below (again with a mask). By comparing this with the associated point

! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! #
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
# !
#
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! #
! # !
!
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! #
! #
! !
#
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! #
!
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! #
!
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
# !
# !
#
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
P-VALUES
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0.001 - 0.002
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! #
!
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0.002 - 0.005
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0.005 - 0.01 #
!
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! #
# !
!
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0.01 - 0.02 !
#
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! #
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! # !
!
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0.02 - 0.05 #
!
#
!
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0.05 - 0.10 !
# !
# !
#
!
#
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! #
! #
! !
#
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0.10 - 0.20 !
#
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! # #
!
!
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0.20 - 1.00
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
# !
#
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
#
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Figure 4.21. Interpolated P-Values for Bodmin

plot in the center of Figure 4.18, one can see that this is essentially a smoother version
of the results depicted there. However, this representation can be considerably
improved upon by interpolating these values using any of number of standard
“smoothers” (discussed further in Part II). The interpolation shown on the right was
obtained by the method known as ordinary kriging. This method of (stochastic)
interpolation will be developed in detail in Section 6.3 of Part II in this NOTEBOOK.

4.8.3 Local Tests of Nonhomogeneous CSR Hypotheses

Next we extend these methods to the more general case of nonhomogeneous CSR
hypotheses. As with all spatial Monte Carlo testing procedures, the key difference
between the homogeneous and nonhomogeneous cases is the way in which random
points are generated. As discussed in Section 4.7.2 above, this generation process for
the nonhomogeneous case amounts to a two-stage sampling procedure in which a
polygon is first sampled in a manner proportional to the given reference measure, M,
and then a random location in this polygon is selected. Since this procedure is already
incorporated into both the programs k_count_loc.m and k_count_loc_ref.m above,
there is little need for further discussion at this point.

________________________________________________________________________
ESE 502 I.4-27 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

By way of illustration, we now apply k_count_loc_ref.m to a Philadelphia data set,


which includes 500 incidents involving inter-group conflict (IGC) situations (such as
housing discrimination) that were reported to the Community Service Division of the
Philadelphia Commission on Human Relations from 1995-1996. [This data set is
discussed in more detail in the project by Amy Hillier on the ESE 502 class web page.]

The locations of these 500 incidents are shown on the left in Figure 4.22 below, and are
also displayed in the map document, Phil_igc.mxd, in ARCMAP. Here the natural null
hypothesis would be that every individual has the same chance of reporting an
“incident”. But as with the housing abandonment example in Figure 4.11 above,
individual location data is not available. Hence census tract population levels

! !
! !
! !! !
!! ! !! !
!
! !
! !! ! !
! ! !
!!
! ! ! !
! ! !
! ! !! !
! ! !! !!
! ! !! ! ! ! !
! ! !
! !! ! !
! ! ! ! ! !! ! !
! !! !
!! !! !
! ! !! !! ! ! ! !!
!! ! !
!
! ! ! ! !! !!! ! ! ! !! ! ! !! !
!! ! ! ! ! !
! ! !
! ! ! ! !! ! ! !! !!! !! ! !! ! !! !! !! !
!
!!
!
! ! !!!!! !
!! ! ! !!
!! ! !! ! !!!
!
! ! !
!
! ! ! ! ! !! !! ! !! !
! ! !
!
! ! ! ! ! !
!! !! !!!!! !! ! ! ! !!! !!! ! ! !!!!!! !! !!
!
!! ! !
! ! ! !! ! ! ! ! !
!!!! ! ! ! ! !! ! !! ! ! !!
!
!
!!! ! ! !! !! !!! ! ! ! !! ! !
!!!
! !! !
!! ! !!!! !! !!! !! !
! !! ! ! ! ! ! ! ! !!
!! !!!! !! !
!
! !
!!! !! ! ! !!
! ! ! !
! !! !! ! ! !
!
! !
! !
!
!!
!! !!! !
! ! ! !! !
! ! !
!!
!
!!!
!!
!
!
!!!
! !
!! !! !!!!!!!! !! ! !!!! !
! !!
! ! !!!
!!!
!!
! !!
!! ! ! !!! !! !!!! ! !
!!
!! ! !
! ! !!
! !! !
! !! ! !!! !! ! !
! !!
!
!
!!
!!!
!
!!
!! !!
! !
!!
! !!! !!
!!
!!
! !! !
! ! ! !
! !
!!
!
! !! !!!! ! !!! !!!!
!! !!!! !! !
!! !
!
!!!!!!! ! ! !! !!! !!! !! ! !
! ! !
! ! !
!!
!!!!! ! !! !! !! ! !
! ! ! !!
! !! ! !! ! ! ! !
! ! !! ! !
!! ! ! ! !! !! ! ! ! !
!
! ! ! !! ! !
! ! !! !!!! !!! ! !! ! !! ! !!! ! !! !
! ! !! ! !
!
! ! !! !!! !! !! !
!
!! !! ! ! !! !! ! ! !!
!
!! ! ! !! ! ! !
!! !!!! !! ! ! ! ! !!!
!
!!! ! ! !
!!!
! !!! ! !
!!
!!! !!
! ! !!! ! !! !!
! !!
!!! ! !!!! !! ! ! !!
!! ! !
!
!
!!!!
!
! !! !!! !!!
!!! !! ! !! !!!
!
!!!!! !
!!!!
!!!
! ! ! ! !! ! !!!! !!
! !
!
!!
! !!
! ! !!
! ! !!
!!
!

ACTUAL IGC INCIDENTS RANDOM IGC INCIDENTS

Figure 4.22. Comparison with IGC Random Incidents

will be used as an approximation to individual locations, so that the relevant reference


measure is here taken to be population by census tract (with corresponding population
densities shown in green in Figure 4.22). The relevant nonhomogeneous CSR
hypothesis for this case is thus simply that the chance of any incident occurring in a
given census tract is proportional to the population of that census tract. Under this
hypothesis, a typical realization of 500 “random IGC incidents” is shown on the right.
Here it is clear that incidents are more clustered in areas of high population density,
such as in West Philadelphia and South Philadelphia. So clusters of actual data on the

________________________________________________________________________
ESE 502 I.4-28 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

left are only significant if they are more concentrated than would be expected under this
hypothesis. Hence, even though there is clearly a cluster of cases in South Philadelphia,
it is not clear that this is a significant cluster. Notice however that the Kensington area
just Northeast of Center City does appear to be more concentrated than would be
expected under the given hypothesis. But no conclusion can be reached on the basis of
this visual comparison. Rather, we must simulate many realizations of random patterns
and determine statistical significance on this basis.

To do so, a reference grid for Philadelphia was constructed, and is shown (with
masking) on the left in Figure 4.23 below, in a manner similar to Figure 4.20 above.
Here a range of distances was tried, and clustering was most apparent at a radius of 500
meters (in a manner similar to the radius of 3 km in Figure 4.18 above for the Bodmin
example). The p-value results for this case are contained in the MATLAB workspace,

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! P-VALUES
! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0.000
! ! ! -!0.001
! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
0.001 - 0.005
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
0.005 -!0.100
! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
0.100
! ! ! !
-!0.200
! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
0.200
! ! ! !
-!1.000
! ! ! ! ! !

REFERENCE GRID P-VALUE CONTOURS

Figure 4.23. P-Value Map for ICG Clustering

phil_igc.mat, and were obtained using k_count_loc_ref.m with the command:

>> [Pval,C0] = k_count_loc_ref(loc,ref,999,D,pop,bnd);

Here loc contains the locations of the 500 IGC incidents, ref is the reference grid
shown above, D contains a range of distances including the 500-meter case,20 and pop

20
The actual coordinates for this map were in decimal degrees, so that the value .005 corresponds roughly
to 500 meters.

________________________________________________________________________
ESE 502 I.4-29 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

contains the populations of each census tract, with boundaries given by bnd. These
results were imported to ARCMAP as a point file, and are displayed as P-val.shp in the
data frame, “P-Values for Dist = .005”, of Phil_igc.mxd. Finally, these p-values were
interpolated using a different smoothing procedure than that of Figure 4.21 above. Here
the spline interpolator in Spatial Analyst was used, together with the contour option.
The details of this procedure are described in Section 8 of the Appendix to Part I.21

Here the red contours denote the most significant areas of clustering, which might be
interpreted as IGC “hotspots”. Notice in particular that the dominant hotspot is
precisely the Kensington area mentioned above. Notice also that the clustering in West
Philadelphia, for example, is now seen to be explained by population density alone, and
hence is not statistically significant.

It is also worth noticing that there is a small “hotspot” just to the west of Kensington
(toward the Delaware River) that appears hard to explain in terms of the actual IGC
incidents in Figure 4.22. The presence of this hotspot is due to the fact that while there
are only four incidents in this area, the population density is less than a quarter of that
in the nearby Kensington area. So this incidence number is usually high given the low
density. This raises the practical question of how many incidents are required to
constitute a meaningful cluster. While there can be no definitive answer to this
question, is important to emphasize that statistical analyses such as the present one
should be viewed as providing only one type of useful information for cluster
identification. 22

21
Notice also that this contour map of P-values is an updated version of that in the graphic header for the
class web page. That version was based on only 99 simulations (run on a slower machine).
22
This same issue arises in regression, where there is a need to distinguish between the statistical
significance of coefficients (relative to zero) and the practical significance of their observed magnitudes in
any given context.

________________________________________________________________________
ESE 502 I.4-30 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

5. Comparative Analyses of Point Patterns

Up to this point, our analysis of point patterns has focused on single point patterns, such
as the locations of redwood seedlings or lung cancer cases. But often the relevant
questions of interest involve relationships between more than one pattern. For example if
one considers a forest in which redwoods are found, there will invariably be other species
competing with redwoods for nourishment and sunlight. Hence this competition between
species may be of primary interest. In the case of lung cancers, recall from Section 1.2
that the lung cancer data for Lancashire was primarily of interest as a reference
population for studying the smaller pattern of larynx cancers. We shall return to this
example in Section 5.8 below. But for the moment we start with a simple forest example
involving two species.

5.1 Forest Example

The 600 foot square section of forest shown in Figure 5.1 below contains only two types
of trees. The large dots represent the locations of oak trees, and the small dots represent
locations of maple trees. Although this is a fairly small section of forest, it seems clear
that the pattern of oaks is much more clustered than that of maples. This is not surprising,
given the very different seed-dispersal patterns of these two types of trees.

! ! ! !
! !
!
! ! !
!
! !
! ! !
! ! !
! ! !

! !
!
!
!
! !
!
! !
! ! !
! ! !
! ! !
! !
!
!
!
! !
! ! OAK MAPLE
! !
! ! !
! ! !
!
!
!
!
! !

0 100 200 feet


  
Figure 5.1. Section of Forest Figure 5.2. Patterns of Seed Dispersal

As shown in Figure 5.2, oaks produce largest acorns that fall directly from the tree, and
are only partially dispersed by squirrels. Maples on the other hand produce seeds with
individual “wings” that can transport each seed a considerable distance with even the
slightest breeze. Hence there are clear biological reasons why the distribution of oaks
might be more clustered than that of maples. So how might we test this hypothesis
statistically?

________________________________________________________________________
ESE 502 I.5-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

5.2 Cross K-Functions

As one approach to this question, observe that if oaks tend to occur in clusters, then one
should expect to find that the neighbors of oak trees tend to be other oaks, rather than
maples. Alternatively put, one should expect to find fewer maples near oak locations than
other locations. While one could in principle test these ideas in terms of nearest neighbor
statistics, we have already seen in the Bodmin tors example that this does not allow any
analysis of relationships between point patterns at different scales. Hence a more flexible
approach is to extend the above K-function analysis for single populations to a similar
method for comparing two populations.1

The idea is simple. Rather than looking at the expected number of oak trees within
distance h of a given oak, we look at the expected number of maple trees within distance
h of the oak. More generally, if we now consider two point populations, 1 and 2, with
respective intensities, 1 and 2 , and denote the members of these two populations by i
and j , respectively, then the cross K-function, K12 (h) , for population 1 with respect to
population 2 is given for each distance h by the following extension of expression (4.2.1)
above:

(5.2.1) K12 (h)  1


2 E (number of j -events within distance h of an arbitrary i -event )

Notice that there is an asymmetry in this definition, and that in general, K12 (h)  K 21 (h) .
Notice also that the word “additional” in (4.2.1) is no longer meaningful, since
populations 1 and 2 are assumed to be distinct. This definition can be formalized in a
manner paralleling the single population case as follows. First for any realized point
patterns, S1  ( si : i  1,.., n1 ) and S 2  ( si : i  1,.., n2 ) , from populations 1 and 2 in region
R , let dij  d ( si , s j ) denote the distance between member i of population 1 and j of
population 2 in R . Then for each distance h the indicator function

 1 , dij  h
(5.2.2) I h (dij )  I h [d ( si , s j )]  
 0 , dij  h

now indicates whether or member j of population 2 is within distance h of a given


member i of population 1. In terms of this indicator, the cross K-function in (5.2.1) can
be formalized [in a manner paralleling (4.3.3)] as

E   j21 I h (dij ) 
n
(5.2.3) K12 (h)  1
2  

1
Note that while our present focus is on two populations, analyses of more than two populations are
usually formulated either as (i) pairwise comparisons between these populations (as with correlation
analyses), or (ii) comparisons between each population and the aggregate of all other populations. Hence
the two-population case is the natural paradigm for both these approaches.
________________________________________________________________________
ESE 502 I.5-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

where both the size, n2 , of population 2 and the distances (dij : j  1,.., n2 ) are here
regarded as random variables.2 This function plays a fundamental role in our subsequent
comparative analyses of populations.

5.3 Estimation of Cross K-Functions

Given the definition in (5.2.3) it is immediately apparent that cross K-functions can be
estimated in precisely the same way as K-functions. First, since the expectation in (5.2.3)
does not depend on which random reference point i is selected from population 1, the
same argument as in (4.3.4) now shows that for any given size, n1 , of population 1,

E   j21 I h (dij )   2 K12 (h) , i  1,.., n1


n
(5.3.1)
 

 E   j21 I h (dij )   n12 K12 (h)


n1 n
 i 1  

so that for each n1 , K12 (h) can be written as3

 E   j21 I h (dij ) 
n n
(5.3.2) K12 (h)  1
2n1 i 1  

In this form, it is again apparent that for any given realized patterns, S1  ( s1i : i  1,.., n1 )
and S 2  ( s2 j : j  1,.., n2 ) , the expected counts in (5.3.2) are naturally estimated by their
corresponding observed counts, and that the intensities, 1 and 2 , are again estimated by
the observed intensities,

nk
(5.3.3) ˆk  , k  1, 2
a( R)

Thus the natural (maximum likelihood) estimate of K12 (h) is given by the sample cross
K-function:

 
n1 n2
(5.3.4) Kˆ 12 (h)  1 I (dij )
ˆ2n1 i 1 j 1 h

2
To be more precise, n2 is a random integer (count), and for any given value of n2 , the conditional
distribution of [ d ij  d ( si , s j ) : j  1,.., n2 ] is then determined by the conditional distribution of the
locations, [ si , ( s j : j  1,.., n2 )] in R, where si is implicitly taken to be the location of a randomly sampled
member of population 1.
3
Technically this should be written as a conditional expectation given n1 [and (4.3.4) should be a
conditional expectation given n ]. But for simplicity, we ignore this additional layer of notation.
________________________________________________________________________
ESE 502 I.5-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

5.4 Spatial Independence Hypothesis

We next use these sample cross K-functions as test statistics for comparing populations 1
and 2. Recall that in the single population case, the fundamental question of interest was
whether or not the given population was more clustered (or more dispersed) than would
be expected if the population locations were completely random. This led to the CSR
hypothesis as a natural null hypothesis for testing purposes. However, when one
compares two populations of random events, the key question is usually whether or not
these events influence one another in some way. So here the natural null hypothesis takes
the form of statistical independence rather than randomness. In terms of cross K-
functions, if there are significantly more j -events close to i -events than would be
expected under independence, then one may infer that there is some “attraction” between
populations 1 and 2. Conversely, if there are significantly fewer j -events close to i -
event than expected, then one may infer that there is some “repulsion” between these
populations. These basic distinctions between the one-population and two-population
cases can be summarized as in Table 5.1 below:

CASE HYPOTHESIS FRAMEWORK


One Pop Clustering Spatial Randomness Dispersion

Two Pops Attraction Spatial Independence Repulsion

Figure 5.3. Comparison of Hypothesis Frameworks

Next we observe that from a testing viewpoint, the particular appeal of the CSR
hypothesis is that one can easily simulate location patterns under this hypothesis. Hence
Monte Carlo testing is completely straightforward. But the two-population hypothesis of
spatial independence is far more complex. In principle this would not be a problem if one
were able to observe many replications of these sets of events, i.e., many replications of
joint patterns from populations 1 and 2. But this is almost never the case. Typically we
are given a single joint pattern (such as the patterns of oaks and maples in Figure 5.1
above) and must somehow detect “departures from independence” using only this single
realization. Hence it is necessary to make further assumptions, and in particular, to define
“spatial independence” in a manner that allows the distribution of sample cross K-
functions to be simulated under this hypothesis. Here we consider two approaches,
designated respectively as the random-shift approach and the random-permutation
approach.

5.5 Random-Shift Approach to Spatial Independence

This approach starts by postulating that each individual population k  1, 2 is generated


by a stationary process on the plane. If region R is viewed as a window on this process
________________________________________________________________________
ESE 502 I.5-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

(as in Section 2) and we again represent each process by the collection of cell counts in
R , say  k  {N k (C ) : C  R} , k  1, 2 , then it follows in particular from (2.5.1) that the
marginal cell-count distribution, Pr[ N k (Ch )] for population k in any circular cell, Ch , of
radius h must be the same for all locations.4 Hence if we now focus on population 2 and
imagine a two-stage process in which (i) a point pattern for population 2 is generated, and
(ii) this pattern is then shifted by adding some constant vector, a , to each point,
s j  s j  a , then the expected number of points in Ch would be the same for both stage
(i) and stage (ii). Indeed this shift simply changes the location of Ch relative to the
pattern (as in Figure 5.5 below) so that by stationarity the expected point count must stay
the same.

5.5.1 Spatial Independence Hypothesis for Random Shifts

In this context, the appropriate spatial independence hypothesis simply asserts that cell
counts for population 2 are not influenced by the locations of population 1, i.e., that for
all cells, C  R ,

(5.5.1) Pr[ N 2 (C )  n | 1 ]  Pr[ N 2 (C )  n] , n  0

where Pr[ N 2 (C )  n | 1 ] is the conditional probability that N 2 (C )  n given all cell


counts, 1 , for population 1.5 Under this hypothesis it then follows that the conditional
distribution on the left must also exhibit stationarity, so that if the circular cell, Ch , is
centered at the location of a point si in population 1, this will make no difference. To
illustrate the substantive meaning of this hypothesis in the presence of stationarity,
suppose that populations 1 and 2 are plant species in which the root system of species 1 is
toxic to species 2, so that no plant of species 2 can survive within two feet of any species
1 plant. Then consider a two stage process in which the plant locations of species 1 and 2
are first generated at random, and then all species 2 plants within two feet of any species
1 plant are removed.6 Then it is not hard to see that the marginal process for population
2 will still exhibit stationarity (since locations of population 1 are equally likely to be
anywhere). But the conditional process for population 2 given the locations of population
1 is highly non-stationary, and indeed must have zero cell counts for all two-foot cells
around population 1 sites.

Now returning to the two-stage “shift” process described above, this process suggests a
natural way of testing the independence hypothesis in (5.5.1) using sample cross K-
functions. In particular, if the given realization of population 2 is randomly shifted in any
way, then this should not affect the expected counts,

4
For the present, we implicitly assume that region R is “sufficiently large” that edge effects can be ignored.
5
Note that while there is an apparent asymmetry in this definition between populations 1 and 2, the
definition of conditional probability implies that (5.5.1) must also hold with labels 1 and 2 reversed.
6
This is an instance of what is called a “hard-core” process in the literature (as for example in Ripley,
1977, section 3.2 and Cressie, 1995, section 8.5.4).
________________________________________________________________________
ESE 502 I.5-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

E{N1[Ch ( si )]}  E   j21 I h (dij ) 


n
(5.5.2)
 

of population 2 events within distance h of any population 1 event, si . This in turn


implies from (5.3.2) that the cross K-function should remain the same for all such shifts
(remember that cross K-functions are expected values). Hence if one were to randomly
sample shifted versions of the given pattern and construct the corresponding statistical
population of sample cross K-functions, then this population could be used to test for
spatial independence in exactly the same way that the CSR hypothesis was tested using
K-functions. This testing scheme is in principle very appealing since it provides a direct
test of the spatial independence hypothesis that preserves the marginal distribution of
both populations.

5.5.2 Problem of Edge Effects

But in its present form, such a test it is not practically possible since we are only able to
observe these processes in a bounded region, R. Thus any attempt to “shift” the pattern
for population 2 will require knowledge of the pattern outside this window, as shown in
Figures 5.4 and 5.5 below. Here the black dots represent unknown sites of population 2
events. Hence any shift of the pattern relative to region R will allow the possible entry of
unknown population 2 events into the window defined by region R.

      
    
      
    
    
       
       
     
  
       
      
   

R    R 
      
    
      
      
     
    
     
         

Figure 5.4. Pattern for Population 2 Figure 5.5. Randomly Shifted Pattern

However, it turns out that under certain conditions one can construct a reasonable
approximation to this ideal testing scheme. In particular, if the given region R is
rectangular, then there is indeed a way of approximating stationary point processes
outside the observable rectangular window. To see this, suppose we start with the two
point patterns in a rectangular boundary, R, as shown in Figure 5.6 below (with pattern 1

________________________________________________________________________
ESE 502 I.5-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

= white dots and pattern 2 = black dots).7 If these patterns are in fact generated by
stationary point processes on the plane, then in particular, the realized pattern,
S 20  ( s20 j : j  1,.., n2 ) , for population 2 (shown separately in Figure 5.7 below) could
equally well have occurred in any shifted version of region R.

Figure 5.6. Rectangular Region Figure 5.7. Population 2

But since the rectangularity of R implies that the entire plane can be filled by a “tiling” of
disjoint copies of region R (also called a “lattice” of translations of R) and since this same
point pattern can be treated as a typical realization in each copy of R, we can in principle
extend the given pattern in region R to the entire plane by simply reproducing this pattern
in each copy of R [as shown partially in Figure 5.8 below].8 We designate this infinite
version of pattern S 20 by S20 .

Figure 5.8. Partial Tiling Figure 5.9. Random Shifts

7
This example is taken from Smith (2004).
8
Such replications are also called “rectangular patterns with periodic boundary conditions” (see for
example Ripley, 1977 and Diggle, 1983, section 1.3).
________________________________________________________________________
ESE 502 I.5-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

In this way, we can effectively remove the “edge effects” illustrated in Figure 5.5 above.
Moreover, while the “replication process” that generates S20 must of course exhibit
stronger symmetry properties than the original process for population 2, it can be shown
that this process shares the same mean and covariance structure as the original process.
Moreover, it can also be shown that under the spatial independence hypothesis, the cross
K-function yielded by this process must be the same as for the original process.9 Hence
for the case of rectangular regions, R, it is possible to carry this replicated version of the
“ideal” testing procedure described above.

5.5.3 Random Shift Test

To make this test explicit, we start by observing that it suffices to consider only local
random shifts. To see this, note first that if point pattern 1 in Figure 5.6 is designated by
S10  ( s10i : i  1,.., ni ) , then shifting S20 relative to S10 on the plane is completely equivalent
to shifting S 0 relative to S 0 . Hence we need only consider shifts of S 0 . Next observe by
1 2 1

symmetry that every distinct rectangular portion S that can occur in shifted versions of
0
2

R (such as the pattern inside the blue box of Figure 5.8) can be obtained at some position
of R inside the red dotted boundary shown Figure 5.8. Hence we need only consider
random shifts of S10 within this boundary. Again, the blue box in Figure 5.8 represents
one such shift (where the white dots for population 1 have been omitted for sake of visual
clarity). Hence to construct the desired random-shift test, we can use the following
procedure:

(i) Simulate N random shifts that will keep rectangle R inside the feasible region in
Figure 5.9. Then shift all coordinates in S10 by this same amount.

(ii) If S 2m  ( s2mj : j  1,.., n2m ) denotes the pattern for population 2 occurring in random
shift m  1,.., N of rectangle R (which will usually be of a slightly different size than
S 0 ), then a sample cross K-function, Kˆ m (h) , can be constructed from S 0 and S m . In
2 12 1 2

particular if the relevant set of distance radii is chosen to be D  {hw : w  1,..,W } ,


then the actual values constructed are {Kˆ m (h ) : w  1,..,W } .
12 w

(iii) Finally, if the observed sample cross K-function, Kˆ 120 (h) , is constructed in the
same way from S10 and S 20 (where the latter pattern is equivalent to the “zero shift”
denoted by the central box in Figure 5.8), then under the spatial independence
hypothesis, (5.5.1), each observed value, Kˆ 120 (hw ) , should be a “typical” sample from
the list of values [ Kˆ m (h ) : m  0,1,.., N ] . Hence (in a manner completely analogous
12 w

to the single-population tests of CSR), if we now let M 0 denote the number of

9
See the original paper by Lotwick and Silverman (1982) for proofs of these facts.
________________________________________________________________________
ESE 502 I.5-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

simulated random shifts, m  1,.., N , with Kˆ 12m (hw )  Kˆ 120 (hw ) , then the estimated
probability of obtaining a value as large as Kˆ 0 (h ) under this spatial independence
12 w

hypothesis is given by the attraction p-value,

M 0 1
(5.5.3) Pˆattraction (hw )  
N 1

where small values of Pˆattraction (hw ) can be interpreted as implying significant


attraction between populations 1 and 2 at scale hw .

(iv) Similarly, if M 0 denotes the number of simulated random shifts, m  1,.., N ,


with Kˆ m (h )  Kˆ 0 (h ) , then the estimated probability of obtaining a value as small
12 w 12 w

as Kˆ (hw ) under this spatial independence hypothesis is given by the repulsion


0
12

p-value,

M 0 1
(5.5.4) Pˆrepulsion (hw )  
N 1

where small values of Pˆrepulsion (hw ) can be interpreted as implying significant


repulsion between populations 1 and 2 at scale hw .

5.5.4 Application to the Forest Example

This testing procedure is implemented in the MATLAB program, k12_shift_plot.m, and


can be applied to the Forest example above as follows. The forest data appears in the
ARCMAP file, Forest.mxd, and was exported to the MATLAB workspace, forest.mat.
The coordinate locations of the n1  21 oaks and n2  43 maples are given in matrices,
L1 and L2, respectively. An examination of these locations in ARCMAP (or in Figure
5.1 above) suggested that a reasonable range of radial distances to consider is from 10 to
330 feet, and the set of (14) distance values, D = [10:20:270],10 was chosen for analysis.
The rectangular region, R, in Figure 5.1 is seen in ARCMAP to be defined by the
bounding values, (xmin = -10, xmax = 589, ymin = 20, ymax = 577). Using these
parameters, the command;

>> PVal = k12_shift_plot(L1,L2,xmin,xmax,ymin,ymax,999,D);

yields a vector of attraction p-values (5.5.3) at each radial distance in D based on 999
simulated random shifts of the maples relative to the oaks. Recall that in this example, an
inspection of Figure 5.1 suggested that there are “island clusters” of oaks in a “sea” of

10
In MATLAB this yields a list D of values from 10 to 270 in increments of 20. (See also p.5-23 below.)
________________________________________________________________________
ESE 502 I.5-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

maples. Hence, in terms of attraction versus repulsion, this suggests that there is some
degree of repulsion between oaks and maples. Thus one must be careful when
interpreting the p-value output, PVal, of this program.

Recall that as with clustering versus dispersion, unless there are many simulated cross K-
function values exactly equal to Kˆ 120 (hk ) , we will have Pˆreplusion (hk )  1  Pˆattraction (hk ) .
Hence one can identify significant repulsion by plotting Pˆ (h ) for k  1,.., K and
attraction k

looking for large p-values. This plot is given as screen output for k12_shift_plot.m, and
is illustrated in Figure 5.10 below for a simulation with N  999 :

1
repulsion
0.9

0.8

0.7

0.6
P-Value

0.5

0.4

0.3

0.2

0.1
attraction
0
0 50 100 150 200 250 300

Radius

Figure 5.10 Random Shift P-Values

Here the red dashed line on the bottom corresponds to a attraction p-value of .05, so that
values below this level denote significant attraction at the .05 level. Similarly the red
dashed line at the top corresponds to an attraction p-value of .95, so that values above this
line denote significant repulsion at the .05 level. Hence there appears to be significant
repulsion between oaks and maples at scales 30  h  150 . This is seen to be in
reasonable agreement with a visual inspection of Figure 5.1 above.

But while this test is reasonable in the present case, this


is in large part due to the presence of a rectangular
region, R. More generally, in the cases such as large
forests where analyses of “typical” rectangular regions
usually suffice, this is not much of a restriction. But for
point patterns in regions, R, such as the elongated island R
shown in Figure 5.10, it is clear from the figure that any
attempt to reduce R to a rectangle might remove most
of the relevant pattern data. Figure 5.10 Island Example

________________________________________________________________________
ESE 502 I.5-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

This island example also raises another important limitation of the random-shift approach
when comparing point patterns. Recall that this approach treats the given region, R, as a
sample “window” from a much larger realization of point patterns, so that the hypothesis
of stationarity is at least meaningful in principle. But the shoreline of an island is physical
barrier between very different ecological systems. So if the point patterns were trees (as
in the oak-maple example) then the shoreline is not simply an “edge effect”. Indeed the
very concept of stationarity is at best artificial in such applications.

5.6. Random-Labeling Approach to Spatial Independence

An approach which overcomes many of these problems is based on an alternative


characterization of multiple-population processes. Rather than focusing on the individual
processes generating patterns S1  ( s1i : i  1,.., n1 ) and S 2  ( s2 j : j  1,.., n2 ) above, one
can characterize this realized joint pattern in an entirely different way. Suppose we let
n  n1  n2 denote the total number of events generated, and associate with each event,
i  1,.., n , a pair ( si , mi ) where si  R is the location of event i in R, and mi  {1, 2} is a
marker (or label) denoting whether event i is of type 1 or 2. Stochastic processes
generating such pairings of joint locations and labels for each event are called marked
point processes.11 The Forest example above can be regarded as the realization of a
marked point process where the number of events is n  21  43  64 , and the possible
labels for each event are “oak” and “maple”. Clearly each realized set of values,
[( si , mi ) : i  1,.., n] , yields a complete description of a joint pattern pair ( S1 , S2 ) above.
The key advantage of this particular characterization is that it allows the location process
to be separated from the distribution of event types.

This is particularly relevant in situations where ! ! ! !


the location process is complex, or where the set ! !
!
! ! !
!
of feasible locations may involve a host of ! !
! ! !
! ! !
! !
unobserved restrictions. As a simple illustration, !
!
suppose that in the Forest example there were in !
!
!
!
! !
fact a number of subsurface rock formations, !
! !
! ! !
denoted by the gray regions in Figure 5.11, that ! !
! !
! ! !
prevented the growth of any large trees in these !
!
!
areas. Then even if these rock formations are ! ! !
! !
not observed (and thus impossible to model), the !
! ! ! !

observed locations of trees must surely avoid ! ! !


!
!
!
!
these areas. Hence if one were to condition on ! !

these observed locations, then it would still


possible to analyze certain relations between
oaks and maples without the need to model all Figure 5.10 Location Restrictions
feasible locations.

11
The following development is based on the treatment in Cox and Isham (1980). For a nice overview
discussion, see Diggle (2003.pp.82-83), and for a deeper analysis of marked spatial point processes, see
Cressie (1993, section 8.7).
________________________________________________________________________
ESE 502 I.5-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

More generally, by conditioning on the observed set of locations, one can compare a wide
variety of point populations without the need to identify alternative locations at all. Not
only does this circumvent all problems related to the shape of region, R, but it also avoids
the need to identify specific land-use constraints (such street networks or zoning
restrictions) that may influence the locations of relevant point events (like housing sales
or traffic accidents).

5.6.1 Spatial Indistinguishability Hypothesis

To formalize an appropriate notion of spatial independence for population comparisons in


the context of marked point processes, we start by considering the joint distribution of a
set of n marked events,

(5.6.1) Pr[( si , mi ) : i  1,.., n]  Pr[( s1 ,.., sn ),(m1 ,.., mn )]

 Pr[(m1 ,.., mn ) | ( s1 ,.., sn )]  Pr( s1 ,.., sn )

where Pr( s1 ,.., sn ) denotes the marginal distribution of event locations, and where
Pr[(m1 ,.., mn ) | ( s1 ,.., sn )] denotes the conditional distribution of event labels given their
locations.12 If Pr(m1 ,.., mn ) denotes the corresponding marginal distribution of event
labels, then the relevant hypothesis of spatial independence for our present purposes
asserts simply that event labels are not influenced by their locations. i.e., that

(5.6.2) Pr[(m1 ,.., mn ) | ( s1 ,.., sn )]  Pr(m1 ,.., mn )

for all locations s1 ,.., sn  R and labels m1 ,.., mn  {1, 2} . In the Forest example above, for
instance, the hypothesis that there is no spatial relationship between oaks and maples is
here taken to mean that the given set of tree locations, ( s1 ,.., sn ) , tell us nothing about
whether these locations are occupied by oaks or maples. Hence the only locational
assumption implicit in this hypothesis is that any observed tree location could be
occupied by either an oak or a maple. Note also that this doesn’t mean that oaks and
maples are equally likely events. Indeed if there are many more maples than oaks, then
all of this information is captured in the distribution of labels, Pr(m1 ,.., mn ) .

As with the random shift approach (where the marginal distributions of each population
were required to be stationary), we do require one additional assumption about the
marginal distribution of labels, Pr(m1 ,.., mn ) . Note in particular that the indexing of
events, 1, 2,.., n , only serves to distinguish them, and that their particular ordering has no

12
For simplicity we take the number of events, n, to be fixed. Alternatively, the distributions in (5.6.1) can
all be viewed as being conditioned on n.
________________________________________________________________________
ESE 502 I.5-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

relevance whatsoever.13 Hence the likelihood of labeling events, (m1 ,.., mn ) , should not
depend on which event is called “1”, and so on. This exchangeability condition can be
formalized by requiring that for all permutations ( 1 ,.., n ) of the subscripts (1,.., n) ,14

(5.6.3) Pr(m1 ,.., m n )  Pr(m1 ,.., mn )

These two conditions together imply that the point processes generating populations 1
and 2 are essentially indistinguishable. Hence we now designate the combination of
conditions, (5.6.2) and (5.6.3) as the spatial indistinguishability hypothesis for
populations 1 and 2. This hypothesis will form the basis for many of the tests to follow.

5.6.2 Random Labeling Test

To test the spatial indistinguishability hypothesis, [(5.6.2),(5.6.3)], our objective is to


show that for any observed set of locations ( s1 ,.., sn ) and population sizes n1 and n2 with
n1  n2  n , all possible labelings of events must be equally likely under this hypothesis.
This in turn will give us an exact sampling distribution that will allow us to construct
Monte Carlo tests of (5.6.2).

To do so, we begin by observing that in the same way that stationarity of marginal
distributions was inherited by conditional distributions in (5.5.1) above, it now follows
that exchangeability of labeling events in (5.6.3) is inherited by the corresponding
conditional events in (5.6.2). To see this, observe simply that for any given set of
locations ( s1 ,.., sn ) and subscript permutation ( 1 ,.., n ) it follows at once from (5.6.2)
and (5.6.3) that

(5.6.4) Pr[(m1 ,.., m n ) | ( s1 ,.., sn )]  Pr(m1 ,.., m n )

 Pr(m1 ,.., mn )  Pr[(m1 ,.., mn ) | ( s1 ,.., sn )]

To complete the desired task, it is enough to observe that for any two labelings,
(m1 ,.., mn ) and (m1,.., mn ) consistent with n1 and n2 we must have

(5.6.5) (m1,.., mn )  (m1 ,.., m n )

for some permutation, ( 1 ,.., n ) . Hence if the conditional distribution of such labels
given both ( s1 ,.., sn ) and (n1 , n2 ) is denoted by Pr[  | ( s1 ,.., sn ), n1 , n2 ] , then it follows that:

(5.6.6) Pr[(m1,.., mn ) | ( s1 ,.., sn ), n1 , n2 ]  Pr[(m1 ,.., m n ) | ( s1 ,.., sn ), n1 , n2 ]

13
However, if one were to model the immergence of new events (such as new disease victims or new
housing sales), then this ordering would indeed play a significant role.
14
For example, possible permutations of (1, 2, 3) include ( 1 ,  2 ,  3 )  (2,1, 3) and ( 1 ,  2 ,  3 )  (3, 2,1) .
________________________________________________________________________
ESE 502 I.5-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

 Pr[(m1 ,.., mn ) | ( s1 ,.., sn ), n1 , n2 ]

Moreover, since these conditional labeling events are mutually exclusive and collectively
exhaustive, it also follows that this set of permutations must yield a well-defined
conditional probability distribution, i.e. that:

(5.6.7)  ( 1 ,.., n )
Pr[(m1 ,.., m n ) | ( s1 ,.., sn ), n1 , n2 ]  1

Finally, recalling that the number of permutations of (1,.., n) is given by n ! , we may


conclude from (5.6.6) and (5.6.7) that for any observed event locations, ( s1 ,.., sn ) , and
event labels, (m1 ,.., mn ) , with corresponding population sizes, n1 and n2 , we have the
following exact conditional distribution for all permutations ( 1 ,.., n ) of these labels
under the spatial indistinguishability hypothesis:15

1
(5.6.8) Pr[(m1 ,.., m n ) | ( s1 ,.., sn ), n1 , n2 ] 
n!

This provides us with the desired sampling distribution for testing this hypothesis. In
particular, the following procedure yields a random-labeling test of (5.6.2) that closely
parallels the random-shift test above:

(i) Given observed locations, ( s1 ,.., sn ) , and labels (m1 ,.., mn ) with corresponding
population sizes, n1 and n2 , simulate N random permutations [ 1 ( ),.., n ( )] ,
  1,.., N of (1,.., n) ,16 and form the permuted labels (m1 ( ) ,.., m n ( ) ) ,   1,.., N
[which is equivalent to taking a sample of size N from the distribution in (5.6.8)].

(ii) If S1  ( s1i : i  1,.., n1 ) and S 2  ( s2 j : j  1,.., n2 ) denote the patterns for popu-
lations 1 and 2 obtained from the joint realization, [( s1 ,.., sn ),(m1 ( ) ,.., m n ( ) )] , and if
Kˆ 12 (h) denotes the sample cross K-function resulting from ( S1 , S2 ) , then choose a
relevant set of distance radii, D  {hw : w  1,..,W } , and calculate the sample cross K-
function values, {Kˆ  (h ) : w  1,..,W } for each   1,.., N .
12 w

(iii) Finally, if the observed sample cross K-function, Kˆ 120 (h) , is constructed from the
observed patterns, S10 and S 20 , then under the spatial indistinguishability hypothesis

15
It should be noted that since mi  {1, 2} for each i  1,.., n , many permutations ( m ,.., m ) will in fact
1 n

be identical. Hence the probability of each distinct realization is n1 ! n2 !/ n ! . But since it is easier to sample
random permutations (as discussed in the next footnote) we choose to treat each permutation as realization.
16
This is in fact a standard procedure in most software. In MATLAB, a random permutation of the
integers (1,.., n) is obtained with the command randperm(n).
________________________________________________________________________
ESE 502 I.5-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

each observed value, Kˆ 120 (hw ) , should be a “typical” sample from the list of values
[ Kˆ 12 (hw ) :   0,1,.., N ] . Hence if we now let M 0 denote the number of simulated
random relabelings,   1,.., N , with Kˆ  (h )  Kˆ 0 (h ) , then the estimated
12 w 12 w

probability of obtaining a value as large as Kˆ 120 (hw ) under this hypothesis is again
given by the attraction p-value in (5.5.3) above.

(iv) Similarly, if M 0 denotes the number of simulated random relabelings,


  1,.., N , with Kˆ 12 (hw )  Kˆ 120 (hw ) , then the estimated probability of obtaining a
value as small as Kˆ 0 (h ) under this hypothesis is again given by the repulsion p-
12 w

value in (5.5.4) above.

Before applying this test it is of interest to ask why simulation is required at all. Since the
distribution in (5.6.8) is constant, why not simply calculate the values,
Pr[ Kˆ 12 (hw )  Kˆ 120 (hw )] for each w  1,..,W ? The difficulty here is that since there is no
simple analytical expression for these probabilities, one must essentially enumerate the
sample space of relabelings and check these inequalities case by case. But even for
patterns as small as n1  10  n2 the number of distinct relabelings to be checked is seen
to be 20!/(10!10!)  184,756 . So even for small patterns, there are sufficiently many
distinct relabelings to make Monte Carlo simulation the most efficient procedure for
testing purposes.

Finally it is important to stress that while this random-labeling approach is clearly more
flexible than the random-shift approach above, this flexibility is not achieved without
some costs. In particular, the most appealing feature of the random shift test was its
ability to preserve many key properties of the marginal distributions for populations 1
and 2. In the present approach, where the joint distribution is recast in terms of a location
and labeling process, all properties of these marginal distributions are lost. So (as
observed by Diggle, 2003, p.83) the present marked-point-process approach is most
applicable in cases where there is a natural separation between location and labeling of
population types. In the context of the Forest example above, a simple illustration would
be the analysis of a disease affecting say maples. Here the two populations might be
“healthy” and “diseased” maples. So in this case there is a single location process
involving all maple trees, followed by a labeling process which represents the spread of
disease among these trees.17

17
An example of precisely this type involving “Myrtle Wilt”, a disease specific to myrtle trees, is part of
Assignment 2 in this course.
________________________________________________________________________
ESE 502 I.5-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

5.6.3 Application to the Forest Example


In a manner paralleling the random-shift test, this random-relabeling test is implemented
in the MATLAB program, k12_perm_plot.m. If the observed locations of populations 1
and 2 are again denoted by L1 and L2, and if D again denotes the set of selected radial
distances, then a screen plot of attraction p-values for 999 simulations is now obtained by
the command (where the final argument, “1”, specifies that a random seed is to be used):
>> k12_perm_plot(L1,L2,999,D,1);
If this test is applied to the Forest example with the somewhat larger set of radial distance
values, D = [10:20:330], then a typical result is shown in Figure 5.11 below:

1
repulsion
0.9

0.8

0.7

0.6
P-Value

0.5

0.4

0.3

0.2

0.1
attraction
0
0 50 100 150 200 250 300 350
Radius
Radius
Figure 5.11 Random Relabeling P-Values

Here we see that the results are qualitatively similar to the random-shift test for short
distances, but that repulsion is dramatically more extreme for long distances. Indeed
significant repulsion now persists up to the largest possible relevant scale of 330 feet (=
Dmax/2). Part of the reason for this can be seen in Figure 5.12 below, where a partial
tiling of the maple pattern in Figure 5.1 is shown.

! ! ! ! ! ! ! ! ! ! ! !
! ! !
! ! ! ! ! !
! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! !

! ! !
! ! !
! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !
! ! ! ! ! !
! ! !
! ! ! ! ! !
! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! !
! ! !
! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !
! ! !
! ! ! ! ! !
! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! !

! ! !
! ! !
! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !
! ! ! ! ! !
! ! !
! ! ! ! ! !
! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! !
! ! !
! ! ! ! ! !

Figure 5.12 New Maple Structure

________________________________________________________________________
ESE 502 I.5-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

Even this small portion of the tiling reveals an additional hidden problem with the
random-shift approach. For while this replication process statistically preserves the
means of sample cross K-functions, the variance of these functions tends to increase. The
reason for this is that tiling by its very nature tends to create new structure near the
boundaries of the rectangle region, R.18 In the present case, the red ellipses in Figure 5.11
represent larger areas devoid of maples than those in R itself (created mainly by the
combination of empty areas in the lower left and upper right corners of R). Similarly the
blue ellipses represent new clusters of maples larger that those in R. The result of this
new structure in the present case is to make the tiled pattern, S20 , of maples appear
somewhat more clustered at larger scales. This in turn yields higher levels of repulsion
between oaks ( S10 ) , and maples ( S20 ) at these larger scales for most simulated shifts. The
result of this is to make the observed level of repulsion between S10 and S 20 appear
relatively less significant at these larger scales, as reflected in the plot of Figure 5.10.19

5.7. Analysis of Spatial Similarity

The two procedures above allowed us to test whether there was significant “attraction” or
“repulsion” between two patterns. This focuses on their joint distribution. Alternatively,
we might simply compare their marginal distributions by asking: How similar are the
spatial point patterns S1 and S2 ? For instance, in the Forest example of Figure 5.1 we
started off with the observation that the oaks appear to be much more clustered than the
maples. Hence rather than characterizing this relative clustering as repulsion between the
two populations, we might simply ask whether the pattern of oaks, S1 , is more clustered
than the pattern of maples, S2 .

But while the original (univariate) sample K-functions, Kˆ 1 (h) and Kˆ 2 (h) , provide
natural measures of individual population clustering, it is not clear how to compare these
two values statistically. Note that since the population values, K1 (h) and K 2 (h) , are
simply mean values (for any given h ), one might be tempted to conduct a standard
difference-between-means test. But this could be very misleading, since such tests
assume that the two underlying populations (in this case S1 and S2 ) are independently
distributed. As we have seen above, this is generally false. Hence the key task here is to
characterize “complete similarity” in a way that will allow deviations from this
hypothesis to be tested statistically.

Here the basic strategy is to interpret “complete similarity” to mean that both point
patterns are generated by the same spatial point process. Hence if the sizes of S1 and S2
are given respectively by n1 and n2 , then our null hypothesis is simply that the

18
For additional discussion of this point see Diggle (2003, p.6).
19
Lotwick and Silverman noted this same phenomenon in their original paper (1982, p.410), where they
concluded that such added structure will tend to “show less discrepancy from independence” and thus yield
a relatively conservative testing procedure.
________________________________________________________________________
ESE 502 I.5-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

combination of these two patterns, S  [( s1i : i  1,.., n1 ),( s2 j : j  1,.., n2 )] , is in fact a single
population realization of size n  n1  n2 , i.e., S  ( s1 ,.., sn1 , sn1 1 ,.., sn ) . If this were true,
then it would not matter which subset of n1 samples was labeled as “population 1”. It
should be clear from the above discussion that a natural way to formulate this hypothesis
is to treat the combined process as a marked point process.20 In this framework, the
relevant null hypothesis is simply that given observed locations, ( s1 ,.., sn ) and labels
(m1 ,.., mn ) with n1 occurrences of “1” and n2 occurrences of “2”, each permutation of
these labels is equally likely. But this is precisely the assertion in expression (5.6.8)
above. Hence in the context of marked point processes, the joint distribution of labels
(m1 ,.., mn ) given locations ( s1 ,.., sn ) and population sizes, n1 and n2 , is here seen to be
precisely the spatial indistinguishability hypothesis.

However, the present focus is on the marginal distributions of populations 1 and 2 rather
than the dependency properties of their joint distribution. Hence the natural test statistics
are the sample K-functions, Kˆ 1 (h) and Kˆ 2 (h) , for each marginal distribution rather than
the sample cross K-function. Note moreover that if both samples are indeed coming from
the same population, then Kˆ 1 (h) and Kˆ 2 (h) should be estimating the same K-function,
say K (h) , for this common population. Hence if these sample K-functions were unbiased
estimates, then by definition the individual K-functions, K (h)  E[ Kˆ (h)], i  1, 2 , would
i i

be the same. In this context, “complete similarity” would thus reduce to the simple null
hypothesis: H 0 : K1 (h)  K 2 (h) . However, as noted in section 4.3, this simplification is
only appropriate for stationary isotropic processes with Ripley corrections. Thus, in view
of the fact that hypothesis (5.6.2) is perfectly meaningful for any point process, we
choose to adopt a more flexible approach.

To do so, we first note that even in the absence of stationarity, the sample K-functions,
Kˆ 1 (h) and Kˆ 2 (h) , continue to be reasonable measures of clustering (or dispersion)
within populations. Hence to test for relative clustering (or dispersion) it is still natural to
focus on the difference between these sample measures,21 which we now define to be

(5.7.1) (h)  Kˆ 1 (h)  Kˆ 2 (h)

Hence the relevant spatial similarity hypothesis for our present purposes is that the
observed difference obtained from (5.7.1) is not statistically distinguishable from the
random differences obtained from realizations of the conditional distribution of labels
under the spatial indistinguishability hypothesis [(5.6.2),(5.6.3)].

20
Indeed this is the reason why the analysis of joint distributions above was developed before considering
the present comparison of marginal distributions.
21
Note that one could equally well consider the ratio of these measures, or equivalently, the difference.of
their logs.
________________________________________________________________________
ESE 502 I.5-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

5.7.1 Spatial Similarity Test

If we simulate random relabelings in (5.6.8) to obtain a sampling distribution of (h)


under this spatial similarity hypothesis, then the observed difference can simply be
compared with this distribution. In particular, if the observed difference is unusually large
(small) relative to this distribution, then it can reasonably be inferred that subpopulation 1
is significantly more clustered (dispersed) than subpopulation 2. This procedure can now
be formalized by the following simple variation of the random relabeling test, which we
designate as the spatial similarity test:

(i) Given observed locations, ( s1 ,.., sn ) , and labels (m1 ,.., mn ) with corresponding
population sizes, n1 and n2 , simulate N random permutations [ 1 ( ),.., n ( )] ,
  1,.., N of (1,.., n) , and construct the corresponding the label permutations
(m1 ( ) ,.., m n ( ) ) ,   1,.., N

(ii) If S1  ( s1i : i  1,.., n1 ) and S 2  ( s2 j : j  1,.., n2 ) denote the population patterns
obtained from the joint realization, [( s1 ,.., sn ),(m1 ( ) ,.., m n ( ) )] ,   1,.., N , and if the
corresponding sample difference function is denoted by  (h)  Kˆ  (h)  Kˆ  (h) , then
1 2

for the given set of relevant radial distances, D  {hw : w  1,..,W } , calculate the
sample difference values, { (hw ) : w  1,..,W } for each   1,.., N .

(iii) Finally, if the observed sample difference function,  0 (h)  Kˆ 10 (h)  Kˆ 20 (h) , is
constructed from the observed patterns, S10 and S 20 , then under the spatial similarity
hypothesis, each observed value,  0 (hw ) , should be a “typical” sample from the list
of values [ (hw ) :   0,1,.., N ] . Hence if we now let m0 denote the number of
simulated random relabelings,   1,.., N , with  (hw )   0 (hw ) , then the probability
of obtaining a value as large as  0 (hw ) under this hypothesis is estimated by the
following relative clustering p-value for population 1 versus population 2:

m0  1
(5.7.2) Pˆr12 ( h ) 
N 1
-clustered

(iv) Similarly, if m0 denotes the number of simulated random relabelings,   1,.., N ,
with  (hw )   0 (hw ) , then the probability of obtaining a value as small as  0 (hw )
under this hypothesis is estimated by the following relative dispersion p-value for
population 1 versus population 2:

m0  1
(5.7.3) Pˆr12 ( h ) 
N 1
-dispersed

________________________________________________________________________
ESE 502 I.5-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

5.7.2 Application to the Forest Example

This spatial similarity test is implemented in the MATLAB program, k2_diff_plot.m.


Here it is convenient to adopt the marked-point-process format by defining a single list of
locations, loc, in which the first n1 locations correspond to population 1 and all
remaining locations correspond to population 2. Hence both of these populations are
identified by simply specifying n1. If D again denotes the set of selected radial distances
used for the Forest example above, then a screen plot of relative clustering p-values for
999 simulations is now obtained by the command:

>> k2_diff_plot(loc,n1,sims,D,1);

The output of a typical run is shown in Figure 5.13 below:

1
r-dispersed
0.9

0.8

0.7
P-Value

0.6

0.5

0.4

0.3

0.2

0.1
r-clustered
0
0 50 100 150 200 250 300 350

Radius

Figure 5.13. Relative Clustering of Oaks

This confirms the informal observation above that oaks are indeed more clustered than
maples, for scales consistent with a visual inspection of Figure 5.1.

5.8 Larynx and Lung Cancer Example

While the simple Forest example above was convenient for developing a wide range of
techniques for analyzing bivariate point populations, the comparison of Larynx and Lung
cancer cases in Lancashire discussed in Section 1 is a much richer example. Hence we
now explore this example in some detail. First we analyze the overall relation between
these two patterns, using a variation of the spatial similarity analysis above. Next we
restrict this analysis to the area most relevant for the Incinerator in Figure 1.9. Finally, we
attempt to isolate the cluster near this Incinerator by a new method of local K-function
analysis that provides a set of exact local clustering p-values.

________________________________________________________________________
ESE 502 I.5-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

5.8.1 Overall Comparison of the Larynx and Lung Cancer Populations

Given the Larynx Cancer population of n1  57 cases, and Lung Cancer population of
n2  917 cases, we could in principle use k2_diff_plot to compare these populations. But
the great difference in size between these populations makes this somewhat impractical.
Moreover, it is clear that the Larynx cancer population in Figure 1.7 above is of primary
interest in the present example, and that Lung cancers serve mainly as an appropriate
reference population for testing purposes. Hence we now develop an alternative testing
procedure that is designed precisely for this type of analysis.

Subsample Similarity Hypothesis

To do so, we again start with the hypothesis that Larynx and Lung cancer cases are
samples from the same statistical population. But rather than directly compare the small
Larynx population with the much larger Lung population, we simply observe that if the
Larynx cases could equally well be any subsample of size n1 from the larger joint
population, n  n  n , then the observed sample K-function, Kˆ (h) , should be typical of
1 2 1

the sample K-functions obtained in this way. Hence, in the context of marked point
processes, the present subsample similarity hypothesis asserts that for any given
realization [( s1 ,.., sn ),(m1 ,.., mn )] , the value Kˆ 1 (h) obtained from the n1 locations with
mi  1 is statistically indistinguishable from the same sample K-function obtained by
randomly permuting these labels.

Test of the Subsample Similarity Hypothesis

The corresponding test of this subsample similarity hypothesis can be formalized as


follows variation of the spatial similarity test procedure above:

(i) Same as for the spatial similarity test.

(ii) If S1  ( s1i : i  1,.., n1 ) denotes the population pattern obtained from the joint
realization, [( s1 ,.., sn ),(m1 ( ) ,.., m n ( ) )] , and if the corresponding sample K-function is
Kˆ  (h) , then for the given set of relevant radial distances, D  {h : w  1,..,W } ,
1 w

calculate the sample K-function values, {Kˆ (hw ) : w  1,..,W } for each   1,.., N .

1

(iii) Finally, if the observed sample K-function, Kˆ 10 (h) , is constructed from the
observed patterns, S10 and S 20 , then under the subsample similarity hypothesis, each
observed value, Kˆ 10 (hw ) , should be a “typical” sample from the list of values
[ Kˆ  (h ) :   0,1,.., N ] . Hence if we now let m0 denote the number of simulated
w 

random relabelings,   1,.., N , with Kˆ (hw )  Kˆ 10 (hw ) , then the probability of



1

________________________________________________________________________
ESE 502 I.5-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

obtaining a value as large as Kˆ 10 (hw ) under this hypothesis is estimated by the


following clustering p-value for population 1:

m0  1
(5.8.1) Pˆclustered
1
( h)  
N 1

(iv) In a similar manner, if m0 denotes the number of simulated random relabelings,
  1,.., N , with Kˆ  (h )  Kˆ 0 (h ) , then the probability of obtaining a value as small
1 w 1 w

as Kˆ (hw ) under this hypothesis is estimated by the following dispersion p-value for
0
1

population 1:

m0  1
(5.8.2) Pˆdispersed
1
( h)  
N 1

Hence under this testing procedure, significant clustering (dispersion) for population 1
means that the observed pattern of size n1 is more clustered (dispersed) than would be
expected if it were a typical subsample from the larger pattern of size n . Note that while
this test is in principle possible for subpopulations of any size less than n , it only makes
statistical sense when n1 is sufficiently small relative to n to allow a meaningful sample
of alternative subpopulations. Moreover, when n1 is much smaller than n , the present
Monte Carlo test is considerably more efficient in terms of computing time then the full
spatial similarity test above

Application to Larynx and Lung Cancers

This testing procedure is implemented in the MATLAB program, k2_global_plot.m.


(Here “global” refers to the global nature of this pattern analysis. We consider a local
version later.) Before carrying out the analysis, it is instructive to construct a sample
subpopulation pattern, S1 , for visual comparison with the observed pattern, S10 , of
Larynx cancers. The MATLAB workspace, Larynx.mat, contains the full set of
n  57  917  974 locations in the matrix, loc, where the n1  57 Layrnx cancer cases
are at the top. A random subpopulation of size n1 can be constructed in MATLAB by the
following command sequence:

>> list = randperm(974);


>> sublist = list(1:57);
>> sub_loc = loc(sublist,:);

The first command produces a random permutation, list, of the indices (1,...,974) and the
second command selects the first 57 values of list and calls them sublist. Finally, the last

________________________________________________________________________
ESE 502 I.5-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

command creates a matrix, sub_loc, of the corresponding locations in loc. While this
procedure is a basic component of the program, k2_global_plot.m, it is useful to perform
these commands manually in order to see an explicit example. This coordinate data can
then be imported to ARCMAP and compared visually with the given Larynx pattern as
shown in Figures 5.14 and 5.15 below:22

!
!! ! ! !! ! !
!
! ! !
!! ! !!
!
! ! ! !!! ! !! !
! ! !! !!!! ! !
! !!
!!
!
! !
! !
! !!!! !! ! !
!! ! ! !
!
! ! !
! ! ! !
!! ! !
!
!
! !!!!!
! !!!!! !!!
!! ! !
! !
!! !!! !
! !! ! ! !!
0 5 10 km
  

Fig.5.14. Observed Larynx Cases Fig.5.15. Sampled Larynx Cases

This visual comparison suggests that there may not be much difference between the
overall pattern of observed Larynx cancers and typical subsamples of the same size from
the combined population of Larynx and Lung cancers.

To confirm this by a statistical test, it remains only to construct an appropriate set of


radial distances, D, for testing purposes. Here it is instructive to carry out this procedure
explicitly by using the following command sequence:

>> Dist = dist_vec(loc);


>> Dmax = max(Dist);
>> d = Dmax/2;
>> D = [d/20:d/20:d];

The first command uses the program, dist_vec, to calculate the vector of n(n  1) / 2
distinct pairwise distances among the n locations. The second command identifies the
maximum, Dmax, of all these distances, and the third command used the “Dmax/2” rule
of thumb in expression (4.5.1) above to calculate the maximum radial distance for the
test. Finally, some experimentation with the test results suggests that the p-value plot
should include 20 equally spaced distance values up to Dmax/2. This can be obtained by
the last command, which constructs a list of numbers starting at the value, d/20, and
proceeding in increments of size d/20 until the number d is reached.

22
Note also that these subpopulations can be constructed directly in MATLAB. The relevant boundary file
is stored in the matrix, larynx_bd, so that subpopulation, sub_loc, can be displayed with the command,
poly_plot(larynx_bd,sub_loc). See Section 9 of the Appendix to Part I for further details.
________________________________________________________________________
ESE 502 I.5-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

Given this set of distances, D, a statistical test of the subsample similarity hypothesis for
this example can be carried out with the command:

>> k2_global_plot(loc,n1,999,D,1);

A typical result is shown in Figure 5.16 below:

1
dispersed
0.9

0.8

0.7
P-Value

0.6

0.5

0.4

0.3

0.2

0.1
clustered
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Radius

Figure 5.16. P-Values for Larynx Cases

Here we can see that, except at small distances, there is no significant difference between
the observed pattern of Larynx cases and random samples of the same size from the
combined population. Moreover, since the default p-values calculated in this program are
the clustering p-values in (5.8.1), the portion of the plot above .95 shows that Larynx
cases are actually significantly more dispersed at small distances than would be expected
from random subsamples. An examination of Figures 1.7 and 1.8 suggests that, unlike
Lung cancer cases which (as we have seen in Section 4.7.3) are distributed in a manner
roughly proportional to population, there appear to be somewhat more Larynx cases in
less populated outlying areas than would be expected for Lung cancers. This is
particularly true in the southern area, which contains the Incinerator. Hence we now
focus on this area more closely.

5.8.2 Local Comparison in the Vicinity of the Incinerator

To focus in on the area closer to the Incinerator itself, we start with the observation that
heavier exhaust particles are more likely to affect the larynx (which is high in the throat).
Hence while little is actually known about either the exact composition of exhaust fumes
from this Incinerator or the exact coverage of the exhaust plume, it seems reasonable to
suppose that heavier exhaust particles are mostly concentrated within a few kilometers of
the source. Hence for purposes of the present analysis, a maximum range of 4000 meters

________________________________________________________________________
ESE 502 I.5-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

(  2.5 miles) was chosen.23 This region is shown in Figure 5.17 below as a circle of
radius 4000 meters about the Incinerator (which is again denoted by a red cross as in
Figure 1.9):

# # #
## #
# #
#
# # #
# # ## # # # # ## ## #
# ## #
# ### # #
# # # # # ##
##
##
## # # # ## # # ## # # ## #
##
# # ###
# # # ## ## ##
# ## ### #####
# ## ###
# #### #### ## ##
### #
# ##
##### ##
# # ## # ### ### ##### ####
#
## # ##
####### # #####
#
# #
##
#
##### #
## #
## ## ### ## # # ## ##
# # #
#### ### #####
## #
### ## ##
# ## #
# # # #
# ## # ## ### # #
#
### # #
# # ### # # ##
# ##
# # # ## # ##
# # #
## ###### #
# # # #
#
#
# # #
# # # # ### #
##
# # # # #
# ## ## ## # # ## #
## ##
# # ##### ##
# # ###
F
G # # ##
#
# #
#
# # ##
## # # #
### ## ##
##
#
# # ## ## #
4000 # ## #

Figure 5.17. Vacinity of the Incinerator

If the coordinate position of the Incinerator is denoted by Incin,24 then one can identify
those cases that are within 4000 meters of Incin by means of the customized MATLAB
program, Radius_4000.m. Open the workspace, layrnx.mat, and use the command:

>> OUT = Radius_4000(Incin,Lung,Larynx);

Here Lung and Larynx denote the locations of the Lung and Larynx cases, respectively.
The output structure, OUT, includes the locations of Lung and Larynx cases within 4000
meters of Incin, along with their respective distances from Incin. Here it can be seen by
inspection that the number of Larynx cases is n1 = 7. The total number of cases in this
area is n = 75. The appropriate inputs for k2_global_plot above can be obtained from
OUT as follows:

>> loc_4000 = OUT.LOC;


>> n1_4000 = length(OUT.L1);

Hence choosing D_4000 = [400:200:4000] to be an appropriate set of radial distances, a


test of the subsample similarity hypothesis for this subpopulation can be run for 999
simulations with the command:

23
This is in rough agreement with the distance influence function, f ( d ) , estimated by Diggle, Gatrell and
Lovett (1990, Figure 7), which is essentially flat for d  4 kilometers.
24
This position is given in the ARCMAP layer, incin_loc.shp, as Incin = (354850,413550) in meters.
________________________________________________________________________
ESE 502 I.5-25 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

>> k2_global_plot(loc_4000,n1_4000,999,D_4000,1);

Here a typical result is shown in Figure 5.18 below:


1
dispersed
0.9

0.8

0.7

P-Value
0.6

0.5

0.4

0.3

0.2

0.1
clustered
0
0 500 1000 1500 2000 2500 3000 3500 4000

Radius

Figure 5.18. P-Values for Incinerator Vicinity

This plot is seen to be quite different from the global plot of Figure 5.16 above. In
particular, there is now some weakly significant clustering at scales below 500 meters.
This suggests that while the global pattern of Larynx cases exhibits no significant
clustering relative to the combined population of Larynx and Lung cases, the picture is
quite different when cases are restricted to the vicinity of the Incinerator. In particular,
the strong cluster of three Larynx cases nearest to the Incinerator in Figure 5.17 would
appear to be a contributing factor here.

5.8.3 Local Cluster Analysis of Larynx Cases

This leads to the third and final phase of our analysis of this problem. Here we consider a
local analysis of clustering which is a variation of the local K-function analysis in Section
4.8 above. We again adopt the spatial indistinguishability hypothesis that Larynx and
Lung cases are coming from the same point process, but now focus on each individual
Larynx case by considering the conditional distribution of all other labels given this
Larynx case.

To motivate this approach, we start by considering an enlargement of Figure 5.17 in


Figure 5.19 below that focuses on the cluster of three Larynx cases closest to the
Incinerator. Here we choose upper most case, labeled s1i in the figure, and consider a
circular region of radius h  400 meters about this case. There are seen to be six other
cases within distance h of s1i , of which two are also Larynx cases. Hence it is of interest
to ask how likely it is to find at least two other Larynx cases within this small set of cases
near s1i .

________________________________________________________________________
ESE 502 I.5-26 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

!
(

!
(

!
(
!
( s1i !
(

h !
(

!
(

!
(

Figure 5.19. Neighborhood of Larynx Case

To determine the probability of this event, we start by removing the 4000-meter


restriction and return to the full population of cancer cases, n  n1  n2  974 with
n1  57 . If we again adopt the null hypothesis of subsample similarity (so that Larynx
cases could equally well be any subsample of size n1 from the full population of n
cases), then under this hypothesis one can calculate the exact probability of this event. To
start with, if there are c other cases within distance h of case, s1i , and c1 of these belong
to population 1, then under the subsample similarity hypothesis, this event can be
regarded as a random sample of size c from the population of n  1 other cases which
contains exactly c1 of the n1  1 other population 1 cases. Hence the probability of this
event is given by the general hypergeometric probability:

 K  M  K   K!  ( M  K )! 
    k !( K  k )!   (m  k )!( M  K  m  k )! 
 k  m  k    
(5.8.3) p(k | m, K , M )  
M   M! 
   
m  m!( M  m)! 

where in the present case, k  c1 , K  n1  1, m  c, and M  n  1 . Finally, to construct the


desired event probability as stated above, observe that if we let the random variable, C1 ,
denote the number of population 1 cases within distance h of s1i , then the chance of
observing at least c1 cases from population 1 is given by the sum:

c
(5.8.4) P (c1 | c, n1 , n)  Prob(C1  c1 | c, n1 , n)  k  c1
p(k | c, n1  1, n  1)

It is this cumulative probability, P(c1 | c, n1 , n) , that yields the desired event probability.
In the specific case above where c1  2, c  6, n1  57, and n  974 , we see that this
probability is given by
________________________________________________________________________
ESE 502 I.5-27 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

(5.8.5) P (2 | 6,57,974)  .042

Hence if the subsample similarity hypothesis were true, then it would be quite surprising
to find at least two Larynx cases with this subpopulation of six cases. In other words, for
the given pattern of Larynx and Lung cases, there appears to be significant clustering of
Larynx cases near s1i at the h  400 meter scale.

Thus to construct a general testing procedure for local clustering (or dispersion) of
Larynx cases, it suffices to calculate the event probabilities in (5.8.4) for every observed
Larynx location, s1i , at every relevant radial distance, h . This procedure is implemented
in the MATLAB program, k2_local_exact.m.25 In the present case, if we consider only
the single radial distance, D = 400, and again use the location matrix, loc, then the set of
clustering p-values at each of the n1 = 57 Larynx locations is obtained with the
command:

>> [P,C,C1] = k2_local_exact(loc,n1,400);

Here P is the vector of p-values at each location, and C and C1 are the corresponding
vectors of total counts, c , and population 1 counts, c1 , at each location.

To gain further perspective on the significance of the cluster in Figure 5.19 above, one
can compare distances of cases to the Incinerator with the corresponding p-values as
follows:

>> L = [Incin;Larynx]; P dist


>> dist_L = dist_vec(L); 0.0094077 693.80
0.029091 910.34
>> dist = dist_L(1:57); 0.042038 1002.90
>> COMP = [P,dist]; 0.29995 12512.00
0.34049 14858.00
>> COMP = sortrows(COMP,1); 0.41478 13744.00
0.48083 14982.00
>> COMP(1:7,:)

The first command stacks the Incinerator location on top of the Larynx locations in a
matrix, L. The second and third commands then identify the relevant distances (i.e., from
Incin to all locations in Larynx ) as the first 57 distances, dist, produced by dist_vec(L).
The fourth and fifth commands combine P with dist in the matrix, COMP, and then sort
rows of COMP by P from low to high. Finally the last command displays the first seven
rows of this sorted version of COMP, as shown in the box on the right.

25
In the MATLAB directory for the class, there is also a Monte Carlo version of this program, k2_local.m.
By running these two programs for the same data set (say with 999 simulations) you can see that exact
calculations tend to be orders of magnitude faster than simulations – when they are possible.
________________________________________________________________________
ESE 502 I.5-28 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

The first three rows (in red) are the three closest Larynx cases to the Incinerator, as can
be verified in ARCMAP (and can also be seen in Figure 5.17 above).26 Moreover, the
ordering of p-values shows that these are the only three locations that exhibit significant
clustering. Hence this result suggests that there may indeed be some relation between the
Incinerator and nearby Larynx cases.

26
Note that the case just below these three is almost as close to the Incinerator as one of these three. But
this case has only a single Lung case within 400 meters, and hence exhibits no clustering at this scale.
________________________________________________________________________
ESE 502 I.5-29 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

6. Space-Time Point Processes

Point events (such as crimes or disease cases) occur in time as well as space. If both time
and location data are available for these events, then one can in principle model this data
as the realization of a space-time point process. As a prime example, recall that the
Burkitt’s Lymphoma data (examined in Assignment 1 of this class) contains both onset
times and locations for 188 cases during the period 1961-1975. Moreover, the original
study of this data by Williams et al. (1978)1, (here referred to as [W]) focused precisely
on the question of identifying significant space-time clustering of these cases. Hence it is
of interest to consider this data in more detail.

The cases occurring in each five-year period of the study are displayed in Figure 6.1
below (with green shading reflecting relative population density in West Nile), and
correspond roughly to Figure 5 in [W].2 Here is does appear that cases in subsequent
periods tend to be clustered near cases in previous periods. But the inclusion of
population density in Figure 6.1 was done specifically to show that such casual
observations can be deceptive. Much of the new clustering is seen to occur in more
densely populated areas where one would expect new cases to be more likely based on
chance alone.

(
! ( !
! ( !
(
(
! (
!
(
!
(
! (
! !!
(( (
!
((
! ! ((
! !!
(
(
! (
! ( !
! ( (
! ( !
(!
(! ( !
( !
(
! (
! !!
( (
! ((
!
(
! (
!
(
! (!
! ( !
!
(!
!
(
!
( (
! ( (
! (( !
! (
( !
! (!
(!( (
!( !
! ((!
! ((!
! (
( (
! ! (
! (!
! (!
!( ! !
( ( ! (
! (
!
((
! (
!
(
! ( ! (
( (
( !
! ! !!
( (! (!
!((
!
(
!
(
! (
!
(
!
(!( ((
!
(!
! (!
!((
! (!
!((
!
(
! (
!
(
!
!
( (!
! (!(!(
(!
!
(
! (
! (
! ( !
!
(!
!(! (
!(
! ( !
! ((!
!
(!
( (( ! (
((
! (
! (
! (
! (
! (
!
!(!
(!
(!
! (!
! ( (
(
! (
((
!
(
! (
! (!
! (!
(!
! (
(
!
(
! (
!
(
!
!
( !
(
(
! (
!
!
( (
!
( (
!
!
(
!
(
!
(
! (
! (
! !!
((
!
( (
! !
( (
!
(
! (
!
( (
! ! (
! !!
(( (
!
!
(
(
! (
! (
!

(
! !
(
(
! (
!

1961-65 1966-70 1971-55

Figure 6.1 Lymphoma Cases in each Five-Year Period


The simple regression procedure used in Assignment 1 related times of cases to those of
their nearest-neighbors. But since population density is ignored in this approach, the
“clustering” result obtained by this procedure is questionable at best. Hence, one

1
This is Paper 1 in the Reference Materials on the class web page.
2
These cases differ slightly from those in Figure 5 of [W]. The present approximation is based on the
counting convention stated in [BG, p.81] that time is “measured in days elapsed since January 1st, 1960”.
This rule does not quite agree with the actual dates in the Appendix of [W], but the difference is very slight.

________________________________________________________________________
ESE 502 I.6-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

objective of the present section is to develop an alternative “random labeling” test that is
more appropriate. But before doing so, we shall consider the general question of space-
time clustering more closely.

6.1 Space-Time Clustering

Event sequences exhibit space-time clustering if events that are close in space tend to be
closer in time than would be expected by chance alone. The most classic examples of
space-time clustering are spatial diffusion processes in which point events are propagated
from locations to neighbors through some form of local interactions. Obvious examples
include the spread of forest fires (where new trees are ignited by the heat from trees
burning nearby), or the spread of contagious diseases (where individuals in direct contact
with infected individuals also become infected). Here it is worth noting that cancers such
as Burkitt’s Lymphoma are not directly contagious. However, as observed in [W,p.116],
malaria infections may be a contributing factor leading to Burkitt’s Lymphoma, and the
spread of malaria itself involves a diffusion process in which mosquitoes transmit this
disease from existing victims to new victims.

But even with genuine diffusion processes one must be careful in analyzing space-time
clustering. Consider the onset of a new flu epidemic introduced into a region, R, by a
single carrier, c, and suppose that the cases occurring during the first few days are those
shown in Figure 6.2 below.

 
R R  
 

 
    
c 
 
      
   
 

Figure 6.2. Early Epidemic Figure 6.3. Late Epidemic

Here there is a clear diffusion effect in which the initial cases involve contacts with c, and
are in turn propagated to others by secondary contacts. But notice that even though the
initial three cases shown are all close to c, this process spreads out quickly. So while the
six “second round” cases shown in the figure may all occur at roughly the same time,
they are already quite dispersed in space. This example shows that cases occurring close
in time need not occur close in space. However, this figure also suggests that cases
occurring close in space may indeed have a tendency to occur close in time.3 So there

3
Here we assume that most contacts involve individuals living in close spatial proximity – which may not
be the case. For example, some individuals have significant contact with co-workers at distant job sites.

________________________________________________________________________
ESE 502 I.6-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

appears to be some degree of asymmetry between space and time in such processes. We
shall return to this issue below.

While the early stages of this epidemic show clear propagation effects, this is not true at
later stages. After the first few weeks, such an epidemic may well have spread throughout
the region, so that the pattern of new cases occurring on each day may be very dispersed,
as shown in Figure 6.3. More importantly, this pattern will most likely be quite similar
from day to day. At this stage, the diffusion process is said to have reached a steady state
(or stationary state). In such a steady state it is clearly much harder to detect any space-
time clustering whatsoever. Diffusion is still at work, but the event pattern is no longer
changing in detectable ways.4 However, it may still be possible to detect such space-time
effects indirectly. For example, if one were to examine the distribution of cases on day t ,
and to identify the new cases on day t  1 , then it might still be possible to test whether
these new cases are “significantly closer” to the population of cases on day t than would
be expected by chance alone. We shall not pursue such questions here. Rather the intent
of this illustration is to show that space-time clustering can be subtle in even the clearest
examples of spatial diffusion.

6.2 Space-Time K-Functions

With this preliminary discussion we turn now to the measurement of space-time


clustering. Here we follow approach of [BG, Section 4.3] by constructing a space-time
version of K-functions.5 Consider a space-time pattern of events, {ei  ( si , ti ) : i  1,.., n} ,
in region, R, where si again denotes the location of event ei in R, and ti denotes the time
at which event ei occurs. If for a given event ei we are interested in the numbers of
events that are “close” to ei in both space and time, then for each spatial distance, h , and
time increment,  , it is natural to define the corresponding space-time neighborhood of
event, ei  ( si , ti ) , by the Cartesian product:

(6.2.1) C( h , ) (ei )  {( s, t ) : si  s  h,| ti  t |  }

 {s : si  s  h}  {t :| ti  t |  }

Hence the circular neighborhoods, Ch ( si ) , in two dimensions are now replaced by


cylindrical neighborhoods, C( h , ) (ei ) , in three dimensions, as shown in Figure 6.4 below.

4
A more extreme example is provided by change in temperature distribution within a room after someone
has lit a match. While the match is burning, there is very sharp peak in the temperature distribution that
spreads out from this point source of heat. After the match has gone out, this heat is not lost. Rather it
continues to diffuse throughout the room until a new steady state is reached in which the temperature is
everywhere slightly higher than before.
5
For a more thorough treatment see Diggle, P., Chetwynd, A., Haggkvist, R. and Morris, S. (1995).

________________________________________________________________________
ESE 502 I.6-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

t
ti  
ti  ei
ti  
y
h  si

Figure 6.4 Space-Time Neighborhoods

As in two dimensions, one can define a relevant space-time region as the Cartesian
product, R  T of the given spatial region, R, and a relevant time interval, T. For a given
pattern of events, {ei  ( si , ti ) : i  1,.., n} , the default time interval, T, for purposes of
analysis is usually taken to be the smallest time interval containing all event times, i.e.,

(6.2.2) T  tmax  tmin  max{ti : i  1,.., n}  min{ti : i  1,.., n}

as illustrated in Figure 6.5 below:6

tmax R T

T
tmin

Figure 6.5. Space-Time Region for Analysis

In this context, the desired space-time extension of K-functions is completely


straightforward. First, if for any two space-time events, ei  ( si , ti ) and e j  ( s j , t j ) we
now let tij  | ti  t j | (and again let dij  si  s j ) then as an extension of (4.3.2), we now
have the following space-time indicator functions:
6
At this point it should be noted that, as with two dimensions, the cylindrical neighborhoods in (6.2.1) are
subject to “edge effects” in R  T , so that in general, one must replace C( h , ) (ei ) by C( h , ) (ei )  ( R  T ) .

________________________________________________________________________
ESE 502 I.6-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

 1 , (dij  h) and (tij   )


(6.2.3) I ( h , ) (dij , tij )  
 0 , otherwise

If for a given space-time point process we let st denote the space-time (st) intensity of
events, i.e., the expected number of events per unit of space-time volume, then the
desired space-time K-function is again defined for each h  0 and   0 to be the
expected number of additional events within space-time distance (h, ) of a randomly
selected event, ei , i.e.,

(6.2.4) K ( h,  )  1
st  j i
E  I ( h , ) (dij , tij ) 

So as in (4.3.4), for any given pattern size, n , the pooled form of this function

  E  I ( h , ) (dij , tij ) 
n
(6.2.5) K (h,  )  1
nst i 1 j i

implies that the natural estimator of K (h,  ) is given by sample space-time K-function:

 
n
(6.2.6) Kˆ (h,  )  1 I
j i ( h ,  )
(dij , tij )
nˆst i 1

Here the sample estimate, ˆst , of the space-time intensity is given by

n
(6.2.7) ˆst 
a( R)  (tmax  tmin )

where the denominator is now seen to be the volume of the space-time region, R  T , in
Figure 6.5 above.

6.3 Temporal Indistinguishability Hypothesis

To test for the presence of space-time clustering, one requires the specification of an
appropriate null hypothesis representing the complete absence of space-time clustering.
Here the natural null hypothesis to adopt is simply that there is no relation between the
locations and timing of events. Hence in a manner completely paralleling the treatment
of marked point processes in (5.6.1) it is convenient to separate space and time, and write
the joint probability of space-time events as,

(6.3.1) Pr[( si , ti ) : i  1,.., n]  Pr[( s1 ,.., sn ),(t1 ,.., tn )]

 Pr[(t1 ,.., tn ) | ( s1 ,.., sn )]  Pr( s1 ,.., sn )

________________________________________________________________________
ESE 502 I.6-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

where Pr( s1 ,.., sn ) again denotes the marginal distribution of event locations, and where
Pr[(t1 ,.., tn ) | ( s1 ,.., sn )] denotes the conditional distribution of event times given their
locations.7 In this context, if the marginal distribution of event times is denoted by
Pr(t1 ,.., tn ) , then as a parallel to (5.6.2), the relevant hypothesis of space-time
independence for our present purposes can be stated as follows:

(6.3.2) Pr[(t1 ,.., tn ) | ( s1 ,.., sn )]  Pr(t1 ,.., tn )

Here it should be noted (as in footnote 5 of Section 5) that from a formal viewpoint, this
independence condition could equally well be stated by switching the roles of
locations, ( s1 ,.., sn ) , and times, (t1 ,.., tn ) , in (6.3.2). But as noted in Section 6.1 above,
there is a subtle asymmetry between space and time that needs to be considered here. In
particular, recall that event sequences are said to exhibit space-time clustering if events
that are close in space tend to be closer in time than would be expected by chance alone.
Hence it is somewhat more natural to condition on the spatial locations of events and
look for time similarities among those events that are close in space.

Note also that as with marked point processes, the indexing of events, ei , is completely
arbitrary. Here it might be argued that the ordering of indices i should reflect the ordering
of event occurrences. But this is precisely why event times have been listed as distinct
attributes of space-time events. Hence in the present formulation, it is again most
appropriate to treats space-time pairs, ( si , ti ) and ( s j , t j ) as exchangeable events. In a
manner paralleling condition (5.6.3), this implies that for all permutations ( 1 ,.., n ) of
the subscripts (1,.., n) the marginal distribution of event times should satisfy the
exchangeability condition:

(6.3.3) Pr(t1 ,.., t n )  Pr(t1 ,.., tn )

These two conditions together constitute our null hypothesis that spatial events are
completely indistinguishable in terms of their occurrence times. Hence we now designate
the combination of conditions, (5.6.2) and (5.6.3) as the temporal indistinguishability
hypothesis.

6.4 Random Labeling Test

In this setting, we next extend the argument in Section 5.6.2 to obtain an exact sampling
distribution for testing this temporal indistinguishability hypothesis. To do so, observe
first that the argument in (5.6.4) now shows that conditional distribution in (6.3.2)
inherits exchangeability from (6.3.3), i.e., that for all permutations ( 1 ,.., n ) of (1,.., n) ,

7
Again for simplicity we take the number of space-time events, n, to be fixed. Alternatively, the
distributions in (6.3.1) can all be conditioned on n.

________________________________________________________________________
ESE 502 I.6-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

(6.4.1) Pr[(t1 ,.., t n ) | ( s1 ,.., sn )]  Pr(t1 ,.., t n )

 Pr(t1 ,.., tn )  Pr[(t1 ,.., tn ) | ( s1 ,.., sn )]

Hence the only question is how to condition these permutations to obtain a well-defined
probability distribution. Recall that the appropriate conditional information shared by all
permutations of population labels, (m1 ,.., mn ) , was precisely the number of instances of
each label, “1” and “2”, i.e., the population sizes, n1 and n2 . Here the set of label
frequencies, {n1 , n2 } , is now replaced by the set of time frequencies, {nt : t  T } , where nt
is the number of times that t occurs in the given set of event times, (t1 ,.., tn ) , i.e.,8

(6.4.2) nt  {i : t  ti , i  1,.., n}

It is precisely this frequency distribution which is shared by all permutations, (t1 ,.., t n )
in (6.4.1). Indeed, it follows [as a parallel to (5.6.5)] that for every list of times (t1,.., tn )
consistent with this distribution, there is some permutation (t1 ,.., t n ) of (t1 ,.., tn ) with:

(6.4.3) (t1,.., tn )  (t1 ,.., t n )

Hence if the conditional distribution of such times given both ( s1 ,.., sn ) and {nt : t  T } is
denoted by Pr[  | ( s1 ,.., sn ),{nt : t  T }] , then the same arguments in (5.6.6) through (5.6.8)
now yield the following exact conditional distribution for all permutations ( 1 ,.., n ) of
these occurrence times under the temporal indistinguishability hypothesis:

1
(6.4.4) Pr[(t1 ,.., t n ) | ( s1 ,.., sn ),{nt : t  T }] 
n!

As in Section 5.6.2, this sampling distribution again leads directly to a random-labeling


test of this hypothesis. For completeness, we list the steps of this test, which closely
parallels the random-labeling test of Section 5.6.2:

(i) Given observed locations, ( s1 ,.., sn ) , and occurrence times, (t1 ,.., tn ) , simulate N
random permutations [ 1 ( ),..,  n ( )] ,   1,.., N of (1,.., n) , and form the permuted
labels (t1 ( ) ,.., t n ( ) ) ,   1,.., N [which is now equivalent to taking a sample of size N
from the distribution in (6.4.4)].
(ii) If Kˆ  (h, ) denotes the sample space-time K-function resulting from joint
realization, [( s1 ,.., sn ),(t1 ( ) ,.., t n ( ) )] , then choose relevant sets of distance radii,

8
Note that in most cases these frequencies will either be zero or one. But the present general formulation
allows for the possibility of simultaneous events, as for example Lymphoma cases reported on the same
day (or even instantaneous events, such as multiple casualties in the same auto accident).

________________________________________________________________________
ESE 502 I.6-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

{hw : w  1,..,WR } , for R, and time intervals, { v : v  1,..,VT } for T, and calculate the
sample space-time K-function values, {Kˆ  (h ,  ) : w  1,..,W , v  1,..,V } for each
w v R T

  1,.., N .

(iii) Finally, if the observed sample space-time K-function, Kˆ 0 (h, ) , is constructed


from the observed event sequence, [( s1 ,.., sn ),(t1 ,.., tn )] , then under the temporal
indistinguishability hypothesis each observed value, Kˆ 0 (h ,  ) , should be a “typical”
w v

sample from the list of values [ Kˆ (hw ,  v ) :   0,1,.., N ] . Hence if M 0 denotes the

number of simulated random relabelings,   1,.., N , with Kˆ  (h ,  )  Kˆ 0 (h ,  ) ,


w v w v

then the probability of obtaining a value as large as Kˆ (hw ,  v ) under this hypothesis
0

is estimated by the space-time clustering p-value:

M 0 1
(6.4.5) Pˆst clustered (hw ,  v )  
N 1

(iv) Similarly, if M 0 denotes the number of simulated random relabelings,


  1,.., N , with Kˆ  (hw ,  v )  Kˆ 0 (hw ,  v ) , then the estimated probability of obtaining
a value as small as Kˆ 0 (h ,  ) under this hypothesis is again given by the space-time
w v

dispersion p-value:

M 0 1
(6.4.6) Pˆst dispersed (hw ,  v )  
N 1

Our primary interest here is of course in space-time clustering for relatively small values
of h and  . But it is clear that a range of other questions could in principle be addressed
within the more general framework outlined above.

6.5 Application to the Lymphoma Example

This testing procedure is implemented in the MATLAB program, space_time_plot.m,


and can be applied to the Lymphoma example above as follows. In the MATLAB
workspace, lymphoma.mat, the (188 x 3) matrix, LT, contains space-time data for the
n =188 lymphoma cases, with rows ( xi , yi , ti ) denoting the location, ( xi , yi ) , and onset
time, ti , of each case i . In this program, the maximum distance again set to hmax / 2 as in
(4.5.1) above, and similarly, the maximum temporal interval is set to half the maximum
time interval,  max / 2 , where  max  tmax  tmin in Figure 6.5 above. Given these
maximum values, the user has the option of choosing subdivisions of hmax / 2 into s
equal increments, hi  (i / s )(hmax / 2), i  1,.., s , and subdivisions of  max / 2 into t equal

________________________________________________________________________
ESE 502 I.6-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

increments,  j  ( j / t )( max / 2), j  1,.., t . So for example the following command uses
999 random relabelings of times to test for space-time clustering of the Lymphoma data,
LT, at each point on a grid of space-time neighborhoods (hi ,  j ) with s  t  20 :

>> results = space_time_plot(LT,999,20,20);

The results of these s  t  400 tests is plotted on a grid and then interpolated in
MATLAB to obtain a p-value contour map such as the one shown in Figure 6.6 below:

Figure 6.6. P-value Map for Lymphoma Data

Note first that each location in this region corresponds to the size of a space-time
neighborhood. Hence those areas with darker contours indicate space-time scales at
which there are significantly more cases in neighborhoods of this size (about randomly
selected cases) than would be expected under the temporal indistinguishability
hypothesis. In particular, the dark contours in the lower left corner show that there is very
significant concentration in small space-time neighborhoods, and hence significant space-
time clustering. This not only confirms the findings of the simple regression analysis
done in Assignment 1, but also conveys a great deal more information. In fact the darkest
contours show significance at the .001 level (which is the maximum significance
achievable with 999 simulations).9
Before discussing these results further, it is of interest to observe that while the direct plot
in MATLAB above is useful for obtaining visual results quickly, these p-values can also
be exported to ARCMAP and displayed in sharper and more vivid formats. For example,

9
Note also that these p-values can be retrieved in numerical form from the output structure, results, in the
command above.

________________________________________________________________________
ESE 502 I.6-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

the above results were exported to ARCMAP and smoothed by ordinary kriging to obtain
the sharper representation shown in Figure 6.7 below:

2500  P-Values
.001
2000 
.001-.002

1500 
Time (days)

.002-.005
.005-.010
1000  .010-.050
.050-.100
500 
.100-.200
.200-1.00
0       
0 10 20 30 40 50 60 70
Distance (km)

Figure 6.7 Smoothed P-Value Map in ARCMAP

Using this sharper image, notice first that the horizontal band of significance at the
bottom of the figure indicates significant clustering of cases within 500 days of each
other (  1.4 years) over a wide range of distances. This suggests the presence of short
periods (about 1.4 years) with unusually high numbers of cases over a wide region, i.e.,
local peaks in the frequency of cases over time. This can be confirmed by Figure 6.8
below, where a number of local peaks are seen, such as in years 7, 11, 13 and 15 (with
year 1 corresponding to 1961)

18

16

14
Number of Cases

12

10

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Time (years)

Figure 6.8 Time Frequency of Lymphoma Cases

________________________________________________________________________
ESE 502 I.6-10 Tony E.
Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

Next observe that there is a secondary mode of significance at about 1500 days (  4
years) on the left edge of Figure 6.7. This indicates that many cases occurred close to one
another over a time lag of about 4 years. Note in particular that the peak years 7,11, and
15 are spaced at 4 years. This suggests that such peaks may represent new outbreaks of
Lymphoma cases in the same areas at intervals of about 4 years. Hence the p-value plots
in Figures 6.6 and 6.7 above do indeed yield more information than simple space-time
clustering of events.

________________________________________________________________________
ESE 502 I.6-11 Tony E.
Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

APPENDIX TO PART I
In this Appendix, designated as A1 (appendices A2 and A3 are for Parts II and III,
respectively), we shall again refer to equations in the text by section and equation
number, so that (2.4.3) refers to expression (3) in section 2.4 of Part I. Also, references to
previous expressions in this Appendix (A1), will be written the same way, so that
(A1.1.3) refers to expression (3) of section 1 in Appendix A1.

A1.1. Poisson Approximation of the Binomial

This standard result appears in many elementary probability texts [such as Larsen and
Marx (2001, p.247)]. Here one starts with the fundamental limit identity

lim n 1  nx   e x
n
(A1.1.1)

that defines the exponential function. Given this relation, observe that since

n! n(n  1) (n  k  1)(n  k )! n(n  1) (n  k  1)


(A1.1.2)  
k !(n  k )! k !(n  k )! k!

it follows that expression (2.2.3) can be written as

k nk
n!  a(C )   a(C ) 
(A1.1.3)   1  
k !(n  k )!  a( R)   a( R) 

k nk
 n k  n(n  1) (n  k  1)  a (C )   a (C ) 
 k   1  
n  k!  a( R)   a( R) 
 n n  1 n  k  1   n / a ( R)  a(C )
k n k
 a(C )   a(C ) 
    1   1  
n n n  k!  a( R)   a( R) 

But if we now evaluate expression (A1.1.3) at the sequence in (2.3.2) and recall that
nm / a( Rm )    0 , then in the limit we can replace nm / a ( Rm ) by  in the second factor.
Moreover, since (nm  h) / nm  1 for all h  0,1,.., k  1 , it also follows that the first factor
in (A1.1.3) goes to one. In addition, the last factor also goes to one since
a ( Rm )    a (C ) / a ( Rm )  0 . Hence by taking limits we see that

k nm k
nm !  a (C )   a(C ) 
(A1.1.4) lim m   1  
k !(nm  k )!  a( Rm )   a ( Rm ) 

________________________________________________________________________
ESE 502 A1-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

[ a (C )]k   a (C )  
nm

 (1)  lim m 1    (1)


k!   a ( Rm )  

[ a(C )]k   a(C )[nm / a( Rm )]  


nm

  lim m 1   
k!   nm  

[ a(C )]k     a(C )  


nm

  lim m 1   
k!
  nm  

[ a(C )]k a (C )
 e
k!

A1.2. Distributional Properties of Nearest-Neighbor Distances under CSR

Given that the nn-distance, D , for a randomly selected point has cdf

FD (d )  1  Pr( D  d )  1  e d
2
(A1.2.1)

By differentiating (A1.2.1) we obtain the probability density f D of D as

f D (d )  FD (d )  2ded
2
(A1.2.2)

This distribution is thus seen to be an instance of the Rayleigh distribution (as for
example in Johnson and Kotz, 1970, p.197). This distribution is closely related to the
normal distribution, which can be used to calculate its moments. To do so, recall first that
since E ( X )  0 for any normal random variable, X ~ N (0,  2 ) , it follows that the
variance of X is simply its second moment, i.e.,

(A1.2.3)  2  var( X )  E ( X 2 )  E ( X ) 2  E ( X 2 )

But since this normal density  ( x)  exp( x 2 ) / 2 2  is symmetric about zero, we
 
then see that

1  E ( X 2 ) 2  2
 x 2e  x 
/2 2
dx    x 2e  x /2 2
dx  22
2 2
(A1.2.4)
22 0 2 2 0 2

Hence by setting  2  1/(2) so that   1/(22 ) , we obtain the identity

 1 2  1  1  1  1
 x 2ex dx    
2
(A1.2.5) 
0 4 2  4    2  2 

________________________________________________________________________
ESE 502 A1-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

 1
 (2)  x 2ex dx 
2

0
2 

So to obtain the mean, E ( D) , of D observe from (A1.2.2) and (A1.2.5) that

   1
E ( D)   xf D ( x)dx   x(2xex ) dx  2  x 2ex dx 
2 2
(A1.2.6)
0 0 0
2 

To obtain the variance, var( D) , of D we first calculate the second moment, E ( D 2 ) . To


do so, observe first from the integration-by-parts identity (as for example in Bartle, 1975,
Section 22) that for any differentiable functions, f ( x) and g ( x) on [0, ) ,

 
(A1.2.7)  0
f ( x) g ( x)dx  
0
f ( x) g ( x)dx   f (0) g (0)  lim x f ( x) g ( x)

whenever these integrals and limits exist. Hence letting f ( x)  x 2 and g ( x)  e x , it
2

follows that

 
 x 2 (2xe x )dx   (2 x)(e x )dx   (0)  lim x x 2e x  0
2 2 2
(A1.2.8)
0 0

But by (A1.2.2) we have,

   1
 f D ( x)dx  1  2 xe x dx  1   2 xex dx 
2 2
(A1.2.9)
0 0 0 

which together with (A1.2.8) now shows that

   1
E ( D 2 )   x 2 f D ( x)dx   x 2 (2xe x )dx   2 xex dx 
2 2
(A1.2.10)
0 0 0 

Finally, by combining (A1.2.6) and (A1.2.10) we obtain1

2
1  1  1  1  4
(A1.2.11) var( D)  E ( D )  [ E ( D)] 
2
      2

  2     4  4

1
I am indebted to Christopher Jodice for pointing out several errors in my original posted derivations of
these moments.

________________________________________________________________________
ESE 502 A1-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

A1.3. Distribution of Skellam’s Statistic under CSR

Given these distributional properties of D , we next derive the distribution of Skellam’s


statistic in (3.2.6). To do so, we first observe from expression (A1.2.1) above that since
the cdf of the exponential distribution with mean 1/  is given by F ( x; )  1  ex , it
follows at once that D 2 is exponentially distributed with mean 1/  . But since sums of
m independent and identically distributed exponentials with means 1/  is well known
to be Gamma distributed, (m, ) , (as for example in Johnson and Kotz, 1970, Chapter
17), it then follows that under CSR, the distribution of m independent nn-distance
samples ( D1 ,.., Dn ) , is given by,

Wm   i1 Di2 ~ (m, )


m
(A1.3.1)

For practical testing purposes, this is usually rescaled. Given that the gamma density for
Wm has the explicit form,

() m wm1 w


(A1.3.2) fWm ( w)  e
(m  1)!

the change of variables

S m  2Wm  2 i1 Di2


m
(A1.3.3)

yields a new density

(A1.3.4) g Sm ( s )  fWm ( w( s )) | w( s ) |  fWm ( s / 2) |1/ 2 |

() m ( s / 2) m1  ( s / 2  )  1  2 m s m1  ( s / 2)


 e   e
(m  1)!  2  (m  1)!

which is precisely the chi-square distribution with 2m degrees of freedom. Hence

S m  2 i1 Di2 ~  22 m


m
(A1.3.5)

A1.4. Effects of Positively Dependent Nearest-Neighbor Samples

In this section it is shown that positive dependencies among nearest neighbors have the
effect of increasing the variance of the test statistic, Z n , thus making outlier values more
likely than they would otherwise be. To show this, suppose first that the sample nn-

________________________________________________________________________
ESE 502 A1-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

distance values ( D1 ,.., Dn ) are identically distributed with mean,   E ( Di ) , and


variance,  2  var( Di )  E[( Di   ) 2 ] . Then as a generalization of expression (3.2.11) in
the text, we have

(A1.4.1) var( Dn )  E[( Dn   ) 2 ]


  
  
 i1 i n  i1 
2 2
 i1 i  
n n n
 E 1
n D    E 
1
n D  1


 

2
 i1 i    E  n2  i1  j 1 ( Di   )( D j   ) 
1 n 
n n
 E 1
n ( D  )

  E[( Di   )( D j   )]
n n
 1
n2 i 1 j 1

 E[( Di   ) 2 ]  n12  i1  j i E[( Di   )( D j   )]


n n
 1
n2 i 1

 var( Di )  n12  i1  j i cov( Di , D j )


n n
 1
n2 i 1

  2  n1  
n n
 1
n2 i 1 2
i 1 j i
cov( Di , D j )

2
 
n
 n  1
n2 i 1 j i
cov( Di , D j )

Hence if there are some positive dependencies (i.e., positive covariances) among the
nearest-neighbor values ( D1 ,.., Dn ) , then the second term of the last line will be positive,
so that in this case var( Dn )   2 / n . Hence we must have

2 n D    2 
(A1.4.2) E[( Dn   ) ] 
2
 2 E[( Dn   ) ]  1  E  n   1
2

n    / n  

 E ( Z n2 ) 1  var( Z n ) 1

where the last line follows from the fact that E ( Z n )  0 regardless of any dependencies
among the nn-distances. But since one should have var( Z n )  1 under independent
random sampling, it then follows that realized values of Z n will tend to be farther away
from zero than would be expected under independence. Thus even those clustering or
uniformity effects due to pure chance will tend to look more significant than they actually
are.

________________________________________________________________________
ESE 502 A1-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

A1.5. The Point-in-Polygon Procedure

The determination whether a point, s , lies in a given polygon or not depends on certain
basic trigonometric facts. In the Figure 1 below the (hollow) point s is seen to lie inside
the polygon, R, determined by three boundary points {1,2,3}.

1

12
 31
 2
23


3
Fig.A1.1. Point Inside Polygon

If the angles (in radians) between successive points i and j are denoted by  ij , then it
should be clear that for any point s inside R these angles constitute a full clockwise
rotation through 2 radians, and hence that we must have 12   23   31  2 . The
situation can be more complex when the given polygon is not convex. But nonetheless, it
can easily be seen that if counterclockwise rotations are given negative values, then any
counterclockwise rotations are canceled out by additional clockwise rotations to yield the
same total, 2 . So if the polygon boundary points are numbered {1, 2,.., N } proceeding
in a clockwise direction from any initial boundary point, then we must always have:2


N 1
(A1.5.1) i 1
 i ,i1  2

On the other hand, if point s is outside of the polygon, R, then by cumulating angles
from s between each successive pair of points, the sum of clockwise and
counterclockwise rotations must cancel, leaving a total of zero radians, i.e.,


N 1
(A1.5.2) i 1
 i ,i1  0

In the case of the simple polygon, R  {1, 2,3} , above, this is illustrated by the three
diagrams shown in Figure 2 below.

1
2

Certain additional complications are discussed at the end of this section.

________________________________________________________________________
12A1-6
ESE 502
 Tony E. Smith

2
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

Here the first two angles 12 and  23 are positive, and the angle  31 is precisely the
negative sum of 12 and  23 . By extending this idea, it is easy to see that a similar
argument holds for larger polygons.

________________________________________________________________________
ESE 502 A1-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

However, it is important to add here that this argument assumes that the polygon R is
connected, and has no holes. Unfortunately, these conditions can sometimes fail to hold
when analyzing general map regions. For example offshore islands are often included as
part of larger mainland regions, creating disconnected polygons. Also certain small
regions are sometimes nested in larger regions, creating holes in these regions. For
example, military bases or Indian reservations within states are often given separate
regional designations. There are other examples, such as the lake in Figure 2.4 of Part I,
where one may wish to treat treat certain subregions as “holes”.

So when using standard point-in-polygon routines in practice, one must be careful to


watch for these situations. Islands are usually best handled by redefining them as separate
regions. Then by applying a point-in-polygon procedure to each region separately, one
can determine whether a given point is one of them, or none of them. Holes can be
handled similarly. For example if R1  R2 so that the relevant region, R2 , is given by the
set-theoretic difference, R2  R1 . So for this region, one can apply point-in-polygon
routines to R1 and R2 separately, and then accept only points that are in R2 but not in R1 .

A1.6. A Derivation of Ripley’s Correction

First observe that the circular cell, C , of radius h about point si can be partitioned into a
set of concentric rings, Ck about si , each of thickness  k , so that C   k Ck . One such
ring is shown in Figure 3 below.

C R

si  k

Ck

Fig.A1.3. Partition of Circular Cell, C

Since these rings are disjoint, it follows that the number of points in C is identically
equal to the sum of the numbers of points in each ring Ck , so that (in terms of the
notation in Section 2.2 in the text),

(A1.6.1) E  N (C )    k E  N (Ck ) 

But by stationarity, it follows from expression (2.3.4) that

________________________________________________________________________
ESE 502 A1-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

 a (Ck ) 
(A1.6.2) E[ N (Ck )]   a (Ck )   a (Ck  R )  
 a (Ck  R ) 

Where a (Ck  R) is by definition the area of the observable portion of Ck inside R .


Now when the ring thickness,  k , becomes small, it should be clear from Figure A1.3
that the ratio of a (Ck  R) to a (Ck ) is approximately equal to the fraction of the circum-
ference of Ck that is inside region R . So if this ratio is now denoted by wik then,

a(Ck  R) a (Ck ) 1
(A1.6.3)  wik  
a (Ck ) a(Ck  R) wik

Hence, when the ring partition in Figure A1.3 becomes very fine, so that the  k ' s
become small, one has the approximation

 a (Ck ) 
(A1.6.4) E[ N (Ck )]   a (Ck  R )  
 a (Ck  R ) 

 a (Ck )  E[ N (Ck  R )]
 E[ N (Ck  R )]   
 a (Ck  R )  wik

Putting these results together, we see that for fine partitions of C ,

E[ N (Ck  R)]
(A1.6.5) K (h)  1 E[ N (C )]  1  k E[ N (Ck )]  1  k
wik

Note also that for sufficiently fine partitions it can be assumed that each ring contains at
most one of the observed points, s j  C  R , so that the point-count estimators
Eˆ [ N (C  R )] for E[ N (C  R)] will have value one for those rings C containing a
k k k

point and zero otherwise. Hence, observing by definition that I h (dij )  1 for all such
points, it follows that

 I ( d ) , s j  Ck  R )
(A1.6.6) Eˆ [ N (Ck  R )]   h ij
 0 , otherwise

If we again estimate  by ˆ  n / a( R) , and relabel the ring containing each point


s j  C  R as C j , then (A1.6.6) is seen to yield the following estimate of K (h) in
(A1.6.5) based on point counts in the set C  R centered at si ,

________________________________________________________________________
ESE 502 A1-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

Eˆ [ N (Ck  R)] 1 I (d )
(A1.6.7) Kˆ i (h)  1ˆ  k  ˆ  j i h ij
 wik  wij

Finally, by averaging these estimates over all points si  R as in the text, we obtain the
pooled estimate,

I h (dij )
  
n n
(A1.6.8) Kˆ (h)  1 Kˆ (h)  1
n i 1 i ˆn i 1 j i
wij

which is seen to be precisely Ripley’s correction in expression (4.3.7).

A1.7. An Alternative Derivation of P-Values for K-functions

The text derivation of the P-values in expressions (4.6.8) and (4.6.10) is appealing from a
conceptual viewpoint in that it focused directly on the distribution of the test statistic,
Kˆ (h) , under the CSR Hypothesis. But there is an alternative derivation of this expression
that has certain practical advantages discussed below. This approach is actually much
closer in spirit to the argument used in deriving the “envelope” P-values of expressions
(4.6.3) and (4.6.4), which we now make more precise as follows. Observe that if l0 is
consistent with CSR then by construction (l0 , l1 ,.., lN ) must be independently and
identically distributed (iid) samples from a common distribution. In the envelope case it
was then argued from the symmetry of iid samples that none is more likely to be the
highest (or lowest) than any other. More generally, suppose we now ask how likely it is
for the observed sample value, l0 , to be the k th largest among the N  1 samples
(l0 , l1 ,.., lN ) , i.e., to have rank, k , in the ordering of these values. Here it is important to
note that ranks are not well defined in the case of ties. So for the moment we avoid this
complication by assuming that there are no ties. In this case, observe that there must be
( N  1)! possible orderings of these iid samples, and again by symmetry, that each of
these orderings must be equally likely. But since exactly N ! of these orderings have l0 in
the k th position (where N ! is simple the number of ways of ranking the other values), it
follows that if the random variable, R0 , denotes the rank of l0 , then under H 0 we must
have:

N! N! 1
(A17.1) Pr( R0  k )    , k  1,.., N  1
( N  1)! ( N  1)  N ! N  1

which in turn implies that the chance of a rank as high as k is given by, 3

3
Remember that “high” ranks mean low values of k .

________________________________________________________________________
ESE 502 A1-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

k  1  k
Pr( R0  k )   r 1 Pr( R0  r )   r 1 
k
(A1.7.2)   , k  1,.., N  1
 N 1 N 1

So rather than using the distribution of Kˆ (h) under CSR to test this null hypothesis, we
can use the distribution of its rank R0 in (A1.7.1) and (A1.7.2). But if we again let
m (l0 ) denote the number of simulated samples at least as large as l0 , then the observed
rank of l0 (assuming no ties) is precisely m (l0 )  1 . So to test the CSR Hypothesis we
now ask: How likely would it be to obtain an observed rank as high as m (l0 )  1 if CSR
were true? Here the answer is given from (A1.7.2) by the clustering P-value:

m (l0 )  1
(A1.7.3) Pcluster (h)  Pr[ R0  m (l0 )  1] 
N 1

which is seen to be precisely the same as expression (4.6.8). However there is one
important difference here, namely that we are no longer attempting to estimate a P-value.
The distribution in (A1.7.1) and (A1.7.2) is exact, so that there is no need for a “hat” on
Pcluster .

Another important advantage of this approach is that it is directly extendable to include


possible ties among values. In particular, suppose that whenever two values are tied, we
flip a fair coin to order them. More generally, suppose we use any tie-breaking procedure
under which the rankings ( R0 , R1 ,.., RN ) are exchangeable random variables (i.e., under
which their joint distribution in invariant under any permutation of the indices, 0,1,.., N ).
Then it again follows that all ( N  1)! orderings resulting from this procedure must be
equally likely, and hence that (A1.7.1) and (A1.7.2) above continue to hold. Hence the
key difference here is that in the presence of one or more ties, the ranking of l0 is not
uniquely determined by its value. There must be some additional tie-breaking procedure.
So if l0 is tied with exactly q of the simulated values, then there must be some additional
information about the ranking, say R0 (q ) , among these q  1 equal values. Hence all that
can be said is that if m (l0 ) again has the same meaning then the final rank of l0 will be
m (l0 )  q  R0 (q) . For example, if l0 were ranked last among the ties, so that
R0 (q )  q  1 , then l0 would again have rank m (l0 )  q  (q  1)  m (l0 )  1 , since all tied
values would be ranked ahead of l0 (i.e., would be closer to rank 1 than l0 ). Similarly, if
l0 were ranked ahead of all other ties, so that R0 (q)  1 , then l0 would have rank
m (l0 )  q  1 . Hence if we are given R0 (q ) , then a conditional cluster P-value could be
defined in terms of expression (A1.7.2) as follows:

m (l0 )  q  R0 (q)
(A1.7.4) Pcluster [h | R0 (q)]  Pr[ R0  m (l0 )  q  R0 (q)] 
N 1

________________________________________________________________________
ESE 502 A1-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

But since the above exchangeability property also implies that

1
(A1.7.5) Pr[ R0 (q )  i ]  , i  0,1,.., q
q 1

it follows that we can obtain an unconditional clustering P-value (depending only on q )


by simply taking summing out these conditioning effects as follows:

Pcluster (h | q )   i0 Pcluster [h | R0 (q )]P[R0 (q )  i ]


q
(A1.7.6)

m (l0 )  1  i  1  1
 
q q
   [m (l0 )  1  i )]
i 0
N  1  q  1  ( N  1)(q  1) i 0

1 {m (l0 )  1}(q  1)   q i 



( N  1)(q  1)  i 0 

1  (q  1)q 
 {m (l0 )  1}(q  1) 
( N  1)(q  1)  2 

m (l0 )  1  (q / 2)

N 1

Hence this generalized cluster P-value amounts to replacing the rank, m (l0 )  1 , of l0 in
(A1.7.2) for the case of no ties with its average rank, m (l0 )  1  q / 2 , for cases where q
values are tied with l0 . So for example, if N  3 and (l0 , l1 , l2 , l3 )  (5, 2,5,6) , so that
m (l0 )  2 , q  1 and the possible ranks of l0 are {2,3} , then its average rank is 2.5 and

(2  1)  1/ 2 2.5
(A1.7.7) Pcluster (h)  
5 5

Note finally that the special case in (A1.7.3) above is now simply the special case of “no
ties”, so that Pcluster (h)  Pcluster ( h | 0) .

The argument for uniform P-values is of course identical. Thus the corresponding
generalized uniform P-value in the presence of q ties is given by:

m (l0 )  1  (q / 2)
(A1.7.8) Puniform (h | q) 
N 1

________________________________________________________________________
ESE 502 A1-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

where m (l0 ) is again the number of simulated values li no larger than l0 . Here it is
important to note that these P-values are “almost complements” in the sense that for all q
and h ,

N 2
(A1.7.9) Pcluster (h | q)  Puniform (h | q) 
N 1

To see this, note simply that if we let N  , N  , N  denote the number of simulated
samples that are less, equal, or greater than l0 , then it follows by definition that q  N  ,
so that

(A1.7.10) m (l0 )  N   N   N   q

(A1.7.11) m (l0 )  N   N   N   q

and hence that

m (l0 )  1  (q / 2) m (l0 )  1  (q / 2)
(A1.7.12) Pcluster (h | q)  Puniform (h | q)  
N 1 N 1

[( N   q)  1  (q / 2)]  [( N   q)  1  ( q / 2)]

N 1

[( N   q  N  ]  2 N  2
 
N 1 N 1

Thus for even fairly small N it must be true that

(A1.7.13) Pcluster (h | q)  Puniform (h | q)  1

so that we can essentially plot both P-values on one diagram. Hence all plots in K-
function programs such as k_function_plot focus on cluster P-values, Pcluster (h | q) ,
where Puniform (h | q) is implicitly taken to be 1  Pcluster (h | q) .

A1.8. A Grid Plot Procedure in MATLAB

While the full grid, ref, can be represented in ARCMAP by exporting this grid from
MATLAB and displaying it as a point file, it is often more useful to construct this display
directly in MATLAB to obtain a quick check of whether or not the extent and grid size
are appropriate. Assuming that the boundary file exists in the MATLAB workspace, this
can be accomplished with the program poly_plot.m, which was written for this kind of
application. In the present case the boundary file, Bod_poly (shown on page 3-23 of Part
________________________________________________________________________
ESE 502 A1-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

I), is the desired input. Hence to plot the grid, ref, with respect to this boundary, use the
command:

>> poly_plot(Bod_poly,ref);

Notice that the size of the dots in the


Figure may be too large or too small,
depending on the size of the boundary
being used. These attributes (and
others, such as the thickness of the
boundary) can be altered. To do so,
click on Edit and select Current
Object Properties. Then to edit the
size of the grid points, click on any of
these points. You will then see that a
few diagonal points are selected, and
that a window has opened containing
the attributes of these points. Observe
that under “Marker” there is a point-
type window and a numerical Marker
size. If you increase or decrease this
size, you will see that the point size in
the display above has changed. In a
similar manner, you can edit the
boundary thickness by repeating the
above Edit procedure, this time
clicking on any exposed portion of the
boundary, rather than on one of the
grid points. Fig.A1.4. Screen Output from poly_plot

A1.9. A Procedure for Interpolating P-Values

To duplicate the results in the text, open


Spatial Analyst and then select:

Interpolate to Raster  Spline.

In the Spline window that opens set:

Input points = “P-val.shp”


Z value field = “P_005”
Weight = “5”

________________________________________________________________________
ESE 502 A1-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________

and leave all other values as defaults. The value-field, P_005, contains the desired p-
values in the file, P-val.shp. The weight 5 adds a degree of “stiffness” to the spline
which yields a somewhat smoother result than the default .01 value. Now click OK and
a new layer appears called “Spline of P-val.shp”. Right click on this layer and select
“Make Permanent”. Save it to your home directory as say, spline_pvals. This will not
change the layer, but will give it an editable form. You can alter the display by right
clicking on the layer, “Spline of P-val.shp”, selecting “Classified” (rather than
“Stretched”), and editing its properties. [Notice that the values are mostly negative, and
that the relevant range from 0 to 1 is only a very small portion of the values. This is due
to the extreme nonlinearity of the spline fit.]

To obtain the display in Figure 4.23 above,


this spline surface can be converted to contour
lines as follows. First open Spatial Analyst
again and this time select

Surface Analysis  Contour

In the “Contour” window that opens set:

Input Surface = “Spline of PVals”


Contour Interval = “.08”
Base Contour = “.005”

Click OK and a new layer called “ctour” appears that shows the desired contours. This
file is stored as a temporary file. You can edit its properties. So select “Classify” and
choose the “Manual” option with settings (.01,.05,0.1,0.2) and appropriate colors. This
should yield roughly the representation in Figure 4.23 above. This file is stored as a
temporary file only. So you can keep trying different interval and base contour values
until you find values that capture the desired regions of significance. Then use Data 
Export to save a permanent copy in your home directory and edit as desired.

________________________________________________________________________
ESE 502 A1-15 Tony E. Smith
CONTINUOUS SPATIAL DATA ANALYSIS

1. Overview of Spatial Stochastic Processes

The key difference between continuous spatial data and point patterns is that there is
now assumed to be a meaningful value, Y ( s ) , at every location, s , in the region of
interest. For example, Y ( s ) might be the temperature at s or the level of air pollution at
s . We shall consider a number of illustrative examples in the next section. But before
doing so, it is convenient to outline the basic analytical framework to be used throughout
this part of the NOTEBOOK.

If the region of interest is again denoted by R , and if the value, Y ( s ) , at each location,
s  R is treated as a random variable, then the collection of random variables

(1.1) {Y ( s ) : s  R}

is designated as a spatial stochastic process on R (also called a random field on R ). It


should be clear from the outset that such (uncountably) infinite collections of random
variables cannot be analyzed in any meaningful way without making a number of strong
assumptions. We shall make these assumptions explicit as we proceed.

Observe next that there is a clear parallel between spatial stochastic processes and
temporal stochastic processes,

(1.2) {Y (t ) : t  T }

where the set, T , is some continuous (possibly unbounded) interval of time. In many
respects, the only substantive difference between (1.1) and (1.2) is the dimension of the
underlying domain. Hence it is not surprising that most of the assumptions and analytical
methods to be employed here have their roots in time series analysis. One key difference
that should be mentioned here is that time is naturally ordered (from “past” to “present”
to “future”) whereas physical space generally has no preferred directions. This will have
a number of important consequences that will be discussed as we proceed.

1.1 Standard Notation

The key to studying infinite collections of random  s2


 s1
variables such as (1.1) is of course to take finite samples 
of Y ( s ) values, and attempt to draw inferences on the  sn
basis of this information. To do so, we shall employ the s
n 1
following standard notation. For any given set of sample R
locations, {si : i  1,.., n}  R (as in Figure 1.1), let the
random vector: Fig.1.1. Sample Locations

________________________________________________________________________
ESE 502 II.1-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

 Y ( s1 )   Y1 
(1.1.1) Y   :    :
Y ( s )   Y 
 n   n

represent the possible list of values that may be observed at these locations. Note that
(following standard matrix conventions) we always take vectors to be column vectors
unless otherwise stated. The second representation in (1.1.1) will usually be used when
the specific locations of these samples are not relevant. Note also that it is often more
convenient to write vectors in transpose form as Y  (Y1 ,.., Yn ) , thus yielding a more
compact in-line representation. Each possible realization,

 y1 
(1.1.2) y  ( y1 ,.., yn )   : 

y 
 n

of the random vector, Y , then denotes a possible set of specific observations (such as the
temperatures at each location i  1,.., n ).

Most of our analysis will focus on the means and variances of these random variables, as
well as the covariances between them. Again, following standard notation we shall
usually denote the mean of each random variable, Y  si  , by

(1.1.3) E Y ( si )    ( si )  i , i  1,.., n

so that the corresponding mean vector for Y is given by

(1.1.4) E (Y )  [ E (Y1 ),.., E (Yn )]  ( 1 ,..,  n )  

Similarly, the variance of random variable, Y  si  , can be denoted in a number of


alternative ways as:

(1.1.5) var(Yi )  E[(Yi  i ) 2 ]   2 ( si )   i2   ii

The last representation facilitates comparison with the covariance of two random
variables, Y  si  and Y  s j  , as defined by

(1.1.6) cov[Y ( si ), Y ( s j )]  E[(Yi  i )(Y j   j )]   ij

The full matrix of variances and covariances for the components of Y is then designated
as the covariance matrix for Y , and is written alternatively as

________________________________________________________________________
ESE 502 II.1-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

 cov(Y1 , Y1 )  cov(Y1 , Yn )    11   1n 
(1.1.7) cov(Y )              
cov(Y , Y )  cov(Y , Y )      
 n 1 n n   n1 nn 

where by definition, cov(Yi , Yi )  var(Yi ) .

As we shall see below, spatial stochastic processes can be often be usefully studied in
terms of these first and second moments (means and covariances). This is especially true
for the important case of multivariate normally distributed random vectors that will be
discussed in some detail below. For the present, it suffices to say that much of our effort
to model spatial stochastic processes will focus on the structure of these means and
covariances for finite samples. To do so, it is convenient to start with the following
overall conceptual framework.

1.2 Basic Modeling Framework

Essentially all spatial statistical models that we shall consider start by decomposing the
statistical variation of random variables, Y ( s ) , into a deterministic trend term,  ( s) , and
a stochastic residual term,  ( s ) , as follows [see also Cressie (1993, p.113)]:

(1.2.1) Y ( s)   ( s)   (s) , s  R

Here  ( s) is almost always take to be the mean of Y ( s) , so that by definition,

(1.2.2)  ( s)  Y ( s)   ( s)  E[ ( s)]  E[Y ( s )]   ( s )

 E[ ( s )]  0 , s  R

Expressions (1.2.1) and (1.2.2) together constitute the basic modeling framework to be
used throughout the analyses to follow. It should be emphasized that this framework is
simply a convenient representation of Y ( s ) , and involves no substantive assumptions
whatsoever. But it is nonetheless very useful. In particular, since  () defines a
deterministic function on R , it often most appropriate to think of  () as a spatial trend
function representing the typical values of the given spatial stochastic process over all
R , i.e., the global structure of the Y -process. Similarly, since  () is by definition a
spatial stochastic process on R with mean identically zero, it is useful to think of  () as
a spatial residual process representing local variations about  () , i.e., the local structure
of the Y -process.

________________________________________________________________________
ESE 502 II.1-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

1.3 Spatial Modeling Strategy

Within this framework, our basic modeling strategy will be to identify a spatial trend
function,  () , that fits the Y -process so well that the resulting residual process,  () , is
not statistically distinguishable from “random noise”. However, from a practical
viewpoint, the usual statistical model of such random effects as a collection of
independent random variables, { ( s) : s  R} , is somewhat too restrictive. In particular,
since most spatial variables tend to exhibit some degree of continuity over space (such as
average temperature or rainfall), one can expect these variables to exhibit similar values
at locations close together in space. Moreover, since spatial residuals  ( s ) by definition
consist of all unobserved spatial variables influencing Y ( s) that are not captured by the
global trend,  ( s) , one can also expect these residuals to exhibit similar values at
locations close together in space. In statistical terms, this means that for locations, s and
v , that are sufficiently close together, the associated residuals  ( s ) and  (v) will tend to
exhibit positive statistical dependence. Thus, in constructing statistical models of spatial
phenomena, it is essential to allow for such dependencies in the spatial residual process,
{ ( s ) : s  R} .

Before proceeding, it is important to emphasize that our basic measure of the degree of
dependency between spatial residuals -- and indeed between any random variables X
and Y -- is in terms of their covariance,

(1.3.1) cov( X , Y )  E[( X   X )(Y  Y )]

[as in expression (1.1.6) above]. To gain further insight into the meaning of covariance,
observe that if cov( X , Y ) is positive, then by definition, this means that the deviations
X   X and Y  Y are expected to have the same sign (either positive or negative), so
that typical scatter plots of ( x, y ) points will have a positive slope, as shown in the first
panel of Figure 1.2 below.

y y y
• •

• • • •
Y • • • • • Y • •
• • •
• •• • • • • • •
• •
X x X x X x

cov( X , Y )  0 cov( X , Y )  0 cov( X , Y )  0

Figure 1.2. Covariance Relations


________________________________________________________________________
ESE 502 II.1-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Similarly, if cov( X ,Y ) is negative, then deviations X   X and Y  Y are expected to


have the opposite signs, so that typical scatter plots will have negative slopes, as in the
middle panel of Figure 1.2. Finally, if cov( X , Y ) is zero, then there is expected to be no
relation between the signs of these deviations, so that typical scatter plots will exhibit no
directional tendencies at all, as in the final panel of Figure 1.2. In particular, positive
dependencies among spatial residuals will thus tend to be reflected by positive covariance
among these residuals.

Given these initial observations, our basic strategy will be to start in Section 3 below by
constructing an appropriate notion of spatially-dependent random effects. While it may
seem strange to begin by focusing on the residual process, { ( s ) : s  R} , which simply
describes “everything left out” of the model of interest, this notion of spatially-dependent
random noise will play a fundamental role in all spatial statistical models to be
developed. In particular, this will form the basis for our construction of covariance
matrices [as in expression (1.1.7) above], which will effectively summarize all spatial
statistical relationships of interest. This will be followed in Section 4 with a development
of a statistical tool for estimating covariance, known as a variogram. This will also
provide a useful graphical device for summarizing spatially-dependent random effects.

Finally in Section 5 we begin by applying these tools to full spatial models as in (1.2.1)
above. In the simplest of these models, it will be assumed that the spatial trend is constant
[i.e.,  ( s )   ] so that (1.2.1) reduces to1

(1.3.2) Y (s)     (s) , s  R

As will be shown, this simple model is useful for stochastic spatial prediction, or kriging.
In Section 6 we then begin to consider models in which the spatial trend  ( s) varies over
space, and in particular, dependents on possible explanatory variables, [ x1 ( s ),..., xk ( s ) ]
associated with each location, s  R .

But before launching into these details, it is useful to begin with a number of motivating
examples which serve to illustrate the types of spatial phenomena that can be modeled.

1
Note that the symbol “  ” means that  ( s ) is identically equal to  for all s  R .

________________________________________________________________________
ESE 502 II.1-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

2. Examples of Continuous Spatial Data

As with point patterns, it is useful to consider a number of explicit examples of


continuous spatial data that will serve to motivate the types of analyses to follow. Each of
these examples is a case study in Chapter 5 of [BG], and the data for each example has
been reconstructed in ARCMAP.

2.1 Rainfall in the Sudan

Among the most common examples of continuous spatial data are environmental
variables such as temperature and rainfall, which can in principle be measured at each
location in space. The present example involves rainfall levels in central Sudan during
1942, and can be found in the ARCMAP file, arcview\Projects\Sudan\Sudan.mxd. The
Sudan population in 1942 was largely along the Nile River, as shown in Figure 2.1
below. The largest city, Khartoum, is at the fork of the Nile (White Nile to the west and
Blue Nile to the east). There is also a central band of cities extending to the west.1
Northern Sudan is largely desert with very few population centers. Hence it should be
clear that the information provided by rainfall measurements in the n  31 towns shown
in the Figure will yield a somewhat limited picture of overall rainfall patterns in Sudan.

Nile River (
!

(
!
(
! (
!
(
! (
!
KHARTOUM ( !
!
! (
( RAINFALL (mm)
(
! ( !
!
(!
(!(
(
!
(
! (
! 105 - 168
(
! (
! (
!
(
! !
( !
( (
! 168- 272
(
! ( !
! (
(
!
(
! 272 - 330
(
! (
!
(
! 330 - 384
(
!
(
! 384 - 503
(
! 503 - 744
(
!

Figure 2.1 Rainfall in Sudan

This implies that one must be careful in trying to predict temperatures outside this band
of cities. For example, suppose that one tries a simple “smoother” like Inverse Distance
Weighting (IDW) in ARCMAP (Spatial Analyst extension) [See Section 5.1 below for
additional examples of “smoothers”] . Here, if the above rainfall data in each city,

1
The population concentrations to the west are partly explained by higher elevations (with cooler climate)
and secondary river systems providing water.

________________________________________________________________________
ESE 502 II.2-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

i  1,.., n , is denoted by y ( si ) , then the predicted value, yˆ ( s ) , at a point, s  R , is given


by a function of the form:

yˆ ( s )   i1 wi ( s ) y ( si )
n( s )
(2.1.1)

where n( s ) is some specified number of points in {si : i  1,.., n} that are closest to s , and
where the inverse distance weights have the form,

d ( s, si ) 
(2.1.2) wi ( s ) 
 d ( s , s j ) 
n(s)
j 1

for some exponent,  (which is typically either   1 or   2 ).2 An interpolation of the


rainfall data above is shown in Figure 2.2 below, for the default values, n( s )  12 and
  2 in Spatial Analyst (Interpolate to Raster  Inverse Distance Weighted).3

(
!

(
!

(
! (
!
!
(
(
!
(
!
(
!
(
! RAINFALL (mm)
(
! !!
( (!
(
(
! (
!
(
! (
! (
! 105 - 168
(
! (
!
(
! (
! (
! 168- 272
(
! (
!
(
! (
!
(
!
(
! 272 - 330
(
!
(
!
(
! 330 - 384
(
!
(
! 384 - 503
(
! 503 - 744
(
!

Figure 2.2. IDW Interpolation of Rainfall

This is an “exact” interpolator in the sense that every data point, si , is assigned exactly
the measured value, yˆ ( si )  y ( si ) . But in spite of this, it should be evident that this
interpolation exhibits considerably more variation in rainfall than is actually present. In
particular, one can see that there are small “peaks” around the highest values and small
“pits” around the lowest values. Mathematically, this is a clear example of what is called
“overfitting”, i.e., finding a sufficiently curvilinear surface that it passes exactly through
every data point.

2
See also Johnston et al. (2001, p.114).
3
The results for IDW in the Geostatistical Analyst extension of ARCMAP are essentially identical.

________________________________________________________________________
ESE 502 II.2-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

For sake of comparison, a more recent detailed map of rainfall in the same area for the
six-month period from March to August in 2006 is shown in Figure 2.3 below. 4 Since
these are not yearly rainfall totals, the legend is only shown in ordinal terms. Moreover,
while there is a considerable difference in dates, it is not unreasonable to suppose that the
overall pattern of rainfall in 1942 was quite similar to that shown in the figure.

RAINFALL
None

Highest

Figure 2.3. Rainfall Pattern in 2006

Here rainfall levels are seen to be qualitatively similar to Figure 2.2 in the sense that
rainfall is heavier in the south than in the north. But it is equally clear that the actual
variation in Figure 2.3 is much smoother that in Figure 2.2. More generally, without
severe changes in elevation (as was seen for the California case in the Example
Assignment) it is natural to expect that variations in rainfall levels will be gradual.

This motivates a very different approach to interpolating the data in Figure 2.1. Rather
than focusing on the specific values at each of these 31 towns, suppose we concentrate on
the spatial trend in rainfall, corresponding to  () in expression (1.2.1) above. Without
further information, one can attempt to fit trends as a simple function of location
coordinates, s  ( s1 , s2 ) . Given the prior knowledge that rainfall trends tend to be smooth,
the most natural specification to start with is the smoothest possible (non-constant)
function, namely a linear function of ( s1 , s2 ) :

(2.1.3) Y ( s )   ( s )   ( s )   0  1s1   2 s2   ( s )

This can of course be fitted by a linear regression, using the above data [ y ( si ), s1i , s2i ] for
the i  1,..,31 towns above. This data was imported to JMPIN as Sudan.jmp, and the

4
The source file here is Sudan_Rainfall_map_source.pdf in the class ArcMap directory, Sudan.

________________________________________________________________________
ESE 502 II.2-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

1942 rainfall data (R-42) was regressed on the town coordinates (X,Y). The estimates
( ˆ0 , ˆ1 , ˆ2 ) were then imported to MATLAB in the workspace, sudan.mat. Here a grid,
G, of points covering the Sudan area was constructed using grid_form.m (as in Section
4.8.2 of Part I) and the predicted value, yˆ g  ˆ0  ˆ1sg1  ˆ2 sg 2 , at each grid point, g ,
was calculated. These results were then imported to Sudan.mxd in ARCMAP and were
interpolated using the spline interpolator in Spatial Analyst (Interpolate to Raster 
Spline).5 The results of this procedure are shown in Figure 2.4 below:

! !
!
!
!
! RAINFALL (mm)
!
! ! !! (
! 105 - 168
! !
! !
! ! (
! 168- 272
! !
! !
! !
! (
! 272 - 330
! (
! 330 - 384
!

! (
! 384 - 503
(
! 503 - 744
!

Figure 2.4. Linear Trend Model of Rainfall

A visual comparison of Figure 2.4 with Figure 2.3 shows that this simple linear trend
model is qualitatively much more in agreement with actual rainfall patterns than the IDW
fit in Figure 2.2.6 The results of this linear regression are shown in Table 2.1 below.

Term Estimate Std Error t Ratio Prob>|t| RSquare 0.59831


Intercept 12786.213 2031.626 6.29 <.0001 RSquare Adj 0.569618
X 7.1438789 5.934012 1.20 0.2387 Root Mean Square Error 1098.022
Y -81.47974 12.89805 -6.32 <.0001 Mean of Response 3692.323
Observations (or Sum Wgts) 31

Table 2.1. Linear Regression Results

Notice in particular that the Y-coordinate ( s2 above) is very significant while the X-
coordinate ( s1 above) is not. This indicates that most temperature variation is from north

5
See section 5.5 below for further discussion of spline interpolations.
6
It should be emphasized here that we have only used the “default” settings in the IDW interpolator to
make a point about “over fitting”. One can in fact construct more reasonable IDW fits by using the many
options available in the Geostatistical Analyst version of this interpolator.

________________________________________________________________________
ESE 502 II.2-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

to south, as is clear from Figures 2.3 and 2.4. However, the adjusted R-square shows that
only about 57% of the variation in rainfall levels is being accounted by this linear trend
model, so that there is still considerable room for improvement. With additional data
about other key factors (such as elevations) one could of course do much better. But even
without additional information, it is possible to consider more complex specifications of
coordinate functions to obtain a better fit. As stressed above, there is always a danger of
over fitting this data. But if adjusted R-square is used as a guide, then it is possible to
seek better polynomial fits within the context of linear regression. To do so, it is natural
to begin by examining the regression residuals, as shown in Figure 2.5 below.

3000 3000

2000 2000
Linear_Resids

Linear_Resids
1000 1000

0 0

-1000 -1000

-2000 -2000
0 2000 4000 6000 8000 200 250 300 350

Pred Formula R-42 X

Figure 2.5. Residual Plot Figure 2.6. Residuals vs X

While these residuals show nothing out of the ordinary, a plot of the residuals against the
X-coordinate is much more revealing. As seen in Figure 2.6 there appears to be a clear
nonlinearity here, suggesting that perhaps a quadratic specification of X would yield a
better fit than the linear specification in (2.1.3) above. This can also be seen by plotting
the residuals spatially, as in Figure 2.7 below:

(
!
#

(
!
#

(
!
# !
(
#
!
(
#
(
!#
(
!
#
!
(#
(
!
#
(
! #!(
!
(
##
# ! (
#
(
! !
(
#
(
!
# !
(
#
(
!
#
(
!
#
(
!
# (
!
# (
!#
(
!
# # !
(
! (
#
(
!
#
!
(
#
(
!
#

(
!
#

(
!
#

Figure 2.7. Plot of Spatial Residuals

________________________________________________________________________
ESE 502 II.2-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

If we focus on the heavy linear contour in the figure, then the residuals near the middle of
this line are seen to be negative (blue), indicating that observed rainfall is smaller than
predicted rainfall. Hence, recalling that higher rainfall values are to the south, these
predictions could be reduced by pulling this contour line further south in the middle.
Similarly, since the residuals near both ends of this line tend to be positive (red), a similar
correction could be made by moving the ends north, yielding a curved contour such as the
dashed curve shown in the figure.

Hence this visual analysis of spatial residuals again suggests that a quadratic specification
of the X-coordinate should yield a better fit. Thus, as an alternative model, we now
consider the following quadratic form:7

(2.1.4) Y ( s )   0  1s1   2 s12   3s2   ( s )

The results of this quadratic regression are shown in Table 2.2 below, and confirm that
this new specification does indeed yield a significantly better overall fit, with adjusted R-
square showing that an additional 10% of rainfall variation has been accounted for. In
addition, it is clear that both the linear and quadratic terms in X are very significant,
indicating that each is important.8

Term Estimate Std Error t Ratio Prob>|t| RSquare 0.727274


Intercept 52409.813 11219.48 4.67 <.0001 RSquare Adj 0.696971
X -258.7088 74.56896 -3.47 0.0018 Root Mean Square Error 921.3522
Y -94.47108 11.41716 -8.27 <.0001 Mean of Response 3692.323
X^2 0.4573417 0.127993 3.57 0.0014

Table 2.2. Quadratic Regression Results

By employing exactly the same procedure outlined for the linear regression above, the
results of this regression can be used to predict values on a grid and then interpolated in
ARCMAP (again using a spline interpolator) to yield a plot similar to Figure 2.4 above.
The results of this procedure are shown in Figure 2.8 below. Here a comparison of Figure
2.8 with the more accurate rain map from 2006 in Figure 2.3 shows that in spite of its
mathematical simplicity, this quadratic trend surface gives a fairly reasonable picture of
the overall pattern of rainfall in Sudan.

7 2
Here one can also start with a general quadratic form including terms for s2 and s1s2 . But this more
general regression shows that neither of these coefficients is significant.
8
It is of interest to notice that over short ranges, the variables X and X^2 are necessarily highly correlated.
So the significance of both adds further confirmation to the appropriateness of this regression.

________________________________________________________________________
ESE 502 II.2-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

! !
!
!
!
! RAINFALL (mm)
!
! ! !! (
! 105 - 168
! !
! !
! ! (
! 168- 272
! !
! ! (
!
! ! 272 - 330
!
! (
! 330 - 384
!
(
! 384 - 503
!
(
! 503 - 744

Figure 2.8. Quadratic Trend Model of Rainfall

Finally a plot of the spatial residuals for this quadratic model, as in Figure 2.9 below,
shows that much of the structure in the residuals for the linear model in Figure 2.7 has
now been removed.

!
(
#

(
!
#

!
(
# !
(
#
!
(
#
(
!
#
(
!
#
!
(
#
(
!
#
(
! #!
(#!
(
#
# !(
#
(
! (
!
#
(
!
# (
!
#
(
!
#
(
!
#
(
!
# (
!
# (
!
#
(
!
# # !
(
! (
#
(
!
#
(
!
#
(
!
#

(
!
#

(
!
#

Figure 2.9. Plot of Quadratic Residuals

________________________________________________________________________
ESE 502 II.2-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

2.2 Spatial Concentration of PCBs near Pontypool in Southern Wales

Among the most toxic industrial soil pollutants are the class of PCBs (polychlorinated
biphenyls). The following data set from [BG] consists of 70 PCB soil measurements from
the area surrounding an industrial site near the town of Pontypool, Wales, in 1991. The
location and PCB levels for these 70 sites can be found in the JMPIN file, Pcbs.jmp. It is
clear from Figure 2.10 below, that there is a significant concentration of PCB levels on
the eastern edge of this site. The task here is to characterize the spatial pattern of
variability in these levels surrounding the plant.

!
#
!
#

!
# !
#
!
# !
#
!
#
!
# !
#
!
#
!
# !
# !
# !
#
!
# !
# !
#
!
#
!
# !
# !
#
!
# !
#
!
# !
#
Industrial Site !
# !
# !
# PCB Levels
!
# !
#
!
# #!
! #!
# ! 3-6
!
#
!
#
!
# !
# !
# !
# ! 6 - 23
!
#
!
# !
#
!
#
!
# !
#
!
# # !
! # !
# ! 23 - 40
!
# !
#
!
# !
# !
# ! 40 - 58

!
# !
# ! 58 - 100
!
#
!
#
!
#
!
#
! 100 - 4620

!
# !
#
!
# !
#
!
#
!
#  
!
# !
# 500 m

Figure 2.10 Spatial PCB Measurements

A visual inspection suggests that the concentration falls off with distance from this area
of high concentration. To model this in a simple way, a representative location in this
site, designated as the “Center” in Figure 2.11 below,9 was chosen and distance from this
location to each measurement site was recorded (in the DIST column of Pcbs.jmp). Here
the simplest possible model is to assume that these PCB levels fall off linearly with
distance from this center. A plot of this regression is shown in Figure 2.11 below, and

9
The coordinates of this center location are given by ( x, y )  (330064,198822) .

________________________________________________________________________
ESE 502 II.2-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

look quite “reasonable” in terms of the concentric rings of decreasing PCB levels from
this center point.

PCB Estimates
173 - 283
93 - 173

CENTER 31 - 93
"
-19 - 31
-73 - -19
-125 - -73
-176- -125
-248 - -176

Figure 2.11 First Regression Estimate

However an examination of the regression diagnostics in Figure 2.12 below tell a


different story. Notice in particular that while distance is significant, the R-Square
indicates that less than 6% of the variation in PCB levels is actually accounted for by
such distances.
eg ess o ot
5000

4000 RSquare 0.059


3000 RSquare Adj 0.045
PCB

2000

1000 Term Estimate Prob > |t|


0 Intercept 352.558 0.0116
0 200 400 600 800 1000 1200 1400
DIST -0.3513 0.0414
DIST

Figure 2.12. Linear Regression Results


________________________________________________________________________
ESE 502 II.2-9 Tony E. Smith
Fig.1.2.
Cell
Centers NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

The reason for this is evident from an examination of the scatter plot on the left side of
this figure, which reveals the presence of two dramatic outliers, circled in red. One could
of course remove these outliers and produce a much better linear fit. But an examination
of their distance shows that both are close to the center point in Figure 2.11, and hence
are extremely important data points. So removing them would defeat the whole purpose
of the analysis.

An alternative approach would be to attempt to transform the data to accommodate this


extreme nonlinearity. One possibility would be to take logs of the variables. But even this
is not sufficient in the present case. However a slight modification involving quadratic
functions of logged variables works reasonably well. In particular, if we perform the
following “translog” regression: 10

(2.2.1) ln PCBi   0  1 ln DISTi   2 (ln DISTi ) 2   i , i  1,.., n

then we obtain a vastly improved fit as well as more significant coefficients.11 (Note that
the positive coefficient on the quadratic term reflects the slight bowl shape seen in Figure
2.12 above.)

RSquare 0.551
lnDIST Residuals

RSquare Adj 0.537

Term Estimate Prob > |t|


lnDIST -7.55 0.0004
lnDIST ^2 0.53 0.0021
lnDIST

Figure 2.13. Transformed Residuals

Moreover, the two outliers (again shown by red circles in Figure 2.13) have been
dramatically reduced by this data transformation. But while this transformed model of
PCBs seems to capture the spatial distribution in a more reasonable way, we cannot draw
sharp conclusions without an adequate statistical model of the residuals ( i : i  1,.., n) in
(2.2.1). This is the task to which we now turn.

10
This is closely related to the translog specifications of commodity production functions often used in
economics. See for example http://www.egwald.ca/economics/cesdatatranslog.php.
11
The estimated intercept term has been omitted to save space.

________________________________________________________________________
ESE 502 II.2-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

3. Spatially-Dependent Random Effects

Observe that all regressions in the illustrations above [starting with expression (2.1.3) in
the Sudan rainfall example] have relied on an implicit model of unobserved random
effects (i.e., regression residuals) as a collection ( i : i  1,.., n) of independently and
identically distributed normal random variables [where for our purposes, individual
sample points i are taken to represent different spatial locations, si ]. But recall from the
introductory discussion in Section 1.2 above that for more realistic spatial statistical
models we must allow for possible spatial dependencies among these residuals. Hence
the main objective of the present section is to extend this model to one that is sufficiently
broad to cover the types of spatial dependencies we shall need. To do so, we begin in
Section 3.1 by examining random effects at a single location, and show that normality
can be motivated by the classical Central Limit Theorem. In Section 3.2 , these results
will be extended to random effects at multiple locations by applying the Multivariate
Central Limit Theorem to motivate multivariate normality of such joint random effects.
This multi-normal model will form the statistical underpinning for all subsequent
analyses. Finally in Section 3.3 we introduce the notion of spatial stationarity to model
covariances among these spatial random effects ( i : i  1,.., n) .

3.1 Random Effects at a Single Location

First recall that the unobserved random effects,  i , at each location (or sample point), si ,
are assumed to fluctuate around zero, with E ( i )  0 . Now imagine that this overall
random effect,  i , is composed of many independent factors,

 i  ei1  ei 2    eim  
m
(3.1.1) e
k 1 ik
,

where in typical realizations some of these factors, eik , will be positive and others
negative. Suppose moreover that each individual factor contributes only a very small part
of total. Then no matter how these individual random factors are distributed, their
cumulative effect,  i , must eventually have a “bell shaped” distribution centered around
zero. This can be illustrated by a simple example in which each random component, eik ,
assumes the values 1/ m and 1/ m with equal probability, so that E (eik )  0 for all
k  1,.., m . Then each is distributed as shown for the m  1 case in Figure 3.1(a) below.
Now even though this distribution is clearly flat, if we consider the m  2 case

(3.1.2)  i  ei1  ei 2

then it is seen in Figure 3.1(b) that the distribution is already starting to be “bell shaped”
around zero. In particular the value 0 is much more likely than either of the extremes, -1
and 1. The reason of course is that this value can be achieved in two ways, namely
(ei1  12 , ei 2   12 ) and (ei1   12 , ei 2  12 ) , whereas the extreme values can each occur in

________________________________________________________________________
ESE 502 II.3-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

only one way. This simple observation reveals a fundamental fact about sums of
independent random variables: intermediate values of sums can occur in more ways than
extreme values, and hence tend to be more likely. It is this property of independent sums
that gives rise to their “bell shaped” distributions, as can be seen in parts (c) and (d) of
Figure 3.1.

0.7 0.7

0.6 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 0 1
-1 0 1 -1 0 1
(a) m = 1 (b) m = 2

0.3 0.25

0.25
0.2

0.2
0.15

0.15

0.1

0.1

0.05
0.05

0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-1 0 1 -1 0 1
(c) m = 10 (d) m = 20

Figure 3.1 Cumulative Binary Errors

But while this basic shape property is easily understood, the truly amazing fact is that the
limiting form if this bell shape always corresponds to essentially the same distribution,
namely the normal distribution. To state this precisely, it is important to notice first that

________________________________________________________________________
ESE 502 II.3-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

while the distributions in Figure 3.1 start to become bell shaped, they are also starting to
concentrate around zero. Indeed, the limiting form of this particular distribution must
necessarily be a unit point mass at zero,1 and is certainly not normally distributed. Here is
turns out that the individual values of these factors, (eik  1/ m , or eik  1/ m) , become
“too small” as m increases, so that eventually even their sum,  i , will almost certainly
vanish. At the other extreme, suppose that these values are independent of m, say
(eik  1, or eik  1) . Then while these individual values will eventually become small
relative to their sum,  i , the variance of  i itself will increase without bound.2 In a
similar manner, observe that if the common means of these individual factors were not
identically zero, then the limiting mean of  i would also be unbounded.3 So it should be
clear that precise analysis of limiting random sums is rather delicate.

3.1.1 Standardized Random Variables

The time-honored solution to these difficulties is to rescale these random sums in a


manner which ensures that both their mean and variance remain constant as m increases.
To do so, we begin by observing that for any random variable, X with mean,   E ( X ) ,
and variance,  2  var( X ) the transformed random variable,

X 
(3.1.3 ) Z   1
 X  

necessarily has zero mean since (by the linearity of expectations),

(3.1.4 ) E(Z )  
1 E ( X )    
    0

Moreover, Z also has unit variance, since by (3.4),

 X    2   1 2 E[( X   )2 ]
(3.1.5 ) var( Z )  E ( Z 2 )  E     E   2 ( X   )   1
    2

1
Simply observe that if xik is a binary random variable with Pr( xik  1)  .5  Pr( xik  1) then by
definition, eik  xik / m , so that  i  ( xi1    xim ) / m is seen to be the average of m samples from this
binary distribution. But by the Law of Large Numbers, such sample averages must eventually concentrate at
the population mean, E ( xik )  0 .
2
In particular since var( eik )  E ( eik )  .5( 1)  .5( 1)  1 for all k, it would then follow from the
2 2 2

independence of individual factors that var( i )  


m

k 1
var(eik )  m  var(e1k )  m , and hence that
var( i )   as m   .
Since E ( i )   E (eik )  m E (ei1 ) implies | E ( i ) |  m | E (ei1 ) | , it follows that if | E ( ei 1 ) |  0 then
3 m

k 1

| E ( i ) |   as m   .

________________________________________________________________________
ESE 502 II.3-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

This fundamental transformation procedure is called the standardization of X. We shall


use this device to study the limits of sums. But more generally, it is important to observe
that if one wants to compare the distributional “shapes” of any two random variables, say,
X and Y, it is much more convenient to compare their standardizations, Z X and ZY . Since
these new variables always have the same mean and variance, a comparison of Z X and
ZY thus allows one to focus on qualitative differences in their shape.

In particular, we can in principle use this standardization procedure to study the limiting
distributional shape of any sum of random variables, say

S m  X 1    X m   k 1 X k
m
(3.1.6)

As in our example, let us assume for the present that these variables are independently
and identically distributed (iid), with common mean,  , and variance,  2 [so that
( X 1 ,.., X m ) can be viewed as a random sample of size m from some common
distribution]. Then the mean and variance of S m are given respectively by

 E ( X k )   k 1   m 
m m
(3.1.7) E ( Sm )  k 1

var( S m )   k 1 var( X k )   2  m2


m m
(3.1.8) k 1

So as above, we may construct the associated standardized sum,

Sm  E ( Sm ) S  m
(3.1.9) Zm   m
var( Sn ) m 2

which by definition implies that E ( Z m )  0 and var( Z m )  1 for all m. The key property
of these standardized sums is that for large m the distribution of Z m is approximately
normally distributed.

3.1.2 Normal Distribution

To state this precisely, we must first define the normal distribution. A random variable, X,
with mean  and variance  2 is said to be normally distributed, written, X ~ N (  , 2 ) ,
if and only if X has probability density given by

f ( x)
 x  
2
( x   )2
  12  
  
(3.1.10) f ( x)  1 e 2 2
 1 e
2 2  2  x

________________________________________________________________________
ESE 502 II.3-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

[where the first version shows f ( x) as an explicit function of (  , 2 ) and the second
shows the more standard version of f ( x) in terms of (  , ) ]. This is the classical “bell-
shaped” curve, centered on the mean,  , as shown on the right. A key property of normal
random variables (that we shall make use of many times) is that any linear function of a
normal random variable is also normally distributed. In particular, since the
standardization procedure in (3.1.3) is seen to be a linear function, it follows that the
standardization, Z, of any normal random variable must be normally distributed with
mean, E ( Z )  0 , variance, var( Z )  1 , and with density

 ( z)
1  z  2
(3.1.11)  ( z)  exp   
2  2 0 z

For obvious reasons, this is called the standard normal distribution (or density), and is
generally denoted by  . The importance of this particular distribution is that all
probability questions about normal random variables can be essentially answered by
standardizing them and applying the standard normal distribution (so that all normal
tables are based entirely on this standardized form).

Next, if the cumulative distribution function (cdf ) of any random variable, X, is denoted
for all values, x, by F ( x)  Prob( X  x) , then for any standard normal random variable,
Z ~ N (0,1) , the cdf of Z is denoted by

z
(3.1.12)  ( z )  Prob( Z  z )  

 ( z ) dz

Again  is usually reserved for this important cdf (that forms the basis of all normal
tables).

3.1.3 Central Limit Theorems

With these preliminaries, we can now give a precise statement of the limiting normal
property of standardized sums stated above. To do so, it is important to note first that the
distribution of any random variable is completely defined by its cdf. [For example, in the
standard normal case above it should be clear that the standard normal distribution,  , is
recovered by simply differentiating  .] Hence, letting the cdf of the standardized sum,
Z m , in (3.1.9) be denoted by FZm , we now have the following classical form of the
Central Limit Theorem (CLT):

Central Limit Theorem (Classical). For any sequence of iid random variables
( X 1 ,.., X m ) with standardized sum, Z m , in (3.1.9),

(3.1.13) lim m   FZm ( z )   ( z ) for all z.

________________________________________________________________________
ESE 502 II.3-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

In other words, the cdf of iid standardized sums, Z m , converges to the cdf of the standard
normal distribution. The advantage of this cdf formulation is that one obtains an exact
limit result. But in practical terms, the implication of the CLT is that for “sufficiently
large” m, the distribution of such standardized sums is approximately normally
distributed.4 Even more to the point, since (3.1.3) implies that iid sums, S m , are linear
functions of their standardizations, Z m , and since linear functions of normal random
variables are again normal, it may also be concluded that these sums are approximately
normal. If for convenience we now use the notation, X  d N (  , 2 ) , to indicate that a
random variable X is approximately distributed normal with mean,  , and variance,  2 ,
and if we recall from (3.1.7) and (3.1.8) that the mean and variance of S m are given by
m  and m  2 , respectively , then we have the follows more useful form of the CLT :

Central Limit Theorem (Practical). For all sums, Sm , of iid random variables
with m sufficiently large,

(3.1.14) S m  d N ( m , m  2 )

This result can in principle be used to motivate the fundamental normality assumption
about random effects,  i . In particular, if  i is a sum of iid random components as in
(3.1.1), with zero means, then by (3.1.14) it follows that  i will also be approximately
normal with zero mean for sufficiently large m.

However, it should be emphasized here that in practical examples (such as the one
discussed in Section 3.2 below) the individual components, eik , of  i may not be fully
independent, and are of course not likely to be identically distributed. Hence it is
important to emphasize that the CLT is actually much more general that the classical
assertion above for iid random variables. While such generalizations require conditions
that are too technical to even be stated in a precise manner here, 5 it is nonetheless useful
to given a very rough statement of the general version as follows: 6

4
Recall from footnote 5 in Section 3.2.2 of Part I that “sufficiently large” is usually taken to mean m  30 ,
as long as the common distribution of the underlying random variables ( X k ) in (3.1.6) is not “too
skewed”.
5
For further details about such generalizations, an excellent place to start is the Wikipedia discussion of the
CLT at http://en.wikipedia.org/wiki/Central_limit_theorem.
6
The following version of the Central Limit Theorem (and the multivariate version of this theorem in
section 3.2.3 below) based on Theorem 8.11 in Brieman (1969). The advantage of the present version is
that it directly extends the “iid” conditions of the classical CLT.

________________________________________________________________________
ESE 502 II.3-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Central Limit Theorem (General). For any sum, S m  X 1    X m , of random


variables with means, 1 ,.., m , and variances,  12 ,.., m2 , if (i ) the distributions
of these random variables are “not too different”, and (ii ) the dependencies
among these random variables is “not too strong”, then for sufficiently large m,
the distribution of S m is approximately normal, i.e.,

(3.1.15) Sm d N (  , 2 )

with   1    m and  2   12    m2 .

So for random effects,  i  ei1   eim , with total variance,  2   12    m2 , it follows


that as long as conditions (i) and (ii) are reasonable and m is sufficiently large, random
effects,  i , will be approximately normally distributed as

(3.1.16)  i  d N (0, 2 )

3.1.4 CLT for the Sample Mean

While the main application of the CLT for our present purposes is to motivate the
normality assumption about residuals in a host of statistical models (including linear
regression), it is important to add that perhaps the single most important application of
the CLT is for inference about population means. In particular, if one draws a iid random
sample, ( X 1 ,.., X m ) from a population with unknown mean,  , and constructs the
associated sample mean:


m
(3.1.17) Xm  1
m k 1
Xk  1
m Sm ,

then by (3.1.7) the identity,

E( X m )  E (Sm )  m (m  ) 
(3.1.18) 1 1
m

implies that X m is the natural unbiased estimator of  . Moreover, by (3.1.8), the


second identity,

(3.1.19) var( X m )  1
m2
var( Sm )  1
m2
(m  2 )   2 / m

implies that for large m this estimate has a small variance, and hence should be close to
 (which is of course precisely the Law of Large Numbers). But one can say even more
by the CLT. To do so, note first that the standardized sample mean,

________________________________________________________________________
ESE 502 II.3-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

X m  E( X m ) X 
(3.1.20) Z Xm   m
 (Xm) 2 /m

can equivalently be written as

1
Sm   Sm  m  Sm  m 
(3.1.21) Z Xm  m
   Zm
 /m
2
m  /m 2
m 2

and hence satisfies exactly the same limiting properties as the sample sum. In particular
this yields the follows version of the practical CLT in (3.1.14) above for sample means:

Central Limit Theorem (Sample Means). For sufficiently large iid random
samples, ( X 1 ,.., X m ) , from any given statistical population with mean,  ,
and variance,  2 , the sample mean, X m , is approximately normal, i.e.,

(3.1.22) X m  d N (  ,  2 / m)

Note in particular that random samples from the same population are by definition
identically distributed. So as long as they are also independent, Corollary 2 is always
applicable. But the Clark-Evans test in Section 3.2.2 of Part I provides a classic example
where this latter assumption may fail to hold. More generally, the types of dependencies
inherent in spatial (or temporal) data require more careful analysis when applying the
CLT to sample means.

3.2 Multi-Location Random Effects

Given the above results for random effects,  i at individual locations, si , we now
consider the vector,  , of such random effects for a given set of sample locations,
{si : i  1,.., n}  R , i.e.,

(3.2.1)   ( i : i  1,.., n)  [ ( si ) : i  1,.., n]

As a parallel to (3.1.1) we again assume that these random effects are the cumulative sum
of independent factors,

  e1  e2    em   k 1 ek
m
(3.2.2)

where by definition each independent factor, ek , is itself a random vector over sample
locations, i.e.,

(3.2.3) ek  (eik : i  1,.., n)  [ek ( si ) : i  1,.., n]

________________________________________________________________________
ESE 502 II.3-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

As one illustration, recall the California rainfall example in which annual precipitation,
Yi , at each of the n  30 sample locations in California was assumed to depend on four
explanatory variables ( xi1  “altitude”, xi 2  “latitude”, xi 2  “distance to coast”, and
xi 4  “rain shadow”, as follows

Yi   0   j 1  j xij   i , i  1,.., n
4
(3.2.4)

Here the unobserved residuals,  i , are the random effects we wish to model. If we write
(3.2.4) in vector form as

Y   0 1n   j 1  j x j  
4
(3.2.5)

[where 1n  (1,..,1) is the unit column vector], then the residual vector,  , in (3.2.5) is an
instance of (3.2.1) with n  30 . This random vector by definition contains all factors
influencing precipitation other that the four “main” effects posited above. So the key
assumption in (3.2.2) is that the influence of each unobserved factor is only a small
additive part of the total residual effect,  , not accounted for by the four main effects
above.

For example, the first factor, e1 , might be a “cloud cover” effect. More specifically, the
unobserved value, e1i  e1 ( si ) at each location, si , might represent fluctuations in cloud
cover at si [where higher (lower) levels of cloud cover tend to contribute positively
(negatively) to precipitation at si ]. Similarly, factor e2 might be an “atmospheric
pressure” effect, where e2i  e2 ( si ) now represents fluctuations in barometric pressure
levels at si [and where in this case higher (lower) pressure levels tend to contribute
negatively (positively) to precipitation levels].

The key point to observe is that while fluctuations in factors like cloud cover or
atmospheric pressure will surely exhibit strong spatial dependencies, the dependency
between these factors at any given location is much weaker. In the present instance, while
there may indeed be some degree of negative relation between fluctuations in pressure
and cloudiness (e1i , e2i ) at any given location, si , this tends to be much weaker than the
positive relations between either fluctuations in cloud cover (e1i , e1 j ) , or atmospheric
pressure (e2i , e2 j ) , at locations, si and s j , that are in close proximity. Hence while the
random vectors, e1 and e2 , can each exhibit strong internal spatial dependencies, it is not
unreasonable to treat them as mutually independent. More generally, as a parallel to
section (3.1.3) above, it will turn out that if (i) the individual distributions of the random
component vectors, e1 ,.., em , in (3.2.2) are not “too different”, and (ii) the statistical
dependencies between these components are not “too strong”, then their sum,  , will be
approximately “normal” for m sufficiently large.

________________________________________________________________________
ESE 502 II.3-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

But in order to make sense of this statement, we must first extend the normal distribution
in (3.1.10) to its multivariate version. This is done in the next section, where we also
develop its corresponding invariance property under linear transformations. This will be
followed by a development of the multivariate version of the Central Limit Theorem that
underscores the importance of this distribution.

3.2.1 Multivariate Normal Distribution

To motivate the multivariate normal (or multi-normal) distribution observe that there is
one case in which we can determine the joint distribution of a random vector,
X  ( X 1 ,.., X n ) , in terms of the marginal distributions of its component, X 1 ,.., X n ,
namely when these components are independently distributed. In particular, suppose that
each X i is independently normally distributed as in (3.1.10) with density

( xi  i )2

2 i2
(3.2.6) fi ( xi )  1 e , i  1,.., n
2 i2

Then letting,  ii   i2 and using the exponent notation, a  a1/2 , it follows that the joint
density, f ( x1 ,.., xn ) , of X is given by the product of these marginals, i.e.,

(3.2.7) f ( x1 ,.., xn )  f1 ( x1 ) f 2 ( x2 ) f n ( xn )

 
( xi  i )2  
( xi  i )2   
( xi  i )2 
 1 e 211
 1 e 2 22
 1 e 2 nn

 211  2 22   2 nn 
    

 ( x   )2 ( x  )2 
 1  i i   n n 
2   nn 
 (2 )  n /2
( 11 22  nn ) 1/2
e  11

where the last line uses the identity, (e a1 )(ea2 ) (ean )  e a1 a2   an . To write this in matrix
form, observe first that if x  ( x1 ,.., xn ) now denotes a typical realization of random
vector, X  ( X 1 ,.., X n ) , then by (3.2.6) the associated mean vector of X is given by
  ( 1 ,.., n ) [as in expression (1.1.4)]. Moreover, since independence implies that
cov( X i , X j )   ij  0 for i  j , it follows that the covariance matrix of X now takes the
form [as in expression (1.1.7)],

________________________________________________________________________
ESE 502 II.3-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

  11 
 
  22 
(3.2.7) cov( X )   
  
 
  nn 

But since the inverse of a diagonal matrix is simply the diagonal matrix of inverse values,

  111 
 
1   22
1

(3.2.8)  
  
 
  nn1 

it follows that

(3.2.9) ( x   )  1 ( x   ) 
  111   x1  1 
  
 22  x2  2 
1
( x1  1 , x2  2 ,.., xn  n )  
    
  
  nn1   xn  n 

 ( x1  1 ) /  11 
( x   ) /  
 ( x1  1 , x2  2 ,.., xn  n )  2 2 22 

  
 
 ( xn  n ) /  nn 

( x1  1 ) 2 ( x2  2 ) 2 ( xn  n ) 2
  
 11  22  nn

which is precisely the exponent sum in (3.2.7). Finally, since the determinant, |  | , of a
diagonal matrix,  , is simply the product of its diagonal elements, i.e.,

 11
 22
(3.2.10) ||    11  22  nn ,

 nn

we see from (3.2.9) and (3.2.10) that (3.2.7) can be rewritten in matrix form as

 1 ( x   ) 1 ( x   )
(3.2.11) f ( x)  (2 )  n/2 |  |1/2 e 2

________________________________________________________________________
ESE 502 II.3-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

This is in fact an instance of the multi-normal density (or multivariate normal density).
More generally, a random vector, X  ( X 1 ,.., X n ) , with associated mean vector,
  ( 1 ,.., n ) , and covariance matrix,   ( ij : i, j  1,.., n) , is said to be multi-normally
distributed if and only if its joint density is of the form (3.2.11) for this choice of  and
 . As a generalization of the univariate case, this is denoted symbolically by
X ~ N (  , ) .

While it is not possible to visualize this distribution in high dimensions, we can gain
some insight by focusing on the 2-dimensional case, known as the bi-normal (or bivariate
normal) distribution. If X  ( X 1 , X 2 ) is bi-normally distributed with mean vector,
  ( 1 , 2 ) and covariance matrix,

  12 
(3.2.12)    11 
  21  22 

then the basic shape of the density function in (3.2.11) is largely determined by the
correlation between X 1 and X 2 , i.e., by

cov( X 1 , X 2 )  12
(3.2.13)  ( X1, X 2 )  
 ( X 1 ) ( X 2 )  11  22

This is most easily illustrated by setting 1  2  0 and  11   22  1 so that the only


parameter of this distribution is covariance,  12 , which in this case is seen from (3.2.13)
to be precisely the correlation,  , between X 1 and X 2 . The independence case
(   0) is shown in Figure 3.2 below, which is simply a 2-dimensional version of the
standard normal distribution in (3.1.11) above. Indeed both of its marginal distributions
are identical with (3.1.11). Figure 3.3 depicts a case with extreme positive correlation
(   .8) to emphasize the role of correlation in shaping this distribution. In particular,
this high correlation implies that value pairs ( x1 , x2 ) that are similar in magnitude (close
to the 45 line) are more likely to occur, and hence have higher probability density. Thus
the density is more concentrated along the 45 line, as shown in the figure.

These properties persist in higher dimensions as well. In particular, the “bell-shaped”


concentration of density around the origin continues to hold in higher dimensions, and is
more elongated in those directions where correlations between components are more
extreme.

________________________________________________________________________
ESE 502 II.3-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

x2 x2

x1 x1

Figure 3.2. Bi-normal Distribution (ρ = 0) Figure 3.3. Bi-normal Distribution (ρ = .8)

3.2.2 Linear Invariance Property

For purposes of analysis, the single most useful feature of this distribution is that all
linear transformations of multi-normal random vectors are again multi-normal. To state
this precisely, we begin by calculating the mean and covariance matrix for general linear
transformations of random vectors. Given a random vector, X  ( X 1 ,.., X n ) , with mean
vector, E ( X )    ( 1 ,.., n ) and covariance matrix, cov( X )   , together with any
compatible (m  n) matrix, A  (aij : i  1,.., m, j  1,.., n) , and n -vector, b  (b1 ,.., bn ) of
coefficients, consider the linear transformation of X defined by

(3.2.14) Y  AX  b

Following standard conventions, if m  1 then the (1 n) matrix, A , is usually written as


the transpose of an n-vector, a  (a1 ,.., an ) , so that (3.2.14) takes the form,

(3.2.15) Y  aX  b

where b is a scalar. If b  0 then the random variable, Y  aX , is called a linear


compound of X . For example, each component of X can be identified by such a linear
compound as follows. If the columns of the n -square identity matrix, I n , are denoted by

________________________________________________________________________
ESE 502 II.3-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

1   1   0   0  
 1   0   1     
(3.2.16) In       ,   ,...,     [ e , e ,..., e ]
          0   1 2 n

        
 1  0   0   1  

then by setting a  ei and b  0 in (3.2.15), we see that

(3.2.17) X i  ei X , i  1,.., n

So linear transformations provide a very flexible tool for analyzing random vectors.

Next recall from the linearity of expectations that by taking expectations in (3.2.14) we
obtain

(3.2.18) E (Y )  E ( AX  b)  A E ( X )  b  A  b

By using this result, we can obtain the covariance matrix for Y as follows. First note that
by definition the expected value of a matrix of random variable is simply the matrix of
their expectations, i.e.,

 Z11  Z1n   E ( Z11 )  E ( Z1n ) 


   
(3.2.19) E          
Z   
 m1  Z mn   E ( Z m1 )  ( Z mn ) 

So the definition of cov(Y ) in (1.1.7) can equivalently be written in matrix terms as

 E[(Y1  1 )(Y1  1 )]  E[(Y1  1 )(Yn  n )] 


(3.2.20) cov(Y )     


 E[(Y   )(Y   )]  E[(Y   )(Y   )] 
 n n 1 1 n n n n 

 (Y1  1 )(Y1  1 )  (Y1  1 )(Yn  n ) 


 
 E    
 (Y   )(Y   )  (Y   )(Y   ) 
 n n 1 1 n n n n 

 Y1  1  
 
 E    Y1  1 ,..., Yn   n 
 Yn  n  

 E[(Y   )(Y   )]

________________________________________________________________________
ESE 502 II.3-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

By applying this to (3.2.15) we obtain the following very useful result:

(3.2.21) cov(Y )  E[(Y   )(Y   )]

 E{([ AX  b]  [ A  b])([ AX  b]  [ A  b])}

 E[( AX  A )( AX  A )]

 E[ A( X   )( X   ) A]

 A E[( X   )( X   )] A

 A cov( X ) A

 cov( AX )  A  A

So both the mean and covariance matrix of AX  b are directly obtainable from those of
X . We shall use these properties many times in analyzing the multivariate spatial models
of subsequent sections.

But for the moment, the key feature of these results is that the distribution of any linear
transformation, AX  b , of a multi-normal random vector, X ~ N (  , ) , is obtained by
simply replacing the mean and covariance matrix of X in (3.2.11) with those of AX  b .
The only requirement here is that the resulting covariance matrix, A  A , be nonsingular
so that the inverse covariance matrix, ( A  A) 1 , in (3.2.11) exists. This in turn is
equivalent to the condition that the rows of A be linearly independent vectors, so that A
is said to be of full row rank. With this stipulation, we have the following result
[established in Section A3.2.3 of the Appendix to Part III in this NOTEBOOK]:7

Linear Invariance Theorem. For any multi-normal random vector,


X ~ N (  , ) , and linear transformation, Y  AX  b , of X with A
of full row rank, Y is also multi-normally distributed as

(3.2.22) Y ~ N ( A  b, A  A)

What this means in practical terms is that if a given random vector, X , is known (or
assumed) to be multi-normally distributed as X ~ N (  , ) , then we can immediately
write down the exact distribution of essentially any linear function, AX  b , of X .

3.2.3 Multivariate Central Limit Theorem

We are now ready to consider multivariate extensions of the univariate central limit
theorems above. Our objective here is to develop only those aspects of the multivariate

7
For an alternative development of this important result, see for example Theorem 2.4.4 in Anderson
(1958).

________________________________________________________________________
ESE 502 II.3-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

case that are relevant for our present purposes. The first objective is to show that the
multivariate case relates to the univariate case in a remarkably simple way. To do so,
recall first from (3.2.17) above that for any random vector, X  ( X 1 ,.., X n ) , each of its
components, X i , can be represented as a linear transformation, X i  eiX , of X . So each
marginal distribution of X is automatically the distribution of this linear compound.
More generally, each linear compound, aX , can be said to define a generalized
marginal distribution of X .8 Now while the marginal distributions of X only determine
its joint distribution in the case of independence [as in (3.2.7) above], it turns out that the
joint distribution of X is always completely determined by its generalized marginal
distributions.9 To appreciate the power of this result, recall from the Linear Invariance
Theorem above that if X is multi-normal with mean vector,  , and covariance matrix,
 , then all of its linear compounds, aX , are automatically univariate normally
distributed with means, a , and variances, a a . But since these marginals in turn
uniquely determine the distribution of X , it must necessarily be multi-normal. Thus we
are led to the following fundamental correspondence:

Univariate-Multivariate Correspondence. A random vector, X , with


mean vector,  , and covariance matrix,  , is multi-normally distributed as

(3.2.23) X ~ N (  , )

if and only if every linear compound, aX , is univariate normal, i.e.,

(3.2.24) aX ~ N (a , a a )

In view of this correspondence, it is not surprising that there is an intimate relation


between univariate and multivariate central limit theorems. In particular, if any of the
univariate conditions in the central limit theorems above hold for all generalized marginal
distributions of X , then X will automatically be asymptotically multivariate normal. For
example, if as an extension of (3.1.15) one considers a sum of iid random vectors,

(3.2.25) Sm  X1    X m

then it follows at once that the terms in each linear compound,

(3.2.26) aSm  aX 1    aX m

must necessarily be iid as well. Hence we obtain an immediate extension of the


“Practical” Central Limit Theorem in (3.1.14) above

8
Since each marginal compound, eiX , has a coefficient vector of unit length, i.e., || ei ||  1 , it is formally
more appropriate to restrict generalized marginals to linear compounds, a , of unit length ( || a ||  1 ). But
for our present purposes we need not be concerned with such scaling effects.
9
For a development of this idea (due to Cramer and Wold), see Theorem 29.4 in Billingsley (1979).

________________________________________________________________________
ESE 502 II.3-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Multivariate Central Limit Theorem (Practical). For all sums of iid random
vectors, S m  X 1    X m , with common mean vector,  , and covariance matrix,
 , if m sufficiently large then

(3.2.27) S m  d N ( m  , m )

But since multivariate normality will almost always arise as a model assumption in our
spatial applications, the most useful extension is the “General” Central Limit Theorem in
(3.1.15), which may now be stated as follows:10

Multivariate Central Limit Theorem (General). For any sum,


S m  X 1    X m , of random vectors with individual means, 1 ,.., m , and
covariance matrices, 1 ,..,  m , if (i ) the distributions of these random vectors
are “not too different”, and (ii ) the dependencies among these random vectors
are “not too strong”, then for sufficiently large m, the distribution of Sm is
approximately multi-normal, i.e.,

(3.2.28) S m  d N (  , )

with   1   m and   1    m .

Finally, it is appropriate to restate this result explicitly in terms of multi-location random


effects, which form the central focus of this section.

Spatial Random Effects Theorem. For any random vector of multi-location


effects,   ( i : i  1,.., n) , comprised of a sum of individual random factors,
  e1  e2    em , with zero means and covariance matrices, 1 ,..,  m , if
(i ) the distributions of these random factors are “not too different”, and
(ii ) the dependencies among these random factors are “not too strong”,
then for sufficiently large m, the distribution of  is approximately
multi-normal, i.e.,

(3.2.29)   d N (0, )

with   1    m .

It is this version of the Central Limit Theorem that will form the basis for essentially all
random-effects models in the analyses to follow.

10
For a similar (informal) statement of this general version of the Multivariate Central Theorem, see
Theorem 8.11 in Brieman (1969).

________________________________________________________________________
ESE 502 II.3-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

3.3 Spatial Stationarity

Given the Spatial Random Effects Theorem above, the task remaining is to specify the
unknown covariance matrix,  , for these random effects. Since  is in turn a sum of
individual covariance matrices,  k , for random factors k  1,..., m , it might seem better to
specify these individual covariance structures. But rather than attempt to identify such
factors, our strategy will be to focus on general spatial dependencies that should be
common to all these covariance structures, and hence should be exhibited by  . In doing
so, it is also important to emphasize that such statistical dependencies often have little
substantive relation to the main phenomena of interest. In terms of our basic modeling
framework, Y ( s)   ( s)   ( s) , in (1.2.1) above, we are usually much more interested in
the global structure of the spatial process, as represented by  ( s) , than in the specific
relations among unobserved residuals { ( si ) : i  1,.., n} at sample locations {si : i  1,.., n} .
Indeed, these relations are typically regarded as “second-order” effects in contrast to the
“first-order” effects represented by  ( s) . Hence it is desirable to model such second-
order effects in a manner that will allow the analysis to focus on the first-order effects,
while at the same time taking these unobserved dependencies into account. This general
strategy can be illustrated by the following example.

3.3.1 Example: Measuring Ocean Depths

Suppose that one is interested in mapping the depth of the sea floor over a given region.
Typically this is done by taking echo soundings (sonar measurements) at regular intervals
from a vessel traversing a system of paths over the ocean surface. This will yield a set of
depth readings, {Di  D( si ) : i  1,.., n} , such as the set of measurements is shown in
Figure 3.4 below:

s1 s2 sn

D1 D2 Dn

Figure 3.4. Pattern of Depth Measurements

However, the ocean is not a homogeneous medium. In particular, it is well known that
such echo soundings can be influenced by the local concentration of zooplankton in the
region of each sounding. These clouds of zooplankton (illustrated in Figure 3.5 below)
create interference called “ocean volume reverberation”.

________________________________________________________________________
ESE 502 II.3-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Figure 3.5. Zooplankton Interference

These interference patterns tend to vary from location to location, and even from day to
day (much in the same way that sunlight is affected by cloud patterns).11 So actual
readings are random variables of the form,

(3.3.1) D ( si )  d ( si )   ( si ) , i  1,.., n

where in this case the actual depth at location si is represented by d ( si )  E[ D( si )] , and


 ( si ) represents measurement error due to interference.12 Moreover these errors are
statistically dependent, since plankton concentrations at nearby locations will tend to be
more similar than at locations widely separated in space. Hence to obtain confidence
bounds on the true depth at location si , it is necessary to postulate a statistical model of
these joint interference levels, [ ( si ) : i  1,.., n] . Now one could in principle develop a
detailed model of zooplankton behavior, including their patterns of individual movement
and clustering behavior. However, such models are not only highly complex in nature,
they are very far removed from the present target of interest, which is to obtain accurate
depth measurements.13

11
Actual variations in the distribution of zooplankton are more diffuse than the “clouds” depicted in Figure
3.5. Vertical movement of zooplankton in the water column is governed mainly by changes in sunlight, and
horizontal movement by ocean currents.
12
In actuality, such measurement errors include many different sources, such as the reflective properties of
the sea floor. Moreover, depth measurements are actually made indirectly in terms of the transmission
loss, Li  L ( si ) , between the signal sent and the echo received. The corresponding depth, Di , is obtained
from Li by a functional relation, Di   ( Li , ) , where  is a vector of parameters that have been
calibrated under “idealized” conditions. For further details, see Urick, R.J. (1983) Principles of Underwater
Sound, 3rd ed., McGraw-Hill: New York, and in particular the discussion around p.413.
13
Here it important to note that such detailed models can be of great interest in other contexts. For
example, acoustic signals are also used to estimate the volume of zooplankton available as a food source
for sea creatures higher in the food chain. To do so, it is essential to relate acoustic signals to the detailed
behavior of such microscopic creatures. See for example, Stanton T.K. and D. Chu (2000) “Review and
recommendations for the modeling of acoustic scattering by fluid-like elongated zooplankton: euphausiids
and copepods”, ICES Journal of Marine Science, 57: 793–807.

________________________________________________________________________
ESE 502 II.3-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

So what is needed here is a statistical model of spatial residuals that allows for local
spatial dependencies, but is simple enough to be estimated explicitly. To do so, we will
adopt the following basic assumptions of spatial stationarity:

(3.3.2) [Homogeneity] Residuals,  ( si ) , are identically distributed at all


locations si .

(3.3.3) [Isotropy] The joint distribution of distinct residuals,  ( si ) and


 ( s j ) depends only on the distance between locations si and s j .

These assumptions are loosely related to the notion of “isotropic stationarity” for point
processes discussed in Section 2.5 of Part I. But here we focus on the joint distribution of
random variables at selected locations in space rather than point counts in selected
regions of space. To motivate the present assumptions in the context of our example,
observe first that while zooplankton concentrations at any point of time may differ
between locations, it can be expected that the range of possible concentration levels over
time will be quite similar at each location. More generally, the Homogeneity assumption
asserts that the marginal distributions of these concentration levels are the same at each
location. To appreciate the need for such an assumption, observe first that while it is in
principle possible to take many depth measurements at each location and employ these
samples to estimate location-specific distributions of each random variable, this is
generally very costly (or even infeasible). Moreover, the same is true of most spatial data
sets, such as the set of total rainfall levels or peak daily temperatures reported by regional
weather stations on a given day. So in terms of the present example, one typically has a
single set of depth measurements [ D( si ) : i  1,.., n] , and hence only a single joint
realization of the set of unobserved residuals [ ( s i ) : i  1,.., n] . Thus, without further
assumptions, it is impossible to say anything statistically about these residuals. From this
viewpoint, the fundamental role of the Homogeneity assumption is to allow the joint
realizations, [ ( si ) : i  1,.., n] , to be treated as multiple samples from a common
population that can be used to estimate parameters of this population.

The Isotropy assumption is very similar in spirit. But here the focus is on statistical
dependencies between distinct random variables,  ( si ) and  ( s j ) . For even if their
marginal distributions are known, one cannot hope to say anything further about their
joint distribution on the basis of a single sample. But in the present example it is
reasonable to assume that if a given cloud of zooplankton (in Figure 3.5) covers location,
si , then it is very likely to cover locations s j which are sufficiently close to s j . Similarly
for locations that are very far apart, it is reasonable to suppose that clouds covering si
have little to do with those covering s j . Hence the Isotropy assumption asserts more
generally that similarities between concentration levels at different locations depend only
on the distance between them. The practical implication of this assumption is that all

________________________________________________________________________
ESE 502 II.3-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

pairs of residuals,  ( si ) and  ( s j ) , separated by the same distance, h  si  s j , must


exhibit the same degree of dependency. Thus a collection of such pairs can in principle
provide multiple samples to estimate the degree of statistical dependency at any given
distance, h . A second advantage of this Isotropy assumption is that it allows simple
models of “local spatial dependency” to be formulated directly in terms of this single
distance parameter. So it should be clear that these two assumptions of spatial stationarity
do indeed provide a natural starting point for the desired statistical model of residuals.

But before proceeding, it should also be emphasized that while these assumptions are
conceptually appealing and analytically useful – they may of course be wrong. For
example, it can be argued in the present illustration that locations in shallow depths
(Figure 3.5) will tend to experience lower concentration levels than locations in deeper
waters. If so, then the Homogeneity assumption will fail to hold. Hence more complex
models involved “nonhomogeneous” residuals may be required in some cases.14 As a
second example, suppose that the spatial movement of zooplankton is known to be
largely governed by prevailing ocean currents, so that clouds of zooplankton tend to be
more elongated in the direction of the current. If so, then spatial dependencies will
depend on direction as well as distance, and the Isotropy assumption will fail to hold.
Such cases may require more complex “anisotropic” models of spatial dependencies.15

3.3.2. Covariance Stationarity

In many cases the assumptions above are stronger than necessary. In particular, recall
from the Spatial Random Effects Theorem (together with the introductory discussion in
Section 3.3) that such random effects are already postulated to be multi-normally
distributed with zero means. So all that is required for our purposes is that these
homogeneity and isotropy assumptions be reflected by the matrix,  , of covariances
among these random effects.

To do so, it will be convenient for our later purposes to formulate such covariance
properties in terms of more general spatial stochastic processes. A spatial stochastic
process, {Y ( s ) : s  R} , is said to be covariance stationary if and only if the following two
conditions hold for all s1 , s2 , v1 , v2  R :

(3.3.4) E[Y ( s1 )]  E[Y ( s2 )]

(3.3.5) || s1  s2 ||  || v1  v2 ||  cov[Y ( s1 ), Y ( s2 )]  cov[Y (v1 ), Y (v2 )]

These conditions can be stated more compactly by observing that (3.3.4) implies the
existence of a common mean value,  , for all random variables. Moreover, (3.3.5)

14
For example, it might be postulated that the variance of  ( s ) depends on the unknown true depth, d ( s ) ,
at each location, s . Such nonstationary formulations are complex, and beyond the scope of these notes.
15
Such models are discussed for example by Gotway and Waller (2004, Section 2.8.5).

________________________________________________________________________
ESE 502 II.3-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

implies that covariance depends only on distance, so that for each distance, h , and pair of
locations s, v  R with s  v  h there exists a common covariance value, C (h) , such
that cov[Y ( s ), Y (v)]  C (h) . Hence, process {Y ( s) : s  R} is covariance stationary if and
only if (iff) the following two conditions hold for all s, v  R ,

(3.3.6) E[Y ( s)]  

(3.3.7) || s  v ||  h  cov[Y ( s ), Y (v)]  C (h)

Note in particular from (3.3.7) that since var[Y ( s)]  cov[Y ( s ), Y ( s )] by definition, and
since || s  s ||  0 , it follows that these random variables must also have a common
variance,  2 given by

(3.3.8) var[Y ( s )]  C (0)   2 , s  R

While these definitions are in terms of general spatial stochastic processes, {Y (s ) : s  R} ,


our most important applications will be in terms of spatial residuals (random effects).
With this in mind, notice that (3.3.6) together with (1.2.1) imply that every covariance
stationary process can be written as

(3.3.9) Y ( s)     ( s)

so that each such process is associated with a unique residual process, { ( s ) : s  R} .


Moreover, since cov[Y ( s ), Y (v)]  cov[ ( s ),  (v)]  E[ ( s )   (v)]  E[ ( s )]  E[ (v)] , we
see that { ( s ) : s  R} must satisfy the following more specialized set of conditions for all
s, v  R :

(3.3.10) E[ ( s)]  0

(3.3.11) || s  v ||  h  E[ ( s) (v)]  C (h)

These are the appropriate covariance stationarity conditions for residuals that correspond
to the stronger Homogeneity (3.3.2) and Isotropy (3.3.3) conditions in Section 3.3.1
above.

Note finally that even these assumptions are too strong in many contexts. For example (as
mentioned above) it is often convenient to relax the isotropy condition implicit in (3.3.7)
and (3.3.11) to allow directional variations in covariances. This can be done by requiring
that covariances dependent only on the difference between locations, i.e., that for all
h  (h1 , h2 ) , s  v  h  cov[Y ( s), Y (v)]  C (h) . This weaker stationarity condition is
often called intrinsic stationarity. See for example [BG] (p.162), Cressie (1993, Sections

________________________________________________________________________
ESE 502 II.3-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

2.2.1 and 2.3) and Waller and Gotway (2004, p.273). However, we shall treat only the
isotropic case [(3.3.7),(3.3.11)], and shall use these assumptions throughout.

3.3.3 Covariograms and Correlograms

Note that since the above covariance values, C (h) , are unique for each distance value, h ,
in region R , they define a function, C , of these distances which is designated as the
covariogram for the given covariance stationary process.16 But as with all random
variables, the values of this covariogram are only meaningful with respect to the
particular units in which the variables are measured. Moreover, unlike mean values, the
values of the covariogram are actually in squared units, which are difficult to interpret in
any case. Hence it is often more convenient to analyze dependencies between random
variables in terms of (dimensionless) correlation coefficients. For any stationary process,
{Y ( s ) : s  R} , the (product moment) correlation between any Y ( s ) and Y (v) with
s  v  h is given by the ratio:

cov[Y ( s ), Y (v)] C (h) C ( h)


(3.3.12)  [Y ( s), Y (v)]   
var[Y ( s) var[Y (v) C (0) C (0) C (0)

which is simply a normalized version of the covariogram. Hence the correlations at every
distance, h , for a covariance stationary process are summarized by a function called the
correlogram for the process:

C (h)
(3.3.13)  ( h)  , sR
C (0)

Probably the most important application of correlograms is to allow comparisons


between covariograms that happen to be in different units. One such application is
illustrated in Section 7.3.5 below.

16
To be more precise, if the set of all distances associated with pairs of locations in region R is denoted by
h( R )  {h :|| s  v ||  h for some s , v  R} , then the covariogram, C , is a numerical function on h( R ) .
Note also that for the weaker form of intrinsic stationarity discussed above, the covariogram depends on
the differences in both coordinates, h  ( h1 , h2 ) , and hence is a two-dimensional function in this case.

________________________________________________________________________
ESE 502 II.3-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

4. Variograms

The covariogram and its normalized form, the correlogram, are by far the most intuitive
methods for summarizing the structure of spatial dependencies in a covariance stationary
process. However, from an estimation viewpoint such functions present certain
difficulties (as will be discussed further in Section 4.10 below). Hence it is convenient to
introduce a closely related function known as the variogram, which is widely used for
estimation purposes.

4.1 Expected Squared Differences

To motivate the notion of a variogram for a covariance stationary process, {Y ( s) : s  R} ,


we begin by considering any pair of component variables, Ys  Y ( s) and Yv  Y (v) , and
computing their expected squared difference:

(4.1.1) E [(Ys  Yv ) 2 ]  E [Ys2  2YsYv  Yv2 ]  E (Ys2 )  2 E (YsYv )  E (Yv2 )

To relate this to covariograms, note that if s  v  h , then by (3.2.3) and (3.2.4),

(4.1.2) C (h)  cov(Ys , Yv )  E[(Ys   )(Yv   )]  E[YsYv  Ys   Yv   2 ]

 E (YsYv )  E (Ys )    E (Yv )   2

 E (YsYv )   2   2   2  E (YsYv )   2

 E (YsYv )  C (h)   2

Exactly the same argument with s  v shows that

(4.1.3) E (Ys2 )  C (0)   2  E (Yv2 )

Hence by substituting (4.1.2) and (4.1.3) into (4.1.1) we see that expected squared
differences for all s, v  R with s  v  h can be expressed entirely in terms of the
covariogram, C , as

(4.1.4) E[( Ys  Yv )2 ]  2  [C (0)  C (h)]

To obtain a slightly simpler relation, it is convenient to suppress the factor “2” by


defining the associated quantity,

(4.1.5)  (h)  12 E[(Ys  Yv ) 2 ] , s  v  h

________________________________________________________________________
ESE 502 II.4-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

and observing from (4.1.4) that with this definition we obtain the following simple
identity for all distances , h :

(4.1.6)  (h)  C (0)  C (h)   2  C (h)

From (4.1.6) it is thus evident that the “scaled” expected squared differences in (4.1.5)
define a unique function of distance which is intimately related to the covariogram. For
any given covariance stationary process, this function is designated as the variogram,  ,
of the process. Moreover, it is also evident that this variogram is uniquely constructible
from the covariogram. But the converse is not true. In particular since (4.1.6) also implies
that

(4.1.7) C ( h)   2   ( h)

it is clear that in addition to the variogram,  , one must also know the variance,  2 , in
order to construct the covariogram.1 Hence this variance will become an important
parameter to be estimated in all models of variograms developed below.

Before proceeded further with our analysis of variograms it is important to stress that the
above terminology is not completely standard. In particular, the expected squared
difference function in (4.1.4) is often designated as the “variogram” of the process, and
its scaled version in (4.1.5) is called the “semivariogram” [as for example in Cressie
(1993, p.58-59) and Gotway and Waller (2004, p.274)]. (This same convention is used in
the Geostatistical Analyst extension in ARCMAP.) But since the scaled version in (4.1.5)
is the only form used in practice [because of the simple identity in (4.1.7)] it seems most
natural to use the simple term “variogram” for this function, as for example in [BG,
p.162].2

4.2 The Standard Model of Spatial Dependence

To illustrate the relation in (4.1.7) it is most convenient to begin with the simplest and
most commonly employed model of spatial dependence. Recall from the Ocean Depth
Example in Section 3.3.1 above, that the basic hypothesis there was that nearby locations
tend to experience similar concentration levels of plankton, while those in more widely
separated locations have little to do with each other. This can be formalized most easily
in terms of correlograms by simply postulating that correlations are high (close to unity)
for small distances, and fall monotonely to zero as distance increases. This same general
hypothesis applies to a wide range of spatial phenomena, and shall be referred to here as
the standard model of spatial dependence. Given the relation between correlograms and
covariograms in (3.3.13), it follows at once that covariograms for the standard model, i.e.,
standard covariograms, must fall monotonely from C (0)   2 toward zero, as illustrated

However, assuming that lim h C ( h )  0 , it follows from (4.1.6) that lim h  ( h)   . So  is in
1 2 2

principle obtainable from  as the asymptote (sill) in Figure 4.2 below.


2
See also the “lament” regarding this terminology in Schabenberger and Gotway (2005, p.135).

________________________________________________________________________
ESE 502 II.4-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

in Figure 4.1 below. The right end of this curve has intentionally been left rather vague. It
may reach zero at some point, in which case covariances will be exactly zero at all
greater distances. On the other hand, this curve may approach zero only asymptotically,
so that covariance is positive at all distances but becomes arbitrarily small. Both cases are
considered to be possible under the standard model (as will be illustrated in Section 4.6
below by the “spherical” and “exponential” variogram models).
sill
 • 2
 •
2

C ()  ()

h • h

Figure 4.1. Standard Covariogram Figure 4.2. Standard Variogram

On the right in Figure 4.2 is the associated standard variogram, which by (4.1.6) above
must necessarily start at zero and rise monotonely toward the value  2 . Graphically this
implies that the standard variogram must either reach the dashed line in Figure 4.2,
designated as the sill, or must approach this sill asymptotically.3

But while this mathematical correspondence between the standard variogram and
covariogram is quite simple, there are subtle differences in their interpretation. The
interpretation of standard covariograms is straightforward, since decreases in (positive)
covariance at large distances are naturally associated with decreases in spatial
dependence. But the associated increase in the standard variogram is somewhat more
difficult to interpret in a simple way. If we recall from (4.1.5) that these variogram values
are proportional to expected squared differences, then is reasonable to conclude that such
differences should increase as variables become less similar (i.e., less positively
dependent). But as a general rule, it would still appear that the simplest approach to
interpreting variogram behavior is to describe this behavior in terms of the corresponding
covariogram.

4.3 Non-Standard Spatial Dependence

Since the analysis to follow will focus almost entirely on the standard model, it is of
interest to consider one example of a naturally occurring stationary process that exhibits
non-standard behavior. As a more micro version of the Ocean Depth Example in Section
3.3.1 above, suppose that one is interested in measuring variations in ocean depth due to
wave action on the surface. Figure 4.3 below depicts an idealized measurement scheme

3
As noted by [BG, p.162] the scaling by ½ in (4.1.5) is precisely to yield a “sill” which is associated with
 rather than 2 .
2 2

________________________________________________________________________
ESE 502 II.4-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

involving a set of (yellow) corks at locations {si : i  1,.., n} that are attached to vertical
measuring rods, allowing them to bob up and down in the waves. The set of cork heights,
H i  H ( si ) , on these n rods at any point of time can be treated as a sample of size n
from a spatial stochastic process, {H ( s ) : s  R} , of wave heights defined with respect to
some given ocean region, R .

wave crest

water
level
H1 Hn

s1 s2 s4 s6 sn

d
Figure 4.3. Measurement of Wave Heights

Here the fluctuation behavior of corks should be essentially the same over time at each
location. Moreover, any dependencies among cork heights due to the smoothness of wave
actions should depend only on the spacing between their positions in Figure 4.3. Hence
the homogeneity and isotropy assumptions of spatial stationarity in Section 3.3.1 should
apply here as well, so that in particular, {H ( s ) : s  R} can be treated as a covariance
stationary process.

But this process has additional structure implied by the natural spacing of waves. If this
spacing is denoted by d , then it is clear that for corks separated by distance d , such as
those at locations s2 and s6 in Figure 4.3, whenever a wave crest (or trough) occurs at
one location it will tend to occur at the other as well. Hence pairs of location separated
by a distance d should exhibit a positive correlation in wave heights, as shown in the
covariogram of Figure 4.4 below. However, for locations spaced at around half this
distance, such as s2 and s4 in Figure 4.3, the opposite should be true: whenever a crest
(or trough) occurs at one location, a wave trough (or crest) will tend to occur at the other.
Hence the wave heights at such locations can be expected to exhibit negative correlation,
as is also illustrated by the covariogram in Figure 4.4.

Finally, it should be clear that distances between wave crests are themselves subject to
some random variation (so that distance d in Figure 4.3 should be regarded as the
expected distance between wave crests). Thus, in a manner similar to the standard model,
one can expect that wave heights a distant locations will be statistically unrelated. This in
turn implies that the positive and negative correlation effects above will gradually

________________________________________________________________________
ESE 502 II.4-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

dampen as distance increases. Hence this process should be well represented by the
“damped sine wave” covariogram shown in Figure 4.4.4

0.8 0.8

0.7 0.7

2 0.6 2 0.6

0.5 0.5

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0

-0.1 -0.1

-0.2
0 0.5 1 1.5 2 • 2.5 3 3.5 4 4.5 5
-0.2
0 0.5 1 1.5 2 • 2.5 3 3.5 4 4.5 5

d d

Figure 4.4. Wave Covariogram Figure 4.5. Wave Variogram

Finally, the associated variogram for this process [as defined by (4.1.6)] is illustrated in
Figure 4.5 for sake of comparison. If the variance,  2 , in Figure 4.4 is again take to
define the appropriate sill for this variogram (as shown by the horizontal dashed line in
Figure 4.5) then it is clear that the values of this variogram now oscillate around the sill
rather that approach it monotonely. Hence this sill is only meaningful at larger distances,
where wave heights no longer exhibit any significant correlation.

4.4 Pure Spatial Independence

A second example of a covariance stationary process, {Y ( s) : s  R} , which is far more


extreme, is the case of pure spatial independence, in which distinct random components,
Y ( s) and Y (v) , have no relation to each other – no matter how close they are in space.
Mathematically this implies that cov[Y ( s ), Y (v )]  0 for all distinct s and v . But since
cov[Y ( s), Y ( s )]   2  0 for all s , this in turn implies that the covariogram, C , for such a
process must exhibit a discontinuity at the origin, as shown on the left in Figures 4.6.

2 • 2
 ( h)

C ( h)
h • h

Figure 4.6. Pure Spatial Independence

4
A mathematical model of this type of covariogram is given in expression 4.6.9 below.

________________________________________________________________________
ESE 502 II.4-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Hence by definition, the corresponding variogram,  , for pure spatial spatial


independence (shown on the right in Figure 4.6) must also exhibit a discontinuity at the
origin, since  (0)  0 and  (h)   2  0 for all h  0 .

Such processes are of course only mathematical idealizations, since literally all physical
processes must exhibit some degree of smoothness (even at small scales). But if
independence holds at least approximately at sufficiently small scales then this
idealization may be reasonable. For example, if one considers a sandy desert region, R ,
and lets D( s ) denote the depth of sand at any location, s  R , then this might well
constitute a smooth covariance stationary process, {D( s ) : s  R} , which is quite
consistent with the standard model of Section 3.5 (or perhaps even the “wave model” of
Section 3.6 if wind effects tend to ripple the sand). But in contrast to this, suppose that
one considers an alternative process {W ( s ) : s  R} in which W ( s) now denotes the
weight of the topmost grain of sand at location s (or perhaps the diameter or quartz
content of this grain). Then while is it reasonable to suppose that the distribution of these
weights is the same at each location s (and is thus a homogeneous process as in Section
3.3.1 above), there need be little relation whatsoever between the specific weights of
adjacent grains of sand. So at this scale, the process {W ( s ) : s  R} is well modeled by
pure spatial independence.

4.5 The Combined Model

The standard model in Section 4.2 and the model of pure spatial independence in
Section 4.4 can be viewed as two extremes: one with continuous positive dependence
gradually falling to zero, and the other with zero dependence at all positive distances.
However, many actual processes are well represented by a mixture of the two. This can
be illustrated by a further refinement of the Ocean Depth Example in Section 3.3.1.
Observe that while mobile organisms like zooplankton have some ability to cluster in
response to various stimuli, the ocean also contains a host of inert debree (dust particles
from the atmosphere, and skeletal remains of organisms, etc.) which bear little relation to
each other. Hence in addition to the spatially correlated errors in sonar depth
measurements created by zooplankton, there is a general level of “background noise”
created by debree particles that is best described in terms of spatially independent errors.

If these two types of measurement errors at location s are denoted repectively by 1 ( s )


and  2 ( s ) , then a natural refinement of the depth measurement model in (3.3.1) would be
to postulate that total measurement error,  ( s ) , is the sum of these two components.

(4.5.1)  ( s )  1 ( s )   2 ( s ) , s  R

Moreover, it is also reasonable to assume that these error components are independent
(i.e., that the distribution of zooplankton is not influenced by the presence or absence of
debree particles). More formally, it may be assumed that 1 ( s ) and  2 (v) are independent
random variables for every pair of locations, s, v  R . With this assumption it then

________________________________________________________________________
ESE 502 II.4-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

follows (see section A2.1 in Appendix A2) that the covariogram, C , of error process 
must be the sum of the separate covariograms, C1 and C2 , for component processes 1
and 1 , i.e., that for any h  0 ,

(4.5.2) C (h)  C1 (h)  C2 (h)

More generally, any covariance stationary process, {Y ( s ) : s  R} , with covariogram of


the form (4.5.2) will be said to satisfy the combined model of covariance stationary
processes. Covariogram C1 then represents the spatially dependent component of this
process, and covariogram C2 represents its spatially independent component. 5

To see the graphical form of this combined model, observe first that by setting h  0 in
(4.5.2) it also follows that

(4.5.3)  2  C (0)  C1 (0)  C2 (0)   12   22

where  12 and  22 are the corresponding variances for the spatially dependent and
independent components, respectively. Hence the covariogram for the combined process
in (4.5.2) is given by Figure 4.7 below:

2 • nugget effect
 12
+ =
 22 •

C1 C2 C

Figure 4.7. Covariogram for Combined Model

In this graphical form it is clear that the covariogram for the combined model is
essentially the same as that of the standard model, except that there is now a discontinuity
at the origin. This local discontinuity is called the nugget effect in the combined model,6
and the magnitude of this effect (which is simply the variance,  22 , of the pure
independent component) is called the nugget. Note that by definition the ratio,  22 /  2 ,

5
This combined model is an instance of the more general decomposition in Cressie (1993, pp.112-113)
with C1 reflecting the “smooth” component, W , and C2 reflecting the “noise” components,    .
6
This term originally arose in mining applications where there are often microscale variations in ore
deposits due to the presence of occasional nuggets of ore [as discussed in more detail by Cressie
(1993,p.59)]. In the present context, such a “nugget effect” would be modeled as an independent micro
component of a larger (covariance stationary) process describing ore deposits.

________________________________________________________________________
ESE 502 II.4-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

gives the relative magnitude of this effect, and is designated as the relative nugget effect.
For example, if the relative nugget effect for a given covariogram is say .75, then this
would indicate that the underlying process exhibits relatively little spatial dependence.

Next we consider the associated variogram for the combined model. If  denotes the
variogram of the combined process in (4.5.1) then we see from (4.1.6) together with
(4.5.2) and (4.5.3) that

(4.5.4)  (h)   2  C (h)  ( 12   22 )  [C1 (h)  C2 (h)]

 [ 12  C1 (h)]  [ 22  C2 (h)]

  ( h)   1 ( h)   2 ( h )

where  1 and  2 are the variograms for the spatially dependent and independent
components, respectively. Hence it follows that variograms add as well, and yield a
corresponding combined variogram as shown in Figure 4.8 below:

sill
2 •
 12  ( h)
C ( h)
 22
nugget

Figure 4.8. Summary of the Combined Model

4.6 Explicit Models of Variograms

While the combined model above provides a useful conceptual framework for variograms
and covariograms, it is not sufficiently explicit to be estimated statistically. We require
explicit mathematical models that are (i) qualitatively consistent with the combined
model, and (ii) are specified in terms of a small number of parameters that can be
estimated.7

7
There is an additional technical requirement covariograms yield well-defined covariance matrices, as
detailed further in the Appendix to Part III (Corollary 2.p.A3-70).

________________________________________________________________________
ESE 502 II.4-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

4.6.1. The Spherical Model

The simplest and most widely used variogram model is the spherical variogram, defined
for all h  0 by:

 0 , h0

  3h h 3 
(4.6.1)  ( h; r, s, a )   a  ( s  a )   3  , 0  h  r
  2r 2r 
 s , hr

Here parameters (r , s, a) of  are assumed to satisfy r , s  0, a  0 with s  a . [Note


that the argument, h , of function  is separated from it parameters, (r , s, a) , by a
semicolon8] To interpret these parameters, it is useful to consider the spherical variogram
shown in Figure 4.9 below with (r  6 , s  4 , a  1) :

4.5 4.5

s 4 s

4

3.5 3.5 a
3 3

2.5 2.5

2 2

1.5 1.5

a 1 1

0.5 0.5


0
0 1 2 3 4 5 6

r
7 8
0
0 1 2 3 4 5

r
6 7 8

Figure 4.9. Spherical Variogram Figure 4.10. Spherical Covariogram

A comparison of Figure 4.9 with the right hand side of Figure 4.8 shows that parameter,
s , corresponds to the sill of the variogram and parameter, a , corresponds to the nugget
[as can also be seen by letting h approach zero in the (4.6.1)]. So for this particular
example the relative nugget effect is a / s  1/ 4 . Note finally that since the spherical
variogram reaches the sill at value, r [as can also be seen by setting h  r in (4.6.1)],
this implies that the corresponding covariogram in Figure 4.10 falls to zero at r . Hence
the parameter, r , denotes the maximum range of positive spatial dependencies, and is

8
More generally the expression, f ( x1 ,.., xn ; 1 ,..,  k ) , is taken to denote a function, f , with arguments
( x1 ,.., xn ) and parameters (1 ,..,  k ) .

________________________________________________________________________
ESE 502 II.4-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

designated simply as the range of the variogram (and corresponding covariogram). These
same notational conventions for range, sill and nugget will be used throughout.9

The formal spherical covariogram corresponding to expression (4.6.1) is immediately


obtainable from (4.1.7) [with s   2 ], and is given by:

 s , h0

  3h h3 
(4.6.2) C ( h ; r , s, a )   ( s  a )  1   3  , 0  h  r
  2r 2 r 
 0 , hr

Together, (4.6.1) and (4.6.2) will be called the spherical model. One can gain further
insight into the nature of this model by differentiating (4.6.2) in the interval, 0  h  r , to
obtain:

dC  3h 2 3   3  h
2

(4.6.3)  ( s  a )  3    ( s  a )    2  1
dh  2r 2r   2r   r 

Hence we see that

dC
(4.6.3) 0  hr
dh

Moreover, by differentiating once more we see that

d 2C  3  2h 
(4.6.4) 2
 ( s  a )   2   0
dh  2r  r 

whenever the sill is greater than the nugget (i.e., s  a  0 ). Thus, except for the extreme
case of pure independence, this function is always “bowl shaped” on the interval
0  h  r , and has a unique differentiable minimum at h  r . Hence this spherical
covariogram yields a combined-model form with finite range that falls smoothly to zero.
These properties (together with its mathematical simplicity) account for the popularity of
the spherical model.

All explicit variogram applications in these notes will employ this spherical model.
However, it is of interest at this point to consider one alternative model which is also in
wide use.

9
Note that the use of “s” to denote sill should not be confused with the use of “ s  ( s1 , s2 ) ” to denote
spatial locations. Also, since the symbol, n, is used to denote sample size, we choose to denote the nugget
by “a” rather than “n”.

________________________________________________________________________
ESE 502 II.4-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

4.6.2 The Exponential Model

While the spherical model is smooth in the sense of continuous differentiability, it makes
the implicit assumption that correlations are exactly zero at all sufficiently large
distances. But in some cases it may be more appropriate to assume that while correlations
may become arbitrarily small at large distances, they never vanish. The simplest model
with this property is the exponential variogram, defined for all h  0 by,

 0 , h0
(4.6.5)  ( h ; r , s, a )  
 a  ( s  a) 1  e 
3 h / r
, h0

with corresponding exponential covariogram, defined for all h  0 by,

 s , h0
(4.6.6) C ( h ; r , s, a )   3 h / r
 ( s  a) e , h0

Together, this variogram-covariogram pair is designated as the exponential model, and is


illustrated in Figures 4.11 and 4.12 below, using the same set of parameter values
(r  6 , s  4 , a  1) as for the spherical model above.

4.5 4.5

s 4

3.5
•s
4

3.5

3 3
s-a
s-a 2.5 2.5

2 2

1.5 1.5

a 1 1

0.5 0.5


0
0 1 2 3 4 5

r
6 7 8
0
0 1 2 3 4 5 6

r
7 8

Figure 4.11. Exponential Variogram Figure 4.12. Exponential Covariogram

Here it is clear that the sill, s , and nugget, a , play the same role as in the spherical
model. However, the “range” parameter, r , is more difficult to interpret in this case since
spatial dependencies never fall to zero. To motivate the interpretation of this parameter,

________________________________________________________________________
ESE 502 II.4-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

observe first that since spatial dependencies are only meaningful at positive distances, it
is natural to regard the quantity s  a in Figure 4.12 as the maximal covariance for the
underlying process.10 In these terms, the practical range of spatial dependency is
typically defined to be the smallest distance, r , beyond which covariances are no more
than 5% of the maximal covariance. To see that r in (4.6.6) in indeed the practical range
for this covariogram, observe simply that since e x  .05  x   ln(.05)  2.9957  3 , it
follows that

(4.6.7) h  r  e 3h / r  .05   (r )  ( s  a)(.05)

Note finally that in terms of the corresponding variogram (which plays the primary role
in statistical estimation of the exponential model), the quantity s  a in Figure 4.11 is
usually called the partial sill.11

4.6.3 The Wave Model

Finally, it is of interest to consider a mathematical model of the nonstandard “wave”


dependence example in Section 4.3 above. Here it is not surprising that the appropriate
variogram for this wave model is given by a damped sin wave as follows,12

 0 , h0

(4.6.8)  (h; r , s, a)    sin(h / w) 
 a  (s  a) 1  w  , h0
  h 

where the parameter, w , denotes the wave intensity. Here the corresponding covariogram
is given by:

 s , h0

(4.6.9) C (h; r , s, a )    sin(h / w) 
 ( s  a)  w  , h0
  h 

The wave covariogram and variograms shown in Figures 4.4 and 4.5 above are in fact the
instances of this wave model with ( w  0.6, a  0, s  0.6) .

10
More generally, this maximal covariance for any combined model in Figure 4.7 is seen to be given by the
variance,  1 , of the (continuous) spatially dependent component.
2

11
Indeed this quantity plays such a central role that variograms are often defined with the partial sill as an
explicit parameter rather than the sill itself. See for example the spherical and exponential (semi) variogram
models in Cressie (1993, p.61). See also the Geostatistical Analyst example in Section 4.9.2 below.
12
This is also referred to as the hole-effect model [as in Cressie (1993, p.623)], and in particular, is given
this designation in the Geostatistical Analyst kriging option of ARCMAP.

________________________________________________________________________
ESE 502 II.4-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

4.7 Fitting Variogram Models to Data

There are many approaches to fitting possible variogram models to spatial data sets, as
discussed at length in Cressie (1993, section 2.4) and Schabenberger and Gotway (2004,
sections 4.4-4.6). Here we consider only the standard two-stage approach most
commonly used in practice (as for example in Geostatistical Analyst). The basic idea of
this approach is to begin by constructing a direct model-independent estimate of the
variogram called the “empirical variogram”. This empirical variogram is then used as
intermediate data to fit specific variogram models. We consider each of these steps in
turn.

4.7.1 Empirical Variograms

An examination of (4.1.5) suggests that for any given set of spatial data  y ( si ) : i  1,.., n 
and distance, h , there is an obvious estimator of the variogram value,  (h) , namely “half
 y(s )  y(s ) 
2
the average value of i j for all pairs of locations si and s j separated by
distance h ”. However, one problem with this estimator is that (unlike K-functions) the
value  (h) refers to point pairs with distance si  s j exactly equal to h . Since in any
finite sample there will generally be at most one pair that are separated by a given
distance h (except for data points on regular grids, as discussed below), one must
necessarily aggregate point pairs ( si , s j ) with similar distances and hence estimate  (h)
at only a small number of representative distances for each aggregate. The simplest way
to do so is to partition distances into intervals, called bins, and take the average distance,
hk , in each bin k to be the appropriate representative distances, called lag distances, as
shown in Figure 4.13 below:

bins max lag

0
•h •h • • •
1 2 h3 h4 • • • • • h
lag distances

Figure 4.13. Lag Distances and Bins

More formally, if N k denotes the set of distance pairs, ( si , s j ) , in bin k , [with the size
(number of pairs) in N k denoted by | N k | ], and if the distance between each such pair is
denoted by hij  si  s j , then the lag distance, hk , for bin k is defined to be

________________________________________________________________________
ESE 502 II.4-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

1
(4.7.1) hk 
Nk
 ( si , s j )N k
hij

To determine the size of each bin, the most common approach is to make all bins the
same size, in order to insure a uniform approximation of lag distances within each bin.
However there is an implicit tradeoff here between approximation of lag distances and
the number of point pairs used to estimate the variogram at each lag distance. Here the
standard rule of thumb is that each bin should contain at least 30 point pairs,13 i.e., that

(4.7.2) N k  30

Next observe that the choice of the maximum lag distance (max-lag), h , (in Figure 4.13)
also involves some implicit restrictions. First, for any given set of sample points,
{si : i  1,.., n}  R , one cannot consider lag distances greater than the maximum pairwise
distance,

(4.7.3) 
hmax  max si  s j : i  j  n 
in this sample since no observations are available. Moreover, practical experience has
shown that even for lag distances close to hmax the resulting variogram estimates tend to
be unstable [Cressie (1985, p.575)]. Hence, in a manner completely analogous to the rule
of thumb for K-functions [expression (4.5.1) of Part I], it is common practice to restrict
h to be no greater than half of hmax , i.e.,

hmax
(4.7.4) h 
2

Hence our basic rule for constructing bins is choose a system of bins {N k : k  1,.., k } of
uniform size, such that the max-lag, h  hk , is as large as possible subject to (4.7.3) and
(4.7.4). More formally, if the biggest distance in each bin k is denoted by
d k  max ( si ,s j )Nk dij , then our procedure (in the MATLAB program variogram.m
discussed below) is to choose a maximum bin number, k , and maximum distance (max-
dist), d , such that 14

(4.7.5) N1    N k  30

(4.7.6) h  hk  d k  d

13
Notice that this rule of thumb is reminiscent of that for the Central Limit Theorem used in the Clark-
Evans test of Section 3.2.2 in Part I (and in Section 3.1.3 above). Note also that some authors recommend
there be at least 50 pairs in each bin [as for example in Schabenberger and Gotway (2005, p.153)].
14
This is essentially a variation on the “practical rule” suggested by Cressie (1985, p.575).

________________________________________________________________________
ESE 502 II.4-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

(Here the default value of d is hmax / 2 and the default value of k is 100 bins.) With
these rules for constructing bins and associated lag distances, it then follows from (4.1.5)
that for any given set of sample points, {si : i  1,.., n}  R , with associated data,
{ y ( si ) : i  1,.., n} , an appropriate estimate of the variogram value,  (hk ) , at each lag
distance, hk  h , is given by half the average squared differences  y ( si )  y ( s j )  over
2

all point pairs ( si , s j ) in N k , i.e.,

1
  y(s )  y(s ) 
2
(4.7.7) ˆ (hk )  ( si , s j )N k i j
2 Nk

This set of estimates at each lag distance is designated as the empirical variogram.15
More formally, if for any given set of (ordered) lag distances, {hk : k  1,.., k } , the
associated variogram estimates in (4.7.7) are denoted simply by ˆk  ˆ (hk ) , then the
empirical variogram is given by the set of pairs {(hk , ˆk ) : k  1,.., k } . An schematic
example of this empirical variogram construction is given in Figure 4.14 below:

  
  
  
 
   ˆk 1 

 ˆk     
 

 
    
  

   
 


 y(s )  y(s ) 
2
i j 

 
hij hk hk 1 h

Figure 4.14 Empirical Variogram Construction

 y(s )  y(s ) 
2
Here the blue dots correspond to squared-difference pairs, i j , plotted
against distances, hij  si  s j , for each point pair, ( si , s j ) , [as illustrated for one point in
the lower left corner of the figure]. The vertical lines separate the bins, as shown for bins

15
The empirical variogram is also known as Matheron’s estimator, in honor of its originator
[Schabenberger and Gotway (2005, Section 4.4.1)].

________________________________________________________________________
ESE 502 II.4-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

k and k+1. So in bin k, for example, there is one blue dot for every point pair,
( si , s j )  N k . The red dot in the middle of these points denotes the pair of average values,
(hk , ˆk ) , representing all points in that bin. Hence the empirical variogram consists of all
these average points, one for each bin of points. [Schematics of such empirical
variograms are shown (as blue dots) in Figure 4.15 below. An actual example of an
empirical variogram is shown in Figure 4.19 below.]

While this empirical variogram will be used to fit all variograms in these notes, it should
be mentioned that a number of modifications are possible. First of all, while the use of
average distances, hk , in each bin k has certain statistical advantages (to be discussed
below), one can also use the median distance, or simply the midpoint of the distance
range. Similarly, while uniformity of bin sizes in (4.7.5) will also turn out to have certain
statistical advantages for fitting variograms in our framework (as discussed below), one
can alternatively require uniform widths of bins.

In addition, it has been observed by Cressie and Hawkins (1980) [also Cressie (1993,
Section 2.4.3)] that estimates involving squared values such as (4.8.7) are often
dominated by a few large values, and are thus sensitive to outliers. Hence these authors
propose several “robust” alternatives to (4.7.7) based on square roots and median values
of absolute differences.

Finally it should be noted that a number of fitting procedures in use actually drop this
initial stage altogether, and fit variogram models directly in terms of the original data,
{ y ( si ) : i  1,.., n} .16 In such approaches, the empirical variogram is essentially replaced
by a completely disaggregated version called the variogram cloud, where each point pair
 
( si , s j ) is treated as a separate “bin”, and where  ij   si  s j is estimated by the

single sample, ˆij   y ( si )  y ( s j )  .


2 17
While this approach can in many cases be more
powerful statistically, it generally requires stronger modeling assumptions. Moreover, it
turns out that such methods are not only very sensitive to these modeling assumptions,
but can also be less stable for smaller data sets. Finally, and most important from
practical viewpoint, plots of the empirical variogram tend to be visually much more
informative that plots of the entire variogram cloud, and in particular, can often help to
suggest appropriate model forms for the variogram itself. [An example is given in Figure
4.20 below.] Hence we choose to focus on the classical empirical-variogram approach.18

16
Most prominent among these is the method of maximum likelihood, as detailed for example in
Schabenberger and Gotway (2005, Section 4.5.2). [This general method of estimation will also be
developed in more detail in Part III of these notes for fitting spatial regression models.]
17
An example is given in Figure 4.19 below.
18
For additional discussion see the section on “Binning versus Not Binning” in Schabenberger and
Gotway (2005, Section 4.5.4.3). See also the excellent discussion in Reilly and Gelman (2007).

________________________________________________________________________
ESE 502 II.4-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

4.7.2 Least-Squares Fitting Procedure

Given an empirical variogram, {(hk , ˆk ) : k  1,.., k } , together with a candidate variogram
model,  (h ; r , s, a ) [such as the spherical model in (4.7.1)], the task remaining is to find
parameter values, (rˆ, sˆ, aˆ ) , for this model that yield a “best fit” to the empirical variogram
data. The simplest and most natural approach is to adopt a “least squares” strategy, i.e.,
to seek parameter values, (rˆ, sˆ, aˆ ) , that solve the following (nonlinear) least-squares
problem:

min ( r ,s ,a )  k 1 ˆk   (hk ; r , s, a) 


k 2
(4.7.8)

While this procedure will be used to fit all variograms in these notes, it is important to
note some shortcomings of this approach. First of all, since squared deviations are being
used in (4.7.8), it again follows that this least-squares procedure is sensitive to outliers.
As with all least-squares procedures, one can attempt to mitigate this problem by using an
appropriate weighting scheme, i.e., by considering the more general weighted least-
squares problem:

min ( r ,s ,a )  k 1 wk  ˆk   (hk ; r , s, a) 


k 2
(4.7.9)

for some set of appropriate nonnegative weights  wk : k  1,.., k  . A very popular choice
for these weights [first proposed by Cressie (1985)] is to set:19

Nk
(4.7.10) wk  , k  1,.., k
 (hk ; r , s, a)2

Here the numerator simply places more weight on those terms with more samples. The
denominator is approximately proportional to the variance of the estimates, ˆk ,20 so that
the effect of both the numerator and denominator is to place more weight on those terms
for which the estimates, ˆk , are most reliable. However, it has been pointed out by others
that the inclusion of the unknown parameters (r , s, a ) in these weights can create certain
instabilities in the estimation procedure [see for example Zhang et al. (1995) and Müller
(1999, Section 4)]. Moreover, since our constant bin sizes in (4.7.5) eliminate variation in
the sample weights, we choose to use the simpler unweighted least-squares procedure in
(4.7.8).

19
In particular, this is the weighted least-squares procedure used in Geostatistical Analyst.
20
This approximation is based on the important case of normally distributed spatial data.

________________________________________________________________________
ESE 502 II.4-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Finally it should also be noted that this least-square procedure is implicitly a constrained
minimization problem since it is required that (i) r  0 and (ii) s  a  0 . In the present
setting, however, nonnegativity of both r and s is essentially guaranteed by the
nonnegativity of the empirical variogram itself. But nonnegativity of the nugget, a , is
much more problematic, and can in some cases fail to hold. This is illustrated by the
schematic example shown on the left in Figure 4.15 below, where a spherical variogram
model (red curve) has been fitted to a set of hypothetical empirical variogram data (blue
dots). Here it is clear that the best fitting spherical variogram does indeed involve a
negative value for the estimated nugget, â .

• • • • •• • • • • • •• •
• • • •• • • • • • •• • •
• • • • • • • • • •
• • • •
h h

Figure 4.15. Negative Nugget Problem

Hence in such cases, it is natural to impose the additional constraint that a  0 , and then
solve the reduced minimization problem in the remaining unknown parameters, (r , s ) :

min ( r ,s )  k 1  ˆk   (hk ; r , s,0) 


k 2
(4.7.11)

The solution to this reduced problem, shown schematically above will yield the “closest
approximation” to the solution of (4.8.8) with a feasible value for the nugget, a . It is this
two-stage fitting procedure that will be used (implicitly) whenever nuggets are negative.

4.8 The Constant-Mean Model

Our next objective is to develop a practical illustration of variogram estimation. But to do


so, it is important to begin by recalling that covariance stationarity was originally
motivated in the context of our general modeling framework in Section 1.2 above, where
it was assumed that spatial random variables are of the form:

(4.8.1) Y ( s)   ( s)   ( s) , s  R

________________________________________________________________________
ESE 502 II.4-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

and where covariance stationarity is actually a property of the unobserved residual


process,   ( s ) : s  R  . Hence variogram estimation for any given set of spatial data,
 y( si ) : i  1,.., n  , must generally been done as part of a larger modeling effort in which
both the variogram and the spatial trend function   ( s ) : s  R  are modeled explicitly.
One can then consider iterative fitting procedures in which the spatial trend function is
first fitted from the data, say by  ˆ ( si ) : i  1,.., n  , to yield residual estimates,

(4.8.2) ˆ( si )  y ( si )  ˆ ( s i ) , i  1,.., n

that are in turn used to fit the variogram model. Much of the present section on
Continuous Spatial Data Analysis will be devoted to this larger modeling-and-estimation
problem. Hence to develop a meaningful example of variogram estimation at this point, it
is necessary to make stronger assumptions about the general framework in (4.9.1) above.

In particular, we now assume that the entire process Y ( s ) : s  R  is itself covariance


stationary. By (3.2.6) through (3.2.8) this equivalent to assuming that in addition to
covariance stationarity of the residual process in the second term of (4.8.1), the spatial
trend function in the first term is constant, so that

(4.8.3) Y ( s)     ( s) , s  R

for some (possibly unknown) scalar,  . Under these conditions it follows at once that

E   Y ( s )  Y (v )    E      ( s )     (v )    E    ( s )   (v )  
2 2 2
(4.8.4)
     

for all s, v  R , so that by definition the variograms for the Y -process and the  -process
are identical:

(4.8.5)  Y ( h)    ( h) , h  0

Hence, under these assumptions we see that for any given spatial data,  y ( si ) : i  1,.., n  ,
the residual variogram,   , can be estimated directly in terms of the empirical variogram,

1
  y(s )  y(s ) 
2
(4.8.6) ˆY (hk )  ( si , s j )N ( hk ) i j , k  1,.., k
2 N (hk )

for the observable Y -process. This approach will be illustrated in the following example.

________________________________________________________________________
ESE 502 II.4-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

4.9 Example: Nickel Deposits on Vancouver Island

The following example is taken from [BG, pp.150-151] and is based on sample data from
Vancouver Island in British Columbia collected by the Geological Survey of Canada.
This data set [contained in the ARCMAP file (…\projects\nickel\nickel.mxd)], extends
over the area at the northern tip of the island shown in Figure 4.16 below. The area
outlined in red denotes the full extent of the data site. For purposes of this illustration, a
smaller set of 436 sample sites was selected, as shown by the dots in Figure 4.17.

•0 •
50 km

! !
! !
!! !! ! ! ! !! !
! ! ! !!
! ! !! ! ! !!!! !
! ! ! ! ! !
! ! ! ! !
! ! ! !! !! ! ! !
! !
!!! !
!! ! ! ! ! ! !! ! ! !!
! !! ! ! !
!! ! ! ! ! !! !! !!
! ! !! ! !!!! ! !
!
! ! ! !!!
! ! !! ! ! ! ! !
!
!!
! ! ! ! ! !! ! ! ! !
!
!
!
!! !!
!! ! ! !!!
! !!! !
! ! ! !
! !
! ! ! ! ! !! !!
! ! !! !
!! ! ! ! ! !
! !!
! !! ! ! !!
! ! !! ! ! ! !
! ! ! ! ! ! ! ! ! ! !!
!! ! ! !
! !!! !! !!
! ! ! ! !
! !!
!
!! !! ! !
! !! ! ! !!!! ! ! ! !
! !! ! ! ! ! !!! ! !!
! !
!!! ! ! ! ! !! ! !! ! !
! !!
! !! ! !! ! !
! ! !!!!
!
! !! !
! !! ! !! ! ! ! !! ! !
! ! ! ! !
!!!! !! ! !! ! ! !!!
! ! !!! ! !! !!! ! !
!
! ! ! ! !! ! !! !
!
! ! ! ! !! ! ! !
!
! !!
!! ! ! !! !
!! !! !! ! ! !!
! ! ! ! !! !! !!

Figure 4.16. Vancouver Sample Area Figure 4.17. Vancouver Sample Area

Note the curvilinear patterns of these sample points. As with many geochemical surveys,
samples are here taken mainly along stream beds and lake shores, where minerals
deposits are more likely to be found. In particular, samples of five different ore types
were collected. The present application will focus on deposits of Nickel ore. [In class
Assignments 3 and 4 you will study deposits of Cobalt and Manganese at slightly
different site selections.] This Nickel data is shown in the enlarged map below, where
Nickel concentration in water samples is measured in parts per million (ppm).

! !
! ! ! !! !
!! ! !! ! ! !! !
! !! !
! ! !! !
! !
!! ! ! ! !!!! !
! !! ! !
! ! ! ! !!! !! !
! !
!! ! ! ! ! !! ! ! !! Nickel (ppm)
! !! ! !! ! !
!! ! ! !! !! !! !! ! !!! ! !
! ! ! !!! ! ! !!
! !! !! ! !!! ! ! ! ! !!
! ! !! !! ! ! ! 1.00 - 19.00
!! !! !
! !
!!
! ! ! ! !! !
! ! !
!! ! !! ! !
! ! ! ! ! !! ! !!
! !! !
!! ! ! ! ! ! 19.01 - 43.00
! !!
! ! ! !!
! !
!
! !
! !!!
!
! ! ! !!! ! ! ! ! ! ! !!
!! ! ! ! ! 43.01 - 78.00
! !!! !! !!
! ! ! ! !!! ! !
!
! !!! ! !
!
! ! ! !
! ! ! ! !!! !
! ! ! ! !!! ! ! !
!!! ! ! ! ! !!
! 78.01 - 140.00
!!! ! ! !!
! !
! !! !! !
!! !!! ! !
! ! !
!! !
! ! !
! !! ! ! ! !! !! !
! !! ! ! ! !! !
! ! ! !! ! !
! !!!
! 140.01 - 340.00
! ! !
! !
!
!! !! ! !!! ! !!
!
! ! ! ! ! ! ! ! ! !
! ! ! ! !
!
! !!
! ! ! !! ! !! ! ! !! !
!
! !
! !!
! !! !!! !! !
!
! !! !!

Figure 4.18. Nickel Data


________________________________________________________________________
ESE 502 II.4-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Since the mapped data exhibits strong similarities between neighboring values (at this
physical scale), we can expect to find a substantial range of spatial dependence in this
data. Notice however that the covariance-stationarity assumption of Isotropy in (3.3.5)
[and (3.3.3)] is much more questionable for this data. Indeed there appear to be diagonal
“waves” of high and low values rippling through the site. An examination of Figure 4.16
above shows that these waves are roughly parallel to the Pacific coastline, and would
seem to reflect the history of continental drift in this region.21 Hence our present
assumption of covariance stationarity is clearly an over-simplification of this spatial data
pattern. We shall see this more clearly in the variogram estimation procedure to follow.

4.9.1 Empirical Variogram Estimation


Given these n  436 sites ( si : i  1,.., n) together with their corresponding nickel
measurements, yi  y ( si ) , our first objective is to construct an empirical variogram for
this data as in (4.8.5) above. This procedure is operationalized in the MATLAB program,
variogram_plot.m. To use this program, the data from Nickel.mxd has been imported to
the MATLAB workspace file, nickel.mat. The 436 x 3 matrix, nickel, contains the
coordinate + nickel data ( si1 , si 2 , yi ) for each location i  1,.., n . By opening the program,
variogram_plot.m, it can be seen that a matrix of this form is the first required input.
Next, recall from Section 4.7.1 that along with this data, there are two inputs for defining
an appropriate set of distance bins, namely the maximum bin number, k , and the
maximum distance (max-dist), d . These parameter options are specified in an opts
structure (similar to that in the program clust_sim.m of Section 3.5 in Part I). Here we
shall start with the default values, k  100 , and d  hmax / 2  48,203 meters, so that
there is no need to specify this structure. Hence by typing the simple command:

>> variogram_plot(nickel);

one obtains a plot of the empirical variogram, as shown in Figure 4.19 below.
2000

1800

1600

1400

1200

1000

800

600
0 1 2 3 4 5
4
x 10

Figure 4.19. Empirical Variogram Figure 4.20. Variogram Cloud

21
In fact these waves are almost mirror images of the Cascadia subduction zone that follows the coastline
immediately to the west of Vancouver Island.

________________________________________________________________________
ESE 502 II.4-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Here the point scatter does rise toward a “sill”, as in the classical case illustrated in
Figure 4.8 above. So it appears that one should obtain a reasonable fit using the spherical
model in Figure 4.9 [from expression (4.6.1)]. But before fitting this model, there are a
number of additional observations to be made.

First, for purposes of comparison, the corresponding variogram cloud is plotted in Figure
4.20.22 Notice first that while the horizontal (distance) scales of these two figures are the
same, the vertical (squared difference) scales are very different. In order to include the
full point scatter in the variogram cloud, the maximum squared-difference value has been
increased from 2000 in Figure 4.19 to around 120,000 ( 12  104 ) in Figure 4.20. For
visual comparison, the value 2000 is shown by a red arrow in both figures. So while the
empirical variogram does indeed look “classical” in nature, it is difficult to draw many
inferences about the shape of the true variogram from the wider scatter of points
exhibited by the variogram cloud. The reason for this is that while the empirical
variogram shows mean estimates of the variogram at k  100 selected lag distances, the
variogram cloud contains the squared y-differences for each of the 70,687 individual
pairs, ( si , s j ) , with dij  d . Hence about all that can be seen from this “cloud” of points
is that there are a considerable number of outliers that are very much larger than the mean
values at each distance. But fortunately this pattern of outliers is fairly uniform across the
distance spectrum, and hence should not seriously bias the final result in this particular
case. On the other hand, if outliers were more concentrated in certain distance ranges (as
is often typical for the larger distance values), then this might indicate the need to “trim”
some of these outliers before proceeding. In short, while the variogram cloud may
provide certain useful diagnostic information, the empirical variogram is usually far more
informative in terms of the possible shapes of the true variogram.

Next, it should be noted that in addition to the variogram plot, one obtains the following
screen output
MAXDIST = 48203.698

which is precisely d above. To compare this with the max-lag distance, h , note first that
there are a number of optional outputs for this program as well. First, the actual values of
the empirical variogram, {(hk , ˆk ) : k  1,.., k } , are contained in the matrix, DAT, where
each row contains one (hk , ˆk ) pair. This can be seen by running the full command,

>> [DAT,maxdist,bin_size,bin_last] = variogram_plot(nickel);

and then clicking on the matrix, DAT, in the workspace to display the empirical
variogram. In particular, the value h corresponds to the last element of the first column
and can be obtained with the command [ >> DAT(end,1) ] yielding h  47984 . This is
smaller than d since h is somewhere in the middle of the last bin (as in Figure 4.13
above), and d is by definition the outer edge, d k , of this last bin.

22
This was constructed using the MATLAB program, variogram_cloud_plot.m.

________________________________________________________________________
ESE 502 II.4-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

As for the additional outputs, maxdist is precisely the screen output above, and the value,
bin_size = 707, tells you how many point pairs there are in each bin [as in condition
(4.7.5) above]. In this application there are many more than 30 point pairs in each bin, so
that the maximum number of bins, k  100 , is precisely the number realized. However, if
the number of sample points had been sufficiently small, then bin_size = 30 , would be a
binding constraint in (4.7.5) , and there could well be fewer than 100 bins.23 Finally, the
value, bin_last, is simply a count of points in the last bin, to check whether it is
significantly smaller than the rest. This will only occur if d is chosen to be very close to
the maximum pairwise distance, hmax , and hence will rarely occur in practice.24

As one last observation, recall from the “wave” pattern in Figure 4.17 above that one may
ask whether this effect is picked up by the empirical variogram at larger distances. By
using the measurement tool in ARCMAP and tracing a diagonal line in the direction of
these waves (from lower left to upper right), it appears that a reasonable value of maxdist
to try is d  80,000 meters. To do so, we can run the program with this option as
follows:
>> opts.maxdist = 80000;
>> variogram_plot(nickel,opts);

We then obtain the empirical variogram in Figure 4.21b, where the previous variogram
has been repeated in Figure 4.21a for ease of comparison:

2000 2000

1800 1800

1600 1600

1400 1400

1200 1200

1000 1000

800 800

600 600
0 1 2 3 4 5 0 1 2 3 4 5 6 7 8
4 4
x 10 x 10

Figure 4.21a. Max Distance = 48,203 Figure 4.21b. Max Distance = 80,000

23
For example if n = 50 so that the number of distinct point pairs is 50(49)/2 = 12256 < 30(100), then there
would surely be fewer than 100 bins.
24
For example, if one were to set opts.maxdist = 95000, which is very close to hmax in the present
example, then the last bin will indeed have fewer points than the rest.

________________________________________________________________________
ESE 502 II.4-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Notice that while the vertical (squared difference) scales for these two figures are the
same, the horizontal distance scales are now different (reflecting the different maximum
distances specified). Moreover, while the segment of Figure 4.21b up to 50,000
( 5  104 ) meters is qualitatively similar to Figure 4.210a, the bins and corresponding
lag distances are not the same as in Figure 4.21a. Hence it is more convenient to show
separate plots of these two empirical variograms rather than try to superimpose them on
the same scale. Given this scale difference, it is nonetheless clear that the slight dip in the
empirical variogram on the left, starting at about 40,000 meters, becomes much more
pronounced at the larger lag distances shown on the right. Recall (from the corresponding
covariograms) that this can be interpreted to mean that pairs of y-values (nickel
measurements) separated by more than 40,000 meters tend to be more similar (positively
correlated) than those separated by slightly smaller distances. Finally, by again using the
measurement tool in ARCMAP, it can be seen that the spacing of successive waves is
about 40,000 meters. So it does appear that this effect is being reflected in the empirical
variogram.

As a final caveat however, it should be emphasized that the most extreme dip in Figure
4.21b occurs at lag distances close to hmax , where variogram estimates tend to be very
unreliable. In addition, there are “edge effects” created by this rectangular sample region
that may add to the unreliability of comparisons at larger distances.

4.9.2 Fitting a Spherical Variogram

Recall from Section 4.6.1 above that all variogram applications in these notes (as well as
the class assignments) will involve fitting spherical variogram models to empirical-
variogram data. [Other models can easily be fitted using the Geostatistical Analyst (GA)
extension in ARCMAP, as illustrated below.] For purposes of the present application, we
shall adhere to the restriction in (4.7.4) that d not exceed hmax / 2 , and hence shall use
only the empirical variogram in Figure 4.19 (and 4.21a) constructed under this condition.
To fit a spherical variogram model to this empirical-variogram data, we shall use the
simple nonlinear least-squares procedure in (4.7.8) above.

Fitting Procedure using MATLAB


This is operationalized in the MATLAB program, var_spher_plot.m.25 Since this
program uses exactly the same inputs as those detailed for variogram_plot.m in Section
4.9.1 above, there is no need for further discussion of inputs. Hence a spherical variogram
model can be fitted in the present application with the command:
>> var_spher_plot(nickel);
The first output of this fitting procedure is the spherical variogram plot shown in Figure
4.22 below, where the blue dots are the empirical variogram points, and the estimated

25
One can also use the weighted nonlinear least-squares procedure in (4.8.9) and (4.8.10) above, which is
programmed in var_spher_wtd_plot.m.

________________________________________________________________________
ESE 502 II.4-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

spherical variograms is shown in red. If you click Enter again you will see the associated
covariogram plot, as shown in Figure 4.23 below.
VARIOGRAM PLOT COVARIOGRAM PLOT
2000
2000

1800
1500
1600

1400
1000
1200

1000
500
800

600

400 0

200

0 -500
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 1 2 3 4 5
4 4
x 10 x 10

4.22 Fitted Spherical Variogram 4.23 Derived Spherical Covariogram

Here it must be emphasized that this covariogram is not being directly estimated. Rather,
the estimates (rˆ, sˆ, aˆ ) obtained for the spherical variogram are substituted into (4.6.2) in
order to obtain the corresponding covariogram. Hence it is more properly designated as
the derived spherical covariogram. Similarly, the blue dots shown in this figure are
simply an inverted reflection of the empirical variogram shown in Figure 4.22. However,
they can indeed be similarly interpreted as the derived empirical covariogram
corresponding to the empirical variogram in Figure 4.22. To do so, recall first from
(4.1.7) that for all distances, h , it must be true that C (h)   2   (h) . But since each
empirical variogram point (ˆk , hk ) by definition yields an estimate of  (hk ) , namely
ˆk  ˆ (hk ) , and since the sill value, ŝ , is by definition an estimate of  2 , i.e., sˆ  ˆ 2 , it
is natural to use (4.1.7) to estimate the covariogram at distance hk by

(4.9.1) Cˆ (hk )  ˆ 2  ˆ (hk )  sˆ  ˆ (hk )

Hence by letting Cˆ k  Cˆ (hk ) , it follows that the set of points, {(hk , Cˆ k ) : k  1,.., k } ,
obtained is precisely the derived empirical covariogram in Figure 4.23 corresponding to
the empirical variogram, {(hk , ˆk ) : k  1,.., k } , in Figure 4.22.26

As mentioned earlier, the advantage of displaying this derived covariogram is that it is


much easier to interpret that the estimated variogram. To do so, we begin by noting that
in addition to these two diagrams, the program var_spher_plot.m also yields a screen

26
In particular, the vertical component, ˆk , of each variogram point ( hk , ˆk ) has simply been shifted to the
new value, Cˆ k  sˆ  ˆk .

________________________________________________________________________
ESE 502 II.4-25 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

display of the parameter estimates (rˆ, sˆ, aˆ ) [along with maxdist, d , and the number of
iterations in the optimization procedure27], as shown in Figure 4.24 below.

SPHERICAL VARIOGRAM:

RANGE 17769.160
SILL 1554.658
NUGGET 618.044
MAXDIST 48203.698

ITERATIONS = 126

Figure 4.24. Parameter Estimates

In particular, the RANGE ( rˆ  17769.160 meters) denotes the distance beyond which
there is estimated to be no statistical correlation between nickel values.28 In Figure 4.22,
this corresponds to the distance at which the variogram first “reaches the sill”. But this
offers little in the way of statistical intuition. In Figure 4.23 on the other hand, it is clear
that this is the distance at which covariance (and hence correlation) first falls to zero.
This is the key difference between these two representations. Notice also that the vertical
axis in Figure 4.23 has been shifted relative to Figure 4.22, in order to depict the negative
covariance values in the cluster of values around the zero line.

Turning to the other estimated parameters, note first from Figure 4.23 that the SILL
( sˆ  1554.658 ) is seen to be precisely the estimated variance of individual nickel values
(i.e., the estimated covariance at “zero distance”). Similarly, the NUGGET
( aˆ  618.044 ) is seen to be that part of the individual variance that not related to spatial
dependence among neighbors. Since in this case the relative nugget effect, 0.398 ( =
618.044/1554.658), is well below 0.5, it is evident that there is a substantial degree of
local spatial dependence among nickel values. So in summary, it should be clear that
while the variogram model is useful for obtaining these parameter estimates, (rˆ, sˆ, aˆ ) , the
derived covariogram model is far more useful for interpreting them.

Fitting Procedure using ARCMAP


Before proceeding, it is of interest to compare this estimated spherical variogram with the
fitting procedure used in the Geostatistical Analyst (GA) extension in ARCMAP
(Version 10). The results of this procedure applied to the nickel data in the ARCMAP
file, nickel.mxd, are shown in Figure 4.25 below.

27
Note that if ITERATIONS exceeds 600, you will get an error message telling you that the algorithm
failed to converge in 600 iterations (which is the default maximum number of iterations allowed).
28
Notice also that this RANGE value is considerably below the MAXDIST (48203.698 meters), indicating
that the range of spatial dependence among nickel values is well captured by this empirical variogram

________________________________________________________________________
ESE 502 II.4-26 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Figure 4.25. Spherical Variogram Fit in Geostatistical Analyst

Note first that the title of this window is “Semivariogram” rather than “Variogram” (as
discussed at the end of Section 4.1 above). Since the full details of this variogram fitting
procedure are given in Assignment 3, it suffices here to concentrate on the estimated
parameter values. However, it is important to point out one aspect of this procedure that
is crucial for parameter estimation. Recall from the discussion of Figure 4.13 that one
must define appropriate bins for the empirical variogram. Since the “default” option for
bin definitions in GA is rather complex compared to ver_spher_plot.m, it is most
convenient to define the bin sizes in GA manually in order to make them (roughly)
comparable to those in ver_spher_plot. To do so, recall from Figure 4.24 the MAXDIST
value is close to 48000 meters. So by setting the number of lags to 12 and choosing a
constant bin size of 4000 meters (as seen in the Lag window in the lower right of Figure
4.25), we will obtain a maximum distance of exactly 48000 meters (as seen on the
distance axis of the variogram plot). Note also that in the Model #1 window we have
chosen Type = “Spherical”, indicating that a spherical variogram is to be fitted.

The fitted spherical variogram is shown by the blue curve in the figure, and the empirical
variogram is shown by red dots. Note that while the number of lags (12) is considerably
smaller than the number of bins (100) used in ver_spher_plot, there actually appear to be

________________________________________________________________________
ESE 502 II.4-27 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

more red dots here than there are blue dots in Figure 4.22 above. The reason for this can
be seen by considering the circular pattern of squares in the lower left corner of the
figure. Starting from the center and moving to the right, one can count 12 squares, which
denote the 12 lag distances. Hence as the figure shows, point pairs are here distinguished
not only by the length of the line between them (distance) but also the direction of this
line (angle). Each square thus defines a “bin” of point pairs with similar separation
distances and angles. So the number of bins here is much larger than 12.29 While these
directional distinctions are important for fitting anisotropic variogram models in which
the isotropy assumption of covariance stationarity is relaxed, we shall not explore such
models in these notes.30 Hence, under our present isotropy assumption, the appropriate
empirical variogram in GA is constructed by using each of these squares as a separate bin
with “lag distance” equal to the average distance between point pairs with distance-angle
combinations in that square.

Next observe that in addition to the different binning conventions, the actual estimation
procedure used in GA is more complex than the simple least-squares procedure used in
ver_spher_plot [and is essentially an elaboration of the weighted least-squares approach
of Cressie in shown in expressions (4.7.9) and (4.7.10) above]. So it should be clear that
the resulting spherical model estimate will not be the same as in Figure 4.22 above. In
particular, the estimated range and nugget in this case are given, respectively, by “Major
Range” (= 17806.86) and “Nugget” (= 617.32). However, the “Sill” is here replaced by
“Partial Sill” (= 943.17). Hence, recalling from the discussion at the end of Section 4.6.2
that “Sill = Partial Sill + Nugget”, it follows the corresponding sill is here given by
1560.5 (= 943.17 + 617.32). A comparison of the parameter estimates using both
MATLAB and GA in this example (Figure 4.26 below) show that in spite of the
differences above, they are qualitatively very similar.

MATLAB GA
Range 17769.2 17806.9
Sill 1554.7 1552.6
Nugget 618.0 617.3

Figure 4.26. Parameter Estimates

29
In this example, the number of bins is given approximately by  12  452 . However the number of red
2

dots is actually half the number of bins, since each bin has a “twin” in the opposite direction. Hence the
number of red dots in this case is given approximately by 226, which is still much larger than 100.
30
For a detailed discussion of such anisotropic models see Gotway and Waller (2004, Section 2.8.5).

________________________________________________________________________
ESE 502 II.4-28 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

4.10 Variograms versus Covariograms

Before applying these methods to analyze spatially dependent data, it is appropriate to


return to the question of why variograms are preferable to covariograms in terms of
estimation. To do so, we start by showing that for any spatial stochastic process,
Y ( s) : s  R  , satisfying the covariance stationarity condition (3.3.7) above, the
“standard” sample estimator of covariance is biased.

4.10.1 Biasedness of the Standard Covariance Estimator

First recall from expression (3.3.7) that for any distance h  0 the covariogram value,
C (h) , is by definition

(4.10.1) C (h)  cov Y ( s1 ), Y ( s2 ) 

for any s1 , s2  R with s1  s2  h . Hence suppose for sake of simplicity that we are able
to draw n sample pairs, [ y1 ( s1i ), y2 ( s2i )]  ( y1i , y2i ) , from this process with s1i  s2i  h
holding exactly for all i  1,.., n . In this context, the standard sample estimator for the
covariance value in (4.10.1) is given by

Cˆ (h)   y  y1  y2i  y2 
n
(4.10.2) 1
n1 i 1 1i

with sample means denoted by y j  (1/ n) i1 y ji , j  1, 2 . Here division by n  1 (rather
n

than the seemingly more natural choice of division by n ) ensures that if these sample
pairs [ ( y1i , y2i ) , i  1,.., n ] were independent draws from jointly distributed random
variables (Y , Y ) with covariance given by (4.10.1), then Cˆ (h) in (4.10.2) would be an
1 2

unbiased estimator of C (h) . However, if these pairs are not independent, then it is
shown in Appendix A2.2 that the actual expectation of Cˆ (h) is given by

E [Cˆ (h)]  C (h)   


n
(4.10.3) 1 cov(Y1i , Y2 j )
n ( n1) i 1 j i

Notice first that if these sample pairs were independent then by definition each
covariance, cov(Y1i , Y2 j ) , with i  j must be zero so that (4.10.3) would reduce to
E [Cˆ (h)]  C (h) , and Cˆ (h) would indeed be an unbiased estimator. But for the more
classical case of nonnegative spatial dependences, all covariances in the second term of
(4.10.3) must either be positive or zero. Hence for this classical case it is clear that there
will in general be a considerable downward bias in this estimator. Moreover, without
prior knowledge of the exact nature of such dependencies, it is difficult to correct this
bias in any simple way. It is precisely this difficulty that motivates the need for
alternative approaches to modeling spatial dependencies.

________________________________________________________________________
ESE 502 II.4-29 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

4.10.2 Unbiasedness of Empirical Variograms for Exact-Distance Samples

To motivate the use of variograms for modeling spatial dependencies, we begin by


recalling from (4.1.7) that the covariogram, C (h) , is entirely determined by the
variogram,  (h) , together with the (nonspatial) variance parameter,  2 . Hence if the
empirical variogram, ˆ (h) , can be shown to yields a unbiased estimate of  (h) , then this
will surely offer a better approach to capturing spatial dependencies.

There is one case in which this is possible, namely when there exist multiple pairs,
[Y ( s1i ), Y ( s2i ) : i  1,.., nh ] , each separated by the same distance h , i.e., satisfying the
condition that s1i  s2i  h for all i  1,.., nh . In particular, if spatial samples form a
regular lattice, as illustrated by the small set of red dots in Figure 4.27 below, then there
will generally be a set of representative distances for which this is true. In particular, the
symmetry of such lattices implies that distance values such as h1 , h2 , and h3 in the figure
will occur for many different point pairs.

h1
● ● ● ● ●
h2

● ● ● ● ●
h3
● ● ● ● ●

Figure 4.27. Regular Lattice of Sample Points

More generally, whenever there exists a representative range of distinct distance values,
{hk : k  1,.., k } , at which a substantial set of exact-distance pairs,

(4.10.4) N k  {( s1 , s2 ) : s1  s2  hk }

can be sampled at each hk , then the associated empirical variogram, {(hk , ˆk ) : k  1,.., k } ,
in (4.7.7) will indeed provide a meaningful unbiased estimate of the true variogram,
 (hk ) , at each of these distance values.31 To see this, it is enough to recall from (4.1.5)
that E Y ( s1i )  Y ( s2i )    2   (hk ) for all ( s1 , s2 )  N k , and hence that
2
 

31
Here the qualifier “meaningful” is meant to distinguish this estimator from one in which there is no
possibility of eventually accumulating a large set of sample pairs, N k , for each hk .

________________________________________________________________________
ESE 502 II.4-30 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

 1 
 Y ( s1 )  Y ( s2 ) 
2
(4.10.5) E[ˆ (hk )]  E  ( s1 , s2 )N k 
 2 Nk 

1
 E Y ( s1 )  Y ( s2 )  
2

2 Nk ( s1 , s2 )N k  

1

2 Nk
 ( s1 , s2 )N k
2   (hk )

2 Nk
   (hk )   (hk )
2 Nk

So regardless of the size of each exact-distance set, N k , this empirical variogram will
always yield an unbiased estimate of the true variogram,  (hk ) , at each distance
k  1,.., k . Hence if in addition it is true that each of these sets is sufficiently large, say
N k  30 , then this empirical variogram should provide a reliable estimate of the true
variogram.

Finally, it should be noted that if one is able to choose the pattern of samples to use in
studying a given spatial stochastic process, (Y ( s ) : s  R) , then such regular lattices have
the practical advantage of providing a uniform coverage of region R . This is particularly
desirable for interpolating unobserved values in R (as discussed in detail in Part 6
below). It is for this reason that much attention is focused on regular lattice samples of
such processes [as for example in Cressie (1993, p.69) and Waller and Gotway (2004,
p.281)].32

4.10.3 Approximate Unbiasedness of General Empirical Variograms

For the general case of irregular samples, where exact-distance sets rarely contain more
than one observation, it is necessary to rely on the binning procedure developed in
Section 4.7.1 above. The “Nickel” example in Section 2.4 above provides a good
illustration of such a case where regular sample patterns are impractical if not impossible.
In this more typical setting, it is difficult to find much discussion in the literature about
the bias of empirical variogram estimates created by binning.33

However, it is not difficult to show that if the true variogram is reasonably smooth, then
one can at least bound the bias in a rather simple way. In particular, if by “smooth” we

32
It should be mentioned again that these references define empirical variograms with respect to the more
general notion of stationarity mentioned in footnote 6 of Section 3.2 above. So the exact-distance sets used
here are replaced by “exact-difference sets”.
33
One noteworthy exception is the interesting analysis of “clustered” sampling schemes by Reilly and
Gelman (2007).

________________________________________________________________________
ESE 502 II.4-31 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

mean that the variogram,  (h) , is locally linear in the sense that its values are well
approximated by linear functions on sufficiently small intervals, then one can bound the
bias of the general empirical variogram in (4.7.7) in terms of these linear approximations.
To be more specific, suppose that the true variogram is given by red curve in Figure 4.27
below, and that the set of bins chosen for estimating this (unknown) function are shown
schematically as in Figure 4.28 below [where by definition each bin, k  1,.., k , is
defined by the interval of separation distances, d k 1  h  d k (with d 0  0 )].

 ( h) lk ( h ) k

0 d1 d 2  d k 1 dk  dk d k 1 dk

Figure 4.28 Bins for Variogram Estimation Figure 4.29. Local Linear Approximation

Here the variogram,  (h) , illustrated is assumed to be an instance of the “combined


model” in Figure 4.8 above. In addition, it is assumed that  (h) is sufficiently smooth to
allow the section of the curve on each bin to be roughly approximated by a linear
function. This is illustrated for a typical bin interval, [d k 1 , d k ) , by the solid blue line in
Figure 4.29. This linear approximation function, denoted by

(4.10.6) lk (h)  ak  h  bk

(with slope, ak , and intercept, bk ) has been implicitly chosen to minimize the maximum
deviation,  (h)  lk (h) , over the interval d k 1  h  d k . If this maximum deviation is
denoted by  k , then the variogram,  (h) , is said to have an  k -linear approximation on
bin k . With these definitions, it is shown in Appendix A2.3 that in terms of this  k -
linear approximation, the maximum bias in the empirical variogram estimate of  (hk )
can never exceed 2 k , i.e.,

(4.10.7) E[ˆ (hk )]   (hk )  2 k

________________________________________________________________________
ESE 502 II.4-32 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Of course one cannot know the value of  k without knowing the true variogram itself. So
the bound in (4.10.7) is simply a qualitative result showing that if  (h) is assumed to be
sufficiently smooth to ensure that the maximum deviation,   max{ k : k  1,.., k } , for
the given bin partition is “small”, then the bias in the empirical variogram,
{(hk , ˆk ) : k  1,.., k } , will also be “small”. In other words, for variograms with good
“piece-wise linear approximations” on the given set of bins, empirical variogram
estimates can be expected to exhibit only minimal bias.

________________________________________________________________________
ESE 502 II.4-33 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

5. Spatial Interpolation Models


Given the above model of stationary random spatial effects { ( s ) : s  R} , our ultimate
objective is to apply these concepts to spatial models involving global trends,  ( s ) , i.e.,
to spatial stochastic models of the form, Y ( s )   ( s )   ( s ) , s  R . In continuous spatial
data analysis, the most fully developed models of this type focus on spatial prediction,
where values of spatial variables observed at certain locations are used to predict values
at other locations. But it is important to emphasize here that many such models are in
fact completely deterministic in nature [i.e., implicitly assume that  ( s )  0 ]. Such
models are typical referred to as spatial interpolation (or smoothing) models [so we
reserve the term spatial prediction for stochastic models of this type, as discussed later].
Indeed the Inverse Distance Weighting (IDW) model used for the Sudan Rainfall
example in Section 2.1 above is an interpolation model. Moreover, a variety of other such
models are in common use, and indeed, are also available in ARCMAP. So before
developing the spatial prediction models that are of central interest for our purposes, it is
appropriate to begin with selected examples of these interpolation models. In Section 6
below, we shall then consider the simplest types of spatial prediction models in which the
global trend is constant, i.e., with  ( s )   for all s  R . This will be followed in Section
7 with a development of more general prediction models in which the global trend,  ( s ) ,
is allowed to vary over space, and takes on a more important role.

5.1 A Simple Example of Spatial Interpolation


The basic idea of spatial interpolation is well illustrated by the “elevation” example
shown in Figure 5.1 below (taken from the ESRI Desktop Help documentation)

s0

Figure 5.1. Interpolating Elevations

________________________________________________________________________
ESE 502 II.5-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Here it is assumed that elevations, y ( s) , have been measured at a set of spatial locations
{si : i  1,.., n} in some relevant region, R, as shown by the dots outlined in white. Given
these measurements, one would like to estimate the elevation, y ( s0 ) , at some new
location s0  R , shown in the figure (outlined in black). Given the typical continuity
properties of elevation, it is clear that those measurement locations closest to s0 are the
most relevant ones for estimating y ( s0 ) , as illustrated by the red dots lying in the
neighborhood of s0 denoted by the yellow circle. While it is not obvious how large this
neighborhood should be, let us suppose for the moment that it has somehow been
determined (we return to this question in Section 6.4 below). Then the question is how to
use this set of five elevations at locations, say s1 ,.., s5 , to estimate y ( s0 ) . These locations
are displayed in more detail in Figure 5.2 below, where d 0i  || s0  si || denotes the
distance from s0 to each point si , i  1,..,5 .

s2

d 02

d 03 s3
s1 d 01 s0
d 04
s4
d 05

s5

Figure 5.2. Neighborhood of Point s 0

5.2 Kernel Smoothing Models


Intuitively, those points closer to s0 should have more influence in this estimate. For
example, it is seen in the figure that point s3 is considerably closer to s0 than is point s4 .
So it is reasonable to assume that y ( s3 ) is more influential in the estimation of y ( s0 ) than
is y ( s4 ) . Hence if we now designate the set of points used for estimation at s0 as the
interpolation set, S ( s0 ) , [so that in the example above, S ( s0 )  {s1 ,.., s5} ] then it is
natural to consider estimates, yˆ ( s0 ) , of the form,

________________________________________________________________________
ESE 502 II.5-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

 w (d 0i ) y ( si )  w (d 0i ) 
(5.2.1) yˆ ( s0 ) 
si S ( s0 )
  siS ( s0 )   y ( si )
 s j S ( s0 )
w (d 0 j )
  s jS ( s0 )
w (d 0 j ) 

where the weight function, w(d ) , is a positive decreasing function of distance, d .


Interpolation models of this type are often referred to as kernel smoothers. The reason for
the ratio form is that the effective weights on each y ( s ) value, [as defined by the
bracketed expression in (5.1.1)] must then sum to one, i.e.,

 w (d 0i )   w (d 0i )
(5.2.2)  siS ( s0 )    si S ( s0 )
1
  s jS ( s0 ) 
w (d0 j )  w (d 0 j )
 s S ( s )
j 0

and are thus interpretable as the “fractional contribution” of each y ( s) to the estimate,
yˆ ( s0 ) . Thus points closer to s0 in S ( s0 ) will have higher fractional contributions to
yˆ ( s0 ) since for all si , s j  S ( s0 )

(5.2.3) d 0i  d 0 j  w (d 0i )  w (d 0 j )

w (d 0i ) w (d0 j )
 
 s S ( s ) w (d0k )
k 0
 sk S ( s0 )
w (d 0 k )

We have already seen an example of a kernel smoother, namely the inverse distance
weighting (IDW) smoother in Section 2.1 above. In this case, the weight function is a
simple inverse power function of the form,

(5.2.4) w(d )  d  a

where  is a positive constant (typically   1 or   2 ). While this is the only kernel


smoother available in ARCMAP, it worthwhile mentioning one other, namely the
exponential smoother, in which the weights are given by a negative exponential function
of the form

(5.2.5) w(d )  e  d

for some positive constant,   0 . To compare these two smoothers, it is instructive to


plot typical values of these weight functions. In Figure 5.3 below, an inverse power
function with   2 (shown in blue) is compared to a negative exponential function with
  1 (shown in red). As seen in the figure, the most important difference between these
functions is near the origin where d  0 implies that e  d  e0  1 , but where d  a   .
So for inverse power smoothers (like IDW), one necessarily obtains very “peaked”
interpolation surfaces near data points where the effective weight of that data point
approaches one, and totally dominates all other data points. (This is precisely the reason
for the “rainfall peaks” around data points seen for Sudan in Figure 2.2.)

________________________________________________________________________
ESE 502 II.5-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

3.5

2.5

1.5 d 2
1
e d
0.5

0
0 0.5 1 1.5 2 2.5 3

d
Figure 5.3. Comparison of Exponential and IDW Smoothers

In this respect, exponential smoothers may be preferred. However, if one requires exact
interpolation at data points [i.e., yˆ ( si )  y ( si ) ] then this is only possible if w(d )   as
d  0 . So kernel smoothers like the exponential yield results that are actually smoother
than the data itself. In summary, it is important to be aware of such differences between
possible kernel smoothers, and to employ one that is most suitable for the purpose at
hand.

5.3 Local Polynomial Models

A second type of spatial interpolator available in the Geostatistical Analyst (GA)


extension of ARCMAP is a local polynomial interpolator. Here it is assumed explicitly
that the value, y ( s0 ) , lies on the same (smooth) surface as the observed values,
 y( si ) : i  1,.., n , and that the local curvature of this surface is well approximated by
polynomials of a given order. The simplest polynomial (of order one) is a linear function.
So here the value, y ( s0 ) , is estimated by finding the linear function which best fits the y-
values on the coordinate values of the points in the interpolation set, S ( s0 )  {s1 ,.., sn } .
More specifically, if the y-values of these points are denoted by ( yi : i  1,.., n) and their
coordinate values by [( si1 , si 2 ) :i  1,.., n] , then estimation is done by the same least
squares procedure as in linear regression, i.e., by finding the beta estimates ( ˆ , ˆ , ˆ ), 0 1 2

that minimize the sum-of-squared deviations:1

1
It should be emphasized that while this estimation procedure is the same as in regression, there is no
appeal to a linear statistical model, and in particular, no random error model.

________________________________________________________________________
ESE 502 II.5-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________


n
(5.3.1) i 1
[ yi  (  0  1si1   2 si 2 )]2

Using these estimates, the interpolated value of y ( s0 ) at point s0  ( s01 , s02 ) is given by

(5.3.2) yˆ ( s0 )  ˆ0  ˆ1s01  ˆ2 s02

A one-dimensional illustration of local linear interpolation is shown in Figure 5.4 below,


where in this case it is assumed that the interpolation set, S ( s0 )  {s1 , s2 , s3 , s4 }   , is
given be the four points shown in red (where si is the coordinate of point i on the line,
 ). The dashed red line is a plot of the linear interpolation function obtained from these
four data points, so that the interpolated value, yˆ ( s0 )  ˆ0  ˆ1s0 , is shown by the white
dot in the figure.



   
   


s1 s2 s0 s3 s4

Figure 5.4. Local Linear Interpolation

More generally, this defines a single point on the interpolation function defined for all
points, as shown schematically by the solid curve in the figure.2 In practice this function
would not be so smooth, and in fact would not even be continuous. In particular there
would be jumps in this function at each location, s0 , where a data point, si , either enters
or leaves the current interpolation set, S ( s0 ) Such discontinuities can be removed by
fixing the diameter (bandwidth) of interpolation sets, say
(5.3.3) S ( s0 )  {si : || s0  si ||  d 0 }

2
A particularly good discussion of this local linear polynomial case is given in ESRI Desktop Help at
http://webhelp.esri.com/arcgisdesktop/9.2/index.cfm?TopicName=How_Local_Polynomial_interpolation_works .

________________________________________________________________________
ESE 502 II.5-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

and introducing kernel smoothing weight function, w , similar to (5.2.1) which falls to
zero at distance, d 0 . One then modifies the local least squares in (5.3.1) to a local
weighted least squares of the form,3


n
(5.3.4) i 1
w0i [ yi  (  0  1si1   2 si 2 )]2

where w0i  w (|| s0  si ||) . Hence points in S ( s0 ) at greater distances from s0 will have
less weight in the interpolation (and will have no weight at distance, d 0 ). This implies in
particular that as s0 is moves along the axis in Figure 5.4, data points entering or leaving
the interpolation set will initially have zero weight, thus preserving continuity of the
interpolation function. An actual example of such a locally weighted linear interpolation
function is shown (in red) on the left in Figure 5.5 below.

4 1

3 0.9

0.8
2
0.7
1
0.6
w(d )
0
0.5
-1
0.4
-2
0.3

-3 0.2

d0
-4 0.1

-5 0
-3 -2 -1 0 1 2 3 0 0.2 0.4 0.6 0.8 1 1.2

Figure 5.5 One-Dimensional Geographic Weighted Regression

Here the kernel smoothing function used is the popular “tri-cube” function which has the
mathematical form:

1  (d / d )3 3 , d  d
(5.3.5) w(d )   0 0

 0 , d  d0

where in this case a bandwidth of d 0  1 was used. The shape of this function is shown
on the right in Figure 5.5, where the distance scale has been increased for visual clarity.

This type of linear interpolation is formally related to Geographic Weighted Regression


(GWR), which is also available in the ArcToolBox of ARCMAP [Spatial Statistics
Tools → Modeling Spatial Relationships → GWR ]. Since GWR allows the use of
explanatory variables other than coordinate locations, and also includes stochastic
random effects, it is much more than a simple interpolation tool. But the interpolation
3
This is essentially a linear version of the nonlinear weighted least squares in expression (4.7.9) of the
Variograms section.

________________________________________________________________________
ESE 502 II.5-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

example above nonetheless serves to illustrate the basic mechanism of locally weighted
least squares used in GWR.4
Finally it should be mentioned that local polynomial interpolation can of course involve
high-order polynomials. As one illustration, suppose we consider local polynomial
interpolation with second-order (quadratic) polynomials. Then the sum-of-squared
deviations in (5.3.1) would now be replaced by the quadratic version,


n
(5.3.6) i 1
[ yi  (  0  1si1   2 si 2   3 si21   4 si22   5 si1si 2 )]2

and would in turn yield local quadratic interpolations of the form:

(5.3.7) yˆ ( s0 )  ˆ0  ˆ1s01  ˆ2 s02  ˆ3 s01


2
 ˆ4 s02
2
 ˆ5 s01s02

A schematic of such an interpolation paralleling Figure 5.4 is shown in Figure 5.6 below.



   
   


s1 s2 s0 s3 s4
Figure 5.6. Local Quadratic Interpolation

In this schematic example, the red dashed curve is now the quadratic function in (5.3.7)
evaluated at all points on the line, include s0 . Given the obvious nonlinear nature of this
data, it would appear that a quadratic polynomial interpolation would yield a better fit to
this data.

However, it should again be emphasized that the actual locus of interpolations obtained
would still be discontinuous for exactly the same reasons as in the linear interpolation

4
Here it should be noted that the type of one-dimensional example shown in Figure 5.5 is not readily
implemented using GWR in ARCMAP. Rather this example was computed using the MATLAB program,
gwr.m , in the suite of programs by James LeSage, available at http://www.spatial-econometrics.com/ .

________________________________________________________________________
ESE 502 II.5-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

case above. So when using these interpolators in GA, be aware that implicit smoothing
procedures are being used, in a manner similar to the kernel smoothing procedure
outlined above. Hence the numerical values obtained will differ slightly from the simple
interpolatins, yˆ ( s0 ) , in (5.3.2) and (5.3.7) above, depending the spacing of actual data
points. Also be aware that these smoothing procedures used are not documented in
ARCMAP help. As with essentially all commercial GIS software, there are often hidden
layers of calculations being done that are not fully documented.

5.4 Radial Basis Function Models

In view of the continuity problem inherent in local polynomial interpolations, it is of


interest to consider interpolation methods that are guaranteed to yield interpolation
surfaces that are not only continuous but are in fact everywhere smooth (in contrast to
IDW which is continuous but not smooth at data points). The simplest of these are the so
called radial basis function interpolators also available in GA. Here the basic idea is to
choose a family of radially symmetric functions, f ( s) , about the origin that (typically)
fall to zero as distance from the origin increases. We have already seen one such
function, namely the standard bivariate normal density functions in Figure 3.2 of the
Spatially-Dependent Random Effects section above. Each data point, ( si , yi ) , is then
associated with a member of this family where the origin is set at si . So for the bivariate
normal case one would have5
 12 ||s  si ||2
(5.4.1) fi ( s )  e , s  2

where the normalizing factor (2 ) 1/2 plays no role here, and has been removed. One
then defines the interpolation function for this model to be a weighted combination of
these basis functions:


n
(5.4.2) yˆ ( s )  i 1 i
a fi ( s )

To choose an appropriate set of weights, one typically requires exact interpolation at each
data point ( si , yi ) , i.e.,


n
(5.4.3) yi  yˆ ( si )  j 1
a j f j ( si ) , i  1,.., n

Since this is simply a system of linear equations in the unknown weight vector,
a  (a1 ,.., an ) , one can easily solve for these weights. In particular, if we let
y  ( y1 ,.., yn ) denote the vector of observe y-values, and let the n-square function matrix,
F , be defined by

(5.4.4) Fij  f i ( s j ) , i, j  1,.., n

5
Recall that Euclidean distance between vectors x and y is denoted by d ( x, y )  || x  y || .

________________________________________________________________________
ESE 502 II.5-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

then it follows at once from (5.4.3) that

 y1   F11  F1n   a1 
    
(5.4.5)              y  Fa
 y  F  F   a 
 n   n1 nn   n 

Hence the desired weight vector is uniquely defined by6

(5.4.6) a  F 1 y

This set of weights necessarily yields a smooth interpolation function that passes through
all of the data points.

Although the normal density above is a very common choice for radial basis functions,
this option is not available in GA. However, one option that is available which looks very
similar to the standard bivariate normal density is the so called inverse multiquadratic
function defined for all s   2 by 7
1
(5.4.7) f ( s) 
1 || s ||2

so that in this case, (5.1.13) is replaced by


1
(5.4.8) fi ( s ) 
1 || s  si ||2

While this function appears to be mathematically very different from the bivariate normal
density, a two-dimensional plot of (5.4.7) shows that it is virtually indistinguishable from
Figure 3.2 in a qualitative sense. About the only significant difference is that it falls zero
much more slowly that the normal density.

To gain some feeling for this type of interpolation, it is again convenient to develop a
one-dimensional example paralleling the example for local polynomial interpolations
above. In this case, (5.4.7) reduces to the function
1

1 0.8
f ( s)
(5.4.9) f ( s)  , s 0.6

1 s2 0.4

0.2

0
-3 -2 -1 0 1 2 3

s→
which is now seen to be qualitatively similar to the univariate normal density in
expression (3.1.11) of Section 3. Interpolation with these radial basis functions can be

6
This of course assumes that F is nonsingular, which will hold in all but the most degenerate cases.
7
As with the standard normal density, this function can be generalized by adding a weight,  , to s

 
1
yielding the one-parameter family, f ( s |  )    || s ||
2 2
.

________________________________________________________________________
ESE 502 II.5-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

illustrated by the example shown in Figure 5.7 below, which involves only three data
points, ( si , yi ) , i  1, 2,3 .

a1 f1 ( s )
2
a3 f 3 ( s )
1

0
s1 s2 s3
-1

yˆ ( s)
-2
a2 f 2 ( s )

-3

-4
-1 0 1 2 3 4 5 6 7 8 9
s

Figure 5.7. Interpolation with Radial Basis Functions

Here the fitted radial basis functions [ ai f i ( s ), i  1, 2,3] are shown in black, and the
resulting interpolation function, yˆ ( s) , is shown in red.

Notice that unlike the kernel smoothers above, there is no need for interpolation sets in
this case. Here the entire interpolation function, yˆ ( s) , is determined simultaneously at all
point locations, s. Notice also that this function is necessarily smooth (since it is a sum of
smooth functions). Finally, note from the figure that yˆ ( s) is indeed an exact interpolator
at data points, i.e., it passes through each of the data points shown.

5.5 Spline Models

While the above procedure offers a remarkably simple way to obtain smooth and exact
interpolations, it can be argued that the choice of basis functions is rather arbitrary.
Moreover, it is difficult to regard this fitting procedure as “optimal” in any sense. Rather
its weights are determined entirely by the exact-interpolation condition in (5.4.3) above.
However, there is an alternative method of interpolation, known as spline interpolation,
which is more appealing from a theoretical viewpoint. As with radial basis functions, the
classical spline model seeks to find a prediction function, yˆ ( s) , that satisfies the exact-
interpolation condition,

(5.5.1) yˆ ( si )  yi , i  1,.., n

________________________________________________________________________
ESE 502 II.5-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

There are of course infinitely many smooth functions that could satisfy this condition.
Hence the unique feature of spline interpolation is that rather than simply pre-selecting a
given set of smooth candidate functions, this approach seeks to find the smoothest
possible function satisfying (5.5.1). To characterize “smoothness”, recall that for one-
dimensional functions, f ( s ) , the second derivative, f ( s ) , measures the curvature of the
function. In particular, linear functions, f ( s)  a  bs , have zero curvature as reflected
by the fact that f ( s )  0 . More generally, if we ignore signs and define the curvature of
f at s by f ( s ) 2 , then to compare the curvature of functions f on a given interval, say
[a, b] , it is natural to consider their total curvature
b
(5.5.2) C( f )  a
f ( s ) 2 ds

as a measure of “smoothness”, where higher degrees of smoothness correspond to lower


total curvature.8

For two dimensions, the idea is basically the same. Here “curvature” at point, s  ( s1 , s2 ) ,
is defined in terms of the Hessian matrix of second partial derivatives

 2 f 2 f 
 s 2 s1s2 
(5.5.3) H f ( s )   21 
 f 2 f
 s1s2 
 s22 

Again to ignore signs, one can define the size of a matrix, M  (mij ) , by its squared
distance from the origin (as a vector), i.e.,

(5.5.4) || M ||2   i  j mij2

In these terms, the curvature of a two-dimensional function, f ( s) , at s  ( s1 , s2 ) is


defined to be the size of its Hessian at s , i.e.,

   2    
2 2 2
2 f 2 f 2 f
(5.5.5) || H f ( s ) ||2  s12 s1s2 s22

[which is seen to be the natural generalization of the one-dimensional case, || f ( s ) ||2 
f ( s ) 2 ]. Hence to compare the curvature of functions, f , on a (bounded) two-
dimensional region, R   2 , the natural extension of (5.5.2) is to define total curvature,
C ( f ) , by

8
While it might be conceptually more appropriate to use average curvature, C ( f )  1
ba
C ( f ) , this
simple rescaling has no effect on the ordering of smoothness among functions, f.

________________________________________________________________________
ESE 502 II.5-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________


   2       ds ds
2 2 2

 
2 f 2 f 2 f
(5.5.6) C( f )  || H f ( s) ||2 ds   s12 s1s2 s22 1 2
R R

Thus, for a given set of data points, [( s1 , y1 ),..,( sn , yn )] with {s1 ,.., sn }  R   2 , the
corresponding spline interpolation problem is to find a function, yˆ ( s ) , on R which
minimizes total curvature (5.5.6) subject to the exact-interpolation condition (5.5.1).

While these interpolation problems are relatively simple to state, they can only be solved
by very sophisticated mathematical methods. Hence for our purposes, it suffices to say
that these solutions are themselves remarkably simple, and lead to optimally smooth
interpolation functions, yˆ ( s) . To illustrate the basic ideas, it is again convenient to focus
on the one dimensional case. Here it turns out that the basic interpolation functions are
combinations of cubic functions,

(5.5.7) f ( s )  a3 s 3  a2 s 2  a1s  a0

between every pair of adjacent data points. To gain some intuition here, suppose one
considers a “partial” smooth curve with a gap between two data points, s1 and s2 , as
shown in Figure 5.8a below. To “complete” this curve in a smooth way, one must match
both the end values (shown as black dots) and the end slopes (shown as dashed lines).

y y
 
 

   
s1 s2 s1 s2

Figure 5.8a. Partial Smooth Curve Figure 5.8b. Completed Smooth Curve

Since the cubic function is the smoothest function with an infection point (where the
second derivative changes sign), it should be clear from Figure 5.8b that this adds just
enough flexibility to complete this curve in the smoothest possible way.9

As one example of this interpolation method, the data set in Figure 5.7 has been re-
interpolated using cubic spline interpolation in Figure 5.9 below.10 Here there appears to
be a dramatic difference between the two methods. But except for the slight scale

9
For a more general discussion of fitting cubic splines to (one-dimensional) data sets, see McKinley and
Levine at http://online.redwoods.cc.ca.us/instruct/darnold/laproj/fall98/skymeg/proj.pdf .
10
This interpolation was computed in MATLAP using their package program, spline.m.

________________________________________________________________________
ESE 502 II.5-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

differences between these two figures, the qualitative “bowl” shape of yˆ ( s) within the
interval defined by these three data points is roughly similar. Moreover, it should now be
clear that this “bowl” is much smoother for the cubic spline case than for the (inverse
quadratic) radial basis function case in Figure 5.7. Indeed, this cubic spline is the
smoothest possible exact interpolator within this interval. But notice also that outside this
interval, the two interpolations differ radically. Since the individual radial basis functions
in Figure 5.7 all approach zero as distance from data points increase, the interpolation
function also decreases to zero. But for the cubic spline case, this smooth “bowl” is only
achieved by continuing the bowl shape outside the data interval.
8

2
yˆ ( s )
0

-2

-4   
1 s21 3 s42 5 s63 7

Figure 5.9 Cubic Spline Interpolation

More generally, these cubic spline interpolations tend to diverge at the data boundaries,
and are much less trustworthy in this range. A better example of this is shown in Figure
5.10 below where the interpolation now involves five points. Here again the cubic spline
interpolator, yˆ ( s) , is seen to yield the smoothest possible exact interpolation of these
five data points. But outside this range, yˆ ( s) now diverges downward on both sides in
order to achieve this degree of smoothness.

2 yˆ ( s )
0

-2

-4

-6

-8

-10
2 4 6 8 10 12
s 
Figure 5.10 Cubic Spline Interpolation
________________________________________________________________________
ESE 502 II.5-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

While these one-dimensional examples serve to illustrate the main ideas of spline
interpolation, the solution in two dimensions [i.e., the function yˆ ( s) minimizing (5.5.6)
subject to (5.5.1)] is mathematically quite different than the one-dimensional case. From
an intuitive viewpoint, the basic reason for this is that in two dimensions it is not possible
to “piece together” solutions between data points. In fact, the solution in two dimensions
is formally much closer to the radial basis function approach above. In particular, the
optimal interpolation function, yˆ ( s) , designated as a thin-plate spline function, takes
essentially the same form as (5.4.2), namely

yˆ ( s )  yˆ ( s1 , s2 )  (0  1s1  2 s2 )   a f (s |  )
n
(5.5.8) i 1 i i

with “radial basis functions” (parameterized by,   0 ) given by

d 
(5.5.9) fi ( s |  )  f (dis )  dis2 log  is  , i  1,.., n
 

where d si  d ( s, si )  || s  si || for each data point, si . The linear part of yˆ ( s ) is usually


called the trend function, and is seen to have little influence on the detailed shape of yˆ ( s)
relative to these radial basis functions.11 The key feature of these functions is their
flexible shape. In particular observe that for s closer to si than distance  , i.e., dis   ,
the log expression is negative. Hence each function first “dips” and eventually rises as
distance dis increases. So, as with the one-dimensional cubic splines above, this again
yields a single inflection point along rays in every direction from the origin, as shown by
the one-dimensional profile of this function in Figure 5.11 below:

30

25

20

15

10

0
f (dis )
-5

-10

-15

-20
0 2 4 6 8 10 12

dis 
Figure 5.11 Radial Shape of Spline Basis Functions

11
In fact, this linear part has zero curvature by definition, and hence has no effect on the curvature of
yˆ ( s ) .

________________________________________________________________________
ESE 502 II.5-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Here the value of   10 was used, so that by definition f10 (dis ) rises back up to zero at
exactly dis  10 . This also makes it clear that beyond radial distance  from data points,
si , these functions diverge rapidly, and can produce rapid changes in yˆ ( s ) . So larger
values of  tend to produce “stiffer” surfaces with less variation. This can also be seen in
the full two-dimensional representation of fi ( s |  ) is shown in Figure 5.12 below [again
with   10 ].

Figure 5.12. Two-Dimensional Spline Basis Functions

As in the one-dimensional case, the local flexibility of this “Mexican hat” function allows
much more rapid changes in value than say the multi-quadratic basis function above. So
for example in the case of elevations, these thin-plate splines will do a much better job of
capturing rapid changes in elevation than will the multi-quadratic functions.
Finally, it should again be noted that (as mentioned for kernel smoothers at the end of
Section 5.2 above) the two-dimensional spline interpolation methods employed in the
Spatial Analyst (SA) extension of ARCMAP are considerably more complex than the
basic thin-plate spline model developed above. To describe the main differences, we
focus on the regularized spline option (rather than the less commonly used “tension”
spline option). While this model is based essentially on thin-plate splines, the radial basis
functions in (5.5.9) above are “augmented” by an additional term that reflects third
derivative effects as well and second derivative (curvature) effects.12 So the “  ”

12
This use of third derivatives is appropriate if continuity of second derivatives is required for the
prediction function [see Mitáš and Mitášová (1988, Section 6)]. But continuity of first derivatives is usually
sufficient for adequate smoothness. So from a practical viewpoint, the simpler approach of thin-plate spline
interpolation is in many ways more appealing. [See for example the constructive development of this
method in Franke (1982)].

________________________________________________________________________
ESE 502 II.5-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

parameter in (5.5.9) plays a somewhat more complex role in these functions. However,
its basic interpretation remains the same. In particular, larger values of  still produce
“stiffer” surfaces with less variation.

But a key point here is that the exact interpolation constraint, yˆ ( s )  y ( s ) , is dropped. So
the surface need not pass through each data point. In addition, the regularized spline
interpolation procedure in SA is “localized” by partitioning R into smaller cells to
simplify the calibration procedure. So in addition to the  parameter, the user is also
asked to choose the “number of points”, say  , to be used in the interpolation of each
cell. Again, larger numbers of points produce smoother interpolations with less variation.

5.6 A Comparison of Models using the Nickel Data

While the above models were motivated in terms of the “elevation” example in Figure
5.1, most spatial data sets tend to exhibit more local variation that elevations (or, say, the
mean temperature levels studied in Assignments 3 and 5). A typical example is provided
by the Nickel data displayed in Figure 4.18 of Section 4.9 above. Here we start with the
regularized spline tool discussed above, and compare interpolations for two different
parameter settings, ( , ) :

! ! !
!
! ! !! ! ! ! ! ! !
!! ! ! !
! ! ! !! ! ! ! ! ! !! !
!!
! !! ! ! ! ! !
! ! ! !! ! ! !!
!
! ! ! !! ! ! ! !
! !!
! ! ! ! ! ! !! ! ! ! ! !
! ! ! ! ! !
! ! !!
! ! !! ! !! !!
! ! ! ! ! ! ! ! ! !
! !
!! ! ! ! ! ! ! !! !! !
!
!
!! !! !! ! ! !! ! ! ! ! !! ! ! !!!
! ! ! !! ! !! !!
! !!! ! ! ! !! !! ! ! !! ! !! ! ! !!! ! ! ! !! ! !!! ! !
!
!
! ! ! !!!! ! !
!! ! ! ! !!
!!
!
! !
!!
! ! ! ! ! !!! ! ! ! ! ! !! !! ! !!! ! ! ! ! ! !!
! !! ! ! ! ! ! !
! !! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! !!! ! !
! ! !! ! !!! ! !
!
! !
!! !
!! ! !! !
! ! ! ! !! !
! !! !
!
! ! ! ! ! !! ! ! ! ! ! ! ! ! !
!! !! !
! ! ! ! ! ! !! ! ! ! ! ! ! !!
! ! ! !
!!
!
! !
! ! !
! !! ! ! !
! !
! !!
! ! !!
!
! ! ! !!
!! ! !
!
! !
! !!
!
!
! ! ! ! ! !
! ! ! !
! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! !
!! ! ! ! !
! !!! !! ! ! ! ! !!! ! !
! !
! ! !!! !! ! ! ! ! !!
! !
! !
!
!! !! ! !! !! !!
!! !! ! !!
!! !! ! !
! !! ! ! ! !! !! ! ! !
! !! ! ! ! ! ! !! !!
! ! ! ! !! !! ! ! !
!
!! !
! !
! ! !! !!! ! !
!
!
! ! ! !! ! ! ! ! !! !!! ! ! !
!
! ! ! !!
! ! !!! ! !! ! !
!!! ! !! !! !
!! !!! !
! !! ! !! ! ! !!!! ! !! !
!
! ! !! !! ! ! !! !! ! ! ! !! !! ! ! !! ! ! ! ! !
! !! ! !! ! ! ! !! ! ! !! ! !
!
! !
!!!! !! !!!
! !!! ! ! !!! !!!! !! !!! ! !
!
!!!
! ! !!! ! !! ! !! ! !! ! !
! ! ! ! !! ! !! ! !!
! ! ! ! !
! ! ! ! ! !! !
!
!
!! ! !
!
! !
!
! ! ! ! !! ! !! ! ! ! ! ! !
! ! ! ! ! !
!!! ! !! !! !! ! ! ! ! !! ! !! !!! !! !
! !
! ! ! ! ! !! ! !! !

Figure 5.13a. Spline with ( τ = 0.1, η = 12) Figure 5.13b. Spline with ( τ = 10, η = 50)

Here the spline interpolation shown in Figure 5.13a uses the default parameter settings
for the spline tool, namely   0.1 and   12 . However, a comparison with Figure 4.18
shows that this interpolation does a rather poor job of capturing overall variations in
nickel deposits. Much like the IDW interpolation of rainfall in Figure 2.2 of Section 2
above, this interpolation shows far too many local extremes, both high and low. This can
also be seen numerically by noting that while the actual values of nickel deposits are of
course nonnegative (starting from 1.0 ppm) the values in this spline interpolation go
down to values as low as 700 ppm, which are of course totally meaningless. As
described above, the key problem here is that this spline function is attempting to
interpolate between points that exhibit extreme local variation in values. Hence while the

________________________________________________________________________
ESE 502 II.5-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

values,   0.1 and   12 , are reasonable choices for very smooth surfaces, they are
creating far too much variation between data locations in this case. In fact, to achieve an
interpolation that is sufficiently “stiff” to provide a reasonable overall approximation to
this surface, it is necessary to use much larger settings, such as the values   10 and
  50 shown in Figure 5.13b.

Turning now to a broader range of models, we present a graphical comparison of four


different methods for interpolating this nickel data in Figure 5.14 below. The first three
relate to the deterministic methods above.

(a) Local Linear Polynomial (b) Radial Basis Function

(c) Spline Function (d) Ordinary Kriging

Figure 5.14 Interpolation Comparisons

Panel (a) shows the results of a local linear polynomial interpolation in Geostatistical
Analyst (GA). Panel (b) shows a radial basis function interpolation in GA with the
inverse multiquadratic option described above. Panel (c) shows the same regularized
spline interpolation in Figure 5.13b above (minus the data points). Finally, the last panel
compares these deterministic methods with the stochastic method of ordinary kriging in
GA, which will be developed fully in Section 6.

________________________________________________________________________
ESE 502 II.5-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

For the present, the main purpose of this graphical comparison to show that in spite of
their mathematical differences, the actual results of these different interpolation methods
are qualitatively quite similar. . However, it should be emphasized that considerable
experimentation was required to find parameter settings for each method the yielded this
degree of visual similarity. For example, as already mentioned above, the spline
interpolation in panel (c) required the use of very large values of ( , ) to achieve a
sufficient degree of overall smoothness. As for panel (b), note that while the inverse
multiquadratic function is itself very smooth, the variation in y-values here leads to the
least smooth interpolation of the four panels shown. The key factor appears to be the lack
of flexibility in fitting, which in this case is determined entirely by the exact-interpolation
condition. Finally, as was pointed out at the end of Section 5.2, the local linear
interpolation in panel (a) involves a number of internal smoothing procedures to remove
the discontinuities created by shifting interpolation sets from point to point. So here it is
not even clear how such smoothness was achieved. The same is in fact true of the
ordinary kriging results shown in panel (d) . As we shall see below, this procedure
involves “prediction sets” identical in spirit to the “interpolation sets” of local polynomial
interpolations. So again, internal smoothing procedures have been employed in GA to
obtain continuous results.

Finally, it is important to emphasize once again that in all interpolation models developed
in this section are completely deterministic. In other words, the surface being
approximated is treated simply as an unknown function, y ( s) , on region R , rather than
as the realization of an unknown spatial stochastic process Y ( s )   ( s)   ( s) , on R. The
deterministic approach may be quite appropriate for applications such as “filling in”
surface elevations from a subset of observed values, where such elevations can
reasonably be assumed to vary continuously. But for spatial data such as nickel deposits
example, where local variations can be quite substantial, it is generally more useful to
treat such variations as random fluctuations,  ( s ) , about a deterministic trend function,
 ( s) , representing the mean values of a stochastic process, Y ( s )   ( s)   ( s) . Hence we
now turn to a consideration of this stochastic approach to spatial prediction.

________________________________________________________________________
ESE 502 II.5-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

6. Simple Spatial Prediction Models


In this section we consider the simplest spatial prediction models that incorporate random
effects. These spatial prediction models are part of a larger class of models known as
kriging models [in honor of the South African mining engineer, D.G. Krige, who
pioneered the use of statistical methods in ore-grade sampling in the early 50’s].1 So
before launching into the details of the specific models developed in this section, it is
appropriate to begin with a general overview of kriging models.

6.1 An Overview of Kriging Models

From a formal viewpoint, kriging models are closely related to the kernel smoothing
models developed in Sections 5.1 and 5.2 above. In particular, the fundamental idea of
predicting values based on local information is exactly the same. In fact, a slight
modification of Figure 5.2, as in Figure 6.1 below, serves to illustrate the main ideas.

Y ( s2 )

 02

Y ( s3 )
 03
Y ( s1 )  01
Yˆ ( s0 )  04
Y ( s4 )
 05

Y ( s5 )

Figure 6.1 Basic Kriging Framework

Given spatial data, y ( s) , at a set of locations, {si : i  1,.., n}  R , we again consider the
prediction of the unobserved value at some location, s0  R . The first key difference is
that we now treat the observed data as a finite sample from a spatial stochastic process
{Y ( s ) : s  R} . As in the case of deterministic interpolation, not all sample data is
necessarily relevant for prediction at s0 . Hence, for the present, we again assume that
some appropriate subset of sample locations,

1
For further background discussion of kriging methods see Cressie (1990) and (1993, p.106).

________________________________________________________________________
ESE 502 II.6-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

(6.1.1) S ( s0 )  {si : i  1,.., n}

has been chosen for prediction, which for convenience we here designate as the
prediction set at s0 (rather than “interpolation set”). The choice of S ( s0 ) will of course
play a major role in determining the final prediction value at s0 . But it will turn out that
the best way to choose these sets is first to determine a “best prediction” for any given
set, S ( s0 ) , and then determine a “best prediction set” by comparing these predictions.
This procedure, known as cross validation, will be developed in Section 6.4 below.

So given prediction set, S ( s0 )  {s1 ,.., sn0 } , the next question is how to determine a
prediction, yˆ ( s0 ) , based on the sample data, { y ( s1 ),.., y ( sn0 )} . Given the present
stochastic framework, this question is more properly posed by treating this prediction as a
random variable, Yˆ ( s0 ) , and asking how it can be determined as a function of the random
variables, {Y ( s1 ),.., Y ( sn0 )} , associated with the observed data. As with kernel smoothers,
we again hypothesize that Yˆ ( s ) can be represented as some linear combination of these
0

random variables, i.e., that Yˆ ( s0 ) is of the form:

Yˆ ( s0 )   i 01 0i Y ( si )
n
(6.1.2)

where the weights 0i are yet to be determined. This fundamental hypothesis shall be
referred to as the linear prediction hypothesis.

6.1.1 Best Linear Unbiased Predictors

In contrast to kernel smoothing, the unknown weights 0i in (6.1.2) need not be simple
functions of distance ( so that 0i in Figure 6.1 now replaces d 0i in Figure 5.2).2 In any
case, the key strategy of kriging models is to choose weights that are “statistically
optimal” in an appropriate sense. To motivate this approach in the simplest way, we
begin by designating the difference between Yˆ ( s0 ) and the unknown true random
variable, Y ( s0 ) , as the prediction error,

(6.1.3) e( s0 )  Y ( s0 )  Yˆ ( s0 )

This prediction error will play a fundamental role in the analysis to follow. But before
proceeding, it is important to distinguish prediction error, e( s0 ) , from the random effects

2
One would expect that points si closer to s0 will tend to have larger weights, 0i . However we shall see
in Section 6.2.3 below that is not true, even when spatial correlations decrease with distance.

________________________________________________________________________
ESE 502 II.6-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

term,  ( s0 ) , in our basic stochastic model, Y ( s0 )   ( s0 )   ( s0 ) . While they can both


viewed as “random errors”, the random effects term,  ( s0 ) , describes the deviation of
Y ( s0 ) from its mean, so that by definition, E[ ( s0 )]  0 . This is certainly not part of the
definition of prediction error.

However, it is clearly desirable that prediction errors satisfy this zero-mean property, i.e.,
that prediction error on average be zero. Indeed, this is our first statistical optimality
criterion, usually referred to as the unbiasedness criterion:

(6.1.4) E [e( s0 ) ]  E [Y ( s0 )  Yˆ ( s0 )]  0

All predictors, Yˆ ( s0 ) , satisfying both (6.1.2) and (6.1.4) are referred to as linear unbiased
predictors of Y ( s0 ) . In these terms, our single most important optimality criterion is that
among all possible linear unbiased predictors, the prediction error of Yˆ ( s ) should be as 0

“close to zero” as possible. While there are many ways to define “closeness to zero”, for
the case of random prediction error it is natural to require that the mean squared error,
E[e( s0 ) 2 ] , be as small as possible.3 Hence our third criterion, designated as the efficiency
criterion is that Yˆ ( s ) have minimum mean squared error among all linear unbiased
0

predictors.

This criterion is so pervasive in the statistical literature that it is given many different
names. On the one hand, if we abbreviate “minimum mean squared error” as MMSE,
then such predictors are often called MMSE predictors. In addition, notice that since
unbiasedness ( E[e( s0 )]  0 ) implies

var[e( s0 )]  E[e( s0 ) 2 ]   E[e( s0 )]  E[e( s0 ) 2 ] ,


2
(6.1.5)

such predictors are also instances of minimum variance predictors. However, to


emphasize their optimality among all linear unbiased predictors, it is most accurate to
designate them as best linear unbiased predictors, or BLU predictors. It is this latter
terminology that we shall use throughout.

6.1.2 Model Comparisons

Within this general framework we consider four different kriging models, proceeding
from simpler to more general models. These models are each characterized by the
specific assumptions made about the properties of the underlying spatial stochastic
process, {Y ( s)   ( s )   ( s) : s  R} . For all such models, we start with a fundamental

3
Another possibility would be to require that the mean absolute error, E[| e( s0 ) |] , be as small as possible.
However, since the absolute-value function is not differentiable at zero, this criterion turns out to be much
more difficult to analyze.

________________________________________________________________________
ESE 502 II.6-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

normality assumption about spatial random effects. In particular, for each finite set of
locations {si :i  1,.., n} in region R, it will be assumed that the associated spatial random
effects [ ( si ) : i  1,.., n] are multi-normally distributed.4 Since E[ ( s )]  0 , by definition,
this distribution is determined entirely by the covariances, cov[ ( si ),  ( s j )] , i, j  1,.., n .
Hence the assumptions characterizing each model can be summarized in terms of
assumptions about (i) the spatial trend,  ( s ) , and (ii) the covariances, cov[ ( s ),  ( s)] ,
between pairs of random errors.

Before stating these assumptions, it is important to make one additional clarification.


When a given parameter such a mean value,  , is assumed to be “known” or
“unknown”, these terms have very specific meanings. In particular, one almost never
actually “knows” the value of any parameter. Rather, a phrase like “  known” is taken
to mean that the value of this parameter is determined outside of the given model.
Similarly, “  unknown” is taken to mean that the value of this parameter is to be
determined inside (i.e., as part of) the given model.5

Simple Kriging Model

Here “simple” refers to the (rather heroic!) assumption that underlying stochastic process
itself is entirely known. In addition, it is also assumed that the spatial trend is constant.
More formally, this amounts to the assumptions:

(6.1.6)  ( s)   known , s  R

(6.1.7) cov[ ( s),  ( s)] known , s, s  R

Before proceeding, it is reasonable to ask why one would even want to consider this
model. Since all parameters of the stochastic process are determined outside the model, it
would appear that there is nothing left to be done. But remember that the underlying
stochastic process model serves only as a statistical framework for carrying out spatial
prediction. In particular, given any location, s0  R , and associated prediction set,
S ( s0 )  {s1 ,.., sn0 } , the basic task is to predict a value for Y ( s0 ) given observed values of
{Y ( s1 ),.., Y ( sn0 )} . So in terms of the linear prediction hypothesis in (6.1.2), the key
prediction weights, (0i : i  1,.., n0 ) , are still unknown, i.e., are yet to be determined.
Hence the chief advantage of this simple kriging model from a conceptual viewpoint is to
4
In addition there is an obvious “consistency” condition that must also be satisfied. For example, if
{Y ( s1 ), Y ( s2 )} is bivariate normal, then the univariate normal distributions for subsets {Y ( s1 )} and {Y ( s2 )}
must of course be the marginal distributions of {Y ( s1 ), Y ( s2 )} . More generally each subset of size k from
the n-variate normal, {Y ( s1 ),..., Y ( sn )} must have precisely the corresponding k-variate marginal normal
distribution.
5
A somewhat more accurate terminology would be to use “  exogenous” and “  endogenous”. But the
terms “known” and “unknown” are so widely used that we choose stay with this convention.

________________________________________________________________________
ESE 502 II.6-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

allow us to derive optimal prediction weights without having to worry about estimating
other unknown parameters at the same time.

Ordinary Kriging Model

The only difference between this model and simple kriging is that the constant mean,  ,
is now assumed to be unknown, and hence must be estimated within the model. More
formally, it is assumed that

(6.1.8)  ( s)   unknown , s  R

(6.1.9) cov[ ( s),  ( s)] known , s, s  R

This ordinary kriging model in fact the simplest kriging model that is actually used in
practice. As will be seen below, the constant-mean assumption (6.1.8) allows both the
mean and covariances to be estimated in a direct way from observed data. So a practical
estimation procedure is available for this model. However, one may still ask why this
model is of any interest from a spatial viewpoint when all variations in spatial trends are
assumed away. The key point to keep in mind here is that spatial variation is still present
in this model, but all such variation is assumed to be captured by the covariance structure
of the model. We shall return to this issue in Section 6.3 below.

Universal Kriging Model

We turn now to kriging models that do allow for explicit variation in the trend function,
 ( s) . The simplest of these, designated as the universal kriging model, allows the trend
function to be modeled as a linear function of spatial attributes, but maintains the
assumption that all covariances are known. More formally, if we now let
x( s )  [ x1 ( s ),.., xk ( s)] denote a (column) vector of spatial attributes [which may include
the coordinate attributes, s  ( s1 , s2 ) , themselves], and let   ( 1 ,..,  k ) denote a
corresponding vector of coefficients, then this model is characterized by the assumptions:

(6.1.10)  ( s)  x( s) ,  unknown , s  R

(6.1.11) cov[ ( s),  ( s)] known , s, s  R

Here it should be emphasized that “linear” means linear in parameters (  ). For example,
if x( s)  [1, s1 , s2 , s12 , s22 , s1s2 ] so that

(6.1.12)  ( s)   0  1 s1   2 s2  3 s12   4 s22  5 s1s2 ,

then the trend,  ( s) , is a quadratic function of the coordinates, s  ( s1 , s2 ) , but is linear


in the parameter vector,   (  0 , 1 ,  2 , 3 ,  4 , 5 ) .

________________________________________________________________________
ESE 502 II.6-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Geostatistical Kriging Model

Our final kriging model relaxes the assumption that covariances are known. More
formally, this geostatistical kriging model (or simply, geo-kriging model) is characterized
by the following assumptions:

(6.1.13)  ( s)  x( s) ,  unknown , s  R

(6.1.14) cov[ ( s),  ( s)] unknown , s, s  R

In this model, the spatial trend parameters,  , as well as all covariance parameters must
be simultaneously estimated. While this procedure is clearly more complex from an
estimation viewpoint, it provides the most general framework for spatial prediction in
terms of prior assumptions. Hence our ultimate goal in this part of the NOTEBOOK is to
develop this geostatistical kriging model in full, and show how it can be estimated.

6.2 The Simple Kriging Model

To develop the basic idea of kriging, we start by assuming as in (6.1.6) and (6.1.7) above
that the relevant spatial stochastic process, {Y ( s)   ( s)   ( s ) : s  R} has a constant
mean, E[Y ( s)]   ( s)   , and that this mean value,  , together with all covariances,
cov[ ( s),  ( s)] , s, s  R have already been estimated. We shall return to such estimation
questions below. But for the present we simply take all these values to be given. In this
setting, observe that if we want to predict a value, Y ( s0 ) , at some location, s0  R , then
since  ( s0 )   is already known, we see from the identity,

(6.2.1) Y ( s0 )     ( s0 )

that it suffices to predict the associated error,  ( s0 ) . Moreover, if we are given a finite
set of sample points, {s1 ,.., sn }  R where observations, { y ( s1 ),.., y ( sn )} have been made,
then in fact we have already “observed” values of the associated errors, namely,

(6.2.2)  ( si )  y ( si )   , i  1,.., n

Hence if S ( s0 )  {s1 ,.., sn0 }  {s1 ,.., sn } denotes the relevant prediction set at s0 , then the
linear prediction hypothesis for  ( s0 ) in this setting reduces to finding a linear
combination,

ˆ ( s0 )   i1 i 0  ( si )
n0
(6.2.3)

________________________________________________________________________
ESE 502 II.6-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

which yields a Best Linear Unbiased (BLU) predictor of  ( s0 ) . The corresponding


predictor of Y ( s0 ) is then defined to be

(6.2.4) Yˆ ( s0 )    ˆ ( s0 )

Note since by definition all errors,  ( s ) , s  R , have zero means, it then follows at once
from (6.2.1) and (6.2.4) together with the linearity of expectations that,

(6.2.5) E[Y ( s0 )  Yˆ ( s0 )]  E[ ( s0 )  ˆ ( s0 )]

 E[ ( s0 )   i01 i 0 ( si )]
n

 E[ ( s0 )]   i01 i 0 E[ ( si )]  0


n

and hence that the unbiasedness condition is automatically satisfied for Yˆ ( s0 ) [and
indeed, for every possible linear estimator given by (6.2.3) and (6.2.4)]. This means that
for simple kriging, BLU prediction reduces precisely to Minimum Mean Squared Error
(MMSE) prediction. So the task remaining is to find the vector of weights,
0  (0i : i  1,.., n0 ) in (6.2.3) that minimize mean squared error:

(6.2.6)  
MSE (0 )  E [Y ( s0 )  Yˆ ( s0 )]2  E [ ( s0 )  ˆ ( s0 )]2 

Here it might seem that without further information about the distributions of these
errors, one could say very little. But surprisingly, it is enough to know their first and
second moments [as assumed in (6.1.6) and (6.1.7) above]. To see this, we begin by
introducing some simplifying notation. First, as in (1.1.1) above, we drop the explicit
reference to locations and now write simply

(6.2.7)  ( si )   i , i  0,1,.., n0

[Here it is worth noting that the choice of “0” for the prediction location is very
convenient in that it often allows this location to be indexed together with it predictor
locations, as in (6.2.7).] Next, recalling that E ( i )  0 it follows that variances and
covariances for the predictor variables can be represented, respectively, as

(6.2.8) var( i )  E ( i2 )   ii , i  1,.., n0

(6.2.9) cov( i ,  j )  E ( i j )   ij , i, j  1,.., n0 ( j  i )

________________________________________________________________________
ESE 502 II.6-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

In addition, the corresponding variance and covariances for unknown error,  0 , to be


predicted can be written as

(6.2.10) var( 0 )  E ( 02 )   2

(6.2.11) cov( 0 ,  i )   0i , i  1,.., n0

Notice in particular that in the variance expression (6.2.10) we have omitted subscripts
and written simply  00   2 . This variance will play a special role in many of the
expressions to follow. Moreover, since only stationary models of covariance will actually
be used in our kriging applications, this variance will be independent of location s0 .6 In
these terms, we can now write mean squared error explicitly in terms of these parameter
values as follows:


MSE (0 )  E [ ( s0 )  ˆ ( s0 )]2   E   0   i01 0i i 

 
n 2
(6.2.12)
 


 

2
 E  02  2  0  i01 0i i   
n n0

 i 1 0 i i 

 
 E   02   2 E  0  i01 0i i  E    

2
 
n n0

 i 1 0 i i 

But since

(6.2.13) 
E  0  i01 0i i  E
n
  n0
i 1
0i 0 i   
n0
i 1
0i E ( 0 i )   i1 0i  0i
n0

and since the product identity

     x   x    x   x   
2
 i1 xi 
n n n n n n n
(6.2.14) i 1 i j 1 j i 1 i j 1 j i 1
x xj
j 1 i

implies that

   


2
   0i 0 j i j
n0 n0 n0
(6.2.15) E
 i 1 0 i i E i 1 j 1

  i01  j01 0i 0 j E ( i j )    0i 0 j  ij ,


n n n0 n0
i 1 j 1

6
Note that this also implies subscripts could be dropped on all predictor variances,  ii . But here it is
convenient to maintain these subscripts so that expressions involving all predictor variances and
covariances can be stated more easily

________________________________________________________________________
ESE 502 II.6-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

it follows by substituting (6.2.13) and (6.2.15) into (6.2.12) that

MSE (0 )   2  2  i01 0i  0i    0i 0 j  ij


n n0 n0
(6.2.16) i 1 j 1

Thus mean squared error, MSE (0 ) , is seen to be a simple quadratic function of the
unknown vector of weights, 0  (0i : i  1,.., n0 ) , with known coefficients given by the
variance-covariance parameters in (6.2.8) and (6.2.9). This means that one can actually
minimize this function explicitly and determine the desired unknown weights. As shown
in Appendix A2, such quadratic minimization problems are easily solved in terms of
vector partial differentiation. But to illustrate the main ideas, it is instructive to consider a
simple case not requiring vector analysis.

6.2.1 Simple Kriging with One Predictor

Consider the one-predictor case shown in Figure 6.2 below. Here the task is to predict
Y ( s0 ) on the basis of a single observation, Y ( s1 ) , at a nearby location, s1 [so the relevant
prediction set is simply S ( s0 )  {s1} ].

Y ( s1 )
 01
Yˆ ( s0 )

Figure 6.2 Single Predictor Case

While such “sparse” predictions are of little interest from a practical viewpoint, the
derivation of a BLU predictor in this case is completely transparent. If we let

(6.2.17) Y ( si )     ( si )     i , i  0,1 ,

________________________________________________________________________
ESE 502 II.6-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

then by (6.2.3), the linear prediction hypothesis reduces to

(6.2.18) ˆ0  011 ,

so that the expression for mean squared error takes the simple form

(6.2.19) MSE (01 )  E ( 0  ˆ0 ) 2   E  ( 0  011 ) 2 

  2  201 01  012  11

where in this case,  2  var( 0 ) ,  01  cov( 0 , 1 ) , and  11  cov(1 , 1 )  var(1 ) .


A representative plot of this simple quadratic function in 01 is shown in Figure 6.2
below. Here it should be clear that mean squared error, MSE (01 ) , is minimized at the
point, ̂01 , shown in the figure. 7
1.1

1.05

0.95 MSE (01 )


0.9

0.85

0.8

0.75 
0.7

0.65

0.6
0 0.2 0.4  0.6 0.8 1
̂01
Figure 6.3 Optimal Weight Estimate

Mathematically, this minimum point, ̂01 , is characterized by the usual first-order


condition that the derivative (slope) of MSE (01 ) be zero (as shown in the figure), along
with the second-order condition that that this slope be increasing, i.e., that the second
derivative of MSE (01 ) be positive. By differentiating (6.2.19) twice, we see that

d 01 MSE (01 )   2 01  2 11 01 MSE (01 )  2 11  0


(6.2.20) d and d2
d 012

Hence the second derivative is positive everywhere (as in the figure), and it follows that
the unique optimal weight, ̂01 , is given by the solution of the first-order condition,

In this example,   1   11 and  01  0.5 , so that the resulting optimal estimate in (6.2.21) is
7 2

ˆ01  0.5 .

________________________________________________________________________
ESE 502 II.6-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

(6.2.21) 2  01  2  11 ˆ01  0  ˆ01   01 /  11  ( 11 ) 1 01

In this simple case, the interpretation of this optimal weight is also clear. Note first that if
the covariance,  01  cov( 0 , 1 ) , between  0 and 1 is zero (so that these random
variables are uncorrelated), then ˆ  0 . In other words, if they are uncorrelated then 
01 1

provides no information for predicting  0 , and one can do no better than to ignore 1
altogether.8 Moreover, as this covariance increases, 1 is expected to provide more
information about  0 , and the optimal weight on 1 increases. On the other hand, as the
variance,  11  var(1 ) , of this predictor increases the optimal weight, ̂01 , decreases.
This reflects the fact that a larger variance in 1 decreases its reliability as a predictor.

Finally, given this optimal weight, ̂01 , it then follows from (6.2.4) together with (6.2.18)
that the resulting optimal prediction, Yˆ ( s ) , in Figure 6.2 is given by
0

(6.2.22) Yˆ ( s0 )    ˆ ( s0 )    ˆ01 1    ( 11 ) 1 01 1

As we shall see below, these results are mirrored in the general case of more than one
predictor.

6.2.2 Simple Kriging with Many Predictors

Given the above results for a single predictor, we now generalize this setting to many
predictors. The main objective of this section is to reformulate (6.2.16) in vector terms,
and to use this formulation to extend expression (6.2.22) to the general the vector of
optimal prediction weights, ˆ0  (ˆ0i : i  1,.., n0 ) , for Simple Kriging. A complete
mathematical derivation of this result is given in Section A2.7.1 of Appendix A2. To
begin with, let the full covariance matrix for  0   ( s0 ) together with its corresponding
prediction set of error values,  i   ( si ) , be denoted by

 2  01   0 n 
 0

  01  11   1n 
 6.2.23 C0   0

     
  0n n 1   n0n0 
 0 0

8
Note in particular that for the present case of multi-normally distributed errors, zero correlation is
equivalent to statistical independence.

________________________________________________________________________
ESE 502 II.6-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

The partitioning shown in this matrix identifies its relevant components. Given the
ordering, i  0,1,.., n0 of both rows and columns, the upper left hand corner denotes the
variance of  0 . The column vector below this value (and the row vector to the right)
identifies the covariances of  0 with each predictor variable,  i , i  1,.., n0 , and is now
denoted by

  01 
 
(6.2.24) c0    
  0n 
 0
Finally, the matrix to the lower right is the covariance matrix for all predictor variables,
 i , i  1,.., n0 , and is now denoted by

  11   1n0 
 
(6.2.25) V0      
 
 n0 1   n0n0 

In these terms, the full covariance matrix, C0 , can be given the compact form,

  2 c0 
(6.2.26) C0   
 c0 V0 

It is the components of this partitioned matrix that form the basic elements of all kriging
analysis. In particular, for the vector of unknown weights, 0  (0i : i  1,.., n0 ) , the mean
squared error function, MSE (0 ) , in (6.2.16) can now be written in vector terms as
follows

(6.2.27) MSE (0 )   2  2 c0 0  0 V0 0

[which can be checked by applying (6.2.24) and (6.2.25) together with the rules of matrix
multiplication]. By minimizing this function with respect to the components of 0 , it is
shown in expression (A2.7.20) of the Appendix that the optimal weight vector,
ˆ0  (ˆ0i : i  1,.., n0 ) , is given by

(6.2.28) ˆ0  V01c0

Hence, letting   (1 ,..,  n0 ) denote the vector of predictors for  0 , it follows that the
BLU predictor of  0 is given by

________________________________________________________________________
ESE 502 II.6-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

(6.2.29) ˆ0  ˆ0  c0 V01

and that [as a generalization of (6.2.22)] the corresponding BLU predictor of Y ( s0 ) is


given by

(6.2.30) Yˆ ( s0 )    ˆ0    c0 V01

This predictor will generally be referred to as the Simple Kriging predictor of Y ( s0 ) .

6.2.3 Interpretation of Prediction Weights

By way of comparison with the single-predictor case above, note that in the present
setting, this case takes the form,

  2  01 
(6.2.31) C0   
  01  11 

so that by (6.2.24), c0  ( 01 ) and V0  ( 11 ) . Hence it should now be clear that (6.2.21)


is simply a special case of (6.2.29). Conversely, the simple interpretation of (6.2.21) can
be (at least partially) extended to the present case. In particular, if the covariances
between  0 and all predictor variables,  i , i  1,.., n0 , are zero, i.e., if c0  (0,..,0) , then
by (6.2.28) we see that ˆ  (0,..,0) . Hence in this case it is again clear that these
0

predictors provide no information. More generally, suppose that all predictors are
uncorrelated, i.e., the  ij  0 for all i, j  1,.., n0 (i  j ) . Then V0 reduces to a positive
diagonal matrix with inverse given by the diagonal of reciprocals, i.e.,

  11    111 
  1  
(6.2.32) V0      V0    
 
 n0n0   1 
 n0n0 
 

(which can be checked by simply multiplying to obtain V0 V01  I n0 ). Hence by (6.2.24)


and (6.2.29) we see that in this case all weights are the same as in the single-predictor
case, i.e., that

(6.2.33) ˆ0i  ( ii ) 1 0i , i  1,.., n0

So if all predictors are uncorrelated, then the contribution of each predictor,  i , to ˆ0 in
(6.2.3) is the same as if it were a single predictor. In particular, it has zero contribution if
and only if it is uncorrelated with  0 .

________________________________________________________________________
ESE 502 II.6-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

However, if such predictors are to some degree correlated, then optimal prediction
involves a rather complex interaction between the covariances, V0 , among predictors and
their covariances, c0 , with  0 . In particular, if  0i  0 then it is possible that interactions
between both  0 and  i with other predictors may result in either positive or negative
values for ˆ . As one illustration, suppose there are two predictors, ( ,  ) , with
0i 1 2

 1 1/ 2   0 
(6.2.34) V0    , c0   
 1/ 2 1  1/ 2 

so that 1 is uncorrelated with  0 , but both have positive covariance (1/2) with  2 . Then
it can be verified in this case that

 4/3 2 / 3  0   1/ 3 
(6.2.35) ˆ0  V01c0     
 2 / 3 4 / 3  1/ 2   2 / 3 

So even though all covariances (and hence correlations) are nonnegative, the optimal
weight on 1 is actually negative. This shows that in the general case the interpretation of
individual weights is much more complex. Indeed, it turns out in this case that the only
quantity than can meaningfully be interpreted is the full linear combination of predictors
in (6.2.29), i.e.,

ˆ0  ˆ0    ˆ0i  i


n0
(6.2.36) i 1

which in the above example, takes the form,

(6.2.37) ˆ0   (1/ 3) 1  (2 / 3)  2

As expected, we see that  2 contributes positively to the prediction, ˆ0 , and makes a
more influential contribution than 1 . But the negative influence of 1 is less intuitive.
To gain further insight here, notice that by definition,

(6.2.38) cov(ˆ0 ,  0 )  E (ˆ0  0 )  E (0  0 )  0 E (  0 )  0 c0 ,

and similarly that

(6.2.39) var(ˆ0 )  var(0  )  0 cov( ) 0  0 V0 0

Hence mean squared error, MSE (0 ) , can also be written as

________________________________________________________________________
ESE 502 II.6-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

(6.2.40) MSE (0 )   2  2cov(ˆ0 ,  0 )  var(ˆ0 )

But since  2 is a constant not involving ˆ0 , it becomes clear that minimization of
MSE (0 ) essentially involves a tradeoff between the covariance of the predictor ˆ0 with
 0 and the variance of the predictor itself. Indeed, this is the proper generalization of the
original interpretation given in the single predictor case, where the relevant covariance
and variance in that case were simply  01 and  11 , respectively. Moreover, the form of
this tradeoff in (6.2.38) makes it clear that to minimize MSE (0 ) , one needs a predictor
ˆ0 with positive covariance, cov(ˆ0 ,  0 ) , as large as possible while at the same time
having a variance, var(ˆ0 ) , as small as possible. It is from this viewpoint that the
negativity of ̂01 in (6.2.35) can be made clear. To see this observe that since  01  0 ,
covariance in this case takes the form

(6.2.41) cov(ˆ0 ,  0 )  0 c0  01 01  02 02  02 02

But since  02  1/ 2  0 , it follows that this covariance can only be positive if 02  0 .
Turning next to variance, observe that for any two-predictor case,

  12   01 
(6.2.42) var(ˆ0 )  0 V00  (01 02 )  11  
  12  22  02 

 012  11  2 0102 12  022  22

 ( 012  11  022  22 )  2 0102 12

But since the first term is always positive and since  12  1/ 2  0 , we see from the
positivity of 02 above that var(ˆ0 ) can only be made small by requiring that 01  0 . In
short, since 1 has no effect on the correlation of the predictor, ˆ0 , with  0 , its best use
for prediction is to shrink the variance of ˆ0 by setting 01  0 .

Before using these kriging weights for prediction, it is of natural interest to consider their
spatial nature. In particular, referring again to our initial illustration in Figure 6.1, it
would seem reasonable that points, si , closer to s0 should have larger weight, ˆ0i . In
particular, if invoke the “standard covariogram” assumption of Figure 4.1 in Section 4,
namely that covariances decrease with distance, then points further away should
contribute less to the prediction of Y ( s0 ) . But for Simple Kriging predictors this is simply
not the case. One simple example is shown in Figure 6.4 below:

________________________________________________________________________
ESE 502 II.6-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

s4
3
point s1 s2 s3 s4
2

1
s1 distance 1.41 2.24 2.83 3.16
0
s2
s0 weight .555 .045 .306 .095
-1

-2
s3 rank 1 4 2 3
-3
-3 -2 -1 0 1 2 3

Figure 6.4 Weighting versus Distance

Here points ( s1 , s2 , s3 , s4 ) are ordered in terms of their distance from prediction point, s0 ,
as shown in the second row of the table.9 To calculate weights in this case, a simple
exponential covariogram was used.10 So in this spatially stationary setting, covariances
are strictly decreasing in distance. Hence the key point to notice is that the kriging
weights (ˆ01 , ˆ02 , ˆ03 , ˆ04 ) in the third row of the table are not decreasing in distance.
Indeed the second closest point, s2 , here is the least influential of the four (as depicted by
the ranking of weights in the last row of the table). Notice that since s1 and s2 are closer
to each other than to s0 , and since distances are in this case inversely related to
correlations,11 the errors  ( s1 ) and  ( s2 ) are more correlated with each other than either
is to  ( s0 ) . So it might be argued here that  ( s2 ) is adding little prediction information
for  ( s0 ) beyond that in  ( s1 ) . But notice that the influence of points s3 and s4 is also
reversed, and that no such relative correlation effects are present here. So even in this
simple monotone-covariance setting, it is difficult to draw general conclusions about the
exact relation between distance and kriging weights.

While these illustrations are necessarily selective in nature, they do serve to emphasize
the complexity of possible interaction effects in MMSE prediction. Given this
development of Simple Kriging predictors, we turn now to the single most important
justification for such stochastic predictors, namely the construction of meaningful
prediction intervals for possible realized values of Y ( s0 ) .

6.2.4 Construction of Prediction Intervals

Note that up to this point we have relied only on knowledge of the means and
covariances of the spatial error process { ( s ): s  R} to derive optimal predictors. But to
develop prediction intervals for these errors, we must now make explicit use of the

9
The actual point coordinates are s0  (0, 0) , s1  (1,1) , s2  (2,1) , s3  ( 2, 2) and s4  ( 1, 3) .
10
With respect to the notation in expression (4.6.6) of Section 4, the range, sill, and nugget parameters used
were ( r  30, s  1, a  0) .
11
Recall from (3.3.13) in Section 3 that spatially stationary correlations are proportional to covariances.

________________________________________________________________________
ESE 502 II.6-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

distributional assumption of multi-normality . In terms of (6.2.26) this assumption implies


in particular that for any prediction site, s0  R , and corresponding prediction set,
S ( s0 )  {si : i  1,.., n0 } , the random (column) vector of errors,

  ( s0 )    0 
   
  ( s1 )    1     0 
        
(6.2.43)
   
  ( s n )
0    n0 

is multi-normally distributed as12

0   0    2 c0  
(6.2.44)   ~ N  0  ,  
   n0   c0 V0  

Our primary application of this distribution will be to derive the distribution of the
associated prediction error in (6.1.3), which we now write simply as e0  e( s0 ) . But
before proceeding it is important to emphasize once again the distinction between  0 and
e0 . Recall that  0 is the deviation of Y ( s0 ) about its mean [ Y ( s0 )     0 ], while e0 is
the difference between Y ( s ) and its predicted value [ e  Y ( s )  Yˆ ( s ) ].
0 0 0 0

To derive the distribution of e0 from that of random error vector, ( 0 ,  ) ,13 we begin by
using (6.2.1), (6.2.4) and (6.2.29) to write e0 in terms of ( 0 ,  ) as follows,

(6.2.45) e0  Y ( s0 )  Yˆ ( s0 )   ( s0 )  ˆ ( s0 )   0  ˆ0

 
  0  ˆ0  (1 ,  ˆ0 )  0 
 

Hence e0 is seen to be a linear compound of ( 0 ,  ) . This, together with the multi-


normality of ( 0 ,  ) , implies at once from the Invariance Theorem in Section 3.2.2 above
that e0 must also be normally distributed. Moreover, since we have already seen in
(6.1.4) that E (e0 )  0 , it follows that if we can calculate the variance of e0 , then its
distribution will be completely determined.

12
Here 0 n0 denotes the n0 -dimensional zero vector.
13
Note that technically this vector should be written inline as ( 0 ,  ) to indicate that it is column vector.
But for sake of visual clarity, we write simply ( 0 ,  ) .

________________________________________________________________________
ESE 502 II.6-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

In view of the importance of this particular variance, we derive it in two ways. First we
derive it directly from the covariance-transformation identity in (3.2.21) of Section 3. In
particular, for any linear compound, aX , of a random vector, X , with covariance
matrix,  , it follows at once from (3.2.21) [with A  a ] that

(6.2.46) var(aX )  a  a

Hence by letting

   1    2 c0 
X   0  , a    
ˆ   
(6.2.47) ,
   0   c0 V0 

it follows from (6.2.45) and (6.2.46) that

  2 c0   1 
(6.2.48) 
var(e0 )  (1 , 0 ) 
ˆ   
 c0 V0   ˆ0 

  2  c0 ˆ0 
 (1 , ˆ0 )    ( 2  c0 ˆ0 )  ˆ0c0  ˆ0V0ˆ0
 c  V ˆ 
 0 0 0

But since for any vectors, x  ( x1 ,.., xn ) and y  ( y1 ,.., yn ) , it must be true that
xy  i xi yi  i yi xi  yx , we see that (6.2.48) can be reduced to

(6.2.49) var(e0 )   2  2 c0 ˆ0  ˆ0V0ˆ0

The form of the right hand side should look familiar. In particular, the representation of
mean squared error, MSE (0 ) , in (6.2.27) now yields the identity,

(6.2.50) var(e0 )  MSE (ˆ0 )

This relation is no coincidence. Indeed, recall from (6.1.5) that for any unbiased
predictor, ˆ0 ,

(6.2.51) E[( 0  ˆ0 ) 2 ]  E (e02 )  var(e0 ) ,

so that its mean squared error is identically equal to the variance of its associated
prediction error. So for the optimal predictor in particular, this variance must be given by
the mean squared error evaluated at ̂0 . Indeed we could have derived (6.2.49) through
this line of reasoning. Hence the direct derivation in (6.2.45) through (6.2.48) offers an
instructive confirmation of this fact.

________________________________________________________________________
ESE 502 II.6-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

To complete this derivation, it suffices to substitute the solution for ̂0 in (6.2.28) [i.e.,
ˆ  V 1c ] into (6.2.49) to obtain,
0 0 0

(6.2.52) var(e0 )   2  2 c0 [V01c0 ]  [V01c0 ]V0 [V01c0 ]

  2  2 c0V01c0  c0V01 (V0V01 )c0

  2  2 c0V01c0  c0V01 ( I n0 )c0

  2  2c0V01c0  c0V01c0

By combining the last two terms, we obtain the final expression for prediction error
variance (also called Kriging variance),

(6.2.53)  02  var(e0 )   2  c0V01c0

where we have now introduced the simplifying notation (  02 ) for this important quantity.
While this expression for  02 is most useful for computational purposes, it is of interest to
develop an alternative expression that is easier to interpret. To do so, if we now use the
simplifying notation, Y0  Y ( s0 ) , for the variable to be predicted at location s0  R , then
[as a consequence of (3.2.21)] it follows that the first term in (6.2.53) is simply the
variance of Y0 , since

(6.2.54) var(Y0 )  var(    0 )  var( 0 )   2

Similarly, if we also represent the corresponding predictor in (6.2.30) by Yˆ0  Yˆ ( s0 ) , then


the second term in (6.2.53) turns out to be precisely the variance of Yˆ . To see this, note
0

simply from (6.2.3) together with (6.2.5) and (3.2.21) that

(6.2.55) var(Yˆ0 )  var(   c0V01 )  var(c0V01 )

 c0V01 cov( )V01c0  c0V01 (V0 )V01c0

 c0V01 (V0 V01 ) c0  c0V01c0

So the prediction error variance in (6.2.53) can be equivalently rewritten as

(6.2.56)  02  var(Y0 )  var(Yˆ0 )

________________________________________________________________________
ESE 502 II.6-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

In these terms, it is clear that prediction error variance,  02 is smaller that than the original
variance of Y0 . Moreover, the amount of this reduction is seen to be precisely the
variance “explained” by the predictor, Yˆ . Indeed, it can be argued that this reduction in
0

variance is the fundamental rationale for kriging predictions, often referred to as


“borrowing strength from neighbors”.

Given this expression for prediction error variance, it follows at once from the arguments
above the prediction error, e0 , must be normally distributed as

(6.2.57) e0 ~ N (0, 02 )

Hence the task remaining is to use this normal distribution of e0 [  Y0  Yˆ0 ] to construct
prediction intervals for Y in terms of Yˆ and  2 . To do so, we first recall from Sections
0 0 0

3.1.1 and 3.1.2 above that the standardization of e0 must be distributed as N (0,1) . In
particular, since the mean of e0 is zero, and since it standard deviation of e0 is given
from (6.2.53) by var(e0 )   0 , it follows that

Y0  Yˆ0 e0
(6.2.58)  ~ N (0,1)
0 0

Hence it now becomes clear that, together with Yˆ0 , the key distributional parameter is the
standard deviation,  0 , of e0 , which is usually designated as the standard error of
prediction. Indeed, as will be seen below, the fundamental outputs of all kriging software
are precisely estimates of the kriging prediction, Yˆ0 , and standard error of prediction,
 0 , at all relevant prediction locations, s0 .

To construct prediction intervals for Y0 based on (6.2.52), we proceed in a manner


paralleling the two-tailed Clark-even test procedure in Section 3.2.2 of Part I. In
particular, by recalling from (3.1.32) that  denotes the cumulative distribution function
for N (0,1) , and that for any probability,  , the  -critical value, z , is defined by
 ( z )   [as in the figure below for  / 2 ], it follows that

1 
 Y  Yˆ  /2
(6.2.59) Pr   z /2  0 0  z /2   1  
 0 
 z /2 0 z /2

________________________________________________________________________
ESE 502 II.6-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

But since the following events are equivalent:

Y0  Yˆ0
(6.2.60)  z /2   z /2    0 z /2  Y0  Yˆ0   0 z /2
0

 Yˆ0   0 z /2  Y0  Yˆ0   0 z /2

it follows that their probabilities must be the same, and hence from (6.2.59) that,

(6.2.61)  
Pr Yˆ0   0 z /2  Y0  Yˆ0   0 z /2  1  

In other words, the probability that the value of Y0 lies between Yˆ0   0 z /2 and
Yˆ   z is 1   . In terms of confidence levels, this means that we be 100(1   )%
0 0  /2

confident that Y0 lies in the prediction interval,

(6.2.62) Yˆ0   0 z /2 , Yˆ0   0 z /2   Yˆ0   0 z /2 


   

The single most common instance of (6.2.62) is for the case,   0.05 , with
corresponding critical value z /2  z.025  1.96 . In this case, one can thus be 95%
confident that Y0 lies in the prediction interval,

(6.2.63) Yˆ0  (1.96) 0 , Yˆ0  (1.96) 0   Yˆ0  (1.96) 0 


   

As with all statistical confidence statements, the phrase “95% confident” here means that
if we were able carry out this same prediction procedure many times (i.e., to take many
random samples from the joint distribution of Y0 and its kriging prediction, Yˆ0 ) then we
would expect the realized values of Y0 to lie in the corresponding realized of intervals
[Yˆ0  (1.96) 0 ] about 95% of the time.

Finally it should again be emphasized that it is the ability to make confidence statements
of this type that distinguishes stochastic prediction methods from the deterministic
methods of spatial interpolation developed in Section 5.

________________________________________________________________________
ESE 502 II.6-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

6.2.5 Implementation of Simple Kriging Models


Given the theoretical development of Simple Kriging above, the task remaining is to
make this procedure operational. But before doing so, it should again be emphasized, as
in Section 6.1.2 above, that from a practical viewpoint, Ordinary Kriging is almost
always used in empirical situations where Simple Kriging is relevant. Hence the main
relevance of this procedure for our purposes is to develop as many of the basic concepts
as possible within this simple setting. It should also be noted that this Simple Kriging
procedure is one of the options available in the Geostatistical Analyst extension of
ARCMAP. So we will be able to use the implementation developed here to illustrate
most of the operational procedures involved in the use of this software. With this in mind,
we now proceed to operationalize Simple Kriging through a series of procedural steps.
This will be followed in Section 6.2.6 below with an application of this procedure.

In the following development, we again postulate that the values of some variable Y
defined over a relevant region R can be modeled by a spatial stochastic process,
{Y ( s )     ( s ) : s  R} , with constant mean,  . In addition, we assume the existence of
a given set of n observations (data points), { yi  y ( si ):i  1,.., n} in R, where of course
each data point, yi , is taken to be a realization of the corresponding random variable,
Yi  Y ( si ) in this spatial stochastic process. Also, for purposes of illustration, we shall
again consider the problem of predicting, Y ( s0 ) , at a single given location, s0  R , with
respect to a given prediction set, S ( s0 )  {s1 ,.., sn 0 }  {s1 ,.., sn } . Within this framework,
we can operationalize the Simple Kriging model as follows:

Step 1. Estimation of the Mean

Recall from the assumption in (6.1.6) that our first task is to produce an estimate of the
mean,  , outside the Simple Kriging model. Here the obvious choice is just to use the
sample mean of the given data, i.e.,

ˆ  yn   
n n
(6.2.62) 1
n i 1
y ( si )  1
n i 1
yi

One attractive feature of this estimate is that it is always unbiased since

(6.2.63) E ( ˆ )  E  1
n 
n
Y 
i 1 i  1
n 
n
i 1
E (Yi )  1
n 
n
i 1


So even though these random variables are spatially correlated, this has no effect on
unbiasedness. What spatial correlation does imply is that the variance of this estimator is
much larger than that of the classical sample mean under independence. We shall return
to this issue in the development of Ordinary Kriging in Section 6.3 below.

________________________________________________________________________
ESE 502 II.6-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Step 2. Estimation of Covariances

Recall next from assumption (6.1.7) that the covariances, cov[ ( s),  ( s)] are assumed to
be given for all locations, s, s  R . But we must of course provide some prior estimates
of these covariances. This was in fact one of the primary motivations for the assumption
of covariance stationarity in Section 3.3.2 above. Hence we now invoke this assumption
in order to estimate spatial covariances in a manner that accounts for spatial correlation
effects. Recall also from Section 4.10.1 that, unlike the mean above, the classical estimate
of covariance is biased in the presence of spatial correlation. So our estimation procedure
here will always start with variograms rather than covariograms. Fortunately, this basic
estimation procedure is exactly the same as that used for Ordinary Kriging, and indeed,
for all more advanced kriging models. So it is worthwhile to develop this procedure in
detail here.

To do so, we begin by recalling from (3.3.7) and (3.3.11) in Section 3 that under
covariance stationarity, all covariances can be summarized by a covariogram, C (h) . As
emphasized in Section 4, this is best estimated by first estimating a variogram,
 (h; r , s, a) with parameters, r  range, s  sill, and a  nugget. Since the common
variance, C (0)   2 , is precisely the sill parameter, s , one can then obtain the desired
covariogram from the identity in (4.1.7) of Section 4, namely14

(6.2.64) C (h)   2   (h)  s   (h; r , s, a )

Hence, the estimation procedure starts by using the MATLAB program, var_spher_plot,
together with the full sample data set above to obtain estimates, (rˆ, sˆ, aˆ ) , of the spherical
variogram parameters. The estimated spherical variogram,  (h; rˆ, sˆ, aˆ ) , is then used
together with (6.2.64) to obtain an estimate, Cˆ (h) , of the desired covariogram as follows:

(6.2.65) Cˆ (h)  sˆ   (h; rˆ, sˆ, aˆ )

Recall that for any pair of point, s, s  R separated by distance, || s  s || h the quantity,
Cˆ (h) , then yields an estimate of cov[ ( s ),  ( s)] , i.e.,

(6.2.66)   ( s ),  ( s)]  Cˆ (|| s  s ||)


cov[

Using this identity, we can then estimate the full covariance matrix, C0 , relevant for
prediction at s0 [as in (6.2.23) above]. In particular, if we let dij  || si  s j || for each pair
of points, si , s j  {s0 , s1 ,.., sn0 } , and [as instances of (6.2.66)] set

(6.2.67) ˆ ij  Cˆ (dij )

14
Again, remember not to confuse the symbol, s , for “sill” with points, s  ( s1 , s2 )  R .

________________________________________________________________________
ESE 502 II.6-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

then we immediately obtain the following estimate, Ĉ0 , of C0 ,

 ˆ 2 cˆ01  cˆ0 n0 
 
 cˆ cˆ11  cˆ1n0   ˆ 2 cˆ0 
(6.2.68) Ĉ0   10    
       cˆ0 Vˆ0 
 cˆn 0 
cˆn0 1  cˆn0n0 
 0

Note in particular, that the common variance,  2 , of all random variables is again
estimated by the sill, since

(6.2.69) ˆ 2  Cˆ (0)  sˆ

Step 3. Estimation of Kriging Predictions

Finally, given these parameter estimates, we are ready to estimate the Simple Kriging
prediction, Yˆ ( s0 ) , of Y ( s0 ) . To do so, begin by recalling that that the deviation error,
 i  yi   , at each data point, i  1,.., n0 , can now be estimated in terms of (6.2.62) by

(6.2.70) ˆi  yi  ˆ

So if we now designate the corresponding estimate of the deviation predictors,


  (1 ,..,  n0 ) for  0   ( s0 ) by

(6.2.71) ˆ  [ˆi : si  S ( s0 )]  (ˆ1 ,.., ˆn )


0

then it follows from (6.2.29) that Simple Kriging prediction of  0 is given by

(6.2.72) ˆ0  ĉ0 Vˆ01ˆ

Finally, by using (6.2.30) together with these estimates, it follows that the Simple Kriging
prediction of Y0  Y ( s0 ) is given by15

(6.2.73) Yˆ0  Yˆ ( s0 )  ˆ  cˆ0 Vˆ01 ˆ

To complete the implementation of Simple Kriging, it remains only to estimate the


corresponding prediction error variance (or Kriging variance) in (6.2.53) by

15
Here it should be noted that for simplicity, we have used the same notation for the theoretical and
estimated Simple Kriging prediction, Yˆ0 (and ˆ0 ).

________________________________________________________________________
ESE 502 II.6-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

(6.2.74) ˆ 02  sˆ  cˆ0Vˆ01cˆ0

and take its square root,

(6.2.75) ˆ 0  sˆ  cˆ0Vˆ01cˆ0

to be the relevant estimate of the standard error of prediction at location s0 . The pair of
values (Yˆ0 , ˆ 0 ) can then be used as in (6.2.61) to estimate the (default) 95% prediction
interval for Y0 , namely,

(6.2.76) [Yˆ0  (1.96) ˆ 0 ]

One final comment should be made about these estimates. In the theoretical development
of Section 6.2.2, the predictors ˆ0 and Yˆ0 were derived as Best Linear Unbiased (BLU)
predictors. This is only accurate if the true mean,  , and covariances, C0 , are known –
which is of course almost never the case. So to be accurate, the above values ˆ0 and
Yˆ are in fact only estimates of BLU predictors. This distinction is often formalized by
0

designating them as Empirical-BLU predictors. Similarly, as with all prediction intervals


or confidence intervals based on estimated parameters, the variation of these parameter
estimates is of course not accounted for in these intervals themselves. So again, a more
precise statement would be to designate (6.2.76) as an estimated 95% prediction interval.

6.2.6 An Example of Simple Kriging

Given the estimation procedure above, we now illustrate an application of Simple Kriging
in terms of the Vancouver Nickel data in Section 4.9 above. But before developing this
example, it is important to emphasize that the underlying normality assumption on all
spatially-dependent random effects,  ( s ) , is crucial for the estimation of meaningful
prediction intervals. Moreover, since these random effects are not directly observable,
this distributional assumption can only be checked indirectly. But by assuming that there
are no global trends (as in Simple and Ordinary Kriging), it should be clear from the
identity

(6.2.77) Y ( si )     ( si ), i  1,.., n

that these random effects differ from the observed data, { y ( si ) : i  1,.., n} , only by a
(possibly unobserved) constant,  . Moreover, since the variance,  2  var[ ( si )]
 var[Y ( si )] is constant for all covariance-stationary processes, it follows that under this
additional assumption, the marginal distributions must be the same for all Y data,
namely

________________________________________________________________________
ESE 502 II.6-25 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

(6.2.78) Y ( si ) ~ N (  , 2 ) , i  1,.., n

So even though these are not independent samples from this common distribution, it is
still reasonable to expect that the histogram of this data should look approximately
normal. This motivates the following simple test of normality.

Normal Quantile Plots and Transformations

A very simple and appealing test of normality is available in JMP, known as Normal
Quantile Plots (also called Normal Probability Plots). The main appeal of this test is that
it is graphical, and in addition, provides global information about possible failures of
normality. The idea is very simple. Given a set of data ( y1 ,..., yn ) from an unknown
distribution, one first reorders the data (if necessary) so that y1  y2    yn , and then
standardizes it by subtracting the sample mean, yn  1n in1 yi , and dividing by the sample
1/2
standard deviation, sn   n11 in1 ( yi  yn ) 2  , to obtain:

yi  yn
(6.2.79) zi  , i  1,.., n
sn

Now if ( y1 ,..., yn ) were coming from a normal distribution, then ( z1 ,..., zn ) should be
approximately distributed as, Z i ~ N (0,1), i  1,.., n [we are using only estimated means
and standard deviations here]. So for an independent sample ( Z1 ,.., Z n ) of size n from
N (0,1) , if we compute the theoretical expected values, i  E ( Z i ), i  1,.., n , then we
would expect on average that the observed values zi in (6.2.79) should be reasonably
closed to their expected values, i . This in turn implies that if plot zi against i , the
points should like close to the 45 line. This is illustrated in Figure 6.5 below, where a
sample of size n  100 has been simulated in JMP (using Formula → Random →
Random Normal).

3
-2.33 -1.64-1.28 -0.67 0.0 0.67 1.281.64 2.33

-1

-2

0.02 0.1 0.2 0.5 0.8 0.9 0.98

Figure 6.5 Normal Quantile Plot


________________________________________________________________________
ESE 502 II.6-26 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

The values on the vertical axis are exactly the zi values together with their histogram
shown on the left. The Normal Quantile Plot is displayed on the right (using the
procedure detailed in Assignment 4). The values on the horizontal axis at the top of the
figure are precisely the expected values, i , for each zi , i  1,..,100 .16 Here it is clear
that all point pairs are indeed close to the 45 line (shown in red). The dashed lines
denote 95% probability intervals on the realized values zi , so that if the sample were
normal (as in this simulation) then each dot should lie between these bands about 95% of
the time.17 For example, the middle sample value, z50 , with expected value,
50  E ( Z 50 )  0 , should lie in the interval between these two bands on the vertical green
center line about 95% of the time. So this plot provides compelling evidence that this
sample is indeed coming from a normal distribution.

We now apply this tool to the Nickel data, as shown in Figure 6.6 below. For ease of
comparison with Figure 6.7, the histogram and corresponding normal quantile plot are
show using the horizontal display option18. (The only difference here is that the Normal
Quantile Plot is now above the histogram, with i values on the vertical axis to the right.)
Since most data observed in practice is nonnegative (i.e., is truncated at zero), the
corresponding histograms tend to be “skewed to the right”, as illustrated by this Nickel
data.

Figure 6.6. Nickel Data Figure 6.7. Log-Nickel Data

16
The values on the bottom horizontal axis are the associated cumulative probabilities, so that “0” on the
top corresponds to “  (0)  .5 ” on the bottom.
17
Note that such probability intervals are different from confidence intervals. In particular, their end points
are fixed. Note also that these (Lilliefors) probability bounds actually account for the estimated mean and
standard-deviation valued used [for more information, Google “Lilliefors test”].
18
Right click on the label bar above the histogram and select Display Options → Horizontal Layout.

________________________________________________________________________
ESE 502 II.6-27 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

The degree of non-normality of this data is even more evident from the Normal Quantile
Plot. Here the mid-range values are well above the 45 line (slightly distorted in this
plot), indicating that there is “too much probability mass to the left of center” relative to
the normal distribution. Hence it is difficult to krige this data directly, since the
corresponding prediction intervals would have little validity.
However, if this data is transformed to natural logs, then the familiar “bell shaped” curve
starts to appear, as seen in Figure 6.7 above. What is happening is that the log
transformation “shrinks” the upper range of the distribution (above value one) and
“expands” the lower range (below value one). While other transformations are possible
here, (such as taking square roots rather than logs), the log transformation is by far the
most common. It is also used for regression residuals, as we shall see in later sections.
To perform this log transformation in MATLAB, we start with original data set, nickel,
the MATLAB file, nickel.mat. Next we replace the data column, nickel(:,3) with log
data, and save as log_nickel using the command:
>> log_nickel = [nickel(:,1:2),log(nickel(:,3))];
This makes a new matrix consisting of the first two columns of nickel and the log of the
third column.19

Estimation of the Spherical Variogram and Covariogram


Recall from Section 4.9.2 that the variogram and covariogram were estimated for the
nickel data, as in Figures 4.22 and 4.23, respectively. We now redo this procedure for the
log_nickel data in order to obtain initial covariance inputs for Kriging this data. To
estimate a spherical variogram we start with the default value of maxdist:
>> var_spher_plot(log_nickel);
and obtain the results shown in Figure 6.8 below:
1.6 1.4

1.2
1.4

1
1.2
0.8
1
0.6
0.8
0.4

0.6
0.2

0.4
0

0.2 -0.2

0 -0.4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 1 2 3 4 5
4 4
x 10 x 10

Figure 6.8. Log Nickel Variogram Figure 6.9. Log Nickel Covariogram
19
Note that the log command uses natural logs by default. Logs to the base 10 are obtained with the
command, log10.

________________________________________________________________________
ESE 502 II.6-28 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

The corresponding covariogram estimate is on the right in Figure 6.9. Here we again see
a wave effect which is qualitatively very similar to that in Figure 4.22 for the raw nickel
data. Here the reported maxdist value is 48,204. However, it appears that up to about
30,000 meters the empirical variogram is reasonably consistent with a classical spherical
variogram. Hence to capture this range, we now rerun var_spher_plot with this specified
maxdist value as follows:

>> opts.maxdist = 30000;

>> OUT = var_spher_plot(log_nickel,opts);

The new covariogram is plotted in Figure 6.10 below, and is seen to be quite in keeping
with the classical model.

1.4

1.2

1
RANGE 21630.857
0.8 SILL 1.356
0.6
NUGGET 0.340
0.4
MAXDIST = 30000
0.2

-0.2
0 0.5 1 1.5 2 2.5 3
4
x 10

Figure 6.10. Final Log Nickel Covariogram

Here we no longer show the variogram, since its main purpose was to estimate the
desired covariogram. By using the estimated range, sill and nugget parameters
(rˆ, sˆ, aˆ )  (21631, 1.356 , 0.340) shown on the right, we can now construct estimates of all
desired covariances as in (6.2.65) and (6.2.66) above.

To use these parameters in MATLAB, recall that the first cell of the OUT structure above
contains these parameter values. So we may identify these for later use as:

>> p_log = OUT{1}

Note that by leaving off the semicolon on the command line, the new vector is
automatically displayed as

p_log = 21631 1.3561 0.34026

so that the correctness of this command is easily checked from the output above.

________________________________________________________________________
ESE 502 II.6-29 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Simple Kriging at a Selected Point

Given this covariogram estimate, we first apply simple kriging to a single point in order
to illustrate the procedure. In particular we choose the point, s0 = (659000,586000),20
shown as a red dot in Figure 6.11 below. Here the nickel values in Figure 4.18 have been
replaced by log-nickel values. Notice that while the values have changed, the overall
pattern is essentially the same. With respect to the particular point, s0, it appears that a
bandwidth of h0 = 5000 meters is sufficient to capture the (12) most important neighbors
of this point, as shown in the enlarged portion of the map. So for purposes of this
illustration we take the relevant prediction set, S ( s0 ) , to be given by these 12 points.

! !
! !
! ! ! !! ! ! ! !!!
!
! ! ! ! ! ! !! !! ! ! !!!! !
! ! ! ! ! ! !
! ! !!! ! ! !
!
! ! ! !!! !! !
! !
!! ! ! ! ! ! ! !! ! ! !! !
!
! !! ! !
! !! ! !
!! ! ! ! ! ! !! ! !!
!! ! !! ! !
! ! !! !!! ! ! !
! !! !! ! !!! ! ! ! ! !! !
! ! ! ! ! !! !! ! !!
!! ! ! ! ! !
! !! !
! !
!!
! !
!
! ! !
!! !!!
!!
!
!
! !
! !
!
!! ! ! ! ! ! ! !!
! ! ! !
!!
! !
! ! !! ! ! ! ! !
! !
! !!
!
! ! ! ! ! !
! ! ! ! ! !! ! ! ! !
!! !
! !!! !! !!
! ! ! !! !
!! !! !!!
!! !! ! !! ! !
! !
!!! ! !
!
! !! ! ! ! ! ! !! ! !
!
! ! !! ! ! !
!! !!
!
!!! ! ! ! !!! !! ! !! ! !
! !! !! ! ! ! ! !
! !! !! !! ! ! ! ! !!! !! !! ! !
! ! ! ! ! !! ! ! ! !! ! !!
! ! !
! ! !!!! !!! ! !!! ! !
! !
! ! ! ! !! ! !! !
!
! ! ! ! !
!
! !!
! ! ! !! ! !
!! ! ! !! !
! !! !!! ! !
!!
! ! !
! !
! ! !
!! ! ! !
!

Figure 6.11. Point s0 and its Prediction Set

The rest of the simple kriging procedure is operationalized in the MATLAB program,
krige_simple.m . So to obtain the desired simple kriging prediction and an associated
estimate of the standard error of prediction at s0, one can use the command:

>> OUT = krige_simple(h0,p_log,log_nickel,s0)

Here the OUT matrix lists the krige prediction in the first column and the standard errors
in the second column (see also the documentation at the beginning of the program). So in
the present case, we can simply leave off the semicolon again and see the screen display:

>> OUT = 3.0488 0.76697

If we now denote nickel values by the random variable, Y, and log_nickel values by
logY, the kriging prediction of log_nickel at the point s0 is seen to be

(6.2.80) 
log Y ( s0 )  3.0488

20
In the following discussion we shall refer to the given location as s0 when discussing input/output for
MATLAB programs, and as s0 when referring to the formal development above. The same is true of
bandwidths, where h0 and h0 .will be used respectively.

________________________________________________________________________
ESE 502 II.6-30 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

where the “hat” notation, logY , is used to denote a prediction (or estimate) of the
random variable, logY . The corresponding estimate of the standard error of prediction
at location s0 is then given by,

(6.2.81) ̂ 0  0.76697

For our later purposes, it is important to note that as in Step 1 of the estimation procedure
for simple kriging, this program uses the sample mean of the log_nickel data, which can
be obtained directly in MATLAB with the command

>> mean(log_nickel(:,3))

which in this case yields the value, ˆ  3.252 .

Comparison with Geostatistical Analyst

Before analyzing this simple kriging output further, it is instructive to compare it with the
output obtained by using the simple kriging procedure in Geostatistical Analyst. First it is
necessary to construct log-nickel values in ARCMAP. This is easily accomplished by
opening the attribute table for the Vancouver_dat shapefile, making a new field, say
LOGNI, and using the Calculator to create the logs of Nickel values [written in the
calculator window as log([NI]) ].21 [These log values are shown in Figure 6.11 above.]
To perform simple kriging start with the path:

Geostatistical Analyst → Geostatistical Wizard → Kriging

and use attribute LOGNI for input data Vancouver_dat. In the next window, select

Simple Kriging → Prediction Map

Notice that the mean value is displayed as 3.2515, which is precisely the (rounded)
MATLAB value above. In the next window, be sure to select the “Semivariogram”
option, to obtain a variogram plot. Recall that the maxdist above was chosen to be 30000
meters.

To obtain a fit that is roughly comparable in this case set the number of lags to 15 with a
lag size of 2000 meters (yielding a maxdist of 15  2000  30000 meters) as shown in
Figure 6.12 below. Here the estimated range of 21706 meters is remarkably close to the
MATLAB value of 21630 meters in Figure 6.10 above. Similarly, the estimated nugget
value, 0.3409, and sill value, (.3409 1.0206  1.3615) , are also very close to those in
Figure 6.10. So in this case one expects the simple kriging results to be quite similar as
well.

21
As with MATLAB, the “log( )” function in ARCMAP calculates natural logs.

________________________________________________________________________
ESE 502 II.6-31 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Figure 6.12 Variogram for Log Nickel Data

This can be verified in the next window, shown in Figure 6.13 below. Here the sample
point coordinates have been set to X = 659000 and Y = 586000 to agree with the point s0
above. Similarly, to produce a circular neighborhood of 5000 meters, the “Sector type” is
set to the simple ellipse form shown, and the axes are both set to 5000 to yield a circle.22

Figure 6.13. Kriging Prediction at s0 = (X,Y)

22
Be sure to set Copy from Variogram = “False” in order to set these axis values.

________________________________________________________________________
ESE 502 II.6-32 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Notice also that the maximum “Neighbors to include” has been set to 15 to ensure that all
points in the circle around point (X,Y) in the preview window will be included.23

The kriging prediction for log nickel is then displayed in Figure 6.13 as “Prediction =
3.0504`” located below the (X,Y) coordinate values. [Notice also that exactly the 12
points inside the circle have been used for this kriging prediction.] As expected, this
value is seen to be quite close to the MATLAB prediction in (6.2.80) above.

Finally, to produce an estimate of the standard error of prediction at (X,Y), click “Back”
twice to return to the “Step 1” window and now select

Simple Kriging → Prediction Standard Error Map

With this selection, return to the “Step 3” window by clicking “Next” twice. Notice that
that all settings in Steps 2 and 3 have remained constant, so that prediction standard
errors are now being calculated under the same settings as the kriging prediction. The
only change is that “Prediction = 3.0504” is now replaced by “Error = 0.7676”. Again,
this value is quite close to the MATLAB standard error estimate in (6.2.81) above. As
mentioned above, this close agreement is largely due to the similarity of the variogram
parameter estimates in this case. Hence such close agreement cannot be expected in
general.

Analysis of the Simple Kriging Results

By applying the prediction interval result in expression (6.2.61) above, we can


immediately obtain a (default) prediction interval for the log-nickel value at s0. However,
this is not particularly appropriate, since it is nickel values (in parts per million, ppm) that
we are really interested in. Indeed the only reason for using log-nickel values was to
obtain a better normal approximation, so that prediction intervals will have some
statistical validity. But having obtained such a prediction interval, we now wish to
transform this interval back to nickel values. Here the idea is very simple. Notice first
that if g (Y ) is any monotone increasing function of a random variable [such as log(Y ) ]
then the function g has a well-defined inverse, g 1 , which is also monotone increasing.
So for any three random variables ( Z1 , Z 2 , Z 3 ) the following “inequality events” must be
identical

(6.2.82) Z1  g ( Z 2 )  Z 3  g 1 ( Z1 )  g 1[ g ( Z 2 )]  g 1 ( Z 3 )

 g 1 ( Z1 )  Z 2  g 1 ( Z 3 )

23
Note also in Figure 6.13 that the “Enlarge” tool for the preview window has been used to focus in on the
point (X,Y).

________________________________________________________________________
ESE 502 II.6-33 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

where the last line follows from the identity, g 1[ g  Z 2 ]  Z 2 . This in turn implies that
the probabilities of these events must be identical, so that

(6.2.83) Pr[ Z1  g ( Z 2 )  Z 3 ]  Pr[ g 1 ( Z1 )  Z 2  g 1 ( Z 3 )]

Now in the present case, recall from (6.2.59) that the 95% prediction interval for
log Y ( s0 ) is defined by the relation:

(6.2.84) 
Pr[log 
Y ( s0 )  (1.96)ˆ 0  log Y ( s0 )  log Y ( s0 )  (1.96)ˆ 0 ]  .95


Hence if we now let Z1  log 
Y ( s0 )  (1.96)ˆ 0 , Z 2  log Y ( s0 ), Z 3  log Y ( s0 )  (1.96)ˆ 0 and
let g ()  log() so that g 1 ()  exp() , then it follows at once from (6.2.83) and (6.2.84)
that

(6.2.85)



Pr exp log  
Y ( s0 )  (1.96)ˆ 0  Y ( s0 )  exp log  
Y ( s0 )  (1.96)ˆ 0   .95

This yields the desired prediction interval for Y ( s0 ) . In the present case we have the
estimated values,

(6.2.86)
 
exp log

 

Y ( s0 )  (1.96) ˆ 0 , exp log Y ( s0 )  (1.96) ˆ 0 

 exp  3.0504  (1.96) (.7676)  , exp  3.0504  (1.96) (.7676)  

 [ exp(1.5459) , exp(4.5549)] = [4.6922, 95.097]

and hence can be 95% confident that the true value of Y ( s0 ) lies in the interval
[4.6922, 95.097] . Note finally that [as stated following expression (6.2.61)] this result
can be interpreted to mean that if we were able to perform this same estimation procedure
many times, then Y ( s0 ) would lie in the estimated interval about 95% of the time. So in
the present case, one can be reasonably confident that the interval obtained (namely
[4.6922, 95.097] ) does indeed contain Y ( s0 ) .

Full Kriging of Log Nickel

While the restriction to a single point, s0, was valuable as an illustration of the Simple
Kriging procedure, typically one wishes to predict (estimate) the entire sample area based
on the observed data points { y ( si ) : i  1,.., N } . In ARCMAP this is precisely the “default”
option (where predictions are restricted to the smallest box in the sample area containing
the observed data). But in MATLAB one must actually specify the set of points where

________________________________________________________________________
ESE 502 II.6-34 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

predictions are desired. So a simple procedure here is to use the program, grid_form.m,
to construct a reasonably fine grid of points in the smallest box containing the data. To
display this visually, one can then import this data to ARCMAP and use some
appropriate (non-statistical) interpolation method to interpolate this grid to every pixel. In
the MATLAB file, nickel.mat, the coordinates of all 437 data points are in the matrix,
L0. So to form a bounding box, write:

>> Xmin=min(L0(:,1));
>> Xmax=max(L0(:,1));
>> Ymin=min(L0(:,2));
>> Ymax=max(L0(:,2));

Next, to choose a grid cell size, observe from the map display in ARCMAP that a
division of the box sides into about 25 segments yields a reasonably fine grid for
interpolation. So we now set,

>> Xcell = (Xmax-Xmin)/25;


>> Ycell = (Ymax-Ymin)/25;

and use the command (recall the application on p.4-26 of Part I):

>> G = grid_form(Xmin,Xmax,Xcell,Ymin,Ymax,Ycell);

to construct an appropriate grid, G. This grid is shown in Figure 6.14 below, and is seen
to just cover the region of the data points. Using grid G as an input rather than the single
point, s0, we can then obtain a full kriging of all grid points with the command:

>> OUT_G = krige_simple(h0,p_log,log_nickel,G);

[Here we use the semicolon to avoid screen output of all kriging values.] This data can
then be imported to ARCMAP by making a data table,

>> DAT_G = [G,OUT_G];

in which the first two columns include the grid coordinate points and the last two include
the krige and standard error estimates at each grid point. By saving this as an ASCII file;

>> save DAT_G.txt DAT_G -ascii

(and editing the file in EXCEL to include column labels) one can then import
DAT_G.txt into ARCMAP, make a shapefile Simple_Krige_Grid.shp, and display this
layer as shown in Figure 6.15 below.

________________________________________________________________________
ESE 502 II.6-35 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !

Figure 6.14. Interpolation Grid Figure 6.15. Simple Kriging Comparison

To display the simple kriging results from MATLAB, we can then use any of the
interpolators in Geostatistical Analyst. The contours shown in Figure 6.15 are obtained
by first interpolating the kriging data in Simple_Krige_Grid with the radial basis
functions option, and then using the command, Data → Export to Vector. The layer
produced contains precisely these contours. The reason why contours are used here is to
allow a visual comparison with a simple kriging of log-nickel in Geostatistical Analyst.
This is accomplished by completing the simple kriging procedure outlined above [that we
terminated with Step 3 (Searching Neighborhood) shown in Figure 6.13]. If one places
the contours above the kriging map displayed, then both can be seen together.24

Finally, this visual comparison shows that while these two kriging surfaces are not in
perfect agreement, they are qualitatively very similar. Moreover, while the Geostatistical
Analyst procedure is clearly easier to perform in this case, the MATLAB “grid”
procedure will prove to be very useful for universal kriging, where the Geostatistical
Analyst version is very limited in terms of applications. This will be illustrated by the
“Venice example” in Section 7.3.5 below.

24
To make the boundaries of the kriging map agree exactly with the contours (as seen in Figure 6.15),
open the “properties” of the kriging map layer, select “Extent” and set this to “the rectangular extent of
Simple_Krige_Grid”.

________________________________________________________________________
ESE 502 II.6-36 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

6.3 The Ordinary Kriging Model

The procedural details of Ordinary Kriging are almost identical to those of Simple
Kriging. Hence the present development focuses on those aspects that extend the above
analysis by internalizing the estimation of the unknown mean,  . Here again we start
with a spatial stochastic process {Y ( s)     ( s) : s  R} where each finite set of sample
variates, {Y ( si )     ( si ) : i  1,.., n} , is assumed to be multi-normally distribution with
known covariances, cov[ ( si ),  ( s j )], i, j  1,.., n . Given such a sample, we again
consider the problem of predicting Y ( s0 ) at some location, s0  R , not in this sample. It
is also assumed that the relevant prediction set, S ( s0 )  {s1 ,.., sn0 } , for location s0 has
been identified within this set of sample locations. Hence the basic task is to predict a
value for Y ( s0 ) in terms of observed values of the variates {Y ( s1 ),.., Y ( sn0 )} . By the linear
prediction hypothesis in (6.1.2) we then seek a best linear unbiased (BLU) predictor,

Yˆ ( s0 )   i01 i 0 Y ( si )
n
(6.3.1)

of Y ( s0 ) . To facilitate the interpretation of this predictor, it is convenient to proceed in


two steps. First we develop a BLU estimator of  , and then use this result to simplify the
form of the BLU predictor obtained for Y ( s0 ) .

6.3.1 Best Linear Unbiased Estimation of the Mean

Since the mean,  , is assumed to be constant throughout region R, it is natural to use the
entire set of sample observations, {Y ( si )     ( si ) : i  1,.., n} , to estimate  . To do so,
we again we start with the linear hypothesis that the desired estimate, ˆ n , can be written
as a linear combination of these observations, say

ˆ n   i1 aiY ( si )  aYn


n
(6.3.2)

where Yn  [Y ( s1 ),.., Y ( sn )] denotes the full sample vector of Y-variates, and where
a  (a1 ,.., an ) denotes the vector of unknown coefficients. To ensure that this linear
estimator is unbiased, we then require that

(6.3.3)   E ( ˆ )  E (aYn )  aE (Yn )  a(  1n )   (a1n )   (1n a)

where 1n  (1,..,1) is the unit vector of length n. Hence unbiasedness for all values of 
will be guaranteed if and only if these unknown coefficients sum to one, i.e.,

(6.3.4) 1n a  1

________________________________________________________________________
ESE 502 II.6-37 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Among all such linear unbiased estimators, we seek that one with minimum variance. To
calculate the variance of linear estimators, we start by letting

  2   1n 
 
(6.3.5) V  cov(Yn )      
  n1   2 
 

denote the full sample covariance matrix (in contrast to the smaller covariance matrices,
V0 , for each predictor set, S ( s0 )  {s1 ,.., sn } ). With this definition, it follows at once from
(3.2.21) that

(6.3.6) var(aYn )  a cov(Yn ) a  aVa

Hence to determine the linear unbiased estimator of  with smallest variance, we seek to
find that coefficient vector, â , that yields a minimum value of (6.3.6) subject to the unit-
sum condition in (6.3.4), i.e., which solves the following constrained minimization
problem in a :

(6.3.7) minimize: aVa subject to: 1n a  1

In expression (A2.8.23) of the Appendix it is shown that the unique solution of this
problem is given by the coefficient vector:

 1  1
(6.3.8) â    V 1n
 1n V 1n 
1

Hence for each possible vector of sample variates, Yn  [Y ( s1 ),.., Y ( sn )] , the unique BLU
estimator for  is given by:

 1  1n V 1Yn
(6.3.9) ˆ ˆ 
n  a Yn    1
1n V Yn 
 1nV 1n  1n V 11n
1

To gain some feeling for this estimator, consider the classical case of uncorrelated
samples, namely where the covariance matrix in (6.3.5) reduces to

 2  0 
 
(6.3.10) V  cov(Yn )         2 I n
 0  2
 

with I n denoting the n-square identity matrix. In this case we see that

________________________________________________________________________
ESE 502 II.6-38 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

1n ( I n )Yn 1 Y
(6.3.11) ˆ n   n n
1n ( I n )1n 1n1n

But since 1n1n   i1 (1)  n and 1n Yn   i1Y ( si ) , it follows that
n n

ˆ n  
n
(6.3.12) 1
n i 1
Y ( si )  Yn

Thus ˆ n reduces to the sample mean, Yn , which is of course the unique BLU estimator of
 for uncorrelated samples. Hence in the presence of spatial correlation, the optimal
weights in the coefficient vector, â , reflect the covariances among these correlated
samples. In the case of Simple Kriging, the use of Yn to estimate  necessarily results in
a linear unbiased estimator with higher variance than ˆ n .

6.3.2 Best Linear Unbiased Predictor of Y(s0)

Given this intermediate result, we now formulate the Best Linear Unbiased prediction
problem for Y ( s0 ) . Here we again stress that the prediction set, S ( s0 )  {s1 ,.., sn0 } , for s0
is generally smaller than the full sample of size n. So here we focus on the smaller vector
of sample variates, Y  [Y ( s1 ),.., Y ( sn0 )] used for predicting Y ( s0 ) in (6.3.1) above. As in
the case of Simple Kriging, if we again denote the desired vector of prediction weights by
0  (01 ,.., 0 n0 ) , then the desired linear predictor of Y0  Y ( s0 ) can be written in vector
form as

(6.3.13) Yˆ0  0 Y

For purposes of prediction, recall from (6.1.4) that the desired unbiasedness criterion for
Yˆ0 is that expected prediction error be zero, i.e., that

(6.3.14) 0  E (e0 )  E (Y0  Yˆ0 )  E (Y0 )  E (0Y )

   0 E (Y )    0 (  1n0 )

  (1  0 1n0 )

So, as a parallel to (6.3.4) above, it follows that 0 will yield an unbiased predictor for all
possible values of  if and only if the bracketed expression is zero, i.e.,

(6.3.15) 1n0 0  1

________________________________________________________________________
ESE 502 II.6-39 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Moreover, to satisfy the efficiency criterion it is required that among all linear unbiased
predictors, Yˆ0 should yield the smallest prediction error variance, which in view of
(6.3.15) together with (6.2.12) is again seen to be precisely residual mean squared error,

(6.3.16) 
var(e0 )  E (e02 )  E[(Y0  Yˆ0 ) 2 ]  E[(Y0  0Y ) 2 ]  E [(    0 )   ( 1n0   )]2 
 
 E [  (1  01n0 )  ( 0  0 )]2  E[( 0  0 ) 2 ]  MSE (0 )

But since all covariances in (6.2.26) continue to be given (i.e., are assumed to be known)
for the case of Ordinary Kriging, the argument leading to (6.2.27) for Simple Kriging still
holds. Hence we again seek to minimize

(6.3.17) MSE (0 )   2  2 c0 0  0 V0 0 ,

but now subject to the unit sum condition in (6.3.15). Hence the desired weights, ̂0 , for
Ordinary Kriging are given by the solution of the constrained minimization problem:

(6.3.18) minimize:  2  2c0 0  0 V0 0 subject to: 1n0 0  1

The solution to this problem is shown in the Appendix [expression (A2.8.26)] to be given
by

 1  1n0 V01c0  1
(6.3.19) ˆ0   1
V0 1n0  V0 c0
 
 1n0 V0 1n0 
1

By substituting this solution into (6.3.13), one then obtains the following BLU predictor
of Y0 [see also expression (A2.8.28) in the Appendix]:

 1n0 V01Y   1n0 V01Y 


  c0V0 Y  c0V0 1n0 
1 1
(6.3.20) Y0  
ˆ 
 1n V011n   1n V011n 
 0 0   0 0 

At first glance, this expression appears rather formidable. But by using the results of
Section 6.3.1 above, it can be made quite transparent. In particular, suppose that the
samples available for mean estimation are taken to be given by the prediction sample,Y ,
at s0 rather than the full sample, Yn . Then it follows at once from (6.3.9) that this BLU
estimator must be of the form

1n0 V01Y
(6.3.21) ˆ n0 
1n0 V011n0

________________________________________________________________________
ESE 502 II.6-40 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

where n is now replaced by n0 , and where V is replaced by V0  cov(Y ) . So by


substituting (6.3.21) into (6.3.20) , we see that this optimal predictor reduces to

(6.3.22) Yˆ0  ˆ n0  c0V01Y  c0V01 ( ˆ n0 1n0 )

 ˆ n0  c0V01 (Y  ˆ n0 1n0 )

Finally, if we treat ˆ n0 as a prior estimate of  , and [as in (6.2.2)] take the corresponding
sample residuals based on this prior estimate, to be
(6.3.23) ˆi  Y ( si )  ˆ n0 , i  1,.., n0

then the vector of these residuals is given by

 ˆ1   Y ( s1 )  ˆ n0 
   
(6.3.24) ˆ          Y  ˆ n0 1n
 ˆn   
 0   Y ( sn0 )  ˆ n0 

Similarly, if we let ˆ0  Yˆ0  ˆ n0 denote the residual predictor corresponding to Yˆ0 , then
(6.3.22) is further reduced to

(6.3.25) ˆ0  c0 V01 ˆ

But by (6.2.29) this is seen to be precisely the Simple Kriging predictor of  0   ( s0 )


based on the vector of residual data, ˆ .

In short, the BLU predictor of Y0  Y ( s0 ) in (6.3.20) can be obtained by the following


two-part procedure:
(i). Construct the BLU estimator, ˆ n0 , of  based on the prediction sample data,
Y , as in (6.3.21).
(ii). Use the sample residuals, ˆ , in (6.3.24) to obtain the Simple Kriging
predictor, ˆ0 , of  0 as in (6.3.25), and set Yˆ0  ˆ n0  ˆ0 .

In retrospect, this procedure seems quite natural. Since all covariance information is
assumed to be given (as in Simple Kriging) the first step simply uses this information to
obtain a BLU estimator for  . The second step then uses Simple Kriging to construct the
predictor. What is remarkable here is that this ad hoc procedure actually yields the Best
Linear Unbiased predictor for Y ( s0 ) based solely on the prediction sample Y .

________________________________________________________________________
ESE 502 II.6-41 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

The only shortcoming of this procedure is that it does not use all sample information
available for estimating  . For since this mean is assumed to be constant over the entire
region R, it should be clear that a better estimate can be obtained by using the BLU
estimator, ˆ n , based on the full sample, Yn . It is this modified procedure that constitutes
the most commonly used form of Ordinary Kriging.25 To formalize this procedure, it thus
suffices to modify the two steps above as follows:

(1). Construct the BLU estimator, ˆ n , of  based on the full sample data,
Yn , as in (6.3.9).

(2). Use the sample residuals, ˆ  Y  ˆ n1n0 to obtain the Simple Kriging
predictor, ˆ , of  as in (6.3.25), and set Yˆ  ˆ  ˆ .
0 0 0 n 0

6.3.3 Standard Error of Prediction

Recall that to obtain prediction intervals, one requires an estimate of the standard error
of prediction,  0 , as well as Yˆ0 . To do so, recall from the argument in (6.3.16) and
(6.3.17) that prediction error variance for any weight vector, 0 , has the same form as for
Simple Kriging, i.e.,

(6.3.26)  02  var(e0 )   2  2 c0 0  0 V0 0

So all that is required to obtain the desired prediction error variance is to substitute the
optimal weight vector, ̂0 , into this expression. After some manipulation, it can be shown
[see expression (A2.8.69) in the Appendix] that desired value, ˆ 02 , is given by:

(6.3.27) ˆ 02   2  2 c0 ˆ0  ˆ0 V0 ˆ0

(1  1n0 V01c0 )2
 (  2  c0 V01c0 ) 
1n0 V011n0

The key point to notice is that the first bracketed expression is precisely the prediction
error variance for Simple Kriging in expression (6.2.53). But since the second term is

25
It should noted however that one may consider “local” versions of ordinary kriging in which the mean is
re-estimated at each prediction site, s0 . This yields a set of local mean estimates, ˆ ( s0 ) , which can be
regarded as local estimates of a possibly non-constant trend surface. See for example [BG], pp.195-196.
This idea is also implicit in Section 5.4.2 of Schabenberger and Gotway (2005).

________________________________________________________________________
ESE 502 II.6-42 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

always positive,26 it follows that prediction error variance for Ordinary Kriging is always
larger than for Simple Kriging. The additional positive term turns out to be precisely the
addition to prediction error variance created by the internal estimation of the mean,  .

Finally, given this expression for prediction error variance, the desired standard error of
prediction is simply the square root of this expression, namely,

(1  1n0 V01c0 )2
(6.3.28) ˆ 0  (   c0 V c ) 
2 1

1n0 V011n0
0 0

6.3.4 Implementation of Ordinary Kriging

From the development above, it should be evident how to implement Ordinary Kriging
by a direct modification of the three-step procedure for Simple Kriging in Section 6.2.5.
To do so, we again start by assuming the existence of a given set of n observations (data
points), { yi  y ( si ):i  1,.., n} in R, where each yi is a realization of the corresponding
random variable, Y ( si ) , in the full sample vector, Yn  [Y ( si ) : i  1,.., n] , in (6.3.2) above.
In this context, we again consider the prediction of Y0  Y ( s0 ) , at a single given location,
s0  R , with respect to a given prediction set, S ( s0 )  {s1 ,.., sn 0 }  {s1 ,.., sn } . Within this
framework, we can operationalize the Ordinary Kriging model by re-ordering the three
steps of the Simple Kriging implementation in Section 6.2.5 as follows:

Step 1. Estimation of Covariances

This step amounts essentially to a reinterpretation of Step 2 for Simple Kriging, where
here we focus on Y -process rather than the  -process. To do so, simply recall from
Section 4.8 that (as with Simple Kriging) Ordinary Kriging assumes a constant-mean
model [Y ( s )     ( s ) : s  R ] , so that the variograms for the Y -process and  -process
are identical. Hence we can again use the sample data ( y1 ,.., yn ) in var_spher_plot.m to
obtain a spherical variogram estimate,  (h; rˆ, sˆ, aˆ ) , and derived covariogram estimate as
in (6.2.65), i.e.,

(6.3.29) Cˆ (h)  sˆ   (h; rˆ, sˆ, aˆ )

The only difference in the present setting is that we treat the covariances between Y
values rather than  values. In particular, we now require estimates of the covariances,
 ij  cov[Y ( si ), Y ( s j )] , for all sample pairs, Y ( si ) and Y ( s j ) , in Yn . Using (6.3.29), these
can be estimated precisely as (6.2.66) by setting,

26
Positivity of the denominator follows from the fact that it is the variance of the linear compound,
1n V0 Y , since var(1n V0 Y )  1n V0 cov(Y )V0 1n  1n V0 (V0 )V0 1n  1n V0 1n .
 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0

________________________________________________________________________
ESE 502 II.6-43 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

(6.3.30)  Y ( s ), Y ( s )]  Cˆ (|| s  s ||)


ˆ ij  cov[ i j i j

These in turn provide an estimate of the full-sample covariance matrix, V  cov(Yn ) , in


(6.3.5) as follows:
 ˆ 2  ˆ1n 
 
(6.3.31) Vˆ      
 ˆ n1  ˆ 2 

By the same procedure, we can obtain estimates for all covariances between the variable,
Y0  Y ( s0 ) , to be predicted and the given set of prediction variates, Y  [Y ( s1 ),.., Y ( sn0 )] ,
namely,

(6.3.32)  Y ( s ), Y ( s )]  Cˆ (|| s  s ||) , j  1,.., n


ˆ 0 j  cov[ 0 j 0 j 0

By again letting cˆ0  (ˆ 0i : i  1,.., n0 ) , we can use these together with the appropriate
sub-matrix, Vˆ , of covariance estimates in (6.3.27) to obtain an estimate,
0

 ˆ 2 cˆ0 
(6.3.33) C0  
ˆ 
 cˆ Vˆ0 
 0

of the full covariance matrix, C0 , relevant for prediction at s0 . From a computational


viewpoint, this matrix is numerically identical to the matrix in (6.2.23), with elements
now interpreted as covariances directly between Y values rather than  values.

Step 2. Estimation of the Mean

This step involves the main departure from Simple Kriging. Here we replace the sample-
mean estimator (Yn ) of  with the BLU estimator, ˆ n , in expression (6.3.9) above. By
using the covariance estimates in (6.3.27) together with the sample data vector,
y  ( y1 ,.., yn ) , this estimate can be calculated as

1n Vˆ 1 y
(6.3.34) ˆ n 
1n Vˆ 1 1n

Step 3. Estimation of Kriging Predictions

As emphasized in the final two-step procedure of Section 6.3.2 above, this step is
identical to that in the Simple Kriging procedure. All that it required at this point is to

________________________________________________________________________
ESE 502 II.6-44 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

replace the sample-mean estimate, ̂ , with the BLU estimate, ˆ n , and redefined the
appropriate residual estimates in (6.2.70) by

(6.3.35) ˆi  yi  ˆ n , i  1,.., n0

and again use (6.2.71) and (6.2.72) to construct the desired prediction, Yˆ0 , by

(6.3.36) Yˆ0  Yˆ ( s0 )  ˆ n  cˆ0 Vˆ01 ˆ

Finally, the estimated standard error of prediction, ̂ 0 , is given by substituting the


covariance estimates into (6.3.28) above to obtain:

(1  1n0 Vˆ01cˆ0 )2
(6.3.37) ˆ 0  ( ˆ 2  cˆ0 Vˆ01cˆ0 ) 
1 Vˆ 11
n0 0 n0

The pair (Yˆ0 , ˆ 0 ) can then be used to construct prediction intervals for Y0  Y ( s0 )
precisely as in (6.2.62) and (6.2.63) above.

6.3.5 An Example of Ordinary Kriging

This implementation of Ordinary Kriging can be illustrated in terms of the Log-Nickel


example developed for Simple Kriging in Section 6.2.6. As emphasized in the above
implementation, all covariogram estimates are identical. Hence from a practical
viewpoint, the only numerical differences between these prediction procedures will result
from the replacement of the sample-mean estimator, ˆ  yn , in (6.2.62) with the BLU
estimator, ˆ n , in (6.3.30). Recall that in the present case, yn  3.252 . A computation of
ˆ n using the same data turns out to yield a value ˆ n  3.329 , which is quite similar to
that of yn . Hence in the present example, one can expect to find very similar predictions
and standard errors. However, it should be stressed that this by no means true in general.
Indeed when substantial spatial dependencies are present, the sample mean yn can yield a
very poor estimate of  relative to ˆ n .

With these general observations, we can now sketch how Ordinary Kriging is done in
both MATLAB and ARCMAP. Starting with MATLAB, Ordinary Kriging is
operationalized in the program, o_krige.m. The inputs are essentially the same as
simple_krige.m, except that values are made distinct from locations. So here, values are
given by y = log_nickel(:,3) and locations by L0 = log_nickel(:,1:2). To obtain a

________________________________________________________________________
ESE 502 II.6-45 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

prediction at the given location, s0 = (659000,586000), in Figure 6.11 above, one now
uses the command:

>> OUT = o_krige(y,L0,s0,h0,p_log);

Here the prediction and standard error are the last two cells of the output structure, which
can be obtained as:

>> [OUT{3} OUT{4}] = 3.0461 0.76771

A comparison with the results on p.5-30 above show that (as expected) the Ordinary
Kriging results are virtually the same.

Finally, to carry out an Ordinary Kriging prediction at s0 in ARCMAP, the procedure


described for Simple Kriging is again the same, except that at “Kriging Step 2 of 5” one
now selects Kriging Type = Ordinary (which is the default choice). By employing all
the same settings as in the Simple Kriging example (pp.5-31 to 5-32 above), the Ordinary
Kriging prediction, Yˆ0 , and standard error of prediction, ̂ 0 , at s0 turn out to be

(6.3.38) Yˆ0  3.0477 ˆ 0  0.7683

Hence, as expected, these are again seen to be virtually the same as those for MATLAB.

6.4 Selection of Prediction Sets by Cross Validation

Before proceeding to more general kriging models, it is important to consider the


question of choosing “best” prediction sets, S ( s0 ) , for each prediction site, s0  R . At
first glance, it would appear that if the range, r , of the covariogram has been correctly
estimated by r̂ , then the most natural choice for predictions sets is to include all points in
closer to s0 than r̂ . If the set of all n sample point locations is denoted by

(6.3.39) S n  {s1 ,.., sn }

then this amounts formally to setting

(6.3.40) S ( s0 )  {si  S n : || s0  si ||  rˆ }

[In fact, this option for defining search neighborhoods is available in “Kriging step 4 of
5” in ARCMAP, as denoted by “Copy from Variogram”.] However, in spite of its
apparent theoretical appeal, this option generally tends to include “too much”. This will
become evident in the simulation analysis below.

________________________________________________________________________
ESE 502 II.6-46 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

To determine a “best” size for prediction sets, one first defines a set of candidate sizes. In
the present case, we shall focus on circular prediction sets of the form (6.3.40) for a
selection of bandwidths, H  {h1 ,.., hk } , and let

(6.3.41) S h ( s0 )  {si  S n : || s0  si ||  h } , h  H

While it is in principle possible to consider different bandwidths at each prediction site,


we follow the standard convention of considering only uniform bandwidths (as reflected
by the bandwidth parameter, h0, used in o_krige.m). The task is then to find a “best”
bandwidth.

The standard procedures for doing so, known as cross validation procedures, leave out
part of the data and attempt to predict these values with the rest of the data. By
calculating the average prediction error for this data, one can then find the bandwidth that
minimizes this value. The most commonly used procedure, known as leave-one-out cross
validation, is to systematically omit single data points one at a time, and predict these
using the rest of the data. Hence, given a candidate bandwidth, h  H , one will obtain for
each data point, yi  y ( si ) , a predicted value, say yˆi (h) , by using all other sample data in
y  ( y1 ,.., yn ) together with the prediction set S h ( si ) . By squaring these prediction
errors, yi  yˆi (h), i  1,.., n , and taking the average, one obtains a summary measure that
can be viewed as a sample version of mean squared error. But in order to preserve units
(so that values, for example, are in terms of Nickel rather Nickel-squared) the most
commonly used measure of performance is root mean squared error, as defined for each
candidate bandwidth, h  H , by:


n
(6.3.42) RMSE (h)  1
n i 1
[ yi  yˆi (h)]2

Hence by systematically calculating RMSE (h) for all h  H , one can define the best
bandwidth, h * , to be the one that minimizes (6.3.42), i.e.,

(6.3.43) RMSE (h*)  min hH RMSE (h)

6.4.1 Log-Nickel Example

For the case of Ordinary Kriging, this leave-one-out cross validation procedure is
operationalized in the MATLAB program, o_krige_cross_val.m. To apply this program
to the log-nickel example, recall that the estimated range was rˆ  21,631 meters, and that
the bandwidth chosen for kriging at s0 was h0  5000 meters. Hence we choose H to
be the set of bandwidths increasing from 1000 to 25,000 in increments of 1000, i.e.,

>> H = [1000:1000:25000];

________________________________________________________________________
ESE 502 II.6-47 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

If the n  436 locations and log_nickel values are denoted respectively by

>> L = log_nickel(:,1:2);

>> y = log_nickel(:,3);

then the above program can be run for this example using the command

>> o_krige_cross_val(y,L,H);

The output is a graph that plots the values of RMSE (h) against bandwidths, h, as shown
in Figure 6.16(a) below. Here the best bandwidth (shown by the red arrow in the figure)
is here seen to be 11,000 meters, which is roughly twice the value chosen for kriging at
point s0 in the examples above. This larger bandwidth is shown by the larger circle in
Figure 6.1(b), with the smaller circle denoting the original choice of 5000 meters. Notice
that many more data points are now included (33 versus 12 in the original analysis). The
predictions obtained by o_krige.m using this larger bandwidth are shown below:

>> [OUT{3} OUT{4}] = 3.1219 0.76365

So the predicted value is seen to be somewhat higher, and the standard error of
prediction is slightly smaller.27 Since the latter implies a slight tighter prediction interval,
this larger bandwidth may indeed be preferable.

0.75

!
! !
0.74 ! !
!
! ! !
!
!! ! !
0.73 ! !
! !
!! !
!
0.72 ! ! !
!
! !
! !
! !
0.71 ! !!
! !
! !
! ! !
! !
0.7 ! !
! ! ! ! !
!! !
0.69 ! ! ! !
!
! ! !!
!
!
0.68 !
0 0.5 1 1.5 2 2.5 ! ! !
4
x 10

(a) Cross Validation Plot (b) Enlarged Bandwidth

Figure 6.16 Log Nickel Example

27
The values obtained in ARCMAP are 3.1237 and 0.7643, respectively, and are again seen to be in close
agreement.

________________________________________________________________________
ESE 502 II.6-48 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

But the most important point to note here is that this best bandwidth is much smaller than
the estimate range ( rˆ  21,631 ). It can of course be argued that in this particular example,
the estimated range may not be very accurate. Indeed, it is well known that estimates of
the range tend to be the least stable (most variable) of the three parameter estimates
(rˆ, sˆ, aˆ ) . Hence it is useful to consider this question in simulated data sets where the true
range is known.

6.4.2 A Simulated Example

To construct a simulated example, we start by generating n  500 random points in a


100  100 kilometer square with locations denoted by L  ( s1 ,.., sn ) . Next we simulate
K  20 realizations, Y  ( y1 ,.., yK ) , of a covariance-stationary process on these points,
where each column, yk  ( yk 1 ,.., ykn ) is a realization on the locations in L. Here we use a
constant mean of   10 and covariogram with parameters, p  (r , s, a)  (25,5,0) . The
simulation was carried out using the MATLAB program, cov_stat.m, with the command:

>> Y = cov_stat(p,L,20);

A typical realization of this process is shown in Figure 6.17 below.

! ! ! !! !! ! !! ! !
! ! ! ! !!
! !! ! ! ! ! !!
!
! ! !! !! !
! ! ! !! !
!
!! ! ! ! ! ! !
!! ! ! !! ! !! !! !!!!! ! ! ! !!
! ! !! ! ! ! !! !!! !!!
!! !!!! ! ! ! !!!
! ! !! !
! !
! !! !! !! !
! !! ! !
! ! !! ! !
! ! !
! ! ! ! !! ! ! ! ! !
! ! ! ! ! ! !
! !
! ! ! ! ! ! ! !
! ! ! ! !! ! !!
! ! ! ! ! !!
! !!! !!!! !! !
! !
!! ! !
!!! ! ! !!
! ! ! ! ! !
!! ! ! ! ! !
! ! ! ! !
! !! !! !! ! !
!! ! ! !! !
! ! ! !! ! ! ! ! ! !!
! ! ! !! ! ! ! !
! !! ! !
! ! ! ! ! !! !
!! ! !! ! ! !! ! !!! !
!! !
!! ! !!! ! ! ! ! ! !!
! !
!! ! ! ! ! !!
! ! !
!!!! !! ! !! ! ! ! ! ! !
! !! ! ! !
! !!! !! !! ! ! ! !
! !! ! ! !! !
! ! !
! ! ! ! ! ! !
!! ! ! !!
!
! !
!!! ! ! !
! ! ! ! ! !! ! !
!
!! ! ! ! !
! ! ! ! ! !
!! ! ! ! !
!!!
! ! !!!! ! ! !! ! ! !
! ! ! ! ! ! !! !! ! !
! ! !! ! !! ! !! !
!! !
!!! ! !!! ! !
! ! ! ! ! !!
! ! ! ! !!! !! !
! !! ! ! ! ! !
! ! ! ! !

  25 km

Figure 6.17. Simulated Realization

Notice that spatial correlation is indeed evident at scales smaller than the 25 km range
shown. Hence the question of interest is whether bandwidths less than this range value do
a better job of prediction. The above program, o_krige_cross_val.m, was run for each of
these 20 simulations. Based on this limited sample, the answer is definitely yes. The
cross-validation plot for the realization in Figure 6.17 is shown in Figure 6.18 below:

________________________________________________________________________
ESE 502 II.6-49 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

1.01

0.99

0.98

0.97

0.96

0.95
0 5 10 15 20 25 30 35 40

best range
Figure 6.18 Cross Validation Plot

So again the best bandwidth is seen to be about half the true range value. It is also
important to note that the estimates of the constant mean and covariogram parameters are
actually quite reasonable:

(6.3.44) ˆ n  9.932 , (rˆ, sˆ, aˆ )  (31.638 , 3.502 , 0.74557)

So it cannot be argued that this is a result of parameter-estimation error. Indeed, given the
moderate overestimation of the true range in this case, one might have expected larger
bandwidths to do quite well here.

Finally it should be added that these best bandwidths showed considerable variation over
the 20 simulated realizations. The lowest was 5 km, and one was actually above the true
range (27 km), even though the range estimate for this case was almost exactly 25 km. So
a great deal seems to depend on the spatial structure of the particular pattern realized. But
this limited investigation does support the commonly held belief that that best bandwidths
for kriging predictions are generally less that the estimated range value.

________________________________________________________________________
ESE 502 II.6-50 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

7. General Spatial Prediction Models


Recall that within our general spatial modeling framework, {Y ( s )   ( s )   ( s ), s  R} ,
the global trend,  ( s ) , is assumed to be constant in both Simple and Ordinary Kriging.
What this means in practice is that all spatial variations are assumed to be captured by the
covariance structure among the residuals,  ( s ) . However, the more general kriging
models, described as Universal Kriging and Geostatistical Kriging in Section 6.1.2
above, allow non-constant spatial trend structures. Hence the central task of this section is
to develop these more general models in detail.

We begin by developing the types of trend functions to be considered. Recall from the
Sudan Rainfall example in Section 2.1 that a number of such trend functions were
developed. Here the simplest of these postulated that there was some linear trend over
space expressible entirely in terms of the spatial coordinates, s  ( s1 , s2 ) , i.e.,

(7.1)  ( s)   0  1 s1   2 s2

An elaboration of this was given by the quadratic trend function,

(7.2)  ( s)   0  1 s1   2 s2  3 s12   4 s1s2  5 s22

More generally, one may consider polynomial trend functions of the form,

 ( s)   0   j 1  j s1 s2
k nj mj
(7.3)

where n j and m j are nonnegative integer values. Spatial trends in phenomena that vary
smoothly over space tend to be well approximated (locally) by such polynomial
functions. A good example is elevation in hilly terrain. The advantage of these functions
is that they require nothing more than the coordinate data in the map itself. Hence the
data for constructing such functions is essentially always available. It is for this reason
that ARCMAP uses polynomial functions as built-in options for modeling spatial trends
(including all polynomials up to order three, i.e., with n j  m j  3 , j  1,.., k ). A second
advantage of these functions is that even though they may involve many spatially
nonlinear terms, they are still linear in parameters. In other words, such functional forms
are linear in all parameters to be estimated, namely  0 , 1 ,..,  k . So unlike the nonlinear
least squares estimation procedure required for the standard variogram models in Section
4.7.2 above, these models can be estimated by ordinary least squares (OLS).

But while such functions are sufficiently general to fit many types of spatial trends, they
offer little in the way of explanation regarding the nature of these trends. For example,
we saw in the introductory California Rainfall example that variables such as “altitude”
and “rain shadow” were useful predictors of average rainfall that are not captured by
coordinate positions. Even in the case of Vancouver Nickel used for Simple and Ordinary

_______________________________________________________________________
ESE 502 II.7-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Kriging above, it may well be that local soil types as well as concentrations of other
mineral types might yield better predictions of nickel deposits that simple location
coordinates. So, in the spirit of the regression model used in the California Rainfall
example, it is of interest to consider linear-in-parameter spatial trend functions involving
many possible spatial attributes:

 ( s)   0   j 1  j x j ( s)
k
(7.4)

This is seen to include all examples above, where for example, one may have polynomial
terms, x j ( s )  s1 j s2 j , or more general spatial attributes such as x j ( s )  “altitude at s”, or
n m

x j ( s )  “copper concentration at s” . Moreover, it should be clear that all such trend


functions yield spatial models

Y (s)  0    j x j ( s)   ( s) , s  R
k
(7.5) j 1

which appear to be simply instances of classical linear regression models like the
California Rain example. However there is one important difference, namely that the
spatial random effects,  ( s ) , are allowed to exhibit nonzero covariances. The only
difference here is the covariance structure of the residuals. More formally, such models
are instances of the general linear regression model that allows for nonzero covariances
between residuals. Hence to develop spatial prediction models with non-constant trends,
we turn first to a consideration of the general linear regression model itself.

7.1 The General Linear Regression Model

To formalize such models in the simplest way, it is essential to use vector representations.
We start with a given finite sample,1 Y  [Y ( si ) : i  1,.., n]  (Yi : i  1,.., n) from a spatial
stochastic process with global trend of the form (7.5). Let

(7.1.1) x( si )  [ x0 ( si ), x1 ( si ),..., xk ( si )]  ( xi 0 , xi1 ,.., xik )  (1, xi1 ,.., xik )

denote the vector of relevant attributes at each sample location, i  1,.., n , where by
convention the “attribute”, xi 0  1 , corresponds to the intercept term (  0 ) in (7.5). With
this convention, the integer k denotes the actual number of spatial attributes used in the
model. If   (  0 , 1 ,..,  k ) denotes the corresponding vector of coefficients, then (7.6)
can be rewritten as

(7.1.2) Y ( si )  x( si )   ( si ), i  1,.., n

1
Notice that we now drop the notation, Yn , used for this sample in Section 6 [in order to avoid confusion
with the data point, Yn  Y ( sn ) ].

________________________________________________________________________
ESE 502 II.7-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

This can be further reduced by letting

1 x11  x1k   x( s1 )    ( s1 )   1 
       
(7.1.3) X        ,       
1 x  x   x( s )    (s )    
 n1 nk   n   n   n

so that (7.8) can be written in compact matrix form as

(7.1.4) Y  X  

Our primary interest for the moment focuses on the residual vector,  . Recall from
Section 3 that  is assumed to be multi-normally distributed with mean zero. Moreover,
the usual multiple regression model (as for example in the California Rain case), assumes
that the individual components of  are statistically independent, and hence have zero
covariance. Thus [as in (6.3.10) above] this covariance matrix has the form:

 2  0 
 
(7.1.5) cov( )         2 I n
 0  2
 
In this spatial context, the classical regression model can be formally summarized as
follows:

(7.1.6) Y  X   ,  ~ N (0, 2 I n )

But as in Section 3.3 above, we wish to extend this model by allowing covariance-
stationary spatial dependencies between the individual components of  . Hence, while
all diagonal elements will continue to have the constant value,  2 , many of the off-
diagonal elements will now be nonzero. If we now let  ij and ij denote, respectively,
the covariance and correlation between residuals  i and  j , and recall that [as in
expression (3.3.13)], ij   ij /  2 , then the general covariance matrix, V, for  can be
written as follows:

  2   1n   1  1n 
  2 
(7.1.7) V  cov( )                2 C
  n1   2    1 
   n1 

where C is the corresponding correlation matrix for  . The advantage of this particular
representation is that the important variance parameter,  2 , is made explicit. Moreover,
(7.1.7) is now more easily related to the classical case in (7.1.5) where C reduces to the

________________________________________________________________________
ESE 502 II.7-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

identity matrix, I n . In this setting, the general linear regression model can be formally
summarized for our purposes by simply replacing I n with C in (7.1.6), i.e.,2

(7.1.8) Y  X   ,  ~ N (0,V )  N (0, 2C )

7.1.1 Generalized Least Squares Estimation


Recall that the classical linear regression model is estimated by the method of ordinary
least squares. As shown below, this method is directly extendable to the generalized
linear regression model. In particular, since the correlation matrix, C, is assumed to be
given (as for example in the Universal Kriging model to be developed below), this model
can be reduced to an equivalent classical linear regression model. To develop these
results, we begin with the classical linear regression case, and then proceed to generalized
linear regression.

OLS Estimators
Given a sample realization, y  ( y1 ,.., yn ) , of Y in model (7.1.6), the method of ordinary
least squares (OLS) seeks to determine an estimate of the unknown coefficient vector,
 , that minimizes the sum of squared deviations between the yi values and their
estimated mean values, x( si ) . More formally, if the sum-of-squares function (S) is
defined for all possible  values by:

S( )   [ yi  x ( si ) ]2   [ yi  (  0  1 xi1    k xik )]2


n n
(7.1.9) i 1 i 1

then the OLS estimator, ˆ  ( ˆ0 , ˆ1 ,.., ˆk ) , is taken to be the minimizer of (7.1.9), i.e.,

(7.1.10) S ( ˆ )  min  S (  )

To determine this estimator, we begin by using (7.1.3) to rewrite this function in matrix
terms as,

(7.1.11) S (  )  ( y  X  )( y  X  )  y y  2 y X     X X 

Notice that this is again a quadratic form in the unknown value,  , that is similar to the
mean squared error function, MSE (0 ) , in expression (6.2.27) above. So the solutions for
these two problems are also similar. In the present case, it is shown in Section A2.7.3 of
the Appendix that the solution to (7.1.10) is given by

2
In Part III of this NOTEBOOK we shall return to this general linear regression model in a somewhat
different context. So both covariance representations, V and  C , will be useful. For similar treatments see
2

expression (9.11) in Gotway and Waller and section 10.1 in Green (2003).

________________________________________________________________________
ESE 502 II.7-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

(7.1.12) ˆ  ( X X ) 1 X  Y

Notice that we have used the random vector, Y , rather than the realized sample data, y, in
(7.1.12) in order to view ˆ as a random vector defined for all realizations. [In statistical
terms, the distinction here is between ˆ as an estimate of  for a given data set, y , and
its role as an estimator of  for all sample realizations of Y.] Notice also that for this
OLS estimator to be well defined, it is necessary that the matrix X X be nonsingular.
This will almost surely be guaranteed whenever the number of samples is substantially
greater than the number of parameters to be estimated, i.e., whenever n  k  1 .3 More
generally, statistical estimation of any set of parameters can only be reliable when the
number of data points well exceeds the number of parameters. In the case of classical
linear regression, a common rule of thumb is that there be at least 10 samples for every
parameter, i.e., n  10(k  1) .

Before proceeding to the more general case, it is important to point out that ˆ is an
unbiased estimator, since under model (7.1.6), E (Y )  X  implies that

(7.1.13) E ( ˆ )  E[( X X ) 1 X Y ]  ( X X ) 1 X E (Y )  ( X X ) 1 X X   

What is equally important is the fact that (like the sample mean used in Simple Kriging
predictions) this unbiasedness property is independent of cov( ) . All that is required is
that the linear trend specification, X  , is correct [i.e., that E ( )  0 ]. So in the case of
California Rainfall, for example, if the four final variables used were a correct
specification of the model, then regardless of possible spatial dependencies among
residuals ignored in this model, the estimated beta coefficients would still be unbiased.

GLS Estimators

To extend these results to generalized linear regression, we employ the fact that every
(nonsingular) covariance matrix can be decomposed in a very simple way. For the
covariance matrix, C , in (7.1.7) it is shown in the Appendix [by combining the Positive
Definiteness Property above expression (A2.7.67) with the Cholesky Theorem following
expression (A2.7.45)] that there exists a Cholesky decomposition of V, i.e., there exists a
lower triangular matrix,

 t11 0  0
 
t t   
(7.1.14) T   21 22
   0
 
 tn1 tn 2  tnn 
such that

3
The symbol “  ” is conventionally used to mean “substantially greater then”

________________________________________________________________________
ESE 502 II.7-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

(7.1.15) C  T T

The matrix, T, is usually called the Cholesky matrix for C. While we require no detailed
knowledge of such matrices here, it is of interest to point out that the desired Cholesky
matrix is easily obtained in MATLAB by the command,4

>> T = chol(C);

Perhaps the most attractive feature of the lower triangular matrices is that they are
extremely easy to invert (and indeed first appeared as part of the classical “Gaussian
elimination” method for solving systems of linear equations). Moreover, it is this inverse,
T 1 , which is directly useful for out purposes. In particular, since C is given, we can
compute T 1 prior to any analysis of model (7.1.8). But if we then premultiply both sides
of the equation in (7.1.8) to obtain,

(7.1.16) T 1Y  T 1 X   T 1

and define the following transformed quantities,

(7.1.17) Y  T 1Y , X  T 1 X ,   T 1

then by (7.1.16) we obtain the following transformed model:

(7.1.18) Y  X   

Moreover, since  is a linear transformation of  , it follows from the Linear Invariance


Theorem for multi-normal random vectors [in (3.2.22) above] that  is also multi-
normally distributed with mean zero. But by using (7.1.15) and (3.2.21) [together with
the matrix identity (T 1 )  (T ) 1 ] we can determine the covariance matrix for  as
follows:

(7.1.19) cov( )  cov(T 1 )  T 1 cov( ) (T 1 )

 T 1 ( 2C )(T ) 1

  2 T 1 (T T )(T ) 1

  2 (T 1T )[( T )(T ) 1 ]

  2 (In )

4
Note the transpose operation here. MATLAB for some reason has chosen to produce T  rather than T.

________________________________________________________________________
ESE 502 II.7-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

So this transformed model is seen to take the form:

(7.1.20) Y  X    ,  ~ N (0, 2 I n )

Finally, by comparing this with (7.1.6) we see that the generalized linear regression
model in (7.1.8) has been transformed into a classical linear regression model. This may
seem a bit like “magic”. But it is simply one of the many consequences of the Linear
Invariance Theorem for multi-normal random vectors, and serves to underscore the
power of this theorem.

Given this equivalence, we may again use OLS to estimate  . In particular, by using the
transformed data, ( X , Y ) in (7.1.17), it follows at once from (7.1.12) the desired OLS
estimator is given by

(7.1.21) ˆ  ( X X )1 X  Y

To distinguish this from the classical linear regression model, it is customary to transform
this estimator back into the form of the generalized linear regression model. This amounts
simply to substituting the above relations into (7.1.21) [and using the matrix identity
(T T ) 1  (T ) 1 (T 1 )  (T 1 )(T 1 ) ] to obtain

(7.1.22) ˆ  [(T 1 X )(T 1 X )]1 (T 1 X )(T 1Y )

 [ X (T 1 )(T 1 ) X ]1 X (T 1 )(T 1 )Y

 [ X (T T ) 1 X ]1 X (T T ) 1Y

Finally, recalling from (7.1.15) that C  T T  , it follows that

(7.1.23) ˆ  ( X  C 1 X ) 1 X  C 1Y

which is entirely independent of Cholesky matrices or transformed models. For our later
purposes, it is convenient to rewrite (7.1.23) using the full covariance matrix, V , for  in
(7.1.7), i.e.,

(7.1.24) ˆ  ( X V 1 X ) 1 X V 1Y

The latter version is typically designated as the generalized least squares (GLS) estimator
of  . However, these two versions are in fact equivalent representations of the GLS
estimator since the substitution, V   2C , in (7.1.24) shows that

________________________________________________________________________
ESE 502 II.7-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

(7.1.25) ˆ  ( X [ 2C ]1 X ) 1 X [ 2C ]1Y  ( X  C 1 X ) 1 X  C 1Y

Note that this identity also shows that the GLS estimator, ˆ , is functionally independent
of  2 . This independence will prove to be enormously useful in later applications.

Note also that even though ˆ is still dependent on the covariance matrix, V, this
dependence has no effect on the unbiasedness of ˆ . This should be obvious from its
equivalence to an OLS estimator. But in any case, by taking expectations in (7.1.24) we
see that

(7.1.26) E ( ˆ )  E[( X  C 1 X ) 1 X  C 1Y ]  ( X  C 1 X ) 1 X  C 1E (Y )

 ( X  C 1 X )1 X  C 1 ( X  )  ( X  C 1 X )1 ( X  C 1 X )   

So regardless of how badly this covariance matrix is misspecified (including C  I n ), this


by itself creates no biasedness. (Rather it creates inefficiency of the estimator, ˆ )

Finally it should be noted that by letting y  T 1 y , one can also transform the sum-of-
squares function, S, (by using the same matrix identities above) to obtain:

(7.1.27) S (  )  ( y  X  )( y  X  )  (T 1 y  T 1 X  )(T 1 y  T 1 X  )

 [T 1 ( y  X  )][T 1 ( y  X  )]  ( y  X  ) (T 1 )(T 1 )( y  X  )

 ( y  X  ) (TT )1 ( y  X  )  ( y  X  ) C 1 ( y  X  )

Note again that since C differs from V   2C by a positive scalar, it can be replaced by
V in (7.1.27) without altering the solution. Both forms are seen to be weighted versions of
(7.1.11). For this reason, GLS estimation is often referred to as weighted least squares. In
any case, it should be clear that by minimizing (7.1.27) to obtain (7.1.23) [or (7.1.24)],
one need never mention Cholesky matrices or transformed models. But this underlying
equivalence between OLS and GLS has many consequences that are not readily
perceived otherwise (as will be seen, for example, in Section 7.3.4 below).

7.1.2 Best Linear Unbiasedness Property

Having derived these estimators in terms of standard least squares procedures, it is


important to consider their optimality properties as estimators. Our objective is to show
that these estimators have the same BLU properties of Simple and Ordinary Kriging
above. But to do so, it is necessary to extend the notion of Best Linear Unbiased
estimation to vectors of parameters such as  . Here one might simply argue that we

________________________________________________________________________
ESE 502 II.7-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

should consider the estimation of each component,  j , j  0,1,.., k , separately. But it


turns out that one can do much better than this. In particular if we were trying to estimate
the expected value of a particular component of Y , say Yi  Y ( si ) , then by (7.1.2) this
takes the form of a condition mean

(7.1.28) E[Yi | x( si )]  x( si )

Here the standard regression procedure is simply to plug in the beta estimators, ˆ , and
use the derived “Yhat” estimator,

Yˆi  x( si )ˆ  ˆ0   j 1 x j ( si ) ˆ j


k
(7.1.29)

Hence, even if one were able to establish optimality properties for individual estimators,
ˆ j , there would remain the question as to whether linear combinations of estimators such
as in (7.1.28) were still optimal in any sense.

It is for this reason that a much more powerful way to characterize optimality properties
of vector estimators is in terms of all possible linear combinations of these estimators. In
the present case, observe that if we now focus on GLS estimators and consider any linear
combination of the unknown  vector, say a , then by (7.1.24) the corresponding
estimator, aˆ , takes the form

(7.1.30) aˆ  a ( X V 1 X ) 1 X V 1Y  [a ( X V 1 X ) 1 X  C 1 ]Y   a Y

where  a  a ( X V 1 X ) 1 X V 1 . But since (a, X ,V ) are all known values, this estimator
is indeed seen to be a linear estimator of a , i.e., a linear function of the Y vector (in a
manner completely analogous to Simple and Ordinary Kriging weights). Moreover, by
the argument in (7.1.26) it follows at once that

(7.1.31) E (aˆ )  a E ( ˆ )  a

Hence the “plug-in” estimator, aˆ , is seen to be a linear unbiased estimator of a , for
all possible choices of a . But the real power of this “linear compound” approach is that
it provides natural definition of best linear unbiased estimators in this vector setting. In
particular, we now say that ˆ is a Best Linear Unbiased (BLU) estimator of  , if and
only if in addition to (7.1.30) and (7.1.31) is also true that the variance of aˆ is smallest
among all such linear unbiased estimators. More formally, if we now denote the class of
all linear unbiased estimators,  , of  by

(7.1.32) 
LU (  )     ( X ,V , Y ): [a  aY ]&[ E (a )  a ] , a   k 1 
________________________________________________________________________
ESE 502 II.7-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

then ˆ is said to be a Best Linear Unbiased (BLU) estimator of  if and only if for all
linear compounds, a   k 1 ,

(7.1.33) var(aˆ )  min{var(a ):   LU (  )}

While this definition looks rather ambitious, it is shown in the Appendix (see the first
subsection of Section A2.8.3) that the unique estimator in LU (  ) satisfying this
minimum variance condition for all a   k 1 is precisely the GLS estimator in (7.1.24).

7.1.3 Regression Consequences of Spatially Dependent Random Effects

As discussed in detail in Section 3 above, our primary interest in GLS models is to allow
covariance structures to reflect spatially dependent random effects. We are now in a
position to see the consequences of such effects in more detail. To do so, we begin with
the simplest possible spatial regression model, where such effects can be seen explicitly.
We then examine these effects in a more complex setting by means of simulation.

Simple Constant-Mean Example.

Here we start with the simplest possible spatial regression model with a constant mean,
i.e., with no “explanatory variables” at all:

(7.1.34) Y ( s )     ( s ) , s  {s1 ,.., sn }  R

In this context, suppose we ignore possible spatial dependencies among residuals, and
assume simply that the residuals in (7.1.34) are independent, say  ( s ) ~ N (0,  2 ) . Then
iid

in matrix form, we have the regression model:

(7.1.35) Y   1n   ,  ~ N (0, 2 I n )

where in this case, X  1n  (1,..,1) and    . Hence for this case it follows from
(7.1.24) that the BLU estimator of  is given by:

(7.1.36) ˆ  (1n1n ) 1 1n Y  (n) 1  Y  


n
i 1 i
1
n 
n
Y Y
i 1 i

which is of course simply the sample mean . [Recall also expressions (6.3.11) and
(6.3.12)]. Moreover, recall from (3.1.19) that the variance of this estimator must be given
by

2
(7.1.37) var(Y ) 
n

________________________________________________________________________
ESE 502 II.7-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

So all inferences about the true value of  will be based on the estimator, Y , and its
variance in (7.1.37).

But suppose that in reality there are positive spatial dependencies among the residuals in
(7.1.35) so that in fact the covariance of  has the form,

 cov(1 , 1 )  cov(1 ,  n )      1n 
2

(7.1.38) cov( )               2In


 cov( ,  )  cov( ,  )   2 

 n 1 n n    n1   

with  ij  0 for many distinct (i, j ) pairs. Then, in a manner similar to expression
(4.10.3) above, it follows that since cov(Y )  cov( ) , the true variance of Y is given by

(7.1.38) var(Y )  var  1


n 
n
 
Y  n12
i 1 i
n
i 1
var(Yi )   i  j i cov(Yi , Y j ) 
 var(Yi )  n12  i  j i cov(Yi , Y j )    2  n1   ij
n n
 1
n2 i 1
1
n2 i 1 2
i j i

2
 1
n2
(n 2 )  n12  i  j i  ij   1
n2 
i j i
 ij
n

which, in the presence of many positive spatial dependencies, implies that,

2
(7.1.39) var(Y ) 
n

and hence that standard deviation,  (Y ) , is much larger than assumed, i.e.,


(7.1.40)  (Y ) 
n

This means, for example, that if we consider a 95% confidence interval for the true mean,
 , then the actual interval is given by

(7.1.41) CI actual  [Y  (1.96)  (Y )]

rather than the assumed interval

(7.1.42) CI assumed  Y  (1.96) n 

________________________________________________________________________
ESE 502 II.7-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

So for any given estimate, y , this implies from (7.1.40) that the actual confidence
intervals for  are much larger than those calculated, as depicted schematically below:

Assumed CI

[ [  ] ]
y

Actual CI

Thus if such spatial dependencies are not accounted for, then the results obtained will
tend to look “too good”. It is this type of false significance that motivates the need to
remove the effects of spatial dependencies in residuals before attempting to draw
statistical inferences.

More Complex Example.

As one illustration of a more complex spatial example, consider a spatial regression


model with data points, {si  ( x1i , x2i ) : i  1,..,100} , forming a 10 10 unit grid on the
plane, and with Yi defined by a linear function of these grid points,

(7.1.43) Yi   0  1 x1i   2 x2i  ui , i  1,..,100

with specific parameter values,  0  1 , 1  .04 , and  2  .08 . Suppose moreover that
the residuals {ui : i  1,..,100} are part of an underlying covariance-stationary spatial
stochastic process with covariogram, C (h) , parameterized by [r  5, s  1, a  0] , as
shown in Figure 7.1 below.

s 1

0.8

0.6 C ( h)
0.4

0.2

0
0 1 2 3 4 5 6 7

h r
Figure 7.1. Example Covariogram

________________________________________________________________________
ESE 502 II.7-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Given this model, one can in principle calculate the theoretical estimates and standard
errors for any given set of data { yi : i  1,..,100} under the (OLS) assumption of
independent errors, and under the true (GLS) model itself. But it is more instructive to
simulate this model many times and compare the OLS and GLS estimates of beta
parameters. In Table 7.1 below, the average results of 100 simulations are shown, where
the “GLS Est” column shows the average GLS estimates of each beta parameter, the
“GLS Std Err” column shows the corresponding average standard errors of these
estimates, and similarly for the OLS columns.

GLS Est GLS Std Err OLS Est OLS Std Err
const 0.9284 0.4802 0.9156 0.2396
X1 0.0564 0.0565 0.0568 0.0289
X2 0.0897 0.0565 0.0934 0.0289

Table 7.1. Average Values for 100 Simulations

Notice first that while the GLS estimates are on average slightly better than the OLS
estimates, both sets of estimates are unbiased (regardless of the true covariance) and
should tend to be roughly the same. The real difference is in the estimated standard
errors for each of these models. Here it is clear that the GLS estimates are about twice as
large as the OLS estimates. So as a direct parallel to expression (7.1.40) above, it is now
clear that by ignoring the true spatial dependencies, OLS is severely underestimating the
true standard deviations. So the confidence intervals on true beta values are again much
tighter than they should be.

To illustrate the consequences of such underestimation, we consider one specific instance


of the simulations above (number 47 in the set of 100 simulations). Here the specific
estimates and standard errors are shown in Table 7.2 below:

GLS Est GLS Std Err OLS Est OLS Std Err
const 1.5197 0.6754 2.0143 0.2669
X1 -0.0062 0.0789 -0.1228 0.0352
X2 0.0913 0.0789 0.0981 0.0352

Table 7.2. Specific Values for a “Bad Case”

This example illustrates a particularly bad case in which the estimates of 1 actually have
the wrong sign in both OLS and GLS. But if we display the 95% confidence intervals for
each case, we can see a substantial difference in the conclusions reached. First for the
GLS case we have:

________________________________________________________________________
ESE 502 II.7-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

(7.1.44) 1  .0062  (1.96)(.0789)  1  [.1607,.1485]

In particular, since the true value, .04, is contained it this interval, this value cannot be
ruled out by these results. More generally, since zero is also contained in this interval, it
can certainly not be concluded that x1 is negatively related to y . On the other hand, since
the corresponding OLS confidence interval is given by:

(7.1.45) 1  .1228  (1.96)(.0352)  1  [.1917, .0537]

it must here be concluded that 1   .0537 , and thus that x1 is significantly negatively
related to y. This is precisely the type of false significance that one seeks to avoid by
allowing for the possibility of spatially-dependent errors in estimation procedures.

Given this general linear regression framework, together with our present emphasis on
modeling spatially-dependent errors, the task remaining is to develop specific methods
for spatial prediction within this setting. Recall from our general classification of Kriging
models in Section 6.1.2 that the method for doing so is known as Universal Kriging.
Hence we now develop this spatial prediction model in more detail.

7.2 The Universal Kriging Model

Recall from (6.1.10) and (6.1.11) that the basic probability model underlying Universal
Kriging is essentially the general linear regression model in (7.1.2) above. Within this
probabilistic framework, the task of spatial prediction (as in both Simple and Ordinary
Kriging) is to determine a BLU predictor for values, Y ( s0 ) , at locations s0 not in the
given sample data set, Yn  [Y ( si ) : i  1,.., n] . As we shall see in the next section, this
essentially amounts to an appropriate extension of the analysis for Ordinary Kriging.
Following this development we derive the appropriate standard error of prediction for
Universal Kriging. As with Simple Kriging, our main interest in Universal Kriging is that
it provides the simplest setting within which one can include the types of spatial trend
models developed above. Because this model is included as part of ARCMAP, we also
outline the procedure for implementing this model. However, the main role of this model
for our present purposes is to serve as an introduction to Geostatistical Regression and
Kriging, as developed in Section 7.3 below.

7.2.1 Best Linear Unbiased Prediction

Here we again start with a given prediction set, S ( s0 )  {si : i  1,.., n0 }  {s1 ,.., sn } , for s0
together with corresponding prediction samples, Y  [Y ( si ) : i  1,.., n0 ] .5 Moreover, by

5
Note that we have now returned to the convention that Yn denotes the full sample vector and Y is the
prediction sample vector for s0 . As with Ordinary Kriging, both random vectors will be used here.

________________________________________________________________________
ESE 502 II.7-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

again appealing to the linear prediction hypothesis, it is assumed that the desired
predictor, Yˆ0  Yˆ ( s0 ) , is of the form:

(7.2.1) Yˆ0  0 Y

for some appropriate weight vector, 0  (1 ,.., n0 ). Turning next to the unbiasedness
condition, it follows from condition (6.3.14) for Ordinary Kriging, that this unbiasedness
condition again takes the basic form:

(7.2.2) 0  E (e0 )  E (Y0  Yˆ0 )  E[Y0  0 Y ]  E[Y0 ]  0 E (Y )

But now these expectations are more complex. By (7.1.2) we see that

(7.2.3) E (Y0 )  E[Y ( s0 )]  x( s0 )

and similarly that

0 E (Y )   i1 0i E[Y ( si )]   0i x( si )


n0 n0
(7.2.4) i 1

So to write (7.2.2) more explicitly, it is convenient to introduce the following notational


conventions. First let the vector of attributes at the prediction location, s0 , be denoted by

 1 
 
x
(7.2.5) x( s0 )  x0   01 
  
 
 x0 k 

and similarly, let the matrix of attributes for locations in S ( s0 ) be denoted by

 x( s1 )   1 x11  x1k 
   
(7.2.6) X0         
 x( sn )   1 xn 1  xn k 
 0   0 0 

Then since

 x( s1 ) 
 
 i1 0i x(si )  (01,.., 0n0 )      0 X 0 
n0
(7.2.7)
 x( sn ) 
 0 

it follows that (7.2.2) can be written in an explicit compact form as

________________________________________________________________________
ESE 502 II.7-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

(7.2.8) 0  x0   0 X 0   ( x0  X 00 )

But since this unbiasedness condition is required to hold for all  , it should be clear that
this is only possible if x0  X 00  0 , or equivalently, if and only if

(7.2.9) X 00  x0

Turning finally to the efficiency condition, the argument in (6.3.17) for Ordinary Kriging
can now be extended by using (7.2.8) to show that prediction error variance continues to
be the same as residual mean squared error:

(7.2.10) var(e0 )  E (e02 )  E[(Y0  Yˆ0 ) 2 ]  E[(Y0  0Y ) 2 ]

 E [( x0    0 )  0 ( X 0    )]2 

 E [( x0  0 X 0 )   ( 0  0 )]2 

 E[( 0  0 ) 2 ]  MSE (0 )

But since all covariances are given, it follows by setting V0  cov(Y ) that (as with both
Simple and Ordinary Kriging) prediction error variance must again be given by,

(7.2.11) var(e0 )   2  2 c0 0  0 V0 0

Hence the optimal weight vector, ̂0 , for the case of Universal Kriging must be the
solution to the following constrained minimization problem:

(7.2.12) minimize:  2  2c0 0  0 V0 0 subject to: X 00  x0

At this point, it should be clear that Ordinary Kriging is simply a special case of
Universal Kriging. Indeed, if one eliminates all explanatory variables and keeps only the
intercept term in (7.1.2), then by (7.2.5) and (7.2.6), x0 reduces to 1 and X 0 reduces to
1n0 , so that the constraint in (7.2.12) reduces to 1n0 0  1 ,which is precisely (6.3.18). This
is a consequence of the fact that under the assumptions of Ordinary Kriging, this reduced
model implies that  0   , i.e.,

(7.2.13) Y ( s)   0   ( s)  E[Y ( s)]   0  

Turning now to the solution, ̂0 , of (7.2.12), it is shown in the Appendix [expression
(A2.8.58)] that

________________________________________________________________________
ESE 502 II.7-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

(7.2.14) ˆ0  V0 1 X 0 ( X 0V0 1 X 0 ) 1 ( x0  X 0V0 1c0 )  V0 1c0

By substituting ̂0 into (7.2.1) we then obtain the following BLU predictor of Y0  Y ( s0 )
for Universal Kriging [see also expression (A2.8.59) in the Appendix]:

(7.2.15) Yˆ0  x0 ( X 0V0 1 X 0 ) 1 X 0 V0 1Y  c0V0 1[Y  X 0 ( X 0V0 1 X 0 ) 1 X 0 V0 1Y ]

While this solution appears to be even more complex than expression (6.3.20) for the
Ordinary Kriging case, it turns out to have an equally simple interpretation. To show this,
we start by noting that as a parallel to (6.2.21), if we now estimate  based solely on the
prediction sample, Y  [Y ( si ) : i  1,.., n0 ] , for Y0 (with attribute data, X 0 , and covariance
matrix, V ) then it follows from (7.1.24) that the resulting GLS estimator of  , say ˆ ,
0 n0

must be given by,

(7.2.16) ˆn  ( X 0V01 X 0 ) 1 X 0V01Y


0

Moreover, by the results of Section 7.1.2 above, this must be the BLU estimator of 
based on this sample data. But by substituting (7.2.16) into (7.2.15), we then see that Yˆ 0

has the simpler form,

(7.2.17) Yˆ0  x0 ˆn0  c0V0 1 (Y  X 0 ˆn0 )

Finally, since the last expression in brackets is simply the vector of estimated residuals,

(7.2.18) ˆ  Y  X 0 ˆn 0

generated by ˆ , it follows that Yˆ0 takes the following form:

(7.2.19) Yˆ0  x0 ˆn0  c0V0 1ˆ

So as with Ordinary Kriging, the construction of Universal Kriging predictors is seen to


have a appealing two-step interpretation:

(i). Construct the BLU estimator, ˆn0 , of  based on the prediction sample data,
Y , as in (7.2.16).
(ii). Use the sample residuals, ˆ , in (7.2.18) to obtain the Universal Kriging
predictor, ˆ0  c0V0 1ˆ , of  0 and set Yˆ0  x0 ˆn0  ˆ0 .

________________________________________________________________________
ESE 502 II.7-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

But as with Ordinary Kriging, it can also be argued that if  characterizes the global
trend over the entire region, R, then a better estimate can be obtained by using the GLS
estimator,

(7.2.20) ˆn  ( X V 1 X ) 1 X V 1Yn

based on the full set of samples, Yn , with attribute data, X . It is this modified procedure
that constitutes the most commonly used form of Universal Kriging.6 To formalize this
procedure, it thus suffices to modify the two steps above as follows:

(1). Construct the BLU estimator, ˆn , of  based on the full sample data,
Yn , as in (7.2.20).

(2). Use the sample residuals, ˆ  Y  X 0 ˆn , to obtain the Universal Kriging
predictor, ˆ  cV 1ˆ , of  and set Yˆ  x ˆ  ˆ .
0 0 0 0 0 0 n 0

7.2.2 Standard Error of Prediction

As with Ordinary Kriging, one can obtain prediction error variance for the optimal weight
vector, ̂0 , by substituting (7.2.14) into (7.2.11). As is shown in the Appendix [see
expression (A2.8.69)] this yields the follow explicit expression for prediction error
variance in the general case of Universal Kriging:

(7.2.21) ˆ 02  (  2  c0 V01c0 )  ( x0  X 0V0 1c0 ) ( X 0V0 1 X 0 ) 1 ( x0  X 0V0 1c0 )

Paralleling the interpretation ˆ 02 for Ordinary Kriging, the first bracketed expression in
(7.2.21) is again prediction error variance for Simple Kriging, and the second expression
is again positive. This second term now accounts for the additional variance created by
estimating  internally. Finally, the resulting standard error of prediction for Universal
Kriging is by definition the square root of (7.2.21), i.e.,

(7.2.22) ˆ 0  (  2  c0 V01c0 )  ( x0  X 0V0 1c0 ) ( X 0V0 1 X 0 ) 1 ( x0  X 0V0 1c0 )

6
As with Ordinary Kriging, there are again arguments for using the local version in [(i),(ii)] above. In fact,
many treatments of Universal Kriging implicitly use this local version, as for example in Section 5.3.3 of
Schabenberger and Gotway (2005).

________________________________________________________________________
ESE 502 II.7-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

7.2.3 Implementation of Universal Kriging

In many respects, the implementation of Universal Kriging closely parallels that of


Ordinary Kriging. But the key difference is that when global trends are not constant, the
fundamental identity between differences of Y-values and  -values in (4.8.4) breaks
down. So prior estimation of the variogram becomes quite problematic in this more
general setting. Indeed, this is the primary motivation for the method of Geostatistical
Kriging to be developed in Section 7.3 below. The most common procedure here is to
start with OLS estimation of  , which assumes all covariances are zero. This will yield a
set of OLS residuals that can then be used to estimate a spherical variogram. Given this
estimate, the procedure closely follows that of Ordinary Kriging.

With these preliminary observations, the implementation procedure for Universal Kriging
can be specified as follows. We again start with a given set of sample data,
yn  ( y ( si ):i  1,.., n) in R, where each yi is taken to be a realization of the
corresponding random variable, Y ( si ) , in a sample vector, Yn  [Y ( si ) : i  1,.., n] . This
sample vector, Yn , is now hypothesized to satisfy the generalized linear regression model
in (7.1.8) with attribute data, X, and covariance matrix, V . In this context, we again
consider the prediction of Y0  Y ( s0 ) , at a given location, s0  R . This prediction is
carried out through the following series of steps:

Step 1. OLS Estimation

Construct an OLS estimate,

(7.2.23) ˆOLS  ( X X ) 1 X yn

of  and form the corresponding residuals

(7.2.24) ˆOLS  y  X ˆOLS

Step 2. Covariance Estimation


Using these residuals,  OLS  [ i   ( si ) : i  1,.., n] , proceed as in Step 2 for Simple
Kriging by estimating a spherical variogram,  (h; rˆ, sˆ, aˆ ) , and associated covariogram,

(7.2.25) Cˆ (h)  sˆ   (h; rˆ, sˆ, aˆ )

as in (6.2.65). Then using the identity

(7.2.26)   ( s ),  ( s )]  Cˆ (|| s  s ||)


ˆ ij  cov[ i j i j

as in (6.2.66), construct an estimate:

________________________________________________________________________
ESE 502 II.7-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

 ˆ 2  ˆ1n 
 
(7.2.27) Vˆ      
 ˆ n1  ˆ 2 
 

of the full-sample covariance matrix, V .

Step 3. GLS Estimation

Now use (7.2.26) to construct a final GLS estimate of  as in (7.2.20),

(7.2.28) ˆn  ( X Vˆ 1 X ) 1 X Vˆ 1Yn

with Vˆ replacing V in (7.2.20).

Step 4. Selection of a Prediction Set for Y(s0)

Given the development of prediction set selection in Section 6.4 above, we can now
consider this selection problem more explicitly for Universal Kriging. In particular, we
now assume that the appropriate prediction set, S ( s0 ) , is defined by an appropriate
bandwidth, h0 , as follows,

(7.2.29) S ( s0 )  {si  Sn : || s0  si ||  h0 }

where S n  {s1 ,.., sn } is again the full sample set of locations. Ideally this bandwidth
should be selected by a cross-validation procedure such as in Section 6.4. But given the
computation intensity of such procedures, we here assume that h0 is selected simply by a
visual inspection of the mapped data surrounding site, s0 .

However, there is one additional requirement that must be met by prediction sets,
S ( s0 )  {s1 ,.., sn0 } , in the case of Universal Kriging. Recall that if the attribute vector at
s0 is denoted by x0 as in (7.2.5), then the unbiasedness condition for Universal Kriging
in (7.2.9) requires that

(7.2.30) X 00  x0

where the transpose, X 0 , of prediction attribute matrix, X 0 , in (7.2.6) has k  1 rows


(one for each attribute) and n0 columns (one for each prediction point). But (7.2.30)
formally requires that the given ( k  1 )-vector, x0 , of attributes at s0 be a linear

________________________________________________________________________
ESE 502 II.7-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

combination of the columns of X 0 . This can only be guaranteed in general if n0  k  1 .


Moreover, to avoid trivial solutions, we require that n0  k  2 . 7

Step 5. Construction of Estimated Prediction Covariances

Given a prediction set, S ( s0 )  {s1 ,.., sn0 } , one can then use (7.2.26) above to construct
estimates of the set of covariances,

 ˆ 2 cˆ0 
(7.2.31) Cˆ 0   
 cˆ Vˆ0 
 0

relevant for prediction of Y ( s0 ) [as in (6.3.33)]. These can in turn be to krige residuals as
in the second step of the basic two-step procedure for Universal Kriging above. Here the
procedure is as follows:

Step 6. Kriging Prediction Residuals at s0

If the prediction sample data relevant for s0 is denoted by y  ( y1 ,.., yn0 ) , and if the
corresponding prediction residuals are estimated by,

(7.2.32) ˆ  y  X 0 ˆn

then the residual, ˆ0 , predicted at s0 can be constructed by Simple Kriging of ˆ as


follows:

(7.2.33) ˆ0  c0V0 1ˆ

Step 7. Constructing the Prediction of Y(s0)

Finally, (7.2.33) can be combined with (7.2.28) to obtain the desired prediction of the
unobserved value, Y ( s0 )  x0    0 , at s0 , namely

(7.2.34) Yˆ0  x0 ˆn  ˆ0

7
More precisely, x0 is required to lie in the span of theses column vectors. Hence there must be at least
k  1 linearly independent columns of X 0 to insure this condition. But if this number were exactly
n0  k  1 then 0 would be uniquely determined by 0  ( X 0 ) 1 x0 . So for nontrivial solutions one must
require that n0  k  2 .

________________________________________________________________________
ESE 502 II.7-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Step 8. Prediction Intervals

By combining this with the corresponding estimate of prediction standard error,

(7.2.35) ̂ 0  ( ˆ 2  cˆ0 Vˆ01cˆ0 )  ( x0  X 0Vˆ0 1cˆ0 ) ( X 0Vˆ0 1 X 0 ) 1 ( x0  X 0Vˆ0 1cˆ0 )

one can use the pair (Yˆ0 ,ˆ 0 ) to construct prediction intervals for Y ( s0 ) . As in (6.2.63),
the default interval takes the form:

(7.2.36) [Yˆ0  (1.96) ˆ 0 , Yˆ0  (1.96) ˆ 0 ]

7.3 Geostatistical Regression and Kriging

As mentioned at the beginning of Section 7.2.3 above, the estimation of variograms for
Universal Kriging is somewhat problematic. In particular, observe that the OLS residuals
in (7.2.24) used for estimation of variograms are generally not consistent with the final
GLS residuals in (7.2.32). So if the variogram were re-estimated on the basis of these
residuals, then generally this would not agree with the variogram used. This
inconsistency is simply ignored in the implementation of Universal Kriging outlined
above, and hence renders this procedure somewhat ad hoc. To be more precise, if now
denote the parameter vector for the spherical variogram by

(7.3.1)   ( r , s, a ) ,

then one the one hand, if  were known (as is implicit in the “known covariance”
assumption of Universal Kriging) one could employ GLS estimation to determine ˆ . On
the other hand, if  were known, then the residual “data”,   Y  X  , could be used to
construct a consistent estimate, ˆ , of the variogram parameters,  . Hence the real
difficulty here is trying to obtain simultaneous estimates, ( ˆ ,ˆ) , of these two sets of
parameters. In Schabenberger and Gotway (2005, p.257) ) this circular argument is aptly
described as the “cat and mouse game of Universal Kriging”. While it is possible to
reformulate this entire estimation problem in terms of more general maximum-likelihood
methods,8 a more practical approach is simply to construct an iterative estimation
procedure in which each parameter vector is estimated given some current value of the
other. It is this procedure that we now develop in more detail.9

8
For further discussion of such methods, see Section 9.2.1 in Waller and Gotway (2004). Here it should
also be noted that a maximum-likelihood estimation approach of this type will be developed to estimate
spatial autoregressive models in Part III of this NOTEBOOK.
9
This procedure is also developed in Section 9.2.11 in Waller and Gotway (2004), where it is designated as
the Iteratively Re-Weighted Generalized Least Squares (IRWGLS) procedure. A less formal presentation of
the same idea is given in [BG], p.189.

________________________________________________________________________
ESE 502 II.7-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Before doing so, it is important to emphasize that the type of spatial model developed
here has uses other than simply predicting values of Y at unobserved locations. A good
example is the California Rainfall study, already used to motivate the present class of
more general spatial trend functions. In this study, the main focus was on identifying
spatial attributes that are significant predictors of rainfall at each data location. While one
could also attempt to predict rainfall levels at locations not in the data set, this was not the
main objective. Hence it is useful to distinguish between two types of spatial applications
here. We begin with a general linear regression model as in (7.1.8), where it is now
assumed that the covariance matrix, V , is generated by an underlying covariogram with
parameter vector,  , in (7.3.1), which we now write explicitly as,

(7.3.2) Y  X  ,  ~ N [0,V ( )]

This is of course precisely the type of model postulated for Universal Kriging above.
However, since the iterative estimation procedure developed below differs from the
implementation of Universal Kriging as developed in Section 7.2.3, it is convenient to
distinguish between these two models. Hence we now designate model (7.3.2) [together
with its iterative implementation developed below] as a Geostatistical Regression model.
In the California Rainfall example, such a model might well be used to incorporate
possible spatial dependencies between rainfall in cities close to one another. The
emphasis here is on estimating  in a manner that will allow proper statistical inferences
to be drawn about each of its components. On the other hand, such a model might also be
used for prediction purposes. Hence when such geostatistical regression models are used
for spatial prediction, they will be designated as Geostatistical Kriging models.10

With these preliminary observations, we can now develop an implementation of both


these models. As in Section 7.2.3, we start with a given set of sample data,
y  ( y ( si ):i  1,.., n) in R, where each yi is taken to be a realization of the corresponding
random variable, Y ( si ) , in a sample vector, Y  [Y ( si ) : i  1,.., n] .11 This sample vector,
Y , is now hypothesized to satisfy the generalized linear regression model in (7.3.2) with
attribute data, X, and covariance matrix, V ( ) .

7.3.1 Iterative Estimation of  and 

We first given an overview of the estimation procedure and then formalize its individual
steps. Every iterative estimation procedure must start with some initial value. Here, as
with Universal Kriging, the initialization used (step [1] below) is to estimate  by OLS,
which we designate as ˆ . The residuals ˆ generated by ˆ are then used to obtain an
0 0 0

10
It should be noted that in other treatments, such as Schabenberger and Gotway (2005), all such
implementations are regarded simply as different ways of estimating the same “Universal Kriging model”.
However, for our purposes it seems best to avoid confusion by reserving the term “Universal Kriging” for
the implementation adopted in ARCMAP, as outlined in Section 7.2.3 above.
11
Note again that we here use Y for the full sample rather than Yn . The latter is only required when we need
to distinguish between the full sample and subsamples used for prediction at each location.

________________________________________________________________________
ESE 502 II.7-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

estimate, ˆ0 , of the spherical variogram parameters in (7.3.1). These are in turn used (in
steps [2] to [6] below) to obtain a GLS estimate, ˆ of  using the covariance matrix,
1

V (ˆ ) . Up to this point, the implementation is identical with that in Section 7.2.3. But the
0

purpose of the present numbering of these estimators is to formalize a continuation of this


procedure. Here the residuals, ˆ1 , generated by ˆ1 are next used (in step [7]) to obtain a
new estimate, ˆ , of the spherical variogram parameter. If the estimates ( ˆ ,ˆ ) are
1 1 1

deemed (as in steps [8] to [9] below) to be “sufficiently similar” to ( ˆ0 ,ˆ0 ) , then the
estimation procedure terminates with these as final values. Otherwise it continues until
such values are found. With this overview, we now formalize these steps as follows:

[1] First construct an OLS estimate,

(7.3.3) ˆ0  ( X X ) 1 X y

of  with corresponding residuals,

(7.3.4) ˆ0  y  X ˆ0 .

[2] Use these residuals to estimate an empirical variogram, ˆ0 (h) , at some set of
selected distance values, (hi : i  1,.., q ) .

[3] Next use this empirical variogram data (ˆ0i , hi ), i  1,.., q to fit (by nonlinear least
squares) a spherical variogram,  (h ; ˆ ) , with parameter vector,
0

(7.3.4) ˆ0  (rˆ0 , sˆ0 , aˆ0 ) .

[4] Then use the identity, C (h)   2   (h) , to construct the corresponding spherical
covariogram,

(7.3.5) Cˆ 0 (h)  sˆ0   (h ; rˆ0 , sˆ0 , aˆ0 )

for all distances h .

[5] If the distance between each pair of data points, si and s j is denoted by hij , then
the covariance,  ij  cov( i ,  j ) , between the residuals at si and s j is estimated by

________________________________________________________________________
ESE 502 II.7-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

ˆ 0ij  Cˆ 0 (hij ) [where by definition,  ii   2  ˆ 0ii  ˆ 02  sˆ0 ], and the resulting


estimate of the covariance matrix, V ( )  cov( ) , between residuals at all data
points i  1,.., n is given by12

 ˆ 02  ˆ 01n 
 
(7.3.6) Vˆ0  V (ˆ0 )      
 ˆ 0 n1  ˆ 02 

[6] Using this covariance matrix, now apply GLS to obtain a new estimate of  :

(7.3.7) ˆ1  ( X Vˆ0 1 X ) 1 X Vˆ0 1 y .

with corresponding residuals,

(7.3.8) ˆ1  y  X ˆ1

[7] Then replace ˆ0 by ˆ1 and apply steps [2] and [3] to obtain a new spherical
variogram,  (h ;ˆ ) , with parameter vector,
1

(7.3.9) ˆ1  (rˆ1 , sˆ1 , aˆ1 )

[8] At this point, one can check to see if there are any “significant” differences between
the initial parameter estimates, ( ˆ0 ,ˆ0 ) , and the new estimates, ( ˆ1 ,ˆ1 ) . Here there
are many criteria to check for differences. If one is primarily interested in the 
parameters (as is typical in regression), the simplest approach is to focus on
fractional changes in these estimates by letting13

 ˆ  ˆ0 j 
(7.3.10) 1  max  1 j : j  0,1,.., k 
  0 j
ˆ


One may then choose an appropriate threshold value,  (say   .001 ) and define a
significant change to be 1   . If one is also interested in the variogram parameters,
  (r , s, a) , then one may replace (7.3.10) by the broader set of fractional changes

12
Be careful not to confuse this initial estimate, Vˆ0 , with the estimated sub-matrix of covariances, Vˆ0 , used
to predict Y ( s0 ) in previous sections.
13
For a possible modification of this simple criterion, see Schabenberger and Gotway (2005, p.259).

________________________________________________________________________
ESE 502 II.7-25 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

 rˆ  rˆ sˆ  sˆ aˆ  aˆ 
(7.3.11)  1  max 1 , 1 0 , 1 0 , 1 0 
 rˆ0 sˆ0 aˆ0 

[9] If there is no significant change, i.e., if 1   (or  1   ), then stop the iterative
estimation procedure and set the final parameter estimates to be

(7.3.12) ( ˆ ,ˆ)  ( ˆ1 ,ˆ1 ) .

[10] On the other hand, if 1   (or  1   ), then continue the iterative estimation
procedure by replacing ˆ with ˆ in steps [4] through [7] to obtain a new 
0 1

estimate,

(7.2.13) ˆ2  ( X Vˆ11 X ) 1 X Vˆ11 y

[based on the new covariance matrix, Vˆ1  V (ˆ1 ) ], and new variogram
parameter estimates

(7.2.14) ˆ2  (rˆ2 , sˆ2 , aˆ2 )

[based on the new residuals, ˆ2  y  X ˆ2 ].

[11] With these new parameters, define  2 (or  2 ) as in step [8]. If  2   (or  2   )
then stop the procedure and set the final parameter estimates to

(7.2.15) ( ˆ ,ˆ)  ( ˆ2 ,ˆ2 ) .

[12] On the other hand, if  2   (or  2   ), then continue the iterative estimation
procedure by replacing ( ˆ ,ˆ ) with ( ˆ ,ˆ ) in steps [4] through [7].
1 1 2 2

[13] Continue in the same way until a set of parameters ( ˆm ,ˆm ) is found for which
   (or    ). Then stop the procedure and set the final estimates to
m m

(7.3.16) ( ˆ ,ˆ)  ( ˆm ,ˆm ) .

These final parameter estimates are said to be mutually consistent in the sense
that the covariance matrix, Vˆ  V (ˆ) , will (approximately) reproduce ˆ as,

(7.3.17) ˆ  ( X Vˆ 1 X ) 1 X Vˆ 1 y

________________________________________________________________________
ESE 502 II.7-26 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

and similarly, that the residuals, ˆ  y  X ˆ , yield an empirical variogram, ˆ (h) ,


that will (approximately) reproduce the parameter estimates, ˆ  (rˆ, sˆ, aˆ ) , of the
spherical variogram yielding Ĉ .

Here it should be emphasized that while this mutual consistency property is certainly
desirable from a conceptual viewpoint, there is no guarantee that any of the Best Linear
Unbiased estimation properties for GLS estimators will continue to hold for ˆ . Hence, as
discussed at the end of the implementation for Simple Kriging in Section 6.2.5 above,
these are often designated as Empirical GLS estimators.14

7.3.2. Implementation of Geostatistical Regression (Geo-Regression)

Given the regression estimates, ˆ , one can use the parameter estimates, ˆ  (rˆ, sˆ, aˆ ) , to
construct the final covariogram as follows:

(7.3.18) Cˆ (h )  sˆ   (h ; rˆ, sˆ, aˆ )

This covariogram is in turn used to obtain a final estimate,

 ˆ 2  ˆ1n 
 
(7.3.19) Vˆ  V (ˆ)      
 ˆ n1  ˆ 2 

of the residual covariance matrix, V  V ( )  cov( ) [mentioned above (7.3.17)].

To employ these estimates for inference about the components of  in geo-regression


applications, one must estimate the covariance matrix of the estimator, ˆ , say
  cov( ˆ ) . Following standard GLS procedures, one can determine  as follows. By
definition,

(7.3.20) ˆ  ( X V 1 X ) 1 X V 1Y  ( X V 1 X ) 1 X V 1 ( X    )

 ( X V 1 X ) 1 ( X V 1 X )   ( X V 1 X ) 1 X V 1

   ( X V 1 X ) 1 X V 1

14
See for example the discussion in Waller and Gotway (2004, p.337).

________________________________________________________________________
ESE 502 II.7-27 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

But by the Linear Invariance Theorem for multi-normal random vectors [in (3.2.22)], it
then follows that ˆ is multi-normally distributed with mean

(7.3.21) E ( ˆ )    ( X V 1 X ) 1 X V 1E ( )    (0)  

and covariance,

(7.3.22)   cov( ˆ )  cov ( X V 1 X ) 1 X V 1 

 ( X V 1 X ) 1 X V 1 cov( )V 1 X ( X V 1 X ) 1

 ( X V 1 X ) 1 X V 1V V 1 X ( X V 1 X ) 1

 ( X V 1 X ) 1 ( X V 1 X ) ( X V 1 X ) 1

 ( X V 1 X ) 1

Hence (7.3.19) yields the following estimate of  ,

 vˆ11  vˆ1k 
 
(7.3.24) ˆ  ( X Vˆ 1 X ) 1      
 vˆ  vˆkk 
 k1

which in turn yields standard error estimates

(7.3.25) s j  vˆ jj

for each beta parameter estimate, ˆ j , j  0,1,.., k . These standard errors can then be used
to construct p-values for significance tests of these coefficients based on the t-ratios:

(7.3.26) t j  ˆ j / s j , j  0,1,.., k

Hence, standard tests of significance can be carried out in terms of these estimates.15 This
procedure is implemented in the MATLAB program, geo_regr.m, and will be illustrated
in Section 7.3.4 below.

7.3.3. Implementation of Geostatistical Kriging (Geo-Kriging)

Recall that Universal Kriging used a prior estimate of the variogram parameters based on
OLS residuals. But one can now improve this procedure by using the mutually consistent
estimates obtained above. In doing so, we must again distinguish between the full sample

15
As with OLS, t j is t-distributed with n  ( k  1) degrees of freedom under the null hypothesis,  j  0 .
See also expressions (9.16) through (9.18) in Waller and Gotway (2004).

________________________________________________________________________
ESE 502 II.7-28 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

vector, Yn , and the prediction sample vector, Y , used for predicting Y0  Y ( s0 ) at a


selected site, s0  R . So for convenience we now rewrite model (7.3.2) as:

(7.3.27) Yn  X    ,  ~ N (0,V )  N [0,V ( )]

to emphasize that this model refers to the full sample. Hence for the mutually consistent
estimates, ( ˆ ,ˆ)  [ ˆ ,(rˆ, sˆ, aˆ )] , obtained from the iterative procedure above, the estimate,
ˆ , now yields the (full sample) GLS estimate, ˆ , in (7.2.28), and the estimated
n

covariance matrix, V (ˆ) , yields the appropriate Vˆ matrix. So by mutual consistency, we


may write16

(7.3.28) ˆn  ( X Vˆ 1 X ) 1 X Vˆ 1Yn

At this point, Steps 4 through 8 in the implementation of Universal Kriging can now be
carried out in tact [where the prediction covariance estimates, Ĉ0 , in (7.2.31) are again
assumed to be constructed using the variogram parameters, ˆ  (rˆ, sˆ, aˆ ) , from the
iterative estimation procedure].

In summary, while the iterative estimation procedure in Geo-Kriging is computationally


more intensive than that of Universal Kriging, the mutual consistency of all estimated
parameters should in principle yield more satisfactory spatial predictions. This procedure
is implemented in the MATLAB program, geo_krige.m, and will be illustrated briefly at
the end of Section 7.3.5 below.

7.3.4 Cobalt Example of Geo-Regression

As an illustration of geo-regression, a small rectangular region of Vancouver Island has


been selected in which Cobalt (Co) values appear to exhibit a interesting spatial trend, as
shown in Figure 7.2(a) below. Notice in particular that the highest values tend to be in the
northwest and southeast corners of this rectangle, while the lowest values tend to be in
the southwest and northeast corners. This suggests a “saddle” shape, as depicted in Figure
7.2(b) below. Such saddle shapes, known technically as hyperbolic paraboloids, are
instances of quadratic functions in the underlying coordinate variables, s  ( x, y ) . This
suggests that spatial trends in this data might be well fitted by a geo-regression with a
quadratic spatial trend function of the form,

(7.3.29) Co   0  1 x   2 y  3 x y   4 x 2  5 y 2  

The Cobalt data for this example is in the JMP file, Cobalt_1.JMP. Before proceeding,
it is worthwhile noticing from this data that the coordinates locations are in feet, so that

16
Here the equality in (7.3.28) is implicitly taken to be “approximately equal” in the sense defined by the
mutual consistency condition in the iterative estimation procedure above.

________________________________________________________________________
ESE 502 II.7-29 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

! ! ! ! ! !
! !
!! ! ! !
!! ! ! !
!! ! ! !!
! ! ! !
!! ! ! !!
!! ! ! ! !
! ! ! ! ! !
! !
! !! ! !
! ! ! ! ! !
!
! ! ! ! !
! ! ! !!
!
!! ! !
!
!! !!
! ! !!!
!
!
! ! ! ! !
! !
! ! !
! ! !!
!
! ! !
! ! ! ! ! !
! ! ! !
! !
! ! !
! ! !!
! ! !
!
!! ! ! !! !!
!
! ! ! ! !
! ! ! !
! ! ! !! !
! !
!
! ! ! ! !!
! !!
! ! ! !

(a) Cobalt Data Map (b) Spatial Trend

Figure 7.2. Cobalt Data Example

their values are quite large. For example, the first point is ( x1  651612, y1  566520) .
More importantly, when one forms a quadratic function, these values are squared in order
of magnitude. So for example the cross product term in (7.3.29) is x1 y1  3.69 1011 .
Since the cobalt magnitudes are drastically smaller (in this case, Co1  36 ), it should be
clear that some of the beta slope coefficients in (7.3.29) will be vanishingly small
(roughly of order 108 ). Such values are so close to zero that they are awkward to
analyze. More importantly, since the intercept is by definition a data vector of ones,
1n  (1,..,1) , this column in the data matrix, X , is vanishing small compared to other data
columns like, xy . This can create numerical instabilities in the regression itself.17 So
before beginning the present analysis, it is advisable to rescale the coordinate data to a
more reasonable range. In the present case, we have divided all coordinate values by
10,000, so that terms like the cross product above now have more tractable values
( x1 y1  3691.5 ). With these values, the OLS regression in (7.3.29) yields the following
results (where xx denotes x 2 , and so on):

Term Estimate Std Error t Ratio Prob>|t|


Intercept -10652.86 3026.992 -3.52 0.0006* RSquare 0.21032
x 278.31445 61.45749 4.53 <.0001* RSquare Adj 0.187094
y 59.926559 63.53409 0.94 0.3469 Root Mean Square Error 8.213746
xy -2.379182 0.407688 -5.84 <.0001* Mean of Response 24.78409
xx -1.103166 0.426945 -2.58 0.0106* Observations (or Sum Wgts) 176
yy 0.8149638 0.493603 1.65 0.1006

Table 7.3. Initial OLS Regression

17
Software such as JMP is usually sophisticated enough to employ internal rescaling procedures to avoid
such obvious instabilities. But this is not true of all regression software.

________________________________________________________________________
ESE 502 II.7-30 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Notice that y is not significant, and that y 2 is only weakly significant. But since there are
clear nonlinearities in the y direction, this suggests that the collinearity between y and
y 2 in this region are masking the effect of y 2 . If the insignificant y variable is removed,
then one obtains the new regression shown below.

Term Estimate Std Error t Ratio Prob>|t|


RSquare 0.206187
Intercept -8439.792 1911.869 -4.41 <.0001*
RSquare Adj 0.187618
x 262.9905 59.25208 4.44 <.0001*
Root Mean Square Error 8.211095
xy -2.204434 0.363043 -6.07 <.0001*
Mean of Response 24.78409
xx -1.062146 0.424588 -2.50 0.0133*
Observations (or Sum Wgts) 176
yy 1.2389119 0.203946 6.07 <.0001*

Table 7.4. Final OLS Regression

Notice that y 2 is now very significant, and moreover, that the adjusted R 2 value has
increased by removing y . This is a clear indication that the present model is capturing
this spatial trend more accurately. Note finally that the coefficients on x 2 and y 2 have
opposite signs. This is a characteristic of hyperbolic paraboloids.18

However, there still remains the question of possible spatial dependencies among the
unobserved residuals,  , in (7.3.29). We can check this in the usual way by regressing
these residuals on their nearest-neighbor residuals. The result of this regression are shown
below:
y _
30

20
residuals

10

-10

-20
-20 -10 0 10 20 30
nn_res

Term Estimate Std Error t Ratio Prob>|t|


Intercept -0.128734 0.59103 -0.22 0.8278
nn_res 0.2628387 0.069852 3.76 0.0002*

Figure 7.3. OLS Residual Analysis

Here it is clear that there does indeed exist significant spatial dependency among these
residuals. As discussed in Section 7.1.3, this can in turn inflate the significance levels

18
See for example http://mathworld.wolfram.com/HyperbolicParaboloid.html.

________________________________________________________________________
ESE 502 II.7-31 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

obtained in Table 7.4. So this motivates an extended analysis using geo-regression to


account for these dependencies.

To do so, this cobalt data has been transported to MATLAB, and is found in the
workspace, Cobalt_1.mat. Here the 176 locations are stored in the matrix, L0, with
corresponding cobalt values in y0 and data [x, xy, xx, yy] in the matrix, X0. The geo-
regression is run with the command,

>> OUT = geo_regr(y0,X0,L0,vnames);

where vnames contains the variable names, and is constructed by the command:

>> vnames = strvcat('X','XY','XX','YY');

The actual regression portion of the screen output for this iterative estimation procedure
is as follows:

FINAL REGRESSION RESULTS:

VAR COEFF T-VAL PROB


const -6848.808565 -1.877835 0.062106
X 212.895520 1.881204 0.061644
XY -1.813464 -2.509593 0.013017
XX -0.843514 -1.028343 0.305241
YY 1.021249 2.519087 0.012683

Table 7.5. Regression Output of Geo_Regr

Notice first that the basic signs of all beta coefficient is the same, so that this new spatial
trend is again a “saddle” shape. In fact this is precisely the saddle shape plotted in Figure
7.2(b) above. But the main thing to notice is that all variables are now less significant
than they were under OLS. In particular, x 2 is no longer even weakly significant.
However, the relative ordering among the p-values (as seen more clearly from the
absolute t-values) is essentially the same. So there appears to have been a fairly uniform
deflation of all significance levels under OLS. While this will certainly not always be
true, in the present case it suggests that spatial dependencies in these OLS residuals are
relative isotropic (i.e., the same in the x and y directions), and hence are consistent with
the covariance stationarity assumption underlying geo-regression.

Before interpreting these results, it is important to check to see whether this geo-
regression has in fact removed the spatial dependencies among residuals. Here it is
important to stress that this cannot be done by simply examining the residuals of the geo-

________________________________________________________________________
ESE 502 II.7-32 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

regression. Indeed these residuals exhibit precisely the spatial covariance structure
estimated by the geo-regression as displayed in Figure 7.4 below:
COVARIOGRAM PLOT
80

60 SPHERICAL VARIOGRAM-
COVARIOGRAM PARAMETERS:
40

20

Range 16265.939790
0
Sill 73.253163
-20 Nugget 41.206102
-40
0 0.5 1 1.5 2 2.5 3
4
x 10

Figure 7.4. Covariogram Estimate

So the task remaining is to remove this estimated spatial covariance structure and
determine whether any spatial dependencies remain. This can be accomplished by
recalling that every GLS model can be reduced to an equivalent OLS model by the
Cholesky procedure in (7.1.15) through (7.1.20) above. By way of review, let us now
write the appropriate GLS model as

(7.3.30) Y  X    ,  ~ N (0,V )

where in this case, Y is the random vector of n  176 cobalt levels, X is the (n  4)
matrix of coordinate variables (labeled as X0 above), and  is the spatially dependent
residual vector with unknown covariance matrix, V . As in (7.1.15), if T denotes the
Cholesky matrix for V , so that V  TT  , then as in (7.1.16) and (7.1.17), if we multiply
both sides of (7.3.30) by T 1 , and let YT  T 1Y , X T  T 1 X , and  T  T 1 , then we
obtain a new linear model,

(7.3.31) YT  X T    T ,  T ~ N (0,VT )

where  is exactly the same as in (7.3.30), but where the argument in (7.1.19) now
shows that the covariance matrix, VT , is simply the identity matrix, i.e.,

(7.3.32) VT  T 1V (T 1 )  T 1 (TT )(T ) 1  I n

In particular, this implies that the components of the transformed residual vector,  T , are
independent. Of course, the true covariance matrix, V , and its Cholesky matrix, T , are
unknown. But if the geo-regression above was successful, then the covariogram estimate
in Figure 7.4 should generate a reasonably good estimate, Vˆ , of this covariance matrix
[by the same procedure as in (7.2.25) through (7.2.27) above]. If so, then by letting T̂

________________________________________________________________________
ESE 502 II.7-33 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

denote the Cholesky matrix for Vˆ , we can use this to transform the given data into an
OLS regression problem. In particular, if [ y, X ] denotes the given cobalt and coordinate
data (represented by [y0,X0] above), then the transformed data for the present case is
given by,

(7.3.33) yˆT  Tˆ 1 y , Xˆ T  Tˆ 1 X

Hence if the geo-regression above was successful, then this data should yield an OLS
regression with approximately independent residuals. This can be checked by the
nearest-neighbor regression procedure above, and provides a useful diagnostic for geo-
regression. To do so, the transformed data in (7.3.33) is saved as part of the output of
geo-regression. By examining the program description of geo_regr.m, it can be seen that
the fifth component, OUT{5}, of the output cell structure, OUT, contains precisely the
matrix [ yˆT , Xˆ T ] . This can be imported to JMP and run as a regression. In doing so, it is
important to note that the first column of the data matrix, X , in (7.3.30) is necessarily the
unit vector, 1n , corresponding to the intercept coefficient,  0 . But in Xˆ T this is
transformed to the vector, Tˆ 11 , which is not a unit vector. So if this regression were run
n

in JMP without modification, then JMP would add a unit vector which is not present in
(7.3.30). This means that JMP must be run using the “No Intercept” option (at the bottom
of the Fit Model window).19 The results of this no-intercept regression must produce
exactly the same beta estimates as the geo-regression output above (except for possible
rounding errors in transporting the data from MATLAB). So this in itself is a good check
to be sure that the data has been transported properly. The results of this nearest-neighbor
residual regression are shown below:

Bivariate Fit of Residual Co By n


4

2
Residual Co*

-1

-2

-3
-2 -1 0 1 2 3 4
nn_res*

Term Estimate Std Error t Ratio Prob>|t|


Intercept 0.0334021 0.074549 0.45 0.6547
nn_res* -0.042016 0.073134 -0.57 0.5664

Figure 7.5. Transformed Residual Analysis

19
We shall see this option used again in Section 4.1.1 of Part III.

________________________________________________________________________
ESE 502 II.7-34 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Here it should be clear that the geo-regression above has indeed been successful in
removing any trace of spatial dependencies among residuals. However, there is one
additional check that is worth mentioning. Notice in (7.3.32) that these transformed
residuals are not only independent, but in fact all have unit variance (  2  1 ) so that the
associated standard deviation is also one (   1 ). This means that the estimated standard
deviation, ̂ , known as “Root Mean Squared Error” should be close to one. This value is
reported in the regression output right under Adjusted R 2 . In the present case,
ˆ  0.995 , which provides additional support for the success of this geo-regression.

By way of summary, this cobalt example provides a simple illustration of the use of geo-
regression. Here the objective has been simply to capture the overall shape of spatial
trends in this data. (A more substantive example will be given in the next section.) But
aside from the geo-regression procedure itself, this example serves to illustrate a number
of more general issues that are common to all spatial regressions. First notice from the
initial OLS regression itself that this spatial trend captures less than 20% of the overall
variation in this cobalt data (with an adjusted R 2 of 0.188). So even though a visual
inspection of Figure 7.2(a) suggests an overall “saddle” shape for these trends, the
present quadratic specification is at best only a rough approximation. Thus for purposes
of spatial prediction, it is vital that the residual structure be modeled in a careful way.
This is a further motivation for techniques like geo-regression.

From an even more general perspective, this example illustrates the fundamental problem
of separating “trends” from “residuals”. To what extent is the spatial pattern of cobalt
values in Figure 7.4(a) the result of some underlying trend, or simply the result of
correlations between cobalt values at nearby locations? If one were able to examine many
“replications” of the underlying spatial process, then such separation would be a
relatively simple matter. Indeed, if most replications produced similar “saddle-like”
patterns, then this would suggest the presence of a dominant spatial trend along the lines
that we have modeled. On the other hand, if such replications produced a wide variety of
similarly correlated patterns (including “mountains” and “valleys” as well as “saddles’),
then this would suggest the presence of a dominant covariance stationary process,
possibly even with a constant mean (as postulated in Ordinary Kriging for example). But
since direct replications are not possible, the best that one can do is to be aware of these
problems, and to treat all model specifications with some degree of suspicion. To
paraphrase the famous remark of George Box,20 “all models are wrong, but some are
more useful than others”.

7.3.5 Venice Example of Geo-Regression and Geo-Kriging

The following example of geo-regression is more substantive in nature, and is based on


the “Ground Water in Venice” data from [BG, pp.147-148]. This data set originally
appeared in the two-part article by Gambolati and Volpi (1979) [which is included as

20
See for example http://en.wikipedia.org/wiki/George_E._P._Box.

________________________________________________________________________
ESE 502 II.7-35 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

References 7 and 8 in the class reference material].21 The area around Venice Island in
Italy is shown (schematically) in Figure 7.6 below.

! !
! ! !
!
!
! !
!
! !
! !
!
INDUSTRY

! ! !
! ! ! !
! VENICE
! ! !
! ! ! !
! ! ! !
!
!
!

!
0 5 Miles

Figure 7.5. Venice Island and Lagoon

Venice Island (shown in red) lies in a shallow lagoon, and has been slowly sinking for
many decades. In 1973 there was a suspicion that the Puerto Marghera industrial area to
the west of Venice was contributing to this rate of sinking. The reason for this suspicion
can be seen from the schematic depiction of the groundwater structure underlying the
Venice Lagoon shown in Figure 7.6 below.

INDUSTRY VENICE LIDO

Aquifers

Aquitards

Figure 7.6. Venice Aquifer System

21
This paper also contains an excellent overview of Kriging methods, as well as the groundwater problem
in Venice.

________________________________________________________________________
ESE 502 II.7-36 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Here the blue bands denote porous water-filled layers of soil called aquifers that are
separated by denser layers called aquitards. Industry consumes water by drilling wells
into the aquifer layers (as depicted by the red shaft in the figure). This lowered the level
of the water table, potentially contributing to the sinking of Venice. Thus the question in
1973 was whether or not this industrial draw-down of water was a significant factor in
the sinking of Venice.

Geo-Regression Model

To study this question, data was gathered on water table levels, Li , from 40 bore hole
sites, i  1,.., 40 , in existing wells throughout the Venice Lagoon area (shown by the dots
in Figure 7.5 above with colors ranging from red to blue denoting higher to lower levels).
[This data, along with the coordinate locations of well sites, can be found in the ( 40  3 )
matrix, venice, in the workspace, venice.mat.] The objective of this study was to identify
the key factors influencing these water table levels by applying geo-regression methods.
Here it was hypothesized that the key factors influencing the water table level, L( s) , at
any location, s  ( s1 , s2 ) , were the elevation, Ev( s ) , above sea level at s, together with
local draw-down effects both from industry, DI ( s ) , and from local water consumption,
DV ( s ) , in Venice itself. To model DI a convenient coordinate system was chosen, with
origin centered in the Industrial Area as shown in Figure 7.7 below.

c2

c1

c2
c1
0 5 Miles

Figure 7.7. Spatial Coordinates for Analysis

For later use, we now record this coordinate transformation as follows:

c1  c1( s )  (0.01) [0.873( s1  418)  0.488( s2  458)]


(7.3.34)
c2  c2 ( s )  (0.01)[0.488( s1  418)  0.873( s2  458)]

________________________________________________________________________
ESE 502 II.7-37 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

The orientation of these axes is designed to simplify the model representation of both
elevation and industrial draw-down effects. Starting with the Industrial draw-down
function, DI , this can be essentially approximated by a decreasing function with elliptical
contours centered on the axes. The present equation used is the following:22

(7.3.35) DI ( s )  DI [c1 ( s ), c2 ( s )]  exp[(1.5) c12  c22 ] 

A similar draw-down function, DV , was constructed for Venice Island and has the
following form:


  
8
(7.3.36) DV ( s )  DV ( s1 , s2 )  exp ( s1  560)2  ( s2  390)2 35 
 

Here the large exponent, ()8 , is designed to drive this function to zero outside of Venice
Island, where local water consumption has little effect. The procedure for calculating
these functions (as well as the elevation function below) can be found in the MATLAB
script, venice_funcs.m. The resulting contours of these two functions are shown in
Figures 7.8 and 7.9 below.

!
! !
! ! !
!

! !

!
!
!
! !

!
• !
!

!
! ! !
! !
! !
! ! ! !
! ! ! !
!
!
!

Figure 7.8. Industry Draw-Down Figure 7.9. Venice Draw-Down

As mentioned above, there is a third effect that cannot be overlooked, namely elevation.
Though detailed data on elevation was not available in this data set, the elevation
contours are roughly parallel to the c2 axis in Figure 7.7, and increase in elevation more

22
The actual functions used in Gambolati and Volpi (1979) are based on more complex hydrological
models. So the present simplified functions are for illustrative purposes only.

________________________________________________________________________
ESE 502 II.7-38 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

rapidly to the west. So the following simple (local) approximation to elevation, Ev( s ) , at
locations, s, was adopted,23

(7.3.37) Ev ( s )  Ev[c1 ( s )]  10 exp( c1 )

If the data sites (well locations) are denoted by si  ( si1 , si 2 ) , i  1,..,40 , and if the
computed values of the above functions at these locations are denoted by
( DIi , DVi , Evi )  [ DI ( si ), DV ( si ), Ev ( si )] , i  1,..,40 , then these values can now serve as the
explanatory variables in a linear regression model of this water table data as follows:

(7.3.38) Li   0   I DIi  V DVi   Ev Evi   i , i  1,..,40.

As with the Cobalt example above, this model was run using both OLS and the iterative
Geostatistical Regression Procedure implemented in geo_regr.m, with the command

>> geo_regr(y0,X0,L0,vnames);

where y0 is the L data, X0 the computed ( DI , DV , Ev ) data, and L0 the coordinate data at
each of the 40 well sites. A comparison of the parameter estimates and significance levels
in shown in Tables 7.6 and 7.7 below;

VAR COEFF T‐RATIO PROB VAR COEFF T‐RATIO PROB


const ‐1.13394 ‐3.17757 0.003045 const ‐1.11526 ‐2.41109 0.021134
Elev 0.016364 6.673262 < 0.000001 Elev 0.020487 6.014161 0.000001
Indus ‐6.54763 ‐8.47941 < 0.000001 Indus ‐7.34398 ‐6.00136 0.000001
Venice ‐1.79037 ‐2.3946 0.021968 Venice ‐2.34154 ‐3.13431 0.003419

Table 7.6. OLS Estimates Table 7.7. Geo Regression Estimates

Note that as in the Cobalt case above, the signs of all coefficients are consistent in both
procedures, but the t-ratios are generally lower (in absolute magnitude) for GLS. Notice
however that the Venice drawdown effect provides an exception to this rule, and shows
that significance levels need not always be higher for OLS. As a final consistency check,
note that the signs of these coefficients are as expected, namely that mean water table
levels rise with higher elevations and that greater levels of water drawdown lower the
mean water table level.

Before analyzing the consequences of these results, it is important to determine whether


spatial correlation effects have been removed by this geo-regression procedure. Rather

23
This approximation produces a maximum elevation of about 30 meters at the western edge of the
Industrial Area, where the water table level is about 7 meters.

________________________________________________________________________
ESE 502 II.7-39 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

than repeat the nearest-neighbor residual analysis done for the Cobalt case, it is of interest
to consider a different approach here. In particular, one can compare the (spherical)
covariogram for the original OLS residuals with that of the residuals from the final
transformed model in expressions (7.3.31) and (7.3.32) above. If the procedure has been
successful, then the final covariogram should be much closer to pure independence. But it
is important to note here that since the transformed data is quite different from that of the
original model, there is a problem in comparing these residual covariograms directly. In
fact, this provides us with an important case where it is more appropriate to compare the
correlograms derived from these covariograms, as defined in expression (3.3.13) above.
These correlograms are free from any dimensional restrictions, and hence are directly
comparable. In particular, since  (0)  1 for all correlograms, their scales must be
identical. This allows one to focus entirely on their relative shapes. In the present case,
the original correlograms and final correlograms of the transformed data are shown in
Figures 7.10 and 7.11, respectively. Notice first that in the original correlogram the
relative nugget effect (defined in Section 4.5 above) is zero, indicating that this process
exhibits no spatial independence whatsoever. In contrast, the relative nugget effect in the
final correlogram is close to one, indicating that the process is now almost completely
spatially independent. In other words, very little spatial correlation remains in this
transformed data. Notice also that the fluctuation of nonzero correlation values is much
smaller, indicating that spatial correlations are uniformly closer to zero at all scales.24
These two observations provide convincing evidence that this geo-regression has indeed
been successful in accounting for almost all spatial correlation in the original OLS model.

1 1
·
0 0

-1 -1
0 100 200 300 0 100 200 300

Figure 7.10. Original Correlogram Figure 7.11. Final Correlogram

Impact Analysis of Industrial Water Drawdown

Given these preliminary findings, the main purpose of this model is to analyze the
impacts of industrial water drawdown effects on the water table level in Venice. To
estimate this impact, observe first from the geo-regression results above, that we can

24
This is due in part to the larger bin sizes used in this figure (50 rather than 30 points per bin).

________________________________________________________________________
ESE 502 II.7-40 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

obtain a upper 95% confidence bound on the beta coefficient,  I , for DI in model
 is denoted by s , then for
(7.3.38) as follows. First note that if the standard error of  I I

any level of significance,  , the 100(1   )% upper confidence bound for  I can be
obtained from the probability identity,

 t
Pr(  I  
(7.3.39) I  ,n ( k 1) sI )  1  

where t ,n ( k 1) is the t-critical value at level  for degrees of freedom, n  ( k  1) [where
n  sample size and k  number of explanatory variables]. To obtain the desired standard
 /s ,
error, recall that by definition the t-ratio, t , for  in Table 7.7 is given by t  
I I I I I

so that by Table 7.7,

(7.3.40)  / t  ( 7.34398) / ( 6.00136)  1.2237


sI   I I

Hence noting that in our case, n  40 and k  3 [so that n  ( k  1)  40  4  36 ], the


desired upper 95% confidence bound is given by

(7.3.41)  I   I  t.05,36 sI   7.34398  (1.6883)(1.2237)  5.278

Next observe that for the representative location, s  ( s1 , s2 )  (555,390) , in the middle of
Venice Island (shown by the red dot in Figure 7.9 above), the transformed coordinates in
(7.3.34) are seen to be (c1 , c2 )  ( 1.572,0.075) , so that the value of the Industrial
drawdown in (7.3.35) is given by:

(7.3.42) DI ( s )  exp [(1.5) c12  c22 ]   0.2998

Thus, for each additional meter of Industrial water drawdown, one can be 95% confident
that the expected decrease,  , in the water table level at location s will be bounded
below by

(7.3.43)   DI ( s )| 5.278 |  (0.2998)(5.278)  0.1582 meters

Thus, based on the above model, one can be 95% confident that the mean industrial
drawdown effect on Venice Island is at least 15%.

While this model is only a rough approximation to the analysis of Gambolati and Volpi
(1979),25 it serves to illustrate how geo-regression can actually be used to address
substantive spatial issues. According to these authors, water pumping in Puerto Marghera

25
Aside from their more elaborate drawdown functions, Gambolati and Volpi also used a universal kriging
approach rather than our present application of geo-regression.

________________________________________________________________________
ESE 502 II.7-41 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

was in fact reduced by 60% after 1973, and their subsequent analysis of 1977 data
showed that the “subsurface flow field had substantially recovered, and the land
settlement had been arrested”. So their post-analysis confirmed that this industrial water
drawdown was indeed a major contributing factor to the sinking of Venice. Of course, in
more recent times, Venice has once again started to sink from more natural causes. But
this is another story.

An Application of Geo-Kriging

Finally it is of interest to apply geo-kriging to the Venice data as an illustration of this


technique. To do so, a grid was constructed using grid_form.m in MATLAB with
specified values

s1  [150 : 25 : 900]
(7.3.44)
s2  [200 : 25 : 650]

(where the cell size, 25, is roughly a third of a mile in terms of Figure 7.5). This grid was
then used as input to the program, geo_krige.m, with the command

>> OUT = geo_krige(y0,X0,L0,X1,L1,h);

where (y0,X0,L0) is the same as for geo_regr above, and where (X1,L1) are the
computed values of ( DI , DV , Ev ) and coordinate values at each of the 589 grid points
from (7.3.44). Finally, the bandwidth used was h = 50 (around two thirds of a mile.)

To visualize these results, it is convenient to compare the geo-kriging output values,


Yˆ ( s ) , with the geo-regression estimates, Lˆ ( s ) , of expected water table levels based on
the results in Table 7.7, where by definition,

(7.3.45) Lˆ ( s )  ˆ0  ˆI DI ( s )  ˆV DV ( s )  ˆEv Ev( s )

for all locations, s. These results were constructed from the above output as follows:

>> b = OUT{1}(:,1); • Extract beta estimates from output

>> X = [ones(589,1),X1]; • Construct regression matrix (with intercept)

>> L_hat = X*b; • Compute estimates ( L̂ ) of expected L values

>> Y_hat = OUT{3}; • Extract kriged values from output

>> StdErr = OUT{4}; • Extract standard error values from output

________________________________________________________________________
ESE 502 II.7-42 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

This data was then collected into a single data matrix:

>> DAT = [L1, L_hat, Y_hat, StdErr];

and exported from MATLAB to ARCMAP. These values were then interpolated using
the Spline option in ArcToolbox:

Spatial Analyst Tools > Interpolation > Spline

and finally converted to contour form by applying

Spatial Analyst Tools > Surface > Contour

to the spline rasters. A comparison of the fitted value, L̂ , and kriging values, Yˆ , is shown
in Figures 7.12 and 7.13 below:

!
! !
! ! !
! ! !
! !
!
!
!

!
!

!
!
!
!
! ! !
!

! !
! ! !
!

! ! ! !
! ! !
! ! ! ! !
!

! !

! !

^ ^
Figure 7.12. Geo-Regression L Values Figure 7.13. Geo-Kriging Y Values

Notice that the L̂ values are essentially a weighted combination of the drawdown effects,
DI and DV , in Figures 7.8 and 7.9 respectively (as captured by their values at the 40 well-
site data points). The kriged values, Yˆ , also reflect these underlying drawdown effects,
but to a lesser extent. By construction, these values also include stochastic interpolations
of the regression residuals, and thus should reflect water table levels more accurately than
the simpler regression predictions. Note however that alternative models of drawdown
functions and fitting procedures will of course produce somewhat different results, as can
be seen by comparing Figure 7.13 with Figure 5.21(a) in [BG, p.199] and Figure 2(a) in
Part 2 of Gambolati and Volpi (1979, p.292).

Finally, the main advantage of this stochastic interpolation procedure is that it allows
prediction intervals to be constructed for actual water table levels in terms of estimated

________________________________________________________________________
ESE 502 II.7-43 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

standard errors of prediction. A plot of these standard errors around Venice Island is
shown in Figure 7.14 below (with the 0.4 and 0.7 contours labeled to indicate
representative values). Here a much finer grid of kriging locations was used here (with
increments of about a tenth of a mile) in order to show the details of these standard error
contours.

0.4

0.7

Figure 7.14. Kriging Standard Errors

Notice in particular that these standard errors fall to zero at each of the five data points
(well sites) on Venice Island [in a manner similar to Figure 2(b) in Part 2 of Gambolati
and Vopi (1979), though Venice Island itself is rather difficult to see in their figure]. This
reflects the fact that geo-kriging (along with simple and ordinary kriging) is an exact
interpolator that goes through every data point. This can be seen most easily from
expression (7.2.17) above, together with the fact that if point s0 is actually a data point,
then it must always be a member if its own predition set, S ( s0 ) , and hence must
correspond to one of the elements of the covariance matrix, V0 . But since
V0 V01  I n0  ( ei : i  1,.., n0 ) , it follows that if c0 is the i th column of V0 , then c0V01  ei ,
so that (7.2.17) becomes:

(7.3.46) Yˆ ( s0 )  x0 ˆn0  c0V0 1 (Y  X 0 ˆn0 )

 x0 ˆn0  ei (Y  X 0 ˆn0 )

 x0 ˆn0  [Y ( s0 )  x0 ˆn0 ]  Y ( s0 )

This same argument also shows that the kriging standard error in (7.2.22) is identically
zero.

Finally, it is of interest to consider the kriged values on Venice Island. Though the
specific kriging contour values are not shown in Figure 7.13, these values yield water

________________________________________________________________________
ESE 502 II.7-44 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

table predictions of around Yˆ ( s0 )   3.0 for points, s0 , on Venice Island (i.e., about 3
meters below sea level). Moreover, while not all standard error contours are shown in
Figure 7.14, the 0.7 contour is roughly the average value, so that ˆ 0  ˆ ( s0 )  0.7 . Thus a
typical prediction interval for points s0 on Venice is about

(7.3.47) Yˆ ( s0 )  (1.96) ˆ 0   3  1.4 meters

While such intervals are not extremely sharp, one must take into account the fact that
only 5 of the 40 data points are actually on Venice Island. So this is probably about the
best that can be expected from such a small data set.

________________________________________________________________________
ESE 502 II.7-45 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

APPENDIX TO PART II
This Appendix, designated as A2, contains additional analytical results for Part II of the
NOTEBOOK, and follows the notational conventions in Appendix A1.

A2.1. Covariograms for Sums of Independent Spatial Processes

First recall that the covariance of any random variables, Z1 and Z 2 , with respective
means, 1 and  2 , is given by

(A2.1.1) cov( Z1 , Z 2 )  E[( Z1  1 )( Z 2  2 )]  E ( Z1Z 2  Z12  1Z 2  12 )

 E ( Z1Z 2 )  E ( Z1 )  2  1E ( Z 2 )  1 2

 E ( Z1Z 2 )  1 2  1 2  1 2

 E ( Z1Z 2 )  1 2

so that if Z1 and Z 2 are independent then

(A2.1.2) E ( Z1Z 2 )  E ( Z1 ) E ( Z 2 )  1 2  cov( Z1 , Z 2 )  0

Hence if a given covariance stationary stochastic process, {Y ( s ) : s  R} , with mean,  , is


the sum of two independent covariance stationary components

(A2.1.3) Y ( s )  Y1 ( s)  Y2 ( s ) , s  R ,

with respective means, 1 and  2 , then it follows by definition that   1   2 , and that
Y1 ( s ) and Y2 (v) are independent for all s, v  R . Hence for any h  0 and s, v  R with
s  v  h , we see that the covariogram, C , of the Y -process must satisfy,

(A2.1.1) C (h)  cov[Y ( s), Y (v)]

 E[Y ( s)  Y (v) ]  E[Y ( s)]  E[Y (v)]  E[Y ( s )  Y (v) ]   2

 E  Y1 ( s )  Y2 ( s ) Y1 (v)  Y2 (v)    ( 1   2 ) 2

 E[Y1 ( s )Y1 (v)  Y1 ( s )Y2 (v)  Y2 ( s )Y1 (v)  Y2 ( s )Y2 (v)]

 ( 12  2 1 2   22 )

________________________________________________________________________
ESE 502 A2-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

 E[Y1 ( s )Y1 (v)]  E[Y1 ( s )]E[Y2 (v)]  E[Y2 ( s )]E[Y1 (v)]  E[Y2 ( s )Y2 (v)]

 ( 12  1 2   2 1   22 )

 E[Y1 ( s )Y1 (v)]  1 2   2 1  E[Y2 ( s )Y2 (v)]  ( 12  1 2   2 1   22 )

  E[Y1 ( s )Y1 (v)]  12    E[Y2 ( s)Y2 (v)]   22 

 cov[Y1 ( s), Y1 (v)]  cov[Y2 ( s ), Y2 (v)]

 C1 (h)  C2 (h)

where C1 and C2 are the respective covariograms for the Y1 and Y2 components of Y .

A2.2. Expectation of the Sample Covariance Estimator under Spatial Dependence


Given any collection of 2n jointly distributed random variables, { (Y1i , Y2i ) , i  1,.., n }
where the pairs (Y1i , Y2i ) have common means E (Y1i )  1 , E (Y2i )   2 and covariance
cov(Y1i , Y2i )   12 for all i  1,.., n , consider the following estimator of  12 ,

ˆ12  
n
(A2.2.1) 1
n1 i 1
(Y1i  Y1 )(Y2i  Y2 )

 Y ji , j  1, 2 . Here ˆ12 and  12 are taken to correspond to the estimator


n
where Y j  1
n i 1

Cˆ (h) of the covariance C (h) in expressions (4.10.2) and (4.10.1), respectively. To


analyze this estimator, it is convenient to begin with the rescaled version

12  n1 ˆ12 


n
(A2.2.2) n  1
n i 1
(Y1i  Y1 )(Y2i  Y2 )

and recall the following standard decomposition of sums of squares:

12  
n
(A2.2.3) 1
n i 1
(Y1iY2i  Y1iY2  Y1Y2i  Y1Y2 )

 1
n 
n
YY 
i 1 1i 2 i  1
n 
n

Y Y2  Y1
i 1 1i  1
n 
n
Y
i 1 2 i   n 1 YY
n 1 2 

n
 1
n Y Y  YY
i 1 1i 2 i 1 2  YY
1 2  YY
1 2


n
 1
n Y Y  YY
i 1 1i 2 i 1 21

But since

(A2.2.4) 1 2 
YY  1
n 
n
Y
i 1 1i  1
n 
n
Y 
i 1 2 i  1
n2  
n
i 1
n
YY
j 1 1i 2 j

________________________________________________________________________
ESE 502 A2-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

it follows from (A2.2.2) through (A2.2.4) that

E (ˆ12 )  n1 E (12 )


n 1
 E (Y1iY2i )  E (Y1Y2 ) 
n
(A2.2.5) n  n 1  n i 1 

n 1
   E (Y1iY2 j ) 
n n n
 n1  n E (Y1iY2i )  1
i 1 n2 i 1 j 1 

n 1
  E (Y1iY2i )  n12  i 1  j i E (Y1iY2 j ) 
n n n
 n1  n E (Y1iY2i )  1
i 1 n2 i 1 

n  n1
 E (Y1iY2i )  n12  i 1  j i E (Y1iY2 j ) 
n n
 n1  n2 i 1 

 E (Y1iY2i )  n ( n11)  i 1  j i E (Y1iY2 j )


n n
 1
n i 1

Finally, if we let  j  E (Y ji ), j  1,2, then since by definition, E (Y1iY2i )   12  12 and


E (Y1iY2i )  cov(Y1i , Y2 j )  12 for all i  1,.., n and j  i , it follows from (A2.2.5) that

E (ˆ12 )  n (  12 )     cov(Y , Y )  12 


n n
(A2.2.6) 1
n 12 n ( n1) i 1 j 1 1i 2j

  12  1 2    n ( n1)
cov(Y1i , Y2 j )  n ( n1) 1 2
n
1
n ( n1) i 1 j i

  12   
n
1 cov(Y1i , Y2 j )
n ( n1) i 1 j i

Finally we note that if 1 = 2, then  12   2 and 12   2 . So precisely the same


argument shows that for the standard sample variance estimate, ˆ 2  n11 in1 (Yi  Y )2 ,
expression (A2.2.6) becomes:

E (ˆ 2 )   2   
n
(A2.2.7) 1 cov(Y1i ,Y2 j )
n ( n 1) i 1 j i

A2.3. A Bound on the Binning Bias of Empirical Variogram Estimates

Here it suffices to consider the variogram,  (h) , on the interval of distance values,
d k 1  h  d k , for a typical bin k . Recall from (4.7.1) that for a given sample of values
 Y ( si ) : i  1,.., n  , if N k denotes the set of distance pairs, ( si , s j ) , in bin k , and if the
distance between each such pair is denoted by hij  si  s j , then the lag distance, hk , for
bin k is defined to be
1
(A2.3.1) hk 
Nk
 ( si , s j )N k
hij

________________________________________________________________________
ESE 502 A2-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Recall also that if the  k -linear approximation to  (h) on this interval is denoted by

(A2.3.2) lk (h)  ak  h  bk

then by definition,

(A2.3.3) h  [d k 1 , d k )   k (h)  lk (h)   k

In this context we have the following bound on the bias of the empirical variogram
estimates,
1
 Y (s )  Y (s ) 
2
(A2.3.4) ˆ (hk ) 
2 Nk
 ( si , s j )N k i j

at lag distance, hk :

Proposition A2.1. If for any bin, k  1,.., k , the true variogram,  (h) , has an  k -linear
approximation, then at lag distance, hk , it must be true that

(A2.3.5) E[ˆ (hk )]   (hk )  2 k

Proof: If for each ( si , s j )  N k we let

 ij   (hij )  12 E  Y ( si )  Y ( s j )  
2
(A2.3.6)
 

with hij  si  s j , then by (A2.3.4),

 1 
 Y (s )  Y (s )  
2
(A2.3.7) E[ˆ (hk ) ]  E   ( si , s j )N k i j
 2 Nk 


1
Nk
 ( si , s j )N k  1E
2 

 Y (s )  Y (s ) 
i j
2

 
1

Nk
 ( si , s j )N k
 ij

But since hij  [d k 1 , d k ) for all ( si , s j )  N k , we see from (A2.3.2) that  ij  lk (hij )   k ,
and thus that

________________________________________________________________________
ESE 502 A2-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

(A2.3.8)   k   ij  lk (hij )   k for all ( si , s j )  N k

Hence by summing this set of inequalities and taking averages [with the observation that
(1/ | N k |) ( s ,s )N  k  (| N k | / | N k |) k   k ], we have
i j k

(A2.3.9) k  1
|N k |  ( si , s j )N k
[ ij  lk (hij )]   k

Next. by using (A2.3.1), (A2.3.2) and (A2.3.7), the middle expression of (A2.3.9) can be
rewritten as,

(A2.3.10) 1
|N k |  ( si , s j )N k
[ ij  lk (hij )]  1
|N k |  ( si , s j )N k
 ij  1
|N k |  l (hij )
( si , s j )N k k

 E[ˆ (hk )]  1
|N k |  ( si , s j )N k
[ak hij  bk ]

 E[ˆ (hk )]  ak  1
|N k |  ( si , s j )N k 
hij  bk

 E[ˆ (hk )]  ak hk  bk

 E[ˆ (hk )]  lk (hk )

so that (A2.3.9) is seen to imply that

(A2.3.11)  k  E[ˆ (hk )]  lk (hk )   k

But since hk  [d k 1 , d k ) it also follows from (A2.3.3) that lk (hk )   (hk )   k and
hence that

(A2.3.12)   k  lk (hk )   (hk )   k

Finally, by adding (A2.3.11) and (A2.3.12) we may conclude that

(A2.3.13) 2 k   E[ˆ (hk )]  lk (hk )    lk (hk )   (hk )   2 k

  2 k  E[ˆ (hk )]   (hk )  2 k

 E[ˆ (hk )]   (hk )  2 k

and thus that (A2.3.5) must hold. ■

________________________________________________________________________
ESE 502 A2-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

A2.4 Some Basic Vector Geometry

In order to understand multidimensional analysis, one must begin with vector geometry.
In particular, all matrix manipulations are interpretable geometrically. If for any vector,
x  ( x1 ,.., xn )   n we denote the (Euclidean) length of x by

xx  
n
(A2.4.1) || x ||  x2
i 1 i

then for any two vectors, x  ( x1 ,.., xn ), y  ( y1 ,.., yn )   n , the distance between x and
y is just the length of the vector x  y  ( x1  y1 ,.., xn  yn )  n , i.e.,

( x  y )( x  y )  
n
(A2.4.2) || x  y ||  i 1
( xi  yi ) 2

This is illustrated for two dimensions (  2 ) in Figures A2.1 and A2.2 below.

x  ( x1 , x2 )
x
yx || y  x ||
y  ( y1 , y2 ) || x ||
|| y ||
y

Figure A2.1. Vectors Figure A2.2. Orthogonal Vectors

These distances in turn define angles, that complete the geometry of Euclidean spaces,
 n . All that is really required here is the notion of orthogonal vectors which constitute
the sides of a right triangle, as shown for  2 in Figure A2.2. Recall from the
Pythagorean Theorem, that such triangles are characterized by the familiar identity that
the square of the hypotenuse equals the sum of squares of the sides, i.e.,

(A2.4.3) || x ||2  || y ||2  || x  y ||2

Hence if we now write this orthogonality relation as, x  y , then terms of the notation
above, this implies that

(A2.4.4) x  y  || x ||2  || y ||2  || x  y ||2

 xx  yy  ( x  y )( x  y )  xx  2 xy  yy

 xy  0

________________________________________________________________________
ESE 502 A2-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Hence we are led to the fundamental geometric relation that orthogonality between
vectors is equivalent to zero inner products. This essentially defines vector geometry in
Euclidean spaces. (A somewhat sharper derivation of this result is given in terms of
cosines in Section ?? below.)

A2.5 Differentiation of Functions

Our main objective here is to develop multidimensional optimization problems, both


with and without constraints. The key analytical tools are differential measurements of
change in functional values. First recall that the derivative of a scalar (i.e., one-
dimensional) function f ( x) at a point x0 is just the slope of the function at x0 , as
defined by the limiting slope of a series of triangles shown in Figure A2.3 below.

f ( x0  ) 

f ( x0 )  
x
0

Figure A2.3. Derivatives of Scalar Functions

In formal terms, this is written as1

f ( x0  )  f ( x0 )
(A2.5.1) d f ( x0 )  lim 0
dx

The example in Figure A2.3 is a simple parabolic function, f ( x)  x 2 , for which the
derivative is given explicitly by

( x0  ) 2  ( x0 ) 2 x02  2 x0    2  x02
(A2.5.2) d f ( x0 )  lim 0  lim 0
dx
 

 lim 0 (2 x0  )  2 x0

Such limiting slopes values cannot usually be obtained so easily. But this case serves to
illustrate the basic idea.

1
In Figure A2.3 we have implicitly assumed that increments are positive (   0 ). But for smooth
functions, the same limiting slope results for negative increments as well.

________________________________________________________________________
ESE 502 A2-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

From a geometric viewpoint, this limiting slope defines the unique tangent line to f at
x0 (shown in red in Figure A2.3). More importantly, the linear function defined by this
line yields the best linear approximation to function f in small intervals around x0
(since by construction it has the same value and slope as f at x0 ).

For multidimensional functions, f ( x)  f ( x1 ,.., xn ) , there is no direct parallel to


(A2.5.2), since small movements (increments) can occur in many different directions.
However, the most fundamental directions are those defined by changes of individual
variables holding all others fixed. More formally, the partial derivative of f ( x) with
respect to variable, xi , at a point x0  ( x01 ,.., x0i ,.., x0 n ) , is just the slope of the function
when moving in the xi direction. This is shown for the n  2 case in Figure A2.4
below, where the partial derivative of f ( x)  f ( x1 , x2 ) with respect to x1 at
x0  ( x01 , x02 ) corresponds to the slope of the red line shown.

z  f ( x1 , x2 )
z 

 x  (x , x )
0 01 02

x01

Figure A2.4 Partial Derivative

Again, this can be represented mathematically by the limit

 f ( x01 ,.., x0i   i ,.., x0 n )  f ( x0 )


(A2.5.3) f ( x0 )  lim i 0
xi i

For example, if f ( x1 , x2 )  2 x12  x22 , then

 [2( x01  1 ) 2  x02


2
]  [2 x01
2
 x02
2
]
(A2.5.3) f ( x0 )  lim i 0
x1 1

________________________________________________________________________
ESE 502 A2-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

2
[2( x01  2 x011  12 )  x02
2
]  [2 x01
2
 x02
2
]
 lim i 0
1

 lim i 0 (4 x01  21 )  4 x01

These partial derivatives can in turn be used to define differential changes in any
direction. The key point to note is that for smooth functions, f ( x)  f ( x1 ,.., xn ) , in
higher dimensions, the unique tangent line defining the scalar derivative in Figure A2.3
is replaced by a unique tangent plane. This is again illustrated by the two-dimensional
function,2 f ( x)  f ( x1 , x2 ) , shown in Figure A2.5 below:

f ( x)
f ( x)
z
0

x0

x2
x1

Figure A2.5. Tangent Planes

As in the scalar case, the plane tangent to f at a given point, x0  ( x01 ,.., x0 n ) , is
essentially the “best linear approximation” to f in small neighborhoods of x0 . In
geometric terms, this tangent plane is more accurately described as the n-dimensional
(hyper) plane tangent to the surface (or graph) of f at the point [ x0 , f ( x0 )]   n1 , as
illustrated by the 2-dimensional plane tangent to f at z0  [ x0 , f ( x0 )]  3 in the figure
(where the “red arrows” can be ignored for the moment).

2
The actual function plotted is the quadratic function, f ( x )  f ( x1 , x2 )  10  [2 y1  y1 y2  y2 ] with
2

yi  xi  10 , i  1, 2 .

________________________________________________________________________
ESE 502 A2-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

If we continue to focus on this two-dimensional case for the present, and consider any
small change in x0 , say x0  x0    ( x01  1 , x02   2 ) , then the corresponding
change in f , denoted by f ( x0 ) , is well approximated by a corresponding movement
on this tangent plane. As we have already seen, movement in the x1 direction (with
 2  0 ) yields changes governed entirely by the partial derivative of f with respect to
x1 at x0 . This can now be depicted graphically as in Figure A2.6 below, where for
notational simplicity we have represented the partial derivative of f with respect to xi
at x0 by ai  f ( x0 ) / xi , i  1, 2 . Here we have also shifted the origin up to the point,
z0  [ x0 , f ( x0 )] , so that local movements away from x0 can be represented simply by
pairs (1 ,  2 ) . [Note that the size of these shifts (relative to the “red arrow” from
Figure A2.5) have been exaggerated for visual clarity.]

f ( x0 )  a11  a2  2

a
2 2
x2
a11

2 f ( x0 )  0
x1 1 
z0

Figure A2.6. Local Linear Approximations

In this graphical depiction, a movement of (1 ,0) yields an increase in f ( x0 ) given


approximately by, a11 , as shown in the figure. Similarly, a movement of (0,  2 )
yields an approximate increase of a2  2 . So by linearity, it follows that for the
combined movement, (1 ,  2 ) , the total increment in f ( x0 ) is approximated by,3

 f ( x0 )   f ( x0 ) 
(A2.5.4) f ( x0 )  a11  a2  2    1    2
 x1   x2 

Finally, if these  -shifts are allowed to become “arbitrarily small”, then we obtain the
limiting differential relation

3
Here the symbol,  , can be loosely read as “is approximately equal to”.

________________________________________________________________________
ESE 502 A2-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

 f ( x0 )   f ( x0 ) 
(A2.5.5) df ( x0 )    dx1    dx2
 x1   x2 

designated as the total derivative of f . Hence in higher dimensions, scalar derivatives


in (A2.5.1) are replaced by the total derivatives in (A2.5.5).

A2.6 Gradient Vectors

But for our present purposes, the key property of total derivatives is what they imply
about partial derivatives in particular. Here we use some vector geometry by first
writing the vector of differential elements in (A2.5.5) as dx  (dx1 , dx2 ) . In geometric
terms, this can be viewed as a directional vector of small movements from any given
point. Similarly, if we designate the vector of partial derivatives of f at x  ( x1 , x2 ) as,

 f ( x) 
 x 
(A2.6.1) f ( x )   1 

 f ( x) 
 x 
 2 

then (A2.5.5) can be rewritten in vector form as:

(A2.6.2) df ( x0 )  f ( x0 )dx

To interpret this geometrically, observe that if we now consider the contour


representation of f , shown as ellipses on the ( x1 , x2 ) -plane in Figure A2.5, then the
curve passing through x0 is by definition the contour with constant value, f ( x0 ) .
Similarly, the line tangent to this contour is simply the “linear contour” of the
corresponding tangent plane, shown by the horizontal (constant height) line passing
through z0 . This tangent line thus defines the directions of movement from x0 yielding
no change in f . But by (A2.5.4) these directions, dx , are given precisely by the no
change condition:

(A2.6.3) 0  df ( x0 )  f ( x0 )dx

Hence, by recalling (A2.4.4), we see that the key geometric consequence of this zero-
inner-product condition is that the vector of partial derivatives, f ( x0 ) , must
necessarily be orthogonal to the directions of no change in f . In Figure A2.5, f ( x0 )
thus corresponds to the red arrow on the ( x1 , x2 ) -plane starting at x0 . Moreover, since
its three-dimensional counterpart starting at z0 on the tangent plane (in both Figures
A2.5 and A2.6) is necessarily the steepest direction of movement on this plane, it

________________________________________________________________________
ESE 502 A2-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

follows that f ( x0 ) defines the direction of movement in the ( x1 , x2 ) -plane yielding a


maximum increase in f at x0 . For this reason, the vector of partial derivatives,
f ( x0 ) , is usually called the gradient vector of f at x0 .

Finally, while the n  2 case is extremely useful for gaining geometric intuition, it
should be emphasized that all relationships above are immediately extendable to
general functions, f ( x)  f ( x1 ,.., xn ) . In particular, if we let dx  (dx1 ,.., dxn ) and
define the general gradient vector at x  ( x1 ,.., xn )   n by

 f ( x) 
 
 1 f ( x)   x1 
 
(A2.6.4) f ( x )        
  f ( x)   f ( x) 
 n   
 x 
 n 

then (A2.6.2) and (A2.6.3) continue to hold in  n .

A2.7 Unconstrained Optimization of Smooth Functions

Given these key geometric results, we can now consider optimization problems
involving smooth multidimensional functions, f ( x)  f ( x1 ,.., xn ) . These amount to
finding points, x , in some specified region, R   n , with either maximum or
minimum values, f ( x) , in R , depending on the given problem. Here it is important to
emphasize that maximizing the function, f ( x) , over R is equivalent to minimizing the
function,  f ( x) , over R. For this reason, it suffices to consider only maximization
problems (which are usually easier to depict graphically for the n  2 case).4

In this context, an unconstrained maximization problem for our purposes is taken to be


one in which the maximum of f ( x) is known to be achieved at some interior point of
R, and hence is a smooth maximum that can be characterized by the derivatives of f. In
the scalar case, this is the usual “zero-slope” condition that the derivative be zero at the
maximum, as shown for the scalar function, f ( x) , in Figure A2.7 below. Here the
maximum at x0 is seen to be uniquely characterized by this zero-slope condition. But
even with a unique maximum, this condition is by no means sufficient. Even when
there are no other local maxima (or minima), it is still possible to have other singular
points, i.e., with zero slope. Figure A2.8 illustrates a singular inflection5 point which is
neither a local minimum or maximum. In the scalar case, such possibilities can be

4
Here it is also worth noting that optimization software (such the MATLAB optimization toolbox) is
typically designed to do only minimization problems. So all maximization problems must be reformulated
as minimization problems.
5
An inflection point, x, for f is a point at which the second derivative of f changes sign.

________________________________________________________________________
ESE 502 A2-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

eliminated by requiring that the second derivative be negative at all singular points, so
that the unique maximum is always characterized by the zero-slope condition. This is
precisely analogous to the one-dimensional kriging problem in Section 6.2.1 of the text.
Here a global minimum was insured for the simple quadratic function in (6.2.19) with
positive second derivative in (6.2.20).

f 
0 d
dx f ( x0 )  0

f ( x) 
f ( x)

  
x0 x0 x1

Figure A2.7. Scalar Maximum Figure A2.8. Singular Inflection

The situation is more complex for multidimensional functions. Here the first-order
“zero-slope” condition, dx d
f ( x0 )  0 , is replaced by a more general “zero-gradient”
condition, f ( x0 )  0 , which ensures that the total derivative in (A2.6.2) is zero in all
directions, dx .6 Geometrically, this first-order condition requires that the tangent plane
at x0 be flat, as is illustrated in Figure A2.9 below.

f ( x0 )  0

 f0

f ( x)

 x0 R
x2
x1

Figure A2.9. First Order Condition for a Maximum

6
Note that since f ( x0 ) is an n-vector, the “0” here is also an n-vector, 0  (0,.., 0) . While we could
write this as 0 n , standard practice is to take the dimension of zero vectors as understood by context.

________________________________________________________________________
ESE 502 A2-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

The function, f ( x)  f ( x1 , x2 ) , actually shown Figure A2.9 is bivariate quadratic


function, which takes the explicit form

(A2.7.1) f ( x1 , x2 )  928  26 x1  20 x2  3 x12  x1 x2  4 x22

So by taking the partial derivatives of this function and setting them equal to zero, we
obtain the relations,

(A2.7.2) 0  f ( x0 )  26  6 x1  x2
x1

(A2.7.3) 0  f ( x0 )  20  8 x1  x2
x2

These linear equations can easily be solved to yield the unique solution point,
x0  ( x01 , x02 )  (4, 2) , shown in the figure. However when the dimension, n, is much
larger than two, it is practically possible to write down the full expression for
f ( x)  f ( x1 ,.., xn ) , let alone the simultaneous equation system corresponding to the
first-order condition. Here is where the power of matrix algebra takes full force. If we
let

 3 2  26 
(A2.7.4) A    , b    , c  928
 1 4   20 

then it can easily be verify (by matrix multiplication) that the function in (A2.7.1) can
be equivalently written in matrix form for all x  ( x1 , x2 ) as,

(A2.7.5) f ( x)  c  bx  xAx

Notice the similarity of this quadratic form to the general expression for mean squared
error, MSE (0 ) , in expression (6.2.27), where x now plays the role of the weight
vector, 0 .7 The power of this notation is that the quadratic form in (A2.7.5) can be
analyzed in the same way regardless of the dimension, n. All that is required here is
that we formalize the vector version of the partial derivatives in (A2.7.2) and (A2.7.3).
To do so, notice first that for any coefficient vector, b  (b1 , b2 ) , such as in (A2.7.4), if
we now employ the gradient notation in (A2.6.4) then it follows that,


 1 (bx)   x1 (b1 x1  b2 x2 )   b1 
(A2.7.6) 
(b x)   
   b

 2 (bx )   x2 1 1 2 2   b2 
 (b x  b x )

7
It is also worth noticing the difference in signs of the quadratic term, where MSE was to be minimized,
and f is to be maximized. We shall return to this distinction below.

________________________________________________________________________
ESE 502 A2-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

More generally, for any linear compound, bx   i1 bi xi , exactly the same argument
n

shows that

(A2.7.7) (bx)  b

Turning next to the quadratic term in (A2.7.5) observe that for any 2  2 matrix, A ,

a a  x  a x a x 
(A2.7.8) xAx  ( x1 x2 )  11 12  1   ( x1 x2 )  11 1 12 2 
 a21 a22  x2   a21 x1  a22 x2 

 a11 x12  a12 x1 x2  a21 x2 x1  a22 x22

so that the corresponding partial derivative expression can be written as

 x ( a11 x12  a12 x1 x2  a21 x2 x1  a22 x22 ) 


(A2.7.9) ( xAx)   1 
 x ( a11 x12  a12 x1 x2  a21 x2 x1  a22 x22 ) 
 2 

 2a x  a12 x2  a21 x2   a11 x1  a12 x2   a11 x1  a21 x2 


  11 1    
 2a22 x2  a12 x1  a21 x1   a21 x1  a22 x2   a12 x1  a22 x2 

a a  x   a a21  x1 
  11 12  1    11    Ax  Ax
 a21 a22  x2   a12 a22  x2 

More generally, for any quadratic expression, xAx   i1  j 1 aij xi x j , essentially the
n n

same argument shows that

(A2.7.10) ( xAx)  ( A  A) x

Here there is one important special case, namely when the matrix A is symmetric, i.e.,
when A  A . For this case it follows at once from (A2.7.10) that

(A2.7.11) ( xAx)  2 A x

To see the special relevance of this case, notice that every square matrix, A , has an
associated symmetrization,

(A2.7.12) As  12 ( A  A)  As  1 ( A 


2 A)  As

But since xy  yx for all vectors, it then follows that

________________________________________________________________________
ESE 502 A2-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

(A2.7.13) xAs x  x  12 ( A  A)  x  1


2  xAx  xAx 

 1
2  x( Ax)  ( Ax) x   12  x( Ax)  x( Ax)  xAx

So in fact, every quadratic expression, xAx , can be represented by a symmetric matrix


as xAs x . As one illustration, observe that the matrix A in (A2.7.4) is not symmetric.
So in this case, one could replace A with the symmetric matrix,

 3 2   3 1   3 1/ 2 
(A2.7.14) As  1
2      
 1 4   2 4   1/ 2 4 

A2.7.1 First-Order Conditions

Using these identities, we can now establish first-order conditions for any quadratic
maximization problem as follows. If f ( x) is assumed to have the general quadratic
form

(A2.7.15) f ( x)  c  bx  xAx

with A symmetric, then by linearity of differentiation [i.e., ( f  g )  f  g ] we


have:

(A2.7.16) f ( x)  (c  bx  xAx)  0  (bx)  ( xAx)  b  2 Ax

So the first-order condition for a maximum of f ( x) can be solved as follows:

(A2.7.17) 0  f ( x0 )  b  2 Ax0  2 Ax0  b  x0   12 A1b

In the present case, where (symmetric) A is given by the negative of (A2.7.14) [to be
consistent with (A2.7.15)] it follows that

1
1
1 (  A ) 1 b 1 3 1/ 2   26   4 
(A2.7.18) x0   2 s  1
2 A b 
s 2      
 1/ 2 4   20   2 

which is precisely the solution shown in Figure A2.9.

If this same line of reasoning is applied to the mean-squared-error function

(A2.7.19) MSE (0 )   2  2 c0 0  0 V0 0

________________________________________________________________________
ESE 502 A2-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

in expression (6.2.25), we can now solve the corresponding first-order condition for the
optimal weight vector, ̂0 , as follows

(A2.7.20) 0  MSE (ˆ0 )   2 c0  2V0 ˆ0  V0 ˆ0  c0  ˆ0  V01c0

which is seen to be precisely the simple kriging solution in expression (6.2.26).

But while these first-order conditions are necessary for optimal solutions, they are not
sufficient. In particular, (A2.7.18) is claimed to be the solution of a maximization
problem, and (A2.7.20) is claimed to be the solution of a minimization problem. Hence
to check whether either of these are actually solutions of their respective problems, we
must develop appropriated second-order conditions.

A2.7.2 Second-Order Conditions

Recall that in the scalar case, the second-order condition for a maximum (or minimum)
2
of f ( x) at x0 is that the second derivative, dxd 2 f ( x0 ) , be negative (or positive), as seen
for the case of a maximum in Figure A2.7 above. In the multidimensional case the
conditions are similar in nature, but are necessarily somewhat more complex. The
simplest way to motivate the basic idea here is to reduce the problem to “one
dimension” in the following way. For a two dimensional function, f ( x)  f ( x1 , x2 ) ,
with a maximum at point, x0 , such as in Figure A2.9 above, consider a one-
dimensional “slice” through this function such as the one shown in Figure A2.10 below.

f ( x0 )
f  f ( x0  t x)

x2

x 
0 x0  t x

x1

Figure A2.10 One-Dimensional Slices

Such a slice can be defined formally by choosing any fixed nonzero vector, x , and
considering all linear combinations, x0  t x . As the scalar, t , increases from zero, one
moves away from x0 in “direction” x . Similarly, as t decreases from zero, one moves

________________________________________________________________________
ESE 502 A2-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

in the opposite direction. The one-dimensional slice through f shown in the figure
thus corresponds precisely to the scalar function of t defined by g x (t )  f ( x0  t x) .
So if f achieves its maximum at x0 , then in particular, it must exhibit a maximum
along this slice at t  0 . This of course implies that dtd g x (0)  0 , and more importantly
for our present purposes, that d2
dt 2
g x (0)  0 . To analyze this latter condition more
explicitly, we introduce the following simplifying notation. For any function,
f ( x)  f ( x1 ,.., xn ) of n arguments, let

(A2.7.21) fi ( x)  
xi f ( x1 ,.., xi ,.., xn )

denote the partial derivative of f with respect to its i-th argument, and for each
i, j  1,.., n let

(A2.7.22) fij ( x)  
xi f j ( x1 ,.., xi ,.., xn )  2
xi x j f ( x1 ,.., xi ,.., x j ,.., xn )

denote the cross partial derivative of f with respect to its i-th and j-th arguments (so
that in particular, fii ( x) is the second partial derivative of f with respect to its i-th
argument). In terms of this notation, if we consider a compound function,
g (t )  f [h1 (t ), h2 (t )] , and recall from the chain rule for derivatives that

(A2.7.23) d
dt g (t )  f1[h1 (t ), h2 (t )] dtd h1 (t )  f 2 [h1 (t ), h2 (t )] dtd h2 (t )

then by applying this rule to the function g x (t ) above, we see that

(A2.7.24) d
dt g x (t )  d
dt f ( x0  t x)  d
dt f ( x01  t x1 , x02  t x2 )

 f1 ( x0  t x)  dtd ( x01  t x1 )  f 2 ( x0  t x)  dtd ( x02  t x2 )

 f1 ( x0  t x)  x1  f 2 ( x0  t x)  x2

Differentiating once again we have

(A2.7.25) d2
dt 2
g x (t )  d
dt [ dtd g x (t )]  d
dt [ f1 ( x0  t x)] x1  d
dt [ f 2 ( x0  t x)] x2

So by applying the chain rule to the first term on the right, we obtain

(A2.7.26) d
dt [ f1 ( x0  t x)] x1  d
dt [ f1 ( x01  t x1 , x02  t x2 )] x1

  f11 ( x0  t x) x1  f12 ( x0  t x)  x2   x1

________________________________________________________________________
ESE 502 A2-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

 f11 ( x0  t x)  x12  f12 ( x0  t x)  x1 x2

Similarly, the second term in (A2.7.25) can be written out as

(A2.7.27) d
dt [ f 2 ( x0  t x)] x2  f 21 ( x0  t x)  x2 x1  f 22 ( x0  t x)  x22

By combining these, we can now write the second derivative in (A2.7.25) more
explicitly as

(A2.7.28) d2
dt 2
g x (t )  f11 ( x0  t x)  x12  f12 ( x0  t x)  x1 x2

 f 21 ( x0  t x)  x2 x1  f 22 ( x0  t x)  x22

Finally, by evaluating this at t  0 , we obtain the explicit second-order condition

(A2.7.29) d2
dt 2
g x (0)  f11 ( x0 )  x12  f12 ( x0 )  x1 x2  f 21 ( x0 )  x2 x1  f 22 ( x0 )  x22  0

The Hessian Matrix

This second-order condition can be written more compactly in matrix form as follows.
If we now designate the matrix of cross partial derivatives of f at point x0 as the
Hessian matrix,

 f (x ) f 21 ( x0 ) 
(A2.7.30) H f ( x0 )   11 0
 f12 ( x0 ) f 22 ( x0 ) 

then the right hand side of (A2.7.29) can be written in matrix terms as

 f (x ) f 21 ( x0 )   x1 
(A2.7.31) d2
g x (0)   x1 , x2   11 0    x H f ( x0 ) x
dt 2
 f12 ( x0 ) f 22 ( x0 )   x2 

Hence the desired second order condition for a maximum of f at x0 with respect to
direction x takes the simple form:

(A2.7.32) x H f ( x0 ) x  0

Before proceeding, it is appropriate to extend condition (A2.7.32) to the general case of


n dimensions. Here it is enough to observe that while the n  2 case permits one-
dimensional slices in each direction to be seen graphically (as in Figure A2.10 above),
none of the analysis is in any way restricted to this case. Hence, if for any smooth

________________________________________________________________________
ESE 502 A2-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

function, f ( x)  f ( x1 ,.., xn ) , and point x0   n in the domain of f we now define the


associated Hessian matrix at x0 by

 f11 ( x0 )  f n1 ( x0 ) 
 
(A2.7.33) H f ( x0 )      
 f (x )  f nn ( x0 ) 
 1n 0

then the argument leading to (A2.7.32) continues to hold for any direction vector,
x   n and Hessian matrix given by (A2.7.33).

Given this “one dimensional” condition, it remains only to observe that for a true
maximum at x0 , this same condition must hold in all directions with respect to x0 . So
if we now designate an n-square matrix, A , to be negative definite if and only if

(A2.7.34) x A x  0 for all x  0

then it follows at once from (A2.7.32) and (A2.7.34) that the desired full-dimensional
condition for a maximum of f at x0 is precisely that the Hessian matrix, H f ( x0 ) , be
negative definite.

This condition for a maximum also yields a corresponding condition for a minimum of
f at x0 . For the n  2 case, simply observe that if the “mountain” shape of f ( x) in
Figure A2.9 is inverted to “bowl” shape, then it is clear that the function,
g x (t )  f ( x0  t x) , corresponding to each slice in Figure A2.10 must now have a
positive second derivative at t  0 , i.e., d2
dt 2
g x (0)  0 . Hence same the argument leading
to (A2.7.32) now shows that

(A2.7.35) x H f ( x0 ) x  0

must hold in each nonzero direction x . This argument is again directly extendable to n
dimensions (but without pictures). So if we now designate an n-square matrix, A , as
positive definite if and only if

(A2.7.36) x A x  0 for all x  0

then the parallel full-dimensional condition for a minimum of f at x0   n is simply


that the Hessian matrix, H f ( x0 ) , be positive definite.

________________________________________________________________________
ESE 502 A2-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

Conditions for Symmetric Positive Definiteness

The task remaining is to establish readily testable conditions for determining when a
matrix is positive or negative definite. Here we begin by observing from (A2.7.34) and
(A2.7.36) that a matrix, A , is positive definite if and only if  A is negative definite.
Hence, it suffices to consider only one of these two conditions. Follows standard
practice, we here focus on positive definiteness. Next recall from the identity in
(A2.7.13) that to establish positive definiteness, we may assume that the matrix A is
symmetric (for if not then use its symmetrization, As ). For Hessian matrices in
particular, it turns out that such matrices are guaranteed to be symmetric, i.e.,
fij ( x)  f ji ( x) , whenever these cross partial derivatives are continuous.8 So we shall
focus on conditions for establishing that a symmetric matrix is positive definite.

To motive the conditions characterizing symmetric positive definite (SPD) matrices, we


begin with the following fundamental observation which forms the basis for essentially
all characterizations of such matrices. An n-square matrix, A , is SPD if and only if it
can be “decomposed” into a product of the form,

(A2.7.37) A  BB

for some nonsingular n-square matrix, B. To see this, observe first that since

(A2.7.38) A  ( BB)  ( B) B  BB  A

it follows that A must be symmetric. More importantly, observe that since the inner
product of a nonzero vector, x, with itself is always positive, i.e.,

x  0  xx   i1 xi2  0


n
(A2.7.39)

and since the nonsingularity of B insures that Bx  0 whenever x  0 , it then follows


from (A2.7.39) that for all x  0 ,

(A2.7.40) xAx  x( BB) x  ( Bx)( Bx)  0

and hence that A is SPD. This characterization helps to clarify the real meaning of
positive definiteness. In particular, if we consider the simplest case, n  1 , and let a
denote the scalar matrix, A , then the positive definiteness condition simply says that for
all nonzero scalars, x , we must have x(a ) x  a x 2  0 , which of course simply
characterizes positivity of the scalar, a . So again letting b denote the scalar matrix, B ,
condition (A2.7.39) simply says that a is positive if and only if it can be written as

8
This result is usually known as Young’s Theorem, and can be found in most calculus textbooks.

________________________________________________________________________
ESE 502 A2-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

a  b 2 for some scalar b , i.e., if and only if it has a real square root, b.9 So in this
sense, positive definite matrices are the natural generalization of positive numbers. But
while this decomposition characterizations is very informative, it is no more “testable”
than positive definiteness itself. However, there do exist testable conditions for
ensuring the existence of such decompositions as we now show.

The simplest and most commonly used test for positive definiteness is based on the
properties of certain determinants. If the determinant of a n-square matrix, A , is
denoted by det( A) , then this condition involves positivity of the determinants of certain
sub-matrices of A. In particular, for each k  1,.., n we now designate the k -square
matrix, Ak  (aij : i, j  1,.., k ) , in the “upper left-hand corner” of A  (aij :i, j  1,.., n) ,
i.e.,

 a11  a1k  a1n 


 
     
(A2.7.41) A   ak1  akk  
 
    
a    ann 
 n1

as the kth leading principle sub-matrix of A, and designate its determinant, det( Ak ) , as
the kth leading principle minor of A, then the following condition, known as Sylvester’s
Condition is both necessary and sufficient for positive definiteness:

Sylvester’s Condition. A symmetric matrix, A, is positive definite if and only if


all principle minors of A are positive.

This result will be shown later to a simple consequence of the Spectral Decomposition
Theorem for symmetric matrices. To illustrate its application, consider the symmetrized
matrix in (A2.7.14) above, i.e.,

a a   3 1/ 2 
(A2.7.42) A   11 12    
 a21 a22  1/ 2 4 

Observe that since the principle minors are det(a11 )  a11  3  0 and

(A2.7.43) det( A)  a11a22  a21a12  (3)(4)  (1/ 2)2  0

it follows at once from Sylvester’s Condition that A is positive definite.

9
Later we shall see that SPD matrices, A, actually have square roots as well, i.e., can be written as A  B
2

for a nonsingular symmetric matrix, B. But this requires the Spectral Decomposition Theorem for
symmetric matrices.

________________________________________________________________________
ESE 502 A2-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

But our main interest in Sylvester’s condition is that it provides the basis for establish a
more useful testable condition that has many applications of its own. In particular, it
yields a simple decomposition of SPD matrices known as the Cholesky decomposition.
In particular, if a matrix, T, with zeros everywhere above the diagonal, i.e., of the form

 t11 0  0
 
t t   
(A2.7.44) T   21 22
   0
 
 tn1 tn 2  tnn 

is designated as a lower triangular matrix, then matrix A is said to have a Cholesky


decomposition if and only if there is a nonsingular lower triangular matrix, T , such that

(A2.7.45) A  T T

By the argument above, every matrix of this form is SPD. Moreover, this again turns
out to completely characterize SPD matrices as we now show: 10

Cholesky Theorem. A symmetric matrix A is positive definite if and only


if there exists a Cholesky decomposition for A.
Proof: If A has a Cholesky decomposition then the argument in (A2.7.40) shows
that A is positive definite. Conversely, if A is positive definite, then by Sylvester’s
condition, all leading principle minors of A are positive. Using this property, we can
now construct a Cholesky decomposition by induction on the dimension of the n -
square matrix, A . For n  1 , A is by hypothesis a positive scalar, so that we may set
T  T   A . Now suppose that it is true for n  1  0 and consider a symmetric n-
square matrix, A , with all positive principle minors. We may write A in partitioned
form as

A a 
(A2.7.46) A   n1 n1 
 an1 ann 

where An1 is the (n  1) st leading principle sub-matrix of A By construction, An1 has


all positive leading principle minors (namely the first n  1 leading principle minors of
A ). Thus by hypothesis, An1 must have a Cholesky decomposition, say

(A2.7.47) An1  Tn1Tn1

Our objective is to extend Tn1 to a Cholesky decomposition, A  T T  , for A as


follows. By lower triangularity, T must have the form

10
The following proof is based on an argument given by Prof. David Hill that is available online at:
http://astro.temple.edu/~dhill001/course/math254/CHOLESKYDECOMPOSITION_stu.pdf

________________________________________________________________________
ESE 502 A2-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

T 0
(A2.7.48) T   n1 
 h c 
for some unknown (n  1) -vector, h , and scalar, c . Hence by (A2.7.46) and (A2.7.48)
we seek values for h and a such that,

 An1 an1   Tn1 0  Tn1 h   Tn1Tn1 Tn1h 


(A2.7.49)          2
 an1 ann   h c  0 c   hTn1 hh  c 

In particular, this implies both that

(A2.7.50) an1  Tn1h , and

(A2.7.51) ann  hh  c 2

But by the nonsingularity of Tn1 , we can solve for h in (A2.7.50) as

(A2.7.52) h  Tn11an1

Similarly by (A2.7.51), the value of c must be given by

(A2.7.53) c  ann  hh

Hence to complete this construction, it remains only to show that the last operation is
legitimate, i.e., that

(A2.7.54) ann  hh  0

But by the determinant rule for partitioned matrices, it follows from (A2.7.46) that

A an 
(A2.7.55) det( A)  det  n1   det( An1 ) (ann  an 1 An1an1 )
1

 an ann 

(since ann  an1 An11an1 is a scalar).11 Moreover, since the hypothesis of positive leading
principle minors for A implies in particular that det( A)  0 and det( An1 )  0 , we see
from (A2.7.55) that

11
To gain some intuition for this determinant rule, observe simply that for the case of n  2 , we must have
 a11 a12  1
det    a11a22  a12 a21  ( a11 )( a22  a12 a11 a21 ) .
 a21 a22 

________________________________________________________________________
ESE 502 A2-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

(A2.7.56) ann  an 1 An11an1  0

Finally by substituting (A2.7.50) into (A2.7.56), we may conclude that

(A2.7.57) 0  ann  (Tn1h)(Tn1Tn1 ) 1 (Tn1h)

 ann  h Tn1[(Tn1 ) 1Tn11 ]Tn1h

 ann  h[Tn1 (Tn1 ) 1 ][Tn11 Tn1 ]h

 ann  h h

Thus (A2.7.54) must hold, and the result is established. 

Remark: It should also be noted that this proof yields a recursive construction for T ,
and in particular shows that it is unique. This is obvious for n  1 , where T  A is the
only possible choice. Moreover, by recursive use of the constructions in (A2.7.53) and
(A2.7.54), one must obtain a unique extension T for each n  1 . 

As noted above, the most attractive feature of Cholesky decompositions is there ease of
calculation. As mentioned in the text, this is easily accomplished with the command,

>> T = chol(A);

If this algorithm fails then one obtains the error message “Matrix must be positive
definite”. So by the Cholesky Theorem above, this procedure yields a practical test of
positive definiteness, which can be designated as the Cholesky Test. In summary, while
Sylvester’s Condition provides a useful test for relatively small matrices, such as
(A2.7.42), the calculation of principle minors is very time consuming for larger
matrices. Here the Cholesky Test is much faster and more practical.12 If the algorithm
succeeds, then the matrix is SPD, and otherwise, it is not.13

Calculation of Hessians

To see how these conditions can be applied in practice, it is instructive to analyze the
maximization example in (A2.7.4) and (A2.7.5). While in this simple case, the desired
Hessian can of course be calculated term by term (i.e., each cross partial derivative), for
larger problem it is much more efficient to do the calculations in matrix terms. So it is
appropriate to see how this can be accomplished. To do so, we begin by rewriting the
gradient vector of first partial derivatives in expression (A2.6.4) in terms of our present
notation as follows

12
Even for n-square SPD matrices, A, as large as n  1000 , the MATLAB command, chol(A), produces
the unique Cholesky decomposition in about 0.03 seconds.
13
Care must be taken for “almost singular” SPD matrices, where rounding errors can sometimes lead to
failure. Methods of numerical analysis must then be used to check whether this is the case.

________________________________________________________________________
ESE 502 A2-25 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

 f ( x) 
 x 
 1   f1 ( x) 
 
(A2.7.58) f ( x )        
   f ( x) 
 f ( x)   n 
 x 
 n 

This can be viewed as a vector of functions, fi ( x), i  1,.., n . Notice that in the Hessian
of (A2.7.33) the i-th column is just the gradient of the i-th function, fi ( x) , in (A2.7.58).
So if we now define the gradient of a vector of smooth functions, [ g ( x),..., h( x)] with
commons arguments, x  ( x1 ,.., xn ) , by

 g ( x)   g1 ( x)  h1 ( x) 
   
(A2.7.59)      [g ( x),..., h( x)]      
 h( x )   g ( x)  h ( x) 
   n n 

then the Hessian in (A2.7.33) is seen to be of the form,

 f1 ( x0 ) 
 
(A2.7.60) H f ( x0 )       [f ( x0 )]   2 f ( x0 )
 f (x ) 
 n 0 

As a second application of (A2.7.59), note that if the i-th row of a matrix, A , is denoted
by ai , then the linear expression, Ax , can be written as a vector of linear functions as
follows,

 a1   a1x 
   
(A2.7.61) Ax     x    
 a   a x 
 n  n 

so that by (A2.7.59) and (A2.7.7),

 a1x 
 
(A2.7.62) ( Ax)       [(a1x),..., (an x)]  (a1 ,..., an )  A
 a x 
 n 

With these preliminaries, we can now reconsider the maximization of the general
quadratic expression in (A2.7.5),

(A2.7.63) f ( x)  c  bx  xAx

________________________________________________________________________
ESE 502 A2-26 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

with A assumed to be symmetric. Using (A2.7.58) through (A2.7.62), the Hessian


matrix for this problem is now given by

(A2.7.64) H f ( x)   2 f ( x)  [(c  bx  xAx)]

 (b  2 Ax)  0  2 ( Ax)

  2A

Hence any point, x0 , satisfying the first-order condition for f ( x) will be a maximum if
and only if the matrix A is positive definite (so that the associated matrix, 2 A , is
negative definite). But for the specific maximum problem with parameters in (A2.7.4),
we have already seen that the symmetrized matrix, A, in (A2.7.56) above is positive
definite. Thus the unique point, x0  (4, 2) , satisfying the first-order conditions is
indeed a maximum (which was already evident in Figure A2.9).

Finally, it is important to reconsider the mean squared error function in (A2.7.19)


above, where it was shown in (A2.7.20) that the unique weight vector satisfying the
first-order conditions for minimization of

(A2.7.65) MSE (0 )   2  2 c0 0  0 V0 0

was given by ˆ0  V01c0 . We are now in a position to complete that analysis. If the
Hessian for this function is denoted by H MSE , then by recalling that every covariance
matrix is symmetric, it follows the same analysis in (A2.7.64) now yields

(A2.7.66) H MSE (0 )   2 MSE (0 )  [( 2  2c0 0  0V00 )]

 (2c0  2V00 )  0  2 (V00 )  2V0

Thus to ensure that ˆ0  V01c0 is the unique minimum of (A2.6.65), it remains only to
show that V0 is positive definite. In fact, it turns out that:

Positive Definiteness Property. Every (nonsingular) covariance matrix is


positive definite.

While we don’t yet have all the tools to show this fully, we can establish the most
essential part of this condition as follows. Recall from the covariance result in (3.2.21)
that for any random vector, X , with covariance matrix,   cov( X ) , the variance of
each linear compound, aX is given by var(aX )  aa . So it must certainly be true
that

(A2.7.67) aa  0 for all a  0

________________________________________________________________________
ESE 502 A2-27 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

This condition is called positive semidefiniteness, and must be exhibited by every


covariance matrix. What remains to be shown is that for nonsingular covariance
matrices the inequality in (A2.6.67) is strict. Since this is a simple consequence of the
Spectral Decomposition Theorem (to be developed later), we postpone it for now.14

Non-Definite Hessians

Before proceeding to the case of constrained optimization, it is of interest to ask


whether one can have stationary points that are neither maxima or minima. An example
for scalar functions was shown in Figure A2.8 above. But unlike this highly special
case in one dimension, it turns out that such examples are quite common in higher
dimensions. This is illustrated by the (n  2) example in Figure A2.10 below, where
there exist two local maxima (the one on the right being the global maximum).

x2 x1

Figure A2.11. Saddle Point Example

However, there is seen to be a third point (shown in red) between these two local
maxima which also satisfies the first-order condition that the gradient be zero. Notice
also that movement from this point toward either maximum point must go “uphill”, so
that second derivative is positive in these directions. But movement orthogonal to these
directions leads “downhill” and hence yields negative second derivatives. At such
saddle point locations, the Hessian is neither positive nor negative definite. Note finally

14
This can actually be shown without the Spectral Decomposition Theorem. For a simple proof that
positive semidefiniteness plus nonsingularity implies positive definiteness, see Horn and Johnson (1985,
p.400)

________________________________________________________________________
ESE 502 A2-28 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

that such saddle points are not rare. Indeed, whenever there are multiple maxima one
can expect to find intermediate saddle points.

A2.7.3 Application to Ordinary Least Squares Estimation

Before considering constrained optimization problems, we consider one final


application of the above concepts, namely to the least squares estimation of
  (  0 , 1 ,..,  k ) in the classical linear regression model. Recall from (7.16) that the
objective function is given by

(A2.7.68) SSD(  )  yy  2 yX     X X 

and hence is seen to be a quadratic form very similar in nature to the mean squared
error function, MSE (0 ) in (A2.7.19) above. Thus, as in (A2.7.2), we see from the
symmetry of the matrix X X that the first-order condition for this minimization problem
takes the form:
(A2.7.69) 0  SSD(  )   2 X y  2 X X   X X   X y

But if is assumed that there are no collinearities between the columns of X (so that
X is of full column rank) then the (k  1) -square matrix, X X , is nonsingular. Hence
the unique solution to (7.17), designated as the ordinary least squares (OLS) estimator
of  is given by

(A2.7.70) ˆ  ( X X )1 X y
The only question remaining is whether this yields a proper minimum. Here we can
answer this question definitively. In particular, recall first from (A2.7.64) that in this
case,
(A2.7.71) H SSD (  )  (2 X y  2 X X  )  2 X X

so that it remains only to show that X X is positive definite. But in the argument of
(A2.7.37) through (A2.7.40) above it was shown that for any nonsingular matrix, B ,
the matrix BB is necessarily positive definite. Hence it is enough to observe that this
continues to hold as long as B is of full column rank. For if it were true that
0  x( BB) x  ( Bx) Bx for some x  0 , then the same argument shows that
 x1 
 
0  Bx  (b1 ,.., bm )      j 1 x j b j
m
(A2.7.72)
x 
 m

which together with x  0 implies the existence of a linear dependency (collinearity)


among the columns (b1 ,.., bm ) of B. Hence for any matrix of full column rank, such as
X , it follows that X X must be positive definite.

________________________________________________________________________
ESE 502 A2-29 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

A2.8 Constrained Optimization of Smooth Functions

As with the development of unconstrained optimization above, we shall be concerned


here with those cases of constrained optimization that are relevant for the applications
in the text. Hence we consider only linear equality constraints, where the optimum will
again be seen to be characterized by appropriate “tangency” conditions.

To motivate the main ideas, we again begin with a two-dimensional example in which
the relevant tangency conditions can be depicted graphically. For ease of visualization,
it is convenient to switch to a minimization problem. So consider minimizing the
quadratic objective function defined for each x  ( x1 , x2 ) by,

(A2.8.1) f ( x)  c  bx  xAx

with c  20 , b  (1, 2) and

 25 1 
(A2.8.2) A 
 1 15 

As in (A2.7.18), this function has a unique stationary point,

(A2.8.3) x*   12 A1b  (0.017,  0.066) ,

in the negative quadrant. Moreover, since A is seen by inspection to be symmetric


positive definite [with its two leading principle minors, det(25)  25 and
det( A)  25 15  1 , both positive], it follows as in (A2.7.64) that H f ( x*)  2 A is
positive definite, an hence that x * is a global minimum. This function is depicted in
Figure A2.12 below [where again for visual convenience the origin (0,0) has been
placed at the back corner of the figure]. The global minimum point, x * , is out of view,
since it is not the relevant minimum for our present purposes.

A2.8.1 Minimization with a Single Constraint

In particular, we now suppose that feasible values of x for this minimization problem
are also required to satisfy a linear constraint of the following form,

(A2.8.4) d x  

with d   (5, 4) and   13 . In other words, the only relevant values of x for this
problem are those lying on the blue line shown in Figure A2.12.

________________________________________________________________________
ESE 502 A2-30 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

f ( x)

d x  
x0
x1
x2

Figure A2.12. Constrained Minimization Example

To put this problem in more standard form, let the function g ( x) be defined by

(A2.8.5) g ( x)  d x

so that (A2.8.4) is equivalent to the condition that g ( x)   . In these terms, the present
problem is formally a constrained minimization problem of form,

(A2.8.6) minimize: f ( x) subject to: g ( x)  

To solve this problem, observe next that [in a manner similar to Figure A2.5 (and
Figure A2.11) above] the contours of the function f ( x) are shown on the ( x1 , x2 ) plane
in Figure A2.12. Moreover, we know from (A2.8.3) above that this function decreases
toward its global minimum, x * , in the negative quadrant. So the lowest contour
touching the blue line in Figure A2.12 clearly defines the desired constrained minimum
point, x0 , solving problem (A2.8.6).

________________________________________________________________________
ESE 502 A2-31 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

With these observations, the key question it how to identify this point analytically. Here
it is convenient to give a planar representation of these contours as in Figure A2.13
below [where the ( x1 , x2 ) plane has now been rotated to place the origin in its more
natural position at the lower left corner of the figure].15

2
g ( x0 )

1.5

 x 0

x1 1

0.5
f ( x0 )

0 
0 0.5 1 1.5 2 2.5
x2

Figure A2.13 Tangency Condition for Constrained Minimum

Here the solution point, x0 , is again identified by a tangency between the linear
constraint, g ( x)   (blue line), and the lowest contour of f ( x) . But recall from
(A2.6.3) that the gradient, f ( x0 ) , of f at x0 must be orthogonal to this tangent line,
which by definition defines the directions of “no change” in f at x0 . Recall also that
gradients point in the direction of maximum increase in f . But since we are here
interested in minimizing f it is more appropriate to consider the (opposite) direction of
maximum decrease in f at x0 , as given by the negative gradient, f ( x0 ) . This is
negative gradient is shown by the red arrow in Figure A2.13.

Similarly, since the blue tangent line is also a constant-value contour for the constraint
function, g [i.e., the set of x values where g ( x)   ], it then follows that the gradient,
g ( x0 ) , of g at x0 must be orthogonal to this same tangent line, as shown by the blue
arrow in Figure A2.13. [Since the positivity of the coefficient vector, d , in this case

15
Note also that for compatibility with Figure A2.12, the horizontal axis is x2 rather than x1 .

________________________________________________________________________
ESE 502 A2-32 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

implies that the function, g ( x)  d x , is increasing in x , this gradient points toward


higher values x ].

Finally, since there is only a single line in the plane that is orthogonal to this blue line,
it follows that the two gradients f ( x0 ) and g ( x0 ) must both lie on this same line,
i.e., must be collinear. Since this implies that f ( x0 ) and g ( x0 ) must be scalar
multiples of one another, the fundamental tangency condition in Figure A2.13 implies
that for some scalar,  0 , it must be true that f ( x0 )   0 g ( x0 ) , or equivalently that

(A2.8.7) f ( x0 )   0 g ( x0 )  0

Algebraically, this two-dimensional tangency condition yields two equations in three


unknown, namely x0  ( x01 , x02 ) together with  0 . However, since x0 must lie on the
blue line, it is also required that

(A2.8.8) g ( x0 )  

These equation system allows all unknowns to be solved for. But before doing so, it is
important to note that while the above derivation is geometrical in nature, and hence
can be illustrated graphically, there is a mathematically more powerful way of deriving
the same conditions. In particular, if we now combine the functions, f and g , into a
single function of the form

(A2.8.9) L ( x,  )  f ( x )   [ g ( x )   ]

then this augmented function, called the Lagrangian function, actually yields conditions
(A2.8.8) and (A2.8.9) as first-order conditions. In particular, if for any function, h( y, z )
of vectors, y  ( y1 ,.., yk ) and z  ( z1 ,.., zm ) , we write the gradients of h with respect to
y and z as,

 y1 h( y, z )   z1 h( y, z ) 
   
(A2.8.10)  y h( y , z )     and  z h( y , z )    
 h( y , z ) 
   h( y , z ) 
 yk   zm 

respectively, then it follows from (A2.8.9) that

(A2.8.11)  x L ( x ,  )  f ( x )   g ( x )

(A2.8.12)  L( x, )  g ( x)  

________________________________________________________________________
ESE 502 A2-33 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

So (A2.8.7) and (A2.8.8) are seen to be precisely the first order conditions of L with
respect to ( x, ) evaluated at ( x0 , 0 ) , i.e.,

(A2.8.13) 0   x L( x0 , 0 )  f ( x0 )   0 g ( x0 )

(A2.8.14) 0   L( x0 , 0 )  g ( x0 )  

This is no coincidence, and in fact provides a general way of “converting” constrained


optimization problems into larger-dimensional unconstrained problems. Here the
original arguments, x , are augmented to ( x, ) , where the dimension of   (1 ,.., k )
corresponds precisely to the number of constraints imposed on the optimization
problem. These unknown scalars, known as Lagrange multipliers, play the same
geometric role as in our one-constraint example above.

We shall consider a general Lagrangian problem of this type below. But for the present,
it is instructive to complete the solution of our particular example. First, recall from
expressions (A2.8.1) and (A2.8.5) that (A2.8.9) can be written more explicitly as
follows:

(A2.8.15) L( x, )  (c  bx  xAx)   (d x   )

Hence by employing the gradient identities in (A2.7.7) and (A2.7.11) together with
(A2.8.11) and (A2.8.11), we see that (A2.8.13) and (A2.8.14) take the explicit form:

(A2.8.16) 0   x L( x0 , 0 )  b  2 Ax0   0 d

(A2.8.17) 0   L( x0 , 0 )  d x0  

But by the nonsingularity of A we can solve (A2.8.16) for x0 as follows:

(A2.8.18) 2 Ax0   ( 0 d  b)  x0   12 A1 ( 0 d  b)

Condition (A2.8.17) then yields the following explicit solution for  0 ,

(A2.8.19)   d x0   12 d A1 ( 0 d  b)   2   0 (d A1d )  d A1b

 2  d A1b 
 0    
 d A d 
1

Finally, substitution of (A2.8.19) into (A2.8.18) yields the following explicit solution
for x0 :

________________________________________________________________________
ESE 502 A2-34 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

 2  d A1b  
(A2.8.20) x0  12 A1   d  b
 d A d 
1

Substitution of the values, c  20, b  (1, 2),   13, d   (5, 4) together with A in
(A2.8.2) yields the final solution

(A2.8.21) x0  (1.2721 , 1.6599)

which is seen to correspond to the graphical solution shown in Figure A2.13.

Solution for Ordinary Kriging

Finally, we apply these results to the case of ordinary kriging. Here we proceed in two
steps. First we derive a BLU estimator for the unknown mean parameter,  , and then
use this to interpret the solution to the optimal weight vector problem. Turning first to
the BLU estimator for  , recall from expression (6.3.7) of the text that the optimal
coefficient vector, â , is given by the solution of the constrained minimization problem:

(A2.8.22) minimize: aVa subject to: a1n  1

This is seen to be a special case of the constrained minimization problem in (A2.8.15)


with (c  0, b  0, A  V ,   1, d  1n ) . Hence by setting x0  aˆ in (A2.8.20) and
making these appropriate substitutions, it follows that the unique optimal coefficient
vector is given by

 2  (0)    1  1
(A2.8.23) aˆ  12 V 1  1n  (0)     V 1n
 1nV 1n    1n V 1n 
1 1

This in turn implies that the unique BLU estimator, ˆ n , of  given sample vector Y is
given by

 1  1 V 1Y
(A2.8.24) ˆ n  aˆ Y   1n V Y 
1 n

 1nV 1n  1nV 11n


1

Turning next to the problem of determining a BLU predictor of Y0  Y ( s0 ) , recall from


expression (6.3.18) in the text that the desired weight vector, ̂0 , solves the constrained
minimization problem:

(A2.8.25) minimize:  2  2c0 0  0 V0 0 subject to: 1n 0  1

________________________________________________________________________
ESE 502 A2-35 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

But this is again a special case of the constrained minimization problem in (A2.8.15)
with (c   2 , b  2c0 , A  V0 ,   1, d  1n ) . Hence by now setting x0  ˆ0 in
(A2.8.20), it follows that

 2  2 1n V01c0  
(A2.8.26) ˆ0  12 V01  0
1n0  2c0 
 1n0 V0 1n0 
 1


 1  1n0 V01c0  1 1
  V 1  V0 c0
 1n V011n  0 n0
 0 0 

Hence the desired BLU predictor of Y0 is given by

 1  1n0 V01c0 
(A2.8.27) ˆ ˆ
Y0  0 Y    1n0 V0 Y  c0V0 Y
1 1
 1n V0 1n 
1
 0 0 

For purposes of interpreting this expression, observe that since 1n0 V01c0  c0V011n0 , we
may rewrite (A2.8.27) as

 1n0 V01Y   1n0 V01Y 


(A2.8.28) ˆ
Y0     c0V0 Y  c0V0 1n0 
1 1

 1n V011n   1n V011n 
 0 0   0 0 

By using (A2.8.24), this expression may then be simplified, as is done in expression


(6.3.21) of the text.

A2.8.2 Minimization with Multiple Constraints

Given the results above for a single constraint, we now proceed to the case of multiple
constraints. For purposes of illustration we begin with the case of two (linear)
constraints on functions of three variables, f ( x)  f ( x1 , x2 , x3 ) , where is still possible to
obtain some geometric intuition. As an extension of (A2.8.6) we thus consider the
following constrained minimization problem:

 g ( x)    
(A2.8.29) minimize: f ( x) subject to:  1    1 
 g 2 ( x)    2 

where x  ( x1 , x2 , x3 )  3 . Paralleling Figures A2.12 and A2.13 above, the solution


conditions for this problem are shown schematically in Figures A2.14 and A2.15 below.

________________________________________________________________________
ESE 502 A2-36 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

x3

f ( x0 ) f ( x)  f 0

x0
x2

g1 ( x)  1

g 2 ( x)   2

x1

Figure A2.14. Constrained Tangency Condition

x3

f ( x0 )

 02 g 2 ( x0 )  01 g1 ( x0 )

 x2

x1

Figure A2.15. Constrained Gradient Condition

________________________________________________________________________
ESE 502 A2-37 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

To compare these figures with the single constraint case above, we start by restricting
attention to the x -space in Figure A2.12, i.e., the ( x1 , x2 ) -plane. Recall that the single
linear constraint corresponds to the blue line in this plane, and the critical tangency
condition for a minimum is shown in terms of the contour representation of f ( x) on
this plane. The situation in Figure A2.14 is conceptually the same, except that the x -
space is now three dimensional. Here the two linear constraints, g1 ( x)  1 and
g 2 ( x)   2 , are shown, respectively, by the blue and black planes in this space. Note
that these planes constitute constant-value contour surfaces for the functions g1 and
g 2 . Hence, like Figure A2.12, the constraint space defined by the intersection of these
two planes is again one dimensional, as shown by the heavy blue line. With respect to
the objective function, f ( x)  f ( x1 , x2 , x3 ) , constant-value contour surfaces in this
space are curvilinear. Hence for visual clarity, only the single contour surface,
f ( x)  f ( x0 )  f 0 , tangent to the constraint space at point x0 is shown. As in Figure
A2.13, the negative gradient vector, f ( x0 ) , at x0 must be orthogonal to the
constraint space, as shown by the red arrows in both Figures A2.13 and A2.14. So the
tangency conditions in these two cases are seen to be conceptually the same.

Turning next to the relation between this gradient vector and those for the constraints
recall that in Figure A2.13 the single gradient vector, g ( x0 ) , was also orthogonal to
the constraint space as defined by a constant-value contour of g . Moreover, since all
vectors orthogonal to this constraint line at x0 must necessarily be collinear, this in turn
implied that f ( x0 ) must be a scalar multiple of g ( x0 ) . But in higher dimensions
this is no longer true. In the present case, the set of vectors orthogonal to the blue line at
x0 must define a plane (not shown) which is called the orthogonal complement of this
line at x0 . So all that can be said is that these three gradient vectors, f ( x0 ) ,
g1 ( x0 ) and g 2 ( x0 ) , must all lie in this plane. But assuming that the two constraint
planes [ g1 ( x)  1 and g 2 ( x)   2 ] have a well-defined linear intersection (and hence
are not parallel), it follows that g1 ( x0 ) and g 2 ( x0 ) cannot themselves be collinear.
Hence they must span this plane, which means that every vector in the plane can be
written as a unique linear combination of g1 ( x0 ) and g 2 ( x0 ) . In particular this
implies that for the negative gradient vector, f ( x0 ) , there must exist unique scalars,
 01 and  02 , such that f ( x0 )   01 g1 ( x0 )   02 g 2 ( x0 ) , or equivalently,

(A2.8.30) f ( x0 )   01 g1 ( x0 )   02 g 2 ( x0 )  0

as shown in Figure A2.15. This is the fundamental constrained gradient condition that
generalizes (A2.8.7) for the single-constraint case. Hence, as an extension of (A2.8.9),
if we now consider the Lagrangian function:

(A2.8.31) L( x,1 , 2 )  f ( x)  1 [ g1 ( x)  1 ]   2 [ g 2 ( x)   2 ]

________________________________________________________________________
ESE 502 A2-38 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

with first-order conditions

(A2.8.32) 0   x L( x0 , 01 , 02 )  f ( x0 )   01 g1 ( x0 )   02 g 2 ( x0 )

(A2.8.33) 0  1 L( x0 , 01 , 02 )  g1 ( x0 )  1

(A2.8.34) 0  2 L( x0 , 01 , 02 )  g 2 ( x0 )   2

then it is clear that the minimum for this function satisfies both the constrained gradient
condition in (A2.8.30) together with the two constraints in (A2.8.29).

The extension of this programming problem to objective functions, f ( x)  f ( x1 ,.., xn ) ,


in n dimensions with k equality constraints is a straightforward generalization of the
geometric representations in Figures A2.14 and A2.15. In particular, if for any k-vector
of constraint functions,

 g1 ( x) 
(A2.8.35) G ( x)     , x  ( x1 ,.., xn )   n
 g k ( x) 

(with k  n ) and corresponding constants,   (1 ,.., k ) we consider the constrained


minimization problem:

(A2.8.36) minimize: f ( x) subject to: G ( x)  

then letting   (1 ,.., k ) denote a vector of Lagrange multipliers, we may again form
the corresponding Lagrangian function,

L ( x,  )  f ( x )    j [ g j ( x)   j ]
k
(A2.8.37) j 1

 f ( x)   [G ( x)   ]

Hence by employing (A2.7.58) ,(A2.7.59) and (A2.8.10), it follows that a minimizing


pair, ( x0 , 0 ) , is now characterized by the first-order conditions:

0   x L( x0 , 0 )  f ( x0 )   j 1 0 j g j ( x0 )
k
(A2.8.38)
  01 
 
 f ( x0 )  [g1 ( x0 ),.., g k ( x0 )]   
 
 0k 
 f ( x0 )  G ( x0 ) 0

________________________________________________________________________
ESE 502 A2-39 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

and,

(A2.8.39) 0   L( x0 , 0 )  G ( x0 )  

In terms of Figures A2.14 and A2.15, condition (A2.8.38) again reflects the
constrained gradient condition that the negative gradient, f ( x0 ) , be a linear
combination of the constraint gradients. As a generalization of the constraint space in
these figures (with dimension 3  2  1 ), it is implicitly assumed here that the relevant
constraint set (i.e., the intersection of k constraint surfaces) is a well defined surface of
dimension n  k , so that the orthogonal complement to this surface at x0 has dimension
k . This is equivalent to assuming that the constraint gradients are linearly independent.
If so, then they must span this complement, so that (A2.8.38) must hold for some
unique vector of multipliers,  0  ( 01 ,.., 0 k ) .

Our objective is to apply this general formulation to the case of quadratic objective
functions

(A2.8.40) f ( x)  c  bx  xA x

on  n with linear constraints,

 d1x  1 
(A2.8.41) Dx          
 d k x   k 

where the above constrained gradient condition is guaranteed to hold as long as these k
constraints are linearly independent (i.e., D is of full row rank, k ). Here the
minimization problem in (A2.8.36) takes the form:

(A2.8.42) minimize: c  bx  xAx subject to: Dx  

with associated Lagrangian in (A2.8.37) of the form

(A2.8.43) L( x, )  [c  bx  xAx]    ( Dx   )

Assuming that A is symmetric positive definite, this problem always has a unique
solution, ( x0 , 0 ) , which is characterized by the first-order conditions,

(A2.8.44) 0   x L( x0 , 0 )  [b  2 Ax0 ]  D 0

(A2.8.45) 0   L( x0 , 0 )  Dx  

________________________________________________________________________
ESE 502 A2-40 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

which are seen to reduce precisely to (A2.8.16) and (A2.8.17) for the case of a single
constraint. Hence the solution is quite similar. Again we start by using the
nonsingularity of A to solve for x0 in (A2.8.44) as

(A2.8.46) 2 Ax0   ( D 0  b)  x0   12 A1 ( D 0  b) ,

and then use (A2.8.46) to solve for  0 :

(A2.8.47)   Dx0   12 DA1 ( D 0  b)   2  ( DA1D) 0  DA1b

  0   ( DA1D) 1 ( DA1b  2 )

Substitution of (A2.8.47) into (A2.8.46) then yields the following solution for x0 :

(A2.8.48) x0  12 A1  D( DA1D) 1 ( DA1b  2 )  b 

A2.8.3 Solution for Universal Kriging

We now apply these results to the case of Universal Kriging. As with Ordinary Kriging
above, we proceed in two steps. Given the linear model

(A2.8.49) Y  X    ,  ~ N (0,V )

we first determine the unique BLU estimator of  , and then use this to interpret the
solution of the optimal weight vector problem. But in this case, the first step is a of
major interest in itself, and in fact yields an important characterization of Generalized
Least Squares estimation.

Best Linear Unbiased Estimation of 

Here we proceed to show that the GLS estimator for  as developed in Section 7.1.2 of
the text is a BLU estimator as defined there. Moreover, since this argument is required
to hold for all possible linear compounds, a   k 1 , it suffices to pick a representative
compound, a , and consider the problem of finding that estimator of  in the set of
linear unbiased estimators,

(A2.8.50)  
LU a (  )     ( X ,V , Y ) :[ a    Y ] &[ E (a )  a ]

with smallest variance. The solution to this problem will show that this estimator is
always given by the GLS estimator,

(A2.8.51) ˆ  ( X V 1 X ) 1 X V 1Y

________________________________________________________________________
ESE 502 A2-41 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

To do so we can construct the appropriate constrained minimization problem as


follows.16 If we choose any estimator,   LU a (  ) , then linearity require that for some
weight vector,  [which may depend on (a, X ,V ) ] we must have

(A2.8.52) a   Y

Moreover, the unbiased condition requires that

(A2.8.53) a  E (a )  E (Y )  E (Y )   X 

But this can only hold for all possible values of  if a  X , or equivalently,

(A2.8.54) X   a

Moreover, since the variance of  Y is given by

(A2.8.55) var( Y )   cov(Y )   cov( )  V 

it follows that weight vector,  , of the desired BLU estimator must solve the
constrained minimization problem:

(A2.8.56) minimize: V  subject to: X   a

But since this is the special case of (A2.8.42) with (c  0, b  0, A  V , D  X ,  a ) , it


follows from (A2.8.48) that optimum values of  for compound a is given by

(A2.8.57) a  1
2 V 1  X ( X V 1 X ) 1 (0  2a)  0   V 1 X ( X V 1 X ) 1 a

and hence that the corresponding linear estimator in (A2.8.50), say a satisfies

(A2.8.58) aa   Y  a ( X V 1 X ) 1 X V 1Y  aˆ

Finally, since this holds identically for all linear compounds, a , we see that the unique
estimator satisfying all these conditions is given precisely by the GLS estimator. To
make this precise, observe that by setting a equal to the i th column, ei , of I k 1 for each
i  1,.., k  1 [as in (3,2,16) of the text], it must follow from (A2.8.58) that

(A2.8.59)   
ei
i
 ei ei  ei ˆ  ˆi , i  1,.., k  1

16
Our present approach is based on the development in Searle (1971, Section 3.3.d).

________________________________________________________________________
ESE 502 A2-42 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

and hence that all components of ˆ are uniquely identified by these particular choices
of a.

Finally, it should be noted that this result is usually referred to as the Gauss-Markov
Theorem in the literature.17 The above constrained minimization approach thus yields a
constructive proof of this theorem.

Best Linear Unbiased Prediction of Y(s0)

Next we derive the solution of the constrained minimization problem for Universal
Kriging in expression (7.2.12) of the text:

(A2.8.60) minimize:  2  2c0 0  0 V0 0 subject to: X 00  x0

Since this is now seen to be an instance of the general constrained minimization


problem (A2.8.42) with ( c   2 , b  2c0 , A  V0 , D  X 0 ,   x0 ) , it follows from
(A2.8.48) that

(A2.8.61) ˆ0  12 V0 1  X 0 ( X 0V0 1 X 0 ) 1 (2 X 0V0 1c0  2 x0 )  2c0 

 V0 1 X 0 ( X 0V0 1 X 0 ) 1 ( x0  X 0V0 1c0 )  V0 1c0

Hence the BLU predictor of Y0 is given by

(A2.8.62) Yˆ0  ˆ0 Y  ( x0  X 0V0 1c0 )( X 0V0 1 X 0 ) 1 X 0 V0 1Y  c0V0 1Y

 ( x0  X 0V0 1c0 )( X 0V0 1 X 0 ) 1 X 0 V0 1Y  c0V0 1Y

 ( x0  c0V0 1 X 0 )( X 0V0 1 X 0 ) 1 X 0 V0 1Y  c0V0 1Y

 x0 ( X 0V0 1 X 0 ) 1 X 0 V0 1Y  c0V0 1[Y  X 0 ( X 0V0 1 X 0 ) 1 X 0 V0 1Y ]

Standard Error of Prediction

Finally, to determine the prediction error variance for Universal Kriging, one must
substitute ̂0 into the general expression for the prediction error variance [as given by
the objective function in (8.2.60)], to obtain:

17
See for example Section 4.4 in Green (2003).

________________________________________________________________________
ESE 502 A2-43 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

(A2.8.63) ˆ 02  var(e0 )   2  2 c0 ˆ0  ˆ0 V0 ˆ0

To evaluate ̂ 02 , it is convenient to simplify the expression for ̂0 in (A2.8.61) as


follows. If we now let

(A2.8.63)  0  ( X 0V0 1 X 0 ) 1 , and

(A2.8.64)  0  x0  X 0V0 1c0 ,

then ̂0 can be written as

(A2.8.65) ˆ0  V0 1 X 0  0  0  V0 1c0

Then second term in (A2.8.63) becomes

(A2.8.66) 2 c0 ˆ0   2 c0 V0 1 X 0  0  0  V0 1c0    2 c0V0 1 X 0  0  0  2 c0 V0 1c0

and the third term becomes

(A2.8.67) ˆ0 V0ˆ0  (V0 1 X 0  0  0  V0 1c0 )V0 (V0 1 X 0  0  0  V0 1c0 )

 (V0 1 X 0  0  0  V0 1c0 ) ( X 0  0  0  c0 )

 ( 0 0 X 0V0 1  c0V0 1 )( X 0  0  0  c0 )

  0 0 ( X 0V0 1 X 0 )  0  0   0 0 X 0V0 1c0  c0V0 1 X 0  0  0  c0V0 1c0

But since the two center terms are the same, and since ( X 0V0 1 X 0 )  0  I n0 by (A2.8.63),
we see that,

(A2.8.68) ˆ0 V0ˆ0   0  0 0  2 0 0 X 0V0 1c0  c0V0 1c0

Finally, by substituting (A2.8.66) and (A2.8.67) into (A2.8.63) and cancelling terms,
we obtain an explicit expression for prediction error variance:

(A2.8.69) ˆ 02   2  c0 V01c0   0 0  0

  2  c0 V01c0  ( x0  X 0V0 1c0 ) ( X 0V0 1 X 0 )1 ( x0  X 0V0 1c0 )

________________________________________________________________________
ESE 502 A2-44 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________

In addition, since it is clear from a comparison of (A2.8.25) and (A2.8.60) that


Ordinary Kriging is simply the special case of Universal Kriging in which x0  1 and
X 0  1n0 , it follows (A2.8.68) that prediction error variance for Ordinary Kriging is
given by

(A2.8.70) ˆ 02   2  c0 V01c0  (1  1n V0 1c0 ) (1n V0 11n )1 (1  1n V0 1c0 )
0 0 0 0

(1  1n0 V0 1c0 )2
  2  c0 V01c0 
1n0 V0 11n0

________________________________________________________________________
ESE 502 A2-45 Tony E. Smith
AREAL DATA ANALYSIS

1. Overview of Areal Data Analysis

The key difference between areal data and continuous data is basically in terms of the
form of the data itself. While continuous data involves point samples from a continuous
spatial distribution (such as temperature readings at various point locations), areal data
involves aggregated quantities for each areal unit within some relevant spatial partition
of a given region (such as census tracts within a city, or counties within a state). Such
differences are illustrated in Figures 1.1 and 1.2 below.

R1 R2
s1 • • c1 c2•
• s2
R
R4
s4 • R3
s3• • c3 • c4

Figure 1.1. Point Samples Figure 1.2. Areal Units

Here Figure 1.1 shows four sample points, si , in region, R, (which is qualitatively the
same as Figure 1.1 of Part II ), and Figure 1.2 represents a partition of region, R, into four
areal units {R1 , R2 , R3 , R4 } . Such areal units, Ri , are often represented by appropriate
central locations, ci  Ri , such as major cities, or geometric “centroids” (to be defined
below).1 But the data values associated with these points represent summary measures for
the areal unit as a whole. For example, rather than measuring the temperature at location,
ci , one could assign (an estimate of) the average temperature over all points in areal unit
Ri . More importantly, one can represent values that have no particular point locations at
all, such as the population of Ri or the average income of all household units in Ri .

The practical significance of areal data for purposes of analysis is that most socio-
economic data comes in this form. For example, while individual income data is
generally regarded as proprietary in nature, such data is often made publically available
in terms of averages (such as per capita income at the state or county level). More
generally, most publically available data (such as US Census data) is only of this type.2

1
This type of representation in terms of point locations has led to the alternative description of areal data as
“lattice data”, as for example in Cressie (1993, Section 6.1).
2
There are exceptions however, such as the Center for Economic Studies (CES) run by the Census Bureau,
which allows restricted access to individual micro data by qualified researchers.

________________________________________________________________________
ESE 502 III.1-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

As in Parts I and II above, it is appropriate to illustrate some of the key features of areal
data in terms of specific examples (again drawn from [BG, Part D]).

1.1 Extensive versus Intensive Data Representations

Areal data is most easily represented visually in terms of choropleth maps, such as the
child mortality data for each of the 167 Census Districts in the city of Auckland, New
Zealand over the nine year period from 1977 to 1985 (taken from [BG, pp.249, 300-
303]). Here we focus only on the populations “at risk”, i.e., children under the age of 5 in
each district. Two possible representations of this data are shown in Figures 1.3 and 1.4
below.

CBD CBD

0 10 miles
10 miles
0
• • • •

Figure 1.3. Raw Population Data Figure 1.4. Population Density Data

The representation in Figure 1.3, which shows the actual number of children under 5 in
each district, appears to suggest that the most substantial concentration of these children
lies in districts to the southeast of the Central Business District (CBD). But it is important
to note that census districts are specifically designed to include roughly the same
population totals in each district. So the smaller districts around the CBD indicate that
population densities are much higher in this area.

An alternative representation of this population in given in Figure 1.4, which displays the
density of such children in each district, i.e., the number of children per square mile
(approx.) Here it is clear that the most dense concentrations are precisely in the smallest
districts, including the CBD. So this representation suggests (not surprisingly) that
children under five are in fact quite evenly spread throughout the population as a whole.

________________________________________________________________________
ESE 502 III.1-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

This example serves to underscore the fact that the distribution of areal data is usually
more accurately represented in terms of density values. More generally, representations
in terms of actual data totals (such a population counts) are designated as extensive
representations of areal data, and representations in terms of densities (such as population
densities) are designated as intensive representations of areal data. The key difference is
that intensive representations allow more direct comparisons between values in each areal
unit. For example, “population per square mile” has the same meaning everywhere, and is
independent of the size of each areal unit.3

1.2 Spatial Pattern Analysis

The above example also demonstrates that when intensive data representations are used,
choropleth maps can serve to reveal meaningful patterns in areal data. In this case,
children under five (and indeed all people) are more concentrated around the CBD than in
outlying areas. But there are many more interesting pattern examples that this.

One example of comparative pattern analysis is provided the Chinese socio-economic


data in [BG, p.249-250]. Figures 1.5 and 1.6 below show the shift in per capita gross
domestic product (pcGDP) in the provinces of China from 1984 to 1994. (Note again that
this data is in intensive form, where “per capita” indicates that the relevant units of
comparison here are “individuals” rather than “square miles”.)4

0 500 miles 0 500 miles


• • • •

Figure 1.5. 1984 Per Capita GDP Figure 1.6. 1994 Per Capita GDP

Here it is clear at a glance that coastal region of China has been the high growth area.
Statistical analysis can of course be applied to confirm this. But the key point here is that
visual pattern analysis is a powerful heuristic tool for discerning relations that may not
be immediately evident in the data itself.

3
For a more detailed discussion of intensive versus extensive data representations, see the classic paper by
Goodchild and Lam (1980).
4
To allow direct comparison, data on both maps has been normalized to have unit maximum values.

________________________________________________________________________
ESE 502 III.1-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

A second example is provided by the Irish blood group data from [BG, p.253] for the 26
counties of Eire. From an historical perspective, there is strong reason to believe that the
Anglo-Norman colonization of Ireland in the 12th Century had a lasting effect on the
population composition. Figure 1.7 below shows the estimated proportion of adults in
each county with blood group A in 1958 (where values increase from blue to red). Figure
1.8 shows the original colonized area of Eire, known as the “Pale”. Since blood group A
is much more common among individuals with Anglo-Norman heritage, a visual
comparison of these two figures strongly suggests a continued pattern of Anglo-Norman
influence in the region around the Pale. We shall later confirm these findings with spatial
regression.

0 50 miles 0 50 miles

Figure 1.7. Blood Group A Percentages Figure 1.8. Counties in the Pale

1.3 Spatial Regression Analysis

A final example of areal data is provided by the English Mortality data from [BG,
pp.252-253]. Here the areal units are the 190 Health Authority Districts throughout
England, and the data used involve deaths from myocardial infarctions among males (35-
64), as shown in Figure 1.9 below. This data is in standardized rates, defined here to be
the number of deaths in the period 1984-1989 divided by the expected number of deaths
during that period based on national averages. Such standardized rates are quite typical
for medical data, and help to identify those areas where death rates are much higher than
expected relative to national averages. In particular, the darkest areas on the map indicate
rates well above average. While such higher rates may be influenced by many factors, the
present analysis focuses on aspects of “social deprivation” as summarized by the “Jarman
underprivileged areas score”, or Jarman score (which is a weighted average of factors
including levels of unemployment and overcrowding). This measure for each Health

________________________________________________________________________
ESE 502 III.1-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Authority District is shown in Figure 1.10, where darker areas here show higher levels of
“social deprivation”.

0 100 km 0 100 km

Figure 1.9. Myocardial Infarctions Figure 1.10. Jarman Scores

A visual comparison of these two maps suggests that there may indeed be some positive
correlation between these two patterns, especially in Northern England where the highest
levels of both death rates and social deprivation seem to occur.

This relation can be readily confirmed by a simple regression of log Myocardial


Infarction (lnMI) rates on log Jarman scores (lnJARMAN). The results in Figure 1.11
below show that there is indeed a very strong relation between the two.

Parameter Estimates Parameter Estimates


Term Estimate Std Error t Ratio Prob>|t| Term Estimate Std Error t Ratio Prob>|t|
Intercept 1.6897938 0.424182 3.98 <.0001* Intercept 0.0061292 0.015545 0.39 0.6938
lnJARMAN 0.6290693 0.09235 6.81 <.0001* Res_NN 0.5135203 0.063817 8.05 <.0001*

Figure 1.11. Regression Results Figure 1.12. NN-Residual Analysis

________________________________________________________________________
ESE 502 III.1-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

However, it is also clear from Figures 1.9 and 1.10 above that there is a strong correlation
between both MI rates and Jarman scores in neighboring districts. Moreover, since it is
highly unlikely that the correlations among Jarman scores could completely account for
those among MI rates, one can expect there to be a strong spatial autocorrelation among
the regression residuals. This is confirmed by the simple nearest-neighbor analysis of
these regression residuals shown in Figure 1.12 (where nearest neighbors are here defined
with respect to centroid distances between districts). In fact the correlation among these
residuals is even stronger than that between lnMI and lnJARMAN (as can be seen by
comparing the t-ratios of Res_NN versus lnJARMON). While much of this residual
correlation could in principle be removed by including a range of other relevant
explanatory variables, it is quite apparent from Figure 1.9 that significant autocorrelation
will remain.

With these observations, our ultimate objective is to extend this simple nearest-neighbor
analysis to a broader and more rigorous framework for spatial autocorrelation analyses of
areal data. But to do so, we must first address the difficult issue of defining appropriate
measures of “distance” between areal units.

________________________________________________________________________
ESE 502 III.1-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis

2. Modeling the Spatial Structure of Areal Units

Aside from data aggregations, the second major difference between continuous and areal
data models concerns the representation of spatial structure itself. In particular, while
“distance between points” for any given units of measure (straight-line distance, travel
distance, travel time, etc.) is fairly unambiguous, the same is not true for “distance
between areal units”. As mentioned above, the standard convention here is to identify
representative points for areal units, the most typical being areal centroids (as defined
formally below). In fact, these centroids serve as the default option in ARCMAP for
constructing such representative points [refer to Section 1.2.9 in Part IV of this
NOTEBOOK]. But in spite of the fact that these points constitute the so-called
“geometric centers” of each areal unit, they can sometimes be quite misleading in terms
of distance relations between areal units.

An example is given in Figure 1.13 below, which involves three areal units, R1 , R2 , and
R3 . Here it might be argued that since units R2 and R3 are spatially separated, but are
each adjacent to R1 , they are both “closer” to (or exhibit a “stronger tie” to) unit R1 than
to each other. However, the centroids of these three units, shown by the black dots in
Figure 1.14, are equidistant from one another. Thus all of these spatial relations are lost
when “closeness” is summarized by centroid distances.

c1

R1

 
R2 R3 c2 c3

Figure 1.13. Areal Units Figure 1.14. Centroid Distances

In particular, this suggests that the shapes of areal units also contain important
information about their relative proximities, even though they are much more difficult to
quantify. We shall return to this question below.

In addition to these geometric issues, there are other non-spatial properties of areal units
that influence their “closeness” in terms of human interactions. For example, it is often
observed that the opposite coasts in the US are relatively “close” to one another in terms
of human interactions (such as phone calls or emails). More generally, there tends to be
more interaction between states with large cities (such as those shown in Figure 1.15)
than would be expected on the basis of their separation in geographical space. For
example, such cities tend to contain relatively large professional populations conducting
business between cities.
________________________________________________________________________
ESE 502 III.2-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

!
! ! !
! ! !
!! ! !
!
!

!
! ! !

! !
!
! !

Figure 1.15. Cities above 500,000

But while such socio-economic linkages between areal units may indeed be relevant for
many applications, we shall restrict our present analysis to purely geometric notions of
“closeness”. The main justification for this is that we are primarily interested in modeling
unobserved residual effects in regression models involving areal units. So these measures
of closeness are designed solely to capture possible spatial autocorrelation effects.
Indeed, it can be argued that potentially relevant socio-economic interactions between
units (such a communication and travel flows) should be part of the model, and not the
residuals.

2.1 Spatial Weights Matrices

To model spatial relations between areal units, we now let n denote the number of units
to be considered, so that the region of interest, say R  Continental US, is partitioned
into areal units, R  {Ri : i  1,.., n} , say the n  48 states in R (as in Figure 1.15 above).
Our basic hypothesis is that the relevant degree of “closeness” or “proximity” of each
areal unit R j to unit Ri (or alternatively, the “spatial influence” of R j on Ri ) can be
represented by a numerical weight, wij  0 . where higher values of wij denote higher
levels of proximity or spatial influence. Under this hypothesis, the full set of such spatial
relations can be represented by a single nonnegative weight matrix:

 w11  w1n 
(2.1.1) W      
w  w 
 n1 nn 

Notice in particular that while the distance between a point and itself is naturally zero,
this need not be true for areal units. For example, if wij were to represent the average
distance between all cities in states i and j (possibly weighted by population sizes) then
since the average distance between cities within each state i is certainly positive, one

________________________________________________________________________
ESE 502 III.2-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

must have wii  0 for all i  1,.., n .. So in general, any nonnegative matrix can be a
spatial weights matrix.

However, certain special structural properties of such matrices are quite common. For
example, if distance itself is measured symmetrically, i.e., if d ( x, y )  d ( y, x ) for all
locations x and y (as with Euclidean distance), then weight measures such as the average
distance between cities in states i and j will also be symmetric, i.e., wij  w ji . So, much
like covariance matrices, many spatial weights matrices will be symmetric matrices.

Moreover, while diagonal weights, wii , can in principle be positive (as in the city
example above), it will often be convenient for analysis to set wii  0 for all i  1,.., n . In
particular, when wij is taken to reflect some notion of the “spatial influence” of unit j on
unit i , then we set wii  0 in order to avoid “self-influence”. This will become clearer in
the development of spatial autocorrelation models in Section ?? below.

Many of the most common spatial weights are based on distances between point
representations of areal units. So before developing these weight functions, it is
convenient to begin with a more detailed consideration of point representations
themselves.

2.1.1 Point Representations of Areal Units

If distances between areal units, Ri , i  1.,, n , are to be summarized by distances between


representative “central” points, ci  Ri , then it is natural to require that ci be “close” to
all other points in Ri . This leads to certain well posed mathematical definitions of such
representative points.1 Perhaps the simplest is the “spatial median” of an areal unit, R ,
which is which is defined to be the point, c, with minimum average distance to all points
in R . If the area of R is denoted (as in Section 2.1 of Part I) by

(2.1.2) a( R )  R
dx ,

then the spatial median, c , of R (with respect to Euclidean distance) is given by the
solution to

(2.1.3) min c 1
a( R)  || x  c || dx
R

But while this point is well defined and is easily shown (from the convexity properties of
this programming problem) to be unique, it is not identifiable in closed form. Even if R
is approximated by a finite grid of points, the solution algorithms for determining spatial

1
Here we ignore other possible reference points (such as the capital cities of states or countries) that might
be relevant in certain applications.

________________________________________________________________________
ESE 502 III.2-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

medians are computationally intensive. For this reason, we shall not use spatial medians
for reference points. However, it is still of interest to note that if R were approximated by
some finite grid of points, Rn  {xi : i  1,.., n} , (say the set of raster pixels inside an
ARCMAP representation of R ), then the spatial median of this set, Rn , can in fact be
calculated in ARCMAP using the ArcToolbox command: Spatial Statistics Tools >
Measuring Geographic Distributions > Median Center.

Spatial Centroids

But in view of these computational complexities, a far more popular choice is the spatial
“centroid” of R , which minimizes the average squared distance to all points in R . More
formally, the centroid, c, of R is given by the solution to:

1
 || x  c ||
2
(2.1.4) min c a( R) dx
R

The advantage of using squared distances is that this minimization problem is actually
solvable in closed form. In particular, by recalling that || x  c ||2  xx  2 xc  cc , and that
the minimum of (2.1.4) is given by its first-order conditions [as for example in Section
A2.7 of the Appendix to Part II], we see in particular that

(2.1.5) 0   c  a (1R )  ( xx  2 xc  cc )dx 


 R 

 0    ( xx  2 xc  cc)dx


R
c

  (2 x  2c) dx  2    x dx  c  dx 
R  R R 

  R
x dx  c  dx  c a( R)
R

 c  1
a( R) 
R
x dx

which is simply the average over all locations, x  R . So the coordinate values of
c  (c1 , c2 ) are precisely the average values of coordinates, x  ( x1 , x2 ) , over R . In more
practical terms, if one were to approximate R by a finite grid of points,
Rn  {xi : i  1,.., n} , in R as mentioned above for spatial medians, then the centroid
coordinates, c  (c1 , c2 ) , are well approximated by

(2.1.6) ci  1
n  xRn
xi , i  1,2

________________________________________________________________________
ESE 502 III.2-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

For this reason, the centroid of R is also called the spatial mean of R . Such spatial means
can be calculated (for finite sets of points) in ARCMAP using the ArcToolbox
command: Spatial Statistics Tools > Measuring Geographic Distributions > Mean
Center.

Computation of Centroids

But while this view of centroids is conceptually very simple and intuitive, there is in fact
a much more efficient and exact way to calculate centroids in ARCMAP. In particular,
since areal units R are defined as polygon features with finite sets of vertices (in a
manner paralleling the matrix representations of polygon boundaries in MATLAB
discussed in Section 3.5 of Part I), one can actually calculate the exact centroids of these
polygons with rather simple geometric formulas. Since the derivation of these formulas is
well beyond the scope of these notes, we simply record them for completeness.2 If we
proceed in a clockwise direction around a given polygon, R , and denote its vertex points
by ( x1i , x2i ) , i  1,.., n [where by definition, ( xn , yn )  ( x1 , y1 ) ], then the area of R is
given by


n 1
(2.1.7) a( R)  1
2 i 1
( x1,i 1 x2i  x1i x2,i 1 )

and the centroid coordinates, c  (c1 , c2 ) , are given by

c j  6 a1( R )  i 1 ( x1i  x2i )(x1,i 1 x2i  x1,i x2,i 1 ) , j  1,2


n 1
(2.1.8)

These formulas are implemented in the MATLAB program, centroids_areas.m. If the


boundary file (in MATLAB format) for a given system of areal units, R  {Ri : i  1,.., n} ,
is denoted by bnd_R, then the n  2 matrix, C, of centroid coordinates and the n -vector,
A, of corresponding areas can be obtained with the command,3

>> [C,A] = centroids_areas(bnd_R);

These are precisely the same formulas used for calculating areas and centroids in the
“Calculate Geometry” option in ARCMAP, using the procedures outlined in Sections
1.2.8 and 1.2.9 of Part IV of this NOTEBOOK.

Displaying Centroids

One can display these centroids in ARCMAP by opening the Attribute Table containing
the centroids calculated above and using Table Options > Export… to save this table as

2
Full derivations of (2.7) and (2.8) require an application of Green’s Theorem, and are given in expressions
(31),(33) and (34) of Steger (1996). Here it should be noted that the signs in Steger are reversed, since it is
there assumed that vertices proceed in a counterclockwise direction.
3
A more general MATLAB program of this type can be downloaded at the web site:
http://www.mathworks.com/matlabcentral/fileexchange/319-polygeom-m.

________________________________________________________________________
ESE 502 III.2-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

say centroids.dbf. When prompted to add this data to the existing map, click OK. If you
right click on this new entry in the Table of Contents and select Display XY Data, then
the centroids will now appear on the map. If you wish to save these centroids, right click
on the new “centroid events” entry in the Table of Contents and use Data > Export
Data. Finally, if you save to the map as centroids.shp, then you can edit this copy as a
permanent file. This procedure was carried out for the Eire map in Figure 1.7 above, and
is shown in Figure 1.16 below.

! ! !

!
!
!
! !
!
!
! !
!
!
! !
!
!
!
!
!
!

! !

0 50 miles

Figure 1.16. Eire Centroids

Before proceeding to spatial weights based on centroid distances, it is important to stress


some of the limitations of this centroid-distance characterization of closeness between
areal units. As was illustrated in Figures 1.13 and 1.14 above, such point representations
can often ignore important shape relations between areal units. As seen in the Eire case,
for example, the actual boundary relations among these counties are quite complex. In
addition, while we usually refer to the centroid of a given areal unit by writing, ci  Ri , it
not necessarily true that point, ci , is actually an element of Ri . This is obvious for cases
like the state of Hawaii, where the relevant areal unit is itself a string of disconnected
islands. But in fact such problems may exist even for spatially connected areal units, such
as the example of a “river shore” area, R , shown (in yellow) in Figure 1.17 below. Here
the centroid of this area (shown in red) is not only outside of R, but is actually on the
other side of the river. So it is important to remember that while such locations are indeed
closest (in squared distance) to all points of R, the shape of R itself may dictate that such
locations lie outside of R. However, it must also be stressed that these are very
exceptional cases. Indeed, while the county boundaries in Eire are very complex, each
centroid in Figure 1.16 is seen to be contained in its respective county.

________________________________________________________________________
ESE 502 III.2-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Figure 1.17. Exterior Centroid Example

2.1.2 Spatial Weights based on Centroid Distances

While we shall implicitly assume that point representations, ci , of areal units, Ri , are
based on centroids, the following definitions hold intact for any relevant sets of points
(such as state capitals or county seats). Moreover, while centroid distances,
dij  d (ci , c j ) , are implicitly assumed to be Euclidean distances, || ci  c j || , the present
definitions of spatial weights are readily extendable to other relevant notions of distance
(such as travel distance or travel time). But it should also be stressed that our present
conventions are in fact used in most areal data analyses. The following examples of
spatial weights based on centroid distances extend the list given in [BG. p.261].

k-Nearest-Neighbor Weights

Recall from Section 3.2 in Part I that the nearest-neighbor distances defined within and
between point patterns are readily extendable to centroid distances. However, such
distance relations can be very restrictive for modeling spatial relations between areal
units. This is again well illustrated by the Eire example above, where the neighbors of
Laoghis county are shown in Figure 1.18 below.

!
!

!
!
!

Figure 1.18. Nearest Neighbors Example


________________________________________________________________________
ESE 502 III.2-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Here it turns out that the nearest neighbor to Laoghis county in centroid distance is Offaly
county to the north (shown by the red arrow). But it is clear that the neighbors adjacent to
Laoghis county in all other directions may be of equal importance it terms of spatial
relations. We shall be more explicit about such adjacency relations below. But in the
present case, it is clear that we can achieve the same effect by considering the five nearest
neighbors to this county,

So to formalize such multiple-neighbor relations, let the centroid distances from each
areal unit i to all other units j  i be ranked as follows: dij (1)  dij ( 2 )    dij ( n1) . Then
for each k  1,.., n  1 , the set N k (i)  { j (1), j (2),.., j (k )} contains the k areal units closest
to i (where for simplicity we ignore ties). For each given value of k , the k-nearest
neighbor weight matrix, W , is then defined to have spatial weights of the form:

1 , j  N k (i )
(2.1.9) wij  
0 , otherwise

Note in particular that the values of wij for the k-nearest neighbors of i are higher than
for other areal units, signifying that these neighbors are deemed to have greater proximity
to i (or greater spatial influence on i ) than other spatial units. Similar conventions will
be used for all weights discussed below. Note also that the common value of these
weights implicitly assumes that levels of proximity or influence are the same for all k-
nearest neighbors. This constancy assumption will be relaxed for other types of spatial
weights.

Before proceeding to other weighting schemes, it is also important to note that such
nearest-neighbor relations are generally asymmetric in nature. For if j is a k-nearest
neighbor of i , then it need not be true that i is a k-nearest neighbor of j , i.e., one may
have wij  w ji . As seen in Figure 1.19 below, this is true even for k  1 , where R2 is the
nearest neighbor of R1 , but R3 is the nearest neighbor of R2 :

R1 R2 R3

c•1 c•2 •c
3

Figure 1.19. Asymmetric Nearest Neighbors

But in some applications it might be argued that as long as either i or j is an “influential


neighbor” of the other, then i and j are “spatially related” in this sense. This symmetric
k-nearest neighbor relation can be formalized as follows:

________________________________________________________________________
ESE 502 III.2-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

1 , j  N k (i ) or i  N k ( j )
(2.1.10) wij  
0 , otherwise

Radial Distance Weights

In some cases, distance itself is an important criterion of spatial influence. For example,
locations “within walking distance” or “within one-hour driving distance” may be
relevant. Such proximity criteria are usually more relevant for comparing actual point
locations (such as distances to shopping opportunities or medical services), but are
sometimes also used for areal data. If d denotes some threshold distance (or bandwidth)
beyond which there is no “direct spatial influence” between spatial units, then the
corresponding radial distance weight matrix, W , has spatial weights of the form:

1 , 0  dij  d 1
(2.1.11) wij  
0 , dij  d 
d dij

Power-Distance Weights

In the radial distance weights above there is no diminishing effect of distance up to


threshold d. However, if there are believed to be diminishing effects, one common
approach is to assume that weights are a negative power function of distance of the form

(2.1.12) wij  dij

dij

where  is some positive exponent, typically   1 (as in the graph) or   2 . Note that
expression (2.1.12) is precisely the same as expression (5.2.4) in the interpolation
discussion of Section 5.2 in Part II. Thus all of the discussion in that section is relevant
here as well.

Exponential-Distance Weights

As in expression (5.2.5) of Part II, the negative exponential alternative to negative power
functions is also relevant here, and is again defined by:

________________________________________________________________________
ESE 502 III.2-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

1
(2.1.13) wij  exp( dij )

dij

for some positive exponent,  (such as   1 in the graph). As discussed in that section,
the negative exponential version is better behaved for short distances, but converges
rapidly to zero for larger distances.

Double-Power-Distance Weights

A somewhat more flexible family incorporates finite bandwidths with “bell shaped” taper
functions. If d again denotes the maximum radius of influence (bandwidth) then the
class of double-power distance weights is defined for each positive integer k by

1

   ij  
k k
1  d d , 0  d ij  d
(2.1.14) wij    
 0 , d ij  d

d
where typical values of k are 2, 3 and 4. Note that wij falls continuously to zero as dij
approaches d , and is defined to be zero beyond d . The graph shows the case of a
quadratic distance function with k  2 (see also [BG, p.85]).

2.1.3 Spatial Weights Based on Boundaries

The advantage of the distance weights above is that such distances are easily computed.
But in many cases the boundaries shared between spatial units can play in important role
in determining degree of “spatial influence”. The case of Eire in Figures 1.16 is a good
example. In particular, recall that k-nearest-neighbor weights were in fact motivated by
an effort to capture the counties surrounding Laoghis county in Figure 1.18. But such
neighbor distances can at best only approximate spatial contiguity relations (especially
since areal units can each have different numbers of contiguous neighbors). A better
approach is of course to identify such contiguities directly. The main difficulty here is
that the identification of contiguities requires the manipulation of boundary files, which
are considerably more complex than simple point coordinates. We shall return to this
issue in Section ?? below. But for the moment, we focus on the formal task of defining
contiguity relations.

________________________________________________________________________
ESE 502 III.2-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Spatial Contiguity Weights

The simplest contiguity weights indicate only whether pairs of areal units share a
boundary or not. If the set of boundary points of unit Ri is denoted by bnd (i ) then the so-
called queen contiguity weights are defined by

1 , bnd (i )  bnd ( j )  
(2.1.15) wij  
0 , bnd (i )  bnd ( j )  

However, this allows the possibility that spatial units share only a single boundary point
(such as a corner point shared by diagonally adjacent cells on a chess board).4 Hence a
stronger condition is to require that some positive portion of their boundary be shared. If
lij denotes the length of shared boundary, bnd (i )  bnd ( j ) , between i and j , then these
so-called rook contiguity weights are defined by

1 , lij  0
(2.1.16) wij  
0 , lij  0

A simple example of a contiguity weight matrix, W, is given in expression (2.1.22)


below.

Shared-Boundary Weights

As a sharper form of comparison, note that if li defines the total boundary length of
bnd (i ) that is shared with other spatial units, i.e.,  j i lij , then fraction of this length
shared with any particular unit j is given by lij / li . These fractions themselves yield a
potentially relevant set of shared boundary weights, defined by

lij lij
(2.1.17) wij  
li  l
k i ik

4
In fact, the present use of the terms “queen” and “rook” in expressions (2.1.15) and (2.1.16) refer
precisely to the possible moves of queen and rook pieces on a chess board, where rooks can only move
through faces between adjacent squares, but the queen can also move diagonally through corners.

________________________________________________________________________
ESE 502 III.2-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

2.1.4 Combined Distance-Boundary Weights

Finally, it should be evident that in many situations spatial closeness or influence may
exhibit aspects of both distance and boundary relations. One classical example of this is
given in the original study of spatial autocorrelation by Cliff and Ord (1969). In
analyzing the Eire blood-group data, they found that the best weighting scheme for
capturing spatial autocorrelation effects was given by the following combination of
power-distance and boundary-shares,

lij dij
(2.1.18) wij 
 l dik
k i ik

with simple inverse-distance,   1 . We shall return to this example in Section ?? below.

2.1.5 Normalizations of Spatial Weights

Having defined a variety of spatial weights, we next observe that for modeling purposes
it is generally convenient to normalize these weights in order to remove dependence on
extraneous scale factors (such as the particular units of distance employed in exponential
and power weights). Here there are two standard approaches:

Row-Normalized Weights

Recall that the i th row of W contains all spatial weights influencing areal unit, i , namely
( wij : j  1,.., n ) [possibly with wii  0 ]. So if the positive weights in each row are
normalized to have unit sum, i.e., if


n
(2.1.19) j 1
wij  1 , i  1,.., n

then this produces what called the row normalization of W.5 Note that each row-
normalized weight, wij , can then be interpreted as the fraction of all spatial influence on
unit i attributable to unit j . The appeal of this interpretation has led to the current wide-
spread use of row-normalized weight matrices. In fact, many of the spatial weight
definitions above are often implicitly defined to be row normalized. The most obvious
example is that of shared boundary weights in (2.1.17), which by definition are seen to be
row normalized. [Also the combined example in (2.1.18) was defined by Cliff and Ord

5
In cases where wii  0 by definition, it is possible that isolated units, i , may have all-zero rows in W. So
condition (2.1.19) is only required to hold for those rows, i , with  j wij  0 .

________________________________________________________________________
ESE 502 III.2-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(1969) to be in row-normalized form.] Another simple example is provided by the k-


nearest neighbor weights in (2.1.9) above, which are often defined using weights 1 / k
rather than 1 to ensure row normalization. A more interesting example is provided by the
power distance weights in (2.1.12) which have the row-normalized form,

dij
(2.1.20) wij 
 k j
dik

These normalized weights are seen to be precisely the Inverse Distance Weighting (IDW)
scheme employed in Spatial Analyst for spatial interpolation (as mentioned in Section 5.2
of Part II). A similar example is provided by exponential distance weights, with row-
normalized form,

exp( dij )
(2.1.21) wij 
 k j
exp( dik )

These weights are also used for spatial interpolation. In addition, it should be noted that
such normalized weights are commonly used in spatial interaction modeling, where
(2.1.20) and (2.1.21) are often designated, respectively, as Newtonian and exponential
models of spatial interaction intensities or probabilities.

Scalar Normalized Weights

In spite of its popularity, row-normalized weighting has its drawbacks. In particular, row
normalization alters the internal weighting structure of W so that comparisons between
rows become somewhat problematic. For example, consider spatial contiguity weighting
with respect to the simple three-unit example shown below:

 0 wij wik  0 1 0
   
(2.1.22) Ri Rj Rk W   w ji 0 w jk   1 0 1
 wki 0  0 1 0
 wkj  

As represented in the contiguity weight matrix, W , on the right, unit j is influenced by


both i and k , while units i and k are each influenced only by the single unit j . Hence
it might be argued that j is subject to more spatial influence than either i or k . But row
normalization of W changes this relation, as seen by its row-normalized form, Wrn ,
below:

________________________________________________________________________
ESE 502 III.2-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

 0 1 0 
 
(2.1.23) Wrn  1/ 2 0 1/ 2 
 0 1 0 
 

Here the “total” influence on each unit is by definition the same, so that unit i now
influences j only “half as much” as j influences i . While the exact meaning of
“influence” is necessarily vague in most applications, this effect of row-normalization
can hardly be considered as neutral.6

In view of this limitation, it is natural to consider simple scalar normalizations, where


W is multiplied by a single number, say   W , that removes any measure-unit effects but
preserves relations between all rows of W. For example, if wmax denotes the largest
element of matrix, W , then the choice,

1
(2.1.24)   0
wmax

provides one such normalization that has the advantage of ensuring that the resulting
spatial weights, wij , are all between 0 and 1, and thus can still be interpreted as relative
influence intensities.

However, for theoretical reasons, it is often more convenient to divide W by the


maximum eigenvalue, W , of W (to be discussed in Section 3.3.2 below) and hence to set

1
(2.1.25)    0
W

2.2 Construction of Spatial Weights Matrices

Our primary interest here is to show how spatial weight matrices can be constructed for
applications in MATLAB. We begin with those spatial weights based on centroid
distances as in Section 2.1.2 above, and illustrate their construction in MATLAB. Next

6
A more detailed discussion of this problem can be found in Kelejian and Prucha (2010).

________________________________________________________________________
ESE 502 III.2-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

we consider certain of the spatial contiguity weights in Section 2.1.3, which require initial
calculations to be made on shapefiles in ARCMAP.

2.2.1 Construction of Spatial Weights based on Centroid Distances

All spatial weights defined in Section 2.1.2 can be constructed in MATLAB using the
program dist_wts.m. By opening this program, one can see that the inputs include a
matrix, L, of centroid coordinates together with a MATLAB structure, info, containing
information about the specific spatial weights desired. The use of this program can be
illustrated by an application to the Eire centroid data in the workspace, eire.mat.
Here L is a 26  2 matrix containing the centroid coordinates for the 26 counties in Eire.
If one chooses to construct a weight matrix containing the five nearest neighbors for each
county, say W_5nn, then by looking at the top of the program, one sees that k-nearest
neighbors corresponds to the first of six types of spatial weights that can be created. In
particular, by setting info.type = [1,5], one specifies a 5-nearest-neighbor matrix. Thus
the appropriate commands for this case are given by:

>> info.type = [1,5];


>> W_nn5 = dist_wts(L,info);

To understand the matrix which is produced, we again consider the case of Laoghis
county in Figure 1.18 above. By using the Identify tool in ARCMAP, one sees that the
FID of Laoghis county is 10, so that its centroid coordinates correspond to row 11 in L
(remember that FID numbers start at 0 rather than 1). Similarly, one can verify that the
five surrounding counties (which are also its five nearest neighbors) have FID values
(0,8,9,18,21). So their row numbers in L are given by (1,9,10,19,22). These numbers
should thus correspond to the “1” values in row 11. This can be verified by displaying the
positive column numbers of all positive elements in row number 11 of W_nn5 using the
find command in MATLAB as follows:

>> find(W_nn5(11,:) > 0)

ans =

1 9 10 19 22

It is also important to emphsize that this matrix is constructed to be in sparse form, which
means that only nonzero values are recorded. This can be seen by attempting to display
the first 5 rows and columns of W_nn5 as follows:

>> W_nn5(1:5,1:5)

ans =

(5,2) 1

________________________________________________________________________
ESE 502 III.2-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

The result displayed says that the only nonzero element here is in (row 5, column 2) and
has value 1. This is a particularly powerful format in MATLAB since spatial weight
matrices tend to have many zero values, and can thus be stored and manipulated very
efficiently in sparse form. If one wants to obtain a full matrix version of W_nn5, say
Wnn5, then use the command:

>> Wnn5 = full(W_nn5);

The above 5  5 display then yields:

>> Wnn5(1:5,1:5)

ans =

0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 1 0 0 0

and shows in particular that all elements other than (5,2) are indeed zero.

2.2.2 Construction of Spatial Weights based on Boundaries

As mentioned above, the construction of spatial weights based on boundaries is


inherently more complex from a computational viewpoint. While there are a number of
available procedures for doing so, we focus here on methods that can be done by
combining ARCMAP and MATLAB procedures. In particular, spatial weights matrices
based on both queen and rook contiguities [expressions (2.1.15) and (2.1.16),
respectively] are directly available in ARCMAP. So our present focus is on how to obtain
these results, and import them to MATLAB. Here it should be mentioned that boundary
share weights [expression (2.1.17)] can also be constructed, but require more complex
procedures (as developed in Sections 3.2.2 and 3.2.3 of Part IV).

Here again we use the Eire data as an example, and assume that the shapefile, Eire.shp,
is currently displayed in ARCMAP. The desired spatial weights can be obtained in
ArcToolbox using the command sequence:

Spatial Statistics Tools > Modeling Spatial Relationships


> Generate Spatial Weights Matrix

(i) In the window that opens, first set:

Inputs Feature Class = “Path/Eire.shp”

where “Path” here represents the full path to the directory containing Eire.shp.

________________________________________________________________________
ESE 502 III.2-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(ii) One then needs to have a unique identifier for each boundary polygon (county) in
Eire. If none are present, then the simplest procedure is to construct a new field, ID,
calculated as “[FID] + 1” and to set:

Unique ID Field = “ID”

(iii) Here we calculate queen contiguity weights, and thus name the output as:

Output Spatial Weights Matrix File = “Path/Queen_W.swm”

where “Path” now represents the full path to the directory where the output should be
placed.

(iv) Queen contiguities are then specified by:

Conceptualization of Spatial Relationships = “CONTIGUITY_EDGES_CORNERS”

Note that a number of other spatial weight matrices can also be constructed:

“CONTIGUITY_EDGES_ONLY” = rook weights (2.1.16)

‘k_NEAREST_NEIGHBORS” = k-nearest neighbors (2.1.9)

“FIXED_DISTANCE” = radial distance (2.1.11)

“INVERSE_DISTANCE” = power distance (2.1.12) [with exponent options]

► Before leaving this window be sure to remove the check on Row Standardization,
unless you want row standardized values.

(v) Now click OK and the file Queen_W.swm should appear in the directory specified.

►Note this file is a binary file that is only useful inside ARCMAP. To use this data in
MATLAB, it must be transformed into a suitable text file. To do so:

(i) Again in ArcToolbox, start with the command sequence:

Spatial Statistics Tools > Utilities > Convert Spatial Weights Matrix to Table

(ii) In the window that opens, set:

Input Spatial Weights Matrix File = “Path/Queen_W.swm”

Output Table: “Path/Queen_W_Table”

________________________________________________________________________
ESE 502 III.2-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

► Note that this output path can have no spaces (or you will get an error message). So
you may have to choose a higher level directory that can be reached without using spaces
(iii) Click OK, and the file Queen_W_Table.dbf should appear in this directory.

(iv) Since MATLAB cannot (yet) import .dbf files, you must transform this to a text file.
To do so, open the file in EXCEL as a .dbf file. Now delete the first column (containing
zeros), so that only three columns remain (“ID” “NID” “WEIGHT”). Save this as a tab-
delimited text file, Queen_W_Table.txt.

(v) To import this text file to MATLAB, use

Home > Import Data

and open Queen_W_Table.txt.

(vi) In the IMPORT Window, change the default “Column vector” setting in the
IMPORTED DATA box to “Matrix”, and click

Import Selection > Import Data

The file will now appear in the workspace as a 112x3 matrix, QueenWTable. You can
rename this as W_queen by right clicking on the workspace entry.

As a check to be sure this procedure was successful, one may compare W_queen with
the ARCMAP representation. In particular, by repeating the procedure for W_nn5 above,
we now see that:

>> find(W_queen(11,:) > 0)

ans =

1 9 10 19 22

so that, as seen in Figure 1.18 above, the five contiguous neighbors to Loaghis county are
indeed its five nearest neighbors with respect to centroid distance.

________________________________________________________________________
ESE 502 III.2-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis

3. The Spatial Autoregressive Model

Given the above formulation of spatial structure in terms of weights matrices, our
objective in this section is to develop the basic model of areal-unit dependencies that will
be used to capture possible spatial correlations between such units. Unless otherwise
stated, we shall implicitly represent the relevant set of areal units, {R1 ,.., Rn } , by their
indices, i  1,.., n . In particular, these areal units will almost always represent the sample
units of interest. To put this spatial-dependency model in proper perspective, we begin
with a typical linear model of the form

Yi   0   j 1  j xij  ui , i  1,.., n
k
(3.1)

where Yi is taken to represent some relevant attribute of each spatial unit, i , and where
( xij : j  1,.., k ) represents a set of “explanatory” attributes of i that are postulated to
influence Yi . For example, if Yi is the Myocardial Infarction rate of each English Health
District, i  1,..,190 , in Section 1.3 above, then xi1 might correspond to the Jarman score
for District i , together with other possible attributes of that district. This model exhibits
an obvious similarity to expression (7.5) in Part II. The key difference is in terms of their
respective spatial sample units, where the point locations ( s ) in expression (7.5) are here
replaced by areal units ( R) that partition this space. As mentioned in the introduction,
this change in spatial sample units reflects the type of spatial data being analyzed. For
example, while, say, temperature is meaningful each point in space, this is not true of
Myocardial Infarction rates.1 But much more important for our present purposes is the
way in which the unobserved errors (or residuals) are treated in each model. Notice in
particular that we have switched notation in (3.1), and are now representing such
residuals by ui rather than  i . The reason for this is that we shall proceed to develop an
explicit linear model of these spatial residuals themselves.

Before doing so, it is convenient to restate (3.1) in matrix terms as

(3.2) Y  X u

where as usual, Y  (Y1 ,..,Yn ), X  [1n , x1 ,.., xk ],   (  0 , 1 ,..,  k ) and u  (u1 ,.., un ) . We
again assume that the random vector, u , of residuals is multinormally distributed with
mean, E (u )  0 , so that by construction,

(3.3) E (Y )  X 

1
Note however that in cases such as the California rainfall example, where cities were treated as points, the
relevant data implicitly involves “local” spatial averages. So in this setting, for example, it would be
perfectly meaningful to compare the Myocardial Infarction rates of San Francisco and Los Angeles.

________________________________________________________________________
ESE 502 III.3-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

In this setting, our primary objective is to model the covariance structure of u in a


manner that reflects possible spatial dependencies among areal units.

But rather than postulate spatial stationarity properties of u (as was done for spatially
continuous data in Part II), we must now rely on discrete spatial structure as summarized
by a given spatial weights matrix, W  ( wij : i, j  1,.., n ) . In terms of our Myocardial
example above, wij , may represent some measure of the spatial proximity of Health
District j to (or influence on) Health District i , where higher values of wij denote greater
spatial proximity or influence. In this setting, it seems reasonable to postulate that each
unobserved residual, ui , in (3.1) is influenced by those residuals, u j , in neighboring areal
units j , i.e., with positive spatial weights, wij . As a parallel to (3.1), such influences
might also be represented by linear “spatial error” model of the form:

(3.4) ui   j i  ( wij ) u j   i

where  ( wij ) is some appropriate “influence” function depending on wij , and where  i
represents that part of residual ui that is not influenced by other areal units. But as we
have seen in Section 3.2, there is already great flexibility in the specification of spatial
weights, wij , and hence no need for further functional elaborations. Rather, the strategy
here is to use the simplest possible specification in terms of a common scale factor,  ,
so that  ( wij ) takes the form  wij , and (3.4) reduces to2

(3.5) ui    j i wij u j   i , i  1,.., n

To interpret (3.5), note first that (except for the absence of an intercept term) this relation
is essentially a type of linear regression model in which each residual, ui , is regressed on
its neighbors, u j (with coefficients  wij ). Moreover, since this effectively implies that
the full set of residuals is being regressed on itself, model (3.5) is designated as a spatial
autoregressive model of residual dependencies. In this context, the summation over all
j  i ensures that no individual residual is “regressed on itself”. But even with this
restriction, it will be shown below that the estimation of such autoregressive models is far
more subtle than that of standard regression models.

For the present however, we focus only on the basic meaning of (3.5). First consider the
parameter,  , which plays a very special role in this model. At one extreme, if   0
then each residual, ui , reduces to its own intrinsic component,  i , and all spatial
dependencies vanish. More formally, if we now assume that these individual components
are independently and identically normally distributed as,

2
Here the notation,  j i , means summation over all units, j , other than unit i .

________________________________________________________________________
ESE 502 III.3-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(3.6)  i ~ N (0, 2 ) , i  1,.., n

then model (3.1) is seen to reduce to a standard linear regression model when   0 . At
the other extreme, when |  | becomes large, the strength of all spatial dependencies
(positive or negative) must also become large. This suggests that  be designated as the
spatial dependency parameter for the model.

Note also, that for any pairs of areal units, ij and kh , with positive spatial weights,
wij , wkh  0 , and any nonzero level of spatial dependence,   0 , it must always be true
that

 wij wij
(3.7) 
 wkh wik

Thus the relative strength of these spatial dependencies is determined entirely by their
spatial weights. In summary, this model provides a natural “division of responsibilities”
in which  governs the overall strength of spatial dependencies, and in which the
spatial weight structure governs their relative strength among individual areal-unit pairs.

Finally, to write this model in more compact matrix form, it convenient to assume that
wii  0 in the given spatial weights matrix, W , so that (3.5) can be rewritten in more
standard terms as

ui    j 1 wij u j   i , i  1,.., n
n
(3.8)

In this form, if we now let   (1 ,..,  n ) denote the random vector of intrinsic
components, then expressions (3.8) and (3.6) together yield the follows Spatial
Autoregressive Model of residual dependencies:3

(3.9) u   Wu   ,  ~ N (0, 2 I n )

where in addition it is assumed that the diagonal element of W are zero, written as

(3.10) diag (W )  0 .

3.1 Relation to Time Series Analysis

Like most of the spatial dependency models considered in these notes, model (3.9) was
originally inspired by a time series model [as in Whittle (1954)]. In the present case, this
3
This model was originally proposed by Whittle (1954). But the present matrix formulation was first given
by Ord (1975), who designated (3.9) as a first-order spatial autoregressive process.

________________________________________________________________________
ESE 502 III.3-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

“parent” model can be formulated as follows. If we consider a finite sequence of random


variables, (ut : t  1,.., T ) , over T time periods (say average Philadelphia temperature. ut ,
over T successive days), then the standard first-order autoregressive [AR(1)] model of
this series takes the recursive form:

(3.1.1) ut   ut 1   t , t  2,.., T

with “initial condition”,4

(3.1.2) u1  1

where ( t : t  1,.., T ) is assumed to be a sequence of independent random “innovations”


identically distributed as N (  , 2 ) . In the “temperature” example above, these
innovations ( t : t  1,.., T ) can be viewed as random fluctuations about some constant
mean daily temperature,  . The term “first-order” in this case refers to the fact that given
the past history of daily temperatures in Philadelphia, model (3.1.1) assumes that today’s
temperature, ut , depends only on yesterday’s temperature, ut 1 plus some current
temperature innovation,  t .

Except for the nonzero value of  , this AR(1) model can be viewed formally as a special
case of model (3.9). To see this, observe simply that if the T  T weights matrix,
W  ( wts : t , s  1,.., T ) , is defined by

1 , t  2,.., T , s  t  1
(3.1.3) wts  
0 , otherwise

then it follows at once from (3.1.1) and (3.1.2) that:

 u1  0 0  0   u1    1 
u      
(3.1.4)  2    1 0  0   u2     2   u  Wu  
             
      
 uT  0  1 0   uT    T 

But this particular instance of (3.9) has the important property that time dependencies
flow in only one direction – namely from the past to the present. Formally, this is
reflected by the so-called “lower triangular” structure of W in (3.1.4).

4
While (3.1.2) can be replaced by more standard “steady state’ initial conditions, the present simpler form
is most appropriate for our purposes.

________________________________________________________________________
ESE 502 III.3-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

To appreciate the significance of this unidirectional flow, it is instructive to ask how one
might simulate this model. Here the answer is almost self-evident from (3.1.1) and
(3.1.2):

Step 1: Sample a value of 1 from N (  , 2 ) and set u1  1 .

Step 2: Sample a value of  2 from N (  , 2 ) and set u2   u1   2 .

Step 3: Sample a value of  3 from N (  , 2 ) and set u3   u2   3 .





Step T: Sample a value of  T from N (  , 2 ) and set uT   uT 1   T .

However, for more general examples of model (3.9), this simple process of simulation is
not possible.

3.2 The Simultaneity Property of Spatial Dependencies

This problem is mostly easily illustrated by the following one-dimensional example.


Suppose we consider “over the fence” communications between residential neighbors on
a given street, as depicted in Figure 3.1 below.

• • •

1 2 3 • • • n-1 n

Figure 3.1. Bilateral Dependency Example

In particular, suppose that household i ’s opinion, ui , on how much each house should
contribute to their annual street party is influenced both by i ’s initial opinion,  i , and by
the opinions of i ’s immediate neighbors, including ui 1 and/or ui 1 . Then a natural
spatial model of opinion formation by these residents might well take the form:

  ui 1   i , i 1

(3.2.1) ui    (ui 1  ui 1 )   i , 2  i  n 1
 u   , in
 i 1 i

where  now reflects how influential the opinions of these neighbors are. Note in
particular that the “edge” residents 1 and n have only one neighbor, while all other
residents have two neighbors.

________________________________________________________________________
ESE 502 III.3-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Given this spatial model of opinion formation,5 one may again ask: how might we
simulate this model? Here the key question is where to start the simulation. For if we
start with edge resident 1, then it clear from the first line of (3.2.1) that we must know the
opinion, u2 , of 1’s neighbor in order to simulate u1 . Similarly, if we start with edge
resident n then the last line of (3.2.2) shows that the opinion, un1 , of n ’s neighbor is
required to simulate un . Moreover, the situation is even worse for intermediate residents,
i , where both neighboring opinions, ui 1 and ui 1 , are required in order to simulate ui . So
it would appear that there is no way to simulate this process at all. But to be more precise,
this argument shows that there is no possible sequential simulation procedure for
realizing samples of (3.2.1). Rather, the full set of opinions, (u1 , u2 ,.., un ) , must be
somehow be simulated simultaneously.

Here it turns out that there is a remarkably simple procedure for doing so. In particular,
let us again formulate (3.2.1) as an instance of (3.9) where W now takes the form:

1 , t  2,.., n  1, s  {t  1, t  1}
(3.2.2) wts  
0 , otherwise

then it follows at once from (3.1.1) and (3.1.2) that:

 u1  0 1 0  0   u1   1 
u  1 0 1 
   u2    2 

 2   
(3.2.3)      0 1   0          u  Wu  
      
 un 1     1   un1    n1 
u  0  0 1 0   un    n 
 n 

But given this matrix formulation, observe that we may solve for u in terms of  as
follows:
(3.2.4) u  Wu    u  Wu  

 ( I n  W ) u  

So assuming for the moment that the inverse matrix, ( I n  W ) 1 , exists, we can multiply
both sides of (3.2.4) by ( I n  W ) 1 to obtain the following reduced form solution for u in
terms of  ,

(3.2.5) u  ( I n  W ) 1

5
Formally, expression (3.2.1) is an instance of the bilateral autoregressive process proposed by Whittle
(1954). Indeed, this is precisely the one-dimensional example that motivated his original analysis of spatial
autoregressive processes.

________________________________________________________________________
ESE 502 III.3-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Given this existence assumption, observe that if “intrinsic opinions” are again assumed
(for sake of illustration) to be independently and identically normally distributed about
some average opinion level,  , as  i ~ N (  , 2 ) , i  1,.., n , then we can now simulate
(3.1.5) in essentially only two steps:

Step 1: Sample each  i from N (  , 2 ) , i  1,.., n , and set   (1 ,..,  n ) .

Step 2: Solve for u  (u1 ,.., un ) as u  ( I n  W )1 .

So by simple matrix manipulations, this simultaneity problem appears to have been


solved. But there remains the question of how this “magic” was possible, and what it
actually means in more intuitive terms.

3.3 A Spatial Interpretation of Autoregressive Residuals

Our objective in this section is to obtain conditions for the existence of ( I n  W )1 and
to give an intuitive spatial interpretation to this inverse matrix. To do so, we start by
recalling that for any number, a, the basic geometric series:



(3.3.1) S  1  a  a 2  a 3   k 0
ak

represents the simplest example of an infinite summation that can be given a closed form
solution in an elementary way. For if one considers the partial sum,

(3.3.2) Sk  1  a  a 2  a 3    a k

and multiplies this by a,

(3.3.3) a S k  a  a 2  a 3    a k  a k 1

then by subtracting (3.3.3) from (3.3.2),

(3.3.4) S k  a Sk  (1  a  a 2  a 3    a k )  ( a  a 2  a 3    a k  a k 1 )  1  a k 1

we obtain the simple identity

1  a k 1
(3.3.4) Sk 
1 a

But since by definition, S  lim k  Sk , it follows at once from (3.3.4) that this limiting
sum exists if and only if lim k  a k  0 , and must have the closed-form solution:

________________________________________________________________________
ESE 502 III.3-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

1
(3.3.5) S  lim k  Sk   (1  a )1
1 a

Finally, by combining (3.3.1) and (3.3.5) we see that

(1  a )1  1  a  a 2  a 3     k 0 a k

(3.3.6)

if and only if lim k  a k  0 .

The point of this exercise for our purposes is that exactly the same argument can be
applied to matrices, by simply substituting the scalar, a, with an n-square matrix A. In
particular, if On denotes the n-square zero matrix, then it is shown in Section A3.5 of the
Appendix that

( I  A)1  I n  A  A2  A3     k 0 Ak

(3.3.7)

if and only lim k  Ak  On . So in our case, by setting A  W , it follows that that the
inverse ( I n  W )1 will exist and have the limiting form

( I n  W )1  I n  W   2W 2   3W 3     k 0  kW k

(3.3.8)

if and only if

(3.3.9) lim k   kW k  On

Our main objective is to employ this representation to give a meaningful interpretation to


the “steady states” of spatial autoregressive processes as in expression (3.2.5). But before
doing so, it is important to establish conditions on the spatial dependency parameter
which will ensure that (3.3.9) holds. Since this condition must surely hold when   0 , it
is not surprising that the desired condition will amount to placing a bound on the
maximum size of |  | . But this bound will of course depend on the structure of the
spatial weights matrix, W, as we now show.

3.3.1 Eigenvalues and Eigenvectors of Spatial Weights Matrices

In Section A3.1 of the Appendix we develop a number of important properties of n-


square matrices, A, as representations of n-dimensional linear transformations on  n .
Our focus is on the geometric interpretations of these properties, which can often be
represented graphically in 2 dimensions. Without going into great detail here, it is enough
to say that every 2-square matrix,

________________________________________________________________________
ESE 502 III.3-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

a a 
(3.3.10) A  (a1 , a2 )   11 12 
 a21 a22 

represents a 2-dimensional linear transformation that transforms each vector,


x  ( x1 , x2 )   2 , into to a new vector, Ax   2 , called the image of x under A. Each
transformation, A, is entirely representable by the images of the identity basis vectors,
e1 , e2  2 [recall expression (3.2.16) if Part II], as shown in Figure 3.2. In particular,
since by definition each x  ( x1 , x2 ) is representable as the weighted sum, x  x1e1  x2 e2 ,
it follows from linearity that Ax is representable by the corresponding weighted sum of
the images, ( Ae1 , Ae2 ) , as shown in Figures 3.3 below (see also Figures A3.3 and A3.4 in
the Appendix).
Ax

x2 Ae2
Ae2
x2e2 •x
e2 Ae1 x1 Ae1

e1 x2e2

Figure 3.2. Basis Image Vectors Figure 3.3. General Image Vectors

From a geometrical viewpoint, it is of interest to ask whether there exist any vectors,
x   n , that are simply “stretched” by A into (possibly negative) multiples of themselves,
i.e., whether

(3.3.11) Ax   x

for some scalar,    . If so, then  is called an eigenvalue of A with associated


eigenvector, x . [Note that (3.3.11) continues to hold for any scalar multiple of x, so that
eigenvectors are only unique up to scalar multiples.] For convenience we refer to
eigenvalues together with their eigenvectors as the eigenstructure of A, and in particular,
denote the set of distinct eigenvalues for A by Eig ( A) . To illustrate these ideas for
spatial weights matrices in 2 dimensions, we are of course restricted to the simplest
possible case of only two areal units, as shown in Figure 3.4 below.

 0 w12 
W 
0 
(3.3.12) R1 R2
 w21

________________________________________________________________________
ESE 502 III.3-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

If W represents a simple contiguity relation with w12  1  w21 [as in the 3-unit example of
expression (2.1.22) above], and if we let x1  (1, 1) and x2  ( 1, 1) , then simple matrix
multiplication shows that W x1  x1 and W x2   x2 , so that these are both eigenvectors
of W with corresponding eigenvalues, Eig (W )  {1 , 2 }  {1, 1} . This is shown
graphically in Figure 3.4 below (where x1 and Wx1 are slightly offset so that both can be
seen):

x2 x x1

Wx1

Wx2

Figure 3.4. Eigenstructure of W

More generally (as shown in Section A3.3 of the Appendix), each n-square matrix, A,
possesses at most n distinct eigenvalues. To see that there may be fewer than n, consider
the identity matrix, I n , which has only one distinct eigenvalue (  1) since by
definition, I n x  x for all x  n . This example also shows that eigenvectors in such
cases can be chosen in many ways. There also exist matrices with no (real) eigenvalues,
as illustrated by the matrix

 0 1 
(3.3.13) A 
1 0 
As seen in Figure 3.5 below, this matrix rotates the plane by 90 , so that no vector can be
sent into a scalar multiple of itself.

e2 Ae1

Ae2 e1

Figure 3.5. Rotation Transformation


________________________________________________________________________
ESE 502 III.3-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

But for sake of simply, we focus here n-square matrices, A, with a full set of eigenvalues,
Eig ( A)  {1 ,.., n } , and associated eigenvectors, x1 ,.., xn , that are linearly independent.6
In geometric terms, this means that every point, x   n , can be written as a linear
combination of these eigenvectors, as illustrated by the point, x, in Figure 3.4. In
algebraic terms, it means that the n-square matrix, X  [ x1 ,.., xn ] , defined by these
eigenvectors is nonsingular, so that the inverse matrix, X 1 , exists. We may thus write
out the relations among these eigenvalues and eigenvectors as follows,

(3.3.14) A xi  i xi , i  1,.., n
 1 

 A X  [ Ax1 ,.., Axn ]  [1 x1 ,.., n xn ]  [ x1 ,.., xn ]   

  
 n

 AX  X 

where   diag (1 ,.., n ) is the diagonal matrix of eigenvalues. So (post) multiplying
both sides of (3.3.14) by X 1 , we obtain the following “spectral” representation of A,

(3.3.15) A X X 1  X  X 1  A  X  X 1

To see the power of this representation, observe that if we multiply A by itself, then:

(3.3.16) A2  ( X  X 1 )( X  X 1 )  X  ( X 1 X )  X 1  X  2 X 1

By comparing this with (3.3.15), it follows at once that that the eigenvalues of A2 are
precisely the squares of the eigenvalues of A, and moreover that the associated
eigenvectors remain the same. By simply repeating this argument k times, it follows
more generally that

 1k 
  1
(3.3.17) Ak  X  k X 1  X   X , k  1,2,...
 n 
k

So the eigenstructure of A tells us a great deal about how the associated powers, Ak , of
A must behave. In particular, the limiting behavior of these powers as k   for any
matrix, A, is governed entirely by the maximum size of its eigenvalues, which we denote
by,

(3.3.18) |  |A  max Eig ( A ) |  | ,

6
In fact the eigenvectors for distinct eigenvalues are always linearly independent, as illustrated in Figure
A3.27 of the Appendix.

________________________________________________________________________
ESE 502 III.3-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

To see this, note simply from (3.3.17) that these powers will converge to the zero matrix
if and only if  k  0 for all   Eig ( A) . Because this is equivalent to the single
condition, |  |A  1 , it then follows that

(3.3.18) lim k  Ak  On  |  |A  1

For the important case of nonnegative matrices, it shown in Section ?? of the Appendix
that this maximum always corresponds to the largest positive eigenvalue of A, denoted
here by A , so that A  |  |A . As an illustrative example, the eigenstructure of the
nonnegative matrix,

 2 / 3 1/ 3 
(3.3.19) A  
 1/ 6 1/ 2 
is easily seen to be given by

5 / 6    2   1  
(3.3.20)   , X  [ x1 , x2 ]    ,   
 1/ 3   1   1  

(as can be checked by matrix multiplication). This eigenstructure is shown graphically in


Figure 3.6 below.

x2
x1
Ax2 Ax1

Figure 3.6. “Shrinking” Eigenvalue Example

Since all points are linear combinations of the eigenvectors, x1 and x2 , and since
|  |A  A  5 / 6  1 implies that both these eigenvectors shrink toward zero, we see that

________________________________________________________________________
ESE 502 III.3-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

all points are shrunk towards zero (as illustrated by the parallelogram in the figure). In
other words, by using the coordinate system created by these eigenvectors, we see that
the shrinking behavior of these eigenvectors is inherited by all points with respect to this
coordinate system. While not every case is so simply illustrated, Figure 3.6 helps to
provide some geometric intuition for the general result in (3.3.18).7

3.3.2 Convergence Conditions in Terms of ρ

By combining (3.3.9) and (3.3.18), we see that a necessary and sufficient condition for
the geometric-series representation in (3.3.8) to hold is that the maximum eigenvalue of
the matrix ( W ) , be less than one. But for each eigenvalue,  , of W , say with
eigenvector, x , it follows at once from (3.3.11) that

(3.3.21) Wx   x  Wx   x  ( W ) x  (  ) x

and thus that  is automatically an eigenvalue for ( W ) , so that

(3.3.22) Eig ( W )   Eig (W )

In particular, since this implies that

(3.3.23) |  |W  |  ||  |W  |  | W

it follows that
1
(3.3.24) |  |W  1  |  | W  1  |  | 
W

So for the present case of spatial weight matrices, W , the general convergence condition
in (3.3.18) now takes the form

(3.3.25) lim k   kW k  On  |  |  1/ W

so that by (3.3.8) and (3.3.9),



(3.3.26) ( I n  W ) 1  k 0
 kW k  |  |  1 / W

Note in particular that if the maximum eigenvalue of W happens to be unity, i.e., w  1 ,


then (3.3.25) takes the simple and appealing form8

7
See Section ?? in the Appendix for a general development of this result.
8
Here it must be stressed that in spite of the apparent similarity of the condition, |  |  1 , to the properties
of correlation coefficients, this spatial dependency parameter,  , is not a correlation coefficient.

________________________________________________________________________
ESE 502 III.3-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________



(3.3.27) ( I n  W ) 1  k 0
 kW k  |  |  1

For this reason, it is often convenient to normalize W to have a maximum eigenvalue of


one. The simplest procedure for doing so is to divide W by it maximum eigenvalue, W ,
say W *  1 W . For this normalized weights matrix, it then follows from the same
W
argument in (3.3.21) through (3.3.23) that

(3.3.28) Eig (W *)  Eig  1


W 
W  1
W Eig (W )  W *  1
W (W )  1

and thus that (3.3.27) always holds for W  W * . In fact, this is the primary motivation
for the normalizing convention in expression (2.1.25) of Section 2 above.

Before proceeding, it is important to note that row normalized weight matrices, Wrn , must
also exhibit this same property. This can be seen in part by observing that the normalizing
condition (2.1.19) for Wrn in Section 2 can be written as

 1
1   j 1 wij  [ wi1 ,.., win ]     wi1n , i  1,.., n
n
(3.3.29)
 1
 

where wi is the i th row of Wrn . This set of conditions can in turn be written in matrix
form as

 w1   1
(3.3.30)   1      1  W 1  1 ,
  n   n rn n n
 w   1
 n  

which shows that 1n must always be an eigenvector of Wrn with unit eigenvalue. Thus for
the row normalization of any spatial weights matrix, W, we must have 1 Eig (Wrn ) . In
addition, it is shown in Section ?? of the Appendix that this unit eigenvalue is necessarily
the maximum eigenvalue of Wrn , and thus that (3.3.27) must always hold for row
normalized matrices.

3.3.3 A Steady-State Interpretation of Spatial Autoregressive Residuals

Assuming that W satisfies (3.3.25), it remains to give a spatial interpretation of the


expanded representation of ( I n  W )1 in (3.3.8). To do so, it is useful to start by
considering the direct influences among areal units as implied by a given spatial weights

________________________________________________________________________
ESE 502 III.3-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

matrix, W. This is well illustrated by the example in expression (2.1.22) of Section 2,


which we reproduce here for convenience,

 0 w12 w13  0 1 0
(3.3.31) R1 R2 R3 W   w21 0 w23    1 0 1
 
w w32 0  0 1 0
 31  

In this example, the only direct influences are between unit 2 and each of the other units,
1 and 3. This can be represented by the following graph, with areal units as “nodes” and
positive weights as directed “links” (in red):

(3.3.32) 1 2 3

So, for example, the top two arrows show that unit 2 directly influences both units 1 and
3. Now consider the square of this weight matrix,

 0 1 0 0 1 0  1 0 1
(3.3.33) W   1 0 1   1 0 1    0 2 0 
2

 0 1 0  0 1 0   1 0 1 
    

If one thinks of direct links as influence paths of length 1, then the ij elements of
W 2  ( wij2 ) are precisely number of influence paths of length 2 from j to i . In particular,
each m th term of the ij -value, wij2  3m 1wim wmj , of W 2 contributes a value of 1 to this
sum if and only if both wim and wmj are 1, i.e., if and only if there is a path, j  m  i ,
of length 2. For example, while unit 3 does not directly influence unit 1, there is an
indirect influence on the path, 3  2  1 , seen in (3.3.32). This single influence path of
length 2 corresponds to the 1 in the upper right hand corner of W 2 . Notice also that while
the diagonal elements of W are zero by construction, this is not true of W 2 . For example
there is now an influence path of length 2 from unit 1 to itself, namely the path
1  2  1 in which 1’s influence on 2 is “echoed back” as a second order influence on 1.
In a similar manner, the ij elements of the k th power, W k  ( wijk ) , of W indicate the
number of length k paths from j to i . But notice in the present example, that these
relations depend explicitly on the fact that W consists entirely of zeroes and ones. More
generally, for any n-square weights matrix, W, the ij elements of the k th power,
W k  ( wijk ) , of W take the form9

(3.3.34) wijk   nm1 1[ mn 2 1[[ mn k 1 1wim1 wm1m2  wmk 1 j ]]]

9
For a deeper discussion of such influence paths see Martellosio (2012).

________________________________________________________________________
ESE 502 III.3-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

where each positive product, wim1 wm1m2  wmk 1 j , in wijk still corresponds to a unique path,
j  mk 1    m1  i , of positive influences – but where this product need not be
unity. Moreover, if we now introduce the spatial dependency parameter,  , and consider
the k th power,  kW k , then (3.3.34) becomes

(3.3.35)  k wijk   mn 1[ mn 1[[ mn


1 2 k 1 1
(  wim1 )(  wm1m2 ) (  wmk 1 j )]]]

In this form, it is clear that the w -values along each path reflect only the relative
influences of each link, where typically such influences will be smaller on links between
more widely separated units. The full influences of these links are then determined by  .

With these preliminary observations, it should now be clear that the geometric sum in
(3.3.8) represents the cumulative effect of all these direct and indirect spatial influences
among units. The can be seen more explicitly by using (3.3.8) to expand (3.2.5) as
follows:

(3.3.36) u  ( I n  W ) 1  ( I n  W   2W 2  ) 

   W    2W 2  

So for any given vector of intrinsic effects,   (1 ,..,  n ) , expression (3.3.36) displays
the accumulation of all direct and indirect effects of  that define the vector,
u  (u1 ,.., un ) , of autoregressive residuals. This is illustrated graphically in Figure 3.7
below for the “over the fence” communications example in Figure 3.1 (for the case of
n  7 neighbors):

ε1 ε2 ε3 ε4 ε5 ε6 ε7
ε • • •3 •4 • •6 •7
1 2 5
+
ρWε • • • • • • •
+

2
ρ2W ε • • • • • • •

Figure 3.7. Spatial Ripple Effect

________________________________________________________________________
ESE 502 III.3-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Here we only show the first three terms of (3.3.35), where the first term reflects the initial
(intrinsic) opinions of each neighbor, and where subsequent terms represent the
cumulative indirect influences on these opinions resulting from over-the-fence
communications. Alternatively, if one were to imagine each initial opinion as a pebble
falling into water, then the influences of these opinions spread out like “ripples” in all
directions. (An empirical example of such a ripple effect is given in Figure 7.8 below.)

More generally this example suggests that spatial autoregressive residuals, u , can be
viewed as the steady state of an implicit spatial diffusion process generated by a random
vector of intrinsic effects,  . Of course, the spatial autoregressive model in (3.9) is static
in nature, and involves no explicit notion of time. But such cumulative effects can
nonetheless be usefully represented as a steady state over virtual time periods as shown in
Figure 3.8 below.

 t 4 t 3 t 2 t 1 t0

     
+
 W  W  W  W  W 
+
  2W 2  2W 2 W 
2 2
 2W 2  W 2
2

+
 W  3 3
W 
3 3
W 
3 3
W 
3 3
 W 3
3

    

Figure 3.8. Steady State Interpretation

Here for example, W  , in the “current” state, t0 , is interpreted as the direct effect of 
in the “previous” state, t1 , and similarly,  2W 2 , in t0 is the indirect effect of  in t2 .
The main feature of this representation is that the total effect in each state resulting from
all previous states remains the same, thus yielding a “steady state” independent of time.

But regardless of whether or not this steady state interpretation is used, the essential
result here is that the reduced-form representation of spatial autoregressive residuals, u, in
(3.3.35) does indeed incorporate all direct and indirect effects generated by  in the
presence of spatial structure, W .

One final point needs to be made about this reduced-form representation. It is often
observed that this representation is not essential for the existence of the inverse
( I n  W ) 1 . For example, if W is given by (3.3.30), and say,   2 , then it may be

________________________________________________________________________
ESE 502 III.3-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

verified (by direct matrix multiplication) that the inverse of this matrix exists, and is
given (approximately) by

 0.4286 0.2857 0.5714 


(3.3.37) ( I 3  W ) 1
  0.2857 0.1429 0.2857 
 0.5714 0.2857 0.4286 

But while this inverse exists, it is far more difficult to interpret in a meaningful way. In
particular, the negative elements in this matrix are rather questionable. Note in particular
from the positivity of  that W must be a nonnegative matrix. So it seems clear from
the basic autoregressive relation, u  Wu   , that a positive increase in the components
of  in the should certainly not decrease any component of u . However, (3.3.37) and
(3.3.36) together imply for example that the second component, u2 , is related linearly to
  (1 ,  2 ,  3 ) by

(3.3.38) u2  (0.2857) 1  (0.1429)  2  (0.2857)  3

Thus an increase in either 1 or  3 will decrease u2 .

But such problems do not arise when this inverse is representable as in (3.3.36). In the
present case, observe that since W  2  1.414 for this W matrix, it follows that if
|  |  1 / W  0.707 , then (3.3.36) must hold. But in case, the nonnegativity of 
ensures that every term of the expansion,  k 0  kW k , must be nonnegative, so that
( I n  W ) 1 is always nonnegative. For example, if   .5 then it can again be verified
that

 1.5 1 .5 
(3.3.39) ( I 3  W ) 1
  1 2 1 
 .5 1 1.5 
 

So positive spatial dependencies here imply that spatial autoregressive residuals, u , are
always monotone nondecreasing in the components of  .

Finally, it should be emphasized that the negative signs in (3.3.37) are no accident. In fact
it is shown in Section ?? of the Appendix that all elements of ( I n  W ) 1 are
nonnegative if and only if |  |  1 / W . So while the steady-state representation in
(3.3.36) is not strictly necessary for the existence of a reduced form solution for u, it
characterizes those cases where a meaningful spatial interpretation of these residuals can
be given.

________________________________________________________________________
ESE 502 III.3-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis

4. Testing for Spatial Autocorrelation

To apply the spatial autoregressive model above, we start by restating the linear model
(for n areal units) in expression (3.2) above, where the residuals, u, are now specified
more explicitly as:

(4.1) Y  X  u , u ~ N (0,V )

with both the parameter vector,  , and covariance matrix, V , unknown. The simplest
procedure for specifying the residual covariance is to start by assuming that

(4.2) V   2In ,

so that  can be estimated by OLS. Given this estimate, one can then test to see whether
there is sufficient spatial autocorrelation in the residuals to warrant more elaborate
specifications of V . If for a given sample, y (i.e., observed realization of Y ) we denote
the OLS estimate of  by

(4.3) ˆ  ( X X )1 X y

and corresponding OLS residuals by

(4.4) û  y  X ˆ

then our objective is to develop statistical tests for the presence of spatial autocorrelation
using these residuals. To do so, we assume that the underlying spatial structure of these n
areal units is representable by a given spatial weights matrix, W  ( wij : i, j  1,.., n) . In
terms of W, it is then hypothesized that all relevant spatial autocorrelation among the
residuals, u , in (4.1) can be captured by the spatial autoregressive model,

(4.5) u  Wu   ,  ~ N (0,  2 I n )

The key feature of this hypothesis is that testing for spatial autocorrelation then reduces
to testing the null hypothesis:

(4.6) H0 :   0

For if H 0 cannot be rejected, then (4.5) reduces to

(4.7) u   ~ N (0, 2 I n ) ,

so that the OLS specification of V in (4.4) above is appropriate. If not, then some more
elaborate specification of V needs to be considered.

________________________________________________________________________
ESE 502 III.4-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

4.1. Three Test Statistics

In this context, our main objective is to construct appropriate test statistics based on û
and W for testing H 0 . In the following subsections, we shall consider three alternative
test statistics that are in common use.

4.1.1 Rho Statistic

Given model (4.5), one natural approach is simply to treat the OLS residuals, û , as a
sample of u , and use model (4.5) to obtain a corresponding OLS estimate of  . To do
so, recall the “one variable regression” illustration given in class, where we started with a
linear model:
(4.1.1) Yi  xib   i , i  1,.., n

In vector form, this is seen to yield the special case of (4.4.1) where X  x and   b is
a scalar, i.e.,
(4.1.2) Y  xb  

Hence, as a special case of (4.3), the OLS estimate of the scalar, b , in (4.1.2) is given by

xy xy
(4.1.3) bˆ  ( xx) 1 xy  
xx || x ||2

But (4.5) can be viewed as a model of the form (4.1.2), where b   , Y  u and x  Wu .
Hence, for our present “data”, y  uˆ , the corresponding OLS estimate of  is given by

(Wuˆ )uˆ uˆ Wuˆ uˆ Wuˆ


(4.1.4) ˆW   
(Wuˆ )(Wuˆ ) uˆ W Wuˆ || Wuˆ ||2

This yields our first test statistic for H 0 , which we designate as the rho statistic. Note
also that we use the subscript “ W ” to emphasize that this statistic (and those below)
depends explicitly on the choice of W .
Having constructed this statistic, it is of interest to observe that the basic spatial
autocorrelation test we have been using so far, namely regressing residuals on nearest-
neighbor residuals, is essentially a special case of this rho statistic. To see this, observe
that the ith row of (4.5) is of the form:

ui   (Wu )i   i    j 1 wij u j   i
n
(4.1.5)

But if W is chosen to be the first nearest-neighbor matrix (i.e., wij  1 if j is the nearest
neighbor of i and wij  0 otherwise) and if we let nn(i ) denote the first nearest-neighbor
of each point i then by construction,
________________________________________________________________________
ESE 502 III.4-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________


n
(4.1.6) j 1
wiju j  unn (i )

So for this case, (4.1.5) is of the form

(4.1.7) ui   unn (i )   i

But this is almost exactly the regression we have been using, where the important slope
coefficient is now precisely  . So the test for significance of this slope is based on the
estimator, ˆW . Notice however that unlike our regression, there is no intercept in (4.1.7).
This makes sense theoretically since1
(4.1.8) E (ui )  E (unn (i ) )  E ( i )  0 ,

which in turn implies that the intercept term must also be zero in this model. So in fact,
(4.1.7) is the model we should have been using. But since regression residuals must sum
to zero by the construction of OLS,2 the intercept is usually not statistically significant.
This is well illustrated by the regression of Myocardial Infarctions on the Jarman Index in
Section 1.3 above. Residual regression with and without the intercept are compared in
Figures 4.1 and 4.2 below. Notice in Figure 4.1 that the intercept is close to zero and
completely insignificant. More importantly, notice that the t-values for the slope
intercepts in both figures are virtually identical.

g
0.6 0.6
0.4 0.4
0.2 0.2
Resids
Resids

0 0

-0.2 -0.2
-0.4 -0.4
-0.6 -0.6
-0.8 -0.8
-0.8 -0.6 -0.4 -0.2 0 0.1 0.3 0.5 -0.8 -0.6 -0.4 -0.2 0 0.1 0.3 0.5
nn-Resids nn-Resids
0.8
Parameter Estimates Parameter Estimates
Term Estimate Std Error t Ratio Prob>|t| Term Estimate Std Error t Ratio Prob>|t|
Intercept 0.0058051 0.015586 0.37 0.7100 nn-Resids 0.5100272 0.063483 8.03 <.0001*
nn-Resids 0.5111284 0.063697 8.02 <.0001*

Figure 4.1. Regression with Intercept Figure 4.2. Regression with No Intercept

Recall from the reduced form of (4.5) that u  ( I n  W )  so that E (u )  ( I n  W ) E ( )  0 .


1 1 1

2
This is established in expression (4.2.9) below.

________________________________________________________________________
ESE 502 III.4-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

However, it is also important to notice that from a regression viewpoint, models like
(4.1.7) are seriously flawed. In particular, since the same random vector, u , appears both
on the left and right hand sides, this regression suffers from what is called an
“endogeneity problem”. Here it can be shown that ˆW is actually an inconsistent
estimator of  , which means that even for very large sample sizes, there is no guarantee
that ˆW will eventually be close to the true value. Nevertheless, we have already seen
that the p-values derived from this regression are generally quite reasonable. So even
though we will develop a more satisfactory Monte Carlo approach using this ˆW statistic,
the regression approach in (4.1.7) is generally quite robust and easy to perform.

4.1.2 Correlation Statistic

But given the inconsistency of ˆW as an estimator of  , it is appropriate to seek less


restrictive approaches to identifying  . One simple observation is to note in (4.1.5) that
if   0 then for each i  1,.., n , the ith row, (Wu )i , of Wu contributes positively to the ith
component, ui , of u . So in this case, the random variables, ui and (Wu )i , should exhibit
some degree of positive correlation for each i  1,.., n . Similarly, ui and (Wu )i should
be negatively correlated for all i when   0 . Hence it stands to reason that ui and
(Wu )i should also be uncorrelated for all i when   0 . The argument here is slightly
less obvious, but can be seen as follows. Since   0 implies that u   , we must also
have Wu  W  . But since the components of  are all independent, and since wii  0
implies that (W  )i   j i wij j does not involve  i , it then follows that  i and (W  )i
must be independent for all i , and hence uncorrelated. In short, all pairs of random
variables [ui ,(Wu )i ] are correlated with the same sign (positive, negative, or zero).

Hence if the OLS regression residuals, û , are taken to be a sample of u (so that Wuˆ is a
sample of Wu ) then all sample pairs [uˆi ,(Wuˆ )i ] must be correlated with the same sign.
This suggests that, as a summary measure, the sample correlation between vectors, û
and Wuˆ , should reflect this common sign. Since all these random variables have zero
means by construction,3 we start by observing that the correlation between any zero-mean
random variables, X and Y , is given by4

cov( X , Y ) E ( XY )
(4.1.9)  ( X ,Y )  
 ( X )  (Y ) E ( X 2 ) E (Y 2 )

Hence the appropriate sample estimator of  ( X , Y ) is constructed as follows. If for any


random samples, ( xi , yi ), i  1,.., n of ( X , Y ) we employ the natural sample estimators,

3
Again, (Wu ) i   j
wij u j implies from (4.1.8) that E[(Wu ) i ]   j
wij E (u j )  0 .
4
Recall that cov( X , Y )  E ( XY )  E ( X ) E (Y )  E ( XY ) , so that var( X )  cov( X , X )  E ( X ) .
2

________________________________________________________________________
ESE 502 III.4-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

  
n n n
(4.1.10) Eˆ ( XY )  1
n x yi , Eˆ ( X 2 ) 
i 1 i
1
n x2
i 1 i
, Eˆ (Y 2 )  1
n i 1
yi2

then the corresponding “plug in” estimator for  ( X , Y ) is given by


n
1
Eˆ ( XY ) x yi
ˆ ( X , Y ) 
n
(4.1.11)  i 1 i

Eˆ ( X 2 ) Eˆ ( X 2 )  
n 2 n
1
n i 1 i
x 1
n i 1
yi2


n
x yi
 i 1 i

 
n 2 n
x
i 1 i i 1
yi2

If we let x  ( x1 ,.., xn ) and y  ( y1 ,.., yn ) , then in more common terminology, this


estimator is designated simply as the sample correlation, r ( x, y ) , between x and y , i.e.,


n
x yi xy xy
(4.1.12) r ( x, y )  i 1 i
 
xx yy
  || x || || y ||
n 2 n
x
i 1 i i 1
yi2

In these terms, our second test statistic is given by the sample correlation between û and
Wuˆ , i.e.,

uˆ Wuˆ
(4.1.13) rW  r (uˆ,Wuˆ ) 
|| uˆ || || Wuˆ ||

We designate this as the correlation statistic, or more simply as corr.

4.1.3 Moran Statistic

Up to this point we have focused mainly on constructing statistics for estimating the
value (or at least the sign) of  in model (4.5). But we have given little attention to how
these statistics behave under the null hypothesis, H 0 , in (4.6). One might suspect from
the inconsistency of ˆW that this statistic exhibits little in the way of “optimal” behavior
under H 0 . The sample correlation, rW , does somewhat better in this respect. But from a
statistical viewpoint, it again suffers from another type of “inconsistency”. For while the
classical sample correlation statistic assumes that ( xi , yi ) , i  1,.., n are independent
random samples from the same statistical population ( X , Y ) , this is not true of the
samples [uˆi ,(Wuˆ )i ] , i  1,.., n . Even under the null hypothesis, where (4.7) implies that
(uˆi  ˆi : i  1,.., n) , are independently and identically distributed, this is not true of the
samples, (W ˆ )i , i  1,.., n , which are neither independent nor identically distributed. So

________________________________________________________________________
ESE 502 III.4-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

there remains a question as to how well either of these statistics behaves with respect to
testing H 0 . In this context, we now introduce our final test statistic,

uˆ Wuˆ uˆ Wuˆ
(4.1.14) IW  I (uˆ ,Wuˆ )  
uˆ uˆ || uˆ ||2

designated as the Moran statistic, or more simply as Moran’s I.5 Here it is important to
emphasize that expression (4.1.14) is different from the version of Moran’s I in [BG,
p.270] (also used in ARCMAP), which is designed for detecting autocorrelation in Y
itself. This can in fact be viewed as the special case of (4.1) in which there only an
“intercept” term with coefficient,  , representing the common mean of the Y
components, i.e.,
(4.1.15) Y   1n  u

If u is again assumed to satisfy (4.5), then under the null hypothesis,   0 , the “OLS”
estimate in (4.3) reduces to the sample mean of y, i.e.,

ˆ  (1n1n )11n y  ( n )1  i 1 yi  y


n
(4.1.16)

Thus the “residuals” in (4.4) are here seen to be simply the deviations of the y
components about this sample mean, i.e.,

(4.1.17) uˆ  y  y 1n

So the appropriate version of Moran’s I in this special case is seen to have the form,

  w ( y  y )( y
n n
( y  y 1n )W ( y  y 1n ) i 1 j 1 ij i j  y)
(4.1.18) IW  
( y  y 1n )( y  y 1n )  ( y  y)
n 2
i 1 i

which is essentially the version used in [BG, p.270], except for the normalizing constant
n
(4.1.19) W 
 i j
wij

For simplicity we have simply dropped this constant [as for example in Tiefelsdorf
(2000,p.48)].6

5
Be careful not to confuse this use of “ I ” with the n-square identity matrix, I n .
6
Notice that for the common case of row normalized W (with zero diagonal) it must be true that
i  j wij  in1 ( j i wij )   in1 (1)  n , so this constant is unity.

________________________________________________________________________
ESE 502 III.4-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

While this statistic is more difficult to motivate by simple arguments,7 it turns out to
exhibit better statistical behavior with respect to testing H 0 then either of the statistics
above.

4.1.4 Comparison of Statistics

By writing these three statistics side by side as

uˆ Wuˆ uˆ Wuˆ uˆ Wuˆ


(4.1.20) ˆW  , rW  , IW 
|| Wuˆ ||2 || uˆ || || Wuˆ || || uˆ ||2

we see that they exhibit striking similarities. Indeed, since the numerators are identical,
and since all denominators are positive, it follows that these three statistics must always
have the same sign. Hence the differences between them are not at all obvious.
But for testing purposes, the key issue is their relative behavior under the null hypothesis,
H 0 . To study this behavior, it is necessary to express û more explicitly as a random
vector. To do so, observe first from (4.3) and (4.4) that

(4.1.21) uˆ  Y  X ˆ  Y  X ( X X )1 X Y  [ I n  X ( X X )1 X ]Y

 [ I n  X ( X X )1 X ]( X   u )

 ( X   X  )  [ I n  X ( X X )1 X ] u

 [ I n  X ( X X )1 X ] u

 û  Mu

where the matrix,

(4.1.22) M  I n  X ( X X )1 X 

is seen to be symmetric, i.e., M  M  . Notice also that

(4.1.23) MM  [ I n  X ( X X )1 X ][ I n  X ( X X )1 X ]

 [ I n  X ( X X )1 X ]  X ( X X )1 X   X ( X X )1 X 

 I n  X ( X X )1 X 

 MM  M

7
However, a compelling motivation of this statistic can be given in terms of the “concentrated likelihood
function” used in maximum likelihood estimation of  . We shall return to this question in Section (??)
after maximum likelihood estimation has been introduced.

________________________________________________________________________
ESE 502 III.4-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Finally, to study the relative behavior of these estimators under H 0 , recall from (4.5)
that   0 implies u   ~ N (0, 2 I n ) , so that û takes now takes the form

(4.1.24) û  M 

with M given by (4.1.22), and satisfying M   M  MM .8 The estimators in (4.1.20)


can then be expressed explicitly in terms of  ~ N (0, 2 I n ) as follows,

 MWM   MWM   MWM 


(4.1.25) ˆW  , rW  , IW 
|| WM  ||2 || M  || || WM  || || M  ||2

In terms of these specific representations under H 0 , the superiority of IW relative to ˆW


and rW can be illustrated in terms of the 190 Health Districts in the English Mortality
example above. Here we choose W to be a row normalized weight matrix [expression
(2.1.25) of Section 2 above] consisting of the five nearest neighbors of each district (with
respect to centroid distance).9 While the exact distributions of these statistics are difficult
to obtain,10 they can easily be approximated by simulating many random samples of  .
In Figure 4.3 below, the approximate sampling distributions of these three statistics are
plotted using 10,000 simulated samples of  with  2  1 .11
10

7 Moran
6
corr
rho
5

0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

Figure 4.3. Sampling Distributions

Note first that while all three distributions are roughly centered on the true value,   0 ,
there is actually some degree of bias in all three. The simulated means for these three

8
The conditions M   M  MM together imply that M is an orthogonal projection matrix.
9
Since the row sums are always 5, i.e., since W 1n  (5)1n , it turns out that   5 and thus that the max-
W
eigenvalue normalization [(2.1.25) of Section 2 above] and row normalization for this particular W matrix
are the same.
10
Exact distribution results for Moran’s I have been obtained by Tiefelsdorf (2000, Chap.7).
11
Density estimation was done using the kernel-smoothing procedure, ksdensity.m, in MATLAB.

________________________________________________________________________
ESE 502 III.4-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

statistics are displayed in Table 4.1 below, and show that the mean of Moran’s I is in fact
an order of magnitude closer to zero than the other two. Moreover, the exact theoretical
mean for Moran in this case [expression (4.1.29) below] can be calculated to be
0.00655 , which shows that for a sample of size 10,000 these simulated values are quite
accurate.

Moran -0.0067
corr -0.0198
rho -0.0851

Table 4.1. Mean Values

But Figure 4.3 also suggests that relative bias among these three estimators is far less
important that their relative variances. Indeed it is here that the real superiority of IW is
evident. While the variance of IW under H 0 is known [see expression (4.1.31) below], its
exact relation to the variances of rW and ˆW under H 0 is difficult to obtain analytically.
But simulations with many examples show that these English Mortality results are quite
typical. In fact, even for individual realizations of  it is generally true that

(4.1.26) | IW ( ) |  | rW ( ) |  | ˆW ( ) |

While counterexamples show that (4.1.26) need not hold in all cases, this ordering was
exhibited by all 10,000 simulations in the English Mortality example.
In summary, this example shows why Moran’s I is by far the most widely used statistic
for testing spatial autocorrelation. Given its relative unbiasedness and efficiency (small
variance) properties under H 0 , Moran’s I tends to be the most reliable tool for detecting
spatial autocorrelation.12

4.2. Asymptotic Moran Tests of Spatial Autocorrelation

Given the superiority of Moran’s I , the most common procedure for testing H 0 is to use
the asymptotic normality of IW under H 0 , first established by Cliff and Ord (1973) [see
also Cliff and Ord (1981, pp.47-51), [BG], p.281 and Cressie (1993), p.442]. Since an
asymptotic testing procedure using Moran’s I is available in ARCMAP, it is of interest to
develop this procedure here. But before doing so, it must be emphasized that the test used
in ARCMAP is based on the version of Moran’s I in expressions (4.1.18) and (4.1.19)
above, which we here denote by

    wij ( yi  y )( y j  y )
n n
n
(4.2.1) IW    i 1 j 1 n
  wij   i1 ( yi  y )2
 i j 

12
For a deeper discussion of its optimality properties, see Section 4.3.2 in Tiefelsdorf (2000).

________________________________________________________________________
ESE 502 III.4-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

The mean and variance of this statistic under H 0 are given in [BG, p.281]. For our
present purposes, it is enough to observe that the mean of I has a very simple form,
W

1
(4.2.2) E ( IW )  
n 1

which is made possible precisely by introducing the normalizing factor, W , in (4.1.19).


This yields an expression which is independent of W (and thus serves to motivate the use
of W ). Note also that like the English Mortality example in Table 4.1 above, this result
shows there is always a slight negative bias in I under H , which shrinks to zero as n
W 0

becomes large.

However, the simplicity of (4.1.2) disappears when analyzing IW for regression


residuals, û  M  , under H 0 . Here even the mean of IW has the more complex form:13

(4.2.3) E ( IW )  1
nk tr ( M W )

where M is given in terms of the n  k data matrix, X , by (4.1.22) , and where the
trace, tr ( A) , of any n-square matrix, A  ( aij ) , is defined to be the sum of its diagonal
elements, i.e.,

tr ( A)   i 1 aii
n
(4.2.4)

So in the case of regression, E ( IW ) , is seen to depend not only on W but also on the
particular data matrix, X . This is also true for the variance of IW , which has the more
complex form:

tr ( M W M W )  tr ( M W M W )  [tr ( M W )]2
(4.2.5) var ( IW )   [ E ( IW )]2
( n  k )( n  k  2)

So before considering the testing procedure in ARCMAP, it is appropriate to consider the


more general asymptotic test for regression residuals.

13
The mean (4.2.3) and variance (4.2.5) of I W under H 0 are taken from Tiefelsdorf (2000, p.48). The
original derivations of these moments (using the normalizing factor, W ) can be found in Cliff and Ord
(1981, Sections 8.3.1 and 8.3.2).

________________________________________________________________________
ESE 502 III.4-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

4.2.1 Asymptotic Moran Test for Regression Residuals


Given the above mentioned asymptotic normality property of IW under H 0 , it follows
that if expressions (4.2.3) and (4.2.5) are now employed to standardize this statistic as,

IW  E ( IW )
(4.2.6) ZW 
var( IW )

then under “appropriate conditions”, this standardized statistic, ZW , should be


approximately standard normally distributed, i.e.,
(4.2.7) ZW  d N (0,1)

where (as in Section 3.1.3 of Part II) the notation,  d means “is approximately
distributed as”. A more detailed description of these “appropriate conditions” will be
given at the end of Section 4.2.2 below. So for the present, we simply assume that the
approximation in (4.2.7) is valid.
Given this assumption, the appropriate testing procedure is operationalized in the
MATLAB program, moran_test_asymp.m. To apply this program to the English
Mortality example, let y = “lnMI” and x = “lnJarman” (as in Section 1.3 above) and
denote the 5-nearest neighbor weight matrix by W. This test can then be run with the
command:
>> moran_test_asymp(y,x,W);
Note that this program actually runs the OLS regression, calculates the residuals, û , and
then calculates ZW in (4.2.6). The test results are reported as screen outputs:

Moran = 0.46405
Zval = 11.3044
Pval < 10^(-323)

Here the calculated value, ZW , is denoted by Zval and is seen to be more than 11
standard deviations above the mean. This suggests that there is simply no chance at all
that these residual values (shown in Figure 4.4 below) could be spatially independent.14

4.2.2 Asymptotic Moran Test in ARCMAP

As mentioned above, the Moran test used in ARCMAP relies on IW in (4.2.1) rather than
IW , and essentially tests whether a given set of spatial data, y, can be distinguished from
independent normal samples. This procedure can again be illustrated using the English

14
In fact, the p-value here is so small that it is reported as “0” in MATLAB. In such cases, the program
simply reports “Pval < 10^(-323)”, which is roughly the smallest number treated as nonzero in MATLAB.

________________________________________________________________________
ESE 502 III.4-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Mortality data. But in doing so, it must be born in mind that these regression residuals, û ,
are now treated as the basic data set “y” itself. This is always possible in the case of OLS
residuals since by definition the “sample mean”, 1n 1n uˆ , of such residuals is identically
zero. This is a consequence of the following property of the projection matrix, M,

(4.2.8) M X  [ I n  X ( X  X )1 X ] X  X  X  On

which together with the definition, X  (1n , x1 ,.., xk ) , implies in particular that M 1n  0 .
But for any realized value, y , of Y it then follows from (4.1.21) that

(4.2.9) uˆ  My  1n uˆ  1n My  ( M 1n ) y  0

and thus that 1n 1n uˆ is always zero. This means that we can set û  y in (4.1.17) then
y  0 and obtain no immediate contradictions. However, it is important to emphasize
that the mean and covariance of IW are in principle very different from those of IW ,
depending on the explanatory data, X.
Given this observation, we now proceed to test for spatial autocorrelation in û by using
IW rather than IW . To do so, the OLS residuals of the regression of Myocardial
Infarction rates on the Jarman Index must first be imported to ARCMAP and joined to
the Eng_Mort.shp file as a new column, say resid, and saved as a new shapefile, say
OLS_resids. These residuals are shown in Figure 4.4 below, with positive residuals in
red, negative in blue, and with all values “close to zero” (i.e., within half a standard
deviation) shown in white. Here it is clear that while the Jarman Index is certainly a
significant predictor of Myocardial Infarction, these unexplained residuals are highly
correlated in space. Recall from the simple nearest-neighbor test that this correlation was
more significant than the Jarman Index itself. We now show that this degree of
significance is in fact even greater than in that simple heuristic test.

0 100 km

Figure 4.4. OLS Residuals

________________________________________________________________________
ESE 502 III.4-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

For this illustration, we again use the above weight matrix, W, consisting of the first 5
nearest neighbors of each district. To use this weight matrix in ARCMAP, it must first be
converted to a text file using the MATLAB program, arcmap_wt_matrix.m, with the
command:

>> L = arcmap_wt_matrix(W);

Here you must be sure that W is not in “sparse” form (which can be seen by displaying
the first row of W). If it is, then use the command, W = full(W), to convert it to a full
matrix. The matrix output, L, should then have initial rows of the form:

1 2 0.2
1 3 0.2
1 8 0.2
1 15 0.2
1 16 0.2
  

This shows in particular that the first 5 nearest neighbors of district 1 are districts
(2,3,8,15,16). To import this matrix to ARCMAP, first open it in EXCEL and “clean” the
numerical format to look like the above. ARCMAP also requires an ID for these values,
which can be accomplished in three steps:

(i) First add a new column to the attribute table (say next to resid) labeled ID and
use the calculator (with “short integer”) to create values (1 2 3 …) by setting
ID = [FID] + 1.

(ii) Now add a new row at the top of the matrix, L, in EXCEL, and write the
identifier name, ID, so that L is now of the form:

ID
1 2 0.2
1 3 0.2
1 8 0.2
  

(iii) Finally, save this as a text file, say Wnn_5.txt, (to indicate that it includes the
first 5 nearest neighbors). This file will be used by ARCMAP below.

To apply the Moran test to the OLS residuals, resid, in ARCMAP, follow the path:

ArcToolbox > Spatial Statistics Tools


> Analyzing Patterns
> Spatial Autocorrelation (Morans I)

________________________________________________________________________
ESE 502 III.4-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

In the “Spatial Autocorrelation” window that opens, fill in the shapefile and field, and be
sure to check the “Generate Report” box. You are now going to use the option,

“GET_SPATIAL_WEIGHTS_FROM_FILE”

in the “Conceptualization of Spatial Relationships” window. (It is this option which


makes the ARCMAP test worthwhile!). Here, browse to the text file, Wnn_5.txt,
constructed above. The relevant portion of the Spatial Autocorrelation window should
now look as shown below (where the last file path will of course vary):

Figure 4.5. Spatial Autocorrelation Window

Click OK, and when the procedure terminates, you will get a report displayed. The most
relevant portion of this report is shown in Figure 4.7 below:

Figure 4.7. Moran Test Report

Before proceeding further, notice that while the value of Moran’s I, (Index = 0.464048),
is the same as in Section 4.2.1 above, the Z value (ZScore = 11.188206) is slightly
different. This is because the mean and variance used to standard Moran’s I are different

________________________________________________________________________
ESE 502 III.4-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

in these two tests. Rather than using (4.2.3) and (4.2.5), the values used are those in [BG,
p.281], including (4.2.2) for the mean. In the present case, spatial autocorrelation is so
strong that there is little difference between these results. But this need not always be the
case.

It is also important to note that while this report contains all key test information, there is
a much better graphical representation that can be obtained by clicking the “HTML
Report File” that is shown here in blue. This graphic is shown in Figure 4.9 below.

Moran's Index: 0.464048


z-score: 11.188206
p-value: 0.000000

Gi th f 11 19 th i l th 1% lik lih d th t thi

Figure 4.9. Moran Test Report

This graphic facilitates the interpretation of the results by making it abundantly clear (in
the present case) that these test results show positive spatial correlation that is even more
significant than that of the heuristic nearest-neighbor approach used previously.

But it should be emphasized that while spatial correlation is visually evident in Figure 4.5
above, this will not always be the case. Moreover, it should also be stressed that the
Moran statistics, IW and IW (as well as ˆW and rW ) are defined only with respect to a
given weight matrix, W . Hence it is advisable to use a number of alternative weight
matrices when testing for spatial autocorrelation. For example, one might try alternative
numbers of neighbors (say 4, 5, and 6), or more generally, weight matrices involving both
distance-based and boundary-based notions of spatial proximity. A general rule-of-
thumb is to try three substantially difference matrices, (W1 ,W2 ,W3 ) , that cover a range of
potentially relevant types of proximity. If the results for all three matrices are comparable
(as will surely be the case in the English Mortality example), then this will help to
substantiate these results. On the other hand, if there are significant differences in these

________________________________________________________________________
ESE 502 III.4-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

results, then an analysis of these differences may serve to yield further information about
the underlying structure of the unobserved spatial dependencies.

Finally it should be emphasized that, as with all asymptotic tests, these asymptotic Moran
tests require that the number of samples (areal units) be “sufficiently large”. Moreover, it
is also required that the W matrix be “sufficiently sparse” (i.e., consist mostly of zero-
valued cells) to ensure that the Central Limit Theorem is working properly. The present
case with n  190 spatial units, and with each row of W containing only 5 nonzero
entries, this should be a reasonable assumption. But as with the Clark-Evans tests for
random point patterns, it is often difficult to know how well this normal approximation is
working.15

4.3. A Random Permutation Test of Spatial Autocorrelation

With this in mind, we now develop an alternative testing procedure based on Monte
Carlo methods that is more computationally intensive, but requires essentially no
assumptions about the distribution of test statistics under H 0 . The basic idea is very
similar to the “random relabeling” test of independence for point patterns in Section 5.6
of Part I. There we approximated the hypothesis of statistical independence by “spatial
indistinguishability”. Here we adopt the same approach by postulating that if the
particular spatial arrangement of sample points doesn’t matter, then neither should the
labeling of these points. More specifically, in the vector of regression residuals,
uˆ  (uˆ1 ,.., uˆn ) , it shouldn’t matter which residual is labeled as “ û1 ”, “ û2 ”, or “ uˆn ”. If so,
then regardless of what the joint distribution of these residuals (uˆ1 ,.., uˆn ) actually is, each
relabeling (uˆ1 ,.., uˆ n ) of these residuals should constitute an equally likely sample from
this distribution. So under this spatial invariance hypothesis, H SI , we may generate the
sampling distribution for any statistic, say S (uˆ1 ,.., uˆn ) , under H SI by simply evaluating
S (uˆ1 ,.., uˆ n ) for many random relabelings,  , of (uˆ1 ,.., uˆn ) . As we have seen for point-
pattern tests, this hypothesis can then be rejected if the observed value, S (uˆ1 ,.., uˆn ) ,
appears to be an unusually high or low value from this sampling distribution.

Before operationalizing this procedure, it is important to stress that it is applicable to any


statistic constructible from this residual data. Hence, in the same way that different
weight matrices, W, can be used to reflect alternative notions of spatial proximity, it is
advisable to use a range of alternative test statistics for H SI . Indeed, this is precisely why
the rho statistic and correlation statistic were developed above. While Moran’s I appears
to be the best choice when residuals are multi-normally distributed, this is less clear in the
present nonparametric setting. So it seems reasonable to check the results for IW with
those of rW and ˆW . However, it should also be emphasized that a substantial body of
simulation results in the literature suggest that Moran’s I tends to be robust with respect

15
For further discussion of these issues, see for example Tiefelsdorf (2000, Section 9.4.1), Anselin and Rey
(1991) and Anselin and Florax (1995).

________________________________________________________________________
ESE 502 III.4-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

to violations of normality.16 So while we shall report results for all three statistics,
Moran’s I tends to be the most reliable of these three.

4.3.1 SAC-Perm Test

With this overview, we now outline the steps of testing procedure for H SI , designated as
the permutation test of spatial autocorrelation, or more simply the sac-perm test. For
convenience we maintain the general notation, S , which can stand for either I, ̂ , or r .
Since higher positive values of each of these three statistics correspond to higher levels of
positive spatial autocorrelation, we assume that S exhibits this same ordering property. In
this setting, our test is designed as a one-tailed test of positive spatial autocorrelation
(paralleling the one-tailed test of clustering for K-functions). In particular, significant
positive (negative) spatial autocorrelation will again be reflected by low (high) p-values.
As with the asymptotic Moran tests above, this sac-perm test is defined with respect to a
given spatial weight matrix, W. Finally, it should be noted that while the notation, u , will
be used to represent the given residual data in this procedure, virtually all applications
will be in terms of OLS residuals, i.e., u  uˆ . With these preliminary observations, the
steps of this testing procedure are as follows:

Step 1. Let u 0  (u1 ,.., un ) denote the vector of observed residuals, and construct the
corresponding value, S 0  S (u 0 ) , of statistic, S .

Step 2. Simulate N random permutations,  j  ( 1j ,..,  nj ) , of the integers (1,.., n) .17

Step 3. For each permutation,  j , construct the corresponding permuted data vector,
u ( j )  (u j ,.., u j ) , and the resulting value of S , denoted by S j  S[u ( j )] , j  1,.., N .
1 n

Step 4. Rank the values ( S 0 , S 1 ,.., S N ) from high to low, so that if S j is the k th highest
value then rank ( S j )  k .

Step 5. If rank ( S 0 )  k then define the p-value for this test to be

(4.3.1)   k / ( N  1)

(i) If  is low (say   0.05 ) then conclude that there is significantly positive
spatial autocorrelation at the  -level of significance.

16
See for example the Monte Carlo results in Anselin and Rey (1991) and Anselin and Florax (1995).
17
So if n = 3 then the first permutation of (1,2,3) might be   ( 1 ,  2 ,  3 )  (2, 3,1) .
1 1 1 1

________________________________________________________________________
ESE 502 III.4-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(ii) Conversely, if  is high (say   0.95 ) then conclude that there is


significantly negative spatial autocorrelation at the (1   ) -level of
significance.
(iii) If neither (i) or (ii) hold, then conclude that the spatial independence
hypothesis, H SI , cannot be rejected.

The “cutoff” levels for significantly positive or negative spatial autocorrelation are
intentionally left rather vague. Indeed the entire nature of this sac-perm test is meant to be
exploratory in nature.

4.3.2 Application to the England Mortality Data

Recall from the asymptotic Moran tests in Section 4.2 that there was extremely strong
autocorrelation in the OLS residuals of the England Myocardial Infarction data when
regressed on the Jarman Index. We now reproduce those results in MATLAB using
sac_perm. Here the data can be found in the workspace, eng_mort.mat, where the OLS
residuals are in the vector, Res. Finally, recall that desired weight matrix, Wnn_5, was
already constructed for Eire in Section 2.2.1 above. So the only difference here is that L
is now a 190x2 matrix of coordinates for English Health Districts. To construct a sac-
perm test of the residual data, Res, using this weight matrix, we can employ 9999 random
permutations with the command:

>> sac_perm(Res, Wnn_5, 9999);

The screen output of this program is shown below:

RANGE OF RANDOM-PERMUTATION INDEX VALUES:

INDEX Moran corr rho


MAX 0.1939 0.3325 0.6799
MIN -0.1444 -0.3797 -0.9987

TABLE OF SIGNIFICANCE LEVELS:

INDEX VALUE SIGNIF


Moran 0.4640 0.0001
corr 0.6324 0.0001
rho 0.8618 0.0001

Here the key outputs are the significance levels for the three test statistics (Moran, corr,
rho). Notice that (as expected), these values are each maximally significant, i.e., they are
higher than the values for all of the 9999 random permutations simulated. In fact, they
are much higher, as can be seen by comparing them with the range of values displayed
above. For example, the important Moran value, 0.4640, is seen to be well above the
range of values, - 0.1444 to 0.1939, reported for all 9999 permutations. Note also that the

________________________________________________________________________
ESE 502 III.4-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

ranges of corr and rho are successively larger than this, in a manner consistent with
expression (4.1.26) for the asymptotic Moran test.

As expected, the Moran value, 0.4640 , is the same as that for the asymptotic tests above,
confirming that the same weight matrix and calculation procedure are being used.
Moreover, the extremely significant p-value reported for those tests is consistent with the
present fact that this Moran value is way above the simulated range. This shows that if
the number of permutations were increased well beyond, 9999, the same maximally-
significant results would almost surely persist.

Finally, just to show that normality of IW persists under random permutations for
samples this large, we have plotted the histogram for the 9999 simulated values of IW
(ranging from -0.1444 to 0.1939), together with the observed value, 0.4640, shown in red.

2000

1800

1600

1400

1200

1000

800

600

400

200

0
-0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5

Figure 4.10. SAC-Perm Test for IW

This plot also serves to further dramatize the significance of spatial autocorrelation for
these particular regression residuals.

________________________________________________________________________
ESE 502 III.4-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis

5. Tests of Spatial Concentration

The above testing procedures are all motivated by the spatial autoregressive model of
residual errors. So before moving on to spatial regression analyses of areal data, it is
appropriate to consider certain alternative measures of spatial association that are also
based on spatial weights matrices. By far the most important of these for our purposes are
the so-called G-statistics, developed by Getis and Ord (1992,1995).1 These statistics
focus on direct associations among (nonnegative) spatial attributes rather than spatial
residuals from some underlying explanatory model. For any given set of nonnegative
data, x  ( x1 ,.., xn ) , associated with n areal units, together with an appropriate spatial
weights matrix, W  ( wij : i, j  1,.., n ) , the G * statistic for x is defined to be:2

 
n n
i 1
x wij x j
j 1 i xWx
(5.1) GW* ( x )   
  (1n x )2
n n
i 1
x xj
j 1 i

As discussed further below, the diagonal elements of W are allowed to be nonzero (since
no autoregressive-type relations are involved). However, if one is only interested in
relations between distinct areal units, i  j , so that the diagonal elements of W are treated
as zeros, then the resulting statistic is called simply the G statistic, and is given by:

 
n n
i 1 j 1 i
x wij x j xW 0 x
(5.2) GW ( x )  
  (1n x )2  xx
n n
i 1
x xj
j 1 i

where W 0  W  diag (W ) . However, our focus will be almost entirely on G* statistics.3

5.1 A Probabilistic Interpretation of G*

While the definitions in (5.1) and (5.2) serve to clarify the formal similarities between
these indices and those of the previous section, there is an alternative representation
which suggests a more meaningful interpretation of these indices. Here we focus on G * .
First observe that since xi  0 , if we let

xi xi
(5.1.1) pi  
 1n x
n
x
j 1 i

1
The 1992 paper is Reference 7 in the class Reference Materials.
2
While our present focus is on areal units, it should be noted that these G-statistics are equally applicable
to sets of point locations, such as hospitals or supermarkets within a given urban area.
3
It should be clear from these definitions that a better choice of notation would have been to use G with W
0 0
and G with W . But at this point, it is best to stay with the standard notation in the literature.

________________________________________________________________________
ESE 502 III.5-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

denote the proportion (or fraction) of x in unit i , and let p  ( p1 ,.., pn ) denote the
corresponding vector of proportions, then G * can be rewritten as

 xi wij x j  xi   x j 
(5.1.2) GW*  ij

(1n x )2
  ij   wij     pi p j wij
 1n x   1n x 
ij

Next observe (from the title of their 1992 paper) that Getis and Ord are primarily
interested in distance-based measures of proximity or accessibility. In particular, if we let
dij denote some appropriate notion of distance between units i and j , and let a( d )
denote an appropriate (nonincreasing) accessibility function of distance [such as
a( d )  d  or a( d )  exp(  d ) ], then we may now interpret each spatial weight as an
accessibility measure

(5.1.3) wij  a( dij ) , i , j  1,.., n

and write

(5.1.4) Ga*   ij ( pi p j ) a( dij )

To give a concrete interpretation to Ga* , let us assume for the moment that xi represents
the population in areal unit i , so that pi is the fraction of population in i , and
p  ( p1 ,.., pn ) is the population distribution among areal units. In this context one may
ask: What is the expected accessibility between two randomly sampled individuals from
this distribution? To answer this question, observe that since pi is by definition the
probability that a randomly sampled individual is from unit i , it follows by independence
that pi p j must be the joint probability that these two random samples are from units i
and j , respectively. So if accessibility is treated as a random variable with values, a( d ij ) ,
for each pair of areal units, then it follows from (5.1.4) that Ga* must be the expected
value of this random variable, i.e.,

(5.1.5) Ga*  E ( a )

Thus the value of Ga* is precisely the answer to the question above, i.e., the expected
accessibility between two randomly sampled individuals in this population.

In terms of this particular example, there are several additional features that should be
noted. First it should be clear that two individuals in the same areal unit are by definition
maximally accessible to one another. So any measure of overall accessibility will surely
be distorted if these relations are omitted – as in G statistics. It is for this reason that our
focus is almost exclusively on G * statistics. Notice also from the definitions of a and p

________________________________________________________________________
ESE 502 III.5-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

that Ga* must achieve its maximum value when all population is concentrated in the
smallest of these n areal units. This suggests that Ga* is more accurately described as a
measure of spatial concentration than association.

More generally, these interpretations carry over to essentially any nonnegative data. For
example, if xi denotes income or crime levels, then Ga* represents the spatial
concentration of income or crime. But here one must be careful to distinguish between
extensive and intensive quantities. For example, while proportion of total income (dollars)
in areal unit i is straightforward, the “proportion” of per capita income is less clear.
Hence one must treat such intensive quantities in terms of density units that can be added.
So for example, if per capita income is twice as high in i as in j , this would here be
taken to mean that the income density in i is twice that in j . So a better interpretation of
Ga* in this case would be in terms of the spatial concentration of income density. In any
case it is certainly meaningful to ask whether certain spatial patterns of per capita income
are more concentrated than others

Finally, we should add that even for spatial weights matrices, W, that are not distance
based (such as spatial contiguity matrices), such weights can still be viewed as measures
of “closeness” in an appropriate sense. So in the analyses to follow, we shall continue to
interpret GW* in (5.1.2) as measuring the degree of spatial concentration of quantities,
x  ( x1 ,.., xn ) .

5.2 Global Tests of Spatial Concentration

To test whether population (income, crime, etc.) is “significantly concentrated” in space,


it is natural to again consider permutation tests involving GW* , where wij is implicitly
interpreted as a measure of accessibility, a, as in (5.1.3) above. The details of such a
testing procedure are essentially identical to the sac_perm test above. The only difference
is that the relevant test statistic, S, in Section 4.3.1 above is now GW* rather than say the
Moran statistic, IW . This procedure in operationalized in the MATLAB program,
g_perm.m.

As one application of this testing procedure, we again consider the English Mortality data
in Figure 1.9 above (p.III.1-5). For purposes of illustration, we here consider a new type
of spatial weights matrices, namely exponential-distance weights [expression (2.1.13)]
which is also constructed by using the MATLAB program, dist_wts.m. Starting with
exponential-distance weights, say

(5.2.1) wij  a( dij )  exp(  dij )

we first note that since the negative exponential function approaches zero very rapidly, it
is often advisable to normalize distance data to the unit interval to avoid vanishingly

________________________________________________________________________
ESE 502 III.5-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

small values.4 To do so we first identify the largest possible centroid distance, d max ,
between all pairs of Health Districts, and then convert centroid distances, dij , to the unit
interval by setting

(5.2.2) d ij*  d ij / d max , i , j  1,.., n (  199)

so that 0  d ij*  1 . Using this normalization, we can then design exponential distance
weight to yield some appropriate “effective band width” by simply plotting the function
exp(  d ), 0  d  1 , for various choices of  . For our present purposes, the value
  10 yields the plot shown in Figure 5.1 below, 5 which is seen to yield an effective
bandwidth of about d  1/ 2 (shown by the red arrow). In terms of our normalization in
(5.2.2) this yields the familiar value, d max / 2 :

1
0.9
0.8
0.7
0.6

0.5 exp( 10 d )


0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Figure 5.1. Negative Exponential Function

Using the workspace, eng_mort.mat, the corresponding spatial weights matrix, W1, is
constructed by using dist_wts.m with the commands:

>> info.type = [4,10,1];


>> W1 = dist_wts(L,info);

Here L is the 199x2 matrix of centroid coordinates, ‘4’ indicates that exponential-
distance weights are option 4 in dist_wts.m, ‘10’ denotes the exponent value, and (most
importantly) ‘1’ denotes the option to leave all diagonal elements as calculated [in this
case, exp(0)  1 ]. Note also that since these weights are already guaranteed to lie in the
unit interval (as in Figure 5.1), there is no need to consider any additional normalizations
(as provided by the info.norm option). Finally, denoting the myocardial infarction rates

4
For example, if distance were in meters, then while a distance of 800 meters is not very large, you will
discover that MATLAB yields the negative exponential value, exp(-800) = 0. Moreover, this is not
“rounded” to zero, but is actually so small a number that it is beyond the limits of double precision
arithmetic to detect.
5
This plot is obtained with the commands: x = [0:.01:1]; y = exp(-10*x); plot(x,y,'k','Linewidth',5);

________________________________________________________________________
ESE 502 III.5-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

by z = mort(:,3), the test of spatial concentration using g_perm.m is performed with the
command:

>> g_perm(z,W1,999);

The results of this test (with 999 random permutations of Health Districts) is shown
below:

SPATIAL CONCENTRATION RESULTS

INDEX VALUE PROB


G 0.0055 0.0010
G* 0.0054 0.0010

Notice first that both G and G * values are reported, even though G * is of primary interest
for our purposes. Next observe that, not surprisingly, these myocardial infarction rates are
maximally significant given 999 permutations, and that in this case there is very little
disagreement between G and G * .

For purposes of comparison, we also try the more local spatial weights matrix, Wnn_5,
already employed in Section 4.3.2 above to test for spatial autocorrelation in the
regression residuals for this same data. Here the results of using

>> g_perm(z,Wnn_5,999);

are seen to be practically the same:

SPATIAL CONCENTRATION RESULTS

INDEX VALUE PROB


G 0.0057 0.0010
G* 0.0056 0.0010

As with spatial autocorrelation, it is always a good idea to use several spatial weight
matrices to check the robustness of the results. Here it is clear from the very different
(implicit) bandwidths used in these two examples that the significance of spatial
concentration in this case is firmly established.

Before moving on to the more interesting local tests of spatial concentration, it is of


interest to note that such tests can also be done in ARCMAP. Here ARCMAP has for
some reason chosen to use only G-statistics rather than G * -statistics.6 But in the more
important case of local spatial concentration below, they do use G * -statistics. So we shall
not spend much time on this particular application, other than to note that it can be
accessed by

6
To see this, simply Google “How High/Low Clustering (Getis-Ord General G) works”.

________________________________________________________________________
ESE 502 III.5-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

ArcToolbox > Spatial Statistics Tools


> Analyzing Patterns
> High/Low Clustering (Getis-Ord General G)

For sake of comparison with the MATLAB results above, we have used exactly the same
procedure developed in Section 4.2.2 above for testing spatial autocorrelation in terms of
Wnn_5. Here the only difference is that General G is used rather than Moran’s I. The
graphical output for this application is shown in Figure 5.2 below:

Observed General G: 0.005676


z-score: 9.053801
p-value: 0.000000

Given the z-score of 9.05, there is a less than 1% likelihood that this
high-clustered pattern could be the result of random chance.

Figure 5.2. Application of the G Statistic

Notice from the value of G = 0.005676 that this is the same value (when rounded) as that
obtained in MATLAB above. Notice also that the result here is in terms of the asymptotic
normal approximation of this G statistic (obtained by Getis-Ord, 1992, under the same
random permutation hypothesis as above), and is thus reported as a z-score (9.0538) with
extremely small p-value. This again suggests that the MATLAB results would continue
to obtain maximal significance for many more permutations than 999.

5.3 Local Tests of Spatial Concentration

Observe that both GW* and GW are decomposable into local measures of concentration
about each location i as follows. Let the local GW* value at i be defined by

________________________________________________________________________
ESE 502 III.5-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________


n
wij x j
j 1
  j 1 p j wij
n
(5.3.1) G *
(i ) 

W n
j 1
xj

and similarly, let the local GW value at i be defined by

(5.3.2) GW (i ) 
 j i
wij x j
 j i
xj

where, again, our interest focuses almost entirely on GW* (i ) . Note in particular from
(5.1.2) that these local measures of concentration are related to GW* by the identity,7

(5.3.3) GW*   i 1 pi
n
 n
j 1 
p j wij   i 1 pi GW* (i )
n

Thus GW* can be viewed as a weighted average of these local concentration measures,
where the weights, pi , are simply the proportions of x in each areal unit i . In terms of
the probability interpretation above, if we again consider accessibility weights of the
form, wij  a ( dij ) , then Ga* (i ) is precisely the expected accessibility from a randomly
sampled unit of x in i to any other randomly sampled unit, i.e., the conditional expected
accessibility


n
(5.3.4) Ga* (i )  j 1
p j a( dij )  E ( a | i )

In these terms, it follows from (5.1.5) together with (5.3.4) that the decomposition in
(5.3.3) is simply an instance of the standard conditional-expectation identity:

(5.3.5) E ( a )   i pi E ( a | i )

But the real interest in these local measures is that they provide information about where
concentration is and is not occurring.8 In particular, by assigning p-values indicating the
significance of local concentration at each areal unit, one can map the results and
visualize the pattern of these significance levels. Those areas of high concentration are
generally referred to as “hot spots” (in a manner completely analogous to strong clusters
in point patterns).

7
It is of interest to note that this decomposition is an instance of what Anselin (1995) has called Local
Indicators of Spatial Association (LISA).
8
Indeed, the original paper by Getis and Ord (1992) starts with these local indices, and only groups them
into a “General G” statistic a later section of the paper.

________________________________________________________________________
ESE 502 III.5-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

5.3.1 Random Permutation Test

In this setting, one may test for the presence of such “hot spots” with respect to data set,
( xi : i  1,.., n ) by employing essentially the same random permutation test as above. In
particular, for any random permutation,   (1 ,.., n ) , of the areal unit indices (1,.., n) ,
one may compute for each unit i the associated statistic, GW* (i ) , and compare this
observed value with the distribution of values, GW* (i , k ) for N random permutations,
k  (1k ,.., nk ) , k  1,.., N . Here it is important to note that the index i is itself included
in this permutation. For if the value of xi is relatively large, then to reflect the
significance of this local concentration at i it is important to allow smaller values to
appear at i in other random permutations.

If the observed value of GW* (i ) has rank ki among all values [GW* (i ), GW* (i ,1),.., GW* (i, N )]
(with rank 1 denoting the highest value), then the significance of concentration at i is
again represented by the p-value,

ki
(5.3.6) Pi  , i  1,.., n .
N 1

It is these values that are plotted to reveal visual patterns of concentration.

5.3.2 English Mortality Example

This testing procedure is implemented for local G * -statistics in the MATLAB program,
g_perm_loc.m. Here it is assumed that tests for all areal units, i  1,.., n , are to be done.
Hence the outputs contain the local G* -statistic and P-value for each areal unit. To
illustrate the use of this local-testing procedure, it is convenient to continue with the
English Mortality example above. For the exponential-distance weights matrix, W1,
constructed above, together with the myocardial infarction data, z, the command:

>> GP1 = g_perm_loc(z,W1,999);

yields a (190 x 2) output matrix GP1  [(Gi* , Pi ) : i  1,..,190] containing the local G * -
statistic, Gi* [ GW* 1 (i )] and P-value, Pi , for each of the 190 districts, based on 999
random permutations. These values were imported to ARCMAP and displayed in the
map document, Eng_mort.mxd, as shown in Figure 5.3 and 5.4 below. Figure 5.3 plots
the actual values of Gi* in each areal unit, i , with darker green areas denoting higher
values. The corresponding P-values are shown in Figure 5.4, where darker red shows the
area of most significance (and where only the legend for P-values is shown). As
expected, there is seen to be a rough correspondence between high local G * values and
more significant areas of concentration.

________________________________________________________________________
ESE 502 III.5-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

P-VALUES
.001 - .01
.01 - .05
.05 - .10
.10 - .20
.20 - 1.00

Fig.5.3. Exponential G*-Values Fig.5.4. Exponential P-values

Notice in particular that the local G * -values reflect the general concentration of
myocardial infarction rates in the north that is seen in the original data set [Figure 1.9
(p.III.1-5)], but now are smoothed by the exponentially weighted averages in the local
G * statistics. However this “north-south” divide ([B-G], p.279) is seen to be much more
dramatic in the associated P-values, where the darkest region, denoting P-values less than
.01, now covers all of Northern England.

Turning next to the nearest-neighbor weights matrix, Wnn_5, the test results are now
obtained with the command,

>> GP2 = g_perm_loc(z,Wnn_5,999);

which again yields a (190 x 2) output matrix GP2  [(Gi* , Pi ) : i  1,..,190] containing the
local G* -statistics and P-value for this case. By again importing these values to
ARCMAP, we obtain the comparable displays shown in Figures 5.5 and 5.6 below.
Notice that key difference between these two sets of results is the additional local
variation in values created by the smaller numbers of neighbors used by Wnn_5. For
example, while each areal unit has only 5 neighbors in Wnn_5, if we approximate the
bandwidth in exponential matrix, W1, by counting only weights, wij  .01 , then some
areal units i still have more than 70 neighbors. So the degree of smoothing is much
greater in the associated Gi* values. But still, the highest values of both Gi* and Pi
continue to be in the north, and in fact are seen to agree more closely with those
concentrations of myocardial infarction rates seen in the original data, such as the
concentration seen around Lancashire county [compare Figure 1.6 (p.I.1-3) with Figure
1.9 (p.III.1-5)]. So it would appear that 5 nearest neighbor yields a more appropriate scale
for this analysis.

________________________________________________________________________
ESE 502 III.5-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

P-VALUES
.001 - .01
.01 - .05
.05 - .10
.10 - .20
.20 - 1.00

Fig.5.5. Nearest Neighbor G*-Values Fig.5.6. Nearest Neighbor P-Values

5.3.3 Asymptotic G* Test in ARCMAP

An alternative test using G * is available in ARCMAP. This procedure can be found at:

ArcToolbox > Spatial Statistics Tool


> Mapping Clusters
> Hot Spot Analysis (Getis-Ord G*)

To employ this procedure, we will again use the English Mortality data with the nearest-
neighbor spatial weights matrix, Wnn_5, already constructed for ARCMAP in Section
4.3.2. In the Hot Spot window that opens, type:

________________________________________________________________________
ESE 502 III.5-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

where the specific path names will of course vary. Click OK, and a shapefile will be
constructed and added to the Table of Contents in your map document. The result
displayed is shown in Figure 5.7 below (where the legend from the Table of Contents has
been added).

< -2.58 Std. Dev.


-2.58 - -1.96 Std. Dev.
-1.96 - -1.65 Std. Dev.
-1.65 - 1.65 Std. Dev.
1.65 - 1.96 Std. Dev.
1.96 - 2.58 Std. Dev.
> 2.58 Std. Dev.

Figure 5.7. Asymptotic G* Test Output

As with the General G test in Figure 5.2 above, this test is based on the asymptotic
normal approximation of the local G * -statistics under the same random permutation
hypothesis as above. So the values shown in the legend above are actually in terms of the
z-scores obtained for each test. For example, the familiar “1.96-2.58” valued in the
second to last red entry indicates that myocardial infarction rates for districts with this
color are significantly concentrated at between the .05 and .01 level. (The actual p-values
are listed in the Attribute Table for this map). Here it is important to note that two-sided
tests are being performed. So for a corresponding one-sided test (as done above), these
values are actually twice as significant (i.e., with one-sided p-values between .025 and
.005). So even though the red areas look slightly “smaller” than those in Figure 5.6, the
results are actually more significant than those of MATLAB, in a manner consistent with
all of the asymptotic tests we have seen so far. Notice also that because two-sided tests
are being done, it is also appropriate show areas with significantly less concentration than
would be expected under the null hypothesis. These districts are shown in blue.

5.3.4. The Advantage of G* over G for Analyzing Spatial Concentration

Before leaving this topic, it is instructive to consider an additional example that illustrates
the advantage of local G * -statistics over G-statistics for the analysis of spatial
concentration. Here we construct a fictitious population distribution for the case of Eire in
which it is assumed that there is a single major concentration of population in one county
(FID 18 = “Offaly” County), as shown in Figure 5.8 below.9

9
In particular, about 25% of the population has been placed in this county, and the rest has been distributed
randomly (under the additional condition that no other county containing more than 5% of the population).

________________________________________________________________________
ESE 502 III.5-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Fig.5.8. Fictitious Data Fig.5.9. Exponential G*-Values

Here an exponential-distance matrix has been constructed similar to W1 above (to ensure
a smooth representation), and the local G * -statistics for this case are shown in Figure 5.9.
Notice these these G * -values roughly approximate the concentration of the original data,
but are somewhat smoother (as was also seen for the myocardial infarction data above
using W1). The corresponding P-values (again for 999 simulations) are shown in Figure
5.10 below.

Fig.5.10. P-Values for G* Fig.5.11. P-Values for G

These results confirm that Offaly County is the overwhelmingly most significant
concentration of population ( P-Value  .02 ), with several of the surrounding counties

________________________________________________________________________
ESE 502 III.5-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

gaining significance from their proximity to Offaly. However, if one carries out the same
test procedure using local G -statistics, then a substantially different picture immerges.
Here Offaly County is not in the least significant – but two of its immediate neighbors
are. The reason of course is that by setting the matrix diagonal to zero, the population of
Offaly itself is ignored in the local G -test for this county. Moreover, since its neighbors
do not exhibit unusually high population concentrations, the local G -value for Offaly
will not be unusually high compared to the corresponding values for random
permutations of county populations. However, its neighbors are still likely to exhibit
significantly high values, because their proximity to the population concentration in
Offaly yields unusually high local G -values compared to those for random permutations.
Hence the anticipated result here is something like a “donut hot spot”, with the “donut
hole” corresponding to Offaly. This is basically what is seen in Figure 5.10, except that
some neighbors are closer (in exponential proximities) to Offaly than others. This
extreme example serves to underscore the difference between these two local statistics,
and shows that local G * -statistics are far more appropriate for identifying significant
local concentrations.

________________________________________________________________________
ESE 502 III.5-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

________________________________________________________________________
ESE 502 III.5-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis

6. Spatial Regression Models for Areal Data Analysis

The primary models of interest for areal data analysis are regression models. In the same
way that geo-regression models were used to study relations among continuous-data
attributes of selected point locations (such as the California rainfall example), the present
spatial regression models are designed to study relations among attributes of areal units
(such as the English Mortality example in Section 1.3 above). The key difference is of
course the underlying spatial structure of this data. In the case of geo-regression, the
fundamental spatial assumption was in terms covariance stationarity, which together
with multi-normality, enabled the full distribution of spatial residuals to be modeled by
mean of variograms and their associated covariograms. In the present case, this
stationarity assumption is replaced by spatial autogressive hypotheses that are based on
specific choices of spatial weights matrices, as developed in Section 5. Here we start with
the most fundamental spatial autogressive hypothesis in terms of regression residuals
themselves.

6.1 The Spatial Errors Model (SEM)

The most direct analogue to geo-regression is the spatial regression already developed in
Section 3 above. In particular, if we start with the regression model in (3.1) above, i.e.,

Yi   0   j 1  j xij  ui , i  1,.., n
k
(6.1.1)

and postulate that dependencies among the regression residuals (errors), ui , at each areal
unit i are governed by the spatial autoregressive model in (3.5) and (3.6), i.e., by

(6.1.2) ui    j wij u j   i ,  i ~ N (0, 2 ) , i  1,.., n

for some choice of spatial weights matrix, W  ( wij : i, j  1,.., n ) [with diag (W )  0 ] then
the resulting model summarized in matrix form by (3.2) and (3.9) as:

(6.1.3) Y  X   u , u  Wu   ,  ~ N (0, 2 I n )

is now designated as the Spatial Errors Model (also denoted as the SE-model or simply
SEM).1

As mentioned above, this constitutes the most direct application of the spatial
autogressive model in Section 3. In essence it is hypothesized here that all spatial
dependencies are among the unobserved errors in the model (and hence the name, SEM).
In the case of the English Mortality data for example, it is clear that while the Jarman
index includes many socio-economic and demographic factors influencing rates of
myocardial infarctions, there are surely other factors involved. Moreover, since many of

1
See footnote 3 below for further discussion of this terminology.

________________________________________________________________________
ESE 502 III.6-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

these excluded factors will exhibit spatial dependencies, such dependencies will
necessarily be reflected by the corresponding residual errors, u , in (6.1.3).
Before considering other types of autoregressive dependencies, it is of interest to
reformulate this model as an instance of the General Linear Regression Model. First, if
for notational convenience, we now let

(6.1.4) B  I n  W

then by expression (3.2.5) above, we may solve for u in terms of  as follows:

(6.1.5) u  ( I n  W )1  B1 ,  ~ N (0, 2 I n )

Thus by the Invariance Theorem for multi-normal distributions, it follows at once from
the multi-normality of  that u is also multi-normal with covariance given by 2

(6.1.6) cov(u )  cov( B1 )  B1 cov( )( B1 )

 B1 ( 2 I n )( B1 )   2 B1 ( B )1   2 ( B B )1   2V

where the spatial covariance structure, V , is given by3

(6.1.7) V  ( B B )1

This in turn implies that (6.1.3) can be rewritten as

(6.1.8) Y  X   u, u ~ N (0,  2V )

which is seen to be an instance of the General Linear Regression Model in expression


(7.1.8) of Part II, where in this case the matrix C is replaced by V in (6.1.7). This will
allow us to apply some of the GLS methods in Section 7.1.1 of Part II to SE-models.

Finally, there is a third equivalent way of writing this SE-model which is also useful for
analysis. If we simply substitute (6.1.5) directly into (6.1.3) and eliminate u altogether,
then this same model can be written as

(6.1.9) Y  X   B1 ,  ~ N (0, 2 I n )

Since all simultaneous relations, u  Wu   , have been eliminated, expression (6.1.9) is
usually called the reduced form of (6.3).

1 1 1
Here we have used the matrix identities, ( A)  ( A ) , and, A B  ( BA) , which are established,
2 1 1

respectively, in expressions (A3.1.20) and (A3.1.18) of the Appendix.


3
This terminology is motivated by the fact that all spatial aspects of covariance (6.1.6) are defined by V .

________________________________________________________________________
ESE 502 III.6-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

6.2 The Spatial Lag Model (SLM)

An alternative linear model based on the spatial autoregressive model is obtained by


assuming that these autoregressive relations are among the dependent variables
themselves. If we again assume that the underlying spatial relations among areal units are
representable by a spatial weights matrix, W  ( wij : i , j  1,.., n ) [with diag (W )  0 ], then
the simplest way to write such a model in terms of W is by modifying expression (6.1.1)
as follows,

Yi   0    h wihYh    j xij   i , i  1,.., n


k
(6.2.1) j 1

where again  i ~ N (0, 2 ) , i  1,.., n . Here the autoregressive term,   nh1wihYh , reflects
possible dependencies of Yi on values, Yh , in other areal units. A standard example of
(6.2.1) is in terms of housing prices. If the relevant areal units are say city blocks within a
metropolitan area, and if Yi is interpreted as the average price (per square foot) of
housing on block i , then in addition to other housing attributes ( xij : j  1,.., k ) , of block
i , such prices may well be influenced by prices in surrounding blocks. So the relevant
autoregressive relations here are among the housing prices, Y , and not the spatial
residuals,  . Such relations are typically called spatial lag relations, which motivates the
name spatial lag model (SLM).4

6.2.1 Simultaneity Structure

Before analyzing this model in detail, it is important to emphasize one fundamental


difference between (6.1.1) and (6.2.1). Since the residuals are here assumed to be
independent,5 one might at first glance conclude that (6.2.1) is nothing more than an OLS
model with an added term,  ( nh 1wihYh ) , where the unknown spatial dependency
parameter,  , is simply the relevant “beta coefficient”. But the key points to notice are
that (i) the Yh values are random variables, and moreover that (ii) they appear on both
sides of the equation system (6.2.1), i.e., that Yi will also appear in equations for Yh ,
whenever whi  0 . Thus, in the same way that “opinions” (u1 ,.., un ) among households in
Figure 3.1 involved simultaneities, the housing prices (Y1 ,..,Yn ) in the present illustration
also involve simultaneities. So this is not simply another term in an OLS model.

4
At this point, it should be emphasized that (much like “variograms” versus “semivariograms” in the
Kriging models of Part II), there is no general agreement regarding the names of various spatial regression
models. For example, while we have reserved the term Spatial Autogressive Model (SAR) for the basic
residual process in expression (3.9) above, this term is used by LeSage and Pace (2009) for the spatial lag
model (SLM). Our present terminology follows that of the open-source software, GEODA, (to be discussed
later) and has the advantage of clarifying where the basic spatial autoregressive model is being applied, i.e.,
to the error terms in SEM and to the dependent variable in SLM.
5
Relaxations of this assumption will be considered in the “combined model” of Section 6.3.1 below.

________________________________________________________________________
ESE 502 III.6-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

This can be seen more clearly by formalizing this model in matrix terms and solving for
its reduced form. By employing the same notation as in (6.1.3), the Spatial Lag Model
(SL-model or simply SLM) can be written as

(6.2.2) Y  WY  X    ,  ~ N (0, 2 I n )

As a parallel to (6.2.1), we can rewrite this model by grouping Y terms in (6.2.2) as


follows:

(6.2.3) Y  WY  X     ( I n  W )Y  X   

 B Y  X   

 Y  B1 X   B1

which then yields the corresponding reduced form of the SL-model:

(6.2.4) Y  B1 X   B1 ,  ~ N (0, 2 I n )

In this reduced form, it should now be clear that the spatial lag term, WY , in (6.2.2) is
not simply another “regression term”.

Finally, one can also view this model as an instance of the Generalized Linear Regression
Model, though the correspondence is not as simple as that of SEM. In particular, if we
now treat the spatial dependency parameter,  , as a known quantity, or more properly, if
we condition (6.2.4) on a given value of  , then [in a manner similar to the Cholesky
transformation in expression (7.1.16) of Part II] we can treat

(6.2.5) X   B1 X

as a transformed data set, and again use (6.1.5) through (6.1.7) to write (6.2.4) as

(6.2.6) Y  X    u , u ~ N (0, 2V )

with spatial covariance structure, V , again given by (6.1.7). The key difference here is
that  is no longer simply an unknown parameter in the covariance matrix, V , but now
also appears in X  . So while (6.2.6) does permit the GLS methods in Section 7.1.1 in

________________________________________________________________________
ESE 502 III.6-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Part II to also be applied to SL-models, these applications are somewhat more restrictive
than for SE-models.

6.2.2 Interpretation of Beta Coefficients

One final difference between SE-models and SL-models that needs to be emphasized is
the interpretation of the standard beta coefficients,  , in (6.1.8) versus (6.2.4) [or
equivalently, (6.1.9) versus (6.2.6)]. Recall that one of the appealing features of OLS is
the simple interpretation of beta coefficients. For example, consider an OLS version of
the housing price example above, namely

Yi   0   j 1  j xij   i ,
k
(6.2.7) i  1,.., n

with  i ~ N (0, 2 ) , i  1,.., n . If say xi1 denotes the average age of housing on block i
(as a surrogate for structural quality), then one would expect that 1 is negative. In
particular since,

E (Yi | xi1 ,.., xik )   0   j 1  j xij , i  1,.., n


k
(6.2.8)

the value of 1 should indicate the expected decrease in mean housing prices on block i
resulting from a one-year increase in the average age of houses on block i . More
generally, these marginal changes can be expressed as partial derivatives of the form:

(6.2.9)  E (Yi | xi1 ,.., xij ,.., xik )   j , i  1,.., n, j  1,.., k


x ji

and are seen to be precisely the corresponding  j coefficient for variable x j .

Of course this OLS model ignores spatial dependencies between blocks. So if (6.2.7) is
reformulated as an SE-model to account for such dependencies, say of the form in
(6.1.8):

Yi   0   j 1  j xij  ui ,
k
(6.2.10) (u1 ,.., un ) ~ N (0,V )

then since E (ui )  0 , i  1,.., n , it follows that (6.2.8) and (6.2.9) continue to hold. Thus,
while certain types of spatial dependencies have been accounted for, the interpretation of
betas (such as 1 above) continues to hold.

However, if the major spatial dependencies are among these price levels themselves, so
that an SL-model is more appropriate, then the situation is far more complex. This can be
seen by observing from the reduced form in (6.2.4), together with the “ripple”
decomposition of ( I n  W )1 in expression (3.3.26) above that 6

6
Here it is implicitly assumed that the convergence condition, |  |  1 / W , holds for  and W.

________________________________________________________________________
ESE 502 III.6-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(6.2.11) E (Y | X )  B1 X   ( I n  W )1 X   ( I n  W   2W 2 ) X 

 X   WX    2W 2 X   

So the partial derivative in (6.2.9) cannot even be defined without specifying all
attributes on all blocks. Moreover, while (6.2.8) implies that there are no interaction
effects between blocks, i.e., that the partial derivatives of E (Yi | x1i ,.., x ji ,.., xki ) with
respect to housing attributes on any other block are identically zero, this is no longer true
in (6.2.11). For example, if the age of housing on block i is increased, then this not only
has a direct effect on expected mean prices in block i , but also has indirect effects on
prices in all other blocks. Moreover, such indirect effects in turn affect prices in i . So this
spatial ripple effect leads to complex interdependencies that must be taken into account
when interpreting each beta coefficient. These effects can be summarized by analyzing
(6.2.11) in more detail. To do so, we now employ the following notation. For any n  m
matrix, A  ( aij : i  1,.., n, j  1,.., m) , let A(i , j )  aij denote the (ij )th element of A, and let
A(, j ) denote the j th column of A. In these terms, (6.2.11) can be decomposed as
follows:

E (Y | X )  B1 X   B1  j 1  j X (, j )   j 1  j [ B1 X (, j )]


k k
(6.2.12)

  j 1  j  h 1 X ( h, j ) B1 (, h )    j 1  j   h1 xhj B1 (, h ) 


k n k n

   

  h 1  j 1 xhj  j B1 (, h )


n k

so that each i th row of E (Y | X ) can be written as

  xhj  j B1 (i, h )


n k
(6.2.13) E (Yi | X )  h 1 j 1

In terms of this decomposition, it now follows that the desired partial derivatives can be
obtained directly. First, as a parallel to (6.2.9) we see that

(6.2.14)  E (Yi | X )   j B1 (i, i )


xij

So this marginal effect depends not just on  j but also on the i th diagonal element of
B1 , which has the more explicit form

(6.2.15) B1 (i, i )  1  W (i, i )   2W 2 (i , i )  

 1   2W 2 (i, i )  

________________________________________________________________________
ESE 502 III.6-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

where the last line follows from the zero-diagonal assumption on W. But since  2W 2 (i , i )
together with all higher order effects are positive, it is clear that the effect of each  j is
being inflated by these spatial effects, as described informally above. Moreover it is also
clear from (6.2.13) that expected mean prices in i are affected by housing attribute
changes in other blocks. In particular, for attribute j in block h, it now follows that

(6.2.16)  E (Yi | X )   j B1 (i , h )


xhj

Total effects on E (Yi | X ) of attributes in the same areal unit i are designated as direct
effects by LeSage and Pace (2009, Section 2.7.1), and similarly, the total effects of
attributes in different areal units are designated as indirect effects. For further analysis of
these effects see LeSage and Pace (2009).

6.3 Other Spatial Regression Models

While there are many variations on the SE-model and SL-model above, we focus only on
those that are of particular interest for our purposes.

6.3.1 The Combined Model

When developing the SL-model above, a question that naturally arises is why all
unobserved factors should be treated as spatially independent. Clearly it is possible to
have spatial autoregressive dependencies both among the Y variables and the residuals,
 . If we now distinguish between these by letting M and  denote the spatial weights
matrix and spatial dependency parameter for the spatial-error component, then one may
combine these two models as follows,7

(6.3.1) Y  WY  X   u , u   Mu   ,  ~ N (0, 2 I n )

with corresponding reduced form given by

(6.3.2) ( I n  W )Y  X  ( I n   M )1

 Y  ( I n  W )1 X   ( I n  W )1 ( I n   M )1

However, our primary interest in this model will be to construct comparative tests of
SEM versus SLM as instances of the same model structure. Hence we shall focus on the
special case with M  W ,

7
This model has been designated by Kelejian and Prucha (2010) as the SARAR(1,1) model, standing for
Spatial Autoregressive Model with Autoregressive disturbances of order (1,1).

________________________________________________________________________
ESE 502 III.6-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(6.3.3) Y  WY  X   u , u  Wu   ,  ~ N (0, 2 I n ) ,

which we now designate as the combined model, with corresponding reduced form:

(6.3.4) Y  ( I n  W )1 X   ( I n  W )1 ( I n  W )1

So for any given spatial weights matrix, W, the corresponding SE-model (SL-model) is
seen to be the special case of (6.3.3) with   0 (   0 ).

One additional point worth noting here is that while this combined model is
mathematically well defined, and can in principle be used to obtain joint estimates of
both  and  , these joint estimates are in practice often very unstable. In particular,
since both  and  serve as dependency parameters for the same matrix, W, they in fact
play very similar rolls in (6.3.4). But, as will be seen in Section 10.4 below, this
instability will turn out to have little effect on the usefulness of this model for comparing
SEM and SLM.

6.3.2 The Spatial Durbin Model

A second model that will prove useful for our comparisons of SEM and SLM can again
be motivated by the housing price example above. In particular, if housing prices, Yi , in
block group i are influenced by housing prices in neighboring block groups, then it is not
unreasonable to suppose that they may be influenced by other housing attributes in these
block groups. If so, then a natural extension of the SL-model in (6.2.1) would be to
include these spatial effects as additional terms, i.e.,

(6.3.5) Yi   0    hi wihYh  


k
j 1
 j xij   h1 wih
n
 k
j 1 
 j xhj   i , i  1,.., n

Following Anselin (1988) this extended model is designated as the Spatial Durbin Model
(also SD-model or simply SDM). This SD-model can be written in matrix form by letting
  (1 ,.., k ) . However, one important additional difference is that (as in all previous
models) the matrix, X, is defined to include the intercept term in (6.3.5). So here it is
convenient to introduce the more specific notation,

 
(6.3.6) X  [1n , X v ] and    0 
 v 
where both X v [ ( x1 ,.., xk )] and  v now refer explicitly to the explanatory variables.
With this additional notation, (6.3.5) can be written in matrix form as follows: 8

8
It is of interest to note here that in many ways it seems more natural to use X for the x variables, and to
employ separate notation for the intercept. But while some authors have chosen to do so, including LeSage
and Pace (2009) [compare (6.3.7) above with their expression (2.34)], the linear-model notation
( Y  X    ) is so standard that it seems awkward at this point to attempt to introduce new conventions.

________________________________________________________________________
ESE 502 III.6-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(6.3.7) Y  WY   01n  X v  v  WX v   ,  ~ N (0, 2 I n )

As pointed out by LeSage and Pace (2009, Sections 2.2, 6.1) this model is also useful for
capturing omitted explanatory variables that may be correlated with the x variables. In
this sense, it may serve to make the SL-model somewhat more robust. However, as
developed more fully in Section 10.3 below, our main interest in this model is that it
provides an alternative method for comparing SLM and SEM.

6.3.3 The Conditional Autoregressive (CAR) Model

There is one additional spatial regression model that should be mentioned in view of its
wide application in the literature. While this model is conceptually similar to the SE-
model, it involves a fundamentally different approach from a statistical viewpoint. In
terms of our housing price example, rather than modeling the joint distribution of all
housing prices (Y1 ,..,Yn ) among block groups, this approach focuses on the conditional
distributions of each housing price, Yi , given all the others. The advantage of this
approach is that it avoids all of the simultaneity issues that we have thus far encountered.
In particular, since all univariate conditional distributions derivable from a multi-normal
distribution are themselves normal, this approach starts off by assuming only that the
conditional distribution of each price, Yi , given any values ( yh : h  i ) of all other prices
(Yh : h  i ) , is normally distributed. So these distributions are completely determined by
their conditional means and variances. To construct these moments, we start by rewriting
the reduced SE-model in (6.1.9) as follows:

(6.3.8) Y  X   B1  Y  X   B1  B (Y  X  )  

 ( I n  W )(Y  X  )  

 Y  X   W (Y  X  )  

 Y  X   W (Y  X  )  

But if we now denote the i th row of W by wi  ( wi1 ,.., win ) , then the i th line of this
relation can be written as,

(6.3.9) Yi  xi   wi(Y  X  )   i  xi    h i wih (Yh  xh  )   i

where the last equality follows from the assumption that wii  0 . This suggests that if we
if we now condition Yi on given values ( yh : h  i ) of (Yh : h  i ) , then the natural
conditional model of Yi to consider it the following:

________________________________________________________________________
ESE 502 III.6-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(6.3.10) Yi | ( yh : h  i )  xi    h i wih ( yh  xh  )   i , i  1,.., n

where again  i ~ N (0, 2 ), i  1,.., n . In this form, it is now immediate that

(6.3.11) E[Yi | ( yh : h  i )]  xi    hi wih ( yh  xh  ), i  1,.., n

Moreover, since Yi | ( yh : h  i ) in (6.3.11) is simply a constant plus  i it also follows that


Yi | ( yh : h  i ) must be normally distribibuted with the same variance as  i , i.e.,

(6.3.12) var[Yi | ( yh : h  i )]   2 , i  1,.., n

Such conditional models are usually designated as Conditional Autoregressive (CAR)


models. The advantages of such conditional formulations are most evident in Bayesian
spatial models, where standard “Gibbs sampling” procedures for parameter estimation
require only the specification of all conditional distributions. However, such Bayesian
models are beyond the scope of this NOTEBOOK. [For an excellent discussion of CAR
models in a Bayesian context, see Banerjee, Carlin and Gelfand (2004, Section 3.3).]

Thus our present analysis will focus on the Spatial Errors Model (SEM) and the Spatial
Lag Model (SLM), which are by far the most commonly used spatial regression models.
In the next section, we shall develop the basic methods for estimating the parameters of
these models. This will be followed in Section 8 with a development of the standard
regression diagnostics for these models.

________________________________________________________________________
ESE 502 III.6-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis

7. Spatial Regression Parameter Estimation

Recall from the specification of both SEM in (6.1.3) and SLM in (6.2.2) above that the
parameters, (  , 2 ,  ) , are essentially the same for both. As mentioned already, the key
difference is how the spatial autoregressive hypothesis is applied (namely to the
unobserved errors in SEM and to the observed dependent variable itself in SLM). So it is
not surprising that the method of estimation is very similar for both of these models. But
unlike the iterative estimation scheme employed for geo-kriging models in Section 7.3.1
of Part II (based on iteratively reweighted least-squares), the present method involves the
simultaneous estimation of all model parameters. So our first objective is to develop this
general method of maximum-likelihood estimation, and then to apply this method to both
SEM and SLM.

7.1 The Method of Maximum-Likelihood Estimation

While maximum-likelihood estimation can in principle be applied to estimate the


parameters of any probability model, it should be clear that the models of primary interest
for our purposes are all based on the multi-normal model. So the following development
is restricted to such models. Here the basic idea can be motivated by the following
(extremely simplified) estimation problem for normal distributions. Suppose that a single
sample, Y , is drawn from one of two possible populations having normal densities, 1
and 2 , [as in expression (3.1.10) of Part II] with common unit variance, but with
different means, 1  0 , and 2  2 . Here the problem is to estimate the true value of the
mean based on the value, Y  y , of this one observation, as shown in Figure 7.1 below:

0.5

0.4 f2 ( y)
0.3 f1 • f2
0.2

0.1
f1 ( y )•
0
-3 -2 -1 0 1 •y 2 3 4 5 6
1 2

Figure 7.1. Simple Estimation Problem

To do so, observe that while the density values, f1 ( y ) and f 2 ( y ) , are not themselves
probabilities, their ratio is approximately the relative likelihood of observing values from
these two populations in any sufficiently small neighborhood, [ y   , y   ] , of y, as

________________________________________________________________________
ESE 502 III.7-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

shown in Figure 7.2 below. In particular, the area under each density, f i , is seen to be
well approximated by a rectangle with base length, 2 , and height, f i ( y ) , i  1,2 .

f2 ( y)

• •
f1 ( y )

• •
  (2 ) f 2 ( y )
  (2 ) f1 ( y )
• • • •
y  y  2 y  y  2

Figure 7.2. Relation of Density to Local Occupancy Probabilities

This figure shows that for any sufficiently small positive increment,  ,

Pr( y    Y  y   | 1 ) f ( y)
(7.1.1)  1
Pr( y    Y  y   | 2 ) f2 ( y)

So if f 2 ( y )  f1 ( y ) , as in the present example, then it is reasonable to infer that y is


more likely to have come from population 2 than population 1. More formally, we now
say the maximum-likelihood estimate, ̂ , of the unknown mean  , in this two-
population case is given by:

(7.1.2) ˆ  i  f i ( y )  f j ( y ) , i, j  {1,2}, i  j

Next suppose that nothing is known about the mean of this population, so that  could in
principle be any real value. In this case, there is a continuum of possible normal
populations, { f ( |  ),   } , to be considered. But it should still be clear that y is most
likely to have come from that population for which the probability density, f ( y |  ) , is
largest. Thus the maximum-likelihood estimate, ̂ , is now given by the condition that,

(7.1.3) f ( y | ˆ )  max  f ( y |  )

More generally, suppose we consider a given sample, y0  ( y01 ,.., y0 n ) , of a random


vector, Y  (Y1 ,..,Yn ) , with multi-normal density, f ( y |  ) , where   (1 ,.., k ) denotes
the vector of relevant parameters defining this density. Here,  , could in principle

________________________________________________________________________
ESE 502 III.7-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

contain all mean parameters,   ( 1 ,.., n ) , together with all covariance parameters,
  ( ij : i , j  1,.., n ) defining f [as in expression (3.2.11) of Part II]. But more typically,
 , will contain a much smaller set of parameters that are assumed to completely specify
both  and  in any given model (as will be illustrated by the many examples to
follow). Even in this general setting, the above notion of maximum-likelihood estimator
continues to be perfectly meaningful. For example, suppose that n  2 , so that each
candidate population is representable by a bivariate normal density similar to that Figure
3.2 of Part II. Then as a two-dimensional analogue to Figure 7.2 above, one can imagine
the portion of density above a small rectangular neighborhood of y0  ( y01 , y02 ) , as
shown schematically on the left side of Figure 7.3 below.

f ( y0 |  )

f f
● ●

y2 y2
y02 y02   y02 y02  
y02   • y02   •
● y0 ● y0
y01   y01  
y01 • y01 •
y01   y01  

y1 y1

Figure 7.3. Local Occupancy Probabilities for Bivariate Densities

Here again, it is clear that for sufficiently small positive increments,  , this density
volume is well approximated by the box with base area, (2 )2 , and height, f ( y0 , ) , so
that for any candidate parameter vectors, 1 and  2 , we again have the approximation1

Pr( y0   12  Y  y   12 | 1 ) f ( y0 | 1 )
(7.1.4) 
Pr( y0   12  Y  y0   12 |  2 ) f ( y0 |  2 )

1
Recall that 1n is the unit vector in  .
n

________________________________________________________________________
ESE 502 III.7-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

While such graphic representations are not possible in higher dimensions, n  2 , it


should be clear that the same approximations hold for all n . So as a direct extension of
(7.1.3), it follows that for any given sample observation, y   n , if the relevant set of
possible values of a given parameter vector,   (1 ,.., k ) , is denoted by    k ,2 then
the maximum-likelihood estimate, ˆ , of parameter vector  is again defined by the
condition that:

(7.1.5) f ( y | ˆ)  max k f ( y |  )

Given the fact that sample y is the known quantity and  is unknown, it is usually more
convenient to define the corresponding likelihood function, l ( | y ) , by

(7.1.6) l ( | y )  f ( y |  ) ,    k

and replace condition (7.1.5) by

(7.1.7) l (ˆ | y )  max k l ( | y )

Finally, because densities are positive (in the range of realizable samples, y ), and
because the log-likelihood function,3

(7.1.8) L( | y )  log[l ( | y )]

is always monotone increasing in l ( | y ) , it follows that maximum-likelihood estimates,


 , can be equivalently characterized by the log-likelihood condition:

(7.1.9) L(ˆ | y )  max k L( | y )

The reason for this transformation is that multivariate density functions often involve
products – as exemplified by the important case of independent random sampling,
f ( y |  )  f ( y1 ,.., yn |  )   in1 f ( yi |  ) . Moreover, since logs convert products to sums,
this representation is often simpler to analyze (as for example when differentiating
likelihood functions).

7.2 Maximum Likelihood Estimation for General Linear Regression Models

To apply this estimation procedure, we start in Section 7.2.1 by considering the most
familiar case of Ordinary Least Squares (OLS). By applying essentially the same
arguments as in Section 7.1 of Part II, we then extend these results to Generalized Least

2
For example, if   (1 ,  2 )  (  ,  ) for N (  ,  ) , then   {(1 ,  2 )   :  2  0} .
2 2 2

3
In these notes “log” always means natural log, so the symbols ln and log may be used interchangeably.

________________________________________________________________________
ESE 502 III.7-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Squares (GLS) in Section 7.2.2. These maximum-likelihood estimates for GLS will then
serve as the general framework for obtaining comparable results for the spatial regression
models, SEM and SLM in Sections 7.3 and 7.4 below.

7.2.1 Maximum Likelihood Estimation for OLS

Here we start with the standard linear model,

(7.2.1) Y  X    ,  ~ N (0, 2 I n )

which in turn implies that Y must be multi-normally distributed as, Y  N ( X  , 2 I n ) . So


as the special case of expression (3.2.11) in Part II with   X  and    2 I n , it follows
that Y has multi-normal density, f ( y |  , 2 ) , given by

 1 ( y  X  ) ( 2 I n ) 1 ( y  X  )
(7.2.2) f ( y |  ,  2 )  (2 )  n/2 |  2 I n |1/2 e 2

[where the parameter vector,  , for the general version in (7.1.5) above is here given by
  (  , 2 )  (  0 , 1 ,..,  k , 2 ) ]. By observing that |  2 I n |1/2  ( 2 n )1/2 | I n |  ( 2 ) n /2
and ( 2 I n ) 1   2 I n1   2 I n , we see that this density can be simplified to:

 1 2 ( y  X  ) ( y  X  )
(7.2.3) f ( y |  ,  2 )  (2 2 )  n/2 e 2

so that the appropriate log-likelihood function for the OLS model is given by:

(7.2.4) L(  , 2 | y )   n2 log(2 )  n2 log( 2 )  21 2 ( y  X  ) ( y  X  )

Thus to obtain the maximum-likelihood estimates, ( ˆ ,ˆ 2 ) , of the model parameters, we


must maximize (7.2.4) with respect to  and  2 . To do so, notice first that since 
appears only in the second term (which is negative), it follows that for any choice of  2 ,
the function L is always maximized with respect to  by minimizing the squared
deviation function:

(7.2.5) SSD(  )  ( y  X  )( y  X  )

in a manner identical to expressions (7.1.10) and (7.1.11) in Part II. Thus expression
(7.1.12) of Part II shows that this solution is again given by:

(7.2.6) ˆ  ( X X ) 1 X  y

While this simple identity might appear to suggest that there is really no need for
maximum likelihood estimation in the case of OLS, the real power of this method

________________________________________________________________________
ESE 502 III.7-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

becomes evident when we turn to the estimation of  2 . Indeed, the method of least
squares used for OLS is not directly extendable to  2 , so that other methods must be
employed. Even in the case of geostatistical regression, where a comparable estimate of
 2 was developed in expression (7.3.19) of Part II, the actual estimation procedure
involved a rather ad hoc application of nonlinear least-squares procedure for fitting
spherical variograms to data. But in the present setting, we can now obtain a theoretically
more meaningful estimate. In particular, by substituting ˆ from (7.2.5) into (7.2.4), we
can derive the exact maximum-likelihood estimate, ̂ 2 , of  2 by minimizing the reduced
function,

(7.2.7) Lc ( 2 | y )  L( ˆ , 2 | y )

  n2 log(2 )  n2 log( 2 )  21 2 ( y  X ˆ ) ( y  X ˆ )

where the subscript “c” reflects the common designation of this function as the
concentrated likelihood function of parameter,  2 [also called a profile likelihood
function]. But since the first order condition for a maximum yields:

(7.2.8) 0 
d
d
2
Lc ( 2 | y )   n2  1   12  (1
2 2 2
)  ( y  X ˆ )( y  X ˆ )
  n  12 ( y  X ˆ )( y  X ˆ )

we see that the maximum-likelihood estimate for  2 is given by,4

(7.2.9) ˆ 2  n1 ( y  X ˆ )( y  X ˆ )

This can be given a more familiar form in terms of estimated residuals, ˆ  (ˆ1 ,.., ˆn ) as

ˆ 2  n ˆ ˆ  ˆ
n
(7.2.10) 1  1
n i 1 i
2

which is seen to be the “natural” estimator of  2  var( )  E ( 2 ) .

7.2.2 Maximum Likelihood Estimation for GLS

To extend these estimation results to GLS, we start with the general linear model,

(7.2.11) Y  X    ,  ~ N (0, 2V )

4
One may also check that the second derivative of Lc evaluated at ˆ is negative and thus yields a
2

maximum.

________________________________________________________________________
ESE 502 III.7-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

where the matrix, V , is assumed to be known.5 So in this setting, OLS is seen to be the
special case with V  I n . The key feature of this model is that, like the OLS model in
(7.2.3) above, the only unknown parameters are the beta coefficients,  , together with
the positive variance parameter,  2 [so that again,   (  , 2 ) ]. As with OLS, this
implies that Y is again multi-normally distributed, where in this case, Y ~ N ( X  , 2V ) ,
with density:
 1 ( y  X  ) ( 2V )1 ( y  X  )
(7.2.12) f ( y |  , 2 )  (2 ) n /2 |  2V |1/2 e 2

By employing parallel matrix identities, |  2V |1/2  ( 2 n )1/2 | V |1/2  ( 2 ) n /2 | V |1/2 and


( 2V )1   2V 1 , this can again be simplified to:
 1 2 ( y  X  )V 1 ( y  X  )
(7.2.13) f ( y |  , 2 )  (2 2 ) n /2 | V |1/2 e 2

which is seen to yield the associated log-likelihood function:

(7.2.14) L(  , 2 | y )   n2 log(2 )  n2 log( 2 )  12 log | V |  21 2 ( y  X  )V 1 ( y  X  )

So to obtain the maximum-likelihood estimate, ˆ , of  , it now follows (as an extension


of the OLS case) that for any choice of  2 , the function L will be maximized by choosing
ˆ to minimize the quadratic form, ( y  X  )V 1 ( y  X  ) , [which is identical in form to
expression (7.1.27) of Part II, and may again be interpreted as a type of weighted least-
squares problem]. But at this point, we may now observe [as in expression (7.1.15) of Part
II] that if T denotes the Cholesky matrix for V,6 then the matrix identity

(7.2.15) V  TT   V 1  (T )1T 1  (T 1 )T 1 ,

allows us to reduce this quadratic form as follows:

(7.2.16) ( y  X  )V 1 ( y  X  )  ( y  X  ) (T 1 )T 1 ( y  X  )

 (T 1 y  T 1 X  ) (T 1 y  T 1 X  )

 ( y  X  )( y  X  )

But this is precisely the squared deviation function in (7.2.5) for the new data set, y  T 1 y
and X  T 1 X . So it follows at once from (7.2.6) that the GLS maximum-likelihood
estimate, ˆ , of  is given [as in expressions (7.1.21) through (7.1.24) in Part II] by

5
Unlike the model specification in expression (7.1.8) of Part II, the matrix V need not be a correlation
matrix (i.e., its diagonal elements need not be all ones). However, since  V is required to be a
2

nonsingular covariance matrix, V , must be symmetric and positive definite (as in Section A2.7.2 of the
Appendix to Part II).
6
Here existence of T is ensured by the Cholesky Theorem in Section A2.7.2 of the Appendix to Part II.

________________________________________________________________________
ESE 502 III.7-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(7.2.17) ˆ  ( X X )1 X  y  [(T 1 X ) (T 1 X )]1 (T 1 X )(T 1 y )

 [ Xˆ (T )1 T 1 X ]1 X (T )1 T 1 y

so that by (7.2.15),

(7.2.18) ˆ  ( X V 1 X )1 X V 1 y

Moreover, precisely the same maximization arguments for  2 in (7.2.8) and (7.2.9) above
now show that the GLS maximum-likelihood estimate for  2 is given by

(7.2.19) ˆ 2  n1 ( y  X ˆ )( y  X ˆ )  n1 (T 1 y  T 1 X ˆ )(T 1 y  T 1 X ˆ )

 n1 ( y  X ˆ )(T 1 )(T 1 )( y  X ˆ )

so that again by (7.2.15),

(7.2.20) ˆ 2  n1 ( y  X ˆ )V 1 ( y  X ˆ )

Thus the maximum-likelihood estimation results for OLS are seen to be directly
extendable to the class of GLS models (7.2.11).

7.3 Maximum Likelihood Estimation for SEM

To apply these general results to SE-models, we start by recalling from expressions


(6.1.7) and (6.1.8) that SEM can be written as

(7.3.1) Y  X   u, u ~ N (0,  2V )

where the spatial covariance structure, V , is given by

(7.3.2) V  ( B B )1  B1 ( B1 )

with B given in terms of weight matrix, W, by

(7.3.3) B  I n   W

So SEM can be viewed as an instance of the GLS model in (7.2.11), where V now takes
the specific form V in (7.3.2). However, it must be emphasized that unlike (7.2.11), the
matrix V involves an unknown parameter,  . So to be precise, (7.3.1) should be viewed

________________________________________________________________________
ESE 502 III.7-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

as a GLS model conditioned on  . But nonetheless, we can still employ (7.2.14) to write
down the appropriate log-likelihood function for SEM as

(7.3.4) L(  , 2 ,  | y )   n2 log(2 )  n2 log( 2 )  12 log | V |  21 2 ( y  X  )V1 ( y  X  )

In particular, we now know from (7.2.18) and (7.2.20) that for any given value of  , the
maximum-likelihood estimates for  and  2 , conditional on  , are given respectively
by

(7.3.5) ˆ  ( X V1 X )1 X V1 y

and

(7.3.6) ˆ 2  n (y 
1 X ˆ )V1 ( y  X ˆ )

where the subscript on these estimates reflects their dependency on the value of  . But
since these conditional estimates are expressible as explicit (closed form) functions of  ,
we can substitute these results into (7.3.4) and obtain a concentrated likelihood function
for  in a manner similar to that of  2 in the case of OLS [in expression (7.2.7) above].
In the present case, this concentrated likelihood takes the following form:

(7.3.7) Lc (  | y )  L( ˆ ,ˆ 2 ,  )

  n2 log(2 )  n2 log(ˆ 2 )  12 log | V |  21ˆ 2 ( y  X ˆ )V1 ( y  X ˆ )


To further simplify this expression, we first note from (7.3.6) that the last term in (7.3.7)
reduces to a constant, since

(7.3.8)  21ˆ 2 ( y  X ˆ )V1 ( y  X ˆ )   21ˆ 2 [n  ˆ 2 ]   n2


 

Moreover, it follows from standard properties of matrix inverses and determinants [as in
expressions (A31.18), (A3.1.20), (A3.2.70) and (A3.2.71) of the Appendix] that

(7.3.9) | V |  | ( B B )1 |  | B1 |  | ( B )1 |  | B |1| B |1  | B |2

So by substituting these identities into (7.3.7) we obtain the simpler form of the
concentrated likelihood function for  :

(7.3.10) Lc (  | y )    n2 [1  log(2 )]  log | B |  n2 log(ˆ 2 )

________________________________________________________________________
ESE 502 III.7-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

With these results, the desired maximum-likelihood estimation procedure for SEM is now
evident. In particular, we first maximize the concentrated likelihood function, Lc (  | y ) ,
to obtain the estimate, ̂ , and then use (7.3.5) and (7.3.6) to obtain the remaining
estimates, ˆ and ̂ 2 as:

(7.3.11) ˆ  ˆˆ  ( X Vˆ1 X )1 X Vˆ1 y

and

(7.3.12) ˆ 2  ˆ 2ˆ  n (y 
1 X ˆ )Vˆ1 ( y  X ˆ )

Since Lc (  | y ) is a smooth function in one variable, the first step can be accomplished
by standard numerical “line search” methods. So for reasonably small sample sizes, n,
this estimation procedure is very efficient.

But for larger sample sizes (say, n  500 ), an additional problem is created by the need to
evaluate the n-square determinant, | B | , at each step of this procedure. However, such
computations can often be made more efficient by means of the following observation.
Recall from the discussion of eigenvalues and eigenvectors in Section 3.3.1 above that
nonsingular matrices such as B have a “spectral” representation in terms of the diagonal
matrix,    diag ( 1 ,..,  n ) , of their eigenvalues, together with the nonsingular matrix,
X   ( x 1 ,.., x n ) , of their associated eigenvectors as:

(7.3.13) B  X    X 1

So again by standard determinant identities [(A3.2.70) and (A3.2.72) in the Appendix], it


follows that

| B |  | X  |  |   |  | X 1 |  | X  |  |   |  | X  |1 |   |   i 1 i


n
(7.3.14)

Moreover, if the eigenvalues of the weight matrix, W , in (7.3.3) are denoted by i with
associated eigenvectors, xi , i  1,.., n , so that

(7.3.15) W xi  i xi , i  1,.., n

then it follows from (7.3.3) that

(7.3.16) B xi  ( I n  W ) xi  xi  Wxi  xi  i xi  (1  i ) xi , i  1,.., n

Thus we see that the eigenvalues of B are obtainable from those of W by the identity

________________________________________________________________________
ESE 502 III.7-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(7.3.17) i  1  i , i  1,.., n

(with corresponding eigenvector, xi  xi ). In particular, this implies from (7.3.13) that

| B |   i 1 (1  i )
n
(7.3.18)

and thus that the log determinant in (7.3.10) is given simply by

log | B |   i 1 log(1  i )


n
(7.3.19)

So by calculating the eigenvalues (1 ,.., n ) of the weight matrix, W, we can rapidly
compute the determinant, | B | , for any value of  . While the computation of these
eigenvalues can itself be time consuming, the key point is that this calculation need only
be done once. This procedure is so useful, that it is incorporated into almost all software
packages for calculating such maximum-likelihood estimates (when n is sufficiently
large).7

7.4 Maximum-Likelihood Estimation for SLM

In most respects, maximum-likelihood estimation for SL-models is virtually identical to


that for SE-models. To begin with, recall from expression (6.2.6) that SLM can be
written as

(7.4.1) Y  X    u , u ~ N (0, 2V )

where X   B1 X and where V and B are again given by (7.3.2) and (7.3.3). So the
only formal difference here is that for each given value of  , we now obtain a GLS
model in which both V and X depend on  . So the corresponding log likelihood
function takes the form,

(7.4.2) L(  , 2 ,  | y )   n2 log(2 )  n2 log( 2 )  12 log | V |  21 2 ( y  X   )V1 ( y  X   )

which in turn implies that for the SLM case, the maximum-likelihood estimate for 
conditional on  is given by:

(7.4.3) ˆ  ( X  V1 X  )1 X  V1 y

7
However, it should also be noted that for extremely large sample sizes (say n  1000 ) the numerical
accuracy of such eigenvalue calculations becomes less reliable. In such cases, (7.3.19) is often
approximated by using only those terms with eigenvalues of largest absolute magnitudes.

________________________________________________________________________
ESE 502 III.7-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

For computational purposes, it is often more convenient to reduce this expression by


observing that

(7.4.4) X  V1 X   ( B1 X )( B B ) B1 X


 X ( B1 )( B B ) B1 X
 X   ( B )1 B  B B1  X  X X

and similarly that

(7.4.5) X  V1  ( B1 X )( B B )  X   ( B )1 B  B  X  B

So the maximum-likelihood estimate of  given  reduces to the simpler form

(7.4.6) ˆ  ( X X )1 X  B y

Similarly, the maximum-likelihood estimate for  2 conditional on  is given by

(7.4.7) ˆ 2  n (y 
1 X  ˆ )V1 ( y  X  ˆ )

But by using the same arguments in (7.4.4) and (7.4.5) we see that

(7.4.8) ( y  X  ˆ )V1 ( y  X  ˆ )  ( y  B1 X ˆ )( B B )( y  B1 X ˆ )

 [ B1 ( B y  X ˆ )]( B B )[ B1 ( B y  X ˆ )]

 ( B y  X ˆ )[( B )1 B ]( B B1 )( B y  X ˆ )]

 ( B y  X ˆ )( B y  X ˆ )

and thus that the maximum-likelihood estimate for  2 conditional on  for SLM
reduces to:

(7.4.9) ˆ 2  n ( B y 
1 X ˆ )( B y  X ˆ )

By substituting these expressions into (7.4.2), we again obtain a concentrated log


likelihood function for  , namely

(7.4.10) Lc (  | y )   n2 log(2 )  n2 log(ˆ 2 )  12 log | V |  21ˆ 2 ( y  X  ˆ )V1 ( y  X  ˆ )


________________________________________________________________________
ESE 502 III.7-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

As with SEM, this can be reduced by again observing from (7.4.7) that

(7.4.11)  21ˆ 2 ( y  X  ˆ )V1 ( y  X  ˆ )   21ˆ 2 nˆ 2   n2


 

which together with (7.3.9) shows that the concentrated likelihood function for  has
exactly the same form for SLM and for SEM, i.e.,

(7.4.12) Lc (  | y )    n2 [1  log(2 )]  log | B |  n2 log(ˆ 2 )

So the only difference between (7.3.10) and (7.4.12) is the explicit form of ˆ 2 in (7.3.6)
and (7.4.9), respectively. In particular, this implies that all the discussion about numerical
maximization of concentrated likelihoods to obtain ̂ is identical for both models. In
particular, the eigenvalue decomposition in (7.3.19) is precisely the same. So to complete
the estimation procedure, it remains only to substitute this estimate, ̂ , into (7.4.6) and
(7.4.9) to obtain the respective estimates,

(7.4.13) ˆ  ( X X )1 X  Bˆ y

and

(7.4.14) ˆ 2  n ( Bˆ y 
1 X ˆ )( Bˆ y  X ˆ )

7.5 An Application to the Irish Blood Group Data

At this point, it is instructive to consider an application of these spatial regression models


to an empirical example, namely the Irish Blood Group data in Section 1.2 above. To do
so, we start with a standard OLS regression analysis in Section 7.5.1 below and test the
residuals for spatial autocorrelation (as in Section 4.3 above). The spatial regression
models, SEM and SLM, are then applied to this data in Section 7.5.2.

7.5.1 OLS Residual Analysis and Choice of Spatial Weights Matrices

Recall from Figures 1.7 and 1.8 above that the “footprint” of the 12th Century Anglo-
Norm counties, known as the Pale, can still be seen in the spatial density pattern of Blood
Group A in 1958. So an interesting question to explore is how much of this pattern can be
statistically accounted for by this single explanatory variable.8 To do so, we now consider
a simple regression

(7.5.1) Yi   0  1 xi   i , i  1,.., n

8
Note that the Irish Blood Group data set in [BG] contains one other potentially relevant explanatory
variable, namely the number of place names (per unit of area) ending in “town” within each county.
However, in the present example we focus only on the (marginal) effect of the Pale itself.

________________________________________________________________________
ESE 502 III.7-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

where relevant dependent variable, Yi is the proportion of adults with Blood Group A in
each county i , and the single explanatory variable, xi , is taken to be the indicator (zero-
one) variable for the Pale (corresponding to the red area in Figure 1.8 above), where

1 , if i  Pale
(7.5.2) xi  
0 , if i  Pale

To run this regression, we here use the ARCMAP version of OLS, and employ the
ARCMAP data set in Eire.mxd. While JMP is generally more suitable for such analyses,
performing OLS inside ARCMAP has the particular advantage of allowing the regression
residuals to be mapped directly. This program can be found on the ArcToolbox path:

Spatial Statistics Tools > Modeling Spatial Relationships > Ordinary Least Squares

In the window that opens, type the entries shown on the left in Figure 7.4 below (where
as usual, path names are machine specific):

Figure 7.4. Running Ordinary Least Squares in ArcToolbox

Notice that both the coefficient estimates and diagnostics are “optional” tables, which
should definitely be added. These will appear in the Table of Contents, as shown at the
bottom right in Figure 7.4. The relevant portion of eire_output (for our purposes)9 is
shown in Table 7.1 below:

9
Note in particular that the “robust” estimates and tests in this Table have not been shown. As with a
number of other statistical diagnostics in ARCMAP, these robust-estimation results are difficult to interpret
without further documentation.

________________________________________________________________________
ESE 502 III.7-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Table 7.1. Coefficient Estimates and P-Values

So the “Pale effect” is seen to be positive and very significant, indicating that Blood
Group A levels are significantly higher inside the Pale than elsewhere in Eire. But as we
have seen many times before, this significance may well be inflated by the presence of
spatial dependencies among Blood Group levels that are not accounted for by the Pale
alone. So the remaining task is to test the regression residuals for spatial autocorrelation.
These residuals are shown graphically on the right in Figure 7.6 below, where the pattern
of Blood Group values in Figure 1.7 is reproduced on the left for ease of comparison.

0 50 miles 0 50 miles

Figure 7.5. Blood Group values and OLS Residuals

Before analyzing these residuals, it is important to emphasize that the “default” residuals
that appear in ARCMAP (as indicated on the right side of Figure 7.4) have been
normalized to Studentized Residuals (StdResiduals). So to be comparable with the rest of
our analysis, this plot must be redone in terms of the Residuals column in the Attribute
Table, as is done in Figure 7.5.10

10
Note that studentized residuals (again not documented in ARCMAP) are useful for many testing
purposes when the original assumption of independent residuals holds. But in the presence of possible
spatial dependencies, it is generally preferable to analyze the raw residuals themselves.

________________________________________________________________________
ESE 502 III.7-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Note from the plot of these residuals in Figure 7.5 that (as with many linear regressions)
the highest Blood Group values in the Pale are underestimated (red residuals) and the
lowest values outside the Pale are overestimated (blue residuals). This by itself tends to
preserve a certain amount of the positive correlation seen in the original Blood Group
data.

But to determine the statistical significance of such residual correlations, we must of


course employ an appropriate spatial weights matrix, W. Because the present Eire
example provides a dramatic illustration of how important this choice of W can be, we
now consider this choice in some detail. To do so, a number of candidate weight matrices
from Sections 2.1.2 and 2.1.3 were applied to this residual data, with test results
summarized in terms of p-values in Table 7.2 below. Here the first column Asymp
displays the results of the standard asymptotic Moran test in Section 4.2.1 above. The
remaining three colums, Moran, Rho, and Corr, show comparable results for the
sac_perm test in Section 4.3.1 above (using 999 simulations).

Asymp Moran Rho Corr


W_nn 0.540 0.595 0.504 0.589
W_nns 0.235 0.282 0.280 0.285
W_nn5 0.228 0.113 0.117 0.115
W_queen 0.249 0.091 0.118 0.106
W_share 0.016 0.035 0.058 0.039
W 0.010 0.019 0.026 0.020

Table 7.2. P-values for the Eire OLS Residuals

The first spatial weights matrix considered is the simple (centroid) nearest-neighbor
matrix, Wnn , which (as already mentioned above Figure 1.18) is very restrictive for areal
data in terms of potentially relevant neighbors ignored. Here it is clear that no spatial
autocorrelation is detected by any method using this matrix. A slightly more appropriate
version is the symmetric nearest-neighbor matrix, Wnns , [expression (2.1.10) above with
k  1 ] shown in the next row. Here the results are all still very insignificant, but are
nonetheless dramatically more significant than for the asymetric case. The reason for this
in the case of Eire can be seen in Figure 7.6 below, where county centroids are shown as
blue dots, and where the red line emanating from each centroid is directed toward its
nearest neighbor. This figure (which extends the Laoghis County illustration in Figure
1.18 above) confirms that such neighbor relations are relatively sparse throughout Eire. In
particular, there are very few mutual nearest neighbors, i.e., red lines with both ends
connected to centroids. So when moving from nearest neighbors, Wnn , to symmetric
nearest neighbors, Wnns , it is now clear that many more relations are added to the matrix,
thus allowing many more possibilities for spatial correlation to be considered.

________________________________________________________________________
ESE 502 III.7-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

The third and fourth rows show respective results for the queen contiguity matrix, Wqueen ,
[expression (2.1.15)] and for one of its k-nearest-neighbor approximations, namely, the
five nearest neighbor version, Wnn 5 [as in expression (2.1.9) with k  5 , and as also used
in Figure 1.18 for Laoghis County]. These two cases are of special interest, since they are
by far the most commonly used weights matrices for analyzing areal data. But in both
cases, spatial autocorrelation is at best seen to be weakly significant – and is totally
insignificant for the standard asymptotic Moran test.11
( p g
900

800

700

600

500

400

300

200

100
200 300 400 500 600 700

Figure 7.6. Nearest-Neighbor Relations in Eire

In view of this lack of significance, the results in the final two rows are quite striking.
These show respective results for the boundary shares matrix, Wshare [expression
(2.1.17)], and for the combined distance-shares matrix, W, of Cliff and Ord (1969)
[expression (2.1.18)]. Because we shall employ this latter matrix, W, in our subsequent
analyses, it is here convenient to reproduce its typical elements, wij , as follows,

lij dij1
(7.5.3) wij 
 l dik1
k i ik

11
In fairness, it should be pointed out (as is done for example in ARCMAP) that such asymptotic tests
typically require more samples (areal units) for statistical reliability. A common rule of thumb (that we
have seen already for the Central Limit Theorem) is that n be at least 30.

________________________________________________________________________
ESE 502 III.7-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

where lij is the fraction of the boundary of county i shared with county j , and dij is the
distance between their respective centroids.12 While it is difficult to explain exactly why
these two matrices capture so much more significance, one can gain insight by simply
noting the unusual complexity of county boundaries in Eire. These complexities have
most likely resulted from a long history of interactions between neighboring counties, so
that shared boundary lengths may well reflect the degree of such interactions. Moreover,
in so far as centroid distances tend to reflect relative travel distance between counties, it
is reasonable to suppose that such distances reflect other dimensions of interaction. In any
case, this example provides a clear case where it is prudent to consider a variety of tests
in terms of alternative spatial weights matrices before drawing any firm conclusions
about the presence of spatial autocorrelation. One rule of thumb is to try several (say
three) different matrices which exhibit sufficient qualitative differences to capture a range
of interaction possibilities. As stressed at the beginning of Part III, one of the most
perplexing features of areal data analysis is the absence of any clear notion of “spatial
separation” between areal units.

7.5.2 Spatial Regression Analyses

As stated above, we here employ the combined distance-shares matrix, W, in (7.5.3)


which captures the most significant amount of spatial autocorrelation in Table 7.2 [and
which constitutes the original matrix used by Cliff and Ord (1969) in their classic study
of this Eire data]. To construct such a matrix, we first note that the procedure for
constructing boundary-share weights is developed in Sections 3.2.2 and 3.2.3 of Part IV
(as mentioned in Section 2.2.2 above), and is also discussed in more detail in Assignment
7. For the case of Eire, such boundary shares are given by matrix, W_share, in the
MATLAB workspace, Eire.mat. Using the MATLAB script, eire_wts.m, these shares
( lij ) can be combined with centroid distances ( dij ) to yield the desired combined
distance-shares weight matrix, W, in the workspace.13

Given this weight matrix, we now employ the spatial regression models, SEM and SAR,
to capture the relation between Blood Group levels and the Pale in a manner that
accounts for the spatial autocorrelation detected in Table 7.2. The estimation procedures
for SEM and SLM are implemented in the MATLAB programs, sem.m and slm.m,
respectively. The inputs required for each program consist of a data vector, y, for the
dependent variable, a data matrix, X, for the explanatory variables, and an appropriate
spatial weights matrix, W, relating the relevant set of areal units. In the present case, y is
the vector of Blood Group proportions for each county, X is the vector, x, identifying
those counties in the Pale, and W is the combined distance-shares matrix above.

12
Further discussion of this weight matrix can be found in Upton and Fingleton (1985, pp.287-288) [see
Reference 18 in the class Reference Materials].
13
Here it is of interest to note that these weights differ slightly from those of Cliff and Ord (1969), which
can be found in Table 5.1 of Upton and Fingleton (1985), and which are also reproduced as matrix, W2, in
the workspace, Eire.mat. This illustrates the fact that such constructions will differ to some degree
depending on the particular map of Eire that is used. (Indeed, digital maps did not even exist in 1969 when
the original work was done.)

________________________________________________________________________
ESE 502 III.7-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Before running these models, it should be noted that there are two additional inputs,
vnames and val (also described in the program documentation). We have already seen
vnames used as the list of variable names in previous applications (as for example in
Cobalt Example of Section 7.3.4 in Part II). For the present case of a single variable, one
need only write the variable name in single quotes, which here is ‘Pale’. The final input,
val, represents the optional input of eigenvalues for W used to calculate the log
determinant in (7.3.19) above. In the case of Eire with n  26 , this is hardly necessary.
But for very large weight matrices, W, it is worth noting that the corresponding vector of
eigenvalues is easily obtained in MATLAB with the command:

>> val = eig(W);

With these preliminary observations, we can now run both SEM and SLM, using the
respective commands:

>> sem(y,X,W,‘Pale’);

>> slm(y,X,W,‘Pale’);

It should also be noted that there are a number of data outputs given by these two models.
But for our present purposes, it is enough to examine their screen outputs, as shown in
Figure 7.7 below. Here it is clear that there is a strong parallel between the output formats
of each model. In particular, they are quite comparable in terms of both their output
results and diagnostics (as discussed in more detail below). Note also that these two
formats look very much the same as for OLS regression in the sense that significance
levels (p-values) are reported for each parameter estimate, together with various measures
of “goodness of fit”. But as we shall see below, the actual methods of obtaining these
results (and in some cases, even their meaning) differs substantially from OLS.
Nonetheless, the basic interpretations of parameter estimates and their significance levels
will remain the same as in OLS. So before getting into the details of calculation methods,
it is appropriate to begin by examining these results in a qualitative way.

With respect to SEM, notice first that while the Pale effect continues to be positive (as in
Table 7.1 for OLS), this effect is now both smaller in magnitide (1.55 versus 4.25) and
dramatically less significant (with a p-value of .0788 versus .000012). Notice also that the
level of spatial autocorrelation, ˆ  0.7885 , is significantly positive. As we have seen
before, this suggests that such differences are largely due to the presence of spatial
autocorrelation. While the exact nature of these effects is difficult to identify in the
present spatial regression setting, we can nonetheless make certain useful observations.
First, if the relevant data matrix for this Eire example is denoted by X  [1n , x ] , then it
follows from expression (7.1.12) in Part II together with and (7.3.11) above that the OLS
and SEM estimates of   (  0 , 1 ) are given respectively by

(7.5.4) ˆOLS  ( X  X )1 X  y

________________________________________________________________________
ESE 502 III.7-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

SEM OUTPUT SLM OUTPUT


FINAL REGRESSION RESULTS: FINAL REGRESSION RESULTS:

VAR COEFF Z-VAL PROB VAR COEFF Z-VAL PROB


const 28.82487 20.66107 0.000000 const 7.130157 2.218746 0.026504
Pale 1.553209 1.757660 0.078805 Pale 2.014177 3.471544 0.000517
Variance = 2.1251 Variance = 1.6146

AUTOCORRELATION RESULTS: AUTOCORRELATION RESULTS:

VAL Z-VAL PROB VAL Z-VAL PROB


rho 0.788456 7.466704 0.000000 rho 0.726419 6.466525 0.000000

GOODNESS-OF-FIT RESULTS: GOODNESS-OF-FIT RESULTS:

Extended R-Square = 0.3313 Extended R-Square = 0.7335


Extended R-Square Adj = 0.3034 Extended R-Square Adj = 0.7224
Squared_Correlation = 0.5548 Squared_Correlation = 0.7512
Log Likelihood Value = -49.8773 Log Likelihood Value = -45.6632
AIC = 107.7546 AIC = 99.3263
AIC_corrected = 109.6593 AIC_corrected = 101.2311
BIC = 112.7869 BIC = 104.3587

TESTS OF SEM MODEL: TEST OF SLM MODEL:

TEST VAL PROB TEST VAL PROB


LR 7.374837 0.006614 LR 15.803078 0.000070
Com-LR 18.427035 0.000018

MORAN z-score and p-val = (0.2741,0.3920) MORAN z-score and p-val = (-0.7550,0.7749)

SAC_PERM TEST (N = 999) SAC_PERM TEST (N = 999)

INDEX VALUE SIGNIF INDEX VALUE SIGNIF


Moran -0.0252 0.4544 Moran -0.1734 0.8135
corr -0.0445 0.4534 corr -0.3110 0.8097
rho -0.0784 0.4541 rho -0.5579 0.8086

Figure 7.7. Regression Results and Autocorrelation Tests for SEM and SLM
________________________________________________________________________
ESE 502 III.7-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

and,

(7.5.5) ˆSEM  ( X Vˆ1 X )1 X Vˆ1 y

In contrast to OLS, the beta estimates for SEM are thus seen to depend on the estimated
level of spatial autocorrelation, ̂ , together with the choice of spatial weights matrix, W,
implicit in Vˆ . So while in theory such estimates are still unbiased [recall expression
(7.1.26) in Part II], their sensitivity to ̂ tends to inflate the variance of these 
estimates.

This can be seen in part by considering the standard errors of the estimated Pale
parameter, ˆ1 , for both OLS and SEM. To do so, recall first from Table 7.1 that the
standard error for ˆ under OLS was given by,
1

(7.5.6) sOLS ( ˆ1 )  0.7775

To derive the comparable standard error under SEM, we begin by noting that appropriate
“Z-VAL” for ˆ1 in Figure 7.7 is given [in a manner analogous to expression (7.3.26) in
Part II] by

ˆ1
(7.5.7) zˆ  ,
1
sˆ
1

so that the estimated standard error for ˆ1 under SEM is given from Figure 7.7 by,

ˆ 1.553209
(7.5.8) sSEM ( ˆ1 )  1   0.88368
z ˆ 1.757660
1

This shows that standard errors of beta estimates do indeed tend to be larger in the
presence of spatial autocorrelation.

Before turning to the SL-model, it is important to note that while the estimated spatial
autocorrelation level, ̂ , for this SE-model is significantly positive, it is not evident that
̂ has successfully eliminated all spatial autocorrelation effects found for weight matrix,
W, in Table 7.2. To address this issue, we may again appeal to the results developed for
all GLS models in expressions (7.1.18) and (7.1.19) in Part II, which show that if the
spatial covariace structure, V , [in (7.3.2) and (7.3.3)] has been correctly estimated, then
the Cholesky reduction of this model to OLS form should yield residuals that exhibit no
significant spatial autocorrelation (with respect to W). In the present case, however, there
is no need for Cholesky decompositions, since V in (7.3.2) is already factorized in terms

________________________________________________________________________
ESE 502 III.7-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

of B1 . In fact the reduction of SEM to an OLS form can be made even more transparent
by simply recalling from expression (6.1.9) that

(7.5.9) Y  X   B1 ,  ~ N (0, 2 I n )

 B Y  B X    ,  ~ N (0, 2 I n )

 Y  X     ,  ~ N (0, 2 I n )

where Y  BY and X   B X . So to test the success of this SE-model it suffices to


analyze the residuals:

(7.5.10) ˆ  Yˆ  X ˆ ˆSEM

of the estimated OLS model in (7.5.9), by again using sac_perm.m. Since this procedure
is detailed in part (c) of Assignment 7, it suffices here to observe that the full command
for sem.m in Section 7.5.2 above is of the form:

>> [OUT,cov,DAT] = sem(y,X,W,‘Pale’);

where the matrix OUT contains a number of useful transformations of the regression
outputs. In particular, the residuals in (7.5.10) are contained in the third column, so that
the command,

>> res_SEM = OUT(:,3);

produces a copy, res_SEM, of these residuals that can be tested using sac_perm as
follows:

>> sac_perm(res_SEM,W,999);

The results of this test are shown in the lower left panel of Figure 7.7, and confirm that
this application of SEM has indeed been successful in removing the spatial
autocorrelation found under weight matrix, W.

Turning next to the SL-model, the most important difference to notice here is that while
the Pale effect on Blood Group A is again positive – it is now vastly more significant
than for the SE-model, with p-value = 0.0005. Moreover, by substituting the maximum-
likelihood estimates ( ˆ ,ˆ 2 , ˆ ) for each model into their respective log-likehood
functions in (7.3.4) and (7.4.2), we obtain maximum log-likehood values for SEM and
SLM that constitute one possible measure of their goodness of fit to this Eire data (see
Section 9 below for a more detailed discussion of goodness-of-fit measures). As seen in
the GOODNESS-OF-FIT section for each model in Figure 7.7, these values are given
respectively by,

________________________________________________________________________
ESE 502 III.7-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(7.5.11) LSEM ( ˆ ,ˆ 2 , ˆ )   49.8773

and

(7.5.12) LSLM ( ˆ ,ˆ 2 , ˆ )   45.6632

So in terms of this likelihood comparison, it is clear that SLM also yields a much better
fit to the Eire data than SEM (i.e., a much higher log-likelihood value).

This raises the natural question as to why SLM is so much more successful in capturing
this spatial pattern of Blood Group A levels in Eire. Interestingly enough, the answer
appears to lie in the ripple effect underlying the spatial autoregressive multiplier matrix,
B1  ( I n  W )1 , for these models, as detailed in Section 3.3 above. The key point here
is that while this ripple effect applies only to unobserved residuals in the SE-model, it
also applies to the explanatory variables in the SL-model, as is evident in expression
(6.2.4) above. More specifically, since our present weight matrix, W, in expression (7.5.3)
is row normalized, it follows from expression (2.1.19) above that

 w11  w1n   1   j w1 j   1
(7.5.13) W 1n                   1n
 
 wn1  wnn   1   j wnj   1
 

which in turn shows that

(7.5.14) B 1n  ( I n  W )1n  1n  W 1n  (1   )1n

 1n  B1 ( B 1n )  B1[(1   )1n ]  (1   ) B11n

 B11n   11 1
n

So in the present case, expression (6.2.4) for SL-models now takes the form:

 
(7.5.15) Y  B1[1n , x ]  0   B1  B11n  0  B1 x 1  B1
 1 
 B11n  0  B1 x 1  B1  1n      (B
1
0

1
x ) 1  B1

 Y   01n  x 1  B1

________________________________________________________________________
ESE 502 III.7-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

where  0   0 / (1   ) and x  B1 x . But since  0 is essentially independent of  for


estimation purposes (i.e.,  0 can assume any value given appropriate choices of  0 ), it
follows that the only difference between SL-model (7.5.15) and the SE-model in (7.5.9)
is that Pale data vector, x , has now been tranformed to x . Moreover, recalling from
expression (3.3.8) that x can be written as

(7.5.16) x  ( I n  W )1 x  x  Wx   2W 2 x  

it is natural to designate this transformed vector as the rippled Pale.

With these preliminary observartions, it should now be clear that the relative success of
the SL-model versus the SE-model in this Eire case can be attributed entirely to this
rippled Pale effect. The dramatic nature of this effect in the Eire case is illustrated in
Figure 7.8 below, where values of the rippled Pale are plotted on the far right (and where
the maximum and minumum values of the rippled Pale have been rescaled to be the same
as those of Blood Group A). Further reflection suggests that this remarkable fit may not
be simply a coincidence. Indeed, the gradual intermingling of blood-group types between
Anglo-Normans and the indigenous Eire population might well be viewed as a “rippling”
of intermarriage effects over many generations.

Blood Group A Original Pale Rippled Pale

Figure 7.8. Comparison of Pale Effects and Rippled Pale Effects

With this qualitative overview of SEM and SLM applications to Eire, we turn now to a
more detailed development of the many diagnostics displayed in Figure 7.7. To do so, we
start in Section 8 below with a development of the fundamental significance tests for
model parameters.

________________________________________________________________________
ESE 502 III.7-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis

8. Parameter Significance Tests for Spatial Regression

Before developing significance tests of parameters for spatial regression, it is appropriate


to begin by stating a few general properties of maximum-likelihood estimators that will
be crucial for the analysis below. (These properties are developed in more detail in
Section ?? of the Appendix.) Here we employ the following notational conventions. First,
observe that since both log likelihood functions and sampling distributions of maximum-
likelihood estimators depend on the given sample size, n, we now make this explicit by
writing Ln and ˆn , and replace expression (7.1.9) with the more sample-explicit form:

(8.1) Ln (ˆn | y )  max  Ln ( | y )

In addition, note that the symbol,  , in (8.1) is treated as a variable which denotes
possible parameter values. The desired estimator, ˆn , is then distinguished as the value
of  that maximizes the log-likelihood function, Ln ( ) . But it is also important to
distinguish the true value of  , which we now denote by  0 . In particular, note that all
distributional properties of the random vector, ˆ , will necessarily depend on the true
n

distribution of y, say with density f ( y | 0 ) .

In these terms, the single most important property of maximum-likelihood estimators is


their consistency, namely that for sufficiently large sample sizes, the estimator, ˆn , is
very likely to be close to the true value,  0 . More precisely, as n becomes large, the
chance of ˆ being further from  than any arbitrarily small amount,  , shrinks to zero,
n 0

i.e.,

(8.2)  
lim n Pr || ˆn  0 ||    0 for all   0

This is expressed more compactly by saying that ˆn converges in probability to 0 and is
written as

(8.3) ˆn  0
prob

This consistency property ensures that given enough sample information, maximum-
likelihood estimators will eventually “learn” the true values of parameters. Without such
a guarantee, it is hard to consider any estimator as being statistically reliable.

The single most useful tool for establishing such consistency results is the classical Law
of Large Numbers (LLN), which states that for any sequence of independently and
identically distributed (iid) random variables, ( X 1 ,.., X n ) , from a statistical population, X,

________________________________________________________________________
ESE 502 III.8-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

with mean, E ( X )   , the sample mean, X n converges in probability to this population


mean as n increases, i.e.,

 X i  E( X )  
n
(8.4) Xn  1
n i 1 prob

Since this law is one of the two most important results in statistics (and will be used
several times below), it is worth pointing out that unlike the other major result, namely
the Central Limit Theorem, assertion (8.4) is obtainable by elementary means that are
completely intuitive. To do so, recall first from expression (3.1.18) of Part II that we have
already shown that X n is an unbiased estimator of  , i.e., that for n,

(8.5) E( X n )  

Moreover, if we let var( X )   2 , then it was shown in expression (3.1.19) of Part II that
for all n,

(8.6) var( X n )  E[( X n   )2 ]   2 / n

In particular, this implies that the expected squared deviation, E[( X n   )2 ] , of X n from
 must shrink to zero as n becomes large. But since the mean (center of mass) of X n is
always the same, namely at  , this implies that the probability distribution of X n must
eventually concentrate around  , as shown schematically in Figure 8.1 below:

f ( X 20 )

f (X2)

    

Figure 8.1 Law of Large Numbers

Here sample sizes n  2,20 are shown with  2  1 , so that the respective sample-mean
variances are given by 1/2 and 1/20.1 For the particular epsilon interval [    ,    ]

1
The densities plotted here are for X ~ N (  ,1) .

________________________________________________________________________
ESE 502 III.8-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

shown, it is clear that almost all of the probability mass for X 20 is already inside this
interval. So even without a formal proof, it should be clear that X n must converge in
probability to  .2

Given the general consistency property in (8.3), the second major property of maximum-
likelihood estimators is that their sampling distributions are always asymptotically
normal with means given by the true parameter values. This can be expressed somewhat
more formally (in a manner analogous to the Central Limit Theorems in Section 3 of Part
II) by asserting that for sufficiently large sample sizes, n,

(8.7) ˆn  d N [0 ,cov(ˆn )]

where the relevant covariance matrix, cov(ˆn ) , is here left undefined, and will be
developed in more detail below. Note in particular from (8.7) that ˆn is always an
asymptotically unbiased estimator of  0 , i.e., that 3

(8.8) lim n E (ˆn )  0

It is these asymptotic properties that make it possible to construct approximate


significance tests for parameters even without knowing the exact distributions of
maximum-likelihood estimators.

With respect to significance tests for SEM and SLM in particular, recall that all such tests
in Figure 7.7 above use z-values [as in expression (7.5.7) above] rather than the standard
t-values used for parameter significance tests in OLS models [as in expression (7.3.26) of
Part II]. The reason for this is that even though we are assuming multi-normally
distributed errors in both SEM and SLM, the exact distributions of estimators ( ˆ ,ˆ 2 , ˆ )
for these models are not necessarily normal, or even expressible in closed form. So we
must appeal to the asymptotic normality of such estimators to carry out significance tests,
and it is for this reason that z-values are used. (See Section 8.4.1 below for further
discussion of z-values versus t-values).

It should be evident here that (with the notable exception of the Central Limit Theorems
developed in Section 3 of Part II) the present asymptotic analysis is the most technically
challenging material in this NOTEBOOK. In view of this, our present objective is simply
to illustrate these results by examples where these general asymptotic properties reduce to
more familiar results obtainable by elementary means. We start with the classic example

2
A formal proof amounts simply to Chebyschev’s Inequality, which shows in the present case that for any
k  0 , Pr(| X n   |  ( k / n )  )  1 / k . So as long as k increases more slowly than
2
n , both k / n
2
and 1 / k can be made arbitrarily small.
3
It might seem obvious from (8.3) that condition (8.8) should hold. But in fact these two conditions are
generally quite independent (i.e., each can hold without the other).

________________________________________________________________________
ESE 502 III.8-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

of estimating the mean of a univariate normal random variable in Section 8.1 below, and
then proceed to a multivariate example in Section 8.2. This second example involves the
General Linear Model, and will provide a useful conceptual framework for the SEM and
SLM results to follow.

8.1 A Basic Example of Maximum Likelihood Estimation and Inference

To illustrate the general methods of parameter inference for maximum likelihood


estimation it is instructive to begin with the single-parameter case of a normally
distributed random variable, Y ~ N (  , 2 ) , with known variance,  2 , but with unknown
mean,  .4 For a given random sample, y  ( y1 ,.., yn ) , the log likelihood of  given y
then takes the familiar form [recall expression (3.2.6) and (3.2.7) in Part II]:

  y   
2

 
 12  i 
 
Ln (  | y , 2 )  log  i 1 f ( yi |  )  log   i 1 1 e 
n n 
(8.1.1)
  2 
 
 

n   1  yi    
2
 1
  i 1 log   2  
   2     

 
  n log  2  21 2  i 1 ( yi   )2
n

If we now use the simplifying notation Ln for first derivatives,5 and solve the usual first-
order condition for a maximum with respect to  , we see that

d
0  Ln (  | y )  Ln (  | y )   ( yi   )   yi  n2 
n n
(8.1.2) 1 1
d 2 i 1 2 i 1

 y  n   0  ˆ n  
n n
 i 1 i
1
n i 1
yi  y n

and thus that ˆ n is precisely the sample mean, yn . So the main advantage of this
example is that the sampling distribution of this particular maximum-likelihood estimator
is obtainable by elementary methods.

4
In fact, this is one of the prime examples used by early contributors to Maximum Likelihood Estimation,
including Gauss (1896) and Edgeworth (1908), as well as in the subsequent definitive work of Fisher
(1922). For an interesting discussion of these early developments see Hald, A (1999) “On the History of
Maximum Likelihood in Relation to Inverse Probability and Least Squares”, Statistical Science, 14: 214-
222
5
Be careful not to confuse this use of primes with that of vector and matrix transposes, like A ,

________________________________________________________________________
ESE 502 III.8-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

8.1.1 Sampling Distribution by Elementary Methods

Note first that consistency of this estimator is precisely the Law of Large Numbers in
(8.4) with X replaced by the random variable Y in this case. As for the asymptotic
normality condition in (8.7), we have a much sharper result for the sample mean. In
particular, it follows as a very special case of the Linear Invariance property (Section
3.2.2 of Part II) of the multi-normal random vector, (Y1 ,..,Yn ) , that the sample mean,
Yn   in1 ( n1 ) Yi , is exactly normally distributed. In particular, if the true mean of Y is
E ( y )  0 , so that Y ~ N ( 0 , 2 ) , then by linear invariance we obtain the exact sampling
distribution of ˆ n ,

(8.1.3) ˆ n  Yn ~ N ( 0 , 2 / n )

8.1.2 Sampling Distribution by General Maximum-Likelihood Methods

Given these well-known results for ˆ n , we now consider how they would be obtained
within the general theory of maximum-likelihood estimation. In the present case, the
general asymptotic normality result in (8.7) asserts that

(8.1.4) ˆ n  d N [ 0 , var( ˆ n )]

which is clearly consistent with (8.1.3). We leave the derivation of the general asymptotic
normality result for the Appendix, and focus here on the large-sample variance, var( ˆ n ) ,
which remains to be determined. Thus our primary objective is show the value of
var( ˆ n ) determined by the general theory is precisely,  2 / n . In doing so, we shall also
illustrate the general strategy for analyzing the large sample properties of maximum-
likelihood estimators. This will not only yield an asymptotic approximation to the
variance of such estimators, but will also show why they are consistent.

The key observation to be made here is that by replacing data values, yi , with their
associated random variables, Yi , the log-likelihood function in (8.1.1) can be viewed as a
sum of iid random variables, X i (  )  log f (Yi |  ) ,

Ln (  | Y1 ,..,Yn )   i 1 log f (Yi |  )   i 1 X i (  )


n n
(8.1.5)

[where we now suppress the given parameter,  2 , except when needed]. So if we divide
both sides by n and let Ln  1n Ln , then this is seen to be the sample mean , X n (  ) , for a
sample of size n from the random variable, X (  )  log f (Y |  ) , i.e.,

________________________________________________________________________
ESE 502 III.8-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Ln (  | Y1 ,.., Yn )   log f (Yi |  )   X i ( )  X n ( )


n n
(8.1.6) 1 1
n i 1 n i 1

Thus, if we now denote the common mean of these random variables by

(8.1.7) L (  )  E[ X (  )]  E[log f (Y |  )]

[where the expectation is with respect to Y ~ N ( 0 , 2 ) ], then it follows from the LLN
that Ln (  | Y1 ,..,Yn ) converges in probability to this mean, i.e.,

(8.1.8) Ln (  | Y1 ,.., Yn )  L (  )
prob

Notice also that since 1/ n is simply a positive constant, this transformation of Ln has no
effect on maxima. So the maximum-likelihood estimator, ˆ n , for sample data, ( y1 ,.., yn ) ,
must still be given by

(8.1.9) ˆ n  max  Ln (  | y1 ,.., yn )

For purposes of analysis, this scaled version of Ln thus constitutes a perfectly good
“likelihood” function, and will be treated as such. In these terms, the LLN ensures that
the likelihood functions, Ln ( | y1 ,.., yn ) , must have a unique limiting form, L () , given by
(8.1.8) which may be designated as the limiting likelihood function. This implies that
essentially all large-sample properties of maximum-likelihood estimators can be studied
in terms of this limiting form, and in particular, that the large-sample distribution of ˆ n
can be obtained.

In the present case, we can learn a great deal by simply computing this limiting likelihood
function. To do so, recall from (8.1.1) and (8.1.7) that

  1  1  Y   2 
(8.1.10) L (  )  E[log f (Y |  )]  E log   2  
   2     

 1  1 
 log    2 2 E Y    
2

  2 

 
  log  2  21 2 E Y 2  2 Y   2 

 
  log  2  21 2  E (Y 2 )  2  E (Y )   2 

But since E (Y )  0 and E (Y 2 )  var(Y )  [ E (Y )]2   2  02 , it then follows that

________________________________________________________________________
ESE 502 III.8-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(8.1.11) L (  )   log( 2 )  21 2 ( 2  02 )  2  0   2 

  [log( 2 )  12 ]  21 2 ( 02  2  0   2 )

 c  21 2 ( 0   )2

where c  1
2  log( 2 ) is a constant depending only on  . So we see that in the
present case, L is a simple quadratic function, as shown by the solid black curve in
Figure 8.2 below, where we have used the parameter values, 0  1 and  2  1 .

L ( 0 )   Ln
 •
L ( 0 ) • L

L ( 0 )  

-0.5 0 0.5 •1 1.5 2 2.5 -0.5 0 0.5 • 1 1.5 2 2.5


0  ˆ min ˆ n ˆ max

Figure 8.2. Limit Curve and ε-Band Figure 8.3. Estimate Interval for ε-Band

Notice also that this limiting function achieves its maximum at precisely the true mean
value, 0 , which can be seen from the following first order condition (that is shown in
the Appendix to hold for all limiting likelihood functions):

(8.1.12) L(  )  1
2
( 0   )  L( 0 )  0

Next observe from expression (8.1.8) that for any given value of  on the horizontal
axis, the likelihood values, Ln (  | y1 ,.. yn ) , should eventually be very close to the limiting
likelihood value, L (  ) , for all sufficiently large data samples, ( y1 ,.., yn ) . However, this
does not imply that the entire function, Ln ( | y1 ,.. yn ) , will be close to the limiting
function, L () . Here one requires a “uniform” version of probabilistic convergence (as
detailed in the Appendix). For the present, it suffices to say that under mild regularity
conditions, one can ensure uniform convergence in probability on any given interval
containing the true mean, 0 , such as the interval, I  [0.5,2.5] , about 0  1 shown

________________________________________________________________________
ESE 502 III.8-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Figure 8.2. What this means is that as sample sizes increase, realized likelihood
functions, Ln ( | y1 ,.. yn ) , will eventually be contained in any given  -band on interval I
(such as the one shown) with probability approaching one. One such realization, Ln , is
shown (schematically) by the blue curve in Figure 8.3,6 with corresponding maximum-
likelihood estimate, ˆ n , also shown.

Consistency of ˆ n

This convergence property of likelihood functions, Ln , in turn implies consistency of


their associated maximum-likelihood estimates, ˆ n . To see this, note that in order to stay
inside this  -band, each function, Ln , evaluated at 0 must achieve a value, Ln ( 0 ) in
the interval of values [ L ( 0 )   , L ( 0 )   ] shown on the left in Figure 8.2. Thus the
maximum value of Ln must be at least L ( 0 )   , which means that this maximum
(tangency) point on Ln must lie somewhere above the horizontal red line shown in Figure
8.3, as illustrated by the example in the figure. But since this maximum is by definition
achieved at Ln ( ˆ n ) , this in turn implies that ˆ n must lie somewhere in the corresponding
 -containment interval, [ ˆ min , ˆ max ] , of  -values on the horizontal axis. Finally,
observe that as  -bands are chosen to be smaller, their corresponding  -containment
intervals must eventually shrink to the single value, 0 . So for sufficiently large sample
sizes, n, we see that maximum-likelihood estimates, ˆ n , must eventually be arbitrarily
close to 0 (with probability approaching one), and thus that such estimators satisfy
consistency. While this consistency argument is certainly more complex than the direct
appeal to the Law of Large Numbers for the simple case of ˆ n  Yn , it serves to illustrate
the approach used for all maximum-likelihood estimators. Moreover, it helps to provide
some geometric intuition for the large-sample variance of such estimators, to which we
now turn.

Large-Sample Variance of ˆ n

First observe that by taking the second derivative, L  ( d / d  ) L , of the limiting
likelihood function and evaluating this at 0 , we see from (8.1.12) that

(8.1.13) L(  )  d
d L(  )   12  L( 0 )   12

But for sufficiently large sample sizes, n, the scaled likelihood functions, Ln , were seen
to be uniformly close to L in the neighborhood of 0 , so that their shapes should be

6
Here it is worth noting from expression (8.1.1) that like the limiting curve, L , all such realizations, Ln ,
in the present case must be smooth quadratic functions (such as the one shown).

________________________________________________________________________
ESE 502 III.8-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

similar to L in this neighborhood. Thus it is reasonable to expect that (8.1.13) should


hold approximately for such functions, i.e., that

(8.1.14) Ln( 0 )  L( 0 )   12

But this in turn implies from the definition of Ln that the original log-likelihood
functions, Ln , must satisfy:

(8.1.15) Ln ( 0 )  n Ln ( 0 )  Ln( 0 )  n Ln( 0 )   n2

By inverting this expression and multiplying by -1, we see that

(8.1.16)  Ln( 0 )1  2


n

Finally, since we happen to know that the right hand side is precisely the variance of
ˆ n  Yn , we see that this variance is well approximated by the negative inverse of the
second derivative of the log-likelihood function, Ln , evaluated at the true mean, 0 , i.e.,

(8.1.17) var( ˆ n )   Ln( 0 )1

While all of this might seem to be purely coincidental, it is shown in the Appendix that
this relation is always true for maximum-likelihood estimators. More importantly for our
present purposes, this geometric argument actually suggests why this should be so. To
begin with, note that while the first derivative, L(  ) , of the limiting likelihood function
reveals its slope at each point,  , the second derivative, L(  ) reveals its curvature, i.e.,
rate of change of slope at  . So L( 0 ) corresponds geometrically to the curvature of
the limiting likelihood function at the true mean, 0 . With this in mind we now illustrate
the effects of such curvature in Figures 8.4 and 8.5 below.

 

-0.5 0 • 0.5 1 1.5 •2 2.5 0 •


0.5 1 •1.5 2
̂ min 0 ̂ max ̂ min 0 ̂ max

Figure 8.4. Estimate Interval for σ 2 = 1 Figure 8.5. Estimate Interval for σ 2 = 1/4
________________________________________________________________________
ESE 502 III.8-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Figure 8.4 simply repeats the relevant features of Figure 8.3. Here we used the variance
parameter,  2  1 , which implies from (8.1.13) that L( 0 )   1 . Note in particular that
this negative sign reflects the concavity of L required for a maximum at 0 . In Figure
8.5 we have used the same value of  for the  -band around function, L , but have now
reduced the variance parameter to  2  1/ 4 . This in turn is seen to yield more extreme
curvature, L( 0 )   4 , at 0 . But the key point to notice is that this sharper curvature
necessarily compresses the corresponding  -containment interval that delimits the
feasible range of maximum-likelihood estimates, ˆ n , for large n. By comparing these two
figures, one can see that the permissible deviations from 0 in Figure 8.5 are only about
half those in Figure 8.4 (decreasing from about 0.8 to 0.4). This in turn implies that the
permissible squared deviations are only about a quarter as large. Moreover, the constancy
of curvature in the present example implies that this same relation must hold for all  ,
and thus that the expected squared deviations of ˆ n should also be about a quarter as
large. But this is precisely the relative variance of ˆ n at each level of curvature. In short,
we see that for large samples, n, with log-likelihoods close to the limiting likelihood, the
desired variance of ˆ n is indeed (inversely) proportional to negative curvature, as in
(8.1.17).

Finally, it should be noted that while the constancy of curvature in this example makes
such relations easier to see, this is of course a very special case. More generally, all that
can be said is that for sufficiently large samples, n, almost all realizations of ˆ n will be so
close to 0 that curvature can be treated as constant over the relevant range of ˆ n .

Computation of Large-Sample Variance

Since direct computation of the variance in (8.1.17) requires knowledge of 0 , we must


estimate this quantity. But since ˆ n is always a consistent estimator of 0 , it is natural to
use the estimated variance:

(8.1.18)  ˆ )   L( ˆ )1


var( n n n

In practice this can be calculated by numerically approximating the second derivative of


the log-likelihood function at the maximum, i.e., Ln( ˆ n )  Ln( ˆ n | y1 ,.., yn ) . However,
when this second derivative can be computed as an explicit function of the sample data,
( y1 ,.., yn ) , it is often more appropriate to use mean values.

By way of motivation, it should be noted that perhaps the weakest link in the chain of
arguments above was the supposition that curvature of the likelihood function, Ln ( 0 ) , at

________________________________________________________________________
ESE 502 III.8-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

the true mean is well approximated by that of the limiting likelihood function, L ( 0 ) ,
i.e., that Ln( 0 )  L( 0 ) , as in expression (8.1.14) above. Since averaging produces
smoothing effects, it should thus be more reasonable to suppose that

(8.1.19) E  Ln( 0 )   E  Ln( 0 | Y1 ,..,Yn )   L( 0 )

and then to approximate the large-sample variance by:7

var( ˆ n )    E[ Ln( 0 )]    E[ Ln( 0 | Y1 ,.., Yn )]


1 1
(8.1.20)

Finally we note that the negative expected curvature value used in (8.1.20) is of much
wider interest, and (in honor of its discoverer) is usually designated as Fisher
information,

(8.1.21)  n ( 0 )   E  Ln( 0 | Y1 ,..,Yn )

In these terms, the variance approximation in (8.1.20) can be rewritten as

(8.1.22) var( ˆ n )  [  n ( 0 )]1

Note in particular that since higher values of  n ( 0 ) mean lower variances of ˆ n and
thus sharper estimates of 0 , this measure does indeed reflect the amount “information”
in Ln about 0 . For computational purposes, we must again substitute ˆ n for 0 , to
obtain the large-sample variance estimate,

(8.1.23)  ˆ )  [  ( ˆ )]1
var( n n n

In subsequent sections it will be shown that explicit expressions for Fisher information
can be obtained for both SEM and SLM. So we shall employ this expectation version of
variance estimates in our analyses of these models. Thus, to avoid any possible confusion
of (8.1.23) with (8.1.18) above, we now follow the standard convention of designating
the more direct estimate of negative curvature in (8.1.18) as observed Fisher information,

(8.1.24)  nobs ( 0 )   Ln( 0 )

and thus redesignate (8.1.18) as observed large-sample variance estimate,

7
As shown in the Appendix, it is this expected-curvature expression that is used in formal convergence
proofs. So while both approximations are used in practice, the main advantage of the (8.1.17) approach is
that it allows the role of geometric curvature to be seen more easily.

________________________________________________________________________
ESE 502 III.8-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(8.1.25)  ( ˆ )  [  obs ( ˆ )]1


var obs n n n

8.2 Sampling Distributions for General Linear Models with Known Covariance

We next develop a multi-parameter example in which the sampling distributions of


parameter estimates can again be obtained by elementary methods. Here we start with
following General Linear Model

(8.2.1) Y  X    ,  ~ N (0,V )

where the covariance matrix, V, is assumed to be known. As in Section 7.2.2, this in turn
implies that

(8.2.2) Y ~ N ( X  ,V )

Before proceeding with this case, it is worth noting that (8.2.2) is in fact a direct
extension of our previous model in Section 8.2.1. In particular, that model is seen to be
the special case in which, V   2 I n with  2 known, and in which X  1n . Thus
 reduces to a single parameter,   (  0 )   , in this case, and we see that:

(8.2.3) Y ~ N (  1n , 2 I n )  Yi ~ N (  , 2 ) , i  1,.., n
iid

So it is perhaps not surprising that the same methods above can be applied to this more
general version.8

8.2.1 Sampling Distribution by Elementary Methods

As mentioned above, the only difference here is that we are now in a multi-parameter
setting where sampling distributions must be obtained for the vector of maximum-
likelihood estimators in (7.2.18) above, i.e., for

(8.2.4) ˆn  ( X V 1 X )1 X V 1Y

But since this is simply a linear transformation of the random vector, Y , we can obtain
the sampling distribution of ˆ by again appealing directly to expression (3.2.2) of the
Linear Invariance Theorem. To do so, note simply that if we let

(8.2.5) A  ( X V 1 X )1 X V 1

so that ˆn  AY , then it follows at once from (3.2.2) together with (8.2.2) above that

8
Here we ignore questions of consistency, which involve a somewhat more complex application of the
Law of Large Numbers [as for example in Theorem 10.2 in Green (2003)].

________________________________________________________________________
ESE 502 III.8-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(8.2.6) ˆn ~ N ( AX  , AVA) ,

and thus that ˆn is exactly multi-normally distributed. Moreover, as we have already
shown in expressions (7.3.21) and (7.3.22) of Part II, ˆn has mean vector

(8.2.7) E ( ˆn )  AX   ( X V 1 X )1 ( X V 1 X )   

and covariance matrix,

(8.2.8) cov( ˆn )  cov ( X V 1 X )1 X V 1 

 ( X V 1 X ) 1 X V 1 cov( )V 1 X ( X V 1 X ) 1

 ( X V 1 X ) 1 X V 1V V 1 X ( X V 1 X ) 1

 ( X V 1 X ) 1 ( X V 1 X ) ( X V 1 X ) 1

 ( X V 1 X )1

so that the exact sampling distribution of ˆn is given by

(8.2.9) ˆn ~ N [  ,( X V 1 X )1 ]

As in the single-parameter case above, this distribution allows us to construct


significance tests for all  j coefficients.

8.2.2 Sampling Distribution by General Maximum-Likelihood Methods


Here we shall focus only on those aspects of the Maximum-Likelihood approach that are
needed for calculating the desired sampling distribution of ˆn , as in (8.2.9) above. Thus,
as a direct extension of expression (8.1.4), we start by assuming that ˆ is both n

asymptotically multi-normal and asymptotically unbiased, i.e.,

(8.2.10) ˆn  d N [  ,cov( ˆn )]

So the main task is to estimate the covariance matrix, cov( ˆn ) , in (8.2.10). To do so,
recall from expression (8.1.17) that the desired asymptotic variance estimate was
obtained in terms of the second derivative of the log-likelihood function evaluated at the
true parameter value. Exactly the same result is true in the multi-parameter case, except
that here we must calculate partial derivatives of the log-likelihood function with respect
to all parameters. The details of partial derivatives for both the scalar and multi-

________________________________________________________________________
ESE 502 III.8-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

dimensional case are developed in Sections A2.5 through A2.7 in the Appendix to Part II
(which we here designate as Appendix A2). In the present case, the log-likelihood
function is precisely the same as that in expression (7.2.14) above with  2  1 , i.e.,

(8.2.11) L(  | y )   n2 log(2 )  12 log | V |  12 ( y  X  )V 1 ( y  X  )

As shown for the OLS case in Section A2.7.3 of Appendix A2, maximizing this function
with respect to parameter vector,  , amounts to setting all partial derivatives of L(  | y )
equal to zero, where the vector of partial derivatives is called the gradient vector of
L(  | y ) , and is written as:

 1 L(  | y ) 
 
(8.2.12)   L(  | y )    
  L(  | y ) 
  k 

But since  only appears in the last term of (8.2.11), it follows that this first order
condition for a maximum reduces to:

(8.2.13) 0    L(  | y )    [ 12 ( y  X  )V 1 ( y  X  )]

  12   [ y V 1 y  2 y V 1 X   2   X V 1 X  ]

 X V 1 y  X V 1 X 

where the last line follows from expressions (A2.7.7) and (A2.7.11) of Appendix A2.
Notice that solving this expression for  yields precisely the maximum-likelihood
estimate in (8.2.4) above. But our present interest in the matrix of second partial
derivatives of L at the true value of  , say  0 ,9

  2 L(  0 | y )  2
L(  0 | y ) 
2

1 k
 1 
(8.2.14)   L(  0 | y )    [  L(  | y )]        
0
 2 
  k 1 L(  0 | y )  
L(  0 | y ) 
2

  k2

which (as in Section A2.7 of Appendix A2) is designated as the Hessian matrix for
L(  | y ) evaluated at  0 . So by the last line of (8.2.13) [together with (A2.7.7) in
Appendix A2]10, we see that

9
Note that the intercept coefficient in  is here designated as “  1 ” precisely to avoid notational conflicts
with this true coefficient vector,  0 .
10
In particular, it follows from (A2.7.7) that for any symmetric matrix, B  ( b1,.., bn ) , we must have
 x Bx  (  x b1x ,..,  x bn x )  B   B .

________________________________________________________________________
ESE 502 III.8-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(8.2.15)   L(  0 | y )    [ X V 1 y  X V 1 X  ]   0

    ( X V 1 X  )

  X V 1 X

But, if we again replace data vector, y , by its corresponding random vector, Y , and take
expectations (under  0 ) then we see in this case that

(8.2.16) E[  L(  0 | Y )]   X V 1 X

Thus by (8.2.8) this is now seen to imply that

cov( ˆn )  ( X V 1 X )1    E[  L(  0 | Y )]


1
(8.2.17)

So if Fisher information in (8.1.21) is here replaced by the corresponding Fisher


Information matrix,

(8.2.18)  n (  0 )   E[  L(  0 | Y )]

then it follows from (8.2.17) that the covariance of ˆn is precisely the inverse of the
Fisher Information matrix, i.e.,

(8.2.19) cov( ˆn )   n (  0 )1

While this is of course a very special case in which covariance is exactly inverse Fisher
Information, it serves to motivate the general results to follow.

8.3 Asymptotic Sampling Distributions for the General Case

Given this multi-parameter example, it can be shown that the asymptotic sampling
distributions for general maximum likelihood estimates are essentially of the same form.
In particular, if the log-likelihood function for n samples, y  ( y1 ,.., yn ) from a
distribution with k unknown parameters,   (1 ,.., k ) , is denoted by L( | y ) [as in
(7.1.8) above], and if the maximum-likelihood estimator for  is denoted by ˆ [as in
n

(7.1.9), with sample size, n , made explicit], then (under mild regularity conditions) ˆn is
both asymptotically multi-normal and asymptotically unbiased, i.e.,

(8.3.1) ˆn  d N [0 ,cov(ˆn )]

________________________________________________________________________
ESE 502 III.8-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

where  0 is the true value of  . So expressions (8.1.4) and (8.2.10) are both seen to be
instances of this general expression. Moreover, the asymptotic covariance matrix,
cov(ˆn ) , takes the same form as expression (8.2.17) through (8.2.19). In particular, if we
again denote the relevant Fisher Information matrix by

(8.3.2)  n (0 )   E[ L(0 | Y )]

then the asymptotic covariance of ˆn is precisely the inverse of the Fisher Information
matrix, i.e.,

(8.3.3) cov(ˆn )   n (0 )1

So in these terms, the explicit form of (8.3.1) is simply

(8.3.4) ˆn  d N [0 ,  n (0 )1 ]

Finally, this distribution is again made operational by appealing to the consistency of ˆn
to replace  with ˆ and write
0 n

(8.3.5) ˆn  d N [0 ,  n (ˆn )1 ]

But taking this approximation to be the relevant sampling distribution for ˆn , one can
proceed with a range of statistical analyses regarding the nature of  0 . But rather than
developing testing procedures within this general framework, it is more convenient to do
so in the specific contexts of SEM and SAR, to which we now turn.

8.4 Parameter Significance Tests for SEM

For the SE-model in (7.3.1) through (7.3.3), it follows at once that the relevant parameter
vector is given by   (  ,  , 2 ) ,11 with likelihood function,

(8.4.1) L( | y )  L(  , 2 ,  | y )

  n2 log(2 )  n2 log( 2 )  12 log | B |  21 2 ( y  X  ) B B ( y  X  )

where B  I n  W . For this model, if we now designate the sum of diagonal elements
of any matrix, A  ( aij ) , as the trace of A, written tr ( A)  i aii , and if (for notational

11
For notational simplicity, we here drop transposes and implicitly assume that both  and  are column
vectors.

________________________________________________________________________
ESE 502 III.8-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

simplicity) we let G  WB1  B1W ,12 then it can be shown [see Ord (1975) and
Appendix B in Doreian (1980)] that the expected value of the Hessian matrix for L( | Y )
evaluated at the true value of  is given by 13

 E (  L) E ( 2 L) E (  L) 
 

(8.4.2) E[ L( | Y )]  E ( L)   E ( 2  L) E ( 2 2 L) E ( 2  L) 
 
 E (  L) E ( 2 L) E (  L) 
  

 12 X B B X 0  0
 
  0 n
2 4
1 tr (G )
2  

0 1 tr (G ) tr[G (G  G  )] 

 2     

where for simplicity we now drop the subscripts “0” denoting “true” values. It then
follows at once from (8.3.2) and (8.3.3) that asymptotic covariance matrix of the
maximum-likelihood estimators, ˆ  ( ˆ , ˆ ,ˆ 2 ) ,14 is given by

(8.4.3) cov(ˆ)   n ( )1   E[ L( | Y )]1


1
 12 X B B X 0  0
 
 0 n
2 4
1 tr (G )
2  

0 1 tr (G ) tr[G (G  G  )] 

 2     

Thus the desired asymptotic sampling distribution of ( ˆ ,ˆ 2 , ˆ ) for SEM is given by

 
1

 ˆ       2 X B B X
1 0 0
  
 2  2 
 ˆ  ~ N     ,  0  
n 1 tr (G )
(8.4.4) 2 4 2 
 ˆ      1 tr (G ) tr[G (G  G  )] 

       0   

  
 2 

The last equality follows from the fact that WB  W (  n 0  W )  ( n 0  W )W  B W . Because of
12 1  n n  n n 1

this, G is defined both ways in the literature [compare for example Ord (1975) and Doreian (1980)].
13
Note in the last line of (8.4.2) that 0 denotes a zero vector of the same length as  , together with its
transpose, 0 .
14
Again for notational simplicity, we now drop the sample-size subscripts (n) on estimators.

________________________________________________________________________
ESE 502 III.8-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Before applying this distribution to construct specific tests, it is of interest to notice from
the pattern of zeros inside this covariance matrix that further simplifications are possible
here. In particular this covariance matrix is seen to be the inverse of a block diagonal
matrix. But just as in the case of simple diagonal matrices, matrix multiplication shows
that the inverse of any ( 2  2 ) block diagonal matrix is given by

1
A   A1 
  
 B 
(8.4.5)
  B 1 

so that (8.4.3) can be rewritten as

 2 ( X B B X )1 
 1 
(8.4.6) cov(ˆ)    2n 4 
1 tr (G )
2   
  1  
   2 tr (G ) tr[G (G  G )]  

In particular, this shows that ˆ is uncorrelated with either ˆ 2 or ̂ , so that by the


general properties of multi-normal distributions, we may conclude that ˆ is completely
independent of (ˆ 2 , ˆ ) . This in turn implies that the joint distribution of ( ˆ , ˆ ,ˆ 2 ) in
(8.4.4) can be factored into a product of the marginal distributions for ˆ and (ˆ 2 , ˆ ) .
With respect to ˆ in particular, this marginal distribution is seen to be of the form

(8.4.7) ˆ ~ N [  , 2 ( X B B X )1 ]

But since the covariance expression can be rewritten as

 2 ( X B B X )1   X [ 2 ( B B )1 ]1 X   ( X V1 X )1


1
(8.4.8)

where V   2 ( B B )1 , it follows from expressions (8.2.8) [together with (6.1.6) and
(6.1.7)] that expression (8.4.6) is precisely the instance of the GLS in Section 7.6.2 above
for the case of an SE-model with known spatial dependency parameter given by  . In
other words, the independency property of SEM allows all analyses of ˆ to be carried
out using the GLS model in (8.2.9) for any given values of the other parameters, ( 2 ,  ) .
As we shall see below, this simplification is not true for SLM.

8.4.1 Parametric Tests for SEM

To develop appropriate tests of parameters for SEM, recall that all unknown true
parameter values (  , 2 ,  ) in the above expressions are estimated using ( ˆ ,ˆ 2 , ˆ ) . So
in these terms, the estimated asymptotic covariance for purposes of statistical inference is

________________________________________________________________________
ESE 502 III.8-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

obtained by substituting these estimated values ( ˆ ,ˆ 2 , ˆ ) into (8.4.6). For notational
convenience, we denote this estimated covariance matrix by S SEM , which is now seen to
have the block diagonal form:

ˆ 2 ( X Bˆ Bˆ X )1 


  
 
1
(8.4.9) S SEM  n / 2 ˆ 2
tr ( G ˆ )  
 ˆ 4  2  
 ˆ tr (Gˆ ) ˆ tr[Gˆ (Gˆ  Gˆ )]  
4


 s2ˆ  sˆ ˆ 
 0 0 k

    
s 
  ˆk ˆ0  sˆk
2

 s2ˆ 2 sˆ 2 ˆ 
 
 sˆ ˆ 2 sˆ2 
 

Before using these estimates, notice from (8.4.6) that we have here factored out the
quantity, ̂ 4 , in the lower diagonal matrix [see also Ord (1975), expression (19) in
Doreian (1980), and expression (5.48) of Upton and Fingleton (1985) – which is class
reference number 18]. The reason for this can be seen from the first row of the
(unfactored) lower block-diagonal matrix in (8.4.6), which will have all elements close to
zero when variance,  2 , is large. So to avoid possible numerical stability problems when
computing the inverse of this matrix, it is convenient to introduce such a factorization.

To apply these estimates, we first consider the most important set of parameter tests,
namely tests for beta coefficients,  j , j  0,1,.., k , in the linear term, X  (where as
usual, the intercept,  0 , tends to be of less interest than the slope coefficients, 1 ,..,  k ,
for explanatory variables). While such analyses are is conceptually similar to those for
Geo-Regression in Section 7.3.2 above, there are sufficient differences to warrant a more
careful development here. For parameter,  j , it follows at once from (8.4.7) that the
marginal distribution of the estimator, ˆ , must be normal and of the form
j

(8.4.10) ˆ j ~ N [  j , var( ˆ j )] , j  0,1,.., k

where var( ˆ j ) is by definition the j th diagonal element of the covariance matrix,


 2 ( X B B X )1 . Here we have estimated this quantity by diagonal element, s2ˆ , of
j

matrix, S SEM , in (8.4.9). So by using this estimate, (8.4.10) can be rewritten as,

________________________________________________________________________
ESE 502 III.8-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(8.4.11) ˆ j ~ N (  j , s2ˆ ) , j  0,1,.., k


j

This is the operational form of the sampling distribution that we shall employ for testing
the significance of  j . To employ the standard normal tables, one must first standardize
ˆ to obtain the corresponding z-statistic:
j

ˆ j   j
(8.4.12) zˆ  ~ N (0,1), j  0,1.., k
j
sˆ
j

where sˆ  s2ˆ denotes the standard deviation of ˆ j . So under the null hypothesis,
j j

 j  0 , it follows from (8.4.12) that the z-score, z j  ˆ j / sˆ , must be standard normal,
j

i.e., that

ˆ j
(8.4.13) zj  ~ N (0,1) , j  0,1,.., k
sˆ
j

15
Finally, if the observed z-score value is denoted by z obs
j , then the p-value for this (two-
sided) test is thus given by

p j  Pr | z j |  | z obs
j |
pj / 2 pj / 2
(8.4.14)
obs
| zj |

We shall illustrate this test for the Eire example below. But before doing so, it is
important to point out that this test treats s2ˆ in (8.4.11) as a “known” quantity, and in
j

particular ignores all variation in this estimator. But in OLS, for example, this estimator
can be shown to be both chi-square distributed and independent of ˆ j , so that the ratio,
ˆ / s 2 , is t-distributed. Thus, the relevant tests of  coefficients in OLS are t-tests. But
j ˆ j j

in more general settings such as SEM, the distribution of ˆ j / s2ˆ is unknown. So the
j

2
standard “fall back” position is to treat s as a constant (by appealing implicitly to its
ˆ j

large-sample consistency property), and to employ the normal distribution in (8.4.11) for
testing purposes.

The consequence of this convention is to inflate significance levels (i.e., reduce p-values)
to some degree. In the OLS case, this can be seen by noting that t-distributions have fatter

15
Here “observed” means the actual value calculated by maximum-likelihood estimation.

________________________________________________________________________
ESE 502 III.8-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

tails than the standard normal distribution, thus increasing the p-value in (8.4.14) for any
given observed value, z obsj . Because of this, some analysts prefer to use a more

conservative t-test based on the number of parameters in the model (as in the OLS case).
In particular, if we now re-designate the ratio in (8.4.13) as a pseudo t-statistic,

ˆ j
(8.4.15) tj 
sˆ
j

and let Tv denote the t-distribution with v degrees of freedom, then [following Davidson
and MacKinnon (1993, Section 3.6)] one can construct a corresponding pseudo t-test of
the null hypothesis,  j  0 , by assuming that t j is t-distributed with degrees of freedom
equal to n minus the number of model parameters. In this case, the relevant parameter
vector (  , 2 ,  ) is of length, k  3 , so that16

ˆ j
(8.4.16) tj  ~ Tn ( k 3)
sˆ
j

The appropriate p-value for this test is then given by the probability in (8.4.14) with
respect to the t-distribution in (8.4.16), and will be dented by p tj pseudo . While these values
are not reported in the screen output of sem.m or slm.m (as for the Eire example in
Figure 7.7), we shall report them here just to illustrate the types of significance inflation
that can occur.

But before turning to the Eire example, we first construct a test of the other key
parameter of the SE-model, namely the spatial dependence parameter,  .17 Here it is
important to note that unlike the simple (and appealing) rho statistic, ˆW , in Section 4.1.1
above, which unfortunately yields an inconsistent estimate of  , the present maximum-
likelihood estimator, ̂ , of  is consistent (assuming of course that the SE-model is
correctly specified). So from a theoretical perspective, formal hypothesis tests based on
this estimator are of great interest. Here the same testing procedure for  j coefficients
can be applied with the appropriate changes. First, it again follows from (8.4.4) that the
estimator, ̂ , is normally distributed, so that as a parallel to (8.4.11), we now have

(8.4.17) ˆ ~ N   , s2ˆ 

16
Given that ˆ in (7.3.11) is functionally independent of  , one could in principle use v  n  ( k  2) in
2

(8.4.16). However, we here adopt the (conservative) approach of using all parameters to calculate v.
17
Note that while the error variance,  , is also a model parameter, and indeed is also asymptotically
2

normally distributed by (8.4.4), one is rarely interested in testing specific hypotheses about  . So
2

following standard convention, we simply report the estimated value, ˆ , in screen outputs like Figure 7.7.
2

________________________________________________________________________
ESE 502 III.8-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

where sˆ  s2ˆ . Moreover, since the natural null hypothesis for this parameter is again,
  0 (here denoting the absence of spatial autocorrelation), we have the corresponding
z-score,
ˆ
(8.4.18) zˆ  ~ N (0,1)
sˆ

under this hypothesis. So the presence of non-zero spatial autocorrelation (either positive
or negative) is now gauged by the two-sided p-value,

(8.4.19) pˆ  Pr | z ˆ |  | z obs


ˆ |

where z obs
ˆ is again the observed z-score value. While this is always the default test
employed in spatial regression software, it should be noted that (as in Section 4 above) a
one-sided test for positive spatial autocorrelation is generally of more relevance. But we
choose to employ the (more conservative) two-sided test to maintain comparability with
other software. Finally, as with tests of  coefficients above, we shall also report the p-
value, ptˆ pseudo , for the corresponding pseudo t-test that captures at least some of the
statistical variation in the estimator, ̂ .

8.4.2 Application to the Irish Blood Group Data

To apply these results to the Eire case, we first note that the estimated covariance matrix,
S SEM , in (8.4.9) is one of the outputs of sem.m, denoted by cov. In the present case, this
matrix for the Eire data take the form in Figure 8.6 below, where each row and column is
labeled by its corresponding parameter estimator ( ˆ0 , ˆ1 ,ˆ 2 , ˆ ) :

ˆ0 ˆ1 ˆ 2 ̂
ˆ0 1.9464 ‐ 0.3061 0 0

ˆ1 ‐ 0.3061 0.7809 0 0

̂ 2 0 0 0.3874 ‐ 0.0211

̂ 0 0 ‐ 0.0211 0.0112

Figure 8.6. SEM Parameter Covariance Matrix

This clearly illustrates the block diagonal structure of S SEM . To relate this covariance
matrix to the SEM results in Figure 7.7, we focus on the important “Pale” coefficient, 1 .
By recalling that the standard error of ˆ in (8.4.12) is given from Figure 8.6 by
1

________________________________________________________________________
ESE 502 III.8-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(8.4.20) sˆ  s2ˆ  0.7809  0.8837


1 1

we see that the z-score for a two-sided test of the hypothesis, 1  0 , is given by18

ˆ1 1.5532
(8.4.21) z1    1.7577
sˆ .8837
1

as in Figure 7.7. Finally, the desired p-value for this test is given by

(8.4.22) p1  Pr(| z1 |  1.7577)  2  ( 1.7577)  0.0788

as in Figure 7.7. To compare this results with the corresponding pseudo t-test for 1 ,
observe first that in this case there are 4 parameters, (  0 , 1 , 2 ,  ) , so that the for the
n  26 counties in Eire, the appropriate two-sided p-value is calculated with respect to a
t-distribution with v  26  4  22 degrees of freedom, yielding

(8.4.23) p1t  pseudo  0.0927

This is still weakly significant (  .10 ), but is noticeably less significant than the result in
(8.4.22) based on the normal distribution. However, it should be noted that the sample
size, n  26 , in this Eire example is quite small. So in larger sample sizes, where the tails
of Tn( k 3) are much closer to those of N (0,1) , this difference will be far less noticeable.

Finally, for completeness, we also calculate the corresponding tests for the spatial
dependency parameter,  . As in (8.4.21), the relevant z-score in (8.4.18) is seen from
Figures 7.7 and 8.6 to be

ˆ 0.7885
(8.4.24) zˆ    7.467
sˆ 0.0112

with corresponding p-values for the z-test and pseudo t-test given by:

(8.4.25) pˆ  Pr(| z ˆ |  7.467)  8.2  1014 and ptˆ pseudo  1.82  107

So while the pseudo t-test again yields a somewhat weaker result, both p-values are
vanishingly small,19 and confirm that spatial autocorrelation in this model is strongly
present.

18
Note that since all of the following calculation examples are done to a much higher degree of precision
than the numbers shown, the results on the right hand sides will not agree “exactly” with the indicated
operations on the left-hand sides.
19
Note that the reported value in Figure 7.7 is not zero, but rather is simply smaller than the number of
decimal places allowed in this (default) printing format.

________________________________________________________________________
ESE 502 III.8-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

8.5 Parametric Significance Tests for SLM

Using the same notation as above, recall that the log-likelihood function for the SL-model
in (6.2.2) through (6.2.4) is given by

(8.5.1) L( | y )  L(  , 2 ,  | y )

  n2 log(2 )  n2 log( 2 )  12 log | V |  21 2 ( y  X   )V1 ( y  X   )

where X   B1 X and where  is now the spatial dependency parameter for the
dependent variable, y , rather than for the residual errors,  . If for notational simplicity
we let H (  ,  )  tr[G (G  G )]   2   X  G G X  , and again let G  WB1 , then the
same analysis of this log-likelihood function as in (8.4.2) and (8.4.3) above [see
Appendix B in Doreian (1980)] yields the corresponding covariance matrix for SLM:

1
 1 X X
2
0 1
2
X  G X  
 
(8.5.2) cov( ˆ ,ˆ 2 , ˆ )   0 n
2 4 
1 tr (G )
2  
 1 
  2   X  G X
1
2
tr (G ) H (  ,  ) 

Thus the appropriate asymptotic sampling distribution for SLM is given by:

 X  G X  
1

 ˆ       2 X X
1 0 1
 2     2
 
 2
 ˆ  ~ N     ,  0  
n 1 tr (G )
(8.5.3) 2 4
 2 
 ˆ       
       12   X  G X 1 tr (G )
 H (  ,  )  
 2 

The key difference from SEM is that the present covariance matrix, cov( ˆ ,ˆ 2 , ˆ ) , is not
block diagonal. The essential reason for this can be seen by comparing the reduced forms
for SEM and SLM in (6.1.8) and (6.2.6), respectively, which we now reproduce:

(8.5.4) Y  X   u , u ~ N (0, 2V )

(8.5.5) Y  X    u , u ~ N (0, 2V )

These are seen to differ only in that X for SEM is replaced by X   B1 X for SLM. So
the difference here is that  in SLM is directly influencing the mean of Y while in SEM
it is not [i.e., E (Y | X )  B1 X  rather than E (Y | X )  X  ]. It is this linkage between

________________________________________________________________________
ESE 502 III.8-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

 and  in SLM that creates non-zero covariances between ̂ and the components of
ˆ .

8.5.1 Parametric Tests for SLM

If we now substitute consistent maximum-likelihood estimates ( ˆ ,ˆ 2 , ˆ ) for the true


parameter values in (8.5.2), and again factor out ˆ 4 (for numerical stability in calculating
the inverse), then the estimated covariance matrix, S SLM , is seen to have the form:

1
 ˆ 2 X X 0 ˆ 2 X  Gˆ X ˆ 
 
(8.5.6) S SLM  ˆ 4  0 n
2 ˆ 2tr (Gˆ ) 
 2 ˆ 
 ˆ   X  Gˆ X ˆ 2tr(Gˆ ) ˆ 4 H ( ˆ , ˆ ) 

Given this estimated matrix, it follows by using the same notation as for SEM that all
relations in expressions (8.4.11) through (8.4.19) continue to hold, where
({ˆ j : j  0,1,.., k },ˆ 2 , ˆ ) is now the vector of maximum-likelihood estimates for SLM
rather than SEM, and where the standard deviations, ({sˆ j : j  0,1,.., k }, sˆ 2 , sˆ ) , are now
the square roots of the diagonal elements of S SLM rather than S SEM . So aside from these
differences, all z-tests and pseudo t-tests are identical in form.

8.5.2 Application to the Irish Blood Group Data

As with sem.m, the MATLAB program slm.m offers an optional output of the S SLM
matrix, designated by cov, which in a manner to Figure 8.6 above, now has the form:

ˆ0 ˆ1 ̂ 2 ̂
ˆ0 10.327 0.8259 0.4129 ‐ 0.3589

ˆ1 0.8259 0.3366 0.0381 ‐ 0.0331

ˆ 2 0.4129 0.0381 0.2172 ‐ 0.0145

̂ ‐ 0.3589 ‐ 0.0331 ‐ 0.0145 0.0126

Notice in particularFigure 8.7.elements


that all SLM Parameter Covariance
of this covariance Matrix
matrix are nonzero. So while
there appear to be no direct links between  and  in the Fisher information matrix for
2

SLM [recall (8.3.2)], there are indirect links as seen in its inverse. More generally, only
block-diagonal patterns of zeros in the Fisher information matrix ensure independence in
the multi-normal case.

________________________________________________________________________
ESE 502 III.8-25 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

With these observations, the relevant test statistics can again be obtained from (the right
panel of) Figure 7.7 together with the diagonal elements of S SLM in Figure 8.7. For 1 we
see in this case that

ˆ1 2.0142
(8.5.7) z1    3.472
sˆ .3366
1

which in turn yields the following p-value for the z-test in Figure 7.7, together with the
corresponding pseudo t-test:

(8.5.8) p1  Pr(| z1 |  3.472)  2 ( 3.472)  0.00052 and p1t  pseudo  0.0022

So again we see that the significance level for the z-test is inflated. But even for the more
conservative pseudo t-test, the “Pale” effect is here vastly more significant than for SEM,
as was seen graphically in Figure 7.8 above.

Turning finally to the spatial dependency parameter,  , we may again use Figures 7.7
and 8.7 to obtain the following z-score,

ˆ 0.7264
(8.5.9) zˆ    6.467
sˆ 0.0126

and corresponding p-values,

(8.5.10) pˆ  Pr(| z ˆ |  6.467)  9.99  1011 and ptˆ pseudo  1.66  106

So even though the “Rippled Pale” in Figure 7.8 fits the Blood Group data far better than
the “Pale” itself, these results show that there remains a great deal of spatial
autocorrelation that is not accounted for by this single explanatory variable.

________________________________________________________________________
ESE 502 III.8-26 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis

9. Goodness-of-Fit Measures for Spatial Regression

Unlike Ordinary Least Squares, where there is a single dominant measure of goodness of
fit – namely R-squared (and adjusted R-squared), no such dominant measure exists for
more general linear models. So relative goodness of fit for models such as SEM and SLM
is best gauged by employing a variety of candidate measures, and attempting to establish
“dominance” in terms of multiple measures. Recall from Figure 7.7 that seven different
measures were reported for each of these models. So the main objective of this section is
to clarify the meaning and interpretation of these measures. To do so, we begin in Section
9.1 below with a detailed investigation of the classical R-squared measure. Our objective
here is to show why it is appropriate for classical OLS but not for more general models.
This will lead to “extended” R-squared measures that can be applied to both SEM and
SLM.

9.1 The R-Squared Measure for OLS

To motivate R-squared ( R 2 ) as a goodness-of-fit measure for OLS, we start with a


simplest case of a single explanatory variable, x, and consider a scatter plot of data points,
( yi , xi ), i  1,.., n , used to estimate a regression of y on x, as shown in Figure 9.1 below.

yi
y • y • yi
y i  yˆ i
• yi  y yˆ i
• yˆ i  y
y
• • • y
• •
• ˆ0  ˆ1 x

xi xi

Figure 9.1. Basic Data Plot Figure 9.2. Regression Line

From an estimation viewpoint, the regression problem for this data is to find a linear
function, y   0  1 x , which best fits this data. If we let ei denote the actual deviation
of point ( yi , xi ) from this function (or line), so that by definition,

(9.1.1) yi   0  1 xi  ei , i  1,.., n

then the regression line is defined to be that linear function, y  ˆ0  ˆ1 x , which
minimizes the sum of squared deviations, i ei2 . In this case, the desired regression line is
given by the blue line in Figure 9.2 [where only the single representative data point,
( yi , xi ) , from Figure 9.1 is shown here].

________________________________________________________________________
ESE 502 III.9-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

To evaluate “goodness of fit” for this line, we first construct an appropriate benchmark
for comparison. To do so, it is natural to ask how we might “fit” y-values if the
explanatory variable, x , were ignored altogether. This can be accomplished by simply
setting 1  0 , so that model (9.1.1) reduces to:

(9.1.2) yi   0  ei , i  1,.., n

In this setting the least-squares fit, ˆ0 , is now obtained by minimizing the sum of squares

(9.1.3) S (  0 )   i ( yi   0 ) 2

By solving the first-order condition for this problem, we see that

(9.1.4) 0  d
d 0 S ( ˆ0 )  2 i ( yi  ˆ0 )( 1)

 0   ( y  ˆ )  
i i 0 i
yi  nˆ0

 ˆ0  1
n  i
yi  y

and thus that the best least-squares fit to y in this case is precisely the sample mean, y .
[Recall also the arguments of expressions (7.1.35) and (7.1.36) in Part II]. In other words,
if one ignores possible relations with other variables, then the best predictor of y values
based only on data ( yi : i  1,.., n ) is given by the sample mean of this data. So the flat line
with value y in Figure 9.1 represents the natural benchmark (or null hypothesis) against
which to compare the performance of any other possible regression model, such as
(9.1.1). But for this benchmark case, it is clear that “goodness of fit” to the y-values can
be measured directly in terms of their squared deviations around y . This can be
summarized in terms of the sum of squared deviations,

S y2   i 1 ( yi  y )2
n
(9.1.5)

designated here as the total variation in y.1 Note in particular that with respect to this
measure, one has a perfect fit (i.e., yi  y for all i  1,.., n ) if and only if S y2  0 .

In this setting, candidate explanatory variables, x , for y only have substance in so far as
they can reduce this benchmark level of uncertainty in y. As we shall see, it is here that

1
Equivalently, one could take averages, and use the sample variance, s 2y  S y2 / ( n  1) , of y in model (9.2).
But as we shall see below, it turns out to be simpler and more direct to consider the fraction of total
variation in y that can be accounted for by a given regression model.

________________________________________________________________________
ESE 502 III.9-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

the R-squared measure ( R 2 ) comes into play. In short, R 2 captures the reduction in
uncertainty about y that can be achieved by regressing y on any given set of explanatory
variables. The key idea can be seen in an intuitive way by reconsidering the regression
shown in Figures 9.1 and 9.2 above. Note first that the full deviation, yi  y , of the
representative point, ( yi , xi ) , from the benchmark flat line, y , is shown explicitly in
Figure 9.1. In the presence of the regression line in Figure 9.2, this deviation can be
decomposed into two parts by using the predicted value, yˆ i , of yi for this regression.
The lower segment, yˆ i  y , reflects that part of the overall deviation, yi  y , that has
been “explained” by the regression line, and the upper segment, yi  yˆ i , reflects that part
left “unexplained” by the regression. In this context, the essential purpose of R 2 is to
yield a summary measure of the fractional deviations accounted for by the regression.
But notice that this example point, ( yi , xi ) , has been carefully chosen so that both the
deviation, yi  y , and its fractional parts are positive. To ensure positivity, it is more
appropriate to ask how much of the squared deviation, ( yi  y )2 , is accounted for by the
regression line. Note moreover that not all points will yield such “favorable” results for
this regression. For example, data points that happen to be very close to the y -line will
surely be better predicted by y than by the regression, so that ( yi  y )2  ( yi  yˆ i )2 .
Thus the key question to be addressed how well a given regression is doing with respect
to total variation of y in (9.1.5). In the context of Figure 9.2, the main result will be to
show that this total variation can be decomposed into the sum of squared deviations of
both yi  yˆ i and yˆ i  y , i.e., that

(9.1.6) S y2   i ( yˆ i  y )2   i ( yi  yˆ i )2   ( yˆ  y )   eˆ
i i
2 2
i i

If these terms are designated respectively as model variation and residual variation, then
this fundamental decomposition says that

(9.1.7) total variation  model variation  residual variation

In this setting, the desired R 2 measure (also called the Coefficient of Determination) is
taken to be the fraction of total variation accounted for by model variation, i.e.,

(9.1.8) R 
2 model variation

 ( yˆ  y )
i i
2

total variation  ( y  y)
i i
2

Note from (9.1.7) that this can equivalently be written as

(9.1.9) R 2  1
residual variation
 1
 eˆ 2
i i
total variation  ( y  y)
i i
2

where this ratio can be viewed as the fraction of “unexplained” variation.


________________________________________________________________________
ESE 502 III.9-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

The task remaining is to demonstrate that this decomposition holds for linear regressions
with any number of explanatory variables. To do so, we begin by developing a “dual”
representation of the regression problem which (among other things) will yield certain
key results for this construction.

9.1.1 The Regression Dual

To motivate this representation, we again begin with the simplest possible case of one
explanatory variable, x, together with only three samples, ( yi , xi ), i  1, 2,3 , as shown in
Figure 9.3 below.
y s3
 y1 
y3 • y   y2 
y 
 0  1 x  3
y1 •
s2  x1 
y2 • x   x2 
x 
 3

x1 x2 x3 x s1

Figure 9.3 Sample Plot Figure 9.4. Variable Plot

This sample plot is simply another instance of the scatter plot in Figure 9.1, where a
candidate line,  0  1 x , for fitting these three points is shown in blue. As in expression
(9.1.1), this yields the identity,

(9.1.10) yi   0  1 xi  ei , i  1,2,3

where again the desired regression line, ˆ0  ˆ1 x , minimizes the sum of squared
deviations,  i ei2  e12  e22  e32 . But recall that (9.1.6) can also be written in vector form
as,

 y1   1  x1   e1 
(9.1.11)  y2    0  1  1  x2    e2   y   0 13  1 x  e
y   1  x  e 
 3    3  3

where in particular, the vectors, y  ( y1 , y2 , y3 ) and x  ( x1 , x2 , x3 ) denote all data values


of the dependent variable and explanatory variable, respectively. These two vectors are
shown (in blue) in Figure 9.4, which is usually designated as the variable plot. Here the
three axes now represent “sample dimensions”, ( s1 , s2 , s3 ) . The two representations in
Figures 9.3 and 9.4 exhibit a certain duality property in that the roles of samples and
variables are reversed. For plots such as Figure 9.3, the axes are variables and the points
are samples. However, the axes in Figure 9.4 are samples and the points are variables

________________________________________________________________________
ESE 502 III.9-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

[here drawn as vectors from the origin]. Each of these representations has its own
advantages. For the present case of a single explanatory variable, x, the more standard
sample plot has the advantage of allowing any number of samples to be plotted and
displayed. The variable plot in Figure 9.2 is far more restrictive in this context, since the
present case of a single explanatory variable with three samples is essentially the only
instance in which a graphic representation is even possible.2 Nonetheless, this dual
representation, or regression dual, reveals key geometric properties of regression that
simply cannot be seen in any other way. This is more apparent in Figure 9.5 below,
where we have included the unit vector, 13  (1,1,1) from expression (9.1.11) as well.

s3 s3
y y
13 13

s2 x ŷ x

s1 s1

Figure 9.5. Regression Plane Figure 9.6. Regression as Projection

Note also that we have now colored the vectors, x and 13 , and have connected them with
a dashed line to emphasize that these two vectors define a two-dimensional plane called
the regression plane. In geometric terms, the linear combinations,  0 13  1 x , in
expression (9.1.10) above represent possible points on this plane (so for example,
 0  1  1/ 2 , corresponds to the point midway on dashed line joining x and 13 ). In
these terms, the regression problem of finding a point, ˆ0 13  ˆ1 x , in the regression
plane that minimizes the sum of squared deviations, i ei2 , has a very clear geometric
interpretation. In particular, since the relation,

(9.1.12) e  y  (  013  1 x )   i ei2  || e ||2  || y  (  013  1 x ) ||2

shows that this sum of squares is simply the squared distance from y to  0 13  1 x ,
the regression problem in this dual representation amounts geometrically to finding that
point, yˆ  ˆ013  ˆ1 x , in the regression plane which is closest to y. Without going into
further details, this closest point is precisely the orthogonal projection of y into this

2
Note that while more variables could in principle be included in Figure 9.4, the associated regression
would be completely overdetermined. More generally, when variables outnumber sample points, there are
generally infinitely many regression planes that all yield perfect fits to the data.

________________________________________________________________________
ESE 502 III.9-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

plane, as shown by the red arrow in Figure 9.6,3 where the red dashed line represents the
corresponding residual vector, ê , from (9.1.12), as defined by eˆ  y  yˆ .

This view of regression as an orthogonal projection also yields a number of insights into
the algebraic structure of regression.4 The most important of these follow from the
observation that since the residual vector, ê , is orthogonal to the regression plane, it must
necessarily be orthogonal to every vector in this plane. In particular, ê must be
orthogonal to both ŷ and 13 . Not surprisingly, the same is true for regressions in any
dimension, n (i.e., with n samples).5 So we can generalize these observations by first
extending the present case to multiple regressions with k explanatory variables and n
samples as,

y  yˆ  eˆ  X ˆ  eˆ  ˆ01n   j 1 ˆ j x j  eˆ
k
(9.1.13)

Here ŷ is now the orthogonal projection of y into the regression hyperplane spanned by
the vectors (1n , x1 ,.., xk ) in  n . Moreover (as shown in Section A2.4 of the Appendix to
Part II), orthogonality between vectors can be expressed algebraically as follows: vectors,
a, b   n , are orthogonal if and only if their inner product is zero, i.e., if and only if
ab  0 .6 So these observations yield the following two important inner product
conditions for any regression in  n :

(9.1.14) eˆyˆ  0  eˆ1n

As we shall see, it is precisely these two conditions that allow the total variation of y to
be decomposed as desired.

9.1.2 Decomposition of Total Variation

To develop this decomposition, we first obtain a vector representation of mean variation


by employing the following notational conventions. Each sample vector, y  ( y1 ,.., yn ) ,
can be transformed into deviation form about its about its sample mean,

3
Here the s2 axis has been hidden for visual clarity
4
An excellent discussion of all these ideas is given in Sections 3.2.4 and 3.5 of Green (2003). In particular,
his Figure 3.2 gives an alternative version of Figure 9.6. For a somewhat more advanced treatment, see
Section 1.2 in Davidson and MacKinnon (1993).
5
As an extension of footnote 2 above, it of interest to note that the present case of one explanatory variable
with n  3 (non-collinear) samples is in fact the unique case where all the relevant geometry can be seen.
On the one hand, three points are just enough to yield a non-trivial regression as in Figure 9.3, while at the
same time still allowing a graphical representation of variable vectors in Figure 9.4.
6
This is perhaps the most fundamental identity linking the algebra of Euclidean vector spaces to their
underlying geometry. As one simple illustrative example, note that any vectors, a  ( a1 , 0) and b  (0, b2 ) ,
on the horizontal and vertical axes in  2 must be orthogonal in geometric terms, and in algebraic terms,
must satisfy a b  a1  0  0  b2  0 .

________________________________________________________________________
ESE 502 III.9-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

 yi  n1 (1n y )
n
(9.1.15) y  1
n i 1

as follows,
 y1  y 
(9.1.16) y  y1n    
 
 yn  y 

This is in fact a linear transformation on n , as can be seen by defining the n-square


deviation matrix,

(9.1.17) D  I n  n1 (1n1n )

and observing that for all y   n ,

(9.1.18) Dy  ( I n  n1 1n1n ) y  y  n1 (1n1n ) y  y  n1 1n (1n y )  y  y1n

Like regression, this transformation is also an orthogonal projection, where in this case
D projects n onto the orthogonal complement of the unit vector, 1n , i.e., the subspace
of all vectors orthogonal to 1n . In algebraic terms, D sends 1n to the origin, i.e.,

(9.1.19) D1n  ( I n  n1 1n1n )1n  1n  n1 1n (1n1n )  1n  nn 1n  0 ,

and leaves all vectors orthogonal to 1n where they are. For example, the residual vector,
ê , for any regression is orthogonal to 1n by (9.1.10), and we see that,

(9.1.20) Deˆ  ( I n  n1 1n1n )eˆ  eˆ  n1 1n (1n eˆ )  eˆ  n1 1n (0)  eˆ

More generally, as with all orthogonal projections, the matrix D is symmetric ( D  D )


and idempotent ( DD  D ), i.e.,7

(9.1.21) DD  ( I n  n1 1n1n )( I n  n1 1n1n )  I n  n2 1n1n  n12 1n (1n1n )1n

 I n  n2 1n1n  nn2 1n1n  I n  n1 1n1n  D

These facts allow the total variation in (9.1.5) to be expressed directly in terms of D as,

S y2   i 1 ( yi  y )2  ( y  y1n )( y  y1n )


n
(9.1.22)

 ( Dy )( Dy )  y DDy  y DDy  y Dy

7
These two conditions in fact characterize the set of orthogonal projection matrices.

________________________________________________________________________
ESE 502 III.9-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Moreover, by recalling from (9.1.13) that y  yˆ  eˆ , we may now employ (9.1.14),


(9.1.20) and (9.1.21) to obtain the following fundamental decomposition of S y2 :

(9.1.23) S y2  ( yˆ  eˆ )D( yˆ  eˆ )  ( yˆ Dyˆ  2 yˆ Deˆ  eˆDeˆ )

 yˆ Dyˆ  2 yˆ eˆ  eˆeˆ  yˆ Dyˆ  2(0)  eˆeˆ

 yˆ Dyˆ  eˆeˆ

To relate this decomposition to (9.1.6), we note first that if we now denote the residual
variation term in (9.1.6) by S ê2 then it follows at one that this is precisely the second term
in (9.1.23), i.e, that

 eˆ  eˆeˆ
n
(9.1.24) S eˆ2  2
i 1 i

Turning next to the model variation term in (9.1.6), notice again from (9.1.14) that

(9.1.25) 0  1n eˆ  1n ( y  yˆ )  1n y  1n yˆ  1n y  1n yˆ

and thus that the mean of the regression predictions, ( yˆ1 ,.., yˆ n ) , is precisely y , i.e.,

 yˆ i  n1 (1n yˆ )  n1 (1n y )  y
n
(9.1.26) 1
n i 1

Thus if we now denote model variation in (9.1.6) by S ŷ2 , then it follows from (9.1.17)
and (9.1.26), together with the above properties of D that

 ( yˆ i  y )2  ( yˆ  y1n )( yˆ  y1n )


n
(9.1.27) S y2ˆ  i 1

 ( yˆ  [ 1n 1n yˆ ]1n )( yˆ  [ n1 1n yˆ ]1n )  ( yˆ  n1 1n1n yˆ )( yˆ  1n 1n1n yˆ )

 ([ I n  n1 1n1n ] yˆ )([ I n  n1 1n1n ] yˆ )  ( Dyˆ )Dyˆ  yˆ DDyˆ

 yˆ Dyˆ

and thus that S ŷ2 is precisely the first term in (9.1.23). By putting these results together,
we may conclude that the desired decomposition of total variation for y is given by

(9.1.28) S y2  S y2ˆ  S eˆ2

In these terms, the R-squared measure in (9.1.8) and (9.1.9) can now be re-expressed as:

________________________________________________________________________
ESE 502 III.9-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

S y2ˆ S eˆ2
(9.1.29) 2
ROLS   1 
S y2 S y2

where the OLS subscript is here used to emphasize that this decomposition property
holds for OLS. Notice also from the nonnegativity of all terms in (9.1.28) that
0  ROLS
2
 1 , and thus that ROLS
2
can be interpreted as the fraction of total variation
explained by a given OLS regression. For computational purposes, it is more convenient
to express R-squared in vector terms as,

yˆ Dyˆ eˆeˆ
(9.1.30) 2
ROLS   1
y Dy y Dy

where the latter form, in terms of unexplained variation, is by far the most commonly
used in practice.

9.1.3 Adjusted R-Squared

2
While ROLS is intuitively very appealing as a measure of goodness of fit, it suffers from
certain drawbacks. Perhaps the single most important of these is that fact that the measure
can never decrease when more explanatory variables are added to the model, and in fact
it almost always increases. This can be most easily seen by relating residual variation to
the solution of the regression problem itself. Recall that if for any given set of data,
( yi , x1i ,.., xki ), i  1,.., n , we define the sum-of-squares function

 
2
S k (  0 , 1 ,..,  k )   i yi   j 0  j xij
k
(9.1.31)

over possible beta values (  0 , 1 ,..,  k ) [as in expression (7.1.9) of Part II], then the
regression problem is to find those values ( ˆ0 , ˆ1 ,.., ˆk ) that minimize this function. But
the residual variation for this regression problem, say eˆk eˆk , is precisely the value of S k at
the minimum, i.e.,

 
2
eˆk eˆk   i eˆik2   i yi   j 0 ˆ j xij  S k ( ˆ0 , ˆ1 ,.., ˆk )
k
(9.1.32)

 min (  0 ,1 ,.., k ) Sk (  0 , 1 ,..,  k )

So if we add another explanatory variable, xk 1 , and observe that by definition


S k (  0 , 1 ,..,  k ) is just the special case of S k 1 (  0 , 1 ,..,  k ,  k 1 ) with  k 1  0 , i.e., that

 
2
S k 1 (  0 , 1 ,..,  k ,0)   i yi   j 0  j xij  (0) xi ,k 1
k
(9.1.33)

________________________________________________________________________
ESE 502 III.9-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

 
2
  i yi   j 0 ˆ j xij  S k (  0 , 1 ,..,  k )
k

then it follows at once from (9.1.31) through (9.1.33) that

(9.1.34) eˆk 1eˆk 1  min (  0 ,.., k , k 1 ) Sk 1 (  0 ,..,  k ,  k 1 )

 min (  0 ,.., k ) Sk 1 (  0 ,..,  k ,0)

 min (  0 ,.., k ) Sk (  0 ,..,  k )

 eˆk eˆk

Thus, when a new explanatory variable is added to the regression, the resulting residual
variation never increases, and in fact must decrease unless the new variable, xk 1 , is
totally unrelated to y in the sense that ˆk 1  0 . Finally, since y Dy is the same in both
2
regressions, we may conclude from last term in (9.1.30) that ROLS never decreases, and
8
almost always increases.

2
This property creates serious problems when using ROLS as a criterion for model
2
selection. Since ROLS can always be increased by adding more variables to a given model,
this will lead inevitably to the classic problem of “overfitting the data”. Indeed, for
2
problems with n samples, it is easy to see that a perfect fit ( ROLS  1) can be guaranteed
by increasing the number of (non-collinear) explanatory variables, k , to n  1 . For
example, if there were only n  2 samples, then since two points define a unique line,
almost any simple regression ( k  1) must yield a perfect fit.

2
This serves to underscore the need to modify ROLS to reflect the number of explanatory
variables used in a given regression model. This can be accomplished by essentially
“penalizing” those models with larger numbers of explanatory variables. The standard
2 2
procedure for doing so is to replace ROLS by the following modification, ROLS ,
designated as adjusted R-squared:

eˆeˆ
(9.1.35) 2
ROLS  1   nn11k   1  nnk1  (1  ROLS
2
)
y Dy

2
Here the first equality is the standard definition of ROLS , and the second equality simply
2
re-expresses this measure directly in terms of ROLS . While this measure can be given

8
The exact magnitude of this increase is given in Green (2003, Theorem 3.6).

________________________________________________________________________
ESE 502 III.9-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

some theoretical justification,9 the popularity of ROLS 2


lies mainly in its simplicity and
2
ease of interpretation as a reasonable “penalized” version of ROLS . In particular, note that
the penalty factor, ( n  1) / ( n  1  k ) , must be greater than one in all cases of interest, and
always increases with k . This in turn implies that ROLS 2
 ROLS
2 2
, and that ROLS decreases as
2
k increases. Thus, ROLS does indeed penalize models with larger numbers of explanatory
variables. Moreover, since ROLS 2
approaches  as k approaches n  1 , it is clear that
models with numbers of variables anywhere close to the sample size will never be
2
considered. Note however that this last property also shows that ROLS need not be
positive, and thus cannot be given any interpretation relating to the “fraction of variation
2
explained”. About all that can be said is that models with negative ROLS can surely be
discarded from consideration. At the other extreme, notice that penalty factor,
( n  1) / ( n  1  k ) , shrinks rapidly to one as sample size, n , increases. So from a
practical viewpoint, this penalty has little effect whenever sample sizes are quite large
compared to the number of explanatory variables being considered. Because of this, it has
2
been argued that ROLS does not penalize models enough. But in any case, this measure is
2
unquestionably preferable to ROLS when comparing regression models of different sizes,
and is far and away the most popular measure of goodness of fit in this context.

9.2 Extended R-Squared Measures for GLS

2 2
In spite of the success of ROLS and ROLS for OLS models, their appropriateness as
goodness-of-fit measures for more general models is more problematic. Here it suffices
to consider the simplest possible extension involving the GLS model in Section 7.2.2
above,

(9.2.1) Y  X    ,  ~ N (0, 2V )

with known covariance structure, V . In this modeling context, the key difficulty is that
the resulting y-predictions obtained from (7.2.18) by

(9.2.2) yˆ  X ˆ  X ( X V 1 X )1 X V 1 y

are no longer orthogonal projections.10 So the fundamental decomposition of total


variation in (9.1.23) and (9.1.28) no longer holds, and the compelling interpretive

9
The standard theoretical justification relies on the fact that (i) y Dy / ( n  1) yields an unbiased estimate of
y variance in the null model (9.1.2), (ii) eˆeˆ / ( n  1  k ) yields an unbiased estimate of residual variance,
 2 , in the regression model, and (iii) the second term in (9.1.35) is precisely the ratio of these unbiased
estimates. But while this argument is appealing, it does not imply that this ratio is an unbiased estimate of
the fraction of unexplained variance. Indeed, the expectation of a ratio is almost never the same as the ratio
of expectations.
10
An excellent discussion of this issue is given in Davidson and MacKinnon (1993 ,Sections 1.2 and 9.3).

________________________________________________________________________
ESE 502 III.9-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

2
features of ROLS now vanish. In particular, the model-oriented and error-oriented
2
definitions of ROLS in (9.1.30) are no longer equivalent. So there is no unambiguous way
to define the “fraction of variation explained” by the given GLS model.

But as in the introductory discussion to Section 9.1 above, the residual vector, eˆ  y  yˆ ,
still captures the deviations of data, y , from their predicted values, ŷ , under any GLS
model. Moreover, since Dy  y  y1n still represents the y deviations from their least-
squares prediction, y , under the null model [as in (9.1.4) above], it is reasonable to
gauge the goodness of fit of this model by comparing its mean squared error:


n
(9.2.3) MSE  1
n i 1
( yi  yˆ i )2

with that under the null model, say


n
(9.2.4) MSE0  1
n i 1
( yi  y ) 2

This comparison is shown graphically in Figures 9.7 and 9.8 below:

y • y •

• •
• •
y
• • • y
• • •
• • ŷ
• •
• •

x x

Figure 9.7. Null Deviations Figure 9.8. Model Deviations

In particular, the positivity (and common units) of these measures suggests that their ratio
should provide an appropriate comparison, as given by


n
MSE ( yi  yˆ i )2 ( y  yˆ )( y  yˆ )
(9.2.5)  i 1

 ( y  y1n )( y  y1n )
n
MSE0
i 1
( yi  y ) 2

eˆeˆ eˆeˆ eˆeˆ


  
( Dy )( Dy ) y DDy y Dy

________________________________________________________________________
ESE 502 III.9-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

2
which is precisely the second term in the error-oriented version of ROLS . Finally, since
smaller values of this ratio indicate better average fit relative to the null model, it follows
that larger values of the difference,

MSE eˆDeˆ
(9.2.6) 2
RGLS  1  1
MSE0 y Dy

2
also indicate a better fit. To distinguish this general measure from ROLS , it is convenient
2
to designate (9.2.6) as extended R . This terminology also serves to emphasize that
(9.2.6) cannot be interpreted as “explained variation” outside of the OLS case. This is
made clear by the fact that extended R 2 can be negative. But as with adjusted R 2 for
OLS, it should be clear that negative values of extended R 2 are a strong indication of
poor fit. Indeed, models with higher mean squared error than y by itself can generally be
ruled out on this basis alone.

Finally, as with the OLS case, it should be clear that larger numbers of explanatory
variables must necessarily reduce MSE and thus increase the value of extended R 2 . So
goodness of fit for GLS models must be also be penalized for the addition of new
variables. While the penalty ratio, ( n  1) / ( n  1  k ) , in (9.1.35) is somewhat more
difficult to interpret in the GLS setting,11 it nonetheless continues to exhibit the same
appealing properties discussed in Section 9.1.3 above. So in the present GLS setting, we
now the designate

ˆ ˆ
(9.2.7) 2
RGLS 1   nn11k  yeDy
e

as the appropriate extended form of adjusted R 2 in (9.1.35).

Before applying these extended measures to SEM and SLM, it is also of interest to note
that there is an alternative approach which seeks to preserve the appealing properties of
2
ROLS . In particular, recall that one can convert any given GLS model to an OLS model
that is equivalent in terms of parameter estimation. In the present setting, it follows from
expressions (7.1.15) through (7.1.18) that if T is the Cholesky matrix for V, so that
V  TT  , then (9.2.1) can be converted to an OLS model

(9.2.8) Yo  X o    o ,  o ~ N (0, 2 I n )

where these new variables are defined by

11
While the simple “unbiasedness” argument in footnote 9 no longer holds, it can still be shown that
replacing n by n  1  k corrects bias in the GLS estimate of variance, ̂ 2 , in (7.2.20). So at least in these
terms, a justification in terms of “unbiasedness” can still be made.

________________________________________________________________________
ESE 502 III.9-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(9.2.9) Yo  T 1Y , X o  T 1 X ,  o  T 1

So if goodness of fit for model (9.2.1) is now measured in terms of R 2 and R 2 for model
(9.2.8), then it would appear that all of the properties of these measures are preserved. In
particular, if for any given y data, we set yo  T 1 y , then the appropriate prediction, say
yˆ o , is given by

(9.2.10) yˆ o  X o ˆ  X o ( X o X o )1 X o yo

So by setting eˆo  yo  yˆ o , it follows that the appropriate R-squared measure, say Ro2 , is
given from (9.1.30) by

yˆ o Dyˆ o eˆ eˆ
(9.2.11) Ro2   1 o o
yo Dyo yo Dyo

Such measures are typically designated as pseudo R-squared measures for GLS models
[see for example, Buse (1973)]. However, the most serious limitation of such measures in
that they account for total variation in yo  T 1 y rather than in y itself. This is not only
difficult to interpret, but in fact can vary depending on the factorization of covariance
used. For example, the estimated SEM covariance matrix, Vˆ in (7.3.2) has a natural
factorization in terms of the matrix, Bˆ 1 , which will clearly yield different results than for
the Cholesky matrix. So the essential appeal of the extended R 2 and R 2 measures above
is that they are directly interpretable in terms of y and ŷ .

9.2.1 Extended R-Squared for SEM

Turning first to SEM, recall from expression (6.1.8) that for any given spatial weights
matrix, W, we can express SEM as a GLS model of the form:

(9.2.12) Y  X   u, u ~ N (0,  2V )

where the spatial covariance structure, V , is given by

(9.2.13) V  ( B B )1  B1 ( B1 )

with B given in terms of weight matrix, W, by

(9.2.14) B  I n   W

So for any given y data, the maximum-likelihood estimate, yˆ SEM , of the conditional
mean, E (Y | X )  X  , is given by

________________________________________________________________________
ESE 502 III.9-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(9.2.15) yˆ SEM  X ˆ  X ( X Vˆ1 X )1 X Vˆ1 y  X ( X  Bˆ Bˆ X )1 X  Bˆ Bˆ y

Finally, letting

(9.2.16) eˆSEM  y  yˆ SEM

it follows from (9.2.6) that the extended R 2 measure for SEM is given by,

 eˆSEM
eˆSEM
(9.2.17) 2
RSEM 1 
y Dy

with associated extended R 2 measure,

(9.2.18) 2
RSEM  1  nn11k  (1  RSEM
2
)

These two values are reported for the Eire data in the left panel of Figure 7.7 as

(9.2.19) 2
RSEM  0.3313 2
( ROLS  0.5548)

and

(9.2.20) 2
RSEM  0.3034 2
( ROLS  0.5363)

where the corresponding OLS values are given in parentheses. As expected, these
extended measures for SEM are lower than for OLS since they incorporate more of the
true error variation due to spatial dependencies among residuals.12 So the main interest in
these goodness-of-fit measures is their relative magnitudes compared to SLM, or other
models which may serve to account for spatial dependencies (such as the spatial Durbin
model in Section 6.3.2).

9.2.2 Extended R-Squared for SLM

Turning next to SLM, recall from (6.2.6) that this can also be expressed as a GLS model
of the form:

12
This can be seen explicitly by observing from the SEM log likelihood function in (7.3.4) that for the OLS
case of   0 , the estimate, ˆ , is chosen precisely to minimize mean squared error. So whenever ˆ  0 ,
one can expect that the associated mean squared error for SEM will be larger than this global minimum.

________________________________________________________________________
ESE 502 III.9-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(9.2.21) Y  X    u , u ~ N (0, 2V )

where V is again given by (9.2.13) and (9.2.14) for some choice of spatial weights
matrix, W, and where in this case,

(9.2.22) X   B1 X  ( I n  W )1 X

So for any given y data, the maximum-likelihood estimate, yˆ SLM , of the conditional
mean, E (Y | X )  X  , is given in terms of (7.4.13) by

(9.2.23) yˆ SLM  X ˆ ˆ  X ˆ ( X  X )1 X  Bˆ y  Bˆ 1 X ( X  X )1 X  Bˆ y

Thus, by now letting

(9.2.24) eˆSLM  y  yˆ SLM

it follows from (9.2.6) that the extended R 2 measure for SLM is given by,

 eˆSLM
eˆSLM
(9.2.25) 2
RSLM 1 
y Dy

with associated extended R 2 measure,

(9.2.26) 2
RSLM  1  nn11k  (1  RSLM
2
)

These two values are reported for the Eire data in the right panel of Figure 7.7 as

(9.2.27) 2
RSLM  0.7335 2
( ROLS  0.5548)

and

(9.2.28) 2
RSEM  0.7224 2
( ROLS  0.5363)

where the corresponding OLS values are again given in parentheses. So in contrast to
2 2
SEM, we see that both RSLM and RSLM for SLM are actually considerably higher than for
OLS. The reason for this is again explained by the contrast between the “pale” effect in
X and the “rippled pale” effect, X ˆ , as illustrated in Figure 7.8 above. However, this
appears to be a very exceptional case in which yˆ SLM (  X ˆ ˆ ) happens to yield an

________________________________________________________________________
ESE 502 III.9-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

extraordinarily good fit to y . More generally, one expects both SEM and SLM to yield
extended R 2 values that are lower than ROLS
2
, so that the spatial components W and 
serve mainly to capture the hidden variation arising from spatial autocorrelation effects.

9.3 The Squared Correlation Measure for GLS Models

A measure that turns out to be closely related to extended R 2 is the squared correlation
between y and its predicted value, ŷ , under any GLS model (including OLS). Here it is
again convenient to begin with the OLS case, where this measure is shown to be identical
to R 2 . We then proceed to the more general case of GLS models, including both SEM
and SLM. Finally, the correlation measure itself is given a geometrical interpretation in
terms of angle cosines in deviation subspaces, which helps to clarify its relevance for
measuring goodness of fit.
Let us begin by recalling that the sample correlation, r ( x, y ) , between any pair of data
vectors, x  ( x1 ,.., xn ) and y  ( y1 ,.., yn ) , can be expressed in vector form by employing
the properties of the deviation matrix, D, in (9.1.17) , (9.1.18) and (9.1.21) as follows:


n
( xi  xn )( yi  yn )
(9.3.1) r ( x, y )  i 1

 
n n
i 1
( xi  xn )2 i 1
( xi  xn )2

( x  xn 1n )( y  yn 1n )

( x  xn 1n )( x  xn 1n ) ( y  yn 1n )( y  yn 1n )

( Dx )Dy

( Dx )Dx ( Dy )Dy

xDDy

xDDx y DDy

xDy

xDx y Dy

so that squared correlation is always of the form

( xDy )2
(9.3.2) r 2 ( x, y ) 
( xDx )( y Dy )

Given this general expression, we now consider the correlation between data, y, and
model predictions, ŷ , for the case of OLS.

________________________________________________________________________
ESE 502 III.9-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

9.3.1 Squared Correlation for OLS


First recall from (7.2.6) that for any given data ( y , X ) , the predicted value, ŷ , of y is
given by

(9.3.3) yˆ OLS  X ˆ  X ( X X )1 X  y

In these terms, the squared correlation measure for OLS is given in terms of (9.3.2) by

( y Dyˆ OLS )2
(9.3.4) r 2 ( y , yˆ OLS ) 
( y Dy )( yˆ OLS
 Dyˆ OLS )

With this definition, our first objective is to show that (9.3.4) is precisely the same as
2
ROLS . If for notational simplicity we let yˆ  yˆ OLS and again denote the estimated residuals
for OLS by eˆ  y  yˆ , then it follows from expression (9.1.14) that

(9.3.5) 0  yˆ eˆ  yˆ yˆ  yˆ ( y  eˆ )  yˆ y  yˆ eˆ  yˆ y

and moreover that [see also (9.1.25)],

(9.3.6) 0  1n eˆ  1n ( y  yˆ )  1n y  1n yˆ

But given these two identities, we must have

(9.3.7) yˆ Dy  yˆ ( I n  n1 1n1n ) y


 yˆ y  n1 1n (1n y )
 yˆ yˆ  1n 1n (1n yˆ )  yˆ ( I n  n1 1n1n ) yˆ  yˆ Dyˆ

So it follows at once from (9.3.4) that

( y Dyˆ )2 ( yˆ Dyˆ )2 yˆ Dyˆ


(9.3.8) r 2 ( y , yˆ )   
( y Dy )( yˆ Dyˆ ) ( y Dy )( yˆ Dyˆ ) y Dy

2
which together with the first (model-oriented) representation of ROLS implies that

(9.3.9) r 2 ( y , yˆ OLS )  ROLS


2

For purposes of later comparison, it follows from (9.3.9) that for the Eire case

(9.3.10) r 2 ( y , yˆ OLS )  ROLS


2
 0.5548

________________________________________________________________________
ESE 502 III.9-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

9.3.2 Squared Correlation for SEM and SLM

By employing yˆ SEM in expression (9.2.15), it follows at once that the squared correlation
measure for SEM is given by,

( yDyˆ SEM )2
(9.3.11) r 2 ( y , yˆ SEM ) 
( y Dy )( yˆ SEM
 Dyˆ SEM )

Similarly, by employing yˆ SLM in expression (9.2.23), it follows that the corresponding


squared correlation measure for SLM is given by,

( y Dyˆ SLM )2
(9.3.12) r 2 ( y , yˆ SLM ) 
( yDy )( yˆ SLM
 Dyˆ SLM )

These values are reported in Figure 7.7 as

(9.3.13) r 2 ( y , yˆ SEM )  0.5548

and

(9.3.14) r 2 ( y , yˆ SLM )  0.7512

Notice first that the squared correlation for SEM is identical with that of OLS. This
appears somewhat surprising, given that their estimated beta coefficients are quite
different. But in fact, this is an instance of the strong scale invariance properties of
correlation. To see this, we again use the simplifying notation in (9.3.8),

( y Dyˆ )2
(9.3.15) r 2 ( y , yˆ ) 
( yDy )( yˆ Dyˆ )

and observe that for the case of only one explanatory variable, the ŷ values for both
SEM and OLS, must be linear combinations of 1n and x , i.e., must be of the form,

(9.3.16) yˆ  a1n  bx

for some scalars a and b. But note first from the properties of the deviation matrix, D, that

(9.3.17) Dyˆ  aD1n  bDx  bDx

and thus that Dyˆ is already independent of a. Moreover, (9.3.17) in turn implies both that

(9.3.18) y Dyˆ  by Dx and yˆ Dyˆ  ( Dyˆ )Dyˆ  b2 xDx

________________________________________________________________________
ESE 502 III.9-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Thus by (9.3.15) we must have

(by Dx )2 b2 ( y Dx )2


(9.3.18) r 2 ( y , yˆ )    r 2 ( y, x )
( y Dy )(b xDx ) b ( y Dy )( xDx )
2 2

and may conclude that squared correlation depends only on y and x . So in particular,
the squared correlation of OLS and SEM must always be the same for the case of one
explanatory variable.
However, this is clearly not true for SLM, where X  [1n , x ] is transformed to

(9.3.19) X   B1 X  [ B11n , B1 x ]

so that ŷ is no longer of the form (9.3.16). Thus there is little relation between the
squared correlations for SLM and OLS, and as we have seen before, the squared
correlation fit for SLM in (9.3.14) is much higher than for OLS (and SEM).

9.3.3 A Geometric View of Squared Correlation

To gain further insight into the role of squared correlation as a general measure of
goodness-of-fit, it is instructive to start with the correlation coefficient itself. As we shall
show below, if one writes vectors, x, y   n , in deviation form as Dx  x  x1n and
Dy  y  y1n , then from a geometric viewpoint, the correlation coefficient, corr ( x, y ) , in
(9.3.1) turns out to be precisely the cosine of the angle,  ( Dx, Dy ) , between these
vectors, i.e.,

(9.3.20) r ( x, y )  cos[ ( Dx, Dy )]

This is most easily seen by first considering the cosine of the angle,    ( x, y ) , between
any pair of (nonzero) vectors, x, y   n , as shown for n  2 in Figure 9.9 below:

y y

x x

x

Figure 9.9. Vector Angle Figure 9.10. Right Triangle

________________________________________________________________________
ESE 502 III.9-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

To calculate the cosine of this angle, we first construct a right triangle by finding the
point,  x , on the x -vector for which the line segment, y   x , is orthogonal to x , as
shown by the red dotted line in Figure 9.10. Since vectors are orthogonal if and only if
their inner product is zero, this point can be identified by solving:

xy xy
(9.3.21) 0  x( y   x )  xy   xx    
xx || x ||2

Next, recall (from trigonometry) that for this right triangle, the desired cosine of  ( x, y )
is given by the (signed) length of the adjacent side, i.e.,  || x || , over the length of the
hypotenuse, || y || , so that

 || x ||  xy  || x ||
(9.3.22) cos[ ( x, y )]    2 
|| y ||  || x ||  || y ||

xy
 cos[ ( x, y )] 
|| x || || y ||

Before proceeding further, recall from expression (4.1.12) that this already establishes
(9.3.20) for the case of “zero mean” vectors. But the more general case is now obtained
by simply considering the vectors, Dx and Dy. In particular, since by definition,

(9.3.23) || Dx ||  ( Dx )( Dx )  xDDx  xDx

and similarly, || Dy ||  y Dy , it follows at once from (9.3.1) together with (9.3.22) and
(9.3.23) that

( Dx )Dy xDy
(9.3.24) cos[ ( Dx, Dy )]    r ( x, y )
|| Dx || || Dy || xDx xDx

and thus that (9.3.20) does indeed hold for all (nonzero) vectors, x, y   n . This in turn
implies that the squared correlation is simply the square of this cosine:

(9.3.25) r 2 ( x, y )  cos2 [ ( Dx, Dy )]

So in our case, if we now let ŷ denote the predicted value of data vector, y , for any
given model (whatsoever), then it follows at once that

(9.3.26) r 2 ( y, yˆ )  cos 2 [ ( Dy, Dyˆ )]

________________________________________________________________________
ESE 502 III.9-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

This geometric view of squared correlation helps to clarify the exact sense in which it
constitutes a robust goodness-of-fit measure. In particular, it yields a measure of
“similarity” between y and ŷ which is completely independent of the measurement
units employed. Indeed, this was already shown in arguments of (9.3.16) through (9.3.18)
above, where shifts of measurement origins were seen to be removed by the deviation
matrix, D, and where scale transformations were removed by the ratio form of squared
correlation itself. Even more important is the fact that since cos2 ( ) is close to one if and
only if  is close to 0 (or  ), the identity in (9.3.26) shows that r 2 ( y, yˆ ) is close to one
if and only if the vectors, Dy and Dyˆ , point in almost the same (or opposite) directions.
Algebraically, this implies they are almost exact linear multiples of one another, i.e., that
Dyˆ   Dy for some nonzero scalar,  . In practical terms, this means that the relative
sizes of all deviation components must be approximately the same, so that if ŷ denotes
the sample mean of ŷ , then

yˆ i  yˆ y y
(9.3.27)  i , i j
yˆ j  yˆ yj  y

Thus large (or small) deviations from the mean in components of y are reflected by
comparable large (or small) deviations the mean in components of ŷ . The shows exactly
the sense in which prediction, ŷ , is deemed to be similar to data, y , when r 2 ( y , yˆ )  1 .

9.4 Measures based on Maximum-Likelihood Values

Recall that our basic strategy for estimating model coefficients, (  , 2 ,  ) , was to find
values ( ˆ ,ˆ 2 , ˆ ) that maximized the likelihood of observed data, y, given explanatory
data values, X. This suggests that a natural measure of fit should be provided by the
maximum (log) likelihood value, L( ˆ ,ˆ 2 , ˆ | y, X ) , obtained. One difficulty here is that
since likelihood values themselves are probability density values, and not probabilities,
any direct interpretation of such values is tenuous at best. But the ratios of these values
for different models might still provide meaningful comparisons in terms of the limiting
probability-ratio arguments used in expressions (7.1.1) and (7.1.4) above.

However, there is a second more serious difficulty with likelihood values that is
reminiscent of R-squared values. Recall from the argument in expressions (9.1.31)
through (9.1.34) that R-squared essentially always increases when new explanatory
variables are added to the model. In fact, that argument really shows that the increase in
R-squared results from the addition of new beta parameters. But this argument is far
more general, and in fact shows that maximum values of functions are never decreased
when more parameters are added. In particular, if we consider the case of two likelihood

________________________________________________________________________
ESE 502 III.9-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

functions, say L( k ) (1 ,.., k | y , X ) and L( k 1) (1 ,.., k , k 1 | y , X ) , where the first is simply a
special case of the second with  k 1  0 , i.e., with

(9.4.1) L( k ) (1 ,.., k | y , X )  L( k 1) (1 ,.., k ,0 | y, X )

then the same argument shows that

(9.4.2) max (1 ,..,k ) L( k ) (1 ,.., k | y , X )  max (1 ,..,k ) L( k 1) (1 ,.., k ,0 | y, X )

 max (1 ,..,k ,k 1 ) L( k 1) (1 ,.., k , k 1 | y , X )

with strictly inequality almost always holding. What this means for our purposes is that
log likelihood functions suffer from exactly the same “inflation problem” as R-squared
whenever new parameters are added. So if one attempts to compare the goodness of fit
between models that are “nested” in the sense of (9.4.1), [i.e., where one is a special case
of the other with certain parameters set to zero (or otherwise constrained in value)], then
the larger model will always yield a better fit in terms of maximum-likelihood values.

This observation suggests that such likelihood comparisons must somehow be penalized
in terms of the numbers of parameters in a manner analogous to adjusted R-squared. If
we again let L(ˆ | y ) denote a general log likelihood function evaluated at its maximum
value, then the simplest of these penalized versions is Akaike’s Information Criterion
(AIC):

(9.4.3) AIC   2 L(ˆ | y )  2 K

where K now denotes the dimension of ˆ , i.e., the number of parameters being estimated
[and where factor “2” in AIC, as well as in the other measures to be developed, relates to
the form of the log likelihood ratio statistic in expression (10.1.7) below.] For both SEM
and SLM with parameters, ˆ  ( ˆ0 , ˆ1 ,.., ˆk ,ˆ 2 , ˆ ) , this implies in particular that
K  ( k  1)  2  k  3 . This measure is discussed in detail by Burnham and Anderson
(2002), where AIC is both defined (p.61) and later derived (Section 7.2). In addition,
these authors recommend a “corrected” version of AIC (p.66) for sample sizes that are
small relative to the number of parameters ( n / K  40 ). This is usually designated as
corrected AIC (AICc) and can be written in terms of (9.4.3) as

2 K ( K  1)
(9.4.4) AICc  AIC 
n  ( K  1)

________________________________________________________________________
ESE 502 III.9-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

An alternative penalized version of maximum likelihood which directly incorporates


sample size is the Bayes (or Schwarz) Information Criterion (BIC):

(9.4.5) BIC   2 L(ˆ | y )  K log( n )

While this measure is also developed in Burnham and Anderson (2002, Section 6.4.1), a
more lucid derivation can be found in Raftery (1995, section 4.1). Given its heavier
penalization term for model sizes, K [when log( n )  2 ], this measure is well known to
favor smaller models (i.e., with fewer parameters) than AIC in terms of goodness of fit.

Finally it should be noted that when comparing SEM and SLM for a given specification
of k explanatory variables, all such measures will differ only in terms of their
corresponding maximum-likelihood values, L(ˆ | y ) , for these two models. So in the
present case of Eire, where Figure 7.7 shows that

(9.4.6) LSEM (ˆ | y )   49.8773

(9.4.7) LSLM (ˆ | y )   45.6632

it is clear that SLM must continue to yield a better fit than SEM with respect to all of
these criteria.

________________________________________________________________________
ESE 502 III.9-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis

10. Comparative Tests among Spatial Regression Models

While the notion of relative likelihood values for different models is somewhat difficult
to interpret directly (as mentioned above), such likelihood ratios can in many cases
provide powerful test statistics for comparing models. In particular, when two models are
“nested” in the sense of expression (9.4.1) above, it turns out that the asymptotic
distribution of such ratios can be obtained under the (null) hypothesis that the simpler
model is the true model. To develop such tests, we begin in Section 10.1 below with a
simple one-parameter example where the general ideas to be developed can be given an
exact form.

10.1 A One-Parameter Example

Here we revisit the example in Section 8.1 of estimating the mean of a normal random
variable, Y  N (  , 2 ) , with known variance,  2 , given a sample, y  ( y1 ,.., yn ) , of size
n. The relevant likelihood function is then given by expression (8.1.1) as

(10.1.1)  
L(  )  Ln (  | y , 2 )   n log  2  21 2  i 1 ( yi   )2
n

and the resulting maximum-likelihood estimate of  , is again seen from expression


(8.1.2) to be precisely the sample mean, ˆ n  yn .

But rather than simply estimating  , suppose that we now want to test whether   0 , or
more generally to test the null hypothesis, H 0 :   0 , for some specified value, 0 .
Then under H 0 the likelihood value in (10.1.1) becomes:

(10.1.2)  
L( 0 )  Ln ( 0 | y , 2 )   n log  2  21 2  i 1 ( yi  0 )2
n

As shown in Figure 10.1 below, it seems reasonable to argue that the likelihood of 0
relative to the maximum likelihood at ˆ n should provide some indication of the strength

0 ˆ n
• •

L( ˆ ) n
•L(  )
0

Ln (  )

Figure 10.1 Likelihood Comparisons

________________________________________________________________________
ESE 502 III.10-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

of evidence in sample y for (or against) hypothesis H 0 . In terms of log likelihoods, such
relations are expressed in terms of the difference between L( ˆ n ) and L( 0 ) . But
following standard conventions, we here refer to such log-differences as likelihood ratios.
Moreover, since L( ˆ n )  L( 0 ) by definition, it is natural to focus on the nonnegative
difference, L( ˆ n )  L( 0 ) . If the distribution of L( ˆ n )  L( 0 ) can be determined under
H 0 , then this statistic can be used to test H 0 . In particular, if L( ˆ n )  L( 0 ) is
“sufficiently large”, then this should provide statistical grounds for rejecting H 0 . With
this in mind, observe that by canceling the common terms in the log likelihood
expressions, and recalling that ˆ n  yn , we see that this likelihood ratio can be written as

L( ˆ n )  L( 0 )   21 2  i 1 ( yi  ˆ n )2   i 1 ( yi  0 )2 
n n
(10.1.3)
 

  21 2  i 1[( yi  yn )2  ( yi  0 )2 ]
n

  21 2  i 1[( yi2  2 yi yn  yn2 )  ( yi2  2 yi 0  02 )]


n

  21 2  i 1[2 yi yn  yn2  2 yi 0  02 ]


n

  21 2  2 yn  i 1 yi  nyn2  2 0  i 1 yi  n 02 


n n

 

 n
2 2
 yn2  2 0 yn  02 

 n
2 2
( y n  0 ) 2

Thus it follows that

 y  0 
2

(10.1.4) 2[ L( ˆ n )  L( 0 )]   n 
 / n 

But under the null hypothesis, H 0 , the standardized mean in brackets is standard normal:

yn  0
(10.1.5) ~ N (0,1)
/ n

So the right-hand side of (10.1.4) is distributed as the square of a standard normal variate,
which is known to have a chi square distribution, 12 , with one degree of freedom, i.e.,

2.5

 yn  0 
2 2

(10.1.6)   ~ 12 1.5

/ n 
1

0.5

0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

________________________________________________________________________
ESE 502 III.10-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

where the density of 12 is plotted on the right. So we may conclude that this likelihood-
ratio statistic is chi-square distributed (up to a factor of 2) as:

(10.1.7) 2[ L( ˆ n )  L( 0 )] ~ 12

[As mentioned in Section 9, this factor of 2 is closely related to the same factor appearing
in the penalized likelihood functions developed there.]
Note that we are implicitly comparing two models here, one with a single free parameter
(  ) and the other a “nested” special case where  has been assigned a specific value,
0 (typically, 0  0 ). But the same likelihood-ratio procedure can be used for much
more general comparisons between a “full” model and some special case, denoted as the
“restricted” model. Here we simply summarize the main result. Suppose that the full
model is represented by a log likelihood function, L( | y ) , with parameter vector,
  (1 ,.., K ) , and that the restricted model is defined by imposing a set of m  K
restrictions on these parameters that are representable by a vector, g  ( g j : j  1,.., m) , of
(smooth) functions as relations of the form,
(10.1.8) g j ( )  0 , j  1,.., m

In our simple example above, there is only one relation, namely, g1 (  )    0  0 . If


the maximum-likelihood estimate for full model is denoted by ˆ , and if the maximum-
likelihood estimate, ˆg , for the restricted model is taken to be the (unique) solution of the
constrained maximization problem,

(10.1.9) L(ˆg | y )  max{ : g ( )  0} L( | y )

then it again follows that the relevant likelihood-ratio statistic, L(ˆ | y )  L(ˆg | y ) , is
nonnegative. In this more general setting, if it is hypothesized that the restricted model is
true (i.e., that the true value of  satisfies restrictions, g ), then under this null hypothesis
it can be shown1 that L(ˆ | y )  L(ˆg | y ) is now asymptotically chi square distributed (up
to a factor of 2) with degrees of freedom, m , equal to the number of restrictions defined
by g :

(10.1.9) 2[ L(ˆ | y )  L(ˆg | y )] ~  m2

1
This result, known as Wilk’s Theorem, is developed, for example, in Section 3.9 of the online Lecture
Notes in Mathematical Statistics (2003) by R.S. Dudley at MIT (http://ocw.mit.edu/courses/mathematics/
18-466-mathematical-statistics-spring-2003/lecture-notes/).

________________________________________________________________________
ESE 502 III.10-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

This family of likelihood-ratio tests provides a general framework for comparing a wide
variety of “nested” models. Moreover, as in the one-parameter case of (10.1.7) above, the
basic intuition is essentially the same for all such tests. In particular, since the full
maximum likelihood, L(ˆn | y ) , is almost surely larger than the restricted maximum
likelihood, L(ˆg | y ) , the only question is whether it is “significantly larger”. If so, then it
can be argued that the restricted model should be rejected on these grounds. If not, then
this suggests that the full model adds little in the way of statistical substance, and thus (by
Occam’s razor) that the simpler restricted model should be preferred. For example, in the
OLS case above, the key question is whether a given parameter, such as 1 , is
significantly different from zero (all else being equal). If so, then this indicates that the
larger model including variable, x1 , yields a better predictor of y than the same model
without x1 .2 In the following sections, we shall employ this strategy to compare the SE-
model and SL-model from a number of perspectives.

10.2 Likelihood-Ratio Tests against OLS

Here we begin by observing that since SEM and SLM are “non-nested” models in the
sense that neither is a special case of the other, it is not possible to compare them directly
in terms of likelihood-ratio tests. But since OLS is precisely the “   0 ” case of each
model, both SEM and SLM can be compared with OLS in terms of such tests. Thus, by
using OLS as a “benchmark” model, we can construct an indirect comparison of SEM
and SLM. For example, if the improvement in likelihood of SEM over OLS is much
greater than that of SLM over OLS for a given data set, ( y, X ) , then in this sense it can
be argued that SEM provides a better model of ( y, X ) than does SLM.

To operationalize such comparisons, we start with SEM and for a given data set, ( y, X ) ,
let ( ˆSEM ,ˆ SEM
2
, ˆ SEM ) denote the maximum likelihood estimates obtained using the SEM
likelihood function, L(  , 2 ,  | y , X ) , in (7.3.4) above [as in expressions (7.3.10)
through (7.3.12)]. Then the corresponding SEM maximum-likelihood value can be
denoted by:

(10.2.1) LˆSEM  L( ˆSEM ,ˆ SEM


2
, ˆ SEM | y , X )

Similarly, if for OLS we let ( ˆOLS ,ˆ OLS


2
) denote the maximum-likelihood estimates in
(7.2.6) and (7.2.9) obtained for ( y, X ) by maximizing (7.2.4), then the corresponding
OLS maximum-likelihood value can be denoted by

(10.2.2) LˆOLS  L( ˆOLS ,ˆ OLS


2
| y, X )

2
One may ask how this likelihood-ratio test in the OLS case relates to the standard (Wald) tests of
significance, such as in expression (8.4.12) above (with   0 ). Here it can be shown [as for example in
Section 13.4 of Davidson and MacKinnon (1993)] that these tests are asymptotically equivalent.

________________________________________________________________________
ESE 502 III.10-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Finally, since the likelihood function in (7.2.4) is clearly the special case of (7.3.4) with
  0 [or more precisely, with g1 (  , 2 ,  )   in (10.1.8) ], it follows from the general
discussion above that under the null hypothesis,   0 , it must be true that the likelihood
ratio, LRSEM /OLS  2[ LˆSEM  LˆOLS ] , is distributed as chi square with one degree of
freedom, i.e., that

(10.2.3) LRSEM /OLS  2[ LˆSEM  LˆOLS ] ~ 12

Similarly, if ( ˆSLM ,ˆ SLM


2
, ˆ SLM ) denotes the maximum likelihood estimates obtained using
the SLM likelihood function, L(  , 2 ,  | y , X ) , in (7.4.2) above [as in expressions
(7.4.12) through (7.4.14) ], then we may denote the resulting SLM maximum-likelihood
value by:

(10.2.4) LˆSEM  L( ˆSEM ,ˆ SEM


2
, ˆ SEM | y , X )

Then in the same manner as (10.2.3), it follows that under the null hypothesis that   0
for SLM, we also have

(10.2.5) LRSLM /OLS  2[ LˆSLM  LˆOLS ] ~ 12

For the Eire case, these two likelihood ratios and associated p-values are reported in
Figure 7.7 as

(10.2.6) LR  LRSEM /OLS  7.375 ( Pval  .0066)

and

(10.2.7) LR  LRSLM /OLS  15.803 ( Pval  .00007)

So for example, if OLS were the correct model, then the chance of obtaining a likelihood
ratio, LRSLM /OLS , as large as 15.803 would be less than 7 in 100,000. Moreover, while the
p-value for LRSEM /OLS is also quite small, it is relatively less significant than for SLM.
Thus a comparison of these p-values provides at least indirect evidence that SLM is more
appropriate than SEM for this Eire data.
But given the indirect nature of this comparison, it is natural to ask whether there are any
more direct comparisons. One possibility is developed below, which will be seen to be
especially appropriate for the case of row normalized spatial weights matrices.

________________________________________________________________________
ESE 502 III.10-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

10.3 The Common-Factor Hypothesis

Here we start by recalling from Section 6.3.2 that if X and  are partitioned as
X  [1n , X v ] and    (  0 ,  v ) , respectively, then an alternative modeling form is
provided by the Spatial Durbin model (SDM),

(10.3.1) Y  WY   01n  X v  v  WX v   ,  ~ N (0, 2 I n )

But this model can be viewed as a special case of the SLM model in the following way. If
we group terms in (10.3.1) by letting X SDM  [1n , X v ,WX v ] and  SDM
  (1n ,  v , ) so that

 0 
(10.3.2) X SDM  SDM  [1n X v WX v ]   v    01n  X v  v  WX v ,
 
 
then (10.3.1) can be rewritten as,

(10.3.3) Y  WY  X SDM  SDM   ,  ~ N (0, 2 I n )

which is seen to be an instance of SLM in expression (6.2.2).

Moreover, if W is row normalized, then SEM can in turn be viewed as a special case of
SDM. To see this, observe first that the reduced form of SEM in expression (6.1.9) can be
expanded and rewritten as follows:

(10.3.4) Y  X   ( I n  W )1

 ( I n  W )Y  ( I n  W ) X   

 Y  WY  ( X  WX )   

 Y  WY  X   WX   

So by employing the notation in (10.3.1), we see that

(10.3.5) Y  WY  [  01n  X v  v ]  W [  01n  X v  v ]  

 WY   01n  X v  v  [  0W 1n  WX v  v ]  

Finally, if W is row normalized, then by expression (3.3.30) it follows that W 1n  1n . So


by letting b0  (1   )  0 , and grouping the two unit vector terms, we see finally that the
SEM model in (10.3.4) becomes

________________________________________________________________________
ESE 502 III.10-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(10.3.6) Y  WY  b01n  X v  v  WX v (  v )  

which is precisely SDM in (10.3.1) under the condition that

(10.3.7)     v

This condition is usually formulated as a null hypothesis, designated as the Common


Factor Hypothesis, and written as

(10.3.8) H CF :    v  0

Under this hypothesis, it follows that SEM is formally a restriction of SDM in the sense
of expression (10.1.8), where the relevant vector, g, of restriction functions is now given
by g (  ,  0 ,  v , , 2 )     v . The number of restrictions (i.e., dimension of g) is here
simply the number of explanatory variables, k . Given this relationship, one can then
employ likelihood-ratio methods to test the appropriateness of SDM versus SEM. To do
so for any given any data set, ( y, X ) , we now let ( ˆSDM ,ˆ SDM 2
, ˆ SDM ) denote the
maximum likelihood estimates obtained by applying the SLM likelihood function,
L(  SDM , 2 ,  | y , X ) , in (7.4.2) to the SLM form of SDM in (10.3.3) above. In these
terms, the resulting SDM maximum-likelihood value is then given by:

(10.3.9) LˆSDM  L( ˆSDM ,ˆ SDM


2
, ˆ SDM | y , X )

Finally, if we let LˆSEM  L( ˆSEM ,ˆ SEM


2
, ˆ SEM | y , X ) denote the maximum-likelihood value
of the SE-model in (10.3.6) [viewed as an SD-model restricted by (10.3.8)], then under
the SEM null hypothesis, we now have

(10.3.10) LRSDM / SEM  2[ LˆSDM  LˆSEM ] ~  k2

where again, k, is the number of explanatory variables in SEM. The results of this
comparative test are part of the SEM output, denoted by Com-LR. For the case of Eire,
the result reported in Figure 7.7 is

(10.3.11) Com-LR = 18.427035 ( Pval = 0.000018)

and shows that SDM fits this Blood Group data far better than SEM. This can largely be
explained by noting from (10.3.2) and (10.3.3) that the reduced form of the SDM model
is given by

(10.3.12) Y  B1 (  01n  X v  v  WX v )  B1

________________________________________________________________________
ESE 502 III.10-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

and thus contains the Rippled Pale term, B1 X v  v  ( B1 x ) 1 , which was shown to yield
a striking fit to this data. So a strong result is not surprising in this case.

Finally, it should be noted that while the above analysis has focused on row-normalized
matrices in order to interpret the “SLM version” of SEM as a Spatial Durbin model, this
restriction can in principle be relaxed. In particular, when W 1n  1n , it is possible to treat
the vector, W 1n , as representing the sample values of an additional “explanatory
variable” and thus modify (10.3.2) to

 0 
 
(10.3.13) X SDM  SDM  [1n X v W 1n WX v ]  v    01n  X v  v   0W 1n  WX v
 0 
 
 

With this addition, SEM can still be viewed formally as an instance of SLM. Moreover, if
the additional restriction,  0   0  0 , is added to yield a set of k  1 restrictions, then
this new likelihood ratio must now be distributed as  k21 under the null hypothesis of
SEM. So while the problematic nature of this artificial “explanatory variable”
complicates the interpretation of the resulting test, it can still be argued that the presence
of the spatial lag term, WY , suggests that SLM may yield a better fit to the given data
than SEM.

10.4 The Combined-Model Approach

A final method of comparing SEM and SLM is provided by the combined model (CM)
developed in Section 6.3.1 above, which for any given spatial weights matrix, W, can be
written as [see also expression (6.3.3) ]:

(10.4.1) Y  WY  X   u , u  Wu   ,  ~ N (0, 2 I n )

Here is clear that SEM is the special case with   0 , and SLM is the special case with
  0 . So these two models are seen to lie “between” OLS and the Combined Model, as
in Figure 10.2 below:

Combined Model

SEM SLM

OLS

Figure 10.2. Model Relations


________________________________________________________________________
ESE 502 III.10-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

In the same way that OLS served as a “lower” benchmark for comparing SEM and SLM,
the Combined Model can thus serve as an “upper” benchmark. Here the only issue is how
to estimate this more complex model. To do so, we start by observing from (6.3.4) that
the reduced form of this model can be written as:

(10.4.2) Y  X      ,   ~ N (0, 2V )

where

(10.4.3) X   ( I n  W )1 X

(10.4.4)    ( I n  W )1 ( I n  W )1

(10.4.5) V  ( I n  W )1 ( I n  W )1 ( I n  W )1 ( I n  W )1

So it should be clear that this model is simply another instance of GLS, where in this case
conditioning is on the pair of spatial dependence parameters,  and  . So for the
parameter vector,   (  , 2 ,  ,  ) , the corresponding likelihood function takes the form:

(10.4.6) L( | y )   n2 log(2 )  n2 log( 2 )  12 log | V |  21 2 ( y  X   )V1 ( y  X   )

and the corresponding conditional maximum-likelihood estimates for  and  2 given


 and  now take the respective forms:

(10.4.7) ˆ  ( X  V1 X  )1 X  V1 y

(10.4.8) ˆ 
2
 n1 ( y  X  ˆ )V1 ( y  X  ˆ )

By substituting (10.4.7) and (10.4.8) into (10.4.6), we may then obtain a concentrated
likelihood function for  and  , denoted by:

(10.4.9) Lc (  ,  | y )  L( ˆ ,ˆ 


2
, ,  | y)

Finally, by maximizing this two-dimensional function to obtain maximum-likelihood


estimates, ̂ and ̂ , we can substitute these into (10.4.7) and (10.4.8) to obtain the
corresponding maximum-likelihood estimates, ˆ and ˆ 2 . This estimation procedure

ˆˆ 
ˆˆ

is programed in the MATLAB program, sac.m, (Spatial Autocorrelation Combined)


written by James Lesage, and can be found in the class directory at:

>> sys502/Matlab/Lesage_7/spatial/sac_models

________________________________________________________________________
ESE 502 III.10-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

While the parameter estimates, ̂ and ̂ , obtained by this procedure often tend to be
collinear (in view of their common role in modifying the same weight matrix, W), the
corresponding maximum-likelihood value,

(10.4.10) LˆCM  L( ˆ


ˆˆ
,ˆ 
2
ˆˆ
, ˆ , ˆ | y )

continues to be well defined and numerically stable. This value can thus be used to test
the relative goodness of fit of the two restricted models, SEM and SLM. In particular, it
follows by the same arguments as above that under the SEM null hypothesis (   0 ) we
have

(10.4.11) LˆCM / SEM  2[ LˆCM  LˆSEM ] ~ 12

and similarly, that under the SLM null hypothesis (   0 ) we have

(10.4.11) LˆCM / SLM  2[ LˆCM  LˆSLM ] ~ 12

The results of these respective tests for the Eire case are as follows:

(10.4.12) LRCM / SEM  10.92 ( Pval  .0009)

(10.4.13) LRCM / SLM  2.49 ( Pval  .1145)

Thus the Combined Model is seen to yield a significantly better fit than SEM, but not
SLM. So relative to this CM benchmark, it can again be concluded that SLM yields a
better fit to the Eire data than does SEM.

________________________________________________________________________
ESE 502 III.10-10 Tony E.
Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

APPENDIX TO PART III


This Appendix, designated as A3, contains additional analytical results for Part III of the
NOTEBOOK, and follows the notational conventions in Appendices A1 and A2.

A3.1. The Geometry of Linear Transformations

The ultimate objective of this section of the appendix is to develop the Spectral
Decomposition Theorem for symmetric matrices, that illuminates many of the most
important properties of covariance matrices. But to gain an intuitive understanding of this
result, it is important to understand the geometry of linear transformations as represented
by matrices. A transformation, T , on  n is simply a mapping that assigns every vector,
x   n , to some other vector, T ( x)   n , called the image of x under T. A
transformation, T , is linear if and only if (iff ) it preserves vector addition, i.e., iff for
each pair of vectors, x, y   n , and scalars,  ,    ,

(A3.1.1) T ( x   y )   T ( x)   T ( y )

The intimate connection between matrices and linear transformations is seen most readily
in  2 . If we let e1  (1,0)   2 and e2  (0,1)   2 denote the so-called identity basis
vectors in  2 (shown in Figure A3.1 below)1,

x2e2 x
e2

e1 x1e1

Figure A3.1. Identity Basis Figure A3.2. Basis Representation

then by definition any vector, x  ( x1 , x2 )   2 can be represented as:

x  1  0
(A3.1.2) x   1   x1    x2    x1e1  x2e2
 x2  0 1

1
Note that we maintain the convention that all vectors are represented as column vectors, so that transpose
notation is used for all inline representations [as in expression (1.1.2) of Part II].

________________________________________________________________________
ESE 502 A3-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

This basis representation of x , shown in Figure A3.2, implies from (A3.1.1.) that the
image of x under any linear transformation, T, can be represented as

(A3.1.3) T ( x)  T ( x1e1  x2e2 )  x1T (e1 )  x2T (e2 )

So if we know where the identity basis vectors, (e1 , e2 ) , are sent by T, then we can
construct the entire transformation. In particular, if we now let

a  a 
(A3.1.4) T (e1 )  a1   11  , T (e2 )  a2   21 
 a12   a22 

then this transformation can be represented for all x  ( x1 , x2 )   2 by

a  a 
(A3.1.5) T ( x)  x1T (e1 )  x2T (e2 )  x1  11   x2  21 
 a12   a22 

a x a x  a a  x 
  11 1 21 2    11 12  1   A x
 a21 x1  a22 x2   a21 a22  x2 

where the matrix,

a a 
(A3.1.6) A  (a1 , a2 )   11 12 
 a21 a22 

is designated as the matrix representation of transformation T. This is the fundamental


relation between matrices and linear transformations. In fact, it is so fundamental that
linear transformations are usually defined by their matrix representations as in (A3.1.5).
So in the two-dimensional case, each linear transformation can be defined by its matrix
representation, A, for all x  ( x1 , x2 )   2 as in Figures A3.3 and A3.4 below:
Ax

x2 Ae2
Ae2
x2e2
e2 Ae1 x1 Ae1

e1 x2e2

Figure A3.3. Basis Image Vectors Figure A3.4. General Image Vectors
________________________________________________________________________
ESE 502 A3-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

More generally, if the identity basis2 in  n is associated with the columns of the identity
matrix, I n  (e1 , e2 ,.., en ) , and if the images of these basis vectors under any linear
transformation, T, are denoted by T (ei )  ai  (ai1 ,.., ain )   n , i  1,.., n , then for all
x  ( x1 ,.., xn )   n , T again has the matrix representation

 ai1 
 
T ( x)   i1 xiT (ei )   i1 xi ai   i1 xi   
n n n
(A3.1.7)
a 
 in 

  n a1 j x j 
 j 1   a11  a1n  x1 
            A x
 n    
 
  j 1 anj x j   n1
a  ann  xn 
 

In the analysis to follow, we shall use the terms matrix and transformation
interchangeably. Note also that is this equivalence that motivates the basic multiplication
rules of matrix algebra. So the meaning of these rules is often best understood in this
way.

To examine some of the more important matrix properties, we begin by observing that
every matrix can be written in two equivalent ways. First there is a column representation
of A,

 a11  a1n    a11   a1n  


            (a ,.., a )
(A3.1.8) A            1 n
a  a    an1   ann  
 n1 n1  

where a j denotes the j th column of A. There is also a row representation of A,

 a11  a1n   [a11  a1n ]   a1 


     
(A3.1.9) A           
 a  a   [ a  a ]   a 
 n1 n1   n1 nn   n

where ai denotes the i th row of A.3 This in turn implies that matrix products, AB, can be
written in two ways:

2
A fuller discussion of vectors bases for linear spaces is given on page A3-16 below.
3
It is important to note, for example, that a1 in (A3.1.9) is not the transpose of a1 in (A3.1.10). To be
more precise here, one could use the “dot” notation, a j , for columns and ai  , for rows. However, we
choose not to add this notational complexity since the rows and columns of A will generally be clear in
context.

________________________________________________________________________
ESE 502 A3-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

 b1 
 
AB  (a1 ,.., an )      i1 aibi
n
(A3.1.10)
 b 
 n

and

 a1   a1b1  a1bn 


   
(A3.1.11) AB     (b1 ,.., bn )      
 a   a  b  a b 
 n  n 1 n n

Both of these representations are very useful, and will be used throughout the analysis to
follow. As one immediate application, it is important to note that for every matrix,
A  (aij : i, j  1,.., n) , the transpose matrix, A  (a ji : i, j  1,.., n) , represents a linear
transformation closely related to that of A. In particular, the rows of A are the columns of
A . So from a transformation viewpoint, A , represents the “row space” of A. Moreover,
if for any matrices A and B we use the representations

 a1   b1 
   
(A3.1.12) A      A  (a1 ,.., an ) , B  (b1 ,.., bn )  B    
 a   b 
 n  n

then (A3.1.11) together with the identity, ab  ba , imply that,

 b1   b1a1  b1an   a1b1  a1bn 


     
(A3.1.13) BA     (a1 ,.., an )             ( AB )
 b   b a  b a   a b  a b 
 n  n 1 n n  n 1 n n

and hence that the transpose of a product, AB , is the product of their transposes in the
reverse order.

A3.1.1 Nonsingular Transformations and Inverses

Perhaps the single most important feature of a linear transformation is whether or not it
has an “inverse”. In particular, a linear transformation A is said to be nonsingular iff there
exists another linear transformation, A1 , called the inverse of A such that

(A3.1.14) A1 A  I n

This inverse transformation can be equivalently defined by the requirement that for all
x, y   n ,

(A3.1.15) A1 y  x  Ax  y

________________________________________________________________________
ESE 502 A3-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

This version also shows that AA1  I n . For if we let X  ( x1 ,.., xn ) be defined by
Axi  ei , i  1,.., n , so that AX  I n , then by (A3.1.15), A1ei  xi , i  1,.., n implies that
A1  A1I n  X , and hence that AA1  I n . Note also that since A1 is well defined as a
transformation (i.e., A1 y is uniquely defined), it must be true that A is a one-to-one
transformation, i.e., for all x1 , x2   n ,

(A3.1.16) x1  x2  Ax1  Ax2

For if Ax1  y  Ax2 then we would have {x1 , x2 }  A1 y , so that A1 y is not uniquely
defined. As an additional consequence of (A3.1.14), note that for any pair of nonsingular
transformations, A and B, we must have

(A3.1.17) ( B 1 A1 ) AB  B 1 ( A1 A) B  B 1I n B  B 1B  I n

Since the same argument shows that AB ( B 1 A1 )  I n , it then follows from (A3.1.14) that
AB must also be nonsingular, and in particular, has a well defined inverse ( AB) 1 given
by

(A3.1.18) ( AB ) 1  B 1 A1

A similar argument shows that transposes, A , of nonsingular matrices, A, must also be


nonsingular. To see this, observe that we may take transposes of the matrices in (A3.1.14)
and use (A3.1.13) to obtain

(A3.1.19) ( A A1 )  I n  ( A1 A)  ( A1 ) A  I n  A ( A1 )

So by again appealing to (A3.1.14), we see that A has a well-defined inverse, ( A) 1 ,


given by

(A3.1.20) ( A) 1  ( A1 )

In other words, the inverse of A is the transpose of A1 (so that the operations of taking
transposes and inverses are said to commute).

To examine some of the more geometric properties of nonsingular transformations,


observe that if for any set, S   n , we let

(A3.1.21) A( S )  { Ax : x   n }

________________________________________________________________________
ESE 502 A3-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

denote the image of S under transformation A, then nonsingular transformations A must


map  n onto itself, i.e.,

(A3.1.22) A( n )   n

Since A( n )   n by definition, (A3.1.22) follows from the observation that for any
x   n , A ( A1 x )  x  x  A( n ) , so that  n  A( n ) . In summary, every
nonsingular transformation is both one-to-one and onto as a mapping.

We next observe that for all transformations, A, the full image set, A( n ) , is of special
importance since it is always a linear subspace of  n , i.e., it is contained in  n and is
closed under linear combinations [ x, y  A( n )   x   y  A( n ) for all scalars,
 ,  ]. In particular, since A( n ) contains all vectors that are expressible as a linear
combinations of the columns of A  (a1 ,.., an ) , it is said to be spanned by these columns,
and is often written as:

(A3.1.23) span( A)  A( n )  { Ax : x   n }   n


x a : x  ( x1 ,.., xn )   n
i 1 i i 
In these terms, we note that one final characterization of nonsingular transformations
(and perhaps the most basic characterization) is in terms of linearly independent vectors.
A set of vectors {z1 ,.., zk }   n is said to be linearly independent if and only if for all
scalars, (1 ,.., k ) ,4

  i zi  0   i  0, i  1,.., k
k
(A3.1.24) i 1

In these terms, a matrix, A  (a1 ,.., an ) , is nonsingular iff its columns {a1 ,.., an } are
linearly independent. So by replacing zi with ai and  i with xi in this general
definition, we can write this nonsingularity condition for A in matrix form as follows. For
all x   n ,

(A3.1.25) Ax  0  x  0

This characterization of nonsingularity is essentially equivalent to the uniqueness


condition in (A3.1.16) [since Ax  0 for x  0 would imply that Ax  0  A 0 ].

4
Note that for convenience we drop the subscript notation on the n-vector of zeros by 0 n  (0,.., 0) . The
dimension of 0 should always be clear in context. So in expression (A3.1.24) for example, the 0 on the left
is k-dimensional and the 0’s on the right are scalars (one-dimensional).

________________________________________________________________________
ESE 502 A3-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

These general properties of nonsingular transformations are well illustrated by the


transformation, A  (a1 , a2 ) , in Figures A3.3 and A3.4 above. Here it is evident that every
vector in  2 is representable as a linear combination of Ae1  a1 and Ae2  a2 , so that
span( A)   2 . Similarly, the only linear combination which is the zero vector is the pair
of zero scalars, 0  (0,0) , so that (A3.1.13) holds. Hence, even without producing the
inverse transformation, A1 , it should be clear that A is nonsingular.

In these notes, we shall deal almost exclusively with nonsingular transformations. But to
understand the full scope of the matrix decomposition theorems to follow, it is important
to consider all linear transformations on  n . In particular, those linear transformations,
A, for which no inverse exists are said to be singular transformations. In terms of
(A3.1.16) above, this means is that there are distinct vectors, x  y with Ax  Ay , so that
the transformation A1 is not well defined. In view of linearity, this in turn implies that
there is a nonzero vector, namely x  y , with A( x  y )  Ax  Ay  0 . This observation
shows that the characterizing property of singular transformations, A, is that there is a
nontrivial set of vectors mapped into zero by A. This set is designated as the null space
for A, written as

(A3.1.26) null ( A)  {x   n : Ax  0}

As the term “space” implies, null ( A) is also a linear space, since

(A3.1.27) x, y  null ( A)  Ax  0  Ay

 A( x   y )   Ax   Ay  0

  x   y  null ( A)

For a nonsingular transformation this is trivially true, since null ( A)  {0} by (A3.1.13).
But for singular transformations, null ( A) is a proper linear space. In fact, the two linear
spaces, span( A) and null ( A) completely characterize most of the geometric features of
every linear transformation. A simple example of a singular transformation, A, in  2 is
given by

 2 1
(A3.1.28) A 
 2 1

where for the vector, x  (1, 2)  0 , we see from (A3.1.28) that Ax  0 . Here span( A)
and null ( A) are shown in Figure A3.5 below.

________________________________________________________________________
ESE 502 A3-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

x Ae1

e2 Ae2

e1

span( A) null ( A)

Figure A3.5. Singular Transformation

The image vectors Ae1  (2, 2) and Ae2  (1,1) are seen to be collinear, so that span( A)
is reduced to a line, i.e., a one-dimensional subspace of  2 . Similarly, the point x above
is also shown, and is seen to generate a one-dimensional subspace, null ( A) , which is
collapsed into 0 by A. [An example in  3 is given in Figure A3.16 below.] More
generally, the dimensions of these two subspaces always add to n. To be more precise, for
any linear subspace, S   n , the dimension of S, denoted by dim( S ) is the maximum
number of linearly independent vectors in S. So by (A3.1.23), the dimension of span( A)
must be the maximum number of linearly independent columns (a1 ,.., an ) of A. Moreover,
by (A3.1.26) the dimension of null ( A) must be the maximum number of linearly
independent vectors mapped to zero by A. As seen in Figure A3.5

(A3.1.29) dim( span( A))  dim( null ( A))  n

where in this case, n  2 . In turns out that this is always true. Since its validity will be
apparent from the Singular Value Decomposition Theorem below, we shall not offer a
proof of this “rank-nullity” theorem here.5

For our later purposes, it is important to note that the maximum number of linearly
independent columns of any matrix, A, is also called the rank of A, written as rank ( A) .
When matrices are not square, [as for example in the Linear Invariance Theorem for
multi-normal random vectors, stated both in expression (3.2.22) of Part II and in
expression (A3.2.121) below], then it is useful to distinguish between columns and rows
of matrix A by designating the column rank (row rank) of A to be the maximum number
of linearly independent columns (rows) of A. In these terms, matrix A is said to be of full
column rank (full row rank) iff all its columns (rows) are linearly independent, i.e., iff its
column rank (row rank) is equal to the number of columns (rows) of A. In terms of linear

5
For an elegant on line proof see http://en.wikipedia.org/wiki/Rank%E2%80%93nullity_theorem.

________________________________________________________________________
ESE 502 A3-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

transformations, the row rank of A can also be viewed as the rank of the linear
transformation represented by A .

With this general discussion of linear transformations, we now consider several specific
types of transformations that will play a central role in the decomposition theorems to
follow.

A3.1.2 Scale Transformations

While there are many different types of linear transformations, it turns out that from a
geometric view point there are essentially only two basic transformation types. The first,
and by far the simplest, are scale transformations that simply rescale the identity basis
vectors, as in Figures A3.6 and A3.7 below:

2e2 2e2

e2 e2

e1 3 e1 3 e1 e1

Figure A3.6. Positive Scalars Figure A3.7. General Scalars

Figure A3.6 represents a positive scalar transformation in which all basis vectors are
scaled by positive multiples. In many cases, such transformations result from simply
changing the measurement units (dollars, meters, etc.) of the variables represented by
each axis. However, some scale transformations may involve negative multiples, as in
Figure A3.7. The matrix representations, A1 and A2 , of these respective transformations
are given by the diagonal matrices (with zeros omitted for visual clarity),

2  2 
(A3.1.30) A1    , A2   
 3  3 

More generally, every diagonal matrix,

 a11 
 
(A3.1.31) A  diag (a11 ,.., ann )    
 ann 

________________________________________________________________________
ESE 502 A3-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

is the representation of a scale transformation on  n . A key feature of these simple


matrices is that multiplication of diagonal matrices is simply multiplication of their
corresponding diagonal elements. In (A3.1.30) for example,

4 
(A3.1.32) A1 A2     A2 A1
 9 

So like real numbers themselves, multiplication of diagonal matrices is commutative, i.e.,


in a sequence of successive scale transformations, the ordering of these transformations
make no difference. One other key feature is that matrix inversion can be done by
inspection, since it is evident from (A3.1.14) that the inverse of A in (A3.1.31) must be:

1/ a11   a111 


1    
(A3.1.32) A       
 1/ ann   1 
  ann 

In other words, undoing a scale transformation amounts to scaling by its reciprocals.

A3.1.2 Orthonormal Transformations

The second important class of linear transformations is far richer, and in fact, is given
many different names, including isometric transformations, orthonormal transformations
and rigid motions. From a geometric viewpoint the term “isometric” is perhaps most
appropriate, since these transformations preserve both distances and angles (as we shall
see below). But from a matrix viewpoint, the term “orthonormal” is most useful since it
relates more directly to the corresponding matrix representations, U  (u1 ,.., un ) , of such
transformations. In particular, if both distances and angles are preserved, then since the
vectors in the identity basis, I n  (e1 ,.., en ) , are mutually orthogonal and of unit length, it
follows that their images

(A3.1.33) U (e1 ,.., en )  (Ue1 ,..,Uen )  (u1 ,.., un )

under U must necessarily have the same properties. More precisely, [recalling property
(A2.4.4) in Appendix A2] it must be true that

(A3.1.34) uiui  || ui ||2  1 , i  1,.., n , and

(A3.1.35) ui u j  0 , i  j  1,.., n

These defining conditions for orthonormality can be written in equivalent matrix form as

________________________________________________________________________
ESE 502 A3-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

 u1   u1u1  u1un   1  0 


     
(A3.1.36) U U     (u1 ,.., un )              In
 u   u  u  u u   0  1 
 n  n 1 n n  

Note also from (A3.1.26) that this condition implies that U must be nonsingular, since

(A3.1.37) Ux  0  U Ux  U 0  0  I n x  0  x  0

Finally since this in turn implies that

(A3.1.38) U   U  (UU 1 )  (U U )U 1  U 1

we see that the inverse of U is simply its transpose. This an equivalent form of the
defining condition in (A3.1.36), though the geometric argument above is far more
intuitive. All geometric and algebraic properties of such transformations are in turn
readily established from these equivalent conditions. The most immediate result is that all
inner products must be preserved, since for any vectors, x, y   n ,

(A3.1.39) (Ux)(Uy )  x(U U ) y  xy

This in turn implies that all distances (lengths) are preserved, since

(A3.1.40) || Ux ||2  (Ux)(Ux)  xx  || x ||2  || Ux ||  || x ||

Finally, if  denotes the angle between any pair of vectors, x and y, as in Figure A3.8
below,

x
|| x ||
|| y  x ||

0
|| y || y

Figure A3.8. Vector Angles

then since the Law of Cosines asserts that

|| x ||2  || y ||2  || y  x ||2


(A3.1.41) cos( ) 
2 || x || || y ||

it follows at once from (A3.1.40) that U must also preserve angles. In other words, all
geometric figures are mapped into congruent copies by U.

________________________________________________________________________
ESE 502 A3-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Another natural consequence of the defining condition, U U  I n , is that compositions


(products) of orthonormal transformations, U1U 2 , must also be orthonormal since

(A3.1.42) (U1U 2 )(U1U 2 )  U 2 (U1U1 )U 2  U 2 ( I n )U 2  U 2U 2  I n

This same argument obviously holds for any finite product, U1U 2 U n .

Rotations and Reflections

Such orthonormal transformations can be further classified into rotations and reflections,
as illustrated in  2 by Figures A3.9 and A3.10 below:

Ue2 e2 e2
Ue1
 e1
e1

Ue2 Ue1

Figure A3.9. Rotation Figure A3.10. Reflection

In Figure A3.9, transformation U defines a counterclockwise rotation of the plane


through an angle  , and in Figure A3.10, transformation U reflects the plane about the
dashed line shown, so that images of all points above this line are their reflections below
the line, and visa versa. Clearly both distances and angles are preserved in both cases. But
one important difference is that clockwise orderings (called “orientations”) are different.
In particular, planar rotations are seen to preserve orientation, while reflections do not.

Another key difference from a practical viewpoint relates to the extendibility of these
concepts to higher dimensions. In particular, while rotations are easily defined with
respect to angles in  2 , the extension of this definition to  n is highly complex (to say
the least). However, the extension of reflections is completely straightforward. For the
case of  3 , the reflection line in Figure A3.10 is simply replaced by a reflection plane
through the origin. For example, the transformation, U  [e1 , e2 , e3 ] , is seen to reflect all
points in  3 about the (e1 , e2 ) plane. More generally, every reflection in  n is uniquely
defined by a (n  1) -dimensional reflection hyperplane through the origin. In addition,
such reflections can be given a unified matrix representation as we now show.

________________________________________________________________________
ESE 502 A3-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Householder Reflections

Observe that each (n  1) -dimensional hyperplane is in fact the orthogonal complement


of a single vector in  n . In the illustration above, the (e1 , e2 ) plane can be characterized
as the orthogonal complement of e3 . More generally, if for any vector, v   n  {0} , we
let

(A3.1.43) v   {x   n : xv  0}

denote the orthogonal complement of v , then the reflection about this hyperplane through
the origin is representable by the Householder matrix,

(A3.1.44) H v  I n   v2v  vv

To see this, note first that

(A3.1.45) H v v   I n   v2v  vv v  v   v2v  v(vv)  v  2v   v

so that the image of v is precisely its refection (through the origin) about v  . Moreover,
for any x  v  it also follows that

(A3.1.46) H v x   I n   v2v  vv x  x   v2v  v(vx)  x  0  x

But since H v is completely defined by this set of images, it then follows that H v must be
the unique reflection in  n about v  . This is shown graphically by the  2 example in
Figure A3.11 below:
x

v
v

v Hv x
H vv

Figure A3.11. Householder Reflection

________________________________________________________________________
ESE 502 A3-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Finally, since every reflection has such a representation, it follows that all reflections are
representable by Householder matrices, as in (A3.1.44). So all reflections are easily
computable in  n .

From a geometric viewpoint, the importance of this fact is that all orthogonal
transformations  n are constructible as compositions of (at most n ) reflections.
Alternatively phrased, every n-square orthonormal matrix is the product of at most n
Householder matrices. Since this fact will not actually be used in our subsequent
analyses, we will not prove it here (see footnote 2 below). Rather we simply illustrate this
general result by showing how all rotations in  2 (such as in Figure A3.9) are equivalent
to (at most) a pair of reflections in  2 . For any given angle,  , let the corresponding
(counterclockwise) rotation be denoted by, R , as in Figure A3.12.

R e2 e2
R e1

e1

Figure A3.12. Angular Rotation

This is clearly not a reflection, and moreover cannot be equivalent to any single
reflection, since this would necessarily reverse the clockwise order of the basis vectors,
e1 and e2 , as mentioned above. But it can be represented as a composition of two
reflections as follows. Choose the first (Householder) refection, H1  H v1 , by setting
v1  R e1  e1 , and observe that by construction it reflects R e1 back into e1 , as shown in
Figure A3.13 below:

e2
R e2
R e1
v1

e1  H1R e1

e2  H1R e2

Figure A3.13. First Reflection

________________________________________________________________________
ESE 502 A3-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Notice also that since every reflection is an orthonormal transformation, the image of
R e2 under H1 must continue to be orthogonal to that of R e1 . But in two dimensions,
there are only two possibilities (with unit length), namely e2 and e2 . In this case,
H1R e2  e2 , as shown in the figure. Finally, this configuration is easily reflected back
into (e1 , e2 ) by simply choosing H 2  H v2 with generating vector, v2  e2  (e2 )  2 e2 ,
so that the orthogonal complement, v2 , in this case is simply the horizontal axis, as
shown in Figure A3.14 below.

R e2 e2  H 2 H1R e2

v2
e1  H 2 H1R e1

e2

Figure A3.14. Second Reflection

Finally, since all Householder matrices in (A3.1.44) are seen to be symmetric, we may
then conclude that:

(A3.1.47) H 2 H1R [e1 , e2 ]  [e1 , e2 ]  H 2 H1R  I 2  H1R  H 2

 R  H1H 2  R  H1H 2

Hence, each such rotation is seen to be equivalent to this particular pair of reflections.6 So
in this sense, Householder reflections can be regarded as the fundamental “generator” of
all orthonormal transformations.

6
The proof of the general representation of orthonormal matrices by products of Householder matrices is
surprisingly difficult to find in standard references. But one can easily show this by extending the standard
Householder construction of QR decompositions (see for example the nice discussion by Tom Lyche
available on line at http://heim.ifi.uio.no/~tom/ortrans.pdf ) which shows in particular that every
orthonormal matrix, U, can be represented as, U  H 1 H 2  H nT , for some choice of Householder matrices,
H 1 H 2  H n , together with an upper triangular matrix, T. But by successive multiplications of this
expression by H i, i  1,.., n , together with (A3.1.36) , we obtain T  H n  H 2 H1U , which implies from
(A3.1.20) that T must also be orthonormal. Finally, since a simple inductive argument can be used to show
that the only orthonormal triangular matrix is the identity matrix, I n , it then follows that U  H 1 H 2  H n .

________________________________________________________________________
ESE 502 A3-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Orthonormal Bases and Extensions

One final aspect of orthonormality is important to consider. Recall that we have often
referred to the (orthogonal) columns of the identity matrix, I n  ( e1 ,.., en ) , as the identity
basis for  n . So before proceeding to the Singular Decomposition Theorem, it is
appropriate to formalize the more general concept of orthonormal bases. First we extend
the notion of span( A) in expression (A3.1.23) to any set of vectors, z1 ,.., zk   n as
follows:

(A3.1.48) span( z1 ,.., zk )   k


i 1
 i zi : 1 ,.., k   
Hence a vector, x   n , lies in span( z1 ,.., zk ) iff x can be expressed as a linear
combination of ( z1 ,.., zk ) , i.e., iff x   ik1 i zi for some scalars, 1 ,.., k . Next, recalling
the definition of linear independence in expression (A3.1.24) above, we now say that a
set of linearly independent vectors, z1 ,.., zk , forms a basis for a given linear subspace, L ,
of  n iff

(A3.1.49) span( z1 ,.., zk )  L

The special feature of linear independence is that for each, x  L , the  - coefficients in
the representation, x   ik1 i zi , must be unique.7 So in geometric terms, these
coefficients (1 ,.., k ) yield a natural coordinate system for L. Notice also that if
( z1 ,.., zk ) is a basis for L, then no larger set ( z1 ,.., zk , zk 1 ) can be a basis since zk 1  L
implies that zk 1 must already be a linear combination of ( z1 ,.., zk ) , which would violate
linear independence. So the size, k, of each basis is a unique characteristic of L,
designated as the dimension of L, and often written as dim( L) .

The single most important example of these concepts is of course the identity basis
( e1 ,.., en ) for n itself. But this basis has the important additional feature that its
component vectors form an orthonormal set, i.e., they are each of unit length and are
mutually orthogonal [as we have already seen for the columns of orthonormal matrices in
(A3.1.34) and (A3.1.35) above]. Any basis with these properties is called an orthonormal
basis. The key feature of such bases is that coordinates of any vector, x  span( z1 ,.., zk ) ,
are immediately constructible as inner products with the basis vectors, i.e., for each
i  1,.., k ,

7
To see this, note simply that if  i 1 i zi  x   i 1 i zi then  i 1 ( i   i ) zi  0 , so that by linear
k k k

independence,  i   i , i  1,.., k .

________________________________________________________________________
ESE 502 A3-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

zix  zi j 1 j z j    j ziz j  (1) i   j i (0)   i


k k
(A3.1.50) j 1

This is why orthonormal bases provide such useful representations of linear spaces. So it
is important to ask how such bases can be constructed.

In particular, for any given set of vectors, z1 ,.., zk   n , we next consider how to
construct an orthonormal basis for span( z1 ,.., zk ) . There is a remarkably simple
procedure for doing so, known as the Gram-Schmidt orthogonalization procedure.
Because the geometry of this procedure is of such fundamental importance, we begin by
considering orthogonal projections. Given two vectors, x, y   n (as illustrated for n  2
in Figure A3.15 below), one may ask what vector in the span of y is “closest” to x, or
equivalently, “best approximates” x ?

x  yx


span( y ) ● y yx

Figure A3.15. Simple Orthogonal Projection

If one were to imagine drawing circles around x, denoting points of equal distance from
x, then the smallest circle touching the line, span( y )  { y :   } , would be just
tangent to this line, and would identify the desired closest point, y x (shown as red in the
figure). Formally, this amounts to finding the  which minimizes the distance,
|| x   y || , from x. But since minimizing distance is equivalent to minimizing squared
distance, it follows that if we now write  ( )  || x   y ||2 , as a function of  , then we
can identify this point by solving the “least squares” minimization problem:

(A3.1.51) min  ( )  || x   y ||2  ( x   y )( x   y )  xx  2 xy   2 y y

Since the last equality is just a quadratic function in  , the desired “tangency” is given
precisely by the first-order condition:

xy
(A3.1.52) 0 d
d  ( )   2 xy  2 y y   y y  xy   
yy

So the vector closest to x in span( y ) is given by

________________________________________________________________________
ESE 502 A3-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

 xy 
(A3.1.53) yx   y
 y y 

and is designated as the orthogonal projection of x on y. The term “orthogonal” is of


most importance for our present purposes, and is motivated by the fact that the difference
vector, x  y x (shown by the red dashed line in the figure) is necessarily orthogonal to y,
as can be seen by taking inner products:

 xy 
(A3.1.54) ( x  y x ) y  xy  y x y  xy    y y  xy  xy  0
 y y 

So if one starts with two vectors, ( x, y ) , and wishes to construct an orthonormal basis for
span( x, y ) , then this projection procedure yields a natural choice. In particular, since
x  y x  x  ( xx / yy ) y is automatically a linear combination of ( x, y ) , it follows than
( y, x  y x ) yields a pair of orthogonal vectors in span( x, y ) . Hence, by normalizing
these, we have found an orthonormal basis for span( x, y ) .

This argument implicitly assumes that x and y are linearly independent, so that the basis
will consist of two orthonormal vectors. But notice also that if x and y were linearly
dependent, so that x was already in span( y ) [i.e., x   y  0 for some  ], then the
solution in (A3.1.53) would automatically yield y x  x so that ( x  y x ) is simply the
zero vector. In other words, this procedure would identify this linear dependence, and tell
us that by normalizing only y we would obtain a natural orthonormal basis for
span( x, y )  span( y ) .

This two-vector example defines the simplest possible instance of the Gram-Schmidt
procedure. So all that remains is to be done is to show how this procedure can be
extended to larger sets of vectors. This extension is extremely simple, and only uses the
two-vector procedure detailed above. To see this, let us proceed to a three vector case.
Suppose we given linearly independent vectors, z1 , z2 , z3   n ( n  3) , and wish to
construct an orthonormal basis (u1 , u2 , u3 ) for span( z1 , z2 , z3 ) . To do so, we first construct
an orthogonal basis (b1 , b2 , b3 ) as follows:

Step 1. Start by setting

(A3.1.55) b1  z1 .

Step 2. Project z2 on b1 and construct the difference vector,

________________________________________________________________________
ESE 502 A3-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

z2b1  z z 
(A3.1.56) b2  z2  b1   z2  2 1 z1 
b1b1  z1 z1 

As in the example above, (b1 , b2 ) , are now orthogonal, and are both in span( z1 , z2 , z3 ) .

Step 3. Finally, project z3 on b1 and b2 individually and take the vector difference:

 z b   z b 
(A3.1.57) b3  z3   3 1  b1   3 2  b2
 b b   
 1 1  b2b2 

Then by construction, (b1 , b2 , b3 )  span( z1 , z2 , z3 ) . Moreover, since b1 and b2 are already


orthogonal, it follows that b3 must necessarily be orthogonal to both b1 and b2 . To see
this, note simply that for b1 we have

 z b   z b 
(A3.1.58) b1b3  b1 z3   3 1  b1b1   3 2  b1b2
 b b   
 1 1  b2b2 

 z b 
 b1 z3  ( z3b1 )   3 2  (0)  b1 z3  b1 z3  0 ,
 b b 
 2 2

and similarly for b2 . Given this orthogonal basis, it then follows by setting

(A3.1.59) u i  bi / || bi || , i  1,2,3

that we must obtain an orthonormal basis (u1 , u2 , u3 ) for span( z1 , z2 , z3 ) . Again, if


( z1 , z2 , z3 ) are not linearly independent, then we need only normalize the nonzero vectors
obtained. This will not only provide an orthonormal basis for span( z1 , z2 , z3 ) , but will
also indicate the dimension of this linear space.

The generalization of this stepwise procedure follows by simple induction. In particular,


to obtain an orthogonal basis for span( z1 ,.., zk ) , suppose we have already obtained an
orthogonal set (b1 , b2 ,.., bm ) in span( z1 ,.., zk ) with 3  m  k . To extend this orthogonal
set, let

m  z b 
(A3.1.60) bm1  zm1   i 1  m1 i  bi  span( z1 ,.., zk )
 b b 
 i i 

________________________________________________________________________
ESE 502 A3-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Then the argument in (A3.1.58) again shows that bm1 is orthogonal to each bi , i  1,.., m .
So by induction, we thus obtain an orthogonal basis (b1 ,.., bk ) for span( z1 ,.., zk ) . This in
turn yields an orthonormal basis (u1 ,.., uk ) by normalizing all nonzero vectors in
(b1 ,.., bk ) as in (A3.1.59). Moreover, the number of such vectors will again identify the
dimension of span( z1 ,.., zk ) .

One final possibility is of interest. Suppose that we are given an orthogonal basis,
(b1 ,.., bk ) for some span( z1 ,.., zk ) with k  n , and wish to extend this to an orthogonal
basis for all of  n . This is again quite simple, since we already have a basis for  n ,
namely the identity basis, ( e1 ,.., en ) . So to extend (b1 ,.., bk ) to a larger orthogonal basis,
(b1 ,.., bk , bk 1 ) we may proceed by setting m  k in (A3.1.60) and then successively letting
zk 1  ei for each i  1,.., n until a nonzero difference vector, bk 1 , is found. There must
be one, since not all ei can lie in the lower dimensional space, span( z1 ,.., zk ) . Once bk 1
is found, the procedure can be repeated by setting m  k  1 in (A3.1.60) and continuing
down the list of identity basis vectors, ei , until a new nonzero difference vector, bk 2 , is
found. Again by induction, this procedure must result in a full set of basis vectors,
(b1 ,.., bn ) , which yield the desired extension. These can in turn be normalized as in
(A3.1.59) to obtain an orthonormal basis, (u1 ,.., un ) , for  n . Finally, if the original basis
is already orthonormal, say (u1 ,.., uk ) , then this procedure is designated as an
orthonormal extension of (u1 ,.., uk ) to all of  n .

________________________________________________________________________
ESE 502 A3-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

A3.2 Singular Value Decomposition Theorem

While there are of course many special types of matrices that are of analytical interest [as
for example the triangular Cholesky decompositions of symmetric matrices in (A2.7.44)
of Appendix A2], our focus above on diagonal matrices and orthonormal matrices was
for a reason. In the same way that orthonormal matrices have a simple decomposition
into reflections, it turns out that every n-square matrix, A, is decomposable into a simple
product of orthonormal and diagonal matrices as follows:

(A3.2.1) A  U S V

where U and V are orthonormal and where S  diag ( s1 ,.., sn ) is a nonnegative diagonal
matrix with diagonal entries, si , called the singular values of matrix A. In geometric
terms, every linear transformation is constructible as a composition of a nonnegative
scale transformation together with two orthonormal transformations. This fundamental
result, known as the Singular Value Decomposition (SVD) Theorem, holds for all
matrices (even rectangular matrices). At this level of generality, it has been designated
by Gilbert Strang (1993,2009) as the Fundamental Theorem of Linear Algebra.

The main objective of the present section is to establish this theorem. By way of
motivation, recall from the beginning of these notes that our ultimate objective is to
establish the Spectral Decomposition (SPD) Theorem for symmetric matrices, which
asserts that every symmetric matrix, A , can be represented in terms of a single
orthonormal matrix, W, and diagonal matrix,   diag (1 ,.., n ) , as

(A3.2.2) A  W W

where the diagonal entries, i , are called the eigenvalues of A (see Section A3.3 below).
So except for the nonnegativity of S in (A3.2.1), it would appear that this important
result is simply a special case of the SVD Theorem with W  U  V . As we shall see
below, this intuition is correct in many important cases. Moreover, it is essentially correct
in all cases in the sense that an SPD can always be constructed from any given SVD. It is
this relationship that provides the main motivation for our consideration of this more
general result. But as emphasized by Strang’s renaming of this result, anyone interested
in understanding linear transformations should try to gain some understanding of
(A3.2.1) in its own right.

While proofs of the SVD Theorem can be found in most standard texts on matrix algebra,
the most common approach is to start with the SPD Theorem and then apply this result to
the partitioned symmetric matrix,

 A
(A3.2.3) MA   
 A 

________________________________________________________________________
ESE 502 A3-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

in order to establish the SVD Theorem. But this “trick” offers little insight into the
geometric origins of either result. So the specific objectives of this section are to illustrate
these origins with an easily visualized geometric example in  2 , and then use these
insights to motivate a constructive proof of the SVD Theorem.

To develop our geometric argument, we require one further characterization of


orthonormal transformations, V. Recall that all such transformations preserve distances.
Conversely, to guarantee that V is orthonormal, it is enough to require that all unit
distances be preserved by V, i.e., that for all x   n ,

(A3.2.4) || x ||  1  || Vx ||  1

To see this, note first that since any vector, x   n , can be transformed to have unit
length by the rescaling, x  ||1x|| x , it follows from (A3.2.4) that all distances must be
preserved, since

(A3.2.5) 1
||x|| x  1
||x|| || x ||  1  1  V  1
||x|| 
x  1
||x|| || Vx ||

 || Vx ||  || x ||

Moreover by observing from the identity

(A3.2.6) || x  y ||2  ( x  y )( x  y )  xx  2 xy  yy  || x ||2 2 xy  || y ||2

 xy  1
2 || x ||
2
 || y ||2  || x  y ||2 

that inner products are entirely expressible in terms of distances, it then follows from
(A3.2.5) that all inner products must be preserved as well. Hence the defining conditions
for orthonormality in (A3.1.34) and (A3.1.35) must hold, and V is orthonormal.

Given this alternative characterization, we next observe that the product of matrices on
the right hand side of (A3.2.1) can be directly interpreted geometrically as an
orthonormal transformation, V  , followed by a rescaling, S , followed by a second
orthonormal transformation, U .8 But while this composite transformation is of course
linear, the key question remains as to why every linear transformation, A, can be so
represented. Assuming that A is nonsingular (so that its inverse exists), a more
informative geometric approach is to start with transformation, A, and see how to “undo
it” (i.e., invert it back to the identity) through a series of simple transformations. For the
two dimensional case, this process can be illustrated by the four panels shown in Figure
A3.16 below.

8
This is illustrated for example in Figure 6.8 of Strang (2009, p.366).

________________________________________________________________________
ESE 502 A3-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Starting from the upper left panel, suppose that a given transformation, A, maps the basis
vectors (e1 , e2 ) in  2 as shown in the upper right panel. In geometric terms, the key here
is to consider not only how these basis vectors are transformed, but also how the entire
unit circle (shown in blue) is transformed. In  2 the image of this circle is always some
ellipse, as shown (in blue) in the upper right panel. Since the unit circle consists of all
vectors of unit length, we see that some of these vectors will typically be “stretched”
more than others by transformation A. In particular, since the major axis and minor axis
of this ellipse (shown as thin blue lines) denote the directions of maximum and minimum
distances from the origin, it follows that the vector on the unit circle which is “maximally
stretched” by A must be the vector (not shown) that is mapped into the major axis of this
ellipse. Similarly, the vector that is “minimally stretched” is mapped into the minor axis.

Ae1
A
e1 Ae2
e2
 

unit circle

V U

U Ae1
S 1U Ae1

 
S U Ae1
1
U Ae2

S 1

Figure A3.16. Geometry of SVD

So to remove all stretch effects, the simplest procedure is to rotate these (orthogonal)
axes into the coordinate axes, and then rescale them back to unit lengths. The appropriate
rotation is shown in the lower right panel, and is represented by an orthonormal matrix,

________________________________________________________________________
ESE 502 A3-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

U  . The rescaling back to unit lengths is then shown in the lower left panel, and is
represented by a positive diagonal matrix, S 1 . Notice also that by scaling the maximum
and minimum lengths to unity, all intermediate lengths must also be scaled to unity.9 So
the ellipse again becomes a unit circle. What this implies is that the transformation
represented by the product, S 1 U  A , has actually mapped the unit circle back into itself.
So if we now denote this product matrix by

(A3.2.7) V   S 1 U  A

then it follows that V  must satisfy (A3.2.4), and hence must be orthonormal. In
particular, the images of e1 and e2 under this transformation (namely the two vectors,
S 1U Ae1 and S 1U Ae2 , shown in the lower left panel of Figure A3.16) must be
orthogonal. So by construction we may use (A3.1.38) to conclude that

(A3.2.8) S 1 U A  V   U  A  S V   A  U S V 

and thus that A is representable as in (A3.2.1) [where in this nonsingular case, S must be
a positive diagonal matrix].

While this argument is quite transparent in  2 , it is more complex in higher dimensions.


In particular, if the unit circle is now replaced by the unit sphere in  n ,

(A3.2.9)  n  {x   n : || x ||  1}

then one can in principle construct similar arguments for the ellipsoidal images,

(A3.2.10) A( n )  { Ax   n : x   n }

of  n under linear transformations, A. The basic ideas can be illustrated for  3 as shown
in Figure A3.17 below.

9
To show this formally, observe first that the equation of the ellipse in the lower right panel must be of the
form, a1 x1  a2 x2  c for some positive constants, a1 , a2 , c . So for the principle axes of this ellipse, say
2 2

( x01 , 0) and (0, x02 ) , it must be true that a1 x01  c  a2 x02 . But if the given scale transformation is denoted
2 2

1 1 1 1 1 1
by, S  diag ( s1 , s2 ) , so that S x  ( s1 x1 , s2 x2 ) , then for this unit scaling it must be also true that
1 1 2
s1 x01  1  s2 x02 , so that x01  s1 and x02  s2 . These two relations together imply that a1  s1 c and
2 2 2
a2  s2 c so that c  a1 x1  a2 x2  ( s1 c ) x1  ( s2 c ) x2 . Finally, by canceling c on both sides, we see that,
2 2 2 2

1 1 1 1 1 1
1  ( s1 x1 )  ( s2 x2 )  || ( s1 x1 , s2 x2 ) || , and may thus conclude that || ( s1 x1 , s2 x2 ) ||  1 , i.e., that all
2 2 2

1 1
transformed vectors ( s1 x1 , s2 x2 ) have unit length.

________________________________________________________________________
ESE 502 A3-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

x3 x3

A(3 )
3
A

v x1  x1
2 Av2

x2
v1
x2

Av1

Figure A3.17 Example in Three Dimensions

In this example, the unit sphere, 3 , shown (in red) on the left is mapped by the linear
transformation,

 0.7 0 0 
 
(A3.2.11) A   0 1.8 0 
 0 0.7 0.7 
 

into the ellipsoidal image set, A(3 ) , shown (in red) on the right. The details of this
example will be discussed further as we proceed. But for the moment, it should be clear
that the first principle axis (major axis) of this ellipsoid is the line through the origin (not
shown) that connects the two ends of this “football-shaped” set. So the point labeled,
Av1 , (to be discussed below) is the image of a point, v1  3 , which is “maximally
stretched” by transformation, A . The location of this particular point, v1 , is shown on 3
(just below the x2 axis). So by linearity, the other maximally stretched point in 3 (not
shown) must be just opposite to v1 on the line from v1 through the origin. Note also that
the second principle axis is a line through the origin which is orthogonal to the first
principle axis and passes through the point labeled, Av2 , in the figure.

While it is possible to construct an orthonormal transformation that rotates these axes into
the coordinate axes, and then rescale the ellipsoidal image back to a sphere as in Figure
A3.16 above, the details of such a construction are extremely tedious (especially in
higher dimensions). Hence the two most important features of the argument in Figure
A3.16 are (i) its graphical simplicity in  2 , and (ii) its role in suggesting a more tractable
approach to the SVD Theorem in  n . In particular, this approach is motivated by the
observation that the critical task in the above argument is to identify those unit vectors in

________________________________________________________________________
ESE 502 A3-25 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

 n that are mapped by A into the principle axes of the ellipsoidal image, A( n ) , so that
the appropriate rotations can be defined. Note in particular that the vector mapped into
the major axis of the ellipse in Figure A3.16 (or ellipsoid in Figure A3.17) is by
definition that unit vector, v1 , with maximal image length, || Av1 || . So the most natural
procedure for identifying v1 is to solve the maximization problem

(A3.2.12) maximize: || Av || subject to: v   n

There will of course be two solutions, corresponding to each end of the ellipse (or
ellipsoid). But this vector is essentially unique up to a choice of direction. The second
key point established for the case of  2 was that the vector mapped into the minor axis is
necessarily determined (up to a choice of direction) as one orthogonal to v. In the case of
 2 , this was established by verifying that the transformation, V  , in (A3.2.7) was
orthonormal. In higher dimensions, a direct proof of this fact is much more difficult. So
our approach will be to start by assuming that this is the case, and use this assumption to
construct a sequence of maximization problems similar to (A3.2.12). The final solutions
to these problems will be seen to yield precisely desired representation in (A3.2.1), and
thus show (among other things) that V  in (A3.2.7) is indeed orthonormal in all cases.

Before developing this sequential maximization procedure, it is appropriate to make a


few preliminary remarks. First of all, this approach to establishing the SVD Theorem is
known in the literature as the “variational” approach, and is in fact one of the oldest
approaches to this problem.10 Second, it turns out that there is a more useful way of
representing image lengths, || Av || , that will be seen to have added benefits in the
following analysis. In particular, if for any vector, v  n , the image vector, Av, is simply
rescaled to a vector of unit length, u   n , as shown (in red) for n  2 in Figure A3.18
below,
Av Av

u
v v
 

Figure A3.18. Rescaling Convention

then by construction

10
See Stewart (1993) for an interesting historical discussion of this variational approach, which goes back
to the work of Jordan in the 1870’s.

________________________________________________________________________
ESE 502 A3-26 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(A3.2.13) Av  s u

for some scalar, s . Note that if Av  0 then (A3.2.13) will hold trivially for s  0 . While
we shall eventually need to deal with this degenerate case, we focus for the present on
vectors, v   n , with Av  0 [i.e., v  null ( A) ] so that s  0 . Moreover, by replacing u
with u if necessary, we can always ensure that s  0 , so that by construction,
|| Av ||  s || u ||  s  0 . Thus, as an alternative to (A3.2.12), one can find the direction, v ,
of maximal stretch by solving the associated maximization problem:

(A3.2.14) maximize: s  s(v, u ) subject to: Av  su , || u ||  1, || v ||  1

Note also that since uu  || u ||2 for any vector, u, it follows from the first constraint that

(A3.2.15) uAv  s uu  s || u ||2  s

Hence (A3.2.14) can be simplified to

(A3.2.16) maximize: uAv subject to: uu  1 , vv  1

As we shall see below, the advantage of this alternative formulation is that will allow us
to solve simultaneously for all three matrices, U , S , and V in (A3.2.1) , where u and v
will turn out to be column vectors of U and V respectively, and where s ( uAv) will be
the diagonal elements of S . This constrained maximization problem thus constitutes the
center piece of the present analysis, and will be used recursively to construction the full
SVD representation for arbitrary linear transformations.

Before doing so, it is important to note finally that (A3.2.16) must always have a
solution. While this may seem obvious in our original two dimensional problem, it is less
so in higher dimensions. In particular, since the objective function, uAv , in (A3.2.16) is
a bilinear form in u and v (i.e., it is linear in u for each fixed v, and linear in v for each
fixed u) there are no natural maxima or minima for this function. But the existence of
such solutions follows from what is usually called the Generalized Extreme Value
Theorem. The classical Extreme Value Theorem simply states that every continuous
function, f ( x) , on a closed bounded interval, [a, b]   , has both a maximum and
minimum value. This can be seen intuitively as in Figure A3.19 below:

f (a)
f (b)

 
a x b

Figure A3.19. Extreme Values


________________________________________________________________________
ESE 502 A3-27 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

The generalized version simply shows that same is true for continuous functions on
nonempty closed bounded sets in any finite-dimensional space,  N .11 In the present case,
the bilinear form, f (u , v)  uAv is a continuous function on  2 n constrained to the
product of unit spheres,  n   n  {u   n : || u ||  1}  {v   n : || v ||  1}   n   n   2 n ,
which is easily seen to be a nonempty closed bounded set in  2 n . Hence there always
exists a maximum solution to (4). Moreover, since both the objective function, f (u , v) ,
and constraint functions, uu and vv , are continuously differentiable on  2 n , this
maximum can be characterized by the first order conditions of the associated Lagrangian
function [recall expression (A2.8.38) in Section 8 of the Appendix, A2, to Part II of these
notes]:

(A3.2.17) L(u , v, s, )  uAv  1


2 [ s (1  uu )   (1  vv)]

(where the factor of ½ is introduced for notational convenience only). By using


expressions (A2.7.7) and (A2.7.11) [with A  I n ] in Appendix A2, we see that the first
order conditions for u and v are given respectively by

(A3.2.18) 0  u L  Av  1 [ 2 s u  0]  Av  su  Av  su
2

(A3.2.19) 0   v L  Au  1 [0
2  2 v]  Au   v  Au   v

where (A3.2.19) also uses the identity,  v (uAv )  (uA)  Au . Similarly, the first order
conditions for s and  reduce to the constraints

(A3.2.20) uu  1  vv

At this point, notice that conditions (A3.2.18) and (A3.2.20) are simply the constraints in
(A3.2.14) that originally motivated this formulation. In particular, transformation A must
achieve its maximum stretch, s , at vector v. Hence the most important new information
provided by this solution is condition (A3.2.19), which shows that there is a parallel
relation for the transpose, A , of A . In particular, the same argument leading to
(A3.2.14) shows that maximum stretch,  , of the transpose transformation, A , must be
achieved at vector u, so that there is a clear duality between these two transformations.
Moreover, by the symmetry of inner products, it follows from (A3.2.18), (A3.2.19) and
(A3.2.20) that

(A3.2.21) s  s (uu )  u( su )  u( Av)  v( Au )  v( v)   (vv)  

So in fact this maximum stretch value must be the same for both A and A .

11
Even more generally, this is true for continuous functions on compact sets in arbitrary topological spaces.
See for example the self-contained development of this general version in Murphy (2008).

________________________________________________________________________
ESE 502 A3-28 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Before extending this argument to obtain the SVD representation (A3.2.1) for all
matrices, A, it is essential to distinguish between the nonsingular and singular cases.
Figure A3.17 above illustrates a typical nonsingular case, which is by far the most
important case for all applications that we consider in these notes. However, since this
same representation also holds for singular matrices, it is instructive to see what this
means for the geometry of linear transformations. To illustrate the basic differences
between these two cases, we now consider the following modification of transformation,
A, in expression (A3.2.11) above

 0.7 0 0 
 
(A3.2.22) A0   0 1.8 0 
 0 0.7 0 
 

Here the matrix, A0 , differs from A in only the third column, which is now the zero
vector. This of course implies that A0e3  0 , and hence that A0 is singular. The
corresponding modification of Figure A3.17 is shown in Figure A3.20 below.
x3 x3

A0

v x1  x1
v 2 A0v2
1

x2 span( A0 ) x2 
A0v1

Figure A3.20 A Singular Example in Three Dimensions

The key difference here is that span( A0 ) is now a two-dimensional plane (shown in
blue). So the ellipsoid in Figure A3.17 has now been collapsed into an ellipse on this
plane. Notice also that even though 3 is only the surface of an ellipsoidal solid in  3 ,
the image set, A0 (3 ) , consists of the full area inside the ellipse on the right (including
the origin). But the initial maximization problem in (A3.2.12) above is still well defined,
and is seen to have a solution very similar to the full-dimensional case in Figure A3.17.
Notice in particular that the analysis of this ellipse in span( A0 ) is qualitatively the same
as that for the ellipse in upper right panel of Figure A3.17 for the  2 case. More
generally, it will turn out that for any singular matrix, A, one proceeds by first analyzing

________________________________________________________________________
ESE 502 A3-29 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

the ellipsoid in span( A) , and then extending this analysis to the collapsed dimensions in
null ( A) in order to complete the SVD representation.

This extension process is most transparent in  2 . So before proceeding with the formal
argument, it is instructive to reconsider the singular example expression (A3.1.28)
together with Figure A3.5. This figure is reproduced in Figure A3.21 below, where the
unit circle,  2 , is now included. The image set, A( 2 ) , is given by the red line segment
shown, which by definition lies in span( A) . So the possible solution vectors, v   2 , in
(A3.2.12) are seen to be either v1 or v1 , with images, Av1 and Av1 , corresponding to the
end points of interval, A( 2 ) , as shown in the figure. For purposes of discussion, we now
focus on v1 . In this case, the full solution to this maximization problem is given by the
triple, (v1 , s1 , u1 ) , where u1 is the unit-scaled version of Av1 (shown by the red point),
with scale factor, s1 , denoting the maximum-stretch value, i.e., Av1  s1u1 .

 Av1

v2 A( 2 )
 u1
2 
v1

v1 
 u2

v2

Av1  null ( A)

Figure A3.21 Singular Example in Two Dimensions

To complete the desired Singular Value Decomposition of matrix A in (A3.1.28), we


would like to find a unit vector, v2   2 , than is mapped by A in the minor axis of this
one-dimensional “ellipse”. But while the major axis is well defined, there appears to be
no meaningful minor axis. Here is where we use our assumption above that the end points
of this “axis” must be the images of unit vectors orthogonal to v1 . If so, then there are
seen to be only two possible choices, namely the points v2 and v2 shown in blue.
Moreover, since these points both lie in null ( A) , it follows by definition that
Av2  Av2  0 . So under this assumption, the origin must constitute the relevant “minor
axis”. In addition, || Av2 ||  || Av2 ||  0 implies that both points are equally good
solutions. So if we now focus on v2 , then the solution value must be given by
s2  || Av2 ||  0 . Finally, to complete this solution, observe that if we choose any unit

________________________________________________________________________
ESE 502 A3-30 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

vector , u2 , orthogonal to u1 , (such as the point, u2 , just to the right of v2 in the figure),
then it is automatically true that Av2  0  2u2 . So this degenerate “maximal stretch”
solution is summarized by the triple (v2 , s2 , u2 ) where s2  0 . Notice that when taken
together, these two solutions can be written as

Av1  s1u1   s1 
(A3.2.23)   A(v1 , v2 )  (u1 , u2 )    AV U S
Av2  s2u2   s2 

where V  (v1 , v2 ) and U  (u1 , u2 ) are orthonormal matrices by construction. So this in


turn implies that

(A3.2.24) A U S V

and thus that (A3.2.1) holds for this choice of matrices. In the present case, it can readily
be verified that these matrices have the exact form:

1/ 2 1/ 2   10   2/ 5 1/ 5   2 1
(A3.2.25) U S V       A
1/ 2 -1/ 2  0 1/ 5 -2/ 5   2 1
     

where for example, s1  10  3.16 , is the length of the major axis vector,
Av1   
10 , 10 in Figure A3.21. The most important feature of this singular example is
that all analysis of the collapsed “minor axis” in null ( A) is formally identical to that of
the positive “major axis” in span( A) . The only difference is that the solutions in the
collapsed case are nonunique, so that any choice of a unit vector, u2 , orthogonal to u1
will work.

Note finally that nonuniqueness of solutions is also possible for positive axes of the
ellipsoid in span( A) . A simple example is provided by any orthonormal transformation,
A  U , where U ( n )   n implies that all “axes” of this spherical image must have the
same length. In this extreme case, there are infinitely many SVD representations of U,
including the trivial one, U  U I n I n . A more interesting example is based on the matrix,
A , in (A3.2.11) with SVD given by12

 0 1 0   1.95   0 0.98924 0.14633 



(A3.2.26) U S V   0.9131 0 0.4076   0.7
 1 0 0

   
 0.4076 0 0.9131   
0.6462   0 0.14633 0.98924 

 

12
This solution was obtained numerically with the MATLAB program, svd.m.

________________________________________________________________________
ESE 502 A3-31 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

 0.7 0 0 
 0 1.8 0   A

 
 0 0.7 0.7 

While all principle axes in this example are distinct, notice that the lengths of the second
and third axes (0.7 and 0.6462) are almost the same. Geometrically, this implies that the
intersection of the surface of the ellipsoid in the right panel of Figure A3.17 with the
plane orthogonal to the major axis vector, Av1 , must be almost circular (as shown by the
blue curve in the figure). So one can imagine that this intersection can be made exactly
circular by an appropriately small modification of the matrix, A .13 In this circular case, it
should be clear that while the principle axis vector, Av1 , is still unique (up to a choice of
sign) there is no unique choice of the second principle axis vector, Av2 , shown in Figure
A3.17. Any selection of a unit vector, v2 , orthogonal to v1 will do. But, as we shall see
below, the actual maximization problem for identifying this principle axis is still well
defined, and most importantly, all such choices of v2 must satisfy the corresponding
Lagrangian first-order conditions.

With these preliminary observations, we are now ready to extend the maximization
problem in (A3.2.16) in order to obtain a full singular value decomposition (SVD) of
matrices, A (which will further clarify the natural duality between A and A ).14

Singular Value Decomposition Theorem. For any n -square matrix, A , there exist
orthonormal matrices, U  (ui : i  1,.., n) , V  (vi : i  1,.., n) and a nonnegative diagonal
matrix, S  diag ( si :i  1,.., n) , such that

(A3.2.27) A  U S V

Proof: To establish this result, we begin by observing that if (A3.2.27) holds, then [as an
extension of (A3.2.23) above] it follows by definition that,

(A3.2.28) A  US V   AV  US

13
One such modification, A , is obtained by simply replacing  with   diag (1.95, 0.7, 0.7) and using
o o

U and V in (A3.2.27) to define A  US V  .


o o

14
As mentioned earlier, more compact versions of this SVD Theorem can be obtained by appealing to the
Spectral Decomposition Theorem and employing the symmetric-matrix device in (A3.2.3) above. [For a
“variational” version of this proof see Theorem 7.3.10 in Horn and Johnson (1985).] However, it should be
emphasized that essentially all direct proofs of the Spectral Decomposition Theorem implicitly embed  in
the complex plane,  , to ensure existence of such decompositions. Hence one of the objectives of the
present approach is to avoid any appeal to complex number theory whatsoever.

________________________________________________________________________
ESE 502 A3-32 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

 s1 
 
  Av1 ,.., Avn   (u1 ,.., un )     ( s1u1 ,.., snun )
 sn 

 Avi  si ui , i  1,.., n

where the last line is seen to have exactly the same form as (A3.2.18) above. Hence if we
now denote the solution to (A3.2.17) by (u1 , s1 , v1 ) , so that conditions (A3.2.18) through
(A3.2.20) imply

(A3.2.29) Av1  s1u1 , Au1  s1v1 , u1u1  1  v1v1

then our objective is to extend this relation to a full SVD as in (A3.2.28) by generating
the successive triplets, (ui , si , vi ) , one at a time. Here it is instructive to generate the first
triplet, (u2 , s2 , v2 ) , in full detail, and then proceed by induction for the rest. To do so, we
begin by observing that in order for U and V to be orthonormal, we must require that
(u2 , v2 ) satisfy the orthogonality conditions, u2u1  0  v2v1 . So if we now let
(u1 )  {u   n : uu1  0} and (v1 ) {v   n : vv1  0} denote the vectors of unit length
orthogonal to u1 and v1 respectively, then in geometric terms, the task is to find
“maximal stretch” vectors, (u2 , v2 )  (u1 )  (v1 ) , for transformation A which generate
the “second principle axes” of the ellipsoids, A( n ) and A( n ) , respectively. [For
example, the set (v1 ) for the nonsingular illustration in Figure A3.17 above is shown by
the blue circle on 3 in the left panel, with corresponding image, A[(v1 )] , shown by the
blue circle in the right panel. Similarly, for the singular illustration in Figure A3.20, the
set (v1 ) is again shown on the left as a (different) blue circle on 3 , with associated
image now corresponding to interval shown in dark blue in the right panel. [Note that (for
sake of visual clarity) neither the vector, u1 , or its orthogonal set, (u1 ) , are shown in
these figures.] As a natural extension of (A3.2.17), the appropriate maximization problem
for determining (u2 , v2 ) is given by

(A3.2.30) maximize: u2 Av2 subject to: (u2 , v2 )  (u1 )  (v1 )

Moreover, (u1 ) and (v1 ) are again a nonempty closed bounded subsets of  n for n  2
[implying that (u1 )  (v1 ) must be a nonempty closed bounded subset of  2 n ]. So the
same argument using the Generalized Extreme Value Theorem again shows that a
solution to (A3.2.30) must exist. Since the above constraint conditions for (u2 , v2 ) can be
equivalently stated as

(A3.2.31) u2u2  1 , v2 v2 1 , u2 u1  0 , v2v1  0

________________________________________________________________________
ESE 502 A3-33 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

it follows that the appropriate Lagrangian function for this problem takes the form:

(A3.2.32) L(u2 , v2 , s2 , 2 ,  2 ,  2 )  u2 Av2  1


2 [ s2 (1  u2 u2 )   2 (1  v2v2 )]

  2 (u2u1 )   2 (v2 v1 )

Here the first order conditions for u2 and v2 are given respectively by

(A3.2.33) 0  u2 L  Av2  s2u2   2u1  Av2  s2u2   2u1

(A3.2.34) 0   v2 L  Au2   2v2   2v1  Au2   2v2   2v1

with corresponding first order conditions for ( s2 , 2 , 2 ,  2 ) given precisely by the


conditions in (A3.2.31) above. At this point it is important to recall from the discussion in
Section 8 of the Appendix to Part II that the validity of this Lagrangian formulation
requires that the constraint gradient vectors be linearly independent [recall expression
(A3.1.24) above]. But this is automatically guaranteed by the mutual orthogonality of
(u2 , u1 ) and (v2 , v1 ) . Hence the task remaining in this second step is to show that
 2  0   2 , so that (A3.2.33) and (A3.2.34) will have the same form as (A3.2.18) and
(A3.2.19). But since the solution in (A3.2.32) is assumed to satisfy (A3.2.18) and
(A3.2.19), together with (A3.2.31), it follows by premultiplying (A3.2.33) by u1 that

(A3.2.35) u1 Av2  s2u1u2   2u1u1  0   2    2

  2   u1 Av2   ( Au1 )v2   ( s1v1 )v2   s1 (v1v2 )  0

Similarly, premultiplying (A3.2.34) by v1 , we see that

(A3.2.36) v1 Au2  s2v1v2   2v1v1  0   2    2

  2   (v1 A)u2  ( Av1 )u2   ( s1u1 )u2   s1 (u1u2 )  0

Hence  2  0   2 , and conditions (A3.2.33) and (A3.2.34) reduce to

(A3.2.37) Av2  s2u2

(A3.2.38) Au2   2v2

Moreover, exactly the same argument in (A3.2.21) with (u2 , s2 , v2 ) replacing (u, s, v)
now shows that  2  s2 , so that (A3.2.38) becomes

________________________________________________________________________
ESE 502 A3-34 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(A3.2.39) Au2  s2v2

Hence the maximal stretch, s2 , for transformation A among vectors in (v1 ) is achieved
at v2 , and similarly, the same maximal stretch for transformation A among vectors in
(u1 ) is achieved at u2 . Most importantly for our present purposes, expression (A3.2.37)
shows that (u2 , s2 , v2 ) yields the desired second row for the SVD in expression (A3.2.28).
Note finally that this solution (u2 , s2 , v2 ) may not be unique, even when (u1 , s1 , v1 ) is
unique [such as in the modification of example (A3.2.11) illustrated above]. But all such
solutions must necessarily satisfy conditions (A3.2.31), (A3.2.37) and (A3.2.39).

The task remaining is to extend this argument by induction to all rows of (A3.2.28). To
do so, we start with the inductive hypothesis that for a given k  n , the first k  1 rows
have been filled with triplets, (ui , si , vi ), i  1,.., k  1 , satisfying

(A3.2.40) Avi  siui , i  1,.., k  1

(A3.2.41) Aui  si vi , i  1,.., k  1

(A3.2.42) uiui  1  vivi , i  1,.., k  1

(A3.2.43) uiu j  0  viv j , i, j  1,.., k  1, i  j

If we now let (u1 ,.., uk 1 )  {u   n : uui  0, i  1,.., k  1} denote the set of vectors in  n
orthogonal to (u1 ,.., uk 1 ) , and similarly let (v1 ,.., vk 1 )  {v   n : vvi  0, i  1,.., k  1}
denote the vectors in  n orthogonal to (v1 ,.., vk 1 ) , then since these nonempty sets are again
closed and bounded, one final application of the Generalized Extreme Value Theorem
shows that the maximization problem

(A3.2.44) maximize: uk Avk subject to: (uk vk )  (u1 ,.., uk 1 )  (v1 ,.., vk 1 )

must have a solution. Moreover, as an extension of (A3.2.31) and (A3.2.32), it follows that
if the constraint conditions on (uk , vk ) are written explicitly as

(A3.2.45) uk uk  1  vk vk , uk ui  0  vk vi , i  1,.., k  1

then the appropriate Lagrangian function for (A3.2.44) is seen to have the form:

(A3.2.46) L(uk , vk , sk , k , 1 ,...,  k 1 , 1 ,..,  k 1 )  uk Avk  1


2 [ sk (1  uk uk )   k (1  vk vk )]

  (uk ui )   i1  i (vk vi )


k 1 k 1
 i 1 i

________________________________________________________________________
ESE 502 A3-35 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Here the first order conditions for uk and vk have the respective forms

0  uk L  Avk  sk uk   i1  iui  Avk  sk uk   i1  iui


k 1 k 1
(A3.2.47)

0   vk L  Auk   k vk   i1 i vi  Auk   k vk   i1  i vi


k 1 k 1
(A3.2.48)

and the remaining first order conditions are now given by (A3.2.45). [Note again from
the orthogonality conditions in (A3.2.45) that the constraint gradient vectors in both
(A3.2.47) and (A3.2.48) are linearly independent, so that this Lagrangian formulation of
(A3.2.46) is indeed valid.] Next, to show that  j  0   j , j  1,.., k  1 , we again
premultiply (A3.2.47) by uj and use the inductive hypotheses (A3.2.40) through
(A3.2.43) together with (A3.2.45) to conclude that

uj Avk  sk uj uk   i1  iujui  0  [ j (uju j )  0]    j


k 1
(A3.2.49)

  j   uj Avk   ( Au j )vk   ( s j v j ) vk   s j (vj vk )  0

Similarly, by premultiplying (A3.2.48) by vj , we see that

vj Auk   k vj vk   i1 i vj vi  0  [  j (vj v j )  0]    j


k 1
(A3.2.50)

  j   vj Auk   ( Av j )uk   ( j u j ) uk    j (uj uk )  0

Hence (A3.2.47) and (A3.2.48) reduce to

(A3.2.51) Avk  sk uk

(A3.2.52) Auk   k vk

Finally, since the argument in (A3.2.21) with (uk , sk , vk ) replacing (u, s, v) again shows
that  k  sk , we see that (A3.2.52) becomes

(A3.2.53) Auk  sk vk

Thus the conditions in (A3.2.40) through (A3.2.43) hypothesized for i  1,.., k  1 are seen
to hold for k as well, and it follows by induction that they must hold for all i  1,.., n .
Most importantly for our purposes, conditions (A3.2.40) together with (A3.2.42) and
(A3.2.43) are now seen to yield a full SVD for A as in expression (A3.2.28). 

________________________________________________________________________
ESE 502 A3-36 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

This particular proof of the SVD Theorem has a number of additional geometric
consequences. Note first from (A3.2.45) and (A3.2.51) that

(A3.2.54) uk Avk  sk (uk uk )  sk , k  1,.., n

so that the stretch values, sk , are indeed the maximum values of the objective function,
uk Avk , at each step, k. Moreover, since this objective function is formally the same at
each step, and since the constraints sets form a nested decreasing sequence of sets, i.e.,

(A3.2.55) (u1 ,.., uk )  (v1 ,.., vk )  (u1 ,.., uk 1 )  (v1 ,.., vk 1 ) , k  1,.., n

it follows that these maximal values must necessarily form a non-increasing sequence, so
that

(A3.2.56) s1  s2    sn

In geometric terms, these singular values thus yield the successive lengths of the principle
axes corresponding to the ellipsoidal image, A( n ) , of the unit sphere,  n , under the
linear transformation, A. In particular, if s1  s2    sn  0 , then A is nonsingular and
and the n-dimensional ellipsoid, A( n ) , has a well defined set of principle axes.
However, if there are say k repetitions of a positive singular value, such as in the
modified version of Figure A3.17 illustrated above with k  2 , then a k-dimensional
“slice” through this ellipsoid will be spherical. Similarly, if the last k singular values are
zero, then A is singular and its null space, null ( A) , has exactly dimension k. So a great
deal of information about A is conveyed by these singular values.

However, it should also be emphasized that the programming formulation of this proof is
not meant to provide a method for computing the SVD of a matrix. This is particularly
evident when there are repeated positive singular values (either positive or zero). Here
there are infinitely many programming solutions, and procedures such as Gram-Schmidt
orthogonalization must be used to construct appropriate orthonormal sets of solution
vectors. While there exist very efficient methods for constructing such decompositions
(often based on Householder representations in section A3.1.2 above), such procedures
are beyond the scope of these notes.15

We now consider some of the more useful consequences of the SVD Theorem for our
purposes. As already mentioned, one direct consequence is to clarify the geometric
relation between A and A . In particular, it follows at once from (A3.2.27) together with
(A3.1.13) that

(A3.2.57) A  US V   A  V S U 

15
For a discussion of such methods as used by MATLAB, see Chapter 10 of Mohler (2004).

________________________________________________________________________
ESE 502 A3-37 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

So the singular values of A and A must always be the same. More the above proof shows
their respective ellipsoidal images, A( n ) and A( n ) , of the unit sphere,  n , must
essentially be rotations of one another, where the roles of the orthonormal matrices, U
and V , are exactly reversed. A simple illustration of this relationship is given in Figure
A3.22 below, where the unit circle,  2 , is shown in black, and the elliptical images,
A( 2 ) and A( 2 ) , for a given matrix, A, 16 are shown in blue and red, respectively.

A( 2 ) A( 2 )

2

Figure A3.22. Ellipsoidal Relations for Transposes

Next we consider a number of SVD consequences that will be used in our subsequent
analyses.

A3.2.1. Inverses and Pseudoinverses.

Note first that since the inverse of an orthonormal matrix is simply its transpose, it
follows at once from the SVD Theorem that for any nonsingular matrix, A,

 s11 u1 
 
(A3.2.58) A  U S V   A1  V S 1 U   (v1 ,.., vn )   
 1 u 
 sn n 

Thus, by recalling (A3.1.11), we see that the inverse, A1 , can be determined from the
SVD of A almost by inspection. While this of course assumes that this SVD has already
been calculated, it nonetheless provides a powerful analytical tool in many contexts. For
example, it now reveals the behavior of “almost nonsingular” matrices, which by
16
The particular matrix used here was A = [1.0689, 2.9443 ; 0.8095, -1.4384].

________________________________________________________________________
ESE 502 A3-38 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

definition have at least one singular value, si , very close to zero. But since this in turn
implies that 1/ si must be very large, it can be seem from the last equality in (A3.2.58)
that vectors in the ui direction are being stretched enormously. So this shows not only
that A1 is becoming unstable, but also the directions in which this instability is worst.

Even more important is the fact that this SVD shows how to construct generalized
inverses for singular matrices. In particular, when no inverse exists for A, this SVD
representation suggest a very natural “best approximation” to such an inverse. The idea is
seen most clearly in trying to solve the associated linear equation system, A x  b . If A1
exists, then there is an exact solution, x  A1b . But if A is singular, one would like to
find x so that Ax is as “close” to b as possible, i.e. so that || Ax  b || is minimized. But by
(A3.2.54),17

(A3.2.59) Ax  b  (U S V ) x  b  U S V x  UU  b  U ( S V x U  b)  U ( S x  b )

where x  V x and b  U  b . But since U is orthonormal and hence preserves distances, it


follows that
 s1  x1   b1   x1s1  b1 
      
(A3.2.60) || Ax  b ||  || S x  b ||             
    
sn  xn   bn    
  xn sn  bn 

So this approximation problem has now been reduced to a diagonal form for which the
solution is seen to be trivial, namely, set

b / s , si  0
(A3.2.61) xi   i i
 0 , si  0

Finally, if we assume (for convenience) that the first k components of S are the positive
ones, and set

(A3.2.62) S   diag ( s11 ,.., sk1 ,0,..,0)

then it follows from (A3.2.61) together with the definitions of x and b that

(A3.2.63) x  S  b  V x  S U  b  x  (V S  U  ) b

Finally, since this argument is completely independent of the choice of b, it follows by


setting

17
The following argument is base on the excellent discussion of SVD properties in Kalman (1996).

________________________________________________________________________
ESE 502 A3-39 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(A3.2.64) A  V S  U 

that A yields a natural generalization of (A3.2.58) which is designated as the


pseudoinverse (or Moore-Penrose inverse) of A. Moreover, since minimizing distance is
the same as minimizing squared distance, this pseudoinverse, A , always provides the
least squares solution to linear equation systems.

These observations are immediately applicable to OLS regression. In particular, recall


from expression (A2.7.69) that the least squares solution, ˆ , for estimating  satisfies
the linear equation system:

(A3.2.65) ( X X ) ˆ  X y

So if ( X X ) 1 exists (as was assumed) then ˆ  ( X X ) 1 X y . But in cases where X X is


singular, one can still determine a least squares solution by setting

(A3.2.66) ˆ  ( X X ) X y

Moreover, even in cases where X X is technically nonsingular but is in fact “almost


singular” (i.e., exhibits strong multicollinearities), one can often obtain a more stable
estimate by using (A3.2.66). So the SVD Theorem is seen to have very practical
applications in such cases.

A3.2.2. Determinants and Volumes

Recall from expression (3.2.11) in Part II of this NOTEBOOK that we encountered


determinants in the density function of the multi-normal distribution. The main objective
of this section is to clarify the role of determinants in such densities, and to emphasize
their broader role in describing the volume changes associated with linear
transformations. To do so, we require some preliminary facts about matrix determinants.
For the simple case of a 2  2 matrix, A, recall that the determinant of A is given by

a a 
(A3.2.67) A   11 12   | A |  a11a22  a12 a21 ,
 a21 a22 

which in turn plays a critical role in calculating the inverse of A:

1  a22 a12 
(A3.2.68) A1   
| A |   a21 a11 

[In fact, the determinant itself originated as part of the first general solution of linear
equations (Cramer’s Rule, 1750).] Note in particular from (A3.2.68) that such solutions
exist iff | A |  0 . The geometric meaning of this relationship will become clear below.

________________________________________________________________________
ESE 502 A3-40 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

But for the present, we simply note that the formula in (A3.2.67) offers little insight by
itself. Indeed, the general formula for determinants (in terms of alternating-signed sums
of products of matrix elements)18 is even more obtuse. But one important observation
about this formula can be made in terms of the following instance of a Householder
reflection, A  H v , in  2 [recall expression (A3.1.44) above], where in this case
v  (1, 1) [with vv  2 ], so that:

 1 1  0 1 
(A3.2.69) A  I 2  v2v vv  I 2      
 1 1   1 0 
By expression (A3.2.67) this matrix has a negative determinant, | A |   1 . To interpret
the meaning of this negative sign, note from Figure A3.23 below that this transformation
simply reverses the basis vectors, (e1 , e2 ) , so that A(e1 , e2 )  (e2 , e1 ) :

Ae1 v

Ae2

v

Figure A3.23. Order Reversal

More generally, negative values of determinants are always associated with such
reversals of orientation. But this “sign” property of determinants is not of direct interest
for our purposes (even though the present Householder example will prove useful later).
Rather, we are primarily interested in the absolute value of determinants. As mentioned
above, these absolute values tell us exactly how volumes are transformed under linear
transformations. The standard example which is often shown in the literature is illustrated
in Figure A3.24 below:

(a11  a12 , a21  a22 )


(a12 , a22 )
(0,1) (1,1)
a11a22  a12a21
1
(a11 , a21 )

(1,0)

Figure A3.24. Volume Transformation

18
See for example section 0.3 in Horn and Johnson (1985)

________________________________________________________________________
ESE 502 A3-41 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

In terms of Figure A3.3, we here set e1  (1,0) , e2  (0,1) , Ae1  (a11 , a21 ) and
Ae2  (a12 , a22 ) in order to emphasize the role of each matrix element. The key point is
that the unit area of the unit square on the left is transformed by A into a parallelogram on
the right with area given precisely by | A | , which in the case is seen to be positive (no
reversal of orientation). This in turn implies (from linearity) that every area on the left is
transformed by A into an area scaled by a factor of | A | . But even in this simple case, it is
not obvious that the parallelogram area should be given by a11a22  a12 a21 . While the
geometric proof in this case is not difficult, its generalization to linear transformations, A,
in  n is tedious, to say the least. So our first objective is to show that this relation
between volume and absolute determinant values can be made completely transparent in
terms of the SVD of A.

To do so, we must first deal with the (unfortunate) notational fact that the symbol, |  | , is
used both for determinants and absolute values. This is often resolved by using “ det( A) ”
for the determinant of A, so that its absolute value can be directly represented by
| det( A) | . But since the relevant determinants for our purposes will almost always be
nonnegative, we choose to stay with the simpler notation, | A | . Where it is essential to
specify absolute values of determinants (such as in the present section) we shall simply
write, | A | .

Aside from this notational convention, the only algebraic properties of determinants that
we require are the product rule,

(A3.2.70) | AB |  | A| | B |

the symmetry rule,

(A3.2.71) | A |  | A |

and the diagonal rule,

(A3.2.72) | diag (a1 , a2 ,.., an ) |  a1 a2  an

Note in particular that for absolute values, the product rule implies

(A3.2.73) | A B |   | A | | B |

Together with the SVD Theorem, these properties of determinants imply that the absolute
determinant of any matrix is the product of its singular values, i.e., that for all
transformations, A, in (A3.2.28)

A  U diag ( s1 ,.., sn ) V   | A |   i1 si


n
(A3.2.74)

________________________________________________________________________
ESE 502 A3-42 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

To see this, note first from (A3.2.72) that | I n |  1 , so that by the defining property of
orthonormal transformations, U,

(A3.2.75) 1  | I n |  | U U |  | U  || U |  | U |2  | U |   1

Hence the absolute determinant of U must be unity, i.e.,

(A3.2.76) U orthonormal  | U |  1

Hence it follows from (A3.2.70) and (A3.2.72) that

(A3.2.77) A  U diag ( s1 ,.., sn ) V   | A |  | U | | diag ( s1 ,.., sn ) | | V  |

 (1) | diag ( s1 ,.., sn ) | (1)


n
 s
i 1 i

Using this result, it is a simple matter to show that for any linear transformation, A, on
 n , volumes are transformed by a factor of | A | . To do so, observe that if the unit cube
in  n is denoted by

(A3.2.78) Cn  [0,1]n  {x  ( x1 ,.., xn )   n : 0  xi  1, i  1,.., n}

and if we denote the volume of any set, T   n by vol (T ) ,19 then clearly vol (Cn )  1 . So
if the image of Cn under transformation A is denoted by

(A3.2.79) A(Cn )  { Ax : x  Cn }

then it suffices to show that vol [ A(Cn )] is always given by | A | . But since each linear
transformations scales all volumes by the same amount, if we now denote this common
scale factor by s ( A)  vol [ A(Cn )] ,20 then for all T   n ,

vol [ A(T )] vol [ A(Cn )]


(A3.2.80)   s ( A)  vol [ A(T )]  s( A) vol (T )
vol (T ) vol (Cn )

In these terms, our objective is to show that for any linear transformation, A, on  n ,

19
The knowledgeable reader will note that technically we here refer to any measurable set, T   .
n

20
Be careful not to confuse scale factors, s ( A) , with singular values, s.

________________________________________________________________________
ESE 502 A3-43 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(A3.2.81) s ( A)  | A |

Here we need only appeal to certain elementary properties of volume itself. The most
fundamental property concerns scale transformations of individual coordinates. For
example, if a transformation scales all coordinate axes by 2 , then volumes increase by a
factor of 2n . More generally, since positive diagonal matrices, D  diag (d1 ,.., d n ) , scale
each coordinate, xi , by a factor of di , i.e., since

(A3.2.82) Dx  D( x1 ,.., xn )  (d1 x1 ,.., d n xn ) ,

it follows that vol [ D(Cn )]  d1d 2  d n , so that by definition

D  diag (d1 , d 2 ,.., d n )  s ( D)   i1 di


n
(A3.2.83)

In fact, this is how volumes of n-dimensional “boxes” are computed. Note also that if
coordinates are scaled by factors, di , one at a time, then since the composition of these
transformations is precisely D, the cumulative effect of these scale changes is necessarily
multiplicative. More generally, the cumulative scale effect of any successive
transformations, say A followed by B , is always multiplicative. For example, if A doubles
volumes and B triples volumes, then the composite transformation, BA, increases
volumes by a factor of (2)(3)  6 . More generally, for all transformations, A and B,

(A3.2.84) s ( BA)  s ( B) s ( A)

The only other property of volume that we require is one we have already seen, namely
that orthonormal transformations preserve volumes. So by definition,

(A3.2.85) U orthonormal  s (U )  1

Given these volume properties, it follows at once from the SVD Theorem together with
(A3.2.84) that

(A3.2.86) A  U diag ( s1 ,.., sn ) V   s( A)  s (U ) s[diag ( s1 ,.., sn )] s (V )

 (1)  s  (1)
n
i 1 i

 | A |

This result has far reaching consequences for determinants, and shows why they play
such a fundamental role in linear algebra. With respect to matrix inverses in particular,
note that if | A |  0 (so that | A |  0 ) then s ( A)  0 implies that all volumes are

________________________________________________________________________
ESE 502 A3-44 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

collapsed to zero. So from a geometric viewpoint, A must collapse the space into a lower
dimensional subspace, such at the examples in Figures A3.20 and A3.21 above.

A3.2.3 Linear Transformations of Random Vectors

The final objective of this section is to illustrate the consequences of these results for
linear transformations of random vectors. In particular, our objective is to complete the
derivation of the multi-normal distribution sketched in Section 3.2.1 of Part II in this
NOTEBOOK, and to show how the multi-normal density in (3.2.11) is derived. The key
element we focus on is the role of the determinant, |  | , of the covariance matrix,  . In
fact, this determinant reflects the volume transformation associated with a particular
linear transformation, as we now show. To do so, we start by considering the standard
normal random vector, X  ( X 1 ,.., X n ) , of independent standard normal variates,
X i ~ N (0,1), i  1,.., n. Recall from the Linear Invariance Theorem of Section 3.2.2 of
Part II that if for some nonsingular matrix, A , the random vector, Y , is defined by

(A3.2.87) Y  AX   ,

then since X ~ N (0, I n ) , this theorem asserts that Y ~ N (  ,  ) with   AA . Moreover,
since all covariance matrices,  , are of this form for some A [as we have already seen
from the Cholesky Theorem in Appendix A2, below expression (A2.7.45)], it follows that
all multi-normal random vectors, Y, are derivable as linear transformations of the
standard normal vector, X. In fact, this is precisely how the general multi-normal
distribution is defined.

Our goal is to establish this result by starting with the probability density of the standard
normal random vector, X, and show how this density is transformed under (A3.2.87). To
do so, we first recall from the argument in (3.2.7) of Part II [with ( i , i )  (0,1) ,
i  1,.., n ] that the probability density, f ( x )  f ( x1 ,.., xn ) , of the standard normal random
vector, X, is necessarily of the form:

(A3.2.88) f ( x )  f ( x1 ) f ( xn )   1
2
e
 12 x12
  1
2
e
 12 xn2

 
n  12 ( x12  xn2 )  12 x x
 1
2
e  (2 ) n / 2 e

where x  ( x1 ,.., xn ) . So to obtain the desired distribution of Y, it suffices to show that


this standard normal density is transformed by (A3.2.87) into a probability density,
g ( y )  g ( y1 ,.., yn ) , of the form (3.2.11) in Part II, i.e., that:

 1 ( y   )  1 ( y   )
(A3.2.89) g ( y )  (2 ) n / 2 |  |1/ 2 e 2

________________________________________________________________________
ESE 502 A3-45 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

But before doing so, it is important to emphasize that even though expressions like
(A3.2.87) are usually referred to as “linear transformations”, they technically involve
linear transformations, A, plus translation terms,  (and are properly classified as affine
transformations). Only in the case,   0 , is this a linear transformation [as defined in
(A3.1.1) above]. So to simplify the present development further, we start with the case,
  0 , where (A3.2.87) reduces to a proper linear transformation,
(A3.2.90) Y  AX
It will be seen later that adding a nonzero translation term,  , is then a simple matter.

To begin this development, we start by observing the role of the determinant, |  | , in


(A3.2.89) in fact has nothing to do with “normality” itself. So clarify this role, it is more
convenient to regard X in (A3.2.90) as a general continuous random vector with
density, f ( x ) . To derive the associated density, g ( y ) , of Y , we begin by recalling that all
probability densities are by definition representations of event probabilities in terms of
volumes. In particular, if for any selected value, y0  ( y01 ,.., y0 n ) , of Y we consider small
intervals,  0i ( )  [ y0i   , y0i   ] about each component value, y0i , for some   0 , and
denote the n-cube defined by these intervals as,

(A3.2.91)  0 ( )   01 ( )   02 ( )    0 n ( )   n ,

then the probability, Pr[Y   0 ( )] , of event  0 ( ) is represented by the integral of


density g over this region of n , i.e.,

(A3.2.92) Pr[Y   0 ( )]   0 (  )
g ( y ) dy  0 n ( )

 01 (  )
g ( y1 ,.., yn ) dy1  dyn

This is illustrated for the case of n  2 on the right-hand side of Figure A3.25 below,
where the 2-cube,  0 ( ) , is seen to be a square (shown in blue) about point, y0 :

g ( y0 ) ●
f ( A 1 y 0 ) ●

x2 A 1 y2
 01 ( )
A1 y0 ● y0 ●
A1[ 0 ( )]  0 ( )  02 ( )
x1 y1

Figure A3.25. Linear Transformation of Variables


________________________________________________________________________
ESE 502 A3-46 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

So the probability integral in (A3.2.92) is simply the volume under that portion of
density, g, above this square (also shown in blue). The key point here is that if the value
of  is sufficiently small, then this volume is well approximated by the box with base,
 0 ( ) , and height, g ( y0 ) . More precisely, if the area (more generally, volume) of this
base is denoted by vol [ 0 ( )] , so that the volume of the box (height  base) is given by,
g ( y0 ) vol [ 0 ( )] , then we obtain the approximation:

(A3.2.93) Pr[Y   0 ( )]  g ( y0 ) vol [ 0 ( )]  eY ( )

where the magnitude of error term, eY ( ) , is assumed to be much smaller than


vol [ 0 ( )] , so that as  approaches zero,

eY ( )
(A3.2.94) lim  0 0
vol [ 0 ( )]

To gain some feeling for such error representations, observe that if we divide both sides
of (A3.2.93) by vol [ 0 ( )] , and let   0 then we obtain

Pr[Y   0 ( )]
(A3.2.95) lim 0  g ( y0 ) ,
vol [ 0 ( )]

which is essentially definition of probability density, g ( y0 ) , at y0 .

In order to associate these quantities with the random vector, X, observe from (A3.2.90)
that since, y  Ax  x  A1 y , it follows that Y-outcome, y, occurs iff X-outcome,
A1 y , occurs. So by using the same image notation in (A3.2.10) to write

(A3.2.96) A1[ 0 ( )]  { A1 y : y   0 ( )}

we obtain the following probability identity

(A3.2.97) Pr[Y   0 ( )]  Pr  A1Y  A1[ 0 ( )]  Pr  X  A1[ 0 ( )]

This provides the key link between the X and Y distributions. The X-event in the last
equality is shown (for the n  2 case) on the left-hand side of Figure A3.25 as a
parallelogram (in red) representing the image of  0 ( ) under A1 . (Note also that the
bold red arrow shows the direction of this inverse relationship.) But since X is
continuously distributed with density, f, this probability again has a “box” approximation
with base, A1[ 0 ( )] , and height, f ( A1 y0 ) , i.e.,

________________________________________________________________________
ESE 502 A3-47 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(A3.2.98) Pr  X  A1[ 0 ( )]  f ( A1 y ) vol  A1[ 0 ( )]  eX ( )

where the error, eX ( ) , again satisfies

eX ( )
(A3.2.99) lim 0 0
vol  A1[ 0 ( )]

But now we are in a position to simplify (A3.2.98) by using (A3.2.86), together with
(A3.2.80) to obtain

(A3.2.100) vol  A1[ 0 ( )]  s( A1 ) vol [ 0 ( )]  | A1 | vol [ 0 ( )]

This can be further simplified by recalling from the same argument as (A3.2.75) that

(A3.2.101) 1  | I n |  | A A1 |  | A || A1 |  | A1 |  | A |1  | A1 |  | A |1

so that (A3.2.100) becomes:

(A3.2.102) vol  A1[ 0 ( )]  | A |1 vol [ 0 ( )]

By combining the results in (A3.2.93), (A3.2.97), (A3.2.98) and (A3.2.102), we obtain


the identity

(A3.2.103) g ( y0 ) vol [ 0 ( )]  eY ( )  Pr[Y   0 ( )]

 Pr  X  A1[ 0 ( )]

 f ( A1 y0 ) | A |1 vol [ 0 ( )]  eX ( )

which after dividing by vol [ 0 ( )] and again using (A3.2.102) yields

eY ( ) eX ( )
(A3.2.104) g ( y0 )   f ( A1 y0 ) | A |1 
vol [ 0 ( )] vol [ 0 ( )]

 eX ( ) 
 | A |1  f ( A1 y0 )  
 vol  A1[ 0 ( )] 

Finally, by letting   0 and using (A3.2.95) and (A3.2.99) , we obtain the density
relation:

________________________________________________________________________
ESE 502 A3-48 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

eY ( )  eX ( ) 
(A3.2.105) g ( y0 )  lim 0  | A |1  f ( A1 y0 )  lim  0 
vol [ 0 ( )]  vol  A1[ 0 ( )] 

 g ( y0 )  f ( A1 y0 ) | A |1

But since this is an identity for all choices of y0 , we can now replace y0 by y and write

(A3.2.106) g ( y )  f ( A1 y ) | A |1

This is the key result for constructing density, g ( y ) , from f ( x ) under linear
transformations, Y  AX , as in (A3.2.88). Essentially it asserts that the desired density,
g ( y ) , at y is obtained by evaluating f at A1 y and rescaling to adjust for the volume
changes created by A.

But before applying this result to the multi-normal case, we first extend (A3.2.102) to
include translations as in (A3.2.87). To do so, observe that if we now let Z  Y   so
that

(A3.2.107) Y  AX    Y    AX  Z  AX

then Z is seen to be related to X by a linear transformation. Hence if h( z ) denotes the


density of Z, then it follows from (A3.2.106) that

(A3.2.108) h( z )  f ( A1z ) | A |1

But since Y is related to Z by a simple translation operator, T, defined by

(A3.2.109) Y  T (Z )  Z  

with associated inverse,

(A3.2.110) Z  T 1 (Y )  Y  

we can now use h( z ) to obtain g ( y ) from these relations. Here the key point to note is
that translations on  n simply shift locations, and involve no rescaling of volumes.21 So
in fact, the relation between h and g in this case reduces simply to:

(A3.2.111) g ( y )  h(T 1 y )  h( y   )

21
Here it is worth noting that the terms isometric transformations and rigid motions mentioned at the
beginning of Section A3.1.2 formally include translations as well as rotations and reflections, since all such
transformations preserve both distances and angles.

________________________________________________________________________
ESE 502 A3-49 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Finally, by combining (A3.2.108) and (A3.2.111), we obtain the desired general relation

(A3.2.112) g ( y )  f [ A1 ( y   )] | A |1

between densities g and f for linear transformations plus translations in (A3.2.87).

Before applying this to the multi-normal case, we can make one additional observation
about covariances that is independent of normality. Recall from expression (3.2.21) in
Part II of the NOTEBOOK that

(A3.2.113) Y  AX    cov(Y )  A cov( X ) A

So if we let cov(Y )   and assume that cov( X )  I n , then as in the standard normal
case of (A3.2.87) we obtain the formal identity:

(A3.2.114)   A I n A  A A

But by the determinantal identities in (A3.2.70) and (A3.2.71), this in turn implies that

(A3.2.115) |  |  | A || A |  | A |2  0

So (as we have already seen in Sylvester’s Condition leading to the Cholesky Theorem in
Appendix A2) the determinant of every (nonsingular) covariance matrix is positive. This
means that “plus” subscripts can be dropped for determinants of covariance matrices. In
particular, by letting |  |1/ 2 denote the positive square root of |  | , it follows that

(A3.2.116) | A |  |  |1/ 2

and hence that (A3.2.112) can also be written as

(A3.2.117) g ( y )  f [ A1 ( y   )] |  |1/ 2

[with the understanding that   cov(Y ) ].

Finally, to apply these results to the multi-normal case, we need only observe that if X is
standard normal, X ~ N (0, I n ) , with density in (A3.2.88), then g ( y ) in (A3.2.117) takes
the form:

(A3.2.118) g ( y )  f [ A1 ( y   )] |  |1/ 2


 (2 ) n / 2 e
 12 [ A1 ( y   )][ A1 ( y   )]
|  |
1/ 2

________________________________________________________________________
ESE 502 A3-50 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

 12 ( y   ) ( A1 )( A1 )( y   )


 (2 ) n / 2 |  |1/ 2 e

But by (A3.2.114) together with the matrix identities in (A3.1.18) and (A3.1.20) we see
that

(A3.2.119) ( A1 )( A1 )  ( A)1 A1  ( A A)1   1

so that (A3.2.118) becomes

 12 ( y   )  1 ( y   )
(A3.2.120) g ( y )  (2 ) n / 2 |  |1/ 2 e

Thus the resulting probability density is precisely that in (A3.2.89), and the multi-normal
case is established. In particular, the family of multi-normal random vectors,
Y ~ N (  ,  ) , is seen to be generated by transformations, Y  AX   , of the standard
normal random vector, X, satisfying   AA , with A nonsingular. As an immediate
consequence of this, we have the following simple proof of the Linear Invariance
Theorem of Section 3.2.2 of Part II, which we now restate for convenience as:

Linear Invariance Theorem. For any multi-normal random vector, X ~ N (  , ) ,


and affine transformation, Y  AX  b , of X with A of full row rank, Y is also
multi-normally distributed as

(A3.2.121) Y ~ N ( A  b, A  A)

Proof: If C denotes the Cholesky decomposition of  so that   CC  , and if


we define the random vector, Z, by

(A3.2.122) Z  C 1 ( X   )

so that by construction,

(A3.2.124) X  CZ  

then the argument above shows that Z ~ N (0, I n ) . But since

(A3.2.125) Y  A X  b  A (C Z   )  b  ( AC ) Z  ( A  b)

shows that Y is an affine transformation of Z, the same argument shows that Y is multi-
normally distributed. Moreover, from expressions (3.2.18) and (3.2.21) in Part II we see
that the mean and covariance of Y are given respectively by

________________________________________________________________________
ESE 502 A3-51 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(A3.2.126) E ( y )  AC E ( Z )  ( A  b)  A  b , and

(A3.2.127) cov(Y )  cov[ AC Z  ( A  b)]  ( AC )cov( Z )( AC )

 AC ( I n )C A  A(CC ) A  A  A

Thus we must have Y ~ N ( A  b, A A) , and the result is established. 

Finally it is important to clarify the above requirement that A be of full row rank. Note in
particular that if A has fewer rows than columns, say m  n , then the random vector,
Y  AX  b , must be of dimension m (where b must also be m-dimensional so that vector
addition is well defined). So it is implicitly assumed that N ( A  b, A A) is a multi-
normal distribution on  m with density given by replacing  and  in (A3.2.120) with
A  b and A  A , respectively. With this in mind, it should be clear from (A3.2.120)
that such a density is only defined if A  A is a nonsingular covariance matrix (i.e., with
a well defined inverse). As shown in Corollary 3 of Section A3.4.3 below, the condition
that A be of full row rank, insures that the m -square covariance matrix, A A , will
indeed be nonsingular.

________________________________________________________________________
ESE 502 A3-52 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

A3.3 Eigenvalues and Eigenvectors

As stated earlier, the single most important application of the Singular Value
Decomposition Theorem for our purposes is to provide a simple proof of the Spectral
Decomposition Theorem for symmetric matrices. Recall from (A3.2.2) that this theorem
asserts that if matrix A is symmetric (i.e., A  A ) then there exists an orthonormal
matrix, U, and diagonal matrix,  , such that

(A3.3.1) A  U U 

The elements of   diag (1 ,.., n ) are called the eigenvalues of A and the columns of
U  (u1 ,.., un ) are the associated eigenvectors. However, these concepts are much more
general, and indeed, provide additional geometric intuition about linear transformations
in general. In particular, it is useful to consider eigenvalues and eigenvectors for
nonnegative spatial weight matrices, W, which may possibly be non-symmetric (as for
example in the case of nearest-neighbor matrices). So it is convenient to start with a
broader consideration of these concepts, and then focus in on symmetric matrices.

For any given n-square matrix, A, and nonzero vector, x   n , if A maps x into a scalar
multiple of itself, i.e., if

(A3.3.2) Ax   x

for some scalar,    , then  is designated as a eigenvalue of A with associated


eigenvector, x .22 In geometric terms, A simply “stretches” or “shrinks” each eigenvector,
x, by a factor,  , with a reversal in direction if   0 .

Before analyzing these concepts, it is important to reiterate that our present view of (real-
valued) matrices, A   nn , is as representations of linear transformations on  n . So our
focus is naturally on the geometric properties of such transformations on the real vector
space,  n . But such matrices can also be viewed as representing a class of linear
transformations on the complex vector space,  n . This distinction is important for the
present discussion because the general theory of eigenvalues and eigenvectors treats such
matrices as linear transformations on  n . The reason for this can be seen by the
following equivalent view of eigenvalues. If we rewrite (A3.3.2) as,

(A3.3.3) Ax   x  0  ( A   I n ) x  0

then it becomes clear that the eigenvalues of A are precisely those values for which the
matrix, A   I n , is singular. So, as was observed following expression (A3.2.86) above,
this is equivalent to the condition that

22
The word “eigen” is German for “own” as in belonging to. So the eigenvalues of A are also referred to as
its “own” values or “characteristic” values.

________________________________________________________________________
ESE 502 A3-53 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(A3.3.4) | A   In |  0

But by the definition of determinants, this is simply a polynomial equation in  , called


the characteristic equation for A. For example, if n  2 , then by (A3.2.67) above
expression (A3.3.4) takes the form,

a a   
(A3.3.5) 0   11 12     (a11   )(a22   )  a12 a21
 a21 a22    

So the eigenvalues of 2  2 matrices are thus seen to be the roots of a quadratic equation.
More generally they are the roots of an nth -degree polynomial called the characteristic
polynomial for A. In this setting, the key result here is of course the Fundamental
Theorem of Algebra, which tells us that there are always exactly n roots to this equation
(counting repetitions) if we allow complex-valued roots. So if A is regarded as a linear
transformation on  n , where both  and x in (A3.3.2) can be complex-valued, then one
obtains a very elegant and powerful theory of eigenvalues and eigenvectors. But from a
geometric viewpoint, there is a fundamental difference between the simple scaling of real
vectors in  n and the corresponding interpretation of expression (A3.3.2) in  n . In
particular, multiplication of complex numbers involves rotation as well as scaling. We
shall return to these issues in Section ?? below, where the geometric meaning of such
rotations will be interpreted in  n . But for the present our attention is restricted to the
case of real eigenvalues. Indeed, one major objective of these notes is to show that the
eigenvalues of symmetric matrices are always real – without any appeal to complex
numbers whatsoever. Hence, unless otherwise stated, we implicitly assume that the
relevant eigenvalues and eigenvectors for matrices, A, in this section are real valued, i.e.,
are meaningful for A as a linear transformation on  n .

In this setting, we begin by noting from (A3.3.2) that each eigenvector, x, for  is
inherently nonunique. In particular, every nonzero scalar multiple,  x , of x is also an
eigenvector for  , since

(A3.3.6) A( x)   Ax   x   ( x)

To remove such obvious redundancies, representative eigenvectors are by convention


normalized to have unit length, || x ||  1 .

With this normalization, the next question concerns the relation between eigenvectors for
distinct eigenvalues. Our objective is to show that such eigenvectors must always be
linearly independent. Here some geometric intuition can be gained by considering several
examples. We start with the simplest and most transparent example of eigenvalues and
associated eigenvectors, namely those for diagonal matrices, A  diag (a11 ,.., ann ) . Here is
it obvious that

________________________________________________________________________
ESE 502 A3-54 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

 a11 
 
(A3.3.7) AI n  I n A  A (e1 ,.., en )  (e1 ,.., en )   
 
 ann 

 Aei  aii ei , i  1,.., n

So if we now denote the set of eigenvalues for any matrix, A , by Eig ( A) , then for
diagonal matrices in (A3.3.7) it is clear that Eig ( A)  {aii : i  1,.., n} with associated
eigenvectors, ei , i  1,.., n . This example shows that n-square matrices can indeed have n
distinct eigenvalues. Notice also that all eigenvectors in this case are in fact orthogonal,
and hence are necessarily linearly independent even if their eigenvalues are not distinct.
We shall see below that this property is shared by all symmetric matrices (of which
diagonal matrices are the simplest example).

Of course, the orthogonal basis in (A3.3.7) is a very special case. A more typical example
with a full set of eigenvalues is given by the following simple matrix

3 1
(A3.3.8) A 
0 2

for which it can easily be verified that the eigenvalues of A are Eig ( A)  {1 , 2 }  {3, 2}
with associated eigenvectors given respectively by x1  e1  (1,0) and

 

x2  1/ 2, 1/ 2 , as shown in Figure A3.26 below.

Ax2

x2

x1 Ax1

Figure A3.26. Non-orthogonal Example

In both these examples, the eigenvectors associated with distinct eigenvalues are indeed
linearly independent. But the question remains as to whether this is always true. To see
that it is, we now consider a general matrix, A, and suppose that 1 and 2 are two

________________________________________________________________________
ESE 502 A3-55 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

distinct eigenvalues of A with associated eigenvectors, x1 and x2 . Clearly, x1 and x2


cannot themselves be linearly dependent since this would mean that x1   x2 for some
  0 . But the normalization condition, || x1 ||  || x2 ||  1 , would then imply that   1
and hence that x1  x2 , which is not possible for distinct eigenvalues 1 and 2 .
However, we can still ask whether there could possibly be another vector, x3 , in
span( x1 , x2 ) which is also an eigenvector of A with distinct eigenvalue, 3  i , i  1, 2 .
A representation of span( x1 , x2 ) is shown in Figure A3.27 below where it is assumed for
sake of illustration that 0  1  2 and that x3 (shown in blue) is a positive linear
combination, x3  a x1  b x2 , of x1 and x2 .

Ax3

Ax1
x3
x1
Ax2

 x2

Figure A3.27. Linear Independence Example

Now if x3 were an eigenvector of A with eigenvalue, 3 , then by definition,

(A3.3.9) Ax3  3 x3  3 (ax1  bx2 )  (3a) x1  (3b) x2

So the coefficients (3a , 3b) of this new linear combination of x1 and x2 would
necessarily be proportional to the original coefficients (a, b) , as shown by all points on
the blue line in the figure. But by hypothesis,

(A3.3.10) Ax3  A(ax1  bx2 )  a( Ax1 )  b( Ax2 )  (a1 ) x1  (b2 ) x2

which together with 0  1  2 , shows that in fact more weight is now placed on the
maximal eigenvector, x2 , and thus that proportionality cannot hold. More generally, the
same argument shows that the image of any vector, x3  span( x1 , x2 ) [not collinear with
either x1 or x2 ] is necessarily “pulled toward” this maximal eigenvector (shown by the
arrow in the figure), and cannot itself be an eigenvector. So we may conclude that no

________________________________________________________________________
ESE 502 A3-56 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

eigenvector with eigenvalue distinct from 1 and 2 can be collinear with ( x1 , x2 ) , i.e.,
can lie in span( x1 , x2 ) .

While this illustration involves only triples of distinct eigenvalues, the argument is in fact
quite general, and can in be used to show that eigenvectors for distinct eigenvalues must
always be linearly independent.23 But since our main interest is in symmetric matrices,
where the argument will seen to be even more transparent, the above example suffices for
our purposes.

A final property of eigenvalues relates to their possible repetitions, and can again be
illustrated most easily by diagonal matrices, A  diag (a11 ,.., ann ) . Notice in particular that
this is the one case where the characteristic equation in (A3.3.4) is completely
transparent, since

 a11   
 
(A3.3.11) 0  | A   In |      (a11   ) (ann   )
 ann   

This implies at once that the diagonal elements of A are indeed the roots of its
characteristic equation. If some of these diagonal elements are the same, then such
repeated roots are designated as algebraic multiplicities. For example, the matrix
A  diag (1,1,3) has only two distinct eigenvalues, Eig ( A)  {1,3} , but since (1   ) ,
appears twice in (A3.3.11), this eigenvalue said to have an algebraic multiplicity of two.
Notice also from (A3.3.7) that since there are two linearly independent eigenvectors for
this eigenvalue, namely e1 and e2 , its geometric multiplicity (i.e., the maximum number
of its linearly independent eigenvectors) is also two. More generally, it follows at once
from (A3.3.7) that algebraic and geometric multiplicities of eigenvalues are always
identical for diagonal matrices.

But for general matrices, even when eigenvalues do exist, these two multiplicities need
not be the same. For example, while the algebraic and geometric multiplicities of   2
in the diagonal matrix, A  diag (2, 2) , are both equal to two, consider the following
(modest) variation of this matrix:

2 1
(A3.3.12) A 
 0 2

This matrix is still nonsingular, and moreover, has the same characteristic equation, since
0  | A   I 2 |  (2   )(2   ) . So the algebraic multiplicity of   2 is two. But observe
that if x  ( x1 , x2 ) is any associated eigenvector, then

23
A simpler and more elegant proof of this fact is given in Lemma 1.3.8 of Horn and Johnson (1985). The
advantage of the present argument is that it provides some geometric intuition as to why this is true.

________________________________________________________________________
ESE 502 A3-57 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

 2 1   x1   x1   2 x1  x2  2 x1 
(A3.3.13) Ax  3 x    x   2 x      x2  0
 0 2  2   2  2 x2  2 x2 

Moreover, there is only one eigenvector (up to a choice of sign) with this property,
namely, x  (1,0) . So the geometric multiplicity this eigenvalue is one. Such matrices are
usually designated as defective matrices in the literature. The reason for this “defective”
property can be seen by plotting the transformation, as in Figure A3.28 below:

Av1 Av2 Av3 Av4


y2     y2

v1 v2 v3 v4
y 1     y 1

v0 Av0
x        x
-3 -2 -1 1 2 3

Figure A3.28. A “Defective” Transformation

Here we have used the vector notation, v  ( x, y ) , for points in the plane, and have
displayed the unique eigenvector by v0  (1,0) , with associated image,
Av0  2 v0  (0, 2) . To show where all other points are sent, we have fixed the y-
coordinate value at y  1 , and have plotted the four points, v1  (2,1) , v2  (1,1) ,
v3  (0,1) , and v4  (1,1) , as shown in blue. Multiplying each of these four vectors by
matrix A in (A3.3.12), we obtain the corresponding image vectors, Av1  (3, 2) ,
Av2  (1, 2) , Av3  (1, 2) , and Av4  (3, 2) , shown in red. The key point to notice is
that all these image vectors are to the right of the original vectors, indicating that (along
with a certain amount of stretching) each vector has been rotated clockwise toward the
eigenvector, v0 . Similarly, by extending all vector arrows in the opposite direction
through the origin, it is clear that the vectors, v1 ,  v2 ,  v3 ,  v4 , are also rotated
clockwise toward the negative eigenvector, v0 . This shows that all nonzero vectors
other than these unique eigenvectors are rotated clockwise to some degree, and thus
cannot be eigenvectors. So essentially, such matrices involve some form of non-rigid

________________________________________________________________________
ESE 502 A3-58 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

rotations that can reduce the number of linearly independent eigenvectors associated with
repeated eigenvalues.

Given these general properties of real eigenvalues and eigenvectors, our objective is to
apply these concepts to symmetric matrices in particular. But before doing so, it is
important to reiterate that not all matrices have a full complement of real eigenvalues.
The following simple orthonormal matrix will turn out to be a particularly important case
in point:

 0 1
(A3.3.14) U  
1 0 

Geometrically, this matrix rotates the plane counterclockwise through an angle of 90 , as
shown in Figure A3.29 below. Clearly no vector can possibly be mapped by this
transformation into a scalar times itself.

Ae1

e2


Ae2 e1

Figure A3.28. Example with No Eigenvalues

In algebraic terms, the characteristic equation of U takes the form,

(A3.3.15) 0  | U   I2 |   2  1

which is seen to have only the “imaginary” solutions,    1 . We shall return to this
example in Section ?? below.

A3.4 Spectral Decomposition Theorem

We begin by recalling from the very beginning of Section A3.2 that there seems to be an
obvious relation between the Spectral Decomposition (SPD) Theorem for symmetric
matrices and the Singular Value Decomposition (SVD) Theorem for general matrices.
Since the SVD Theorem shows that for every matrix, A, there exist orthonormal matrices,
U, V, and a diagonal matrix of singular values, S , such that

________________________________________________________________________
ESE 502 A3-59 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(A3.4.1) A  U S V

it follows that once for symmetric matrices, A, we must have

(A3.4.2) U S V   A  A  V S U 

So at first glance, this identity would appear to suggest that U  V , and thus that (A3.3.1)
must hold with   S . To see that this intuition is wrong, recall that | U |   1, which
together with the nonnegativity of the singular values, S , must imply that

(A3.4.3) | A |  | U || S || U  |  | U || S || U |  | U |2 | S |  | S |  0

and thus that the determinant of every symmetric matrix is nonnegative! But we have
already seen from (A3.2.69) that the symmetric (orthonormal) matrix

 0 1
(A3.4.4) A 
1 0

has a negative determinant, | A |  1 . More generally, the fact that singular values are by
construction nonnegative, shows that the relation between singular values and
eigenvalues for symmetric matrices is not immediately obvious.

This is made even more clear by a closer examination of this particular counterexample.
Here one can verify (by direct multiplication) that (A3.3.2) holds for this matrix A with

 0 1 1   1 0 
(A3.4.5) U   , S   , V  
 1 0   1  0 1

Moreover, since U and V are easily seen to be orthonormal, this is indeed a singular value
decomposition of A with U  V . Here it can also be verified by direction computation
that

 1 0  1 0 
(A3.4.6) U S U    A    V S V
 0 1  0 1

so that neither U nor V yield spectral decompositions of A with   S . But it turns out
that A does indeed have a unique spectral decomposition:

(A3.4.7) A  W W

with orthonormal matrix, W, and diagonal matrix,  , given by

________________________________________________________________________
ESE 502 A3-60 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

 1 1   1 
W   , 
(A3.4.8) 1
 
 1 1 
2 1

So at first glance, there would seem to be little relation between the decompositions if A
in (A3.4.5) and (A3.4.8). But closer inspection show that the absolute value of  is
precisely S. As we shall see below, this relationship is fundamental.

A3.4.1 Eigenvalues and Eigenvectors of Symmetric Matrices

To gain further insight, it is convenient for the moment to suppose that the SPD Theorem
is true, and to examine its geometric consequences. To do so, note first that if (A3.3.1)
holds for a symmetric matrix, A, then since U 1  U  , it follow at once that

(A3.4.9) A  U  U   AU  U  (UU )  U 

 1 
 
 A (u1 ,.., un )  (u1 ,.., un )   
  
 n

 Aui  iui , i  1,.., n

Thus, as an extension of the diagonal-matrix case in (A3.3.7), we see that the diagonal
elements (1 ,.., n ) of  must indeed be the eigenvalues of A with associated
orthonormal eigenvectors (u1 ,.., un ) , as asserted at the beginning of Section A3.3. Note
also that by definition this decomposition implies that all eigenvalues must be real.

Moreover, these eigenvalues and eigenvectors together imply that such matrices (like
diagonal matrices) are in fact representations of scale transformations with respect to
some appropriate coordinate system. This can be illustrated in two dimensions by the
symmetric matrix,

3 1
(A3.4.10) A 
 1 3

Here A does indeed have a spectral decomposition as in (A3.3.1) with   diag (2, 4) and
orthonormal matrix,

 1 1   2   
1 1
2
(A3.4.11) U  1
 
   1  
    (u1 , u2 )
2
 1 1    2  
1 
2 

________________________________________________________________________
ESE 502 A3-61 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

[which is precisely W in (A3.4.8) above]. So the eigenvectors for this matrix are the two
diagonal vectors, u1 and u2 , with corresponding eigenvalues, (1  2 , 2  4) , as shown
in Figure A3.29 below:

Au2

Au1
u1 u2

2

A( 2 )

Figure A3.29. Positive Eigenvalue Case

So if (u1 , u2 ) are regarded as the coordinate axes, then A is seen to be a pure scale
transformation with respect to this coordinate system. More generally, if the spectral
decomposition of A is regarded as a composition of the respective transformations, U  , 
and U , then we obtain a diagram very reminiscent of that in Figure A3.16, with V in the
last step replaced by U . In particular, the eigenvectors (i.e., columns of U ) correspond
precisely to the principle axes of the ellipsoidal image, A( n ) , of the unit circle,  n , as
seen for n  2 in the figure. So this shows geometrically that there must be an intimate
connection between the singular value decomposition (SVD) and spectral decomposition
(SPD) of symmetric matrices.

In fact for the present matrix, A, in (A3.4.10), these two decompositions are identical.
The special feature of this symmetric matrix that leads to this identity is that its
eigenvalues are all positive. What this implies is that these eigenvalues play exactly the
same role as singular values, i.e., they measure the lengths of these axes from the origin.
More generally, this suggests that the lengths of such axes for symmetric matrices, A.
should be precisely the absolute values of their eigenvalues. In other words, the
eigenvalues of A should differ only in sign from the associated singular values of A.

All these conjectures will be shown to be true in the following sections. But for the
moment, we continue with our illustrations by reconsidering the counterexample in
expression (A3.4.4) above. As mentioned already, the eigenvectors here are precisely the

________________________________________________________________________
ESE 502 A3-62 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

same as in (A3.4.11) above. So only the eigenvalues are different, as shown in Figure
A3.30 below.

u1 u2  Au2

A( 2 )   2 Au1

Figure A3.29. Mixed Signs Case

The key point to notice [as was evident in the SVD of this matrix in expression (A3.4.5)
above] is that the unit circle is mapped onto itself, i.e., A( 2 )   2 . So any set of
orthonormal axes can be used for an SVD. However, this is not true for the SPD. In fact
there is exactly one eigenvector, u1 , associated with eigenvalue, 1  1 , and exactly one
eigenvector, u2 , associated with eigenvalue, 2  1 .24 So in contrast to the SVD, this
SPD is essentially unique. But it will be shown below that in spite of its nonuniqueness,
the SVD for A still contains enough information to allow the SPD to be constructed
explicitly. The essential reason for this is that relations between U and V implicit in the
identity (A3.4.2) will yield additional analytical information.

Before proceeding to these analytical results, it should be noted that there is one
additional complication that cannot be illustrated in two dimensions. Consider the
following 4-dimensional version of the matrix in (A3.4.4) above:

 1
 
 1 
(A3.4.12) A
 1 
 
1 

which can be seen (by direct multiplication) to have eigenvalues,   diag (1,  1, 1 , 1)
with associated eigenvectors:

24
Again, remember that if ui is an eigenvector for i , then so is ui . So we are implicitly ignoring this
trivial form of nonuniqueness.

________________________________________________________________________
ESE 502 A3-63 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

 0 1 1 0
 
(A3.4.13) U  1  1 0 0 1
 (u1 , u2 , u3 , u4 )
2  1 0 0 1
 
 0 1 1 0

Note in particular that the unit sphere is again mapped onto itself, i.e., A( 4 )   4 , so that
any set of four mutually orthonormal axes can again be used to define an SVD. But now,
the two distinct eigenvalues, 1 and -1, both have two-dimensional spaces of eigenvectors,
namely span(u1 , u2 ) and span(u3 , u4 ) , respectively. So even the SPD is nonunique in this
example. Such cases require more effort to construct an admissible SPD from any given
SVD. So this general case will be treated by itself.

With these examples in mind, we now proceed to establish the Spectral Decomposition
(SPD) Theorem in stages. The first task is to establish some general consequences of
singular value decompositions (SVD) for symmetric matrices. This will provide a general
foundation for the SPD results to follow.

A3.4.2 Some Consequences of SVD for Symmetric Matrices

Here we focus on the additional information contained in the identity (A3.4.2) for
symmetric matrices. These equalities can be rewritten in the following way:

(A3.4.14) A  U SV  A V  U S

(A3.4.15) A  A  V S U   A U  V S

By adding the right hand sides we obtain

(A3.4.16) A (V  U )  (U  V ) S  (V  U ) S

and similarly by subtracting the right hand sides we have

(A3.4.17) A (V  U )  (U  V ) S  (V  U ) ( S )

But if we now let

(A3.4.18) X  U V

(A3.4.19) Y  V U

then (A3.4.16) and (A3.4.17) yield the associated sets of eigenvalue equations:

(A3.4.20) AX  X S

________________________________________________________________________
ESE 502 A3-64 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(A3.4.21) AY  Y ( S )

Note that neither matrix, X  ( x1 ,.., xn ) or Y  ( y1 ,.., yn ) , is orthonormal, or even


orthogonal. But nonetheless, the respective rows of each relation (A3.4.20) and (A3.4.21)
yield well defined eigenvalue relations:

(A3.4.22) A xi  si xi , i  1,.., n

(A3.4.23) A yi  ( si ) yi , i  1,.., n

This shows us that each nonzero column of X and Y, namely each xi  ui  vi  0 and
yi  ui  vi  0 , respectively, must yield corresponding real eigenvalues, si or  si for
symmetric matrix, A. As we shall see below, many columns of X and/or Y must be zero.
But the key point to notice is that for all i  1,.., n , the column pair ( xi , yi ) cannot both be
zero. For if so then

(A3.4.24) ui  vi  xi  0  yi  ui  vi

 vi   vi  vi  0

which contradicts the normalization condition, || vi ||  1 . So conditions (A3.4.22) and


(A3.4.23) together will provide us with a full complement of real eigenvalues for A in
every case.

Thus the first major consequence of these observations is that without loss of generality
we can focus our attention on real eigenvalues for symmetric matrices. This is of
sufficient important to be stated formally. If the set of distinct eigenvalues for any matrix,
A, is denoted by Eig ( A) , and if we define symmetric matrices by the condition that
A  A , then this first consequence of SVD can be stated as follows:25

(A3.4.24) A  A  Eig ( A)  

A second consequence (as suggested by the examples above) is that all eigenvalues in
(A3.4.22) and (A3.4.23) are either the singular values of A or their negatives. So the
absolute magnitudes of all eigenvalues can be determined by the SVD of A. To state this
more formally, let the set of distinct singular values of any matrix, A, be denoted by
Sing ( A) , and let the negatives of these values be denoted by  Sing ( A) . Then, in a
manner paralleling (A3.4.24), this second consequence of SVD can be stated as follows:

25
The standard proof of this fact is to show that eigenvalues of symmetric matrices must always be equal to
their complex conjugates, and hence must be real (see for example Theorem 4.1.3 in Horn and Johnson,
1985).

________________________________________________________________________
ESE 502 A3-65 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(A3.4.25) A  A  Eig ( A)  Sing ( A)  [  Sing ( A)]

There is a third important consequence that relates to the eigenvectors associated with
distinct eigenvalues of symmetric matrices. Recall that in Figure A3.27 above a
geometric argument was sketched showing that eigenvectors for distinct (real)
eigenvalues are always linearly independent. For symmetric matrices we have the
stronger property that such eigenvectors must actually be orthogonal. This can be
demonstrated as follows:26

Orthogonality of Distinct Eigenvectors. For any symmetric matrix, A, and


eigenvectors, xi , x j , associated with distinct eigenvalues, i ,  j  Eig ( A) ,

(A3.4.26) i   j  xix j  0

Proof: By definition we must have,

(A3.4.27) A xi  i xi , and

(A3.4.28) A xj  j xj

But premultiplying (A3.4.28) by xj and employing the symmetry of A, we see that,

(A3.4.29) i xj xi  xj Axi  ( Axi ) x j  xiAx j  xi( j x j )  i xix j  i xj xi

 ( xj xi )(i   j )  0

So if i   j then we may conclude that xj xi  0  xix j . 

Given these properties of eigenvalues and eigenvectors for symmetric matrices, the key
questions remaining are (i) how to identify which of the values on the right hand side of
(A3.4.25) are relevant in any particular case, and (ii ) how to construct their associated
eigenvectors in terms of the SVD of A. To answer these questions, we shall proceed on a
case-by-case basis from the simplest to the most general cases.

26
It should be noted that both the statement and proof of this result make constant use of property
(A3.4.24), since eigenvectors for real eigenvalues can always be restricted to  . This allows orthogonality
n

(and indeed all inner products) to be defined solely on  . While this same analysis can of course be
n

carried on  using complex inner products, property (A3.4.24) shows that this is not necessary for real
n

symmetric matrices.

________________________________________________________________________
ESE 502 A3-66 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

A3.4.3 Spectral Decomposition for Symmetric Definite and Semidefinite Matrices

The simplest and by far the most important cases for our purposes all involve symmetric
definite or semidefinite matrices. So this is the best place to begin. Recall from
expressions (A2.7.36) and (A2.7.67) in Appendix A2 that an n-square matrix, A, is
positive semidefinite iff for all x   n ,

(A3.4.30) x  0  xAx  0

and is positive definite iff this inequality is strict, i.e., iff

(A3.4.31) x  0  xAx  0

Moreover, A is negative definite (semidefinite) iff  A is positive definite (semidefinite).


Since all results for symmetric positive definite and semidefinite matrices are
immediately extendable to their negative counterparts by just reversing signs, we focus
only on (A3.4.30) and (A3.4.31). Hence our first result is to show that for symmetric
positive semidefinite matrices, A, the SVD and SPD of A are essentially identical. In
particular, the eigenvalues of A are precisely its singular values, and their associated
eigenvectors can be taken directly from the SVD of A. Moreover, if A is positive definite
then all eigenvalues of A are positive, and each SVD for A is precisely an SPD for A.
These results can be stated more formally as follows:27

Spectral Decomposition of Symmetric Positive Semidefinite Matrices

(i ) If A is a symmetric positive semidefinite matrix with SVD,

(A3.4.32) A  U SV ,

then it must be true that

(A3.4.33) A  U SU  V SV

(ii ) If in addition A is positive definite, then diag ( S )  0 and U  V .

Proof: (i ) To establish the first equality, it must be shown that

(A3.4.34) Aui  siui

27
It is of interest to note that a direct proof for this case follows from the standard construction of principle
components in multivariate analysis, which in fact closely parallels the above proof of the Singular
Decomposition Theorem. See for example the classic treatment in Anderson (1958, pp.273-275).

________________________________________________________________________
ESE 502 A3-67 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

for all i  1,.., n . But by applying the same column decomposition in (A3.2.28) to
(A3.4.15) for the SVD in (A3.4.32), it follows that
(A3.4.35) Aui  si vi , i  1,.., n

Given this representation, there are two cases to consider. First if si  0 , then it follows at
once from (A3.4.35) that,

(A3.4.36) Aui  si vi  0  siui

On the other hand if si  0 , then observe from (A3.4.23) that we must have yi  0 . For if
not, then since yi  0  yi yi  0 , it would follow from (A3.4.23) that

(A3.4.37) Ayi   si yi  yiAyi   si yi yi  0

which contradicts the positive semidefiniteness of A. Thus we must have

(A3.4.38) 0  yi  ui  vi  ui  vi

and may conclude again from (A3.4.35) that

(A3.4.39) Aui  si vi  siui

So (A3.4.34) must hold in all cases, and the first equality (A3.4.33) is established. The
second equality follows in exactly the same way by replacing (A3.4.15) with (A3.4.14)
and thus switching the roles of ui and vi in (A3.4.35).

(ii ) Finally, if A is positive definite, then since uiui  || ui ||2  1 for each i  1,.., n , it
follows from (A3.4.34) that

(A3.4.40) Aui  siui  uiAui  si uiui  si

and hence from positive definiteness that si  0 . Thus diag ( S )  diag ( s1 ,.., sn )  0 .
Moreover since the argument in (A3.4.37) and (A3.4.38) now holds for all i  1,.., n , it
also follows that U  V . 

For symmetric positive definite matrices, A, the above theorem (now referred to as SPD
Theorem 1), shows that the two decompositions, SVD and SPD, of A exhibit a one-to-one
correspondence. As a direct consequence of this correspondence, we now have the
following additional characterizations of positive definiteness:

________________________________________________________________________
ESE 502 A3-68 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Corollary 1. For any symmetric positive semidefinite matrix, A, the following three
properties are equivalent:

(A3.4.41) A is positive definite.


(A3.4.42) A has all positive eigenvalues.

(A3.4.43) A is nonsingular.

Proof: To establish this equivalence it suffices to show that (A3.4.41)  (A3.4.42)


 (A3.4.43)  (A3.4.41). But if A is positive definite and  is any eigenvalue of A
with eigenvector, x , then by (A3.4.31) together with xx  0 it follows that,

xAx
(A3.4.44) Ax   x  xAx   xx    0
xx

and thus that all eigenvalues must be positive. Next, to show that positive eigenvalues
imply nonsingularity, observe since the symmetric positive semidefiniteness of A implies
from part (i ) of SPD Theorem 1 that A has a spectral decomposition,

(A3.4.45) A  U SU

[given by the first equality in (A3.4.33)], it follows that if all eigenvalues are positive,
then the positive diagonal matrix, S , in (A3.4.45) must have a well defined inverse, S 1 .
But this together with the orthonormality of matrix, U, implies that

(A3.4.46) U S 1 U   (U S U  ) 1  A1

and thus that A is nonsingular. Finally, to show that nonsingularity implies positive
definiteness, note first from the nonnegativity of the diagonal matrix, S, in (A3.4.45) that
S has a well defined square root,

(A3.4.47) S 1/2  diag ( s11/2 ,.., s1/2


n )

satisfying S  S 1/2 S 1/2 . So for any x   n it follows that,

(A3.4.48) xAx  xU S U x  xU S 1/2 S 1/2 U x  ( S 1/2 U x)( S 1/2 U x)  || S 1/2 U x ||2

But since for any vector z , || z ||2  0  || z ||  0  z  0 , we see from (A3.4.48) that

(A3.4.49) xAx  0  S 1/2U x  0  US 1/2 ( S 1/2U x)  0  Ax  0

So if xAx  0 for any x  0 , then it would also be true that Ax  0 , which contradicts
the nonsingularity of A. Thus, nonsingularity together with the positive semidefiniteness

________________________________________________________________________
ESE 502 A3-69 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

of A imply that xAx  0 must hold whenever x  0 , and it follows that A is positive
definite. 

For our present purposes, the single most important application of these results is to
characterize the spectral properties of covariance matrices. To begin with, we can now
give a more complete statement of the Positive Definiteness Property for nonsingular
covariance matrices stated in Appendix A2 (page A2-27):

Corollary 2. Every nonsingular covariance matrix,  , is positive definite with spectral


decomposition:

(A3.4.50)   U  U  , diag ( )  0

Proof: For convenience we start by repeating the argument in Appendix A2. First recall
that if the covariance matrix of a random vector, X  ( X 1 ,.., X n ) , is denoted by
  cov( X ) , then the symmetry of covariances, ij  cov( X i , X j )  cov( X j , X i )   ji
implies that  is symmetric. Moreover, since for any coefficient vector, a  0 , we must
have

(A3.4.51) aa  var(aX )  0

it follows that  is positive semidefinite. Hence by SPD Theorem 1 together with


Corollary 1 above, it follows at once nonsingularity of  implies both positive
definiteness and (A3.4.50). 

In addition, recall from the discussion following the Linear Invariance Theorem in
Section A.2.3 that reduced covariance matrices of the form, A  A , were asserted to be
nonsingular whenever A is of full row rank. We are now in a position to establish this
result:

Corollary 3. For any nonsingular n-square covariance matrix,  , and any m  n


matrix, A, with 1  m  n , if A is of full row rank then A  A is also a nonsingular (m-
square) covariance matrix.

Proof: The matrix, A  A , has already been shown to be an m-square covariance matrix
in expression (3.2.21) of Part II of these notes. So it remains to be shown that A  A is
nonsingular. To do so, recall first (from the end of Section A3.1.1) that A is if full row
rank iff its rows are linearly independent. But since these rows are precisely the columns
of A  (a1 ,.., am ) , it follows from the definition of linear independence [expression
(A3.1.24)], that for any x  ( x1 ,.., xm )   m ,

Ax  0  
m
(A3.4.52) x a  0  xi  0, i  1,.., m  x  0
i 1 i i

________________________________________________________________________
ESE 502 A3-70 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Moreover, if  is a nonsingular covariance matrix, then by (A3.4.50),

(A3.4.53)   U  U   U 1/2 1/2 U 

where again 1/2  diag (11/2 ,.., n1/2 ) . So by essentially the same argument as in (A3.4.48)
and (A3.4.49) it follows that for any x   m ,

(A3.4.54) xA  Ax  0  xAU 1/2 1/2U Ax  0

 (1/2U Ax)1/2U Ax  0

 || 1/2U Ax ||2  0

 1/2U Ax  0

 U 1/2 (1/2U Ax)  0

  Ax  0

But this together with (A3.4.52) and the nonsingularity of  then shows that

(A3.4.55) xA  Ax  0   ( Ax)  0  Ax  0  x  0

Finally, since the covariance matrix, A  A , is symmetric and positive semidefinite by


(A3.4.51) , it must then be true that

(A3.4.55) x  0  x( A  A) x  0

for all x   m . Thus A  A is positive definite, and we may conclude from (A3.4.43) that
A  A is also nonsingular. 

A3.4.4 Spectral Decompositions with Distinct Eigenvalues

There is a second class of symmetric matrices for which each SVD directly yields a
unique SPD, namely those symmetric matrices for which all eigenvalues are distinct.
Here it is of interest to recall the example given in expression (A3.4.4) above, i.e.,

0 1
(A3.4.56) A   
1 0

with distinct eigenvalues, Eig ( A)  {1 , 1} , but with (necessarily) repeating singular
values given by the absolute values if Eig ( A) , so that A has only one distinct singular
value, namely Sing ( A)  { 1 } . Here the SVD in (A3.4.5) appeared to exhibit little direct

________________________________________________________________________
ESE 502 A3-71 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

relation to the SPD in (A3.4.8). Indeed, the situation is even worse for this matrix. In
particular, since the unit circle is mapped onto itself by A, it follows that every pair of
orthogonal unit vectors can serve as principle axes for this “ellipse”. This nonuniqueness
can be seen algebraically by observing that since A is itself orthonormal, it follows that
for any other orthonormal matrix, V, the product, U  AV , must also be orthonormal. But
since the singular values of A are given by the identity matrix, S  I 2 , we may then use
V to construct a distinct SVD for A by the product:

(A3.4.57) A  AVV   ( AV )( I 2 )V   U S V 

Thus there are seen to be infinitely many SVDs for A. One the other hand, since the
eigenvalues of A are distinct, we have already seen from (A3.4.26) that their
corresponding eigenvectors must be orthogonal, and thus must form a basis for  2 . So
these eigenvectors [in (A3.4.8)] must in fact be unique (up to a choice of signs). Given
this stark contrast, it would appear that there is little hope of constructing the unique SPD
for A from its highly nonunique SVDs. But as we now show, this can indeed be done so
long as eigenvalues are distinct in the sense that each has a geometric multiplicity of one
[as in the case of (A3.4.56)]. Note also from the orthogonality of eigenvectors for distinct
eigenvalues in (A3.4.26) that this in turn implies that the SPD for such symmetric
matrices must be unique. With these observations, we now show that:

Spectral Decomposition of Symmetric Matrices with Distinct Eigenvalues

If A is a symmetric matrix with distinct eigenvalues, then each SVD,

 s1  v1 
  
(A3.4.58) A  USV   (u1 ,.., un )      ,
 sn  
  vn 

of A yields exactly the same SPD,

 1  w1 
  
(A3.4.59) A  W  W   ( w1 ,.., wn )     
 n  
  wn 

in terms of the relations in (A3.4.22) and (A3.4.23).

Proof: Our objective is to give an explicit construction of (A3.4.59) in terms of


(A3.4.58). To do so, we note first from the assumed distinctness of eigenvalues that there
can be at most one zero eigenvalue for A, say i  0 with eigenvector satisfying,

________________________________________________________________________
ESE 502 A3-72 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Awi  0 . But since this implies that A is singular there must at least one zero singular
value [for otherwise, we would have | A |  0 by (A3.2.77), which contradicts the
singularity of A]. Moreover, if there were more than one, the argument in (A3.4.26)
shows that there would be more than one zero eigenvalue for A, which contradicts the
distinctness assumption. So there is exactly one ui with Aui  0 as in (A3.4.36). Thus we
may then set wi  ui , and conclude from (A3.4.26) that this will always form and
admissible entry in the orthonormal matrix, W, of (A3.4.59). For the positive singular
values, si  diag ( S ) , we now consider the possible distinct eigenvalues they can generate
by (A3.4.22) and (A3.4.23). In view of the distinctness assumption, either exactly one of
the values ( si ,  si ) belongs to Eig ( A) or both do. The first case is the simplest, and is
equivalent to the condition derived from (A3.4.58) that si only appear in one equation of
the equation systems (A3.4.22) and (A3.4.23) with a nonzero eigenvector. If for
notational simplicity we let (i , wi ) denote the associated eigenvalue-eigenvector pair to
be constructed in (A3.4.59)28 then by using the definitions of xi  ui  vi and yi  ui  vi
in (A3.4.22) and (A3.4.23), respectively, (and recalling that at least one of these vectors
must be nonzero) we may set

 ui  vi
 , if ui  vi  0
 || ui  vi ||
(A3.4.60) wi  
 ui  vi
, if ui  vi  0
 || ui  vi ||

and similarly, set

 si , if ui  vi  0
(A3.4.61) i  
  si , if ui  vi  0

Again the orthogonality of eigenvectors for distinct eigenvalues in (A3.4.26) guarantees


that these normalized vectors are automatically admissible components of W.

Turning to the second case, where both ( si ,  si ) appear in equation systems (A3.4.22)
and (A3.4.23) with nonzero eigenvectors, observe that si must appear twice in diag ( S ) ,
say in positions i and j . If we consider the values of x and y in columns i and j of
both (A3.4.22) and (A3.4.23), namely,

28
More formally, the rows and columns of A can always be permuted to satisfy this relation. The standard
convention in the literature is thus to say that “by relabeling if necessary” we can use i for both si and its
associated eigenvalue, i .

________________________________________________________________________
ESE 502 A3-73 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

xi  ui  vi yi  ui  vi
(A3.4.62)
xj  uj  vj yj  uj  vj

then it must be true that either xi or x j is nonzero, and similarly that either yi or y j are
nonzero. But if both xi and x j are nonzero, then they must be scalar multiples of one
another. For otherwise, columns i and j of equation system (A3.4.22) would yield two
linearly independent solutions, ( Axi  si xi , Ax j  s j x j ) with si  s j , and it would follow
that eigenvalue, si , has a multiplicity of two. But since this contradicts the assumption of
distinct eigenvalues, xi and x j must be linearly dependent, i.e., scalar multiples of one
another. This in turn implies that they must have the same normalizations, which can be
written terms of u and v by:

ui  vi uj  vj
(A3.4.63) 
|| ui  vi || || u j  v j ||

Moreover, since exactly the same argument for yi and y j shows that if both are nonzero
then

ui  vi uj  vj
(A3.4.64) 
|| ui  vi || || u j  v j ||

With these observations, if we now denote the eigenvalues for si and s j (  si ) in


(A3.4.59) by i and  j (again using the convention in footnote 6), then by definition,
i  si  0 and  j   si  0 , with associated eigenvectors given respectively by

 ui  vi
 || u  v || , if ui  vi  0
 i i
(A3.4.65) wi  
 uj  vj , if ui  vi  0
 || u j  v j ||

 uj  vj
 || u  v || , if u j  v j  0

wj  
j j
(A3.4.66)
 ui  vi , if u j  v j  0
 || ui  vi ||

Again, it follows from (A3.4.63) and (A3.4.64) that these choices of wi and w j are
insensitive to whether the i th or j th quantities are used first on the right hand sides of

________________________________________________________________________
ESE 502 A3-74 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(A3.4.65) and (A3.4.66). Note also from the orthogonality of eigenvectors for distinct
eigenvalues that these normalized vectors will always yield admissible components of W.

Finally, since the multiplicity of each singular value, s  diag ( S ) determines exactly the
number of eigenvalues generated by s (including the s  0 case), it follows that this
procedure must generate precisely n eigenvalues (1 ,.., n ) with corresponding
orthonormal eigenvectors ( w1 ,.., wn ) generating a basis for  n . So by construction, this
procedure must yield a complete representation of A as in (A3.4.59). 

So for the case of distinct eigenvalues, we see that the unique SPD for symmetric matrix
A can be explicitly constructed from any of its possible SVDs. Here it is instructive to see
how this procedure works for the example in (A3.4.56). In this case, the SVD produced
in (A3.4.5) yields

 1  1 
(A3.4.67) X  U  V  (u1  v1 , u2  v2 )      
 1  1 

 1   1  
(A3.4.68) Y  U  V  (u1  v1 , u2  v2 )      
  1   1  
So this is case where all four elements of (A3.4.62) are nonzero, and thus where the
identities in (A3.4.63) and (A3.4.64) are seen to hold. Moreover, since the norms of all
these vectors are seen to be 2 , it follows that they yield precisely the pair of normalized
eigenvectors in (A3.4.8). Moreover, one can verify by direct computation that any choice
of an orthonormal matrix, V, in (A3.4.57) will always produce vectors that are scalar
multiples of those in (A3.4.67) and (A3.4.68), and thus will yield the same eigenvectors
for W.

Finally it is important to note that the case of distinct eigenvalues is overwhelmingly the
most common case observed in practice. Indeed, it is a simple matter to show that within
the space of all n -square symmetric matrices, the subset possessing two or more common
eigenvalues must have zero volume. So if one were to choose a symmetric matrix at
random, then with probability one, this matrix will have all distinct eigenvalues.

A3.4.5 General Spectral Decomposition Theorem

Nonetheless, it is clear that in numerous modeling contexts, theoretical considerations


can often lead to symmetric matrices with additional structure yielding repeated
eigenvalues. In particular, for singular matrices it is clear that zero eigenvalue may be
repeated many times. So it is of practical interest to show that the information contained
in SVDs for such matrices can still be used to construct their SPDs. The major difference
in this general case with repeated eigenvalues is that the SPD itself is not unique. So there
will necessarily be some degree of nonuniqueness in the SVD construction of SPDs.

________________________________________________________________________
ESE 502 A3-75 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

Here we begin with one preliminary result that will enable us to verify dimensional
consistency for all subspaces of repeated eigenvalues. In particular, note that for any pair
of n  k matrices, A and B, the matrix sum, A  B , is well defined. In addition, A and
B are said to be mutually orthogonal iff their columns are orthogonal, i.e., AB  Ok .29
Hence, recalling that the rank of a matrix is by definition the dimension of its span
[ rank ( A)  dim( span( A)) ], we have the following useful rank equality:30

Rank Lemma. For any mutually orthogonal n  k matrices, A and B,

(A3.4.69) rank ( A  B)  rank ( A)  rank ( B)

Proof: By definition it suffices to show that

(A3.4.70) dim( span( A  B) )  dim( span( A))  dim( span( B))

But if we choose any bases [ x1 ,.., xk ] and [ y1 ,.., yh ] for span( A) and span( B) ,
respectively, then by mutual orthogonality it follows that [ x1 ,.., xk , y1 ,.., yh ] must
constitute a linearly independent set. To see this, note that since xi  span( A)  xi  Az
for some z   k , and similarly that y j  span( B)  y j  Bw for some w   k , this
together with the mutual orthogonality condition, AB  Ok , implies that

(A3.4.71) xi y j  ( Az )( Bw)  z( AB) w  0

and hence that [ x1 ,.., xk ] and [ y1 ,.., yh ] are mutually orthogonal sets of vectors. This
together with the linear independence of basis vectors implied that the full set of vectors,
[ x1 ,.., xk , y1 ,.., yh ] , is linearly independent. Finally since for any vector, v   n ,

(A3.4.72) v  span( A  B )  v  ( A  B) u for some u   k

 v  Au  Bu  span( A)  span( B)

  i xi   j 1  j y j
k h
 v  i 1

for some coefficients (1 ,.., k ) and ( 1 ,..,  k ) , it then follows that [ x1 ,.., xk , y1 ,.., yh ]
must be a basis for span( A  B ) . Thus we may have

(A3.4.73) dim( span( A  B))  k  h  dim( span( A))  dim( span( B))

29
As with the n-square identity matrix, I n , we here denote the n-square zero matrix by On .
30
A detailed development of other rank properties can be found in Chapter 6 of Searle (1982).

________________________________________________________________________
ESE 502 A3-76 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

and may conclude that condition (A3.4.70) holds. 

With this preliminary result, we are now ready to establish the following general form of
the Spectral Decomposition Theorem:

Spectral Decomposition Theorem: For any symmetric matrix, A, there exists a


diagonal matrix,  , and an orthonormal matrix, W , such that

(A3.4.74) A  W W

Proof: Our approach is again to start with any SVD,

(A3.4.75) A  U SV

for the n-square symmetric matrix, A, and to construct and SPD for A as in (A3.4.74). To
do so, we first note that by relabeling the rows and columns of A if necessary, we may
assume that the sets of common singular values (including singleton sets) are grouped
into blocks , Si  diag ( si1 ,.., sini ) , i  1,.., m ( n) , along the diagonal of matrix, S , where
each block has common value, si  sij , j  1,.., ni ( 1) , and has associated orthonormal
sets of column vectors, U i  (ui1 ,.., uini ) and Vi  (vi1 ,.., vini ) . With this grouping,
expression (A3.4.75) can be written as

 S1  V1 
  
(A3.4.76) A  (U1 ,..,U m )       AVi  U i Si , i  1,.., m
 S m  
  Vm 

Our objective is then to construct an SPD of A with corresponding block form:

 1  W1 
  
(A3.4.77) A  W  W   (W1 ,..,Wm )       AWi  Wi i , i  1,.., m
 m  
  Wm 

As with U and V above, the key conditions to be satisfied by W are that each block, Wi ,
have orthonormal columns, and that the columns in different blocks be mutually
orthogonal. To construct (A3.4.77), we start by observing there is one special case that
can be handled without further analysis. In particular, if matrix A is singular, then exactly
one block, Si , in (A3.4.76), will have si  0 . But since the vectors in this block must
satisfy,
________________________________________________________________________
ESE 502 A3-77 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(A3.4.78) AVi  Oni  Avij  0  0 vij , j  1,.., ni


it follows [as in (A3.4.36) above] that these are automatically eigenvectors for si  0 , and
in addition, must be orthonormal by the properties of SVDs. Moreover, since all other
eigenvalues of A must be nonzero, it also follows from (A3.4.26) that the columns of Vi
will automatically be orthogonal to all other eigenvectors of A. Thus by setting

(A3.4.79) i  Si  On and Wi  Vi ,
i

we are guaranteed to obtain an admissible “zero” block in (A3.4.74). So we may


henceforth assume (from the nonnegativity of singular values) that si  0 . For these
cases, recall from (A3.4.18) through (A3.4.21) that if we set

(A3.4.80) X i  U i  Vi , i  1,.., m

(A3.4.81) Yi  U i  Vi , i  1,.., m

that by construction,

(A3.4.82) A X i  si X i , i  1,.., m

(A3.4.83) AYi   siYi , i  1,.., m

So all columns in X i and Yi are potential eigenvectors for A. Here there three possible
cases to be considered, namely (i) U i  Vi  Yi  O , (ii) U i  Vi  X i  O , or (iii)
X i  O and Yi  O .31 In case (i), all eigenvalues in block i are positive, and governed by
(A3.4.82). But since (A3.4.80) and (A3.4.81) together imply that X i  Yi  2U i , it follows
that

(A3.4.84) X i  X i  O  X i  Yi  2U i

Thus, except for a factor of 2, the columns of X i automatically form a set of ni


admissible eigenvectors for block i . So here we may set

(A3.4.85) i  Si  si I n and Wi  12 X i  U i
i

to obtain the desired block i in (A3.4.77). The construction for case (ii) is essentially
identical except that now

31
For notational simplicity, we here take the common dimension of these zero matrices (namely n  ni ) to
be understood.

________________________________________________________________________
ESE 502 A3-78 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(A3.4.86) Yi  O  Yi  X i  Yi  2U i
with all eigenvalues given by  si . So in this case, we can construct the desired block i
by setting

(A3.4.87) i   Si  ( si ) I n and Wi  12 Yi  U i
i

This leaves case (iii), in which both si and  si are eigenvalues for A.

This is by far the most complex case, and requires additional analysis. Here we start by
observing from the distinctness of the eigenvalues, si and  si , together with (A3.4.82),
(A3.4.83) and the orthogonality condition (A3.4.26), that the matrices X i and Yi must
now be mutually orthogonal (i.e., X iYi  Oni ). So by the Rank Lemma above, we must
have

(A3.4.88) rank ( X i  Yi )  rank ( X i )  rank (Yi )

Moreover, since it continues to be true that X i  Yi  2U i , it then follows that

(A3.4.89) dim( span( X i ))  dim( span(Yi ))  rank ( X i )  rank (Yi )

 rank ( X i  Yi )  rank (2U i )  ni

So if we now use the Gram-Schmidt orthogonalization procedure [summarized by


expression (A3.1.60) above] to construct orthogonal bases [b1 ,.., bki ] and [c1 ,.., chi ] for
span( X i ) and span(Yi ) , respectively, then these basis vectors will constitute the desired
eigenvectors for this case. To verify this, observe first from (A3.4.89) that

(A3.4.90) ki  hi  ni

and hence that there are again exactly ni of these basis vectors. Moreover, since
b j  span( X i )  b j  X i z j for some z j   ni , it follows from (A3.4.82) that

(A3.4.91) A X i  si X i  A X i z j  si X i z j  Ab j  sib j , j  1,.., ki

and thus that the basis vectors [b1 ,.., bki ] form and orthogonal set of eigenvectors for
eigenvalue, si . Similarly, since c j  span(Yi )  c j  X iu j for some u j   ni , it
follows from (A3.4.83) that

________________________________________________________________________
ESE 502 A3-79 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

(A3.4.91) AYi  ( si ) Yi  AYi u j  ( si ) X i u j  Ac j  ( si )c j , j  1,.., hi


and thus that the basis vectors [c1 ,.., chi ] form and orthogonal set of eigenvectors for
eigenvalue,  si . Finally, since the distinctness of si and  si again implies that [b1 ,.., bki ]
and [c1 ,.., chi ] are mutually orthogonal, and also orthogonal to the eigenvectors for all
other distinct eigenvalues of A, we may conclude that the normalizations of these
eigenvectors yield an admissible choice for Wi . Hence, if we let 1m  (1,..,1) denote the
unit vector of length m, then an admissible choice for block i in (A3.4.77) is now given
by:

(A3.4.92) Wi   b1 bki c1 chi


||b1|| ,.., ||bki || , ||c1|| ,.., ||chi ||  and i  diag ( si1ki ,  si 1hi )

By way of summary, the blocks defined respectively by (A3.4.79), (A3.4.85), (A3.4.87)


and (A3.4.92) yield a full specification of expression (A3.4.77), and thus the desired
SPD for A is established. 

As one final comment, we begin by reiterating that our main objective in this section has
been to show that the spectral decomposition (SPD) of any symmetric matrix, A, can be
constructed from its singular value decomposition (SVD). However, this appears to leave
open the converse question of how to construct SVDs of symmetric matrices from their
SPDs. But since the singular values of A are simply the absolute values of its eigenvalues,
it turns out to be a simple matter to transform each SPD into a corresponding SVD. To do
so, recall from (A3.1.10) that the SPD in (A3.4.74) can be rewritten as:

 1   w1 
      n  w w
(A3.4.93) A  W W    w1 ,.., wn        i 1 i i i
 n  wn 

To convert these eigenvalues to absolute form, observe that if sgn(  ) denotes the sign of
any number,  , then by definition,   |  |sgn(  ) , so that (A3.4.93) can be written as,

A   i 1| i |sgn(i )wi wi


n
(A3.4.94)

But if we define, U  (u1 ,.., un ) , S  diag ( s1 ,.., sn ) , and V  ( v1 ,.., vn ) by

ui  sgn(i )wi , i  1,.., n


(A3.4.95) si  | i | , i  1,.., n
vi  wi , i  1,.., n

then by definition,

________________________________________________________________________
ESE 502 A3-80 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________

 s1   v1 
(A3.4.96)
n 
A   i 1 si ui vi   u1 ,.., un         USV 
 
   
 s n   vn 

where S is nonnegative diagonal matrix and where V (  W ) is orthonormal. Moreover,


since

uiui  sgn(i )2 wiwi  1 , i  1,.., n


(A3.4.97)
uiu j  sgn(i )sgn( j )wiw j  0 , i  1,.., n

it also follows that U is orthonormal, and thus that (A3.4.96) is automatically an SVD
for A. However, the SPDs of symmetric matrices clearly contain more information, and
turn out to be far more useful than their corresponding SVDs. So this final result only
serves to complete the full correspondence between the two.

________________________________________________________________________
ESE 502 A3-81 Tony E. Smith
Tony E. Smith

Professor of Systems Engineering and Regional Science


Department of Electrical and Systems Engineering
University of Pennsylvania
Philadelphia, PA 19104

E‐Mail: [email protected]

You might also like