Notebook On Spatial Data Analysis
Notebook On Spatial Data Analysis
Smith
SPATIAL DATA ANALYSIS
ESE 502 COURSE
Philadelphia
2014
ESE 502 COURSE DESCRIPTION
The course is designed to introduce students to modern statistical methods for analyzing spatial data.
These methods include nearest‐neighbor analyses of spatial point patterns, variogram and kriging
analyses of continuous spatial data, and autoregression analyses of areal data. The underlying statistical
theory of each method is developed and illustrated in terms of selected GIS applications. Students are
also given some experience with ARCMAP, JMP, and MATLAB software.
COURSE TOPICS
Spatial Point Pattern Analysis
• Nearest-Neighbor Methods
• K-Function Methods
INTRODUCTION
I. SPATIAL POINT PATTERN ANALYSIS
1. Examples of Point Patterns
1.1 Clustering versus Uniformity
1.2 Comparisons between Point Patterns
APPENDIX TO PART I
ТОС_2
1.2 Basic Modeling Framework
4. Variograms
APPENDIX TO PART II
ТОС_7
INTRODUCTION
In this NOTEBOOK we develop the elements of spatial data analysis. The analytical
methods divided into three parts: Part1. Point Pattern Analysis, Part II. Continuous
Spatial Data Analysis, and Part III. Regional Data Analysis. This classification of spatial
data types essentially follows the course text by Bailey and Gatrell (1995)1, hereafter
referred to as [BG]. It should be noted that many of the examples and methods used in
these notes are drawn from [BG]. Additional materials are drawn from Cressie (1993)
and Anselin (1988).
This course is designed to introduce both the theory and practice of spatial data analysis.
The practice of spatial data analysis depends heavily on software applications. Here we
shall use ARCMAP for displaying and manipulating spatial data, and shall use both
JMPIN and MATLAB for statistical analyses of this data. Hence, while these notes
concentrate on the statistical theory of spatial data analysis, they also develop a number
of explicit applications using this software. Brief introductions to each of these software
packages are given in Part IV of this NOTEBOOK, along with numerous tips on useful
procedures.
These notes will make constant reference to files and programs that are available in the
Class Directory, which can be opened in the Lab with the menu sequence:
The relevant files are organized into three subdirectories: arcview, jmpin, and matlab.
These are the three software packages used in the course. The files in each subdirectory
are formatted as inputs to the corresponding software package. Instructions for opening
and using each of these packages can be found in the Software portion of this
NOTEBOOK.
To facilitate references to other parts of the NOTEBOOK, the following conventions are
used. A reference to expression (3.4.7) means expression (7) in Section 3.4 of the same
part of the NOTEBOOK. If a reference is made to an expression in another part of the
NOTEBOOK, say Part II, then this reference is preceded by the part number, in this case,
expression (II.3.4.7). Similar references are made to figures by replacing expressions
numbers in parentheses with figure numbers in brackets. For example, a reference to
figure II.3.4 means Figure 4 in Section 3 of Part II.
1
All references are listed in the Reference section at the end of this NOTEBOOK.
SPATIAL POINT PATTERN ANALYSIS
We begin by considering a range of point pattern examples that highlight the types of
statistical analyses to be developed. These examples can be found in ARCMAP map
documents that will be discussed later.
Consider the following two point patterns below. The first represents the locations of
redwood seedlings in a section of forest.1 This pattern of points obviously looks too
0 10
feet
The approach adopted here is to begin by developing a statistical model of purely random
point patterns, and then attempt to test each of these patterns against that statistical
model. In this way, we will be able to conclude that the first is “significantly more
clustered than random” and the second is “significantly more dispersed than random”.
Figures 1.4 and 1.5 below show the locations of abandoned houses in central Philadelphia
for the year 2000.4 The first shows those abandonments for which the owner’s residence
is off site, and the
!
(
! !
(! !
(! (
!
( ( !
( (!
(!(! ( !
! ( ! ( !
(
!
( (! ! !
(
!
(! (!(!( ! !
( !
( ((( !!
( (!
(!(
!
( !
( !
(
!
(!(! (
(!
( (
! !
(! ( ! (!
! (
( !
! (
!!
( !
( ! !
( !
(
(! (!
( (!
! !
( !
(!
(( (!
! ( ! ( ! !
( (
!
( ( !
( !
(!
(!
( (!
! (!
(! !
( !
(
!
( !
( (!
( !
!
( (!
!( !
(!(!
(!
( !( ! (! !
(
!
(
!
( !
( !
( !
(
!
(!
!
(
(!
( !
( (!
! (
!
( (!
(!( !
(! !
(!
(
!
(
!
(( (
!
(! !
(
!
(
( ! (! ( !
(
!
( !
( !
( (!!
(! (!!
((!(!
(!
( (!
( !!
((!( !
(
! ( ! (
! !
( ( !
! (!
((
!( !
(! (
!
( ! !
(!
!
((
!!
( ! !
(!
(!
( ! !
(!
(!
(!! (!
!
(!
( ( ( (
(!(! !
( ( !
( !
( (((!
(!(!(
!
(!
((
! !
(!
!
((
!
(
(!
!
!
( (!
(
!(!
(!
!
(
(!
( !
( !
(!
!
( (! !
(!(
!
(!
(
!
(
!
(
!
(!!
(!
(
!(!
(! (!
(
!
(
!
(
!
(
(!
!
!
( (!(
!
(
(!
!
!
(!
!(!
( !
(! !( (! !
(!
(! !
(( !
( ( !
(
(!
! ( !
(!
(
!
((!
! (
!
(!
(!
( (
!
(!((! !
!
(
( !
( !
( !( (!
(! ! ( !
(
!
(( !
(
!
(!
( !
(! !
(
((!
(! ! !
(
!
(!
((!
!( !
(
!
(!
!
(
!
(
!
(!
!
( (!
( (!
!
(
! (
!
(
!
(!
(
!!
!
(
(( !
(! !
( !( (
(! ! !
( (
!(
( ! (!
!
!
( !
(
(!
(
!
(!
( !
( !
(( !
( !
( !
(
!!
(
!
(
(!(
!
(!
(!
(!
(!
(
!! !
(! !
(
!
(!
(
(!(
!
(!
!
(( !
(
! ( !
(!
( !
(
(
(!
!
( (!
(! (!
(!(!
(!
(
!
( !
(! (!
(!
! (
!
(!
((!
!
(
(! !
(!!!
(!
( !
(
!
(!
(! ( ! (((
((! !
(
!
(!
((
!
(
(!
! ( !
(
!
(
!
(!
(!
(!
(!
(
!
!
(
!
(
!
(
!
(
!
(
!
(
!
(
!
(
!
(!
(( (
!!
!
(! ( !
(
!
(
!
(!
(!
(
!
(!!
(
!
(!
(
(!
!
( !
( !
( !(
( !
(
!
(
!
(!
(
!
(
!
(
!
(
!
(
!
(
!
(!
(
!
(!
(! !
(!!
((!!
(! !
!
(!
!
(! (!((
!
(
!
(
(
!
(
!!
(
!
(
!
(
!
(
!!
( (( (!
!!
!
(!
((!
(!
!
(
!
((!
!
( (!
! !
(
(!
! (!
(!
(!!
(
(! !
! (
! !
!
((
((!
! ( !
((
!
( !
(!
(!
( ! ! !
(!
(
!
( (( ( ( !
( !
(
(!
!
( (! ( (!
!(!(!
( (!
!(
!
(!
((
!
(
(!!
(!
(
!
(!
(!
( !
(!
(!
( ! (
(!
!
( (! ( !
(
! !
(!
(
!(!
!
(
(
((!
! !
(!
( (!
!
(((!
!(!
!
( !
(!(! !
(
(!(!
!
((
(!
!(!
(! (
(!
!((!
(!((
!
( !
(!
(!
(
!
( !
(! (!
( !
(
!
(!
(
!
(!!
(!!
(!
(
!
(!
!
(
(
!
(!
(!!
(
(!
!
(!
(( !
(! !
(!!
(
((((!
(
!
!(!(! ( (!
(!(
(!
(
!
(! !
( ! !
(!(!
(!
( (!
(!
! (
!
(
(!
(!
(!
(!
(! (!
( !
(
!
(!
(!
( !
( !
( !
!
(
!
(
!
(
!
(
!
(!
(!
(!!(
!
(!
(!
(!
(!
(!
(
!
(!
(
!
(!
(
!
(!
(
!
(
(
!
(
!
(!
(
!
(!( (!
(!
(
! !
(!
(!
(!
(
!
( ! (!
(!(
!
(!
!
((!
(
! ( !
( !
(
!
(
!
!
(
!
(
(! !
(!
(!
( !
(!
!
((!
(!
!
(
(!
!
(
(
!
(! !
(!
(!
(!
( ! !
(
!
(!
(
!
(
! !
(!
(
!
!
(! !
(
(
!
( (! ( !
( !
(
!
(
(
(!
! ( !
(
!
(!
(
!
(!
(!
(
!
(
(!
!
(
(!
(!!
(
!
(
(!
(
(
!
(
!
(
!
!
(
!
!
(
!
((!
!(
!
(
!
(!
(
!
(( !
!
(
!(
!
(
!
(
!
(
!
(( !
! (!
((!(!
!
( !
(! (( ! !
( ( !
! ( ! (
!
( !
(
!
( !
(!
(!
(
!
((!
!
(
((!
!
!
(
(!
(!
(
!
(
(!
! (!(
!
(
(!
(!
(!
((!
(
!
(!
( ((!
!
((
(!
! ( ( ! ((!
!
(!
!
(! (
( !
! ( !
(!
(!
(
!
( !
(
!
(
!
( !
( !
(!
( !
(
!
( !
( !
( !
( !
(
!!
(!
(
!
(
!
( !
( !
(! !
((!
(!
( ! !
(
(!(! (!!
( ( !!!
(
!
(
!
(!
( !
( !
( !
( !
(!
!
(!
( ( !
( !
(!
(!
(
!
(!!
(
(
!
(
!
(
!
(!!
(
(!
!
(
(! !
((!
(
!
(!
(!
( !
(
!
(!
(
!
( (!
! !
(
( ! (!
( ( !
! (! (!!! !
(!
(!
!
(
!
(!
((!
(
!
(!
(!
!
(!
((!
!(!
(!( !
(!
(
(!
((!
!((!
! ( (!
! (
!
(!!
((!
(!!
(
!
(( (
! !
(
!
(! (!
(
(!(!((
(!
!
( ! !
(
( ! !
(
(
!
(!
(!
(!
(
!
(!!
(
( !
!!
(
(
!
!
(
(!
(
!
(!
!
(
(
!
(!
(
!
(
!
(!
( !
( !
(! !
(
(!
(
!
(
!
(
(!
! (
!
(
!
(
!
!
(
!
(
(!( !
!
( !
( ! (!
( !
( !
(
!
(((
!
(
(!
(
!
(!
(!
(!
(!
(!
(
!
(!
(
!
(!
((
!
(!
!
((!
(
!
(
!
!
(
(
!
( !
( (!
! ((
!
(
!
(
!
(
!
(
!
(!
(
!
(( !
! (( !!
(
!
(
(!
!((!
!
!! (!
(!(! !
(! (!
(!
(!
(
!(!
(!(!!
(
!
(!!
(
(!
!
(! !
(! ( ( !
( !
(! !
( (!!
(!
((!
(! (
!
!
(!
(! !
(!
(!(!
(
!
(!
( (!
(!!
(
!
(
(!
!
(!
( (!
(!
(!
! (
( ! !
(
(!(
(!
!
(
!!
(
(
((
!((!
!(
(!
!(
!!
((
!
(!
!
( (!
(!
(
!
(
!
(
!
( (!
(!
( (! (!
(! (!
((!
(!!
((
!
(!
(!!
(
!
(!
(
(
!
(!
( !
( !
(!
(
!
(!
(
!
(
!
(
(!
! (!
(!
(!
!
(
!
(
!
(!
((
!!
(!
(
(
(!
(
!(
!
(!
( ( !
(
!
(!
(!
!
!
(!
(!
(!
((!
!
( (!( (!
! (
(!
!( ! (!
(
!(!
(
!
( !
(
(!
(!
!( (! ( !
!
(
! (!
!( (!
!
!
(
(
!
(!
(!
( !
( !
(!
( ( !!
( !
( ! (!
( (!
! (
!
( ! ( ! ( ! !
(! !
(!
(
(!
! (!(! !
(!
(
(
(
!
(! ((
! (!( !
( !
(!
(
( ! ( !
( (!(! !
( !
( !
(!
! !
(! (!
(!
(!
(!
(!(!
( (!
!( !
( (!
!
!
(
!
!(!!
(
(!
(
!
(!
(
!
(!
(
!
(!
(!
(! !
( (!
(!( !
(
(!
!
(( !
(!
((
!
(
!
(
!
(!
(!(!
(!
( (!
(!!
(
((
!
(!
(!
(!
!!
(!
!
(
(!!
((
!
( ( !
(
!
( !
(!
(!
(!
(
!
(!
( (!
!(
(!
!(!
(
!
(!
(
!
(
!(
(
(!!!
(
!
(
(!
!
(!
(
!
(
!
(
!
(
!
( (!
(!
! (( !
(!
(!
!
(
!
( !
( !
(
!
(
! !
(
(!
! (! !
(
!(!
! !
(!(!
(!
( !
(
(
!
(!
(
!
(
!
(!
(
!
(!
(!
( !
(
!
!
(
!
(
!
(
!
(
!
(!
!
(
!
(
!
(
!
(
!
(! (!
!
( !
(((
!
(!
(
!!
(
(! !
(
(!
(
(!
! !
(
(
!
(
!
(!
(
!
(!!
(!
(
(
!(!
((!(!
(
!
(
!
(
!
(
!
((
!
(
!
(
!
(
!
((
!
(
!
(
!
(
!
(!
(
!
(
!
(
!
(
!
(!
(
!
(!
( !
( ( (!!
! (!
(! (
!
!
(
(!
(
(!
( !
( (
!
(!
( !
(!!
(
(
!
(!
(!
(
!
(!
(
!
(
!
(
!
(
(
!
(!
(!
(
!
(!
(
!
(
!
(
!
(
!
( ( ! (!
!(
( !
! ( !
(!
(
!
(!!
(
(!!
(
!
(!
(
!!
(
(!!
(!
( !!
(( !
(
!
(!
(
!
(!
(
!!
(
(! !
(
(
!!
(
!
(
( !!
( !
( !
(
!
(!
( ( (! ( (( ! !
(
!
(! !
(
!
( ! (
( !
(!
(
!
(!
( (!
! ! ( !
(
!
(!
(
(
!(!
!(
(!
! (
( !
(!
(!
(
!
(
!
(!
(
!
( !
(( !
(!
((
!
(!
(
!
(!
((!
! (
!
((
(!
!
!
(
!
(!( ( !
! ( !
! ( ! (
!
( !
(
!
(
(!
!(
second shows properties for which the owner’s residence is on site. If off-site ownership
tends to reflect abandoned rental properties, while on-site ownership reflects abandoned
residences, then one might hypothesize that different types of decisions were involved:
abandoning a rental property might be more directly an economic decision than
abandoning one’s home. However, these patterns look strikingly similar. So one may ask
whether there are any statistically significant differences between them.
Notice that there appears to be significant clustering in each pattern. But here it is
important to emphasize that one can only make this judgment by comparing these
4
This data was obtained from the Neighborhood Information System data base maintained by the
Cartographic Modeling Lab here on campus, http://www.cml.upenn.edu/. For further discussion of this data
see Hillier, Culhane, Smith and Tomlin (2003).
________________________________________________________________________
ESE 502 I.1-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
patterns with the pattern of all housing in this area. For example, there are surely very
few houses in Fairmont Park, while there are many houses in other areas. So here it is
important to treat the pattern of overall housing as the relevant reference pattern or
“backcloth” against which to evaluate the significance of any apparent clusters of
abandoned houses.
Here again it should be clear that clustering of such cancer cases is only meaningful
relative to the distribution of population in this area. The population densities in each
parish are shown in Figure 1.8 below.
(!
! (!
( (!
! (!
(
(!
! (
! ((
! (!
! (
!
(!(!((
(
!(
!
(!
!
(!
!
(
(!
(
!
!
(!
!
(
!
((
!(! (!
! (! (
!(
! (!
! ( !(
!(
!
(!
!
(!(
(!
(
!
((
!
!
!
((!
!
(
! (!
! (
!(
! (!
! (
((!
(
!(!
( (!
! ( (!
! ( ( !
!
(
! (!(!(!
( ( !
! ((!
!(!
( (!
!( (!
!(
! ( (
!( !
! (!(! (!
( ( !
!
( !
! (!!
((
!
(!
(
(!
!( (!
(
!
!
(
! ((!
!((
!(
!(
!
( (!
(!
!
(
! (!
(
!
((
!
(
! (!
! (
!
(
!(
!(!
!( (
!(!
! ( ! !
(
(!
! ( !
( (
! (!
!
(!
!((!
(
(!
! (!
(
!
!
(
! ((!
!((
!(
!((
!(!
(!
! (!
(
(
!(
!
(
! (!
! (
!(
!
(
!(!
(
!
! ( (
!(!
! ( ! !
(
(!
! (
(
! (
! (!
! !
( (
! !
((!
(! (
!
((!
( (!
! (!
! ( ( !
! ((
(! ( ( !
! ( !
( (
! !
(
(
!
(!
! (!
((!
( ((!
((!
!
!
( (!
! ( ( !
! (!(! ( ! ((
(! (
(
!(
! (
(
! !!(!
((!
!
(
!
(
!((!
!((
!
! (!
(
(
! ! !
(
(
!
(!
(
! (
( (!
! ( ! (!(!
(!
!
( ! (
!(!
!( (!
!(!(
(!
! (!!(!
((!
!
(
!
(
!( (! !(
! ! (
!
(!
(
!
!
(
((!
! (!
! ( ! (!
! (!(
! (
!(!
!
( !
( (
(!
! (!
(! (
(
!
(
!
(
! (
!
!
( (
! (
!
(
!(
!
(
!
(
! (
!(
! !
( (
!(
!!
(
!(
!
(!
!
(
!
(
!
(
!(
(
!(!
!((
! (
!
(
(
!
(!(
!
(( !
!
(
!
( (!
! (!(!(
!
(
!(
!
( !
( (
! (
!
(
!
(
!
(
!
(
! (
!(
! !
( (
!(
! (
!(
!
(
!
(
!((
(
!
!(
!
!
(
(!
!((
!(
! (( !
! ( (!
(
!(
!
(
!(!
(!
( (
! (!(!
( (
! (!
! (
!
(
!(
!!
(
(
!(!
! (
!
(
! ( ! ( (
! ( (
!(
!
(! (! (
! (!
! (
!(
!(!(
! (
! (
!(
!(
!(
! (
! ( !
! (
(!
! (! (!
(! (
(!
! (!
!
( ( !
! (! (!(!
!(! (
! ! (!
!( (
! ( (
!
(
!(
!
(
!
(
! (
!
(
!(
!((
!
!
(
(
!(
!
!
(! ! (!
! (!(!
(! (!
(( (!
(!
! (!(!
!(! ( ( !
! (!((!
((!
(
((
! (!
! (!
! (
!(
!
(!
!
((!
( (
(!
((
! (
!
(!
((
! (!
!(
!
(!
( ! ( (!
! ((
!
(
! ( !
(!(!
!
(!
(
((!
(
!
!(! ((
!
(!
( (!
!( ! (!(
(
!
(
!(!
! (!
!(!
(( (! ( ! ( (!
! ( ! !(!
(!
( (
!
(!
! ((
!
(!
( !
(!(!((
!
! (!
(
! ( ! (
!
(
!(
!
! (
!
(
!!((
! (! ( ! (
(!
! (! (!
!( ! ( (
(!
! (!
( ( (!
!
(
!
((!
(! (! !
(!(!
! (!
( ( (!
! !
(
(!
! (!
(!
! !
(!( (!
! ( !
! (
! !
( ! ( (!
!
( (! (
((!
! ( !
! (
!
( !
! (
( (!
! ( ( !
( (
!
!
(
! !
(!
!((!
! (
!
( !
( (
! (
!
(!
!((!
!((
! (
! ( (
! (
!
( !
(
!
! (!( (
! (!
! (! (
!
(( !
! ( !
(
!
! ( (
! (!
! (! (! (
!(( !
! (
!
(
! (!
! (
!(!
! (! (
!(!
! (!(
!
(!
(
!( (! ( (!
! ((
! (
! (
!
( !
! (!
!(! (
!(!
! (! (!
!
(
!
(
(
!(
!(!
((
!
(!
(
!
!((
!(!
!(
( (!
! ((
!
(
! ( !
! (!( ((
!
(
!
!
(!
(
! (!
( (
!(!
!( (
! ( (! (!(( (!
(!
! (! (
!(
!
(
!!
(!
(!
(
( !
! (!
( ( (!
!
!
( (!
(! (!
( (
!
(
!(!
(!
! (!
(!
(
!
(
!
(
!
(
(
! (! (
!
(
! (
! (
! (
!
(!
! (!
!( (! ( ( !
! ( ( !
! (!
( !
(!
! !
(
(
!(!
(
! (
!(
!
(
!(
!
(
!(
!(
!
(!
(
!
(
!(
!
(
!
((
! (
!
(
!
(
!(!
! (
!
(
! (!
! (
!
(!
!
( (!
!(
(!(! (!
! ( ( !
! (
(!
! (
! (
! (
!
(
! (
! (
! (
!
(!
! (
!
((
! (
! (
!(
!
(!
(!
(! (
! (
! (
! (!(! ! (
! (
!(
! (
!
((
! (
!
(
!(!
! (
! (
!
(!
! (
! (
!
(!
! !
(
(!( ( (!
( ( !
(
(!
! (
!
(
!
(
(!
(
!( (
!(!
!
(
!
(
!(
!((
!
(
!
(
!(!
!
(
! (
(
!!
( (!
!
(
!( (
!
(
!
(
! !(
!
(
! (
(
!(
!
(
! (
! (
! (!
(
(!
! (!
! ( (
! (!
! ((
!(
!
(
! (
!(!
!
(
! (!
(
(
!(
!(
!
(
!(
!
(
!(
! (
!
( (
!(
(
! (
!(
!
(
!
(
!
(
! (
! (
!(
!
(
! (
!
(
!
(!
! ( (
! (
! (!
!
(
! ((!
!
(!
((
!
!
(
(
!(!
! (!
((!
!(
(
!(
!(
!
(
!(
!
(!
!(( !
! (!(!
((
!(
!(!(!((
! (
! (!
!( (
!
(!
! (
(!
(!
! (!(!
(
(!
! ((!
!((
!
(
!(
!
(!
!((
! (
!
(!
! (! ( (
!(
! (!
!(
(!
(!
! (!
! ((
!
!
((
! (!
! ( ! (
! (!
! ((
!(
! (
! (!
( (!
! (
!((
! (
! (
! (
!
(
! (!
!( (
!
(
!(
!
( !
! (!
! (
! ( !
! ( ( !
! (
!
( ( !
! ( ! ( ((!
! (( ! ( (
!
( ! ( (
!
(!
!
( !
(! (
(
! (
!
(
!
(!
! ( !
!
( (
! (!
(!
! (!(
! (!
! (
(
! (
!
(!
(!
! (! (!
! (! ( (!((
(!
! ( (
! (! (
!
( (
! (
! (!
! (( ( (
! (!
! !
((!
! (!
! (!
(!
( ( !
(!
! (!
!
(
! (
! ( !
! ( (
!
(!
! (!!
((!(! (
!
(
(
!
(( !
!
(!
!
(
!(
!
(
(
!!
(
!
(!
!
(! !(!
(! (!
! (
!
(!
( (
! ( !
! (
!( !(
( (
!
(
! (
(!
!
(
!
(
!
( ! (
(
!
(!
! (
(
!
(
! !(
!
(!(!
!
(!
((
(
!
(
!
(!
!(
(!
( ( !
! (! (!
(! ( (
!(
! (
!(!
! ((
(
! (
!
(!
! (! (
!
(
! (
! (
!
(
!
(
! !
( (
! (!
(
! (
! (
! (
! ( (
!
(
!
(
! (!
(!
! ( ( !
! ( (( !
! (
!
(!
( (
!
((!
! (
!
(
!(
!!
(
(
!
(
! (
!(
!
(
!
(
! (
!( !
! (
!
(
(!
! (!
!
(
((
(
!
(
!(
!!
(
(
!
(
!
(
! (
!(
!
(
! (
! !
(
! ((
! (
! (
!
(
!
(
!
(
!(!
!(
(
! (
! (
!
(
!
(!
! (
!
(
(!
! !
(
!(
!!
(
!
(!
((
!
(!
!((
!
(
!
(
!
(!
!
(
!(
(
!
(
!
(
! (!
!(!
( (
!
(!
! ((
!
(!
!((!
!(! (
! (!
!
(
!
(
(!
! (!
((!
!
(
!
(
(!
!((
!
(
!(
!
!
(
(
!
!
(
(!
!
(
!
(
!
(
!
(
!
(
!
(
!
(!
(
!(!
(
(!
! (
(!
! ((!
!(!
((
! (! (
!
(
!(
!(
!
!
((
!
(
!
(
(
!
(
!
(
!
(!
! (
!
(
!(
!
(
!!
(
(
!
!
(
(
!
(
!
(
!
(!
!
(
!
(
! ( (!
! ((!
! (!( (
!
(!
!((
!(
!
(
!
(
!
(
!
!
(
(
!
(
!
(
!
(
!(
!(
!
(
!(!
!(
(!
! (
!(
!
(
!
(!
!(
(
!
(
!(!
(
!
!((
!
(
!
(
!
(!( ( (
! (
! (!
(
!
! ((
! (
! ( (
! (
! (
!
( (
!
!
(
!(!
!((!
! (
! (!
!
(
!
(!
(
(
!
(
!
(
!
(
!
(
!
(
!
(
!
(
!
(
!(!
(
!
!( (
(!
!!
(
(
!
(!
(!
!
(
!
(
((
!
(!
!( (
!(
! (
! (!
!(
(!
! (
(
! (
!
(!
!
(
!
(
!
(
!
(
!
(!
(
!
(
!
(!
!(
(
!
!
(
(
!
(
!
(
!
(
!!
(
!
(
(
!
(
!
(
!
(
!
(
!
(!
! ( (!
! (
!
(! (
!!
(
(
! (
!
(
!(
! (
!
(
! (
! (
!
(!
(!
( (!
! ((
(
!
( !
!
(!
!
((
(!
(! (
!(!
((
!
(!
!
(
!( ( !
! (!
( (!
!((!
!
(!( !
(! (!
! (
(!(
!
(! ! (!
( (
! ( (!
! (
!(!
!
(
! ( (!
!
(
!((
(
!(!
!
(
! (!
! (!
( !
(!(
!
(
!((
! (!
!(
(
!
(
! (
!(
! (!
! (! (
(
! (!
!(
!
(!
((!
! (!
!(!
!
( (!
! (!
( (! (
(
!! (( !
! (!
! (
(
! (!
! (!
! (
!(
!
(
!(!
!(!
((
!
!
( ! (
(
! (!
! (!
! (
!
(
!(!
(
!
!(!
((
! (!
! (! ((
! (! (! (!(!(!
((
!(
!
!! (!
(!(
(!
! (( (
!
(
! (! (! ((
! (!
! (!
(!(
(
!
(
(
!
!
( !(!
(!( ( (
(
!
(
!!
(!
! (
!
(!
(!
!
(((
(!
! (
!
(
!
(!
!(! (
!
(!
! (
!(
! (
! (
! (
! (! (
!
(!
(
(
! (
! (((!
(!
( (
! (!
! ((!(
(
! (!
! (!(
!
5
This data first appeared in the paper by Diggle, Gatrell and Lovett (1990) which is included as Paper 12
“Larynx Cancer” in the Reference Materials on the class web page.
________________________________________________________________________
ESE 502 I.1-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
An examination of these population densities reveals that the clustering of cases in some
of the lower central parishes is now much less surprising. But certain other clusters do
not appear to be so easily explained. For example the central cluster in the far south
appears to be in an area of relatively sparse population. This cluster was in fact the center
of interest in this particular study. An enlargement of this southern portion in Figure 1.9
below indicates that a large incinerator6 is located just upwind of this cluster of cases.7
INCINERATOR
Moreover, an examination of the composition of this cluster suggests that there are
significantly more larynx cases present than one would expect, given the total distribution
of cases shown in Figures 1.7 and 1.8 above. This appears to be consistent with the fact
that large airborne particles such as incinerator ash are more likely to lodge in the larynx
rather than the lungs. So there is some suspicion that this incinerator may be a significant
factor contributing to the presence of this particular clustering of cases.
To analyze this question statistically, one may ask how likely it is that this could simply
be a coincidence. Here one must model the likelihood of such local clustering patterns.
6
According to Diggle, Gatrell and Lovett (1990), this incinerator burned industrial wastes, and was active
during the period from 1972-1980.
7
Prevailing winds are from the Atlantic ocean to the west, as seen in Figure 1.6 above.
________________________________________________________________________
ESE 502 I.1-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
As with most statistical analyses, cluster analysis of point patterns begins by asking:
What would point patterns look like if points were randomly distributed ? This requires a
statistical model of randomly located points.
To develop such a model, we begin by considering a square region, S , on the plane and
divide it in half, as shown on the left in Figure 2.1 below:
1/4 1/4
1/2 1/2 C
1/4 1/4
S
a(C )
(2.1.1) Pr(C | S )
a( S )
Finally, since this must hold for any pair of nested regions C R S it follows that3
1
This is also known as Laplace’s “Principle of Insufficient Reason”.
2
This argument in fact simply repeats the construction of area itself in terms of Riemann sums [as for
example in Bartle (1975, section 24)].
3
Expression (2.1.2) refers to equation (2) in section 2.1. This convention will be followed throughout.
________________________________________________________________________
ESE 502 I.2-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
Pr(C | S ) a (C ) / a ( S )
(2.1.2) Pr(C | S ) Pr(C | R) Pr( R | S ) Pr(C | R)
Pr( R | S ) a( R) / a ( S )
a(C )
Pr(C | R)
a( R)
and hence that the square in Figure 2.1 can be replaced by any bounded region, R , in the
plane. This fundamental proportionality result, which we designate as the Spatial Laplace
Principle, forms the basis for almost all models of spatial randomness.
(2.1.3) Pr X (C ) 1| R a (C ) / a ( R) ,
so that Pr X (C ) 0 | R 1 Pr X (C ) 1| R 1 [a (C ) / a ( R )] .
In this context, suppose now that n points are each located randomly in region R . Then
the second key assumption of spatial randomness is that the locations of these points have
no influence on one another. Hence if for each i 1,.., n , the Bernoulli variable, X i (C ) ,
now denotes the event that point i is located in region C , then under spatial randomness
the random variables { X i (C ) : i 1,.., n} are assumed to be statistically independent for
each region C . This together with the Spatial Laplace Principle above defines the
fundamental hypothesis of complete spatial randomness (CSR), which we shall usually
refer to as the CSR Hypothesis.
Observe next that in terms of the individual variables, X i (C ) , the total number of points
appearing in C , designated as the cell count, N (C ) , for C , must be given by the random
sum
N (C ) i1 X i (C )
n
(2.2.1)
[It is this additive representation of cell counts that in fact motivates the Bernoulli (0-1)
characterization of location events above.] Note in particular that since the expected
________________________________________________________________________
ESE 502 I.2-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
E N (C ) | n, R
n n
(2.2.2) i 1
E[ X i (C ) | R] i 1
Pr[ X i (C ) 1| R]
a(C ) a (C ) n
n
i 1
n a(C )
a( R) a( R) a( R)
Finally, it follows from expression (2.1.3) that the under the CSR Hypothesis, the sum of
independent Bernoulli variables in (2.2.1) is by definition a Binomial random variable
with distribution given by
k nk
n! a(C ) a(C )
(2.2.3) Pr[ N (C ) k | n, R] 1 , k 0,1,.., n
k !(n k )! a( R) a( R)
For most practical purposes, this conditional cell-count distribution for the number of
points in cell, C R (given that n points are randomly located in R ) constitutes the
basic probability model for the CSR Hypothesis.
However, when the reference region R is large, the exact specification of this region and
the total number of points n it contains will often be of little interest. In such cases it is
convenient to remove these conditioning effects by applying the well-known Poisson
approximation to the Binomial distribution. To motivate this fundamental approximation
in the present setting, imagine that you are standing in a large tiled plaza when it starts to
rain. Now consider the number of rain drops landing on the tile in front of you during the
first ten seconds of rainfall. Here it is evident that this number should not depend on
either the size of the plaza itself or the total number of raindrops hitting the plaza. Rather,
it should depend on the intensity of the rainfall – which should be the same everywhere.
This can be modeled in a natural way by allowing both the reference region (plaza), R ,
and the total number of points (raindrops landing in the plaza), n , to become large in
such a way that the expected density of points (intensity of rainfall) in each unit area
remains the same. In our present case, this expected density is given by (2.1.2) as
n
(2.3.1) (n, R)
a( R)
Hence to formalize the above idea, now imagine an increasing sequence of regions
R1 R2 Rm and corresponding point totals n1 n2 nm that expand
such a way that the limiting density
4
By definition E ( X ) x
x p ( x ) 1 p (1) 0 p (0) p (1) .
________________________________________________________________________
ESE 502 I.2-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
nm
(2.3.2) lim m (nm , Rm ) lim m
a( Rm )
exists and is positive. Under this assumption, it is shown in the Appendix (Section 1) that
the Binomial probabilities in (2.2.3) converge to simple Poisson probabilities,
Morover, by (2.2.2) and (2.3.2), the expected number of points in any given cell (plaza
tile), C , is now given by
(2.3.4) E[ N (C )] a(C )
where density becomes the relevant constant of proportionality. Finally, if the set of
random variables {N (C )} describing cell-counts for every cell of finite area in the plane
is designated as a spatial point process on the plane, then any process governed by the
Poisson probabilities in (2.3.3) is designated as a spatial Poisson process on the plane.
Hence, when extended to the entire plane, the basic model of complete spatial
randomness (CSR) above corresponds precisely to a spatial Poisson process.
The basic notion of spatial randomness above was derived from the principle that regions
of equal area should have the same chance of containing any given randomly located
point. More formally, this Spatial Laplace Principle asserts that for any two subregions
(cells), C1 and C2 , in R ,
However, as was noted in the Housing Abandonment example above, simple area may
not always be the most relevant reference measure (backcloth). In particular, while one
can imagine a randomly located abandoned house, such houses are very unlikely to
appear in the middle of a public park, let alone the middle of a street. So here it makes
much more sense to look at the existing housing distribution, and to treat a “randomly
located abandoned house” as a random sample from this distribution. Here the Laplace
principle is still at work, but now with respect to houses. For if housing abandonments
are spatially random, then each house should have that same chance of being abandoned.
Similarly, in the Larynx cancer example, if such cancers are spatially random, then each
individual should have the same chance of contracting this disease. So here, the existing
population distribution becomes the relevant reference measure.
________________________________________________________________________
ESE 502 I.2-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
To generalize the above notion of spatial randomness, we need only replace “area” with
the relevant reference measure, say (C ) , which may be the “number of houses” in C or
the “total population” of C . As a direct extension of (2.4.1) above, we then have the
following Generalized Spatial Laplace Principle: For any two subregions (cells), C1 and
C2 , in R :
If (2.4.1) is now replaced by (2.4.2), then one can essentially reproduce all of the results
above. Given this assumption, exactly the same arguments leading to (2.2.3) now show
that
k nk
n! (C ) (C )
(2.4.3) Pr[ N (C ) k | n, R] 1 , k 0,1,.., n
k !(n k )! ( R) ( R)
To establish the Poisson approximation, there is one additional technicality that needs to
be mentioned. The basic Laplace argument in Figure 2.1 above required that we be able
to divide the square, S , into any number of equal-area cells. The simplest way to extend
this argument is to assume that the relevant reference measure, , is absolutely
continuous in the area measure, a . In particular, it suffices to assume that the relevant
reference measure can be modeled in terms of a density function with respect to area. 5 So
if housing (or population) is the relevant reference measure, then we can model this in
terms of a housing density (population density) with respect to area. In this setting, if we
now let (n, R ) n / ( R ) , and again assume the existence of limiting positive density
nm
(2.4.4) lim m (nm , Rm ) lim m
( Rm )
as the reference region becomes larger, then the same argument for (2.3.3) [in Section
A1.1 of the Appendix] now shows that
Spatial point processes governed by Poisson probabilities of this type (i.e., with non-
uniform reference measures) are often referred to as nonhomogeneous spatial Poisson
processes. Hence we shall often refer to this as the nonhomogeneous CSR Hypothesis.
5
More formally, it is assumed that there is some “density” function, f , on R such that is the integral
of f , i.e., such that for any cell, C R , (C ) C f ( x ) dx .
________________________________________________________________________
ESE 502 I.2-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
Finally we consider a number of weaker versions of the spatial randomness model that
will also prove to be useful. First observe that some processes may in fact be “Laplace
like” in the sense that they look the same everywhere, but may not be completely
random. A simple example is provided by the cell centers in Figure 1.1 of Section 1
above. Here one can imagine that if the microscope view were shifted to the left or right
on the given cell slide, the basic pattern of cell centers would look very similar. Such
point processes are said to be stationary. To make this notion more precise, it is
convenient to think of each subregion C R as a “window” through which one can see
only part of larger point process on all of region R . In these terms, the most important
notion of stationarity for our purposes is one in which the processs seen in C remains the
same no matter how we move this window. Consider for example the pattern of trees in a
large rain-forest, R , part of which is shown in Figure 2.2 below. Here again this pattern
is much too dispersed to be completely random, but nonetheless appears to be the same
everywhere. Suppose that the relevant subregion, C , under study corresponds to the
small square in the lower left. In these terms, the appropriate notion of stationarity for our
purposes amounts to the assumption that the cell-count distribution in C will remain the
! ! ! ! !
! ! ! ! !
!
! ! ! ! !
! ! !
! !
! ! ! ! !
! ! ! ! !
!
! ! !
! ! ! ! !
!
!
! ! ! !
! ! ! ! !
!
!
! ! ! !
! ! ! !
! ! ! !
! !
! ! ! ! ! ! !
! ! !
! !
! ! ! ! !
!
! ! ! !
! ! !
! ! !
! !
! ! ! !
! ! ! !
! ! ! !
!
! ! ! !
! ! ! !
!
same no matter where this subregion is located. For example the tilted square shown in
the figure is one possible relocation (or copy) of C in R . More generally if cell, C2 , is
simply a translation and/or rotation of cell, C1 , then these cells are said to be
geometrically congruent, written C1 C2 . Hence our formal definition of stationarity
asserts that the cell-count distributions for congruent cells are the same, i.e., that for any
C1 , C2 R
Since the directional orientation of cells make no difference, this is also called isotropic
stationarity. There is a weaker form of stationarity in which directional variations are
________________________________________________________________________
ESE 502 I.2-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
allowed, i.e., in which (2.5.1) is only required to hold for cells that are translations of one
another. This type of anisotropic stationarity is illustrated by the tree pattern in Figure
2.3, where the underlying point process tends to produce vertical alignments of trees
(more like an orchard than a forest). Here the variation in cell counts can be expected to
differ depending on cell orientation. For example the vertical cell in Figure 2.3 is more
likely to contain extreme point counts than its horizontal counterpart. (We shall see a
similar distinction made for continuous stationary processes in Part II of this
NOTEBOOK.)
One basic consequence of both forms of stationarity is that mean point counts continue to
be proportional to area, as in the case of complete randomness, i.e. that
(2.5.2) E[ N (C )] a(C )
where is again the expected point density (i.e., expected number of points per unit
area). To see this, note simply that the basic Laplace argument in Figure 1.1 of Section 1
depends only on similarities among individual cells in uniform grids of cells. But since
such cells are all translations of one another, it now follows from (2.5.1) that they all
have the same cell-count distributions, and hence have the same means. So by the same
argument above (with cell occupancy probabilities now replaced by mean point counts) it
follows that such mean counts must gain be proportional to area. Thus while there can be
many types of statistical dependencies between counts in congruent cells (as in the
dispersed tree patterns above), the expected numbers of points must be the same in each.
One final point should be made about stationarity. This concept implicitly assumes that
the reference region, R , is sufficiently large to ensure that the relevant cells C never
intersect the boundary of R . Since this rarely happens in practice, the present notion of
stationarity is best regarded as a convenient fiction. For example, suppose that in the rain-
forest illustrated in Figure 2.2 above there is actually a lake, as shown in Figure 2.4
below. In this case, any copies of the given (vertical) cell that lie in the lake will of course
contain no trees. More generally, those cells that intersect that lake are likely to have
fewer trees, such as the tilted cell in the figure. Here it is clear that condition (2.5.1)
cannot possibly hold. Such violations of (2.5.1) are often referred to as edge effects.
! ! ! !
! ! ! !
! !
! ! ! !
! ! ! !
! !
! ! ! ! ! ! ! !
! ! !
!
! !
! !
! ! ! !
! !
! !
! ! ! !
! ! ! ! !
! ! !
! !
! ! ! !
! ! ! !
! ! ! !
! !
! ! ! !
! ! ! ! ! !
!
!
! LAKE
!
!
!
!
! !
!
! ! ! ! ! ! ! !
! !
Here there are two approaches that one can adopt. The first is to disallow any cells that
intersect the lake, and thus to create a buffer zone around the lake. While this is no doubt
effective, it has the disadvantage of excluding some points near the lake. If the forest, R,
is large, this will probably make little difference. But if R is small (say not much bigger
than the section shown) then this amounts to throwing away valuable data. An alternative
approach is to ignore the lake altogether and to imagine a “stationary version” of this
landscape, such as that shown in Figure 2.5. Here there are seen to be more points than
were actually counted in this cell. So the question is then how to estimate these missing
points. A method for doing so (known as Ripley’s correction ) will be discussed further in
Section 4.3 below.
________________________________________________________________________
ESE 502 I.2-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
There are at least three approaches to testing the CSR hypothesis: the quadrat method, the
nearest-neighbor method, and the method of K-functions. We shall consider each of these
in turn.
This simple method is essentially a direct test of the CSR Hypothesis as stated in
expression (2.1.3) above. Given a realized point pattern from a point process in a
rectangular region, R , one begins by partitioning R it into congruent rectangular
subcells (quadrats) C1 ,.., Cm as in Figure 3.1 below (where m 16 ). Then, regardless of
whether the given
pattern represents trees in a forest or beetles in a field, the CSR Hypothesis asserts that
the cell-count distribution for each Ci must be the same, as given by (2.1.3). But rather
than use this Binomial distribution, it is typically assumed that R is large enough to use
the Poisson approximation in (2.3.3). In the present case, if there are n points in R , and
if we let a a (C1 ) , and estimate expected point density by
n
(3.1.1) ˆ
a( R)
(ˆ a) k ˆ a
(3.1.2) Pr[ N i k | ˆ ] e , k 0,1, 2,...
k!
Moreover, since the CSR Hypothesis also implies that each of the cell counts,
N i N (Ci ), i 1,.., k , is independent, it follows that N i : i 1,.., k must be a
independent random samples from this Poisson distribution. Hence the simplest test of
________________________________________________________________________
ESE 502 I.3-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
this hypothesis is to use the Pearson 2 goodness-of-fit test. Here the expected number of
points in each cell is given by the mean of the Poisson above, which (recalling that
a a( R) / m by construction) is
n n
(3.1.3) E ( N | ˆ ) a ˆ a
a( R) m
(ni n / m) 2
2 i1
m
(3.1.4)
n/m
be written as
(ni n ) 2 s2
2 i1
m
(3.1.5) (m 1)
n n
m
where s 2 1
m1 i 1
(ni n ) 2 is the sample variance. But since the variance if the Poisson
distribution is exactly the mean, it follows that var( N ) / E ( N ) 1 under CSR. Moreover,
since s 2 / n is the natural estimate of this ratio, this ratio is often designated as the index
of dispersion, and used as a rough measure of dispersion versus clustering. If s 2 / n 1
then there is too little variation among quadrat counts, suggesting possible “dispersion”
rather than randomness. Similarly, if s 2 / n 1 then there is too much variation among
counts, suggesting possible “clustering” rather than randomness.
But this testing procedure is very restrictive in that it requires an equal-area partition of
the given region.1 More importantly, it depends critically on the size of the partition
chosen. As with all applications of Pearson’s goodness-of-fit test, if there is no natural
choice of partition size, then the results can be very sensitive to the partition chosen.
In view of these shortcomings, the quadrat method above has for the most part been
replaced by other methods. The simplest of these is based on the observation that if one
simply looks at distances between points and their nearest neighbors in R , then this
provides a natural test statistic that requires no artificial partitioning scheme. More
1
More general “random quadrat” methods are discussed in Cressie (1995,section 8.2.3).
________________________________________________________________________
ESE 502 I.3-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
(3.2.1) d ( s, v) ( s1 v1 ) 2 ( s2 v2 ) 2
and denote each point pattern of size n in R by Sn ( si : i 1,.., n) , then for any point,
si S n ,3 the nearest neighbor distance (nn-distance) from si to all other points in Sn is
given by4
(3.2.2) di di ( Sn ) min{d ( si , s j ) : s j S n , j i}
In a manner similar to the index of dispersion above, the average magnitudes of these
nn-distances (relative to those expected under CSR) provide a direct measure of
“dispersion” or “clustering” in point patterns. This is seen clearly by comparing of the
two figures below, each showing a pattern of 14 points.
In Figure 3.2 these points are seen to be very uniformly spaced, so that nn-distances tend
to be larger than what one would expect under CSR. In Figure 3.3 on the other hand, the
points are quite clustered, so that nn-distances tend to be smaller than under CSR.
2
Throughout these notes we shall always take d ( s , v ) to be Euclidean distance. However there are many
other possibilities. At large scales it may be more appropriate to use great-circle distance on the globe.
Alternatively, one may take d ( s , v ) to be travel distance on some underlying transportation network. In
any case, most of the basic concepts developed here (such as nearest neighbor distances) are equally
meaningful for these definitions of distance.
3
The vector notation, S n ( si : i 1,.., n) , means that each point si is treated as a distinct component of
S n . Hence (with a slight abuse of notation), we take si S n to mean that si is a component of pattern S n .
4
This is called the event-event distance in [BG] (p.98). One may also consider the nn-distance from any
random point, x R to the given pattern as defined by d x ( S n ) min{d ( x , s ) : i 1, .., n} . However, we
i
shall not make use of these point-event distances here. For a more detailed discussion see Cressie (1995,
section 8.2.6).
________________________________________________________________________
ESE 502 I.3-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
To make these ideas precise, we must determine the probability distribution of nn-
distance under CSR, and compare the observed nn-distance with this distribution. To
begin with, suppose that the implicit reference region R is large, so that for any given
point density, , we may assume that cell-counts are Poisson distributed under CSR.
Now suppose that s is any randomly selected point in a pattern realization of this CSR
process, and let the random variable, D , denote the nn-distance from s to the rest of the
pattern. To determine the distribution of D , we next consider a circular region, Cd , of
radius d around s , as shown in Figure 3.4 below.
Then by definition, the probability that D is
at least equal to d is precisely the probability R
that there are no other points in Cd . Hence if
we now let Cd ( s ) Cd {s} , then this proba- d
s
bility is given by C d
where the last equality follows from the fact that a[Cd ( s )] a(Cd ) d 2 . Hence it
follows by definition that the cumulative distribution function (cdf), FD (d ) , for D is
given by,
In Section 2 of the Appendix to Part I it is shown that this is an instance of the Rayleigh
distribution, and in Section 3 of the Appendix that for a random sample of m nearest-
neighbor distances ( D1 ,.., Dm ) from this distribution, the scaled sum (known as Skellam’s
statistic),
is chi-square distributed with 2m degrees of freedom (as on p.99 in [BG]). Hence this
statistic provides a test of the CSR Hypothesis based on nearest neighbors.
________________________________________________________________________
ESE 502 I.3-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
While Skellam’s statistic can be used to construct tests, it follows from the Central Limit
Theorem that independent sums of identically distributed random variables are
approximately normally distributed.5 Hence the most common test of the CSR
Hypothesis based on nearest neighbors involves a normal approximation to the sample
mean of D , as defined by
m
(3.2.7) Dm 1
m i 1
Di
4
(3.2.9) var( D)
4
To get some feeling for these quantities observe that under the CSR Hypothesis, as the
point density, , increases, both the expected value and variance of nn-distances
decrease. This makes intuitive sense when one considers denser scatterings of random
points in R .
Next we observe from the properties of independently and identically distributed ( iid )
random samples that for the sample mean, Dm , in (3.2.7) we must then have
1
m
(3.2.10) E ( Dm ) 1
m i 1
E ( Di ) m1 [mE ( D1 )] E ( D1 )
2
But from the Central Limit Theorem it then follows that for sufficiently large sample
sizes,6 Dm must be approximately normally distributed under the CSR Hypothesis with
mean and variance given by (3.2.10) and (3.2.11), i.e., that:
1 4
(3.2.12) Dm ~ N ,
2 m(4)
5
See Section 3.1.4 in Part II of this NOTEBOOK for further detail. Here we simply state those results
needed for the Clark-Evans test.
6
Here “sufficiently large” is usually taken to mean m 30 , as long as the distribution in (3.2.4) is not “too
skewed”. Later we shall investigate this by using simulations.
________________________________________________________________________
ESE 502 I.3-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
Hence this distribution provides a new test of the CSR Hypothesis, known as the Clark-
Evans Test [see Clark and Evans (1954) and [BG], p.100]. If the standard error of Dm is
denoted by
then to construct this test, one begins by standardizing the sample mean, Dm , in order to
use the standard normal tables. Hence, if we now denote the standardized sample mean
under the CSR Hypothesis by
Dm E ( Dm ) Dm [1/(2 )]
(3.2.14) Zm
( Dm ) (4 ) m4
(3.2.15) Z m ~ N (0,1)
To construct a test of the CSR Hypothesis based on this distribution, suppose that one
starts with a sample pattern S n ( si : i 1,.., n) and constructs the nn-distance di for each
point, si S n . Then it would seem most natural to use all these distances (d1 ,.., d n ) to
construct the sample-mean statistic in (3.2.10) above. However, this would violate the
assumed independence of nn-distances on which this distribution theory is based. To see
this it is enough to observe that if si and s j are mutual nearest neighbors, so that di d j ,
then these are obviously not independent. More generally, if s j is the nearest neighbor of
si , then again di and d j must be dependent.8
2
E ( Z ) E ( X ) / / 0 and var( Z ) var( X ) / 1 .
8
If the random variable D j is the nearest neighbor of j , then since D j cannot be bigger than d1 it follows
that Pr( D j d i | Di d i ) 1 , and hence that these nn-distances are statistically dependent.
________________________________________________________________________
ESE 502 I.3-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
m
(3.2.16) dm 1
m i 1
di
The standard test of CSR in most software is a two-tailed test in which both the
possibility of “significantly small” values of d m (clustering) and “significantly large”
values of d m (dispersion) are considered. Hence it is appropriate to review the details of
such a testing procedure. First recall the notion of upper-tail points, z , for the standard
normal distribution as defined by Pr( Z z ) for Z ~ N (0,1) . In these terms, it
follows that for the standardized mean in (3.2.14)
(3.2.17) Pr Z m z / 2 Pr ( Z m z / 2 ) or ( z / 2 Z m )
under the CSR Hypothesis. Hence if one estimates point density as in (3.1.1), and
constructs corresponding estimates of the mean (3.2.10) and standard deviation (3.2.13)
under CSR by
(3.2.18) ˆ
1
2 ˆ
, ˆ m (4 ) m4ˆ
then one can test the CSR Hypothesis by constructing the following standardized sample
mean:
d m ˆ
(3.2.19) zm
ˆ
If the CSR Hypothesis is true, then by (3.2.14) and (3.2.15), zm should be a sample from
N (0,1) .9 Hence a test of CSR at the -level of significance10 is then given by the rule:
The significance level, , is also called the size of the test. Example results of this
testing procedure for a test of size are illustrated in Figure 3.6 below. Here the two
9
Formally this assumes that ̂ is a sufficiently accurate estimate of to allow any probabilistic variation
in ̂ to be ignored.
10
By definition, the level of significance of a test is the probability, , that the null hypothesis (in this case
the CSR Hypothesis) is rejected when it is actually true. This is discussed further below.
________________________________________________________________________
ESE 502 I.3-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
samples, zm , in the tails of the distribution are seen to yield strong evidence against the
CSR Hypothesis, while the sample in between does not.
As already noted, values of d m (and hence zm ) that are too low to be plausible under
CSR are indicative of patterns more clustered than random. Similarly, values too large
are indicative of patterns more dispersed than random. In many cases, one of these
alternatives is more relevant than the other. In the redwood seedling example of Figure
1.1 it is clear that trees appear to be clustered. Hence the only question is whether or not
/2 /2
zm z / 2 0 zm z / 2 zm
this apparent clustering could simply have happened by chance. So the key question here
is whether this pattern is significantly more clustered than random. Similarly, one can ask
whether the pattern of Cell Centers in Figure 1.2 is significantly more dispersed than
random. Such questions lead naturally to one-tailed versions of the test above. First, a test
of clustering versus the CSR Hypothesis at the -level of significance is given by the
rule:
Example results of this testing procedure for a test of size are illustrated in Figure 3.7
below. Here the standardized sample mean zm to the right is sufficiently low to conclude
the presence of clustering (at the -level of significance), and the sample toward the
middle is not.
________________________________________________________________________
ESE 502 I.3-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
zm z zm 0
Significant No Significant
Clustering Clustering
In a similar manner, one can construct a test of dispersion versus the CSR Hypothesis at
the -level of significance using the rule:
Example results for a test of size are illustrated in Figure 3.8 below, where the sample
zm to the left is sufficiently high to conclude the presence of dispersion (at the -level of
significance) and the sample toward the middle is not.
0 zm z zm
No Significant Significant
Dispersion Dispersion
While such tests are standard in literature, it is important to emphasize that there is no
“best” choice of . The typical values given by most statistical texts are listed in Tables
3.1 and 3.2 below:
Significance z / 2 Significance z
“Strong” .01 2.58 “Strong” .01 2.33
“Standard” .05 1.96 “Standard” .05 1.65
“Weak” .10 1.65 “Weak” .10 1.28
________________________________________________________________________
ESE 502 I.3-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
So in the case of a two-tailed test, for example, the non-randomness of a given pattern is
considered “strongly” (“weakly”) significant if the CSR Hypothesis can be rejected at the
.01 ( .10) level of significance.11 The same is true of one-tailed tests (where the
cutoff value, z / 2 , is now replaced by z ). In all cases, the value .05 is regarded as a
standard (default) value indicating “significance”.
However, since these distinctions are admittedly arbitrary, another approach is often
adopted in evaluating test results. The main idea is quite intuitive. In the one-tailed test of
clustering versus CSR above, suppose that for the observed standardized mean value, zm ,
one simply asks how likely it would be to obtain a value this low if the CSR Hypothesis
were true? This question is easily answered by simply calculating the probability of a
sample value as low as zm for the standard normal distribution N (0,1) . If the cumulative
distribution function for the normal distribution is denoted by
(3.2.20) ( z ) Pr( Z z )
(3.2.21) Pr( Z zm ) ( zm )
( zm )
zm 0
Notice that unlike the significance level, , above, the P-value for a test depends on the
realized sample value, zm , and hence is itself a random variable that changes from
sample to sample. However, it can be related to by observing that if P ( Z zm ) ,
then for a test of size , one would conclude that there is significant clustering. More
generally the P-value, P ( Z zm ) can be defined as the largest level of significance
(smallest value of ) at which CSR would be rejected in favor of clustering based on the
given sample value, zm .
Similarly, one can define the P-value for a test of dispersion the same way, except that
now for a given observed standardized mean value, zm , one asks how likely it would be to
11
Note that lower values of denote higher levels of significance.
________________________________________________________________________
ESE 502 I.3-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
obtain a value this large if the CSR Hypothesis were true. Hence the P-value in this case
is given simply by
where the first equality follows from the fact that Pr( Z zm ) 0 for continuous
distributions.12 This P-value is illustrated graphically below:
1 ( zm )
0 zm
Finally, the corresponding P-value for the general two-tailed test is given as the answer to
the following question: How likely would it be to obtain a value as far from zero as zm if
the CSR Hypothesis were true? More formally this P-value is given by
(3.2.23) P (| Z | zm ) 2 ( | zm |)
as shown below. Here the absolute value is used to ensure that | zm | is negative
regardless of the sign of zm . Also the factor “2” reflects the fact that values in both tails
are further from zero than zm .
( | zm |) ( | zm |)
| zm | 0
We now illustrate the Clark-Evans testing procedure in terms of the Redwood Seedling
example in Figure 1.1. This image is repeated in Figure 3.12a below, where it is
compared with a randomly generated point pattern of the same size in Figure 3.12b. Here
it is evident that the redwood seedlings are more clustered than the random point pattern.
12
By the symmetry of the normal distribution, this P-value is also given by ( z m ) [ 1 ( z m )] .
________________________________________________________________________
ESE 502 I.3-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
However, it is important to notice that there are indeed some apparent clusters in the
random pattern. In fact, if there were none then this pattern would be “too dispersed”. So
the key task is to distinguish between degrees of clustering that could easily occur by
chance and those that could not. This is the essence of statistical pattern analysis.
0 10
feet
To do so, we shall start by assuming that most of the necessary statistics have already
been calculated. (We shall return to the details of these calculations later.) Here the area,
a( R) 44108 sq.meters., of this region R is given ARCMAP. It appears in the Attribute
Table of the boundary file Redw_bnd.shp in the map document Redwoods.mxd. The
number of points, n 62 , in this pattern is given in the Attribute Table of the data file,
Redw_pts.shp, in Redwoods.mxd. [The bottom of the Table shows “Records (0 out of
62 Selected). Note that there only appear to be 61 rows, because the row numbering
always starts with zero in ARCMAP.] Hence the estimated point density in (1) above is
given by
n 62
(3.3.1) ˆ .00141
a( R) 44108
1 1
(3.3.2) ˆ 13.336 meters
2 ˆ 2 .00141
4 4 3.14
(3.3.3) ˆ n .8853
n4ˆ (62)4(3.14)(.00141)
For the redwood seedling pattern, the mean nn-distance, d n , turns out to be
________________________________________________________________________
ESE 502 I.3-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
At this point, notice already that this average distance is much smaller than the theoretical
value calculated in (3.3.2) under the hypothesis of CSR. So this already suggests that for
the given density of trees in this area, individual trees are much too close to their nearest
neighbors to be random. To verify this statistically, let us compute the standardized mean
d n ˆ 9.037 13.336
(3.3.5) zn 4.855
ˆ n .8853
Now recalling from Table 2 above that there is “strongly significant” clustering if
zn z.01 2.33 , one can see from (3.3.5) that clustering in the present case is even
more significant. In fact the P-value in this case is given by13
(3.3.6) P-value P( Z zn ) ( zn ) ( 4.855) .0000006
(Methods for obtaining -values are discussed below). So the chances of obtaining a
mean nearest-neighbor distance this low under the CSR hypothesis are less than one in a
million. This is very strong evidence in favor of clustering versus CSR.
However, one major difficulty with this conclusion is that we have used the entire point
pattern (m n) , and have thus ignored the obviously dependencies between nn-distances
discussed above. Cressie (1993, p.609-10) calls this “intensive” sampling, and shows
with simulation analyses that this procedure tends to overestimate the significance of
clustering (or dispersion). The basic reason for this is that positive correlation among nn-
distances results in a larger variance of the test statistic, Z n , than would be expected
under independence (for a proof of this see Section 4 of the Appendix to Part I, and also
see p.99 in [BG]). Failure to account for this will tend to inflate the absolute value of the
standardized mean, thus exaggerating the significance of clustering (or dispersion). With
this in mind, we now consider two procedures for taking random subsamples of pattern
points that tend to minimize this dependence problem. These two approaches utilize
JMPIN and MATLAB, respectively, and thus provide convenient introductions to using
these two software packages.
One should begin here by reading the notes on opening JMPIN in section 2.1 of Part IV
in this NOTEBOOK.14 In the class subdirectory jmpin now open the file,
Redwood_data.jmp in JMPIN. (The columns nn-dist and area contain data exported
from MATLAB and ARCMAP, respectively, and are discussed later). The column
Rand_Relabel is a random ordering of labels with associated nn-distance values in the
13
Methods for obtaining -values are discussed later.
14
This refers to section 2.1 in the Software portion (Part IV) of this NOTEBOOK. All other references to
software procedures will be done similarly.
________________________________________________________________________
ESE 502 I.3-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
column, Sample. [These can be constructed using the procedure outlined in section
2.2(2) of Part IV in this NOTEBOOK.]
Now open a second file, labeled CE_Tests.jmp, which is a spreadsheet constructed for
this class that automates Clark-Evans tests. Here we shall use a random 50% subsample
of points from the Redwood Seedlings data set to carry out a test of clustering.15 To do
so, click Rows Add Rows and add 31 rows ( 62 / 2) . Next, copy-and-paste the first
31 rows of Redwood_data.jmp into these positions.
In Redwood_data.jmp :
(i) Select rows 1 to 31 (click Row 1, hold down shift, and click Row 31)
(ii) Select column heading Sample (this entire column is now selected)
(iii) Click Edit Copy
Now in CE_Tests.jmp :
Finally, to activate this spread sheet you must fill in the two parameters (area, n), start
with area as follows:
The procedure for filling in the value n ( 62) is the same. Once these values are
registered, the spread sheet does all remaining calculations. (Open the formula windows
for lam, mu, sig, s-mean, and Z as above, and examine the formulas used.) The results
are shown below (where only the first row is displayed).
Notice first that all values other than lam differ from the full-sample case (m n)
calculated above since we have only m 31 samples. Next observe that the P-value for
clustering (.0000273) is a full order of magnitude larger than for the full-sample case. So
while clustering is still extremely significant (as it should be), this significance level has
15
In [BG] (p.99) it is reported that a common a rule-of-thumb to ensure approximate independence is to
take a random subsample of no more than 10% (i.e., m n /10 ). But even for large sample sizes, n , this
tends to discard most of the information in the data. An alternative approach will be developed in the
MATLAB application of Section 3.2.5 below.
________________________________________________________________________
ESE 502 I.3-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
While the procedure in JMPIN above does allow one to take random subsamples, and
thereby reduce the effect of positive dependencies among nn-distances, it only allows a
single sample to be taken. So the results obtained depend to some degree on the sample
selected. What one would like to do here is to take many subsamples of the same size
(say with m 31 ) and look at the range of Z-values obtained. If almost all samples
indicate significant clustering, then this yields a much stronger result that is clearly
independent of the particular sample chosen. In addition, one might for example want to
use the P-value obtained for the sample mean of Z as a more representative estimate of
actual significance. But to do so in JMPIN would require many repetitions of the same
procedure, and would clearly be very tedious. Hence an advantage of programming
languages like MATLAB is that one can easily write a program to carry out such
repetitious tasks. With this in mind, we now consider an alternative approach to Clark-
Evans tests using MATLAB.
One should begin here by reading the notes on opening MATLAB in section 3.1 of Part
IV in this NOTEBOOK. Now open MATLAB, and set the Current Directory (at the top
of the MATLAB window) to the class subdirectory, T:/sys502/matlab, and open the data
file, Redwoods.mat.16 The Workspace window on the left will now display the data
matrices contained in this file. For example, area, is seen to be a scalar with value,
44108, that corresponds to the area value used in JMPIN above. [This number was
imported from ARCMAP, and can be obtained by following the ARCMAP procedure
outlined in Section 1.2(8) of Part IV.] Next consider the data matrix, Redwoods, which is
seen to be a 62 x 2 matrix, with each row denoting the (x,y) coordinates of one of the 62
redwood seedlings. You can display the first three rows of this matrix by typing
>> Redwoods(1:3,:).
I have written a program, ce_test.m,17 in MATLAB to carry out Clark_Evans tests. You
can display this program by clicking Edit Open and selecting the file ce_test.m.18
The first few lines of this program are displayed below:
16
The extension .mat is used for data files in MATLAB.
17
The extension .m is used for all executable programs and scripts in MATLAB.
18
To view this program you can also type the command >> edit ce_test.
________________________________________________________________________
ESE 502 I.3-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
% INPUTS:
% (i) pts = file of point locations (xi,yi), i=1..n
% (ii) a = area of region
% (iii) m = sample size (m <= n)
% (iv) test = indicator of test to be used
% 0 = two-sided test for randomness
% 1 = one-sided test for clustering
% 2 = one-sided test for dispersion
%
% OUTPUTS: OUT = vector of all nearest-neighbor distances
%
% SCREEN OUTPUT: critical z-value and p-value for test
The first line defines this program to be a function call ce_test, with four inputs
(pts,a,n,test) and one output called OUT. The percent signs (%) on subsequent lines
indicate comments intended for the reader only. The next few comment lines describe
what the program does. In this case ce_test takes a subsample of size m n and
performs a Clark-Evans test as in JMPIN. The next set of comment lines describe the four
inputs in detail. The first, pts, contains the (x,y) coordinates of the given point pattern,
and corresponds in our present case to Redwoods. The parameter a corresponds to area,
and m corresponds to the number of subsamples to be taken (in this case m 31 ). Finally
test is an indicator denoting the type of test to be done, so that for a one-tailed test of
clustering we would give test the value 1. During the execution of this program, the
nearest-neighbor distance for each pattern point is calculated. Since this vector of nn-
distances is useful for other applications (such as the JMPIN spread-sheet above) it is
useful to save this vector. Hence the single output, OUT, is in this case the n x 1 matrix
of nn-distances. The last comment line describes the screen output of this program,
which in the present case is simply a display of the Z-value obtained and its
corresponding P-value.
________________________________________________________________________
ESE 502 I.3-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
To run this program, suppose that you want to save the nn-distance output as a vector
called D (the names of inputs and outputs can be anything you choose). Then at the
command prompt you would type:
>> D = ce_test(Redwoods,area,31,1);
Here it is important to end this command statement with a semicolon (;), for otherwise,
all output will be displayed on the screen (in this case the contents of D). Hence by
hitting return after typing the above command, the program will execute and give a
screen display such as the following:
Z_Value = -3.3282
P_Value = .00043697
The results are now different from those of JMPIN above because a different random
subsample of size m 31 was chosen. To display the first four rows of the output vector,
D, type19
>> D(1:4,:)
As with the Redwoods display above, the absence of a semicolon at the end will cause
the result of this command to be displayed. If you would like to save this output to your
home directory (S:) as a text file, say nn_dist.txt, then use the command sequence20
As was pointed out above, the results of this Clark-Evans test depend on the particular
sample chosen. Hence, each time the program is run there will be a slightly different
result (try it!). But in MATLAB it is a simple matter to embed ce_test in a slightly larger
program that will run ce_test many times, and produce whatever summary outputs are
desired. I have constructed a program to do this, called ce_test_distr.m. If you open this
program you will see that it has a similar format:
19
Since D is a vector, there is only a single column. So one could simply type D(1:4) in this case.
20
To save D in another directory, say with the path description, S:\path , you must use the full command:
>> save S:\path\nn_dist.txt D -ascii .
________________________________________________________________________
ESE 502 I.3-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
% INPUTS:
% (i) pts = file of point locations (xi,yi), i=1..n
% (ii) a = area of region
% (iii) m = sample size (m <= n)
% (iv) test = indicator of test to be used
% 0 = two-sided test for randomness
% 1 = one-sided test for clustering
% 2 = one-sided test for dispersion
% (v) N = number of sample tests.
%
% OUTPUTS: OUT = vector of Z-values for tests.
%
% SCREEN OUTPUT: (1) Normal fit of Histogram for OUT
% (2) Mean of OUT
% (3) P-value of mean (if normcdf present)
The only key difference is the new parameter, N, that specifies the number of point
pattern samples of size m to be simulated (i.e., the number of times the ce_test is to be
run). The output chosen for this program is the vector of Z-values obtained. So if N =
1000, then OUT will be a vector of length 1000. The screen outputs now include
summary measures of this vector of Z-values, namely the histogram of Z-values in OUT,
along with the mean of these Z-values and the P-value for this mean. If this program is
run using the command
>> Z = ce_test_distr(Redwoods,area,31,1,1000);
then 1000 samples will be drawn, and the resulting Z-values will be saved in a vector, Z.
In addition, a histogram of these Z-values will be displayed, as illustrated in Figure 3.13
below. Notice that the results of this simulated sampling scheme yield a distribution of Z-
values that is approximately normal. While this normality property is again a
consequence of the Central Limit Theorem, it should not be confused with the normal
distribution in (3.2.12) upon which the Clark-Evans test is based (that requires n to be
sufficiently large). However, this normality property does suggest that a 50% sample
(m n / 2) in this case yields a reasonable amount of independence among nn-distances,
as it was intended to do.21
21
Hence this provides some evidence that the 10% rule of thumb in footnote 15 above is overly
conservative.
________________________________________________________________________
ESE 502 I.3-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
90
80
70
60
50
40
30
20
10
0
-5.5 -5 -4.5 -4 -3.5 -3 -2.5 -2 -1.5
In particular, the mean of this distribution is now about -3.46 as shown by the program
output below:
Here the P-value, .000273, is of the same order of magnitude as the single sample above,
indicating that this single sample was fairly representative.22 However it is of interest to
note that the single sample in JMPIN above, with a P-value of .0000546 is an order of
magnitude smaller. Hence this sample still indicates more significance than is warranted.
But nonetheless, a P-value of .000273 is still very significant – as it should be for this
redwood seedling example.
The Redwood Seedling example above is something of a “straw man” in that statistical
analysis is hardly required to demonstrate the presence of such obvious clustering. Rather
22
Again it should be emphasized that this P-value has nothing to do with the sampling distribution in
Figure 13. Rather it is the P-value for the mean Z-value under the normal distribution in (3.2.12).
________________________________________________________________________
ESE 502 I.3-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
it serves as an illustrative case where we know what the answer should be.23 However,
the presence of significant clustering (or dispersion) is often not so obvious. Our second
example, again taken from [BG] (Figure 3.2), provides a good case in point. It also serves
to illustrate some additional limitations of the above analysis.
The map in Figure 3.14a below shows a portion of the Moor containing n 35 tors. A
randomly generated pattern of 35 tors is shown for comparison in 3.14b.
# #
# #
# # # # ##
# # #
#
# #
#
# # #
#
#
# #
##
#
#
# # #
# # #
# #
# #
# #
# # #
# ## # #
#
# # ## #
#
# #
# #
#
# # #
# #
# 0 5 km # #
#
Here there does appear to be some clustering of tors relative to the random pattern on the
right. But it certainly not as strong as the redwood seedling example above. So it is of
interest to see what the Clark-Evans test says about clustering in this case (see also
exercise 3.5 on pp.114-15 in [BG]). The maps in Figures 3.14a and 3.14b appear in the
ARCMAP project, bodmin.mxd, in the directory arview/project/Bodmin. The area,
a( R) 206.62 , of the region R in Figure 3.14a is given in the Attribute Table of the
shapefile, bod_bdy.24 This point pattern data was imported to MATLAB and appears in
the matrix, Bodmin, of the data file, bodmin.mat, in the matlab directory. For our
present purposes it is of interest to run the following full-sample version of the Clark-
Evans test for clustering:
23
Such examples are particularly useful for providing consistency checks on statistical methods for
detecting clustering.
24
The area and distance scales for this pattern are not given in [BG].
________________________________________________________________________
ESE 502 I.3-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
>> D = ce_test(Bodmin,area,35,1);
Z_Value = -1.0346
P_Value = 0.15043
Hence even with the full sample of data points, the Clark-Evans test yields no significant
clustering. Moreover, since subsampling will only act to reduce the level of significance,
this tells us that there is no reason to proceed further. But for completeness, we include
the following results for a subsample of size m 18 (approximately 50%):25
>> ce_test_distr(Bodmin,area,18,1,1000);
So even though there appears to be some degree of clustering, this in not detected by
Clark-Evans. It turns out that there are two key theoretical difficulties here that have yet
to be addressed. The first is that for point pattern samples as small as the Bodmin Tors
example, the assumption of asymptotic normality may be questionable. The second is
that nn-distances for points near the boundary of region R are not distributed the same as
those away from the boundary. We shall consider each of these difficulties in turn.
This type of skewness is typical of nn-distances – even under the CSR hypothesis. [Under
CSR, the theoretical distribution of nn-distances is given by the Rayleigh density in
expression (2) of Section 2 in the Appendix to Part I, which is seen to have the same
skewness properties.]
25
Here we are not interested in saving the Z-values, so we have specified no outputs for clust_distr.
________________________________________________________________________
ESE 502 I.3-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
The second theoretical difficulty concerns the special nature of nn-distances near the
boundary of region R. The theoretical development of the CSR hypothesis explicitly
assumed that the region R is of infinite extent, so that such “edge effects” do not arise.
But in practice, many point patterns of interest occur in regions R where a significant
portion of the points are near the boundary of R. Recall from the discussion in Section
2.4 that if region R is viewed as a “window” through which part of a larger (stationary)
point process is being observed, then points near the boundary will tend to have fewer
observed neighbors than points away from the boundary. So in cases where the nearest
neighbor of a point in the larger process is outside R, the observed nn-distance for that
point will be greater than it should be (such as the example shown in Figure 3.16 below).
Thus the distribution of nn-distances for such points will clearly have higher expected
values than for interior points. For samples from CSR processes, this will tend to inflate
mean nn-distances relative to their theoretical values under the CSR hypothesis. This
edge effect will be demonstrated more explicitly in the next section.
R
Given these shortcomings, we now develop a testing procedure that simulates the true
distribution of Dn in region R for a given pattern size, n .26 While this procedure is
computationally more intensive, it will not only avoid the need for normal approxi-
mations, but will also avoid the need for subsampling altogether. The key to this
procedure lies in the fact that the actual distribution of a randomly located point in R can
easily be simulated on a computer. This procedure, known as rejection sampling, starts
by sampling random points from rectangles. Since each rectangle is the Cartesian product
of two intervals, [a1 , b1 ] [a2 , b2 ] , and since drawing a random number, si from an
interval [ai , bi ] is a standard operation in any computer language, one can easily draw a
random point s ( s1 , s2 ) from [a1 , b1 ] [a2 , b2 ] . Hence for any given planar region, R, the
basic idea is to sample points from the smallest rectangle, rec( R) containing R, and then
to reject any points which are not in R.
26
Procedures for simulating distributions by random sampling are known as “Monte Carlo” procedures.
________________________________________________________________________
ESE 502 I.3-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
Hence expression (2.1.2) holds, and the CSR hypothesis is satisfied. More generally, for
any pattern of size n one can easily simulate as many samples of size n from R as
desired, and use these to estimate the sampling distribution of Dn under the CSR
hypothesis.
Here the first row gives information about the boundary, namely that there is one
polygon, and that this polygon consists of 144 points. Each subsequent row contains the
(x,y) coordinates for one of these points. Notice also that the second row and the last row
are identical, indicating that the polygon is closed (and thus that there are only 144
distinct points in the polygon). This boundary information for R is necessary in order to
define the rectangle, rec( R) . It is also needed to determine whether a given point in
rec( R) is also in R or not. While this latter determination seems visually evident in the
present case, it turns out to be relatively complex from a programming viewpoint. A brief
description of this procedure is given in section 5 of the Appendix to Part I.
________________________________________________________________________
ESE 502 I.3-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
(3.5.2) D d ) N0
Pr(
N 1
n n
Here the denominator N 1 includes the observed sample along with the simulated
samples. This estimate then constitutes the relevant P-value for a test of clustering
relative to the CSR hypothesis. Hence the testing procedure in clust_sim consists of the
follows two steps:
(i) Simulate N patterns of size n and for each pattern i 1,.., N compute the
mean nn-distance, d n(i ) .
(ii) Determine the number of patterns, N 0 , with d n(i ) d n and calculate the
P-value for d n using (3.5.2) above.
To run this program we require one additional bit of information, namely the value of d n .
Given the output vector, D, of nn-distances for Bodmin tors obtained above from the
program, ce_test, this mean value (say m_dist) can be calculated by using the built-in
function, mean, in MATLAB as follows:
In the present case, m_dist = 1.1038. To input this value into clust_sim, we shall use a
MATLAB data array known as a structure. Among their many uses, structures offer a
convenient way to input optional arguments into MATLAB programs. In the present
case, we shall input the value m_dist together with the number of bins to be used in
constructing a histogram display for the simulated mean nn-distance values. [The default
value in MATLAB is bin = 10 is useful for moderate samples sizes, say N 100 . But for
simulations with N 1000 , is better to use bin = 20 or 25.] If you open the program,
clust_sim, you will see that the last input of this function is a structure namely opts (for
“options”) that is described in more detail under INPUTS:
________________________________________________________________________
ESE 502 I.3-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
% INPUTS:
% (i) poly = boundary file of polygon
% (ii) a = area of polygon
% (iii) m = number of points in polygon
% (iv) N = number of simulations
% (v) opts = an (optional) structure with variable inputs:
% opts.bins = number of bins in histogram (default = 10)
% opts.m_dist = mean nearest-neighbor distance for testing
To define this structure in the present case, we shall use the value of m_dist just
calculated, and shall set bins = 20. This is accomplished by the two commands:
Notice that opts is automatically defined by simply specifying its components.27 The key
point is that only the structure name, opts, needs to be specified in the command line.
The program clust_sim will look to see if either of these components for opts have been
specified. So if you want to use the default value of bins, just leave out this command.
Moreover, if you just want to look at the histogram of simulated values (and not run a test
at all), simply leave opts out of the command line. This is what is meant in the
description above when opts is referred to as an “(optional) structure”.
Given these preliminaries, we are now ready to run the program, clust_sim, for Bodmin.
To do so, enter the command line:
>> clust_sim(Bod_poly,area,35,1000,opts);
Here we have specified n = 35 for the Bodmin case, and have specified that N = 1000
simulated patterns be constructed. The screen output will start with successive displays:
percent_done = 10
percent_done = 20
:
percent_done = 100
27
Note also we have put both commands on the same line to save room. Just remember to separate each
command by a semicolon (;)
________________________________________________________________________
ESE 502 I.3-25 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
that indicate how the simulations are proceeding. The final screen output will then
include both a histogram of mean nn-distance values, and some numerical outputs, as
described in the “SCREEN OUTPUT” section of the comments in clust_sim. The
histogram will be something like that shown in Figure 3.18 below (the red vertical bar
will be discussed below):
150
100
50
0
0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
Note first that in spite of the relatively skewed distribution of observed nn-distance
values for Bodmin, this simulated distribution of mean nn-distances appears to be
approximately normal. Hence, given the sample size, n 35 , it appears that the
dependencies between nn-distance values in this Bodmin region are not sufficient to rule
out the assumption of normality used in the Clark-Evans test.
But in spite of its normality, this distribution is noticeably different from that predicted
by the CSR hypothesis. To see this, recall first that that for the given area of Bodmin,
a( R) 206.6 , the point density estimate is given by ˆ 35 / 206.6 .1694 . Hence the
theoretical mean nn-distance value predicted by the CSR hypothesis is
1
(3.5.3) ˆ 1.215
2 ˆ
However, if we now look at the numerical screen output for this simulation, we have
CLUST_SIM RESULTS
SIM_MEAN_DIST = 1.3087
M_DIST = 1.1038
________________________________________________________________________
ESE 502 I.3-26 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
Here the first line reports the mean value of the 1000 simulated mean nn-distances. But
since (by the Law of Large Numbers) a sample this large should give a fairly accurate
estimate of the true mean, E ( Dn ) , we see that this true mean is considerably larger than
that predicted by the CSR hypothesis above.28 The key point to note here is that the edge
effects depicted in Figure 3.16 above are quite significant for pattern sizes as small as
n 35 relative to the size of the Bodmin region, R.29 So this simulation procedure does
indeed give a more accurate distribution of nn-distances in the Bodmin region under the
CSR hypothesis.
Observe next that the second line of screen output above gives the value of opts.m_dist
as noted above (assuming this component of opts was included). The final line is the
critical one, and gives the P-value for opts.m_dist, as estimated by (3.5.2) above. Hence,
unlike the Clark-Evans test where no significant clustering was observed (even under full
sampling), the present procedure does reveal significant clustering.30 This is shown by
the position of the red vertical bar in Figure 3.18 above (at approximately a value of
m_dist = 1.1038). Here there are seen to be only a few simulated values lower than
m_dist. Moreover, the discussion above now shows why this result differs from Clark-
Evans. In particular, by accounting for edge effects, this procedure reveals that under the
CSR hypothesis, mean nn-distance values for Bodmin should be higher than those
predicted by the Clark-Evans model. Hence the observed value of m_dist is actually
quite low once this effect is taken into account.
28
You can convince yourself of this by running clust_sim a few times an observing that the variation in
this estimated mean values is quite small.
29
Note that as the sample size n becomes larger, the expected nn-distance, E ( Dn ) , for a given region, R,
becomes smaller. Hence the fraction of points sufficiently close to the boundary of R to be subject to edge
effects eventually becomes small, and this edge effect disappears.
30
Note again that this P-value will change each time clust_sim is run. However, by trying a few runs you
will see that all values are close to .05.
________________________________________________________________________
ESE 502 I.3-27 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
In the Bodmin Tors example above, notice from Figure 3.14a (p.20) that the clustering
structure is actually quite different from that of the Redwood Seedling example in Figure
3.12a (p.12). Rather than small isolated clumps, there appear to be two large groups of
points in the northwest and southwest, separated by a large empty region. Moreover, the
points within each group are actually quite evenly spaced (locally dispersed). These
observations suggest that the pattern of tors exhibits different structures at different
scales. Hence the objective of the present section is to introduce a method of point pattern
analysis that takes such scale effects into account, and in fact allows “scale” to become a
fundamental variable in the analysis.
To motivate the main ideas, we begin with a new example involving wolf packs. A map
is shown in Figure 4.1a below representing the relative locations of wolf packs in a
portion of the Central Arctic Region in 1998.1 The enlarged portion in Figure 4.1b is a
schematic map depicting individual wolves in four of these packs.
Wolf packs
0
50 km
At the level of individual wolf locations in Figure 4.1b, there is a pattern of isolated
clumps that bears a strong resemblance to that of the Redwood seedlings above.2
Needless to say, this pattern would qualify as strongly clustered. But if one considers the
larger map in Figure 4.1a, a different picture emerges. Here, the dominant feature is the
remarkable dispersion of wolf packs. Each pack establishes a hunting territory large
enough for its survival (roughly 15 to 20 km in diameter), and actively discourages other
1
This map is based on a more detailed map published in the Northwest Territories Wolf Notes, Winter
1998/99. See the class file: ese502/extra_materials/wolf_packs.jpg.
2
The spacing of individual wolves is of course exaggerated to allow a representation at this scale.
________________________________________________________________________
ESE 502 I.4-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
packs from invading its territory.3 Hence this pattern of wolf locations is very clustered at
small scales, and yet very dispersed at large scales.
But if one were to analyze this wolf-location pattern using any of the nearest-neighbor
techniques above, it is clear that only the small-scale clustering would be detected. Since
each wolf is necessarily close to other wolves in the same dens, the spacing between dens
would never be observed. In this simple example one could of course redefine wolf dens
to be aggregate “points”, and analyze the spacing between these aggregates at a larger
scale. But there is no way to analyze multiple scales using nearest neighbors without
some form of re-aggregation.4
To capture a range of scales in a more systematic way, we now consider what amounts to
an extension of the quadrat (or cell-count) method discussed in section 1 above. In
particular, recall that the quadrat method was criticized for being too dependent on the
scale of individual cells. Hence the key idea of K-functions is to turn this dependency
into a virtue by explicitly incorporating “scale” as a variable in the analysis. Thus, rather
than fixing the scale and locations of cell grids, we now consider randomly sampled cells
of varying sizes. While many sampling schemes of this type can be defined, we shall
focus on the single most basic scheme which is designed to answer the following
question for a given point process with density : What is the expected number of point
events within distance h from any randomly sampled point event? Note that this expected
number is not very meaningful without specifying the point density, , since it will of
course increase with . Hence if we divide by in order to eliminate this obvious
“density effect” then the quantities of interest take the form:
(4.2.1) K ( h) 1
E (number of additional events within distance, h, of an arbitrary event)
If we allow the distance or scale, h , to vary then expression (4.2.1) is seen to define a
function of h , designated as a K-function.5 As with nn-distances, these values, K (h) ,
yield information about clustering and dispersion. In the wolf-pack example above, if one
were to define K (h) with respect to small distances, h , around each wolf in Figure 4.1b,
then given the close proximity to other wolves in the same pack, these values would
surely be too high to be consistent with CSR for the given density of wolves in this area.
Similarly, if one were to define K (h) with respect to much larger distances, h , around
each wolf in Figure 4.1a, then given the wide spacing between wolf packs (and the
relative uniformity of wolf-pack sizes6), these values would surely be too low to be
3
Since wolves are constantly on the move throughout their hunting territories, the actual locations shown in
Figure 1a are roughly at the centers of these territories.
4
One could also incorporate larger scales by using higher-order nearest neighbors [as discussed for
example in Ripley (1996, sec.6.2)]. But these are not only more complex analytically, they are difficult to
associate with specific scales of analysis.
5
This concept was popularized by the work of Ripley (1976,1977). Note also that following standard
convention, we now denote distance by h to distinguish it from nn-distance, d .
6
Wolf packs typically consist of six to eight wolves (see the references in footnote 1 above).
________________________________________________________________________
ESE 502 I.4-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
consistent with CSR for the given density of wolves. Hence if one can identify
appropriate bench-mark values for K (h) under CSR, then these K-functions can be used
to test for clustering and dispersion at various scales of analysis. We shall consider these
questions in more detail in Section 4.4 below.
But for the moment, there are several features of definition (4.2.1) that warrant further
discussion. First, while the distance metric in (4.2.1) is not specified, we shall always
refer to Euclidean distance, d ( s, v) between pairs of points, as defined expression (3.2.1)
above. Hence with respect to any given point event, s , the expected number of point
events within distance h of s is simply the expected number of such events a circle of
radius h about s , as shown in Figure 4.2 below.
h
K ( h)
s
Expected Number
of Points in here
This graphical image helps to clarify several additional assumptions implicit in the
definition of K (h) . First, since this value is taken to depend only on the size of the circle
(i.e., the radius h ) and not its position (i.e., the coordinates of s ) there is an implicit
assumption of spatial stationarity [as in expression (2.5.1) above]. In other words, it is
assumed that the expected number of additional points in this circle is the same regardless
of where s is located. (This assumption will later be relaxed in our Monte Carlo
applications of K-functions).
Observe next that the circularity of this region implicitly assumes that direction is not
important, and hence that the underlying point process is isotropic (as in Figure 2.2
above). On the other hand, if the point process of interest were to exhibit some clear
directionality, such as the vertical directionality in shown in Figure 2.3 above, then it
might be more appropriate to use directional ellipses as defined by weighted Euclidean
distances of the form:
(4.2.2) d ( s, v) w1 ( s1 v1 ) 2 w2 ( s2 v2 ) 2
where the weights w1 and w2 reflect relative sensitivities of point counts to movements
in the horizontal or vertical direction, respectively.7 More generally, if the relevant point
7
One can also use appropriate quadratic forms to define anisotropic distances with any desired directional
orientations. We shall consider such distances in more detail in the analysis of spatial variograms in Part II
of this NOTEBOOK.
________________________________________________________________________
ESE 502 I.4-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
(4.2.4) K ( h) 1
E[ N (Ch {s}) | N ( s ) 1]
To see the importance of this conditioning, recall from expression (2.3.4) that for any
stationary process (not just CSR processes) it must be true that the expected number of
points in Ch {s} is simply proportional to its area, i.e., that
But this is not true of the conditional expectation above. Recall from the wolf-pack case,
for example, that for small circles around any given wolf, the expected number of
additional wolves is much larger than what would be expected based on area alone [i.e.,is
larger than a (Ch {s}) ]. These ideas will be developed in more detail in Section 4.4,
where it is shown that such deviations from simple area proportionality form the basis for
all K-function tests of the CSR Hypothesis.
(4.3.1) dij d ( si , s j )
and for any distance, h , define the indicator function, I h , for point pairs in Sn by
8
Here it should be noted that tools are available in the spatial analyst extension of ARCMAP for
constructing cost-weighted and shortest-paths distances. However, we shall not do so in this NOTEBOOK.
________________________________________________________________________
ESE 502 I.4-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
1 , dij h
(4.3.2) I h (dij ) I h [d ( si , s j )]
0 , dij h
From this definition it follows at once that for any given point si S n , the total number of
additional points s j within distance h of si is given by the sum j i I h (dij ) . Hence, if i
now refers to a randomly selected point generated by a point process on R, and if both the
number and locations of points in R are treated as random variables, then in terms of
(4.3.2) the K-function in (4.2.1) above can now be given the following equivalent
definition:
(4.3.3) K ( h) 1 E j i I h (dij )
Observe also that for stationary point processes the value of K (h) must be independent
of the particular point event i chosen. So multiplying through by in (4.3.3) and
summing over all point events i 1,.., n in region R, it follows that
E j i I h (dij )
n
K ( h) 1
n i 1
This “pooled” version of K (h) motivates the following pooled estimate of K (h) ,
designated as the sample K-function,
n
(4.3.5) Kˆ (h) 1 I (dij )
ˆn i 1 j i h
where again, ˆ n / a( R) .9 The advantage of this estimator is that it uses all points of the
given realized point pattern S in R. To interpret Kˆ (h) , note that if we rewrite (4.3.5) as
n
1
I (dij )
n
(4.3.6) Kˆ (h) 1
ˆ n i 1 j i h
then the expression in brackets is seen to be simply an average of the relevant point
counts for each of the pattern points, si S n . Hence, if the underlying process were truly
stationary (and edge effects were small) then this sample K-function would be
9
At this point it should be noted that our notation differs from [BG] where regions are denoted by a script
with area R. Here we use R for region, and make the area function, a ( R ) , explicit. In these terms, (4.3.5)
is seen to be identical to the estimate on the top of p. 93 in [BG], where 1/(ˆn) a ( R ) / n .
2
________________________________________________________________________
ESE 502 I.4-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
However, since this idealization can never hold exactly in bounded regions R, it is
necessary to take into account the edge effects created by the boundary of R. Unlike the
case of nn-distances, where the expected values of nn-distances are increased for points
near the boundary (as in Figure 3.16), the expected value of point counts are reduced for
these points, as shown in Figure 4.3a below.
R R
wij
h
si si
sj
To counter this downward bias, Ripley (1976) proposed a “corrected” version of (4.3.5)
that is quite effective in practice. His correction consists of weighting each point, s j , in
the count j i I h (dij ) in a manner that inflates counts for points near the boundary. If one
considers the circle about si passing through s j (as shown in Figure 4.3b) and defines
wij to be the fraction of its circumference that lies inside R, then the appropriate
reweighting of s j in the count for si is simply to divide I h (dij ) by wij , producing a new
estimate known as Ripley’s correction:
I h (dij )
n
(4.3.7) Kˆ (h) 1
ˆn i 1 j i
wij
One can gain some intuition here by observing in Figure 4.3b that weights will be unity
unless circle about si passing through s j actually leaves R. So only those point pairs will
be involved that are close to the boundary of R, relative to distance h . Moreover, the
closer that s j is to the edge of R, the more of this circumference is outside R, and a hence
the smaller wij becomes. This means that values I h (dij ) / wij are largest for points closest
10
For further discussion of this approximate unbiasedness see Ripley (1977, Section 6).
________________________________________________________________________
ESE 502 I.4-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
to the edge, thus inflating Kˆ (h) to correct the bias. [An explicit derivation of Ripley’s
correction in given in Section 6 of the Appendix to Part I.]
It should be emphasized that while Ripley’s correction is very useful for estimating the
true K-function for a given stationary processes, this is usually not the question of most
interest. As we have seen above, the key questions relate to whether this process exhibits
structure other than what would be expected under CSR, and how this structure may vary
as the spatial scale of analysis is increased. Here it turns out that in most cases, Ripley’s
correction is not actually needed. Hence this correction will not be used in the analysis to
follow.11
(4.4.1) Ch {v R : d ( s, v) h} R
Next recall from the basic independence assumption about individual point locations in
CSR processes (Section 2.2 above) that for such processes, the expected number of points
in Ch {s} does not dependent on whether or not there is a point event at s , so that
Hence from expression (4.2.3), together with the area formula for circles [and the fact
that a (Ch {s}) a (Ch ) ], it follows that
which together with expression (4.2.4) yields the following simple K-function values:
(4.4.4) K ( h) 1
( h 2 ) h 2
Thus by standardizing with respect to density, , and ignoring edge effects as in (4.4.1),
we see that the K-function reduces simply to area under the CSR Hypothesis. Note also
that when K (h) h 2 , this implies a mean point count higher than would be expected
under CSR, and hence indicates some degree of clustering at scale h (as illustrated in
11
Readers interested in estimating the true K-function for a given process are referred to Section 8.4.3 in
Cressie (1993), and to the additional references found therein.
________________________________________________________________________
ESE 502 I.4-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
Section 4.2 above). Similarly, a value K (h) h 2 implies a mean point count lower than
would be expected under CSR, and hence indicates some degree of dispersion at scale h .
Thus for any given h 0 ,
While these relations are adequate for testing purposes, area values are difficult to
interpret directly. Hence it usually convenient to further standardize K-functions in a
manner that eliminates the need for considering these values. If for each h we let
K ( h)
(4.4.6) L ( h) h
h2
(4.4.7) L ( h) h hh0
for all h 0 . In other words, this associated L-function is identically zero under CSR.
Moreover, since L(h) is an increasing function of K (h) , it follows that L(h) is positive
exactly when K (h) h 2 , and is negative exactly when K (h) h 2 . Hence the relations
in (4.4.5) can be given the following simpler form in terms of L-functions:
Given the estimate, Kˆ (h) , in (4.3.7) above, one can estimate L(h) by
Kˆ (h)
(4.4.9) Lˆ (h) h
We can apply these testing ideas to Bodmin by using the MATLAB program,
k_function.m. The first few lines of this program are shown below:
________________________________________________________________________
ESE 502 I.4-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
function C = k_function(loc,area,b,extent)
% INPUTS:
% (i) loc = file of locations (xi,yi), i=1..m
% (ii) area = area of region
% (iii) b = number of bins to use in CDF (and plot)
% (iv) extent = 1 if max h = half of max pairwise distance (typical case)
% = 2 if max h = max pairwise distance to be considered
% DATA OUTPUTS: C = (1:b) vector containing raw Point Count
% SCREEN OUTPUTS: Plot of L-Function over the specified extent.
To apply this program, again open the data file, Bodmin.mat, and recall that the tor
locations are given in the matrix, Bodmin. As seen above, the program first computes
Kˆ (h) for a range of distance values, h , and then coverts this to Lˆ (h) and plots these
values against the reference value of zero. The maximum value of h for this illustration
is chosen to be the maximum pairwise distance between pattern points (tors), listed as
option 2 in input (iv) above. The number of intermediate distance values (bins) to be used
is specified by input (iii). Here we set b = 20. Hence to run this program, type:
>> k_function(Bodmin,area,20,2); 2
Clustering
under the CSR Hypothesis. So it would -4
________________________________________________________________________
ESE 502 I.4-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
One approach is suggested by recalling that random point pattern for Bodmin was also
generated in Figure 3.14b above. Hence if the L-function for such a random pattern is
plotted, then this can serve as a natural benchmark against which to compare the L-
function for tors. This random pattern is contained in the matrix, Bod_rn2, of data file
Bodmin.mat (and is also shown again in Figure 4.7 below). Hence the corresponding
command, k_function(Bod_rn2,area,20,2), now yields a comparable plot of this
benchmark L-function as shown in Figure 4.5 below.
2 2
0 0
-2 -2
-4 -4
L L
-6 -6
Relative
Clustering
-8 -8
-10 -10
-12
-12 0 2 4 6 8 10 12 14 16 18 20
0 2 4 6 8 10 12 14 16 18 20
h h
Fig.4.5. Random L-function Fig.4.6. L-function Overlay
Here it is clear that the L-function for this random pattern is not flat, but rather is
everywhere negative, and decreases at an increasing rate. Hence relative to zero, this
pattern appears to exhibit more and more dispersion as the scale increases.
## #
and hence is mostly empty. Thus, given the #
#
12
A nice comparison of Ripley’s correction with uncorrected L-functions (such as in Figure 4 above) is
given in Figure 8.15 of Cressie (1993, p.617).
________________________________________________________________________
ESE 502 I.4-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
But if we now ignore the zero reference line and use this random L-function as a
benchmark, then a perfectly meaningful comparison can be made by overlaying these two
L-functions, as in Figure 4.6 above. Here one can see that the region of relative clustering
is now considerably larger than in Figure 4.4, and occurs up to a scale of about h 8 (see
the scale shown in Figure 3.14). But observe even these benchmark comparisons have
little meaning at scales so large that circles of radius h around all pattern points lie
mostly outside the relevant region R. For this reason, the commonly accepted rule-of-
thumb is that for any given point pattern, Sn , one should not consider h -values larger
that half the maximum pairwise distance between pattern points. Hence if we now denote
the maximum pairwise distance for Sn by, hmax max{d ( si , s j ) : si , s j S n } , and use h to
indicate the largest value of h to be considered in a given case, then the standard rule-of-
thumb is to set
(4.5.1) h hmax / 2
This corresponds to option 1 for input (iv) of k_function above, and option 2 correspond
to h hmax . We shall have occasion to use (4.5.1) in many of our subsequent analyses,
and in fact this will usually denote the “default” value of h .
A more important limitation of this benchmark comparison is that (like the JMPIN
version of the Clark-Evans test in Section 3.3.1 above) the results necessarily depend on
the random point pattern that is chosen for a benchmark. Hence we now consider a much
more powerful testing procedure using Monte Carlo methods.
As we saw in Section 3.5 above, it is possible to use Monte Carlo methods to estimate the
sampling distribution of nn-distances for any pattern size in a given region of interest.
This same idea extends to the sampling distribution of any statistics derived from such
patterns, and is of sufficient importance to be stated as a general principle:
Essentially, this simulation procedure gives us a clear statistical picture of what realized
patterns from a CSR process on R should look like. In the case of K-function tests of
CSR, we first consider the standard application of these ideas in terms of “simulation
envelopes”. This method is then refined in terms of a more explicit P-value
representation.
________________________________________________________________________
ESE 502 I.4-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
The essential idea here is to simulate N random patterns as above and to compare
observed estimate Lˆ (h) with the range of estimates Lˆi (h) , i 1,.., N obtained from this
simulation. More formally, if one defines the lower-envelope and upper-envelope
functions respectively by
then Lˆ (h) is compared with LN (h) and U N (h) for each h . So for a given observed
pattern, Sn , in region R the steps of this Monte Carlo testing procedure can be outlined as
follows:
The key difference between this figure and Figure 4.6 above is that, rather than a single
benchmark pattern, we now have a statistical population of patterns for gauging the
________________________________________________________________________
ESE 502 I.4-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
significance of Lˆ () . This plot in fact summarizes a series of statistical tests at each
scale of analysis, h H . In the case illustrated, if we consider any h under the blue
area in Figure 4.8, then by definition, Lˆ (h) U N (h) . But if pattern Sn were just
another sample from this population of random patterns, then every sample value
{Lˆ (h), Lˆ1 (h),.., LˆN (h)} would have the same chance of being the biggest. So the chance
that Lˆ (h) is the biggest is only 1/( N 1) . More formally, if pattern S is consistent
n
1
(4.6.3) Pr[ Lˆ (h) U N (h)] , hH
N 1
1
(4.6.4) Pr[ Lˆ (h) LN (h)] , hH
N 1
These probabilities are thus seen to be precisely the P-values for one-tailed tests of the
CSR Hypothesis against clustering and dispersion, respectively. For example, if
N 99 [as in step (i) above] then the chance that Lˆ (h) U N (h) is only 1/(99 1) .01 .
Hence at scale, h , one can infer the presence of significant clustering at the .01-level.
Similarly, if there were any h H with Lˆ (h) LN (h) in Figure 4.8, then at this scale
one could infer the presence of significant dispersion at the .01 -level. Moreover, higher
levels of significance could easily be explored by simulating larger numbers of random
patterns, say N 999 .
This Monte Carlo test can be applied to the Bodmin example by using the MATLAB
program, k_function_sim.m, shown below.
function k_function_sim(loc,area,b,extent,sims,poly)
% INPUTS:
% (i) loc = file of locations (xi,yi), i=1..n
% (ii) area = area of region
% (iii) b = number of bins to use in CDF (and plot)
% (iv) extent = 2 if max h = max pairwise distance to be considered
% = 1 if max b = half of max pairwise distance (typical case)
% (v) sims = number of simulated random patterns
% (vi) poly = polygon boundary file
________________________________________________________________________
ESE 502 I.4-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
Note that the two key additional inputs are the numbers of simulations (here denoted by
sims rather that N) and the boundary file, poly, for the region, R. As with the program,
clust_sim, in Section 3.5 above, poly is needed in order to generate random points in R.
To apply this program to Bodmin with sims = 99, be sure the data file, Bodmin.mat, in
open in the Workspace, and write:
>> k_function_sim(Bodmin,area,20,1,99,Bod_poly);
However, this approach is still rather limited in the sense that it provides information
only about the relation of Lˆ (h) to the maximum and minimum simulated values
U N (h) and LN (h) for each h H . Hence the following refinement of this approach is
designed to make fuller use of the information obtained from the above Monte Carlo
procedure.
By focusing on the maximum and minimum values, U N (h) and LN (h) for each
h H , the only P-values that can be obtained are those in (4.6.3) and (4.6.4) above.
But it is clear for example that values of Lˆ (h) that are just below U N (h) are probably
still very significant. Hence a natural extension of the above procedure is to focus
directly on P-values for clustering and dispersion, and attempt to estimate these values
on the basis of the given samples. Turning first to clustering, the appropriate P-value is
given by the answer to the following question: If the observed pattern were coming
from a CSR process in region R, then how likely it would be to obtain a value as large
as Lˆ (h) ? To answer this question let the observed L-value be denoted by l0 Lˆ (h) , and
let the random variable, LCSR (h) , denote the L-value (at scale h ) obtained from a
randomly sampled CSR pattern of size n on R. Then the answer to the above question
________________________________________________________________________
ESE 502 I.4-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
is given formally by the probability that LCSR (h) is at least as large as l0 , which we
designate as the clustering P-value, Pclustered (h) , at scale h for the observed pattern, Sn :
To estimate this probability, observe that our simulation has by construction produced a
sample of N realized values, li Lˆi (h) , i 1,.., N , of this random variable LCSR (h) .
Moreover, under the CSR Hypothesis the observed value, l0 , is just another sample,
which for convenience we designate as sample i 0 . Hence the task is to estimate
(4.6.5) on the basis of a random sample, (l0 , l1.,.., lN ) of size N 1 . The standard
approach to estimating event probabilities is simply to count the number of times the
event occurs, and then to estimate its probability by the relative frequency of these
occurrences. In the present case, the relevant event is “ LCSR (h) l0 ”. Hence if we now
define the indicator variables for this event by
1 , li l0
(4.6.6) 0 (li ) , i 0,1,.., N
0 , li l0
then the relative-frequency estimator, Pˆclustered (h) , of the desired P-value is given by13
0 (li )
N
(4.6.7) Pˆclustered (h) Pr[ LCSR (h) l0 ] 1
N 1 i 0
To simplify this expression, observe that if m (l0 ) denotes the number of simulated
samples, i 1,.., N , that are at least as large as l0 [i.e., with 0 (li ) 1 ], then this
estimated P-value reduces to14
m (l ) 1
(4.6.8) Pˆclustered (h) 0
N 1
Observe that expression (4.6.3) above is now the special case of (4.6.8) in which Lˆ (h)
happens to be bigger than all of the N simulated values. But (4.6.8) conveys a great
deal more information. For example, suppose that N 99 and that Lˆ (h) is only the
fifth highest among these N 1 values. Then in Figure 4.9 this value of Lˆ (h) would be
inside the envelope [probably much closer to U N (h) than to LN (h) ]. But no further
information could be gained from this envelope analysis. However in (4.6.8) the
estimated the chance of observing a value as large as Lˆ (h) is 5 /(99 1) .05 , so that
13
This is also the maximum-likelihood estimator of Pcluster ( h) . Such estimators will be considered in more
detail in Part III of this NOTEBOOK.
14
An alternative derivation of this P-value is given in Section 7 of the Appendix to Part I.
________________________________________________________________________
ESE 502 I.4-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
this L-value is still sufficiently large to imply some significant degree of clustering.
Such examples show that the P-values in (4.6.8) are considerably more informative
than the simple envelopes above.
Turning next to dispersion, the appropriate P-value is now given by the answer to the
following question: If the observed pattern were coming from a CSR process in region
R, then how likely it would be to obtain a value as small as Lˆ (h) ? The answer to this
question is given by the dispersion P-value, Pdispersed (h) , at scale h for the observed
pattern, Sn :
Here, if we let m (l0 ) denote the number of simulated L-values that are no larger than
l0 , then exactly the same argument above [with respect to the event “ LCSR (h) l0 ”] now
shows that the appropriate relative-frequency estimate of Pdispersed (h) , is given by
m (l ) 1
(4.6.10) Pˆdispersed (h) 0
N 1
To apply these concepts, observe first that (unless many li values are the same as l0 )15
it must be true that Pˆ (h) 1 Pˆ
dispersed (h) . So there is generally no need to compute
clustered
both. Hence we now focus on clustering P-values, Pˆclustered (h) for a given point pattern,
S , in region R. Observe next that to determine Pˆ
n (h) , there is no need to use L-
clustered
values at all. One can equally well order the K-values. In fact, there is no need to
normalize by ̂ since this value is the same for both the observed and simulated
patterns. Hence we need only compute “raw” K-function values, as given by the
bracketed part of expression (4.3.6). Finally, to specify an appropriate range of scales to
be considered, we take the appropriate maximum value of h to be the default value
h hmax / 2 in (4.5.1), and specify a number b of equal divisions of h . The values of
Pˆ (h) are then computed for each of these h values, and the result is plotted.
clustered
15
The question of how to handle such ties is treated more explicitly in Section 7 of the Appendix to Part I.
________________________________________________________________________
ESE 502 I.4-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
(Simply ignore the fourth input “1” for the present.) The screen output of k_count_plot
gives the value of h computed by the program, which in this case is Dmax/2 = 8.6859.
The minimum pairwise distance between all pairs of points (Dmin = 0.5203) is also
shown. This value is useful for interpreting P-values at small scales, since all values of
h less that this minimum must have Kˆ (h) 0 and hence must be “maximally
dispersed” by definition [since no simulated pattern can have smaller values of Kˆ (h) ].
P-Values
of the envelope approach above. Here 0.6
0
0 1 2 3 4 5 6 7 8 9
shows more. In particular, clustering at
h h
scales in the range 1.7 h 5.7 is now
seen to be significant at the .01 level, Fig.4.10. Bodmin Cluster P-Values
which by definition the highest level of
significance possible for N = 99.
Here it is also worth noticing that the clustering P-value at scale h .5 is so large (in
fact .93 in the above simulation) that it shows weakly significant dispersion (where the
upper dashed red line indicates significant dispersion at the .05 level). The statistical
reason for this can be seen from the screen output that shows the minimum distance
between any two tors to be .52. Hence at scale h .5 it must be true that no circle of
radius .5 about any tor can contain other tors, so that we must have Kˆ (.5) 0 . But since
random point patterns such as in Figure 3.14b often have at least one pair of points this
close together, it becomes clear that there is indeed some genuine local dispersion here.
Further reflection suggests that is probably due to the nature of rock outcroppings,
which are often only the exposed portion of larger rock formations and thus cannot be
too close together. So again we see that the P-value map adds information about this
pattern that may well be missed by simply visual inspection.
16
Simulations with N = 999 yield about the same results as Figure 4.10, so this appears to be a more
accurate range than given by the envelope in Figure 4.9.
________________________________________________________________________
ESE 502 I.4-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
While no explicit applications are given in [BG], we can illustrate the main ideas with
the following housing abandonment example.
As in the Philadelphia example of Section 1.2 above, suppose that we are given the
locations of n currently abandoned houses in a given city, R, such as in Figure 4.11a
below.
City
Boundary
H3
H2
H1
Hi
In addition, suppose that data on the number of housing units, H i (Ci ) , in each
census tract, Ci , i 1,.., m within city R is also available, as in Figure 4.11b. If the
number of total housing units in the city is denoted by
H ( R) i1 (Ci )
m m
(4.7.1) i 1
Hi
then the probability that a randomly sampled housing unit will be located in tract i is
given by
Hi (Ci )
(4.7.2) Pi , i 1,.., m
H ( R)
Thus if these n housing abandonments were completely random events (i.e., with no
housing unit more likely to be abandoned than any other) then one would expect the
distribution of abandoned houses across census tracts to be given by n independent
random samples from the distribution in (4.7.2).17 More formally, this is an example of
a nonhomogeneous CSR hypothesis with respect to a given reference measure, .
17
In particular, this would yield a marginal distribution of abandonments in each tract Ci given by the
binomial distribution in expression (2.4.3) above with C Ci .
________________________________________________________________________
ESE 502 I.4-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
To test such hypotheses, we proceed exactly the same way as in the homogeneous case.
The only real difference here is that the probability distributions corresponding to
nonhomogeneous spatial hypotheses are somewhat more complex. Using the above
example as an illustration, we can simulate samples of n random abandonments from
the appropriate distribution by the following two-stage sampling procedure:
(i) Randomly sample a census tract, C1i , from the distribution in (4.7.2).
(iii) Repeat (i) and (ii) n times to obtain a point pattern S n( i ) ( s (ji ) : j 1,.., n) .
The resulting pattern S n( i ) corresponds to the above hypothesis in the sense that
individual abandonment locations are independent, and the expected number of
abandonments in each tract C j is proportional to the reference measure, H j (C j ) .
However, this reference measure is only an approximation to the theoretical
measure, since the actual locations of individual housing units are not known. [This is
typical of situations where certain key spatial data is available only at some aggregate
level.18] Hence in step (ii) the location of a housing units in Ci is taken to be uniformly
(homogeneously) distributed throughout this subregion. The consequences of this
“local uniformity” approximation to the ideal reference measure, , will be noted in
the numerical examples below.
18
Such aggregate data sets will be treated in more detail in Part III of this NOTEBOOK.
________________________________________________________________________
ESE 502 I.4-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
To illustrate this testing procedure, the following example has been constructed from
the Larynx and Lung Cancer example of Section 1.2 above. Here we focus only on
Lung Cancer, and for simplicity consider only a random subsample of n 100 lung
cases, as shown in Figures 4.12 below.
! !
! !!!! !
!! ! ! ! !! ! !
!! !!
! ! ! ! ! ! ! !
! ! !! ! !! !
!!
!!!
! ! ! ! !
! ! !! ! ! !
!!
!
! ! ! !!
! ! !!! ! ! !!
! !
! !!
!! ! !! ! !!
! !
!
! ! !
! !!!!! ! !
! !!
!!
!
!
! !!!!! ! ! ! ! ! ! !
! ! !
! ! ! ! !!
!
! !
!
!!! !!! ! !
!
!! ! !
! ! !! !! !
! ! ! ! ! !
!
! !!!!! !! ! ! ! !
! !!
!
! ! !
!! ! ! !
!! !
!
!!! !!
5 10 km ! !
0 !
Note from Figures 1.7 and 1.8 that this is fairly representative of the full data set (917
lung cancers). To analyze this data set we begin by observing that in terms of area
alone, the point pattern in Figure 4.12 is obviously quite clustered.
________________________________________________________________________
ESE 502 I.4-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
But the single most important factor contributing to this clustering (as observed in
Section 2.4 above) is the conspicuous absence of an appropriate reference measure –
namely population. In Figure 4.15 below, the given subsample of lung cases in Figure
4.12 above is now depicted on the appropriate population backcloth of Figure 1.8.
(
! (
!
( !
! ( (
! !
( !
(
!
( (!( ( !
! ( (! (!
(
(!
! ((!
(! (!
((
! (!
!
(!
! ( ! (! (
! !
(!
! ( ! ( !
! ( !
((
!
(
! ( !
! ( (!
!
(
! (!
(
(
!
( (
! (
! (
! (!
! (!
(
(
!((
!!
(
! (
! (! ( ! (
( !
! ( (
!(
! !
( !
(
(
! (
! (!
! ( !
!
( (
!
(
! (
! (
! (
! (
!
(
! (
! (
! (
! (
! (
!
(
! (
! (
! ( (
! !
(
! (
! ( !
! (!
!( (
!
(
! (!
! (( !
(!
(! ( (
(
! (!
!(
( !
! ( ! !
(
(
! (!
(!(! (
! !
( (
! (
!
( !
! ( !
!
(
( !
(
(
! (
!
(!
! ( (!
!
(
! ( !
(
(
!
!!
(( (!
! ((
! (
! (
! (!
!
(
! (
! (!
! (
! (!
! ( (!
! ( (
( (!
! (!
!( (
!
( (
! !! (
! (!
!( (!( ! (!
( ((
! !
((
! (
! (!
( ! (
(
! (!
! (!
! ((
!
(!
!(
(
! (!
! (
( !
! (
! (
!
(
! (
! (
!
(
(
! (
! (
!(
!
(
!
!!
((!
(
!!
((
0 5 10 km !!
(
((
!
Here it is clear that much of the clustering in Figure 4.12 can be explained by variations
in population density. Notice also that the relative sparseness of points in the west and
east are also explained by the lower population densities in these areas (especially in
the east). For comparison, a random pattern generated using the two-stage sampling
procedure above is shown in Figure 4.16. Here there still appears to be somewhat less
clustering than in Figure 4.15, but the difference is now far less dramatic than above.
________________________________________________________________________
ESE 502 I.4-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
Up to this point we have only considered global properties of point patterns, namely the
overall clustering or dispersion of patterns at various scales. However, in many cases
interest focuses on more local questions of where significant clustering or dispersion is
occurring. Here we begin by constructing local versions of K-functions, and then apply
them to several examples.
Recall from expression (4.3.3) that K-functions were defined in terms of expected point
counts for a randomly selected point in a pattern. But exactly the same definitions can
be applied to each individual point in the pattern by simply modifying the interpretation
of (4.3.3) to be a given point, i , rather than a randomly sampled point, and rewriting
this expression as a local K-function for each point, i :
(4.8.1) K i ( h) 1 E j i I h (dij )
Moreover, if we now relax the stationarity assumption used in (4.3.4) above, then these
expected values may differ for each point, i . In this context, the pooled estimator
(4.3.5) for the stationary case now reduces to the corresponding local estimator:
(4.8.2) Kˆ i (h) 1
ˆ I (dij )
j i h
Hence to determine whether there is significant clustering about point i at scale h , one
can develop local Monte Carlo testing procedures using these statistics.
In the case of homogenous CSR hypotheses, one can simply hold point i fixed in
region R and generate N random patterns of size n 1 in R (corresponding to the
locations of all other points in the pattern). Note that in the present case, (4.8.2) is
simply a count of the number of points with distance h of point i , scaled by 1/ ˆ . But
since this scaling has no effect on Monte Carlo tests of significance, one can focus
solely on point counts (which may be thought of as a “raw” K-function). For each
random pattern, one can then simply count the number of points within distance h of
point i . Finally, by comparing these counts with the observed point count, one can then
generate p-values for each point i 1,.., n and distance, h , [paralleling (4.6.8) above]:
m ( h) 1
(4.8.3) Pˆi (h) i
N 1
________________________________________________________________________
ESE 502 I.4-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
where mi (h) now denotes the number of simulated patterns with counts at distance h
from i at least as large as the observed count. This testing procedure is operationalized
in the MATLAB program, k_count_loc.m, shown below:
Here the main output, Pval, is a matrix of P-values at each reference point and each
distance value under the CSR Hypothesis. (The point counts for each point-distance
pair are also in the output matrix, C0.) Notice that since homogeneity is simply a
special case of heterogeneity, this program is designed to apply both to homogeneous
and nonhomogeneous CSR hypotheses.
The homogeneous case can be illustrated by the following application to Bodmin tors.
Recall that the location pattern of tors is given by the matrix, Bodmin, in the workspace
Bodmin.mat. Here there is a single boundary polygon, Bod_poly. Hence the reference
measure can be set to a constant value, say M = 1. So the appropriate command for
999 simulations in this case is given by:
In view of Figure 4.10 above, one expects that the most meaningful distance range for
significant clustering will be somewhere between h 1 and h 5 kilometers. Hence
the selected range of distances was chosen to be D = [1,2,3,4,5]. One key advantage of
this type of local analysis is that since a p-value is now associated with each individual
point, is now possible to map the results. In the present case, the results of this Monte
Carlo analysis were imported to ARCMAP, and are displayed in Bodmin.mxd. In
Figure 4.18 below, the p-value maps for selected radii of h 2,3,5 km are shown. As
________________________________________________________________________
ESE 502 I.4-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
seen in the legend (lower right corner of the figure), the darker red values correspond to
lower p-values, and hence denote regions of more significant clustering. As expected,
there are basically two regions of significant clustering corresponding to the two large
groupings of tors in the Bodmin field.
!
!
( !
( !
!
( ( !
! ( !
(
!
( ( !
! ( !
( ! ! !
!
( !
( !
( !
( ! ! !
!
( !
( !
!
( !
( !
!
( !
( ! ! !
!
( !
( !
( !
( !
( !
(
!
( !
( !
! !
!
( ( ! !
( !
! ( ( !
! (
!
( !
( !
!
( !
! ( ( !
! ( !
!
!
( !
( !
( !
( !
!
( !
(!( !
( !
(!( ! ! ! !
!
( !
( !
( (
! ! ! !
!
( !
( !
( !
( !
!
( !
( !
!
( !
( !
( !
( !
!
( !
( ! !
!
( !
(
!
( !
( !
h = 2 km h = 3 km h = 5 km
3
P-VALUES
! 0.001 - 0.005
! 0.005 - 0.010
Notice here that clustering is much more pronounced at a radius of 3 km than at smaller
or larger radii. (The red circle in the figure shows the actual scale of a 3 km radius.)
This figure well illustrates the ability of local K-function analyses to pick up sharper
variations in scale than global analyses such as Figure 4.10 above (where there
appeared to be equally significant clustering at all three scales, h 2,3,5 km). Hence it
should be clear from this example that local analyses are often much more informative
than their global counterparts.
The ability to map p-values in local analyses suggests one additional extension that is
often more appropriate than direct testing of clustering at each individual point. By way
of motivation, suppose that one is studying a type of tree disease by mapping the
locations of infected trees in a given forest. Here it may be of more interest to
distinguish diseased regions from healthy regions in the forest rather than to focus on
individual trees. A simple way to do so is to establish a reference grid of locations in
the forest, and then to estimate clustering p-values at each grid location rather than at
each tree. (The construction of reference grids is detailed in Section 4.8.3 below.) Such
a uniform grid of p-values can then be easily interpolated to produce a smoother visual
________________________________________________________________________
ESE 502 I.4-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
° ° ° °
° h
° ° °
° ° ° °
° ° ° °
° ° ° °
Figure 4.19. Reference Grid for Local Clustering
Assuming that the forest itself is reasonably uniform with respect to the spatial
distribution of trees, the homogeneous CSR hypothesis would again provide a natural
benchmark for identifying significant clustering of diseased trees. In this case, one
would simulate random patterns of diseased trees and compare disease counts with
those observed within various distances h of each grid point. Hence those grid points
with low p-values at distance h would denote locations where there is significant
disease clustering at scale h .
________________________________________________________________________
ESE 502 I.4-25 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
This produces a 2-column matrix, ref, of grid point coordinates. (The upper left corner
of the grid is displayed on the screen for a consistency check.). A plot of the full grid,
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
ref, is shown on the left in Figure 4.20.19 (In Section 8 of the Appendix to Part I a
procedure is developed for obtaining this full grid representation directly in MATLAB.)
While all of these grid points are used in the calculation, those outside of the Bodmin
boundary are only relevant for maintaining some degree of smoothness in the
interpolation constructed below. On the right, these grid points have been masked out in
order to display only those points inside the Bodmin boundary. (The construction of
such visual masks is quite useful for many displays, and is discussed in detail in Section
1.2.4 of Part IV in this NOTEBOOK.)
Given this reference grid, ref, the extension of k_count_loc.m that utilizes ref is
operationalized in the MATLAB program, k_count_loc_ref.m. This program is
essentially identical to k_count_loc.m except that ref is a new input. Here one obtains
p-values for Bodmin at each reference point in ref with the command:
19
Notice that the right side and top of the grid extend slightly further than the left and bottom. This is
because the Xmax and Ymax values in the program are adjusted upward to yield an integral number of
cells of the same size.
________________________________________________________________________
ESE 502 I.4-26 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
where the matrix Pval now contains one p-value for each grid point in ref and distance
radius in D. The results of this Monte Carlo simulation were exported to ARCMAP and
the p-values at each grid point inside Bodmin are displayed for h 3 km on the left in
Figure 4.21 below (again with a mask). By comparing this with the associated point
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! #
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
# !
#
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! #
! # !
!
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! #
! #
! !
#
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! #
!
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! #
!
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
# !
# !
#
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
P-VALUES
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0.001 - 0.002
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! #
!
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0.002 - 0.005
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0.005 - 0.01 #
!
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! #
# !
!
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0.01 - 0.02 !
#
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! #
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! # !
!
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0.02 - 0.05 #
!
#
!
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0.05 - 0.10 !
# !
# !
#
!
#
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! #
! #
! !
#
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0.10 - 0.20 !
#
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! # #
!
!
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0.20 - 1.00
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
# !
#
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
#
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
plot in the center of Figure 4.18, one can see that this is essentially a smoother version
of the results depicted there. However, this representation can be considerably
improved upon by interpolating these values using any of number of standard
“smoothers” (discussed further in Part II). The interpolation shown on the right was
obtained by the method known as ordinary kriging. This method of (stochastic)
interpolation will be developed in detail in Section 6.3 of Part II in this NOTEBOOK.
Next we extend these methods to the more general case of nonhomogeneous CSR
hypotheses. As with all spatial Monte Carlo testing procedures, the key difference
between the homogeneous and nonhomogeneous cases is the way in which random
points are generated. As discussed in Section 4.7.2 above, this generation process for
the nonhomogeneous case amounts to a two-stage sampling procedure in which a
polygon is first sampled in a manner proportional to the given reference measure, M,
and then a random location in this polygon is selected. Since this procedure is already
incorporated into both the programs k_count_loc.m and k_count_loc_ref.m above,
there is little need for further discussion at this point.
________________________________________________________________________
ESE 502 I.4-27 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
The locations of these 500 incidents are shown on the left in Figure 4.22 below, and are
also displayed in the map document, Phil_igc.mxd, in ARCMAP. Here the natural null
hypothesis would be that every individual has the same chance of reporting an
“incident”. But as with the housing abandonment example in Figure 4.11 above,
individual location data is not available. Hence census tract population levels
! !
! !
! !! !
!! ! !! !
!
! !
! !! ! !
! ! !
!!
! ! ! !
! ! !
! ! !! !
! ! !! !!
! ! !! ! ! ! !
! ! !
! !! ! !
! ! ! ! ! !! ! !
! !! !
!! !! !
! ! !! !! ! ! ! !!
!! ! !
!
! ! ! ! !! !!! ! ! ! !! ! ! !! !
!! ! ! ! ! !
! ! !
! ! ! ! !! ! ! !! !!! !! ! !! ! !! !! !! !
!
!!
!
! ! !!!!! !
!! ! ! !!
!! ! !! ! !!!
!
! ! !
!
! ! ! ! ! !! !! ! !! !
! ! !
!
! ! ! ! ! !
!! !! !!!!! !! ! ! ! !!! !!! ! ! !!!!!! !! !!
!
!! ! !
! ! ! !! ! ! ! ! !
!!!! ! ! ! ! !! ! !! ! ! !!
!
!
!!! ! ! !! !! !!! ! ! ! !! ! !
!!!
! !! !
!! ! !!!! !! !!! !! !
! !! ! ! ! ! ! ! ! !!
!! !!!! !! !
!
! !
!!! !! ! ! !!
! ! ! !
! !! !! ! ! !
!
! !
! !
!
!!
!! !!! !
! ! ! !! !
! ! !
!!
!
!!!
!!
!
!
!!!
! !
!! !! !!!!!!!! !! ! !!!! !
! !!
! ! !!!
!!!
!!
! !!
!! ! ! !!! !! !!!! ! !
!!
!! ! !
! ! !!
! !! !
! !! ! !!! !! ! !
! !!
!
!
!!
!!!
!
!!
!! !!
! !
!!
! !!! !!
!!
!!
! !! !
! ! ! !
! !
!!
!
! !! !!!! ! !!! !!!!
!! !!!! !! !
!! !
!
!!!!!!! ! ! !! !!! !!! !! ! !
! ! !
! ! !
!!
!!!!! ! !! !! !! ! !
! ! ! !!
! !! ! !! ! ! ! !
! ! !! ! !
!! ! ! ! !! !! ! ! ! !
!
! ! ! !! ! !
! ! !! !!!! !!! ! !! ! !! ! !!! ! !! !
! ! !! ! !
!
! ! !! !!! !! !! !
!
!! !! ! ! !! !! ! ! !!
!
!! ! ! !! ! ! !
!! !!!! !! ! ! ! ! !!!
!
!!! ! ! !
!!!
! !!! ! !
!!
!!! !!
! ! !!! ! !! !!
! !!
!!! ! !!!! !! ! ! !!
!! ! !
!
!
!!!!
!
! !! !!! !!!
!!! !! ! !! !!!
!
!!!!! !
!!!!
!!!
! ! ! ! !! ! !!!! !!
! !
!
!!
! !!
! ! !!
! ! !!
!!
!
________________________________________________________________________
ESE 502 I.4-28 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
left are only significant if they are more concentrated than would be expected under this
hypothesis. Hence, even though there is clearly a cluster of cases in South Philadelphia,
it is not clear that this is a significant cluster. Notice however that the Kensington area
just Northeast of Center City does appear to be more concentrated than would be
expected under the given hypothesis. But no conclusion can be reached on the basis of
this visual comparison. Rather, we must simulate many realizations of random patterns
and determine statistical significance on this basis.
To do so, a reference grid for Philadelphia was constructed, and is shown (with
masking) on the left in Figure 4.23 below, in a manner similar to Figure 4.20 above.
Here a range of distances was tried, and clustering was most apparent at a radius of 500
meters (in a manner similar to the radius of 3 km in Figure 4.18 above for the Bodmin
example). The p-value results for this case are contained in the MATLAB workspace,
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! P-VALUES
! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0.000
! ! ! -!0.001
! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
0.001 - 0.005
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
0.005 -!0.100
! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
0.100
! ! ! !
-!0.200
! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
0.200
! ! ! !
-!1.000
! ! ! ! ! !
Here loc contains the locations of the 500 IGC incidents, ref is the reference grid
shown above, D contains a range of distances including the 500-meter case,20 and pop
20
The actual coordinates for this map were in decimal degrees, so that the value .005 corresponds roughly
to 500 meters.
________________________________________________________________________
ESE 502 I.4-29 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
contains the populations of each census tract, with boundaries given by bnd. These
results were imported to ARCMAP as a point file, and are displayed as P-val.shp in the
data frame, “P-Values for Dist = .005”, of Phil_igc.mxd. Finally, these p-values were
interpolated using a different smoothing procedure than that of Figure 4.21 above. Here
the spline interpolator in Spatial Analyst was used, together with the contour option.
The details of this procedure are described in Section 8 of the Appendix to Part I.21
Here the red contours denote the most significant areas of clustering, which might be
interpreted as IGC “hotspots”. Notice in particular that the dominant hotspot is
precisely the Kensington area mentioned above. Notice also that the clustering in West
Philadelphia, for example, is now seen to be explained by population density alone, and
hence is not statistically significant.
It is also worth noticing that there is a small “hotspot” just to the west of Kensington
(toward the Delaware River) that appears hard to explain in terms of the actual IGC
incidents in Figure 4.22. The presence of this hotspot is due to the fact that while there
are only four incidents in this area, the population density is less than a quarter of that
in the nearby Kensington area. So this incidence number is usually high given the low
density. This raises the practical question of how many incidents are required to
constitute a meaningful cluster. While there can be no definitive answer to this
question, is important to emphasize that statistical analyses such as the present one
should be viewed as providing only one type of useful information for cluster
identification. 22
21
Notice also that this contour map of P-values is an updated version of that in the graphic header for the
class web page. That version was based on only 99 simulations (run on a slower machine).
22
This same issue arises in regression, where there is a need to distinguish between the statistical
significance of coefficients (relative to zero) and the practical significance of their observed magnitudes in
any given context.
________________________________________________________________________
ESE 502 I.4-30 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
Up to this point, our analysis of point patterns has focused on single point patterns, such
as the locations of redwood seedlings or lung cancer cases. But often the relevant
questions of interest involve relationships between more than one pattern. For example if
one considers a forest in which redwoods are found, there will invariably be other species
competing with redwoods for nourishment and sunlight. Hence this competition between
species may be of primary interest. In the case of lung cancers, recall from Section 1.2
that the lung cancer data for Lancashire was primarily of interest as a reference
population for studying the smaller pattern of larynx cancers. We shall return to this
example in Section 5.8 below. But for the moment we start with a simple forest example
involving two species.
The 600 foot square section of forest shown in Figure 5.1 below contains only two types
of trees. The large dots represent the locations of oak trees, and the small dots represent
locations of maple trees. Although this is a fairly small section of forest, it seems clear
that the pattern of oaks is much more clustered than that of maples. This is not surprising,
given the very different seed-dispersal patterns of these two types of trees.
! ! ! !
! !
!
! ! !
!
! !
! ! !
! ! !
! ! !
! !
!
!
!
! !
!
! !
! ! !
! ! !
! ! !
! !
!
!
!
! !
! ! OAK MAPLE
! !
! ! !
! ! !
!
!
!
!
! !
As shown in Figure 5.2, oaks produce largest acorns that fall directly from the tree, and
are only partially dispersed by squirrels. Maples on the other hand produce seeds with
individual “wings” that can transport each seed a considerable distance with even the
slightest breeze. Hence there are clear biological reasons why the distribution of oaks
might be more clustered than that of maples. So how might we test this hypothesis
statistically?
________________________________________________________________________
ESE 502 I.5-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
As one approach to this question, observe that if oaks tend to occur in clusters, then one
should expect to find that the neighbors of oak trees tend to be other oaks, rather than
maples. Alternatively put, one should expect to find fewer maples near oak locations than
other locations. While one could in principle test these ideas in terms of nearest neighbor
statistics, we have already seen in the Bodmin tors example that this does not allow any
analysis of relationships between point patterns at different scales. Hence a more flexible
approach is to extend the above K-function analysis for single populations to a similar
method for comparing two populations.1
The idea is simple. Rather than looking at the expected number of oak trees within
distance h of a given oak, we look at the expected number of maple trees within distance
h of the oak. More generally, if we now consider two point populations, 1 and 2, with
respective intensities, 1 and 2 , and denote the members of these two populations by i
and j , respectively, then the cross K-function, K12 (h) , for population 1 with respect to
population 2 is given for each distance h by the following extension of expression (4.2.1)
above:
Notice that there is an asymmetry in this definition, and that in general, K12 (h) K 21 (h) .
Notice also that the word “additional” in (4.2.1) is no longer meaningful, since
populations 1 and 2 are assumed to be distinct. This definition can be formalized in a
manner paralleling the single population case as follows. First for any realized point
patterns, S1 ( si : i 1,.., n1 ) and S 2 ( si : i 1,.., n2 ) , from populations 1 and 2 in region
R , let dij d ( si , s j ) denote the distance between member i of population 1 and j of
population 2 in R . Then for each distance h the indicator function
1 , dij h
(5.2.2) I h (dij ) I h [d ( si , s j )]
0 , dij h
E j21 I h (dij )
n
(5.2.3) K12 (h) 1
2
1
Note that while our present focus is on two populations, analyses of more than two populations are
usually formulated either as (i) pairwise comparisons between these populations (as with correlation
analyses), or (ii) comparisons between each population and the aggregate of all other populations. Hence
the two-population case is the natural paradigm for both these approaches.
________________________________________________________________________
ESE 502 I.5-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
where both the size, n2 , of population 2 and the distances (dij : j 1,.., n2 ) are here
regarded as random variables.2 This function plays a fundamental role in our subsequent
comparative analyses of populations.
Given the definition in (5.2.3) it is immediately apparent that cross K-functions can be
estimated in precisely the same way as K-functions. First, since the expectation in (5.2.3)
does not depend on which random reference point i is selected from population 1, the
same argument as in (4.3.4) now shows that for any given size, n1 , of population 1,
E j21 I h (dij )
n n
(5.3.2) K12 (h) 1
2n1 i 1
In this form, it is again apparent that for any given realized patterns, S1 ( s1i : i 1,.., n1 )
and S 2 ( s2 j : j 1,.., n2 ) , the expected counts in (5.3.2) are naturally estimated by their
corresponding observed counts, and that the intensities, 1 and 2 , are again estimated by
the observed intensities,
nk
(5.3.3) ˆk , k 1, 2
a( R)
Thus the natural (maximum likelihood) estimate of K12 (h) is given by the sample cross
K-function:
n1 n2
(5.3.4) Kˆ 12 (h) 1 I (dij )
ˆ2n1 i 1 j 1 h
2
To be more precise, n2 is a random integer (count), and for any given value of n2 , the conditional
distribution of [ d ij d ( si , s j ) : j 1,.., n2 ] is then determined by the conditional distribution of the
locations, [ si , ( s j : j 1,.., n2 )] in R, where si is implicitly taken to be the location of a randomly sampled
member of population 1.
3
Technically this should be written as a conditional expectation given n1 [and (4.3.4) should be a
conditional expectation given n ]. But for simplicity, we ignore this additional layer of notation.
________________________________________________________________________
ESE 502 I.5-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
We next use these sample cross K-functions as test statistics for comparing populations 1
and 2. Recall that in the single population case, the fundamental question of interest was
whether or not the given population was more clustered (or more dispersed) than would
be expected if the population locations were completely random. This led to the CSR
hypothesis as a natural null hypothesis for testing purposes. However, when one
compares two populations of random events, the key question is usually whether or not
these events influence one another in some way. So here the natural null hypothesis takes
the form of statistical independence rather than randomness. In terms of cross K-
functions, if there are significantly more j -events close to i -events than would be
expected under independence, then one may infer that there is some “attraction” between
populations 1 and 2. Conversely, if there are significantly fewer j -events close to i -
event than expected, then one may infer that there is some “repulsion” between these
populations. These basic distinctions between the one-population and two-population
cases can be summarized as in Table 5.1 below:
Next we observe that from a testing viewpoint, the particular appeal of the CSR
hypothesis is that one can easily simulate location patterns under this hypothesis. Hence
Monte Carlo testing is completely straightforward. But the two-population hypothesis of
spatial independence is far more complex. In principle this would not be a problem if one
were able to observe many replications of these sets of events, i.e., many replications of
joint patterns from populations 1 and 2. But this is almost never the case. Typically we
are given a single joint pattern (such as the patterns of oaks and maples in Figure 5.1
above) and must somehow detect “departures from independence” using only this single
realization. Hence it is necessary to make further assumptions, and in particular, to define
“spatial independence” in a manner that allows the distribution of sample cross K-
functions to be simulated under this hypothesis. Here we consider two approaches,
designated respectively as the random-shift approach and the random-permutation
approach.
(as in Section 2) and we again represent each process by the collection of cell counts in
R , say k {N k (C ) : C R} , k 1, 2 , then it follows in particular from (2.5.1) that the
marginal cell-count distribution, Pr[ N k (Ch )] for population k in any circular cell, Ch , of
radius h must be the same for all locations.4 Hence if we now focus on population 2 and
imagine a two-stage process in which (i) a point pattern for population 2 is generated, and
(ii) this pattern is then shifted by adding some constant vector, a , to each point,
s j s j a , then the expected number of points in Ch would be the same for both stage
(i) and stage (ii). Indeed this shift simply changes the location of Ch relative to the
pattern (as in Figure 5.5 below) so that by stationarity the expected point count must stay
the same.
In this context, the appropriate spatial independence hypothesis simply asserts that cell
counts for population 2 are not influenced by the locations of population 1, i.e., that for
all cells, C R ,
Now returning to the two-stage “shift” process described above, this process suggests a
natural way of testing the independence hypothesis in (5.5.1) using sample cross K-
functions. In particular, if the given realization of population 2 is randomly shifted in any
way, then this should not affect the expected counts,
4
For the present, we implicitly assume that region R is “sufficiently large” that edge effects can be ignored.
5
Note that while there is an apparent asymmetry in this definition between populations 1 and 2, the
definition of conditional probability implies that (5.5.1) must also hold with labels 1 and 2 reversed.
6
This is an instance of what is called a “hard-core” process in the literature (as for example in Ripley,
1977, section 3.2 and Cressie, 1995, section 8.5.4).
________________________________________________________________________
ESE 502 I.5-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
But in its present form, such a test it is not practically possible since we are only able to
observe these processes in a bounded region, R. Thus any attempt to “shift” the pattern
for population 2 will require knowledge of the pattern outside this window, as shown in
Figures 5.4 and 5.5 below. Here the black dots represent unknown sites of population 2
events. Hence any shift of the pattern relative to region R will allow the possible entry of
unknown population 2 events into the window defined by region R.
R R
Figure 5.4. Pattern for Population 2 Figure 5.5. Randomly Shifted Pattern
However, it turns out that under certain conditions one can construct a reasonable
approximation to this ideal testing scheme. In particular, if the given region R is
rectangular, then there is indeed a way of approximating stationary point processes
outside the observable rectangular window. To see this, suppose we start with the two
point patterns in a rectangular boundary, R, as shown in Figure 5.6 below (with pattern 1
________________________________________________________________________
ESE 502 I.5-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
= white dots and pattern 2 = black dots).7 If these patterns are in fact generated by
stationary point processes on the plane, then in particular, the realized pattern,
S 20 ( s20 j : j 1,.., n2 ) , for population 2 (shown separately in Figure 5.7 below) could
equally well have occurred in any shifted version of region R.
But since the rectangularity of R implies that the entire plane can be filled by a “tiling” of
disjoint copies of region R (also called a “lattice” of translations of R) and since this same
point pattern can be treated as a typical realization in each copy of R, we can in principle
extend the given pattern in region R to the entire plane by simply reproducing this pattern
in each copy of R [as shown partially in Figure 5.8 below].8 We designate this infinite
version of pattern S 20 by S20 .
7
This example is taken from Smith (2004).
8
Such replications are also called “rectangular patterns with periodic boundary conditions” (see for
example Ripley, 1977 and Diggle, 1983, section 1.3).
________________________________________________________________________
ESE 502 I.5-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
In this way, we can effectively remove the “edge effects” illustrated in Figure 5.5 above.
Moreover, while the “replication process” that generates S20 must of course exhibit
stronger symmetry properties than the original process for population 2, it can be shown
that this process shares the same mean and covariance structure as the original process.
Moreover, it can also be shown that under the spatial independence hypothesis, the cross
K-function yielded by this process must be the same as for the original process.9 Hence
for the case of rectangular regions, R, it is possible to carry this replicated version of the
“ideal” testing procedure described above.
To make this test explicit, we start by observing that it suffices to consider only local
random shifts. To see this, note first that if point pattern 1 in Figure 5.6 is designated by
S10 ( s10i : i 1,.., ni ) , then shifting S20 relative to S10 on the plane is completely equivalent
to shifting S 0 relative to S 0 . Hence we need only consider shifts of S 0 . Next observe by
1 2 1
symmetry that every distinct rectangular portion S that can occur in shifted versions of
0
2
R (such as the pattern inside the blue box of Figure 5.8) can be obtained at some position
of R inside the red dotted boundary shown Figure 5.8. Hence we need only consider
random shifts of S10 within this boundary. Again, the blue box in Figure 5.8 represents
one such shift (where the white dots for population 1 have been omitted for sake of visual
clarity). Hence to construct the desired random-shift test, we can use the following
procedure:
(i) Simulate N random shifts that will keep rectangle R inside the feasible region in
Figure 5.9. Then shift all coordinates in S10 by this same amount.
(ii) If S 2m ( s2mj : j 1,.., n2m ) denotes the pattern for population 2 occurring in random
shift m 1,.., N of rectangle R (which will usually be of a slightly different size than
S 0 ), then a sample cross K-function, Kˆ m (h) , can be constructed from S 0 and S m . In
2 12 1 2
(iii) Finally, if the observed sample cross K-function, Kˆ 120 (h) , is constructed in the
same way from S10 and S 20 (where the latter pattern is equivalent to the “zero shift”
denoted by the central box in Figure 5.8), then under the spatial independence
hypothesis, (5.5.1), each observed value, Kˆ 120 (hw ) , should be a “typical” sample from
the list of values [ Kˆ m (h ) : m 0,1,.., N ] . Hence (in a manner completely analogous
12 w
9
See the original paper by Lotwick and Silverman (1982) for proofs of these facts.
________________________________________________________________________
ESE 502 I.5-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
simulated random shifts, m 1,.., N , with Kˆ 12m (hw ) Kˆ 120 (hw ) , then the estimated
probability of obtaining a value as large as Kˆ 0 (h ) under this spatial independence
12 w
M 0 1
(5.5.3) Pˆattraction (hw )
N 1
p-value,
M 0 1
(5.5.4) Pˆrepulsion (hw )
N 1
yields a vector of attraction p-values (5.5.3) at each radial distance in D based on 999
simulated random shifts of the maples relative to the oaks. Recall that in this example, an
inspection of Figure 5.1 suggested that there are “island clusters” of oaks in a “sea” of
10
In MATLAB this yields a list D of values from 10 to 270 in increments of 20. (See also p.5-23 below.)
________________________________________________________________________
ESE 502 I.5-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
maples. Hence, in terms of attraction versus repulsion, this suggests that there is some
degree of repulsion between oaks and maples. Thus one must be careful when
interpreting the p-value output, PVal, of this program.
Recall that as with clustering versus dispersion, unless there are many simulated cross K-
function values exactly equal to Kˆ 120 (hk ) , we will have Pˆreplusion (hk ) 1 Pˆattraction (hk ) .
Hence one can identify significant repulsion by plotting Pˆ (h ) for k 1,.., K and
attraction k
looking for large p-values. This plot is given as screen output for k12_shift_plot.m, and
is illustrated in Figure 5.10 below for a simulation with N 999 :
1
repulsion
0.9
0.8
0.7
0.6
P-Value
0.5
0.4
0.3
0.2
0.1
attraction
0
0 50 100 150 200 250 300
Radius
Here the red dashed line on the bottom corresponds to a attraction p-value of .05, so that
values below this level denote significant attraction at the .05 level. Similarly the red
dashed line at the top corresponds to an attraction p-value of .95, so that values above this
line denote significant repulsion at the .05 level. Hence there appears to be significant
repulsion between oaks and maples at scales 30 h 150 . This is seen to be in
reasonable agreement with a visual inspection of Figure 5.1 above.
________________________________________________________________________
ESE 502 I.5-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
This island example also raises another important limitation of the random-shift approach
when comparing point patterns. Recall that this approach treats the given region, R, as a
sample “window” from a much larger realization of point patterns, so that the hypothesis
of stationarity is at least meaningful in principle. But the shoreline of an island is physical
barrier between very different ecological systems. So if the point patterns were trees (as
in the oak-maple example) then the shoreline is not simply an “edge effect”. Indeed the
very concept of stationarity is at best artificial in such applications.
11
The following development is based on the treatment in Cox and Isham (1980). For a nice overview
discussion, see Diggle (2003.pp.82-83), and for a deeper analysis of marked spatial point processes, see
Cressie (1993, section 8.7).
________________________________________________________________________
ESE 502 I.5-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
More generally, by conditioning on the observed set of locations, one can compare a wide
variety of point populations without the need to identify alternative locations at all. Not
only does this circumvent all problems related to the shape of region, R, but it also avoids
the need to identify specific land-use constraints (such street networks or zoning
restrictions) that may influence the locations of relevant point events (like housing sales
or traffic accidents).
where Pr( s1 ,.., sn ) denotes the marginal distribution of event locations, and where
Pr[(m1 ,.., mn ) | ( s1 ,.., sn )] denotes the conditional distribution of event labels given their
locations.12 If Pr(m1 ,.., mn ) denotes the corresponding marginal distribution of event
labels, then the relevant hypothesis of spatial independence for our present purposes
asserts simply that event labels are not influenced by their locations. i.e., that
for all locations s1 ,.., sn R and labels m1 ,.., mn {1, 2} . In the Forest example above, for
instance, the hypothesis that there is no spatial relationship between oaks and maples is
here taken to mean that the given set of tree locations, ( s1 ,.., sn ) , tell us nothing about
whether these locations are occupied by oaks or maples. Hence the only locational
assumption implicit in this hypothesis is that any observed tree location could be
occupied by either an oak or a maple. Note also that this doesn’t mean that oaks and
maples are equally likely events. Indeed if there are many more maples than oaks, then
all of this information is captured in the distribution of labels, Pr(m1 ,.., mn ) .
As with the random shift approach (where the marginal distributions of each population
were required to be stationary), we do require one additional assumption about the
marginal distribution of labels, Pr(m1 ,.., mn ) . Note in particular that the indexing of
events, 1, 2,.., n , only serves to distinguish them, and that their particular ordering has no
12
For simplicity we take the number of events, n, to be fixed. Alternatively, the distributions in (5.6.1) can
all be viewed as being conditioned on n.
________________________________________________________________________
ESE 502 I.5-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
relevance whatsoever.13 Hence the likelihood of labeling events, (m1 ,.., mn ) , should not
depend on which event is called “1”, and so on. This exchangeability condition can be
formalized by requiring that for all permutations ( 1 ,.., n ) of the subscripts (1,.., n) ,14
These two conditions together imply that the point processes generating populations 1
and 2 are essentially indistinguishable. Hence we now designate the combination of
conditions, (5.6.2) and (5.6.3) as the spatial indistinguishability hypothesis for
populations 1 and 2. This hypothesis will form the basis for many of the tests to follow.
To do so, we begin by observing that in the same way that stationarity of marginal
distributions was inherited by conditional distributions in (5.5.1) above, it now follows
that exchangeability of labeling events in (5.6.3) is inherited by the corresponding
conditional events in (5.6.2). To see this, observe simply that for any given set of
locations ( s1 ,.., sn ) and subscript permutation ( 1 ,.., n ) it follows at once from (5.6.2)
and (5.6.3) that
To complete the desired task, it is enough to observe that for any two labelings,
(m1 ,.., mn ) and (m1,.., mn ) consistent with n1 and n2 we must have
for some permutation, ( 1 ,.., n ) . Hence if the conditional distribution of such labels
given both ( s1 ,.., sn ) and (n1 , n2 ) is denoted by Pr[ | ( s1 ,.., sn ), n1 , n2 ] , then it follows that:
13
However, if one were to model the immergence of new events (such as new disease victims or new
housing sales), then this ordering would indeed play a significant role.
14
For example, possible permutations of (1, 2, 3) include ( 1 , 2 , 3 ) (2,1, 3) and ( 1 , 2 , 3 ) (3, 2,1) .
________________________________________________________________________
ESE 502 I.5-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
Moreover, since these conditional labeling events are mutually exclusive and collectively
exhaustive, it also follows that this set of permutations must yield a well-defined
conditional probability distribution, i.e. that:
(5.6.7) ( 1 ,.., n )
Pr[(m1 ,.., m n ) | ( s1 ,.., sn ), n1 , n2 ] 1
1
(5.6.8) Pr[(m1 ,.., m n ) | ( s1 ,.., sn ), n1 , n2 ]
n!
This provides us with the desired sampling distribution for testing this hypothesis. In
particular, the following procedure yields a random-labeling test of (5.6.2) that closely
parallels the random-shift test above:
(i) Given observed locations, ( s1 ,.., sn ) , and labels (m1 ,.., mn ) with corresponding
population sizes, n1 and n2 , simulate N random permutations [ 1 ( ),.., n ( )] ,
1,.., N of (1,.., n) ,16 and form the permuted labels (m1 ( ) ,.., m n ( ) ) , 1,.., N
[which is equivalent to taking a sample of size N from the distribution in (5.6.8)].
(ii) If S1 ( s1i : i 1,.., n1 ) and S 2 ( s2 j : j 1,.., n2 ) denote the patterns for popu-
lations 1 and 2 obtained from the joint realization, [( s1 ,.., sn ),(m1 ( ) ,.., m n ( ) )] , and if
Kˆ 12 (h) denotes the sample cross K-function resulting from ( S1 , S2 ) , then choose a
relevant set of distance radii, D {hw : w 1,..,W } , and calculate the sample cross K-
function values, {Kˆ (h ) : w 1,..,W } for each 1,.., N .
12 w
(iii) Finally, if the observed sample cross K-function, Kˆ 120 (h) , is constructed from the
observed patterns, S10 and S 20 , then under the spatial indistinguishability hypothesis
15
It should be noted that since mi {1, 2} for each i 1,.., n , many permutations ( m ,.., m ) will in fact
1 n
be identical. Hence the probability of each distinct realization is n1 ! n2 !/ n ! . But since it is easier to sample
random permutations (as discussed in the next footnote) we choose to treat each permutation as realization.
16
This is in fact a standard procedure in most software. In MATLAB, a random permutation of the
integers (1,.., n) is obtained with the command randperm(n).
________________________________________________________________________
ESE 502 I.5-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
each observed value, Kˆ 120 (hw ) , should be a “typical” sample from the list of values
[ Kˆ 12 (hw ) : 0,1,.., N ] . Hence if we now let M 0 denote the number of simulated
random relabelings, 1,.., N , with Kˆ (h ) Kˆ 0 (h ) , then the estimated
12 w 12 w
probability of obtaining a value as large as Kˆ 120 (hw ) under this hypothesis is again
given by the attraction p-value in (5.5.3) above.
Before applying this test it is of interest to ask why simulation is required at all. Since the
distribution in (5.6.8) is constant, why not simply calculate the values,
Pr[ Kˆ 12 (hw ) Kˆ 120 (hw )] for each w 1,..,W ? The difficulty here is that since there is no
simple analytical expression for these probabilities, one must essentially enumerate the
sample space of relabelings and check these inequalities case by case. But even for
patterns as small as n1 10 n2 the number of distinct relabelings to be checked is seen
to be 20!/(10!10!) 184,756 . So even for small patterns, there are sufficiently many
distinct relabelings to make Monte Carlo simulation the most efficient procedure for
testing purposes.
Finally it is important to stress that while this random-labeling approach is clearly more
flexible than the random-shift approach above, this flexibility is not achieved without
some costs. In particular, the most appealing feature of the random shift test was its
ability to preserve many key properties of the marginal distributions for populations 1
and 2. In the present approach, where the joint distribution is recast in terms of a location
and labeling process, all properties of these marginal distributions are lost. So (as
observed by Diggle, 2003, p.83) the present marked-point-process approach is most
applicable in cases where there is a natural separation between location and labeling of
population types. In the context of the Forest example above, a simple illustration would
be the analysis of a disease affecting say maples. Here the two populations might be
“healthy” and “diseased” maples. So in this case there is a single location process
involving all maple trees, followed by a labeling process which represents the spread of
disease among these trees.17
17
An example of precisely this type involving “Myrtle Wilt”, a disease specific to myrtle trees, is part of
Assignment 2 in this course.
________________________________________________________________________
ESE 502 I.5-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
1
repulsion
0.9
0.8
0.7
0.6
P-Value
0.5
0.4
0.3
0.2
0.1
attraction
0
0 50 100 150 200 250 300 350
Radius
Radius
Figure 5.11 Random Relabeling P-Values
Here we see that the results are qualitatively similar to the random-shift test for short
distances, but that repulsion is dramatically more extreme for long distances. Indeed
significant repulsion now persists up to the largest possible relevant scale of 330 feet (=
Dmax/2). Part of the reason for this can be seen in Figure 5.12 below, where a partial
tiling of the maple pattern in Figure 5.1 is shown.
! ! ! ! ! ! ! ! ! ! ! !
! ! !
! ! ! ! ! !
! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! !
! ! !
! ! !
! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !
! ! ! ! ! !
! ! !
! ! ! ! ! !
! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! !
! ! !
! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !
! ! !
! ! ! ! ! !
! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! !
! ! !
! ! !
! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! !
! ! ! ! ! !
! ! !
! ! ! ! ! !
! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! !
! ! !
! ! ! ! ! !
________________________________________________________________________
ESE 502 I.5-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
Even this small portion of the tiling reveals an additional hidden problem with the
random-shift approach. For while this replication process statistically preserves the
means of sample cross K-functions, the variance of these functions tends to increase. The
reason for this is that tiling by its very nature tends to create new structure near the
boundaries of the rectangle region, R.18 In the present case, the red ellipses in Figure 5.11
represent larger areas devoid of maples than those in R itself (created mainly by the
combination of empty areas in the lower left and upper right corners of R). Similarly the
blue ellipses represent new clusters of maples larger that those in R. The result of this
new structure in the present case is to make the tiled pattern, S20 , of maples appear
somewhat more clustered at larger scales. This in turn yields higher levels of repulsion
between oaks ( S10 ) , and maples ( S20 ) at these larger scales for most simulated shifts. The
result of this is to make the observed level of repulsion between S10 and S 20 appear
relatively less significant at these larger scales, as reflected in the plot of Figure 5.10.19
The two procedures above allowed us to test whether there was significant “attraction” or
“repulsion” between two patterns. This focuses on their joint distribution. Alternatively,
we might simply compare their marginal distributions by asking: How similar are the
spatial point patterns S1 and S2 ? For instance, in the Forest example of Figure 5.1 we
started off with the observation that the oaks appear to be much more clustered than the
maples. Hence rather than characterizing this relative clustering as repulsion between the
two populations, we might simply ask whether the pattern of oaks, S1 , is more clustered
than the pattern of maples, S2 .
But while the original (univariate) sample K-functions, Kˆ 1 (h) and Kˆ 2 (h) , provide
natural measures of individual population clustering, it is not clear how to compare these
two values statistically. Note that since the population values, K1 (h) and K 2 (h) , are
simply mean values (for any given h ), one might be tempted to conduct a standard
difference-between-means test. But this could be very misleading, since such tests
assume that the two underlying populations (in this case S1 and S2 ) are independently
distributed. As we have seen above, this is generally false. Hence the key task here is to
characterize “complete similarity” in a way that will allow deviations from this
hypothesis to be tested statistically.
Here the basic strategy is to interpret “complete similarity” to mean that both point
patterns are generated by the same spatial point process. Hence if the sizes of S1 and S2
are given respectively by n1 and n2 , then our null hypothesis is simply that the
18
For additional discussion of this point see Diggle (2003, p.6).
19
Lotwick and Silverman noted this same phenomenon in their original paper (1982, p.410), where they
concluded that such added structure will tend to “show less discrepancy from independence” and thus yield
a relatively conservative testing procedure.
________________________________________________________________________
ESE 502 I.5-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
combination of these two patterns, S [( s1i : i 1,.., n1 ),( s2 j : j 1,.., n2 )] , is in fact a single
population realization of size n n1 n2 , i.e., S ( s1 ,.., sn1 , sn1 1 ,.., sn ) . If this were true,
then it would not matter which subset of n1 samples was labeled as “population 1”. It
should be clear from the above discussion that a natural way to formulate this hypothesis
is to treat the combined process as a marked point process.20 In this framework, the
relevant null hypothesis is simply that given observed locations, ( s1 ,.., sn ) and labels
(m1 ,.., mn ) with n1 occurrences of “1” and n2 occurrences of “2”, each permutation of
these labels is equally likely. But this is precisely the assertion in expression (5.6.8)
above. Hence in the context of marked point processes, the joint distribution of labels
(m1 ,.., mn ) given locations ( s1 ,.., sn ) and population sizes, n1 and n2 , is here seen to be
precisely the spatial indistinguishability hypothesis.
However, the present focus is on the marginal distributions of populations 1 and 2 rather
than the dependency properties of their joint distribution. Hence the natural test statistics
are the sample K-functions, Kˆ 1 (h) and Kˆ 2 (h) , for each marginal distribution rather than
the sample cross K-function. Note moreover that if both samples are indeed coming from
the same population, then Kˆ 1 (h) and Kˆ 2 (h) should be estimating the same K-function,
say K (h) , for this common population. Hence if these sample K-functions were unbiased
estimates, then by definition the individual K-functions, K (h) E[ Kˆ (h)], i 1, 2 , would
i i
be the same. In this context, “complete similarity” would thus reduce to the simple null
hypothesis: H 0 : K1 (h) K 2 (h) . However, as noted in section 4.3, this simplification is
only appropriate for stationary isotropic processes with Ripley corrections. Thus, in view
of the fact that hypothesis (5.6.2) is perfectly meaningful for any point process, we
choose to adopt a more flexible approach.
To do so, we first note that even in the absence of stationarity, the sample K-functions,
Kˆ 1 (h) and Kˆ 2 (h) , continue to be reasonable measures of clustering (or dispersion)
within populations. Hence to test for relative clustering (or dispersion) it is still natural to
focus on the difference between these sample measures,21 which we now define to be
Hence the relevant spatial similarity hypothesis for our present purposes is that the
observed difference obtained from (5.7.1) is not statistically distinguishable from the
random differences obtained from realizations of the conditional distribution of labels
under the spatial indistinguishability hypothesis [(5.6.2),(5.6.3)].
20
Indeed this is the reason why the analysis of joint distributions above was developed before considering
the present comparison of marginal distributions.
21
Note that one could equally well consider the ratio of these measures, or equivalently, the difference.of
their logs.
________________________________________________________________________
ESE 502 I.5-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
(i) Given observed locations, ( s1 ,.., sn ) , and labels (m1 ,.., mn ) with corresponding
population sizes, n1 and n2 , simulate N random permutations [ 1 ( ),.., n ( )] ,
1,.., N of (1,.., n) , and construct the corresponding the label permutations
(m1 ( ) ,.., m n ( ) ) , 1,.., N
(ii) If S1 ( s1i : i 1,.., n1 ) and S 2 ( s2 j : j 1,.., n2 ) denote the population patterns
obtained from the joint realization, [( s1 ,.., sn ),(m1 ( ) ,.., m n ( ) )] , 1,.., N , and if the
corresponding sample difference function is denoted by (h) Kˆ (h) Kˆ (h) , then
1 2
for the given set of relevant radial distances, D {hw : w 1,..,W } , calculate the
sample difference values, { (hw ) : w 1,..,W } for each 1,.., N .
(iii) Finally, if the observed sample difference function, 0 (h) Kˆ 10 (h) Kˆ 20 (h) , is
constructed from the observed patterns, S10 and S 20 , then under the spatial similarity
hypothesis, each observed value, 0 (hw ) , should be a “typical” sample from the list
of values [ (hw ) : 0,1,.., N ] . Hence if we now let m0 denote the number of
simulated random relabelings, 1,.., N , with (hw ) 0 (hw ) , then the probability
of obtaining a value as large as 0 (hw ) under this hypothesis is estimated by the
following relative clustering p-value for population 1 versus population 2:
m0 1
(5.7.2) Pˆr12 ( h )
N 1
-clustered
(iv) Similarly, if m0 denotes the number of simulated random relabelings, 1,.., N ,
with (hw ) 0 (hw ) , then the probability of obtaining a value as small as 0 (hw )
under this hypothesis is estimated by the following relative dispersion p-value for
population 1 versus population 2:
m0 1
(5.7.3) Pˆr12 ( h )
N 1
-dispersed
________________________________________________________________________
ESE 502 I.5-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
>> k2_diff_plot(loc,n1,sims,D,1);
1
r-dispersed
0.9
0.8
0.7
P-Value
0.6
0.5
0.4
0.3
0.2
0.1
r-clustered
0
0 50 100 150 200 250 300 350
Radius
This confirms the informal observation above that oaks are indeed more clustered than
maples, for scales consistent with a visual inspection of Figure 5.1.
While the simple Forest example above was convenient for developing a wide range of
techniques for analyzing bivariate point populations, the comparison of Larynx and Lung
cancer cases in Lancashire discussed in Section 1 is a much richer example. Hence we
now explore this example in some detail. First we analyze the overall relation between
these two patterns, using a variation of the spatial similarity analysis above. Next we
restrict this analysis to the area most relevant for the Incinerator in Figure 1.9. Finally, we
attempt to isolate the cluster near this Incinerator by a new method of local K-function
analysis that provides a set of exact local clustering p-values.
________________________________________________________________________
ESE 502 I.5-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
Given the Larynx Cancer population of n1 57 cases, and Lung Cancer population of
n2 917 cases, we could in principle use k2_diff_plot to compare these populations. But
the great difference in size between these populations makes this somewhat impractical.
Moreover, it is clear that the Larynx cancer population in Figure 1.7 above is of primary
interest in the present example, and that Lung cancers serve mainly as an appropriate
reference population for testing purposes. Hence we now develop an alternative testing
procedure that is designed precisely for this type of analysis.
To do so, we again start with the hypothesis that Larynx and Lung cancer cases are
samples from the same statistical population. But rather than directly compare the small
Larynx population with the much larger Lung population, we simply observe that if the
Larynx cases could equally well be any subsample of size n1 from the larger joint
population, n n n , then the observed sample K-function, Kˆ (h) , should be typical of
1 2 1
the sample K-functions obtained in this way. Hence, in the context of marked point
processes, the present subsample similarity hypothesis asserts that for any given
realization [( s1 ,.., sn ),(m1 ,.., mn )] , the value Kˆ 1 (h) obtained from the n1 locations with
mi 1 is statistically indistinguishable from the same sample K-function obtained by
randomly permuting these labels.
(ii) If S1 ( s1i : i 1,.., n1 ) denotes the population pattern obtained from the joint
realization, [( s1 ,.., sn ),(m1 ( ) ,.., m n ( ) )] , and if the corresponding sample K-function is
Kˆ (h) , then for the given set of relevant radial distances, D {h : w 1,..,W } ,
1 w
calculate the sample K-function values, {Kˆ (hw ) : w 1,..,W } for each 1,.., N .
1
(iii) Finally, if the observed sample K-function, Kˆ 10 (h) , is constructed from the
observed patterns, S10 and S 20 , then under the subsample similarity hypothesis, each
observed value, Kˆ 10 (hw ) , should be a “typical” sample from the list of values
[ Kˆ (h ) : 0,1,.., N ] . Hence if we now let m0 denote the number of simulated
w
________________________________________________________________________
ESE 502 I.5-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
m0 1
(5.8.1) Pˆclustered
1
( h)
N 1
(iv) In a similar manner, if m0 denotes the number of simulated random relabelings,
1,.., N , with Kˆ (h ) Kˆ 0 (h ) , then the probability of obtaining a value as small
1 w 1 w
as Kˆ (hw ) under this hypothesis is estimated by the following dispersion p-value for
0
1
population 1:
m0 1
(5.8.2) Pˆdispersed
1
( h)
N 1
Hence under this testing procedure, significant clustering (dispersion) for population 1
means that the observed pattern of size n1 is more clustered (dispersed) than would be
expected if it were a typical subsample from the larger pattern of size n . Note that while
this test is in principle possible for subpopulations of any size less than n , it only makes
statistical sense when n1 is sufficiently small relative to n to allow a meaningful sample
of alternative subpopulations. Moreover, when n1 is much smaller than n , the present
Monte Carlo test is considerably more efficient in terms of computing time then the full
spatial similarity test above
The first command produces a random permutation, list, of the indices (1,...,974) and the
second command selects the first 57 values of list and calls them sublist. Finally, the last
________________________________________________________________________
ESE 502 I.5-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
command creates a matrix, sub_loc, of the corresponding locations in loc. While this
procedure is a basic component of the program, k2_global_plot.m, it is useful to perform
these commands manually in order to see an explicit example. This coordinate data can
then be imported to ARCMAP and compared visually with the given Larynx pattern as
shown in Figures 5.14 and 5.15 below:22
!
!! ! ! !! ! !
!
! ! !
!! ! !!
!
! ! ! !!! ! !! !
! ! !! !!!! ! !
! !!
!!
!
! !
! !
! !!!! !! ! !
!! ! ! !
!
! ! !
! ! ! !
!! ! !
!
!
! !!!!!
! !!!!! !!!
!! ! !
! !
!! !!! !
! !! ! ! !!
0 5 10 km
This visual comparison suggests that there may not be much difference between the
overall pattern of observed Larynx cancers and typical subsamples of the same size from
the combined population of Larynx and Lung cancers.
The first command uses the program, dist_vec, to calculate the vector of n(n 1) / 2
distinct pairwise distances among the n locations. The second command identifies the
maximum, Dmax, of all these distances, and the third command used the “Dmax/2” rule
of thumb in expression (4.5.1) above to calculate the maximum radial distance for the
test. Finally, some experimentation with the test results suggests that the p-value plot
should include 20 equally spaced distance values up to Dmax/2. This can be obtained by
the last command, which constructs a list of numbers starting at the value, d/20, and
proceeding in increments of size d/20 until the number d is reached.
22
Note also that these subpopulations can be constructed directly in MATLAB. The relevant boundary file
is stored in the matrix, larynx_bd, so that subpopulation, sub_loc, can be displayed with the command,
poly_plot(larynx_bd,sub_loc). See Section 9 of the Appendix to Part I for further details.
________________________________________________________________________
ESE 502 I.5-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
Given this set of distances, D, a statistical test of the subsample similarity hypothesis for
this example can be carried out with the command:
>> k2_global_plot(loc,n1,999,D,1);
1
dispersed
0.9
0.8
0.7
P-Value
0.6
0.5
0.4
0.3
0.2
0.1
clustered
0
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Radius
Here we can see that, except at small distances, there is no significant difference between
the observed pattern of Larynx cases and random samples of the same size from the
combined population. Moreover, since the default p-values calculated in this program are
the clustering p-values in (5.8.1), the portion of the plot above .95 shows that Larynx
cases are actually significantly more dispersed at small distances than would be expected
from random subsamples. An examination of Figures 1.7 and 1.8 suggests that, unlike
Lung cancer cases which (as we have seen in Section 4.7.3) are distributed in a manner
roughly proportional to population, there appear to be somewhat more Larynx cases in
less populated outlying areas than would be expected for Lung cancers. This is
particularly true in the southern area, which contains the Incinerator. Hence we now
focus on this area more closely.
To focus in on the area closer to the Incinerator itself, we start with the observation that
heavier exhaust particles are more likely to affect the larynx (which is high in the throat).
Hence while little is actually known about either the exact composition of exhaust fumes
from this Incinerator or the exact coverage of the exhaust plume, it seems reasonable to
suppose that heavier exhaust particles are mostly concentrated within a few kilometers of
the source. Hence for purposes of the present analysis, a maximum range of 4000 meters
________________________________________________________________________
ESE 502 I.5-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
( 2.5 miles) was chosen.23 This region is shown in Figure 5.17 below as a circle of
radius 4000 meters about the Incinerator (which is again denoted by a red cross as in
Figure 1.9):
# # #
## #
# #
#
# # #
# # ## # # # # ## ## #
# ## #
# ### # #
# # # # # ##
##
##
## # # # ## # # ## # # ## #
##
# # ###
# # # ## ## ##
# ## ### #####
# ## ###
# #### #### ## ##
### #
# ##
##### ##
# # ## # ### ### ##### ####
#
## # ##
####### # #####
#
# #
##
#
##### #
## #
## ## ### ## # # ## ##
# # #
#### ### #####
## #
### ## ##
# ## #
# # # #
# ## # ## ### # #
#
### # #
# # ### # # ##
# ##
# # # ## # ##
# # #
## ###### #
# # # #
#
#
# # #
# # # # ### #
##
# # # # #
# ## ## ## # # ## #
## ##
# # ##### ##
# # ###
F
G # # ##
#
# #
#
# # ##
## # # #
### ## ##
##
#
# # ## ## #
4000 # ## #
If the coordinate position of the Incinerator is denoted by Incin,24 then one can identify
those cases that are within 4000 meters of Incin by means of the customized MATLAB
program, Radius_4000.m. Open the workspace, layrnx.mat, and use the command:
Here Lung and Larynx denote the locations of the Lung and Larynx cases, respectively.
The output structure, OUT, includes the locations of Lung and Larynx cases within 4000
meters of Incin, along with their respective distances from Incin. Here it can be seen by
inspection that the number of Larynx cases is n1 = 7. The total number of cases in this
area is n = 75. The appropriate inputs for k2_global_plot above can be obtained from
OUT as follows:
23
This is in rough agreement with the distance influence function, f ( d ) , estimated by Diggle, Gatrell and
Lovett (1990, Figure 7), which is essentially flat for d 4 kilometers.
24
This position is given in the ARCMAP layer, incin_loc.shp, as Incin = (354850,413550) in meters.
________________________________________________________________________
ESE 502 I.5-25 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
>> k2_global_plot(loc_4000,n1_4000,999,D_4000,1);
0.8
0.7
P-Value
0.6
0.5
0.4
0.3
0.2
0.1
clustered
0
0 500 1000 1500 2000 2500 3000 3500 4000
Radius
This plot is seen to be quite different from the global plot of Figure 5.16 above. In
particular, there is now some weakly significant clustering at scales below 500 meters.
This suggests that while the global pattern of Larynx cases exhibits no significant
clustering relative to the combined population of Larynx and Lung cases, the picture is
quite different when cases are restricted to the vicinity of the Incinerator. In particular,
the strong cluster of three Larynx cases nearest to the Incinerator in Figure 5.17 would
appear to be a contributing factor here.
This leads to the third and final phase of our analysis of this problem. Here we consider a
local analysis of clustering which is a variation of the local K-function analysis in Section
4.8 above. We again adopt the spatial indistinguishability hypothesis that Larynx and
Lung cases are coming from the same point process, but now focus on each individual
Larynx case by considering the conditional distribution of all other labels given this
Larynx case.
________________________________________________________________________
ESE 502 I.5-26 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
!
(
!
(
!
(
!
( s1i !
(
h !
(
!
(
!
(
K M K K! ( M K )!
k !( K k )! (m k )!( M K m k )!
k m k
(5.8.3) p(k | m, K , M )
M M!
m m!( M m)!
It is this cumulative probability, P(c1 | c, n1 , n) , that yields the desired event probability.
In the specific case above where c1 2, c 6, n1 57, and n 974 , we see that this
probability is given by
________________________________________________________________________
ESE 502 I.5-27 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
Hence if the subsample similarity hypothesis were true, then it would be quite surprising
to find at least two Larynx cases with this subpopulation of six cases. In other words, for
the given pattern of Larynx and Lung cases, there appears to be significant clustering of
Larynx cases near s1i at the h 400 meter scale.
Thus to construct a general testing procedure for local clustering (or dispersion) of
Larynx cases, it suffices to calculate the event probabilities in (5.8.4) for every observed
Larynx location, s1i , at every relevant radial distance, h . This procedure is implemented
in the MATLAB program, k2_local_exact.m.25 In the present case, if we consider only
the single radial distance, D = 400, and again use the location matrix, loc, then the set of
clustering p-values at each of the n1 = 57 Larynx locations is obtained with the
command:
Here P is the vector of p-values at each location, and C and C1 are the corresponding
vectors of total counts, c , and population 1 counts, c1 , at each location.
To gain further perspective on the significance of the cluster in Figure 5.19 above, one
can compare distances of cases to the Incinerator with the corresponding p-values as
follows:
The first command stacks the Incinerator location on top of the Larynx locations in a
matrix, L. The second and third commands then identify the relevant distances (i.e., from
Incin to all locations in Larynx ) as the first 57 distances, dist, produced by dist_vec(L).
The fourth and fifth commands combine P with dist in the matrix, COMP, and then sort
rows of COMP by P from low to high. Finally the last command displays the first seven
rows of this sorted version of COMP, as shown in the box on the right.
25
In the MATLAB directory for the class, there is also a Monte Carlo version of this program, k2_local.m.
By running these two programs for the same data set (say with 999 simulations) you can see that exact
calculations tend to be orders of magnitude faster than simulations – when they are possible.
________________________________________________________________________
ESE 502 I.5-28 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
The first three rows (in red) are the three closest Larynx cases to the Incinerator, as can
be verified in ARCMAP (and can also be seen in Figure 5.17 above).26 Moreover, the
ordering of p-values shows that these are the only three locations that exhibit significant
clustering. Hence this result suggests that there may indeed be some relation between the
Incinerator and nearby Larynx cases.
26
Note that the case just below these three is almost as close to the Incinerator as one of these three. But
this case has only a single Lung case within 400 meters, and hence exhibits no clustering at this scale.
________________________________________________________________________
ESE 502 I.5-29 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
Point events (such as crimes or disease cases) occur in time as well as space. If both time
and location data are available for these events, then one can in principle model this data
as the realization of a space-time point process. As a prime example, recall that the
Burkitt’s Lymphoma data (examined in Assignment 1 of this class) contains both onset
times and locations for 188 cases during the period 1961-1975. Moreover, the original
study of this data by Williams et al. (1978)1, (here referred to as [W]) focused precisely
on the question of identifying significant space-time clustering of these cases. Hence it is
of interest to consider this data in more detail.
The cases occurring in each five-year period of the study are displayed in Figure 6.1
below (with green shading reflecting relative population density in West Nile), and
correspond roughly to Figure 5 in [W].2 Here is does appear that cases in subsequent
periods tend to be clustered near cases in previous periods. But the inclusion of
population density in Figure 6.1 was done specifically to show that such casual
observations can be deceptive. Much of the new clustering is seen to occur in more
densely populated areas where one would expect new cases to be more likely based on
chance alone.
(
! ( !
! ( !
(
(
! (
!
(
!
(
! (
! !!
(( (
!
((
! ! ((
! !!
(
(
! (
! ( !
! ( (
! ( !
(!
(! ( !
( !
(
! (
! !!
( (
! ((
!
(
! (
!
(
! (!
! ( !
!
(!
!
(
!
( (
! ( (
! (( !
! (
( !
! (!
(!( (
!( !
! ((!
! ((!
! (
( (
! ! (
! (!
! (!
!( ! !
( ( ! (
! (
!
((
! (
!
(
! ( ! (
( (
( !
! ! !!
( (! (!
!((
!
(
!
(
! (
!
(
!
(!( ((
!
(!
! (!
!((
! (!
!((
!
(
! (
!
(
!
!
( (!
! (!(!(
(!
!
(
! (
! (
! ( !
!
(!
!(! (
!(
! ( !
! ((!
!
(!
( (( ! (
((
! (
! (
! (
! (
! (
!
!(!
(!
(!
! (!
! ( (
(
! (
((
!
(
! (
! (!
! (!
(!
! (
(
!
(
! (
!
(
!
!
( !
(
(
! (
!
!
( (
!
( (
!
!
(
!
(
!
(
! (
! (
! !!
((
!
( (
! !
( (
!
(
! (
!
( (
! ! (
! !!
(( (
!
!
(
(
! (
! (
!
(
! !
(
(
! (
!
1
This is Paper 1 in the Reference Materials on the class web page.
2
These cases differ slightly from those in Figure 5 of [W]. The present approximation is based on the
counting convention stated in [BG, p.81] that time is “measured in days elapsed since January 1st, 1960”.
This rule does not quite agree with the actual dates in the Appendix of [W], but the difference is very slight.
________________________________________________________________________
ESE 502 I.6-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
objective of the present section is to develop an alternative “random labeling” test that is
more appropriate. But before doing so, we shall consider the general question of space-
time clustering more closely.
Event sequences exhibit space-time clustering if events that are close in space tend to be
closer in time than would be expected by chance alone. The most classic examples of
space-time clustering are spatial diffusion processes in which point events are propagated
from locations to neighbors through some form of local interactions. Obvious examples
include the spread of forest fires (where new trees are ignited by the heat from trees
burning nearby), or the spread of contagious diseases (where individuals in direct contact
with infected individuals also become infected). Here it is worth noting that cancers such
as Burkitt’s Lymphoma are not directly contagious. However, as observed in [W,p.116],
malaria infections may be a contributing factor leading to Burkitt’s Lymphoma, and the
spread of malaria itself involves a diffusion process in which mosquitoes transmit this
disease from existing victims to new victims.
But even with genuine diffusion processes one must be careful in analyzing space-time
clustering. Consider the onset of a new flu epidemic introduced into a region, R, by a
single carrier, c, and suppose that the cases occurring during the first few days are those
shown in Figure 6.2 below.
R R
c
Here there is a clear diffusion effect in which the initial cases involve contacts with c, and
are in turn propagated to others by secondary contacts. But notice that even though the
initial three cases shown are all close to c, this process spreads out quickly. So while the
six “second round” cases shown in the figure may all occur at roughly the same time,
they are already quite dispersed in space. This example shows that cases occurring close
in time need not occur close in space. However, this figure also suggests that cases
occurring close in space may indeed have a tendency to occur close in time.3 So there
3
Here we assume that most contacts involve individuals living in close spatial proximity – which may not
be the case. For example, some individuals have significant contact with co-workers at distant job sites.
________________________________________________________________________
ESE 502 I.6-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
appears to be some degree of asymmetry between space and time in such processes. We
shall return to this issue below.
While the early stages of this epidemic show clear propagation effects, this is not true at
later stages. After the first few weeks, such an epidemic may well have spread throughout
the region, so that the pattern of new cases occurring on each day may be very dispersed,
as shown in Figure 6.3. More importantly, this pattern will most likely be quite similar
from day to day. At this stage, the diffusion process is said to have reached a steady state
(or stationary state). In such a steady state it is clearly much harder to detect any space-
time clustering whatsoever. Diffusion is still at work, but the event pattern is no longer
changing in detectable ways.4 However, it may still be possible to detect such space-time
effects indirectly. For example, if one were to examine the distribution of cases on day t ,
and to identify the new cases on day t 1 , then it might still be possible to test whether
these new cases are “significantly closer” to the population of cases on day t than would
be expected by chance alone. We shall not pursue such questions here. Rather the intent
of this illustration is to show that space-time clustering can be subtle in even the clearest
examples of spatial diffusion.
{s : si s h} {t :| ti t | }
4
A more extreme example is provided by change in temperature distribution within a room after someone
has lit a match. While the match is burning, there is very sharp peak in the temperature distribution that
spreads out from this point source of heat. After the match has gone out, this heat is not lost. Rather it
continues to diffuse throughout the room until a new steady state is reached in which the temperature is
everywhere slightly higher than before.
5
For a more thorough treatment see Diggle, P., Chetwynd, A., Haggkvist, R. and Morris, S. (1995).
________________________________________________________________________
ESE 502 I.6-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
t
ti
ti ei
ti
y
h si
As in two dimensions, one can define a relevant space-time region as the Cartesian
product, R T of the given spatial region, R, and a relevant time interval, T. For a given
pattern of events, {ei ( si , ti ) : i 1,.., n} , the default time interval, T, for purposes of
analysis is usually taken to be the smallest time interval containing all event times, i.e.,
tmax R T
T
tmin
________________________________________________________________________
ESE 502 I.6-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
If for a given space-time point process we let st denote the space-time (st) intensity of
events, i.e., the expected number of events per unit of space-time volume, then the
desired space-time K-function is again defined for each h 0 and 0 to be the
expected number of additional events within space-time distance (h, ) of a randomly
selected event, ei , i.e.,
(6.2.4) K ( h, ) 1
st j i
E I ( h , ) (dij , tij )
So as in (4.3.4), for any given pattern size, n , the pooled form of this function
E I ( h , ) (dij , tij )
n
(6.2.5) K (h, ) 1
nst i 1 j i
implies that the natural estimator of K (h, ) is given by sample space-time K-function:
n
(6.2.6) Kˆ (h, ) 1 I
j i ( h , )
(dij , tij )
nˆst i 1
n
(6.2.7) ˆst
a( R) (tmax tmin )
where the denominator is now seen to be the volume of the space-time region, R T , in
Figure 6.5 above.
To test for the presence of space-time clustering, one requires the specification of an
appropriate null hypothesis representing the complete absence of space-time clustering.
Here the natural null hypothesis to adopt is simply that there is no relation between the
locations and timing of events. Hence in a manner completely paralleling the treatment
of marked point processes in (5.6.1) it is convenient to separate space and time, and write
the joint probability of space-time events as,
________________________________________________________________________
ESE 502 I.6-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
where Pr( s1 ,.., sn ) again denotes the marginal distribution of event locations, and where
Pr[(t1 ,.., tn ) | ( s1 ,.., sn )] denotes the conditional distribution of event times given their
locations.7 In this context, if the marginal distribution of event times is denoted by
Pr(t1 ,.., tn ) , then as a parallel to (5.6.2), the relevant hypothesis of space-time
independence for our present purposes can be stated as follows:
Here it should be noted (as in footnote 5 of Section 5) that from a formal viewpoint, this
independence condition could equally well be stated by switching the roles of
locations, ( s1 ,.., sn ) , and times, (t1 ,.., tn ) , in (6.3.2). But as noted in Section 6.1 above,
there is a subtle asymmetry between space and time that needs to be considered here. In
particular, recall that event sequences are said to exhibit space-time clustering if events
that are close in space tend to be closer in time than would be expected by chance alone.
Hence it is somewhat more natural to condition on the spatial locations of events and
look for time similarities among those events that are close in space.
Note also that as with marked point processes, the indexing of events, ei , is completely
arbitrary. Here it might be argued that the ordering of indices i should reflect the ordering
of event occurrences. But this is precisely why event times have been listed as distinct
attributes of space-time events. Hence in the present formulation, it is again most
appropriate to treats space-time pairs, ( si , ti ) and ( s j , t j ) as exchangeable events. In a
manner paralleling condition (5.6.3), this implies that for all permutations ( 1 ,.., n ) of
the subscripts (1,.., n) the marginal distribution of event times should satisfy the
exchangeability condition:
These two conditions together constitute our null hypothesis that spatial events are
completely indistinguishable in terms of their occurrence times. Hence we now designate
the combination of conditions, (5.6.2) and (5.6.3) as the temporal indistinguishability
hypothesis.
In this setting, we next extend the argument in Section 5.6.2 to obtain an exact sampling
distribution for testing this temporal indistinguishability hypothesis. To do so, observe
first that the argument in (5.6.4) now shows that conditional distribution in (6.3.2)
inherits exchangeability from (6.3.3), i.e., that for all permutations ( 1 ,.., n ) of (1,.., n) ,
7
Again for simplicity we take the number of space-time events, n, to be fixed. Alternatively, the
distributions in (6.3.1) can all be conditioned on n.
________________________________________________________________________
ESE 502 I.6-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
Hence the only question is how to condition these permutations to obtain a well-defined
probability distribution. Recall that the appropriate conditional information shared by all
permutations of population labels, (m1 ,.., mn ) , was precisely the number of instances of
each label, “1” and “2”, i.e., the population sizes, n1 and n2 . Here the set of label
frequencies, {n1 , n2 } , is now replaced by the set of time frequencies, {nt : t T } , where nt
is the number of times that t occurs in the given set of event times, (t1 ,.., tn ) , i.e.,8
(6.4.2) nt {i : t ti , i 1,.., n}
It is precisely this frequency distribution which is shared by all permutations, (t1 ,.., t n )
in (6.4.1). Indeed, it follows [as a parallel to (5.6.5)] that for every list of times (t1,.., tn )
consistent with this distribution, there is some permutation (t1 ,.., t n ) of (t1 ,.., tn ) with:
Hence if the conditional distribution of such times given both ( s1 ,.., sn ) and {nt : t T } is
denoted by Pr[ | ( s1 ,.., sn ),{nt : t T }] , then the same arguments in (5.6.6) through (5.6.8)
now yield the following exact conditional distribution for all permutations ( 1 ,.., n ) of
these occurrence times under the temporal indistinguishability hypothesis:
1
(6.4.4) Pr[(t1 ,.., t n ) | ( s1 ,.., sn ),{nt : t T }]
n!
(i) Given observed locations, ( s1 ,.., sn ) , and occurrence times, (t1 ,.., tn ) , simulate N
random permutations [ 1 ( ),.., n ( )] , 1,.., N of (1,.., n) , and form the permuted
labels (t1 ( ) ,.., t n ( ) ) , 1,.., N [which is now equivalent to taking a sample of size N
from the distribution in (6.4.4)].
(ii) If Kˆ (h, ) denotes the sample space-time K-function resulting from joint
realization, [( s1 ,.., sn ),(t1 ( ) ,.., t n ( ) )] , then choose relevant sets of distance radii,
8
Note that in most cases these frequencies will either be zero or one. But the present general formulation
allows for the possibility of simultaneous events, as for example Lymphoma cases reported on the same
day (or even instantaneous events, such as multiple casualties in the same auto accident).
________________________________________________________________________
ESE 502 I.6-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
{hw : w 1,..,WR } , for R, and time intervals, { v : v 1,..,VT } for T, and calculate the
sample space-time K-function values, {Kˆ (h , ) : w 1,..,W , v 1,..,V } for each
w v R T
1,.., N .
sample from the list of values [ Kˆ (hw , v ) : 0,1,.., N ] . Hence if M 0 denotes the
then the probability of obtaining a value as large as Kˆ (hw , v ) under this hypothesis
0
M 0 1
(6.4.5) Pˆst clustered (hw , v )
N 1
dispersion p-value:
M 0 1
(6.4.6) Pˆst dispersed (hw , v )
N 1
Our primary interest here is of course in space-time clustering for relatively small values
of h and . But it is clear that a range of other questions could in principle be addressed
within the more general framework outlined above.
________________________________________________________________________
ESE 502 I.6-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
increments, j ( j / t )( max / 2), j 1,.., t . So for example the following command uses
999 random relabelings of times to test for space-time clustering of the Lymphoma data,
LT, at each point on a grid of space-time neighborhoods (hi , j ) with s t 20 :
The results of these s t 400 tests is plotted on a grid and then interpolated in
MATLAB to obtain a p-value contour map such as the one shown in Figure 6.6 below:
Note first that each location in this region corresponds to the size of a space-time
neighborhood. Hence those areas with darker contours indicate space-time scales at
which there are significantly more cases in neighborhoods of this size (about randomly
selected cases) than would be expected under the temporal indistinguishability
hypothesis. In particular, the dark contours in the lower left corner show that there is very
significant concentration in small space-time neighborhoods, and hence significant space-
time clustering. This not only confirms the findings of the simple regression analysis
done in Assignment 1, but also conveys a great deal more information. In fact the darkest
contours show significance at the .001 level (which is the maximum significance
achievable with 999 simulations).9
Before discussing these results further, it is of interest to observe that while the direct plot
in MATLAB above is useful for obtaining visual results quickly, these p-values can also
be exported to ARCMAP and displayed in sharper and more vivid formats. For example,
9
Note also that these p-values can be retrieved in numerical form from the output structure, results, in the
command above.
________________________________________________________________________
ESE 502 I.6-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
the above results were exported to ARCMAP and smoothed by ordinary kriging to obtain
the sharper representation shown in Figure 6.7 below:
2500 P-Values
.001
2000
.001-.002
1500
Time (days)
.002-.005
.005-.010
1000 .010-.050
.050-.100
500
.100-.200
.200-1.00
0
0 10 20 30 40 50 60 70
Distance (km)
Using this sharper image, notice first that the horizontal band of significance at the
bottom of the figure indicates significant clustering of cases within 500 days of each
other ( 1.4 years) over a wide range of distances. This suggests the presence of short
periods (about 1.4 years) with unusually high numbers of cases over a wide region, i.e.,
local peaks in the frequency of cases over time. This can be confirmed by Figure 6.8
below, where a number of local peaks are seen, such as in years 7, 11, 13 and 15 (with
year 1 corresponding to 1961)
18
16
14
Number of Cases
12
10
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Time (years)
________________________________________________________________________
ESE 502 I.6-10 Tony E.
Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
Next observe that there is a secondary mode of significance at about 1500 days ( 4
years) on the left edge of Figure 6.7. This indicates that many cases occurred close to one
another over a time lag of about 4 years. Note in particular that the peak years 7,11, and
15 are spaced at 4 years. This suggests that such peaks may represent new outbreaks of
Lymphoma cases in the same areas at intervals of about 4 years. Hence the p-value plots
in Figures 6.6 and 6.7 above do indeed yield more information than simple space-time
clustering of events.
________________________________________________________________________
ESE 502 I.6-11 Tony E.
Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
APPENDIX TO PART I
In this Appendix, designated as A1 (appendices A2 and A3 are for Parts II and III,
respectively), we shall again refer to equations in the text by section and equation
number, so that (2.4.3) refers to expression (3) in section 2.4 of Part I. Also, references to
previous expressions in this Appendix (A1), will be written the same way, so that
(A1.1.3) refers to expression (3) of section 1 in Appendix A1.
This standard result appears in many elementary probability texts [such as Larsen and
Marx (2001, p.247)]. Here one starts with the fundamental limit identity
lim n 1 nx e x
n
(A1.1.1)
that defines the exponential function. Given this relation, observe that since
k nk
n! a(C ) a(C )
(A1.1.3) 1
k !(n k )! a( R) a( R)
k nk
n k n(n 1) (n k 1) a (C ) a (C )
k 1
n k! a( R) a( R)
n n 1 n k 1 n / a ( R) a(C )
k n k
a(C ) a(C )
1 1
n n n k! a( R) a( R)
But if we now evaluate expression (A1.1.3) at the sequence in (2.3.2) and recall that
nm / a( Rm ) 0 , then in the limit we can replace nm / a ( Rm ) by in the second factor.
Moreover, since (nm h) / nm 1 for all h 0,1,.., k 1 , it also follows that the first factor
in (A1.1.3) goes to one. In addition, the last factor also goes to one since
a ( Rm ) a (C ) / a ( Rm ) 0 . Hence by taking limits we see that
k nm k
nm ! a (C ) a(C )
(A1.1.4) lim m 1
k !(nm k )! a( Rm ) a ( Rm )
________________________________________________________________________
ESE 502 A1-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
[ a (C )]k a (C )
nm
lim m 1
k! nm
lim m 1
k!
nm
[ a(C )]k a (C )
e
k!
Given that the nn-distance, D , for a randomly selected point has cdf
FD (d ) 1 Pr( D d ) 1 e d
2
(A1.2.1)
f D (d ) FD (d ) 2ded
2
(A1.2.2)
This distribution is thus seen to be an instance of the Rayleigh distribution (as for
example in Johnson and Kotz, 1970, p.197). This distribution is closely related to the
normal distribution, which can be used to calculate its moments. To do so, recall first that
since E ( X ) 0 for any normal random variable, X ~ N (0, 2 ) , it follows that the
variance of X is simply its second moment, i.e.,
(A1.2.3) 2 var( X ) E ( X 2 ) E ( X ) 2 E ( X 2 )
But since this normal density ( x) exp( x 2 ) / 2 2 is symmetric about zero, we
then see that
1 E ( X 2 ) 2 2
x 2e x
/2 2
dx x 2e x /2 2
dx 22
2 2
(A1.2.4)
22 0 2 2 0 2
1 2 1 1 1 1
x 2ex dx
2
(A1.2.5)
0 4 2 4 2 2
________________________________________________________________________
ESE 502 A1-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
1
(2) x 2ex dx
2
0
2
1
E ( D) xf D ( x)dx x(2xex ) dx 2 x 2ex dx
2 2
(A1.2.6)
0 0 0
2
(A1.2.7) 0
f ( x) g ( x)dx
0
f ( x) g ( x)dx f (0) g (0) lim x f ( x) g ( x)
whenever these integrals and limits exist. Hence letting f ( x) x 2 and g ( x) e x , it
2
follows that
x 2 (2xe x )dx (2 x)(e x )dx (0) lim x x 2e x 0
2 2 2
(A1.2.8)
0 0
1
f D ( x)dx 1 2 xe x dx 1 2 xex dx
2 2
(A1.2.9)
0 0 0
1
E ( D 2 ) x 2 f D ( x)dx x 2 (2xe x )dx 2 xex dx
2 2
(A1.2.10)
0 0 0
2
1 1 1 1 4
(A1.2.11) var( D) E ( D ) [ E ( D)]
2
2
2 4 4
1
I am indebted to Christopher Jodice for pointing out several errors in my original posted derivations of
these moments.
________________________________________________________________________
ESE 502 A1-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
For practical testing purposes, this is usually rescaled. Given that the gamma density for
Wm has the explicit form,
In this section it is shown that positive dependencies among nearest neighbors have the
effect of increasing the variance of the test statistic, Z n , thus making outlier values more
likely than they would otherwise be. To show this, suppose first that the sample nn-
________________________________________________________________________
ESE 502 A1-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
i1 i n i1
2 2
i1 i
n n n
E 1
n D E
1
n D 1
2
i1 i E n2 i1 j 1 ( Di )( D j )
1 n
n n
E 1
n ( D )
E[( Di )( D j )]
n n
1
n2 i 1 j 1
2 n1
n n
1
n2 i 1 2
i 1 j i
cov( Di , D j )
2
n
n 1
n2 i 1 j i
cov( Di , D j )
Hence if there are some positive dependencies (i.e., positive covariances) among the
nearest-neighbor values ( D1 ,.., Dn ) , then the second term of the last line will be positive,
so that in this case var( Dn ) 2 / n . Hence we must have
2 n D 2
(A1.4.2) E[( Dn ) ]
2
2 E[( Dn ) ] 1 E n 1
2
n / n
E ( Z n2 ) 1 var( Z n ) 1
where the last line follows from the fact that E ( Z n ) 0 regardless of any dependencies
among the nn-distances. But since one should have var( Z n ) 1 under independent
random sampling, it then follows that realized values of Z n will tend to be farther away
from zero than would be expected under independence. Thus even those clustering or
uniformity effects due to pure chance will tend to look more significant than they actually
are.
________________________________________________________________________
ESE 502 A1-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
The determination whether a point, s , lies in a given polygon or not depends on certain
basic trigonometric facts. In the Figure 1 below the (hollow) point s is seen to lie inside
the polygon, R, determined by three boundary points {1,2,3}.
1
12
31
2
23
3
Fig.A1.1. Point Inside Polygon
If the angles (in radians) between successive points i and j are denoted by ij , then it
should be clear that for any point s inside R these angles constitute a full clockwise
rotation through 2 radians, and hence that we must have 12 23 31 2 . The
situation can be more complex when the given polygon is not convex. But nonetheless, it
can easily be seen that if counterclockwise rotations are given negative values, then any
counterclockwise rotations are canceled out by additional clockwise rotations to yield the
same total, 2 . So if the polygon boundary points are numbered {1, 2,.., N } proceeding
in a clockwise direction from any initial boundary point, then we must always have:2
N 1
(A1.5.1) i 1
i ,i1 2
On the other hand, if point s is outside of the polygon, R, then by cumulating angles
from s between each successive pair of points, the sum of clockwise and
counterclockwise rotations must cancel, leaving a total of zero radians, i.e.,
N 1
(A1.5.2) i 1
i ,i1 0
In the case of the simple polygon, R {1, 2,3} , above, this is illustrated by the three
diagrams shown in Figure 2 below.
1
2
Certain additional complications are discussed at the end of this section.
________________________________________________________________________
12A1-6
ESE 502
Tony E. Smith
2
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
Here the first two angles 12 and 23 are positive, and the angle 31 is precisely the
negative sum of 12 and 23 . By extending this idea, it is easy to see that a similar
argument holds for larger polygons.
________________________________________________________________________
ESE 502 A1-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
However, it is important to add here that this argument assumes that the polygon R is
connected, and has no holes. Unfortunately, these conditions can sometimes fail to hold
when analyzing general map regions. For example offshore islands are often included as
part of larger mainland regions, creating disconnected polygons. Also certain small
regions are sometimes nested in larger regions, creating holes in these regions. For
example, military bases or Indian reservations within states are often given separate
regional designations. There are other examples, such as the lake in Figure 2.4 of Part I,
where one may wish to treat treat certain subregions as “holes”.
First observe that the circular cell, C , of radius h about point si can be partitioned into a
set of concentric rings, Ck about si , each of thickness k , so that C k Ck . One such
ring is shown in Figure 3 below.
C R
si k
Ck
Since these rings are disjoint, it follows that the number of points in C is identically
equal to the sum of the numbers of points in each ring Ck , so that (in terms of the
notation in Section 2.2 in the text),
(A1.6.1) E N (C ) k E N (Ck )
________________________________________________________________________
ESE 502 A1-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
a (Ck )
(A1.6.2) E[ N (Ck )] a (Ck ) a (Ck R )
a (Ck R )
a(Ck R) a (Ck ) 1
(A1.6.3) wik
a (Ck ) a(Ck R) wik
Hence, when the ring partition in Figure A1.3 becomes very fine, so that the k ' s
become small, one has the approximation
a (Ck )
(A1.6.4) E[ N (Ck )] a (Ck R )
a (Ck R )
a (Ck ) E[ N (Ck R )]
E[ N (Ck R )]
a (Ck R ) wik
E[ N (Ck R)]
(A1.6.5) K (h) 1 E[ N (C )] 1 k E[ N (Ck )] 1 k
wik
Note also that for sufficiently fine partitions it can be assumed that each ring contains at
most one of the observed points, s j C R , so that the point-count estimators
Eˆ [ N (C R )] for E[ N (C R)] will have value one for those rings C containing a
k k k
point and zero otherwise. Hence, observing by definition that I h (dij ) 1 for all such
points, it follows that
I ( d ) , s j Ck R )
(A1.6.6) Eˆ [ N (Ck R )] h ij
0 , otherwise
________________________________________________________________________
ESE 502 A1-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
Eˆ [ N (Ck R)] 1 I (d )
(A1.6.7) Kˆ i (h) 1ˆ k ˆ j i h ij
wik wij
Finally, by averaging these estimates over all points si R as in the text, we obtain the
pooled estimate,
I h (dij )
n n
(A1.6.8) Kˆ (h) 1 Kˆ (h) 1
n i 1 i ˆn i 1 j i
wij
The text derivation of the P-values in expressions (4.6.8) and (4.6.10) is appealing from a
conceptual viewpoint in that it focused directly on the distribution of the test statistic,
Kˆ (h) , under the CSR Hypothesis. But there is an alternative derivation of this expression
that has certain practical advantages discussed below. This approach is actually much
closer in spirit to the argument used in deriving the “envelope” P-values of expressions
(4.6.3) and (4.6.4), which we now make more precise as follows. Observe that if l0 is
consistent with CSR then by construction (l0 , l1 ,.., lN ) must be independently and
identically distributed (iid) samples from a common distribution. In the envelope case it
was then argued from the symmetry of iid samples that none is more likely to be the
highest (or lowest) than any other. More generally, suppose we now ask how likely it is
for the observed sample value, l0 , to be the k th largest among the N 1 samples
(l0 , l1 ,.., lN ) , i.e., to have rank, k , in the ordering of these values. Here it is important to
note that ranks are not well defined in the case of ties. So for the moment we avoid this
complication by assuming that there are no ties. In this case, observe that there must be
( N 1)! possible orderings of these iid samples, and again by symmetry, that each of
these orderings must be equally likely. But since exactly N ! of these orderings have l0 in
the k th position (where N ! is simple the number of ways of ranking the other values), it
follows that if the random variable, R0 , denotes the rank of l0 , then under H 0 we must
have:
N! N! 1
(A17.1) Pr( R0 k ) , k 1,.., N 1
( N 1)! ( N 1) N ! N 1
which in turn implies that the chance of a rank as high as k is given by, 3
3
Remember that “high” ranks mean low values of k .
________________________________________________________________________
ESE 502 A1-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
k 1 k
Pr( R0 k ) r 1 Pr( R0 r ) r 1
k
(A1.7.2) , k 1,.., N 1
N 1 N 1
So rather than using the distribution of Kˆ (h) under CSR to test this null hypothesis, we
can use the distribution of its rank R0 in (A1.7.1) and (A1.7.2). But if we again let
m (l0 ) denote the number of simulated samples at least as large as l0 , then the observed
rank of l0 (assuming no ties) is precisely m (l0 ) 1 . So to test the CSR Hypothesis we
now ask: How likely would it be to obtain an observed rank as high as m (l0 ) 1 if CSR
were true? Here the answer is given from (A1.7.2) by the clustering P-value:
m (l0 ) 1
(A1.7.3) Pcluster (h) Pr[ R0 m (l0 ) 1]
N 1
which is seen to be precisely the same as expression (4.6.8). However there is one
important difference here, namely that we are no longer attempting to estimate a P-value.
The distribution in (A1.7.1) and (A1.7.2) is exact, so that there is no need for a “hat” on
Pcluster .
m (l0 ) q R0 (q)
(A1.7.4) Pcluster [h | R0 (q)] Pr[ R0 m (l0 ) q R0 (q)]
N 1
________________________________________________________________________
ESE 502 A1-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
1
(A1.7.5) Pr[ R0 (q ) i ] , i 0,1,.., q
q 1
m (l0 ) 1 i 1 1
q q
[m (l0 ) 1 i )]
i 0
N 1 q 1 ( N 1)(q 1) i 0
1 (q 1)q
{m (l0 ) 1}(q 1)
( N 1)(q 1) 2
m (l0 ) 1 (q / 2)
N 1
Hence this generalized cluster P-value amounts to replacing the rank, m (l0 ) 1 , of l0 in
(A1.7.2) for the case of no ties with its average rank, m (l0 ) 1 q / 2 , for cases where q
values are tied with l0 . So for example, if N 3 and (l0 , l1 , l2 , l3 ) (5, 2,5,6) , so that
m (l0 ) 2 , q 1 and the possible ranks of l0 are {2,3} , then its average rank is 2.5 and
(2 1) 1/ 2 2.5
(A1.7.7) Pcluster (h)
5 5
Note finally that the special case in (A1.7.3) above is now simply the special case of “no
ties”, so that Pcluster (h) Pcluster ( h | 0) .
The argument for uniform P-values is of course identical. Thus the corresponding
generalized uniform P-value in the presence of q ties is given by:
m (l0 ) 1 (q / 2)
(A1.7.8) Puniform (h | q)
N 1
________________________________________________________________________
ESE 502 A1-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
where m (l0 ) is again the number of simulated values li no larger than l0 . Here it is
important to note that these P-values are “almost complements” in the sense that for all q
and h ,
N 2
(A1.7.9) Pcluster (h | q) Puniform (h | q)
N 1
To see this, note simply that if we let N , N , N denote the number of simulated
samples that are less, equal, or greater than l0 , then it follows by definition that q N ,
so that
(A1.7.10) m (l0 ) N N N q
(A1.7.11) m (l0 ) N N N q
m (l0 ) 1 (q / 2) m (l0 ) 1 (q / 2)
(A1.7.12) Pcluster (h | q) Puniform (h | q)
N 1 N 1
[( N q) 1 (q / 2)] [( N q) 1 ( q / 2)]
N 1
[( N q N ] 2 N 2
N 1 N 1
so that we can essentially plot both P-values on one diagram. Hence all plots in K-
function programs such as k_function_plot focus on cluster P-values, Pcluster (h | q) ,
where Puniform (h | q) is implicitly taken to be 1 Pcluster (h | q) .
While the full grid, ref, can be represented in ARCMAP by exporting this grid from
MATLAB and displaying it as a point file, it is often more useful to construct this display
directly in MATLAB to obtain a quick check of whether or not the extent and grid size
are appropriate. Assuming that the boundary file exists in the MATLAB workspace, this
can be accomplished with the program poly_plot.m, which was written for this kind of
application. In the present case the boundary file, Bod_poly (shown on page 3-23 of Part
________________________________________________________________________
ESE 502 A1-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
I), is the desired input. Hence to plot the grid, ref, with respect to this boundary, use the
command:
>> poly_plot(Bod_poly,ref);
________________________________________________________________________
ESE 502 A1-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part I. Spatial Point Pattern Analysis
______________________________________________________________________________________
and leave all other values as defaults. The value-field, P_005, contains the desired p-
values in the file, P-val.shp. The weight 5 adds a degree of “stiffness” to the spline
which yields a somewhat smoother result than the default .01 value. Now click OK and
a new layer appears called “Spline of P-val.shp”. Right click on this layer and select
“Make Permanent”. Save it to your home directory as say, spline_pvals. This will not
change the layer, but will give it an editable form. You can alter the display by right
clicking on the layer, “Spline of P-val.shp”, selecting “Classified” (rather than
“Stretched”), and editing its properties. [Notice that the values are mostly negative, and
that the relevant range from 0 to 1 is only a very small portion of the values. This is due
to the extreme nonlinearity of the spline fit.]
Click OK and a new layer called “ctour” appears that shows the desired contours. This
file is stored as a temporary file. You can edit its properties. So select “Classify” and
choose the “Manual” option with settings (.01,.05,0.1,0.2) and appropriate colors. This
should yield roughly the representation in Figure 4.23 above. This file is stored as a
temporary file only. So you can keep trying different interval and base contour values
until you find values that capture the desired regions of significance. Then use Data
Export to save a permanent copy in your home directory and edit as desired.
________________________________________________________________________
ESE 502 A1-15 Tony E. Smith
CONTINUOUS SPATIAL DATA ANALYSIS
The key difference between continuous spatial data and point patterns is that there is
now assumed to be a meaningful value, Y ( s ) , at every location, s , in the region of
interest. For example, Y ( s ) might be the temperature at s or the level of air pollution at
s . We shall consider a number of illustrative examples in the next section. But before
doing so, it is convenient to outline the basic analytical framework to be used throughout
this part of the NOTEBOOK.
If the region of interest is again denoted by R , and if the value, Y ( s ) , at each location,
s R is treated as a random variable, then the collection of random variables
(1.1) {Y ( s ) : s R}
Observe next that there is a clear parallel between spatial stochastic processes and
temporal stochastic processes,
(1.2) {Y (t ) : t T }
where the set, T , is some continuous (possibly unbounded) interval of time. In many
respects, the only substantive difference between (1.1) and (1.2) is the dimension of the
underlying domain. Hence it is not surprising that most of the assumptions and analytical
methods to be employed here have their roots in time series analysis. One key difference
that should be mentioned here is that time is naturally ordered (from “past” to “present”
to “future”) whereas physical space generally has no preferred directions. This will have
a number of important consequences that will be discussed as we proceed.
________________________________________________________________________
ESE 502 II.1-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Y ( s1 ) Y1
(1.1.1) Y : :
Y ( s ) Y
n n
represent the possible list of values that may be observed at these locations. Note that
(following standard matrix conventions) we always take vectors to be column vectors
unless otherwise stated. The second representation in (1.1.1) will usually be used when
the specific locations of these samples are not relevant. Note also that it is often more
convenient to write vectors in transpose form as Y (Y1 ,.., Yn ) , thus yielding a more
compact in-line representation. Each possible realization,
y1
(1.1.2) y ( y1 ,.., yn ) :
y
n
of the random vector, Y , then denotes a possible set of specific observations (such as the
temperatures at each location i 1,.., n ).
Most of our analysis will focus on the means and variances of these random variables, as
well as the covariances between them. Again, following standard notation we shall
usually denote the mean of each random variable, Y si , by
(1.1.3) E Y ( si ) ( si ) i , i 1,.., n
The last representation facilitates comparison with the covariance of two random
variables, Y si and Y s j , as defined by
The full matrix of variances and covariances for the components of Y is then designated
as the covariance matrix for Y , and is written alternatively as
________________________________________________________________________
ESE 502 II.1-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
cov(Y1 , Y1 ) cov(Y1 , Yn ) 11 1n
(1.1.7) cov(Y )
cov(Y , Y ) cov(Y , Y )
n 1 n n n1 nn
As we shall see below, spatial stochastic processes can be often be usefully studied in
terms of these first and second moments (means and covariances). This is especially true
for the important case of multivariate normally distributed random vectors that will be
discussed in some detail below. For the present, it suffices to say that much of our effort
to model spatial stochastic processes will focus on the structure of these means and
covariances for finite samples. To do so, it is convenient to start with the following
overall conceptual framework.
Essentially all spatial statistical models that we shall consider start by decomposing the
statistical variation of random variables, Y ( s ) , into a deterministic trend term, ( s) , and
a stochastic residual term, ( s ) , as follows [see also Cressie (1993, p.113)]:
(1.2.1) Y ( s) ( s) (s) , s R
E[ ( s )] 0 , s R
Expressions (1.2.1) and (1.2.2) together constitute the basic modeling framework to be
used throughout the analyses to follow. It should be emphasized that this framework is
simply a convenient representation of Y ( s ) , and involves no substantive assumptions
whatsoever. But it is nonetheless very useful. In particular, since () defines a
deterministic function on R , it often most appropriate to think of () as a spatial trend
function representing the typical values of the given spatial stochastic process over all
R , i.e., the global structure of the Y -process. Similarly, since () is by definition a
spatial stochastic process on R with mean identically zero, it is useful to think of () as
a spatial residual process representing local variations about () , i.e., the local structure
of the Y -process.
________________________________________________________________________
ESE 502 II.1-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Within this framework, our basic modeling strategy will be to identify a spatial trend
function, () , that fits the Y -process so well that the resulting residual process, () , is
not statistically distinguishable from “random noise”. However, from a practical
viewpoint, the usual statistical model of such random effects as a collection of
independent random variables, { ( s) : s R} , is somewhat too restrictive. In particular,
since most spatial variables tend to exhibit some degree of continuity over space (such as
average temperature or rainfall), one can expect these variables to exhibit similar values
at locations close together in space. Moreover, since spatial residuals ( s ) by definition
consist of all unobserved spatial variables influencing Y ( s) that are not captured by the
global trend, ( s) , one can also expect these residuals to exhibit similar values at
locations close together in space. In statistical terms, this means that for locations, s and
v , that are sufficiently close together, the associated residuals ( s ) and (v) will tend to
exhibit positive statistical dependence. Thus, in constructing statistical models of spatial
phenomena, it is essential to allow for such dependencies in the spatial residual process,
{ ( s ) : s R} .
Before proceeding, it is important to emphasize that our basic measure of the degree of
dependency between spatial residuals -- and indeed between any random variables X
and Y -- is in terms of their covariance,
[as in expression (1.1.6) above]. To gain further insight into the meaning of covariance,
observe that if cov( X , Y ) is positive, then by definition, this means that the deviations
X X and Y Y are expected to have the same sign (either positive or negative), so
that typical scatter plots of ( x, y ) points will have a positive slope, as shown in the first
panel of Figure 1.2 below.
y y y
• •
•
• • • •
Y • • • • • Y • •
• • •
• •• • • • • • •
• •
X x X x X x
Given these initial observations, our basic strategy will be to start in Section 3 below by
constructing an appropriate notion of spatially-dependent random effects. While it may
seem strange to begin by focusing on the residual process, { ( s ) : s R} , which simply
describes “everything left out” of the model of interest, this notion of spatially-dependent
random noise will play a fundamental role in all spatial statistical models to be
developed. In particular, this will form the basis for our construction of covariance
matrices [as in expression (1.1.7) above], which will effectively summarize all spatial
statistical relationships of interest. This will be followed in Section 4 with a development
of a statistical tool for estimating covariance, known as a variogram. This will also
provide a useful graphical device for summarizing spatially-dependent random effects.
Finally in Section 5 we begin by applying these tools to full spatial models as in (1.2.1)
above. In the simplest of these models, it will be assumed that the spatial trend is constant
[i.e., ( s ) ] so that (1.2.1) reduces to1
As will be shown, this simple model is useful for stochastic spatial prediction, or kriging.
In Section 6 we then begin to consider models in which the spatial trend ( s) varies over
space, and in particular, dependents on possible explanatory variables, [ x1 ( s ),..., xk ( s ) ]
associated with each location, s R .
But before launching into these details, it is useful to begin with a number of motivating
examples which serve to illustrate the types of spatial phenomena that can be modeled.
1
Note that the symbol “ ” means that ( s ) is identically equal to for all s R .
________________________________________________________________________
ESE 502 II.1-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Among the most common examples of continuous spatial data are environmental
variables such as temperature and rainfall, which can in principle be measured at each
location in space. The present example involves rainfall levels in central Sudan during
1942, and can be found in the ARCMAP file, arcview\Projects\Sudan\Sudan.mxd. The
Sudan population in 1942 was largely along the Nile River, as shown in Figure 2.1
below. The largest city, Khartoum, is at the fork of the Nile (White Nile to the west and
Blue Nile to the east). There is also a central band of cities extending to the west.1
Northern Sudan is largely desert with very few population centers. Hence it should be
clear that the information provided by rainfall measurements in the n 31 towns shown
in the Figure will yield a somewhat limited picture of overall rainfall patterns in Sudan.
Nile River (
!
(
!
(
! (
!
(
! (
!
KHARTOUM ( !
!
! (
( RAINFALL (mm)
(
! ( !
!
(!
(!(
(
!
(
! (
! 105 - 168
(
! (
! (
!
(
! !
( !
( (
! 168- 272
(
! ( !
! (
(
!
(
! 272 - 330
(
! (
!
(
! 330 - 384
(
!
(
! 384 - 503
(
! 503 - 744
(
!
This implies that one must be careful in trying to predict temperatures outside this band
of cities. For example, suppose that one tries a simple “smoother” like Inverse Distance
Weighting (IDW) in ARCMAP (Spatial Analyst extension) [See Section 5.1 below for
additional examples of “smoothers”] . Here, if the above rainfall data in each city,
1
The population concentrations to the west are partly explained by higher elevations (with cooler climate)
and secondary river systems providing water.
________________________________________________________________________
ESE 502 II.2-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
yˆ ( s ) i1 wi ( s ) y ( si )
n( s )
(2.1.1)
where n( s ) is some specified number of points in {si : i 1,.., n} that are closest to s , and
where the inverse distance weights have the form,
d ( s, si )
(2.1.2) wi ( s )
d ( s , s j )
n(s)
j 1
(
!
(
!
(
! (
!
!
(
(
!
(
!
(
!
(
! RAINFALL (mm)
(
! !!
( (!
(
(
! (
!
(
! (
! (
! 105 - 168
(
! (
!
(
! (
! (
! 168- 272
(
! (
!
(
! (
!
(
!
(
! 272 - 330
(
!
(
!
(
! 330 - 384
(
!
(
! 384 - 503
(
! 503 - 744
(
!
This is an “exact” interpolator in the sense that every data point, si , is assigned exactly
the measured value, yˆ ( si ) y ( si ) . But in spite of this, it should be evident that this
interpolation exhibits considerably more variation in rainfall than is actually present. In
particular, one can see that there are small “peaks” around the highest values and small
“pits” around the lowest values. Mathematically, this is a clear example of what is called
“overfitting”, i.e., finding a sufficiently curvilinear surface that it passes exactly through
every data point.
2
See also Johnston et al. (2001, p.114).
3
The results for IDW in the Geostatistical Analyst extension of ARCMAP are essentially identical.
________________________________________________________________________
ESE 502 II.2-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
For sake of comparison, a more recent detailed map of rainfall in the same area for the
six-month period from March to August in 2006 is shown in Figure 2.3 below. 4 Since
these are not yearly rainfall totals, the legend is only shown in ordinal terms. Moreover,
while there is a considerable difference in dates, it is not unreasonable to suppose that the
overall pattern of rainfall in 1942 was quite similar to that shown in the figure.
RAINFALL
None
Highest
Here rainfall levels are seen to be qualitatively similar to Figure 2.2 in the sense that
rainfall is heavier in the south than in the north. But it is equally clear that the actual
variation in Figure 2.3 is much smoother that in Figure 2.2. More generally, without
severe changes in elevation (as was seen for the California case in the Example
Assignment) it is natural to expect that variations in rainfall levels will be gradual.
This motivates a very different approach to interpolating the data in Figure 2.1. Rather
than focusing on the specific values at each of these 31 towns, suppose we concentrate on
the spatial trend in rainfall, corresponding to () in expression (1.2.1) above. Without
further information, one can attempt to fit trends as a simple function of location
coordinates, s ( s1 , s2 ) . Given the prior knowledge that rainfall trends tend to be smooth,
the most natural specification to start with is the smoothest possible (non-constant)
function, namely a linear function of ( s1 , s2 ) :
(2.1.3) Y ( s ) ( s ) ( s ) 0 1s1 2 s2 ( s )
This can of course be fitted by a linear regression, using the above data [ y ( si ), s1i , s2i ] for
the i 1,..,31 towns above. This data was imported to JMPIN as Sudan.jmp, and the
4
The source file here is Sudan_Rainfall_map_source.pdf in the class ArcMap directory, Sudan.
________________________________________________________________________
ESE 502 II.2-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
1942 rainfall data (R-42) was regressed on the town coordinates (X,Y). The estimates
( ˆ0 , ˆ1 , ˆ2 ) were then imported to MATLAB in the workspace, sudan.mat. Here a grid,
G, of points covering the Sudan area was constructed using grid_form.m (as in Section
4.8.2 of Part I) and the predicted value, yˆ g ˆ0 ˆ1sg1 ˆ2 sg 2 , at each grid point, g ,
was calculated. These results were then imported to Sudan.mxd in ARCMAP and were
interpolated using the spline interpolator in Spatial Analyst (Interpolate to Raster
Spline).5 The results of this procedure are shown in Figure 2.4 below:
! !
!
!
!
! RAINFALL (mm)
!
! ! !! (
! 105 - 168
! !
! !
! ! (
! 168- 272
! !
! !
! !
! (
! 272 - 330
! (
! 330 - 384
!
! (
! 384 - 503
(
! 503 - 744
!
A visual comparison of Figure 2.4 with Figure 2.3 shows that this simple linear trend
model is qualitatively much more in agreement with actual rainfall patterns than the IDW
fit in Figure 2.2.6 The results of this linear regression are shown in Table 2.1 below.
Notice in particular that the Y-coordinate ( s2 above) is very significant while the X-
coordinate ( s1 above) is not. This indicates that most temperature variation is from north
5
See section 5.5 below for further discussion of spline interpolations.
6
It should be emphasized here that we have only used the “default” settings in the IDW interpolator to
make a point about “over fitting”. One can in fact construct more reasonable IDW fits by using the many
options available in the Geostatistical Analyst version of this interpolator.
________________________________________________________________________
ESE 502 II.2-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
to south, as is clear from Figures 2.3 and 2.4. However, the adjusted R-square shows that
only about 57% of the variation in rainfall levels is being accounted by this linear trend
model, so that there is still considerable room for improvement. With additional data
about other key factors (such as elevations) one could of course do much better. But even
without additional information, it is possible to consider more complex specifications of
coordinate functions to obtain a better fit. As stressed above, there is always a danger of
over fitting this data. But if adjusted R-square is used as a guide, then it is possible to
seek better polynomial fits within the context of linear regression. To do so, it is natural
to begin by examining the regression residuals, as shown in Figure 2.5 below.
3000 3000
2000 2000
Linear_Resids
Linear_Resids
1000 1000
0 0
-1000 -1000
-2000 -2000
0 2000 4000 6000 8000 200 250 300 350
While these residuals show nothing out of the ordinary, a plot of the residuals against the
X-coordinate is much more revealing. As seen in Figure 2.6 there appears to be a clear
nonlinearity here, suggesting that perhaps a quadratic specification of X would yield a
better fit than the linear specification in (2.1.3) above. This can also be seen by plotting
the residuals spatially, as in Figure 2.7 below:
(
!
#
(
!
#
(
!
# !
(
#
!
(
#
(
!#
(
!
#
!
(#
(
!
#
(
! #!(
!
(
##
# ! (
#
(
! !
(
#
(
!
# !
(
#
(
!
#
(
!
#
(
!
# (
!
# (
!#
(
!
# # !
(
! (
#
(
!
#
!
(
#
(
!
#
(
!
#
(
!
#
________________________________________________________________________
ESE 502 II.2-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
If we focus on the heavy linear contour in the figure, then the residuals near the middle of
this line are seen to be negative (blue), indicating that observed rainfall is smaller than
predicted rainfall. Hence, recalling that higher rainfall values are to the south, these
predictions could be reduced by pulling this contour line further south in the middle.
Similarly, since the residuals near both ends of this line tend to be positive (red), a similar
correction could be made by moving the ends north, yielding a curved contour such as the
dashed curve shown in the figure.
Hence this visual analysis of spatial residuals again suggests that a quadratic specification
of the X-coordinate should yield a better fit. Thus, as an alternative model, we now
consider the following quadratic form:7
The results of this quadratic regression are shown in Table 2.2 below, and confirm that
this new specification does indeed yield a significantly better overall fit, with adjusted R-
square showing that an additional 10% of rainfall variation has been accounted for. In
addition, it is clear that both the linear and quadratic terms in X are very significant,
indicating that each is important.8
By employing exactly the same procedure outlined for the linear regression above, the
results of this regression can be used to predict values on a grid and then interpolated in
ARCMAP (again using a spline interpolator) to yield a plot similar to Figure 2.4 above.
The results of this procedure are shown in Figure 2.8 below. Here a comparison of Figure
2.8 with the more accurate rain map from 2006 in Figure 2.3 shows that in spite of its
mathematical simplicity, this quadratic trend surface gives a fairly reasonable picture of
the overall pattern of rainfall in Sudan.
7 2
Here one can also start with a general quadratic form including terms for s2 and s1s2 . But this more
general regression shows that neither of these coefficients is significant.
8
It is of interest to notice that over short ranges, the variables X and X^2 are necessarily highly correlated.
So the significance of both adds further confirmation to the appropriateness of this regression.
________________________________________________________________________
ESE 502 II.2-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
! !
!
!
!
! RAINFALL (mm)
!
! ! !! (
! 105 - 168
! !
! !
! ! (
! 168- 272
! !
! ! (
!
! ! 272 - 330
!
! (
! 330 - 384
!
(
! 384 - 503
!
(
! 503 - 744
Finally a plot of the spatial residuals for this quadratic model, as in Figure 2.9 below,
shows that much of the structure in the residuals for the linear model in Figure 2.7 has
now been removed.
!
(
#
(
!
#
!
(
# !
(
#
!
(
#
(
!
#
(
!
#
!
(
#
(
!
#
(
! #!
(#!
(
#
# !(
#
(
! (
!
#
(
!
# (
!
#
(
!
#
(
!
#
(
!
# (
!
# (
!
#
(
!
# # !
(
! (
#
(
!
#
(
!
#
(
!
#
(
!
#
(
!
#
________________________________________________________________________
ESE 502 II.2-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Among the most toxic industrial soil pollutants are the class of PCBs (polychlorinated
biphenyls). The following data set from [BG] consists of 70 PCB soil measurements from
the area surrounding an industrial site near the town of Pontypool, Wales, in 1991. The
location and PCB levels for these 70 sites can be found in the JMPIN file, Pcbs.jmp. It is
clear from Figure 2.10 below, that there is a significant concentration of PCB levels on
the eastern edge of this site. The task here is to characterize the spatial pattern of
variability in these levels surrounding the plant.
!
#
!
#
!
# !
#
!
# !
#
!
#
!
# !
#
!
#
!
# !
# !
# !
#
!
# !
# !
#
!
#
!
# !
# !
#
!
# !
#
!
# !
#
Industrial Site !
# !
# !
# PCB Levels
!
# !
#
!
# #!
! #!
# ! 3-6
!
#
!
#
!
# !
# !
# !
# ! 6 - 23
!
#
!
# !
#
!
#
!
# !
#
!
# # !
! # !
# ! 23 - 40
!
# !
#
!
# !
# !
# ! 40 - 58
!
# !
# ! 58 - 100
!
#
!
#
!
#
!
#
! 100 - 4620
!
# !
#
!
# !
#
!
#
!
#
!
# !
# 500 m
A visual inspection suggests that the concentration falls off with distance from this area
of high concentration. To model this in a simple way, a representative location in this
site, designated as the “Center” in Figure 2.11 below,9 was chosen and distance from this
location to each measurement site was recorded (in the DIST column of Pcbs.jmp). Here
the simplest possible model is to assume that these PCB levels fall off linearly with
distance from this center. A plot of this regression is shown in Figure 2.11 below, and
9
The coordinates of this center location are given by ( x, y ) (330064,198822) .
________________________________________________________________________
ESE 502 II.2-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
look quite “reasonable” in terms of the concentric rings of decreasing PCB levels from
this center point.
PCB Estimates
173 - 283
93 - 173
CENTER 31 - 93
"
-19 - 31
-73 - -19
-125 - -73
-176- -125
-248 - -176
2000
The reason for this is evident from an examination of the scatter plot on the left side of
this figure, which reveals the presence of two dramatic outliers, circled in red. One could
of course remove these outliers and produce a much better linear fit. But an examination
of their distance shows that both are close to the center point in Figure 2.11, and hence
are extremely important data points. So removing them would defeat the whole purpose
of the analysis.
then we obtain a vastly improved fit as well as more significant coefficients.11 (Note that
the positive coefficient on the quadratic term reflects the slight bowl shape seen in Figure
2.12 above.)
RSquare 0.551
lnDIST Residuals
Moreover, the two outliers (again shown by red circles in Figure 2.13) have been
dramatically reduced by this data transformation. But while this transformed model of
PCBs seems to capture the spatial distribution in a more reasonable way, we cannot draw
sharp conclusions without an adequate statistical model of the residuals ( i : i 1,.., n) in
(2.2.1). This is the task to which we now turn.
10
This is closely related to the translog specifications of commodity production functions often used in
economics. See for example http://www.egwald.ca/economics/cesdatatranslog.php.
11
The estimated intercept term has been omitted to save space.
________________________________________________________________________
ESE 502 II.2-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Observe that all regressions in the illustrations above [starting with expression (2.1.3) in
the Sudan rainfall example] have relied on an implicit model of unobserved random
effects (i.e., regression residuals) as a collection ( i : i 1,.., n) of independently and
identically distributed normal random variables [where for our purposes, individual
sample points i are taken to represent different spatial locations, si ]. But recall from the
introductory discussion in Section 1.2 above that for more realistic spatial statistical
models we must allow for possible spatial dependencies among these residuals. Hence
the main objective of the present section is to extend this model to one that is sufficiently
broad to cover the types of spatial dependencies we shall need. To do so, we begin in
Section 3.1 by examining random effects at a single location, and show that normality
can be motivated by the classical Central Limit Theorem. In Section 3.2 , these results
will be extended to random effects at multiple locations by applying the Multivariate
Central Limit Theorem to motivate multivariate normality of such joint random effects.
This multi-normal model will form the statistical underpinning for all subsequent
analyses. Finally in Section 3.3 we introduce the notion of spatial stationarity to model
covariances among these spatial random effects ( i : i 1,.., n) .
First recall that the unobserved random effects, i , at each location (or sample point), si ,
are assumed to fluctuate around zero, with E ( i ) 0 . Now imagine that this overall
random effect, i , is composed of many independent factors,
i ei1 ei 2 eim
m
(3.1.1) e
k 1 ik
,
where in typical realizations some of these factors, eik , will be positive and others
negative. Suppose moreover that each individual factor contributes only a very small part
of total. Then no matter how these individual random factors are distributed, their
cumulative effect, i , must eventually have a “bell shaped” distribution centered around
zero. This can be illustrated by a simple example in which each random component, eik ,
assumes the values 1/ m and 1/ m with equal probability, so that E (eik ) 0 for all
k 1,.., m . Then each is distributed as shown for the m 1 case in Figure 3.1(a) below.
Now even though this distribution is clearly flat, if we consider the m 2 case
(3.1.2) i ei1 ei 2
then it is seen in Figure 3.1(b) that the distribution is already starting to be “bell shaped”
around zero. In particular the value 0 is much more likely than either of the extremes, -1
and 1. The reason of course is that this value can be achieved in two ways, namely
(ei1 12 , ei 2 12 ) and (ei1 12 , ei 2 12 ) , whereas the extreme values can each occur in
________________________________________________________________________
ESE 502 II.3-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
only one way. This simple observation reveals a fundamental fact about sums of
independent random variables: intermediate values of sums can occur in more ways than
extreme values, and hence tend to be more likely. It is this property of independent sums
that gives rise to their “bell shaped” distributions, as can be seen in parts (c) and (d) of
Figure 3.1.
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 0 1
-1 0 1 -1 0 1
(a) m = 1 (b) m = 2
0.3 0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-1 0 1 -1 0 1
(c) m = 10 (d) m = 20
But while this basic shape property is easily understood, the truly amazing fact is that the
limiting form if this bell shape always corresponds to essentially the same distribution,
namely the normal distribution. To state this precisely, it is important to notice first that
________________________________________________________________________
ESE 502 II.3-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
while the distributions in Figure 3.1 start to become bell shaped, they are also starting to
concentrate around zero. Indeed, the limiting form of this particular distribution must
necessarily be a unit point mass at zero,1 and is certainly not normally distributed. Here is
turns out that the individual values of these factors, (eik 1/ m , or eik 1/ m) , become
“too small” as m increases, so that eventually even their sum, i , will almost certainly
vanish. At the other extreme, suppose that these values are independent of m, say
(eik 1, or eik 1) . Then while these individual values will eventually become small
relative to their sum, i , the variance of i itself will increase without bound.2 In a
similar manner, observe that if the common means of these individual factors were not
identically zero, then the limiting mean of i would also be unbounded.3 So it should be
clear that precise analysis of limiting random sums is rather delicate.
X
(3.1.3 ) Z 1
X
(3.1.4 ) E(Z )
1 E ( X )
0
X 2 1 2 E[( X )2 ]
(3.1.5 ) var( Z ) E ( Z 2 ) E E 2 ( X ) 1
2
1
Simply observe that if xik is a binary random variable with Pr( xik 1) .5 Pr( xik 1) then by
definition, eik xik / m , so that i ( xi1 xim ) / m is seen to be the average of m samples from this
binary distribution. But by the Law of Large Numbers, such sample averages must eventually concentrate at
the population mean, E ( xik ) 0 .
2
In particular since var( eik ) E ( eik ) .5( 1) .5( 1) 1 for all k, it would then follow from the
2 2 2
k 1
var(eik ) m var(e1k ) m , and hence that
var( i ) as m .
Since E ( i ) E (eik ) m E (ei1 ) implies | E ( i ) | m | E (ei1 ) | , it follows that if | E ( ei 1 ) | 0 then
3 m
k 1
| E ( i ) | as m .
________________________________________________________________________
ESE 502 II.3-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
In particular, we can in principle use this standardization procedure to study the limiting
distributional shape of any sum of random variables, say
S m X 1 X m k 1 X k
m
(3.1.6)
As in our example, let us assume for the present that these variables are independently
and identically distributed (iid), with common mean, , and variance, 2 [so that
( X 1 ,.., X m ) can be viewed as a random sample of size m from some common
distribution]. Then the mean and variance of S m are given respectively by
E ( X k ) k 1 m
m m
(3.1.7) E ( Sm ) k 1
Sm E ( Sm ) S m
(3.1.9) Zm m
var( Sn ) m 2
which by definition implies that E ( Z m ) 0 and var( Z m ) 1 for all m. The key property
of these standardized sums is that for large m the distribution of Z m is approximately
normally distributed.
To state this precisely, we must first define the normal distribution. A random variable, X,
with mean and variance 2 is said to be normally distributed, written, X ~ N ( , 2 ) ,
if and only if X has probability density given by
f ( x)
x
2
( x )2
12
(3.1.10) f ( x) 1 e 2 2
1 e
2 2 2 x
________________________________________________________________________
ESE 502 II.3-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
[where the first version shows f ( x) as an explicit function of ( , 2 ) and the second
shows the more standard version of f ( x) in terms of ( , ) ]. This is the classical “bell-
shaped” curve, centered on the mean, , as shown on the right. A key property of normal
random variables (that we shall make use of many times) is that any linear function of a
normal random variable is also normally distributed. In particular, since the
standardization procedure in (3.1.3) is seen to be a linear function, it follows that the
standardization, Z, of any normal random variable must be normally distributed with
mean, E ( Z ) 0 , variance, var( Z ) 1 , and with density
( z)
1 z 2
(3.1.11) ( z) exp
2 2 0 z
For obvious reasons, this is called the standard normal distribution (or density), and is
generally denoted by . The importance of this particular distribution is that all
probability questions about normal random variables can be essentially answered by
standardizing them and applying the standard normal distribution (so that all normal
tables are based entirely on this standardized form).
Next, if the cumulative distribution function (cdf ) of any random variable, X, is denoted
for all values, x, by F ( x) Prob( X x) , then for any standard normal random variable,
Z ~ N (0,1) , the cdf of Z is denoted by
z
(3.1.12) ( z ) Prob( Z z )
( z ) dz
Again is usually reserved for this important cdf (that forms the basis of all normal
tables).
With these preliminaries, we can now give a precise statement of the limiting normal
property of standardized sums stated above. To do so, it is important to note first that the
distribution of any random variable is completely defined by its cdf. [For example, in the
standard normal case above it should be clear that the standard normal distribution, , is
recovered by simply differentiating .] Hence, letting the cdf of the standardized sum,
Z m , in (3.1.9) be denoted by FZm , we now have the following classical form of the
Central Limit Theorem (CLT):
Central Limit Theorem (Classical). For any sequence of iid random variables
( X 1 ,.., X m ) with standardized sum, Z m , in (3.1.9),
________________________________________________________________________
ESE 502 II.3-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
In other words, the cdf of iid standardized sums, Z m , converges to the cdf of the standard
normal distribution. The advantage of this cdf formulation is that one obtains an exact
limit result. But in practical terms, the implication of the CLT is that for “sufficiently
large” m, the distribution of such standardized sums is approximately normally
distributed.4 Even more to the point, since (3.1.3) implies that iid sums, S m , are linear
functions of their standardizations, Z m , and since linear functions of normal random
variables are again normal, it may also be concluded that these sums are approximately
normal. If for convenience we now use the notation, X d N ( , 2 ) , to indicate that a
random variable X is approximately distributed normal with mean, , and variance, 2 ,
and if we recall from (3.1.7) and (3.1.8) that the mean and variance of S m are given by
m and m 2 , respectively , then we have the follows more useful form of the CLT :
Central Limit Theorem (Practical). For all sums, Sm , of iid random variables
with m sufficiently large,
(3.1.14) S m d N ( m , m 2 )
This result can in principle be used to motivate the fundamental normality assumption
about random effects, i . In particular, if i is a sum of iid random components as in
(3.1.1), with zero means, then by (3.1.14) it follows that i will also be approximately
normal with zero mean for sufficiently large m.
However, it should be emphasized here that in practical examples (such as the one
discussed in Section 3.2 below) the individual components, eik , of i may not be fully
independent, and are of course not likely to be identically distributed. Hence it is
important to emphasize that the CLT is actually much more general that the classical
assertion above for iid random variables. While such generalizations require conditions
that are too technical to even be stated in a precise manner here, 5 it is nonetheless useful
to given a very rough statement of the general version as follows: 6
4
Recall from footnote 5 in Section 3.2.2 of Part I that “sufficiently large” is usually taken to mean m 30 ,
as long as the common distribution of the underlying random variables ( X k ) in (3.1.6) is not “too
skewed”.
5
For further details about such generalizations, an excellent place to start is the Wikipedia discussion of the
CLT at http://en.wikipedia.org/wiki/Central_limit_theorem.
6
The following version of the Central Limit Theorem (and the multivariate version of this theorem in
section 3.2.3 below) based on Theorem 8.11 in Brieman (1969). The advantage of the present version is
that it directly extends the “iid” conditions of the classical CLT.
________________________________________________________________________
ESE 502 II.3-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
(3.1.15) Sm d N ( , 2 )
with 1 m and 2 12 m2 .
(3.1.16) i d N (0, 2 )
While the main application of the CLT for our present purposes is to motivate the
normality assumption about residuals in a host of statistical models (including linear
regression), it is important to add that perhaps the single most important application of
the CLT is for inference about population means. In particular, if one draws a iid random
sample, ( X 1 ,.., X m ) from a population with unknown mean, , and constructs the
associated sample mean:
m
(3.1.17) Xm 1
m k 1
Xk 1
m Sm ,
E( X m ) E (Sm ) m (m )
(3.1.18) 1 1
m
(3.1.19) var( X m ) 1
m2
var( Sm ) 1
m2
(m 2 ) 2 / m
implies that for large m this estimate has a small variance, and hence should be close to
(which is of course precisely the Law of Large Numbers). But one can say even more
by the CLT. To do so, note first that the standardized sample mean,
________________________________________________________________________
ESE 502 II.3-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
X m E( X m ) X
(3.1.20) Z Xm m
(Xm) 2 /m
1
Sm Sm m Sm m
(3.1.21) Z Xm m
Zm
/m
2
m /m 2
m 2
and hence satisfies exactly the same limiting properties as the sample sum. In particular
this yields the follows version of the practical CLT in (3.1.14) above for sample means:
Central Limit Theorem (Sample Means). For sufficiently large iid random
samples, ( X 1 ,.., X m ) , from any given statistical population with mean, ,
and variance, 2 , the sample mean, X m , is approximately normal, i.e.,
(3.1.22) X m d N ( , 2 / m)
Note in particular that random samples from the same population are by definition
identically distributed. So as long as they are also independent, Corollary 2 is always
applicable. But the Clark-Evans test in Section 3.2.2 of Part I provides a classic example
where this latter assumption may fail to hold. More generally, the types of dependencies
inherent in spatial (or temporal) data require more careful analysis when applying the
CLT to sample means.
Given the above results for random effects, i at individual locations, si , we now
consider the vector, , of such random effects for a given set of sample locations,
{si : i 1,.., n} R , i.e.,
As a parallel to (3.1.1) we again assume that these random effects are the cumulative sum
of independent factors,
e1 e2 em k 1 ek
m
(3.2.2)
where by definition each independent factor, ek , is itself a random vector over sample
locations, i.e.,
________________________________________________________________________
ESE 502 II.3-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
As one illustration, recall the California rainfall example in which annual precipitation,
Yi , at each of the n 30 sample locations in California was assumed to depend on four
explanatory variables ( xi1 “altitude”, xi 2 “latitude”, xi 2 “distance to coast”, and
xi 4 “rain shadow”, as follows
Yi 0 j 1 j xij i , i 1,.., n
4
(3.2.4)
Here the unobserved residuals, i , are the random effects we wish to model. If we write
(3.2.4) in vector form as
Y 0 1n j 1 j x j
4
(3.2.5)
[where 1n (1,..,1) is the unit column vector], then the residual vector, , in (3.2.5) is an
instance of (3.2.1) with n 30 . This random vector by definition contains all factors
influencing precipitation other that the four “main” effects posited above. So the key
assumption in (3.2.2) is that the influence of each unobserved factor is only a small
additive part of the total residual effect, , not accounted for by the four main effects
above.
For example, the first factor, e1 , might be a “cloud cover” effect. More specifically, the
unobserved value, e1i e1 ( si ) at each location, si , might represent fluctuations in cloud
cover at si [where higher (lower) levels of cloud cover tend to contribute positively
(negatively) to precipitation at si ]. Similarly, factor e2 might be an “atmospheric
pressure” effect, where e2i e2 ( si ) now represents fluctuations in barometric pressure
levels at si [and where in this case higher (lower) pressure levels tend to contribute
negatively (positively) to precipitation levels].
The key point to observe is that while fluctuations in factors like cloud cover or
atmospheric pressure will surely exhibit strong spatial dependencies, the dependency
between these factors at any given location is much weaker. In the present instance, while
there may indeed be some degree of negative relation between fluctuations in pressure
and cloudiness (e1i , e2i ) at any given location, si , this tends to be much weaker than the
positive relations between either fluctuations in cloud cover (e1i , e1 j ) , or atmospheric
pressure (e2i , e2 j ) , at locations, si and s j , that are in close proximity. Hence while the
random vectors, e1 and e2 , can each exhibit strong internal spatial dependencies, it is not
unreasonable to treat them as mutually independent. More generally, as a parallel to
section (3.1.3) above, it will turn out that if (i) the individual distributions of the random
component vectors, e1 ,.., em , in (3.2.2) are not “too different”, and (ii) the statistical
dependencies between these components are not “too strong”, then their sum, , will be
approximately “normal” for m sufficiently large.
________________________________________________________________________
ESE 502 II.3-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
But in order to make sense of this statement, we must first extend the normal distribution
in (3.1.10) to its multivariate version. This is done in the next section, where we also
develop its corresponding invariance property under linear transformations. This will be
followed by a development of the multivariate version of the Central Limit Theorem that
underscores the importance of this distribution.
To motivate the multivariate normal (or multi-normal) distribution observe that there is
one case in which we can determine the joint distribution of a random vector,
X ( X 1 ,.., X n ) , in terms of the marginal distributions of its component, X 1 ,.., X n ,
namely when these components are independently distributed. In particular, suppose that
each X i is independently normally distributed as in (3.1.10) with density
( xi i )2
2 i2
(3.2.6) fi ( xi ) 1 e , i 1,.., n
2 i2
Then letting, ii i2 and using the exponent notation, a a1/2 , it follows that the joint
density, f ( x1 ,.., xn ) , of X is given by the product of these marginals, i.e.,
(3.2.7) f ( x1 ,.., xn ) f1 ( x1 ) f 2 ( x2 ) f n ( xn )
( xi i )2
( xi i )2
( xi i )2
1 e 211
1 e 2 22
1 e 2 nn
211 2 22 2 nn
( x )2 ( x )2
1 i i n n
2 nn
(2 ) n /2
( 11 22 nn ) 1/2
e 11
where the last line uses the identity, (e a1 )(ea2 ) (ean ) e a1 a2 an . To write this in matrix
form, observe first that if x ( x1 ,.., xn ) now denotes a typical realization of random
vector, X ( X 1 ,.., X n ) , then by (3.2.6) the associated mean vector of X is given by
( 1 ,.., n ) [as in expression (1.1.4)]. Moreover, since independence implies that
cov( X i , X j ) ij 0 for i j , it follows that the covariance matrix of X now takes the
form [as in expression (1.1.7)],
________________________________________________________________________
ESE 502 II.3-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
11
22
(3.2.7) cov( X )
nn
But since the inverse of a diagonal matrix is simply the diagonal matrix of inverse values,
111
1 22
1
(3.2.8)
nn1
it follows that
(3.2.9) ( x ) 1 ( x )
111 x1 1
22 x2 2
1
( x1 1 , x2 2 ,.., xn n )
nn1 xn n
( x1 1 ) / 11
( x ) /
( x1 1 , x2 2 ,.., xn n ) 2 2 22
( xn n ) / nn
( x1 1 ) 2 ( x2 2 ) 2 ( xn n ) 2
11 22 nn
which is precisely the exponent sum in (3.2.7). Finally, since the determinant, | | , of a
diagonal matrix, , is simply the product of its diagonal elements, i.e.,
11
22
(3.2.10) || 11 22 nn ,
nn
we see from (3.2.9) and (3.2.10) that (3.2.7) can be rewritten in matrix form as
1 ( x ) 1 ( x )
(3.2.11) f ( x) (2 ) n/2 | |1/2 e 2
________________________________________________________________________
ESE 502 II.3-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
This is in fact an instance of the multi-normal density (or multivariate normal density).
More generally, a random vector, X ( X 1 ,.., X n ) , with associated mean vector,
( 1 ,.., n ) , and covariance matrix, ( ij : i, j 1,.., n) , is said to be multi-normally
distributed if and only if its joint density is of the form (3.2.11) for this choice of and
. As a generalization of the univariate case, this is denoted symbolically by
X ~ N ( , ) .
While it is not possible to visualize this distribution in high dimensions, we can gain
some insight by focusing on the 2-dimensional case, known as the bi-normal (or bivariate
normal) distribution. If X ( X 1 , X 2 ) is bi-normally distributed with mean vector,
( 1 , 2 ) and covariance matrix,
12
(3.2.12) 11
21 22
then the basic shape of the density function in (3.2.11) is largely determined by the
correlation between X 1 and X 2 , i.e., by
cov( X 1 , X 2 ) 12
(3.2.13) ( X1, X 2 )
( X 1 ) ( X 2 ) 11 22
________________________________________________________________________
ESE 502 II.3-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
x2 x2
x1 x1
For purposes of analysis, the single most useful feature of this distribution is that all
linear transformations of multi-normal random vectors are again multi-normal. To state
this precisely, we begin by calculating the mean and covariance matrix for general linear
transformations of random vectors. Given a random vector, X ( X 1 ,.., X n ) , with mean
vector, E ( X ) ( 1 ,.., n ) and covariance matrix, cov( X ) , together with any
compatible (m n) matrix, A (aij : i 1,.., m, j 1,.., n) , and n -vector, b (b1 ,.., bn ) of
coefficients, consider the linear transformation of X defined by
(3.2.14) Y AX b
(3.2.15) Y aX b
________________________________________________________________________
ESE 502 II.3-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
1 1 0 0
1 0 1
(3.2.16) In , ,..., [ e , e ,..., e ]
0 1 2 n
1 0 0 1
So linear transformations provide a very flexible tool for analyzing random vectors.
Next recall from the linearity of expectations that by taking expectations in (3.2.14) we
obtain
(3.2.18) E (Y ) E ( AX b) A E ( X ) b A b
By using this result, we can obtain the covariance matrix for Y as follows. First note that
by definition the expected value of a matrix of random variable is simply the matrix of
their expectations, i.e.,
Y1 1
E Y1 1 ,..., Yn n
Yn n
________________________________________________________________________
ESE 502 II.3-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
E[( AX A )( AX A )]
E[ A( X )( X ) A]
A E[( X )( X )] A
A cov( X ) A
cov( AX ) A A
So both the mean and covariance matrix of AX b are directly obtainable from those of
X . We shall use these properties many times in analyzing the multivariate spatial models
of subsequent sections.
But for the moment, the key feature of these results is that the distribution of any linear
transformation, AX b , of a multi-normal random vector, X ~ N ( , ) , is obtained by
simply replacing the mean and covariance matrix of X in (3.2.11) with those of AX b .
The only requirement here is that the resulting covariance matrix, A A , be nonsingular
so that the inverse covariance matrix, ( A A) 1 , in (3.2.11) exists. This in turn is
equivalent to the condition that the rows of A be linearly independent vectors, so that A
is said to be of full row rank. With this stipulation, we have the following result
[established in Section A3.2.3 of the Appendix to Part III in this NOTEBOOK]:7
(3.2.22) Y ~ N ( A b, A A)
What this means in practical terms is that if a given random vector, X , is known (or
assumed) to be multi-normally distributed as X ~ N ( , ) , then we can immediately
write down the exact distribution of essentially any linear function, AX b , of X .
We are now ready to consider multivariate extensions of the univariate central limit
theorems above. Our objective here is to develop only those aspects of the multivariate
7
For an alternative development of this important result, see for example Theorem 2.4.4 in Anderson
(1958).
________________________________________________________________________
ESE 502 II.3-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
case that are relevant for our present purposes. The first objective is to show that the
multivariate case relates to the univariate case in a remarkably simple way. To do so,
recall first from (3.2.17) above that for any random vector, X ( X 1 ,.., X n ) , each of its
components, X i , can be represented as a linear transformation, X i eiX , of X . So each
marginal distribution of X is automatically the distribution of this linear compound.
More generally, each linear compound, aX , can be said to define a generalized
marginal distribution of X .8 Now while the marginal distributions of X only determine
its joint distribution in the case of independence [as in (3.2.7) above], it turns out that the
joint distribution of X is always completely determined by its generalized marginal
distributions.9 To appreciate the power of this result, recall from the Linear Invariance
Theorem above that if X is multi-normal with mean vector, , and covariance matrix,
, then all of its linear compounds, aX , are automatically univariate normally
distributed with means, a , and variances, a a . But since these marginals in turn
uniquely determine the distribution of X , it must necessarily be multi-normal. Thus we
are led to the following fundamental correspondence:
(3.2.23) X ~ N ( , )
(3.2.25) Sm X1 X m
8
Since each marginal compound, eiX , has a coefficient vector of unit length, i.e., || ei || 1 , it is formally
more appropriate to restrict generalized marginals to linear compounds, a , of unit length ( || a || 1 ). But
for our present purposes we need not be concerned with such scaling effects.
9
For a development of this idea (due to Cramer and Wold), see Theorem 29.4 in Billingsley (1979).
________________________________________________________________________
ESE 502 II.3-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Multivariate Central Limit Theorem (Practical). For all sums of iid random
vectors, S m X 1 X m , with common mean vector, , and covariance matrix,
, if m sufficiently large then
(3.2.27) S m d N ( m , m )
But since multivariate normality will almost always arise as a model assumption in our
spatial applications, the most useful extension is the “General” Central Limit Theorem in
(3.1.15), which may now be stated as follows:10
(3.2.28) S m d N ( , )
with 1 m and 1 m .
(3.2.29) d N (0, )
with 1 m .
It is this version of the Central Limit Theorem that will form the basis for essentially all
random-effects models in the analyses to follow.
10
For a similar (informal) statement of this general version of the Multivariate Central Theorem, see
Theorem 8.11 in Brieman (1969).
________________________________________________________________________
ESE 502 II.3-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Given the Spatial Random Effects Theorem above, the task remaining is to specify the
unknown covariance matrix, , for these random effects. Since is in turn a sum of
individual covariance matrices, k , for random factors k 1,..., m , it might seem better to
specify these individual covariance structures. But rather than attempt to identify such
factors, our strategy will be to focus on general spatial dependencies that should be
common to all these covariance structures, and hence should be exhibited by . In doing
so, it is also important to emphasize that such statistical dependencies often have little
substantive relation to the main phenomena of interest. In terms of our basic modeling
framework, Y ( s) ( s) ( s) , in (1.2.1) above, we are usually much more interested in
the global structure of the spatial process, as represented by ( s) , than in the specific
relations among unobserved residuals { ( si ) : i 1,.., n} at sample locations {si : i 1,.., n} .
Indeed, these relations are typically regarded as “second-order” effects in contrast to the
“first-order” effects represented by ( s) . Hence it is desirable to model such second-
order effects in a manner that will allow the analysis to focus on the first-order effects,
while at the same time taking these unobserved dependencies into account. This general
strategy can be illustrated by the following example.
Suppose that one is interested in mapping the depth of the sea floor over a given region.
Typically this is done by taking echo soundings (sonar measurements) at regular intervals
from a vessel traversing a system of paths over the ocean surface. This will yield a set of
depth readings, {Di D( si ) : i 1,.., n} , such as the set of measurements is shown in
Figure 3.4 below:
s1 s2 sn
D1 D2 Dn
However, the ocean is not a homogeneous medium. In particular, it is well known that
such echo soundings can be influenced by the local concentration of zooplankton in the
region of each sounding. These clouds of zooplankton (illustrated in Figure 3.5 below)
create interference called “ocean volume reverberation”.
________________________________________________________________________
ESE 502 II.3-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
These interference patterns tend to vary from location to location, and even from day to
day (much in the same way that sunlight is affected by cloud patterns).11 So actual
readings are random variables of the form,
(3.3.1) D ( si ) d ( si ) ( si ) , i 1,.., n
11
Actual variations in the distribution of zooplankton are more diffuse than the “clouds” depicted in Figure
3.5. Vertical movement of zooplankton in the water column is governed mainly by changes in sunlight, and
horizontal movement by ocean currents.
12
In actuality, such measurement errors include many different sources, such as the reflective properties of
the sea floor. Moreover, depth measurements are actually made indirectly in terms of the transmission
loss, Li L ( si ) , between the signal sent and the echo received. The corresponding depth, Di , is obtained
from Li by a functional relation, Di ( Li , ) , where is a vector of parameters that have been
calibrated under “idealized” conditions. For further details, see Urick, R.J. (1983) Principles of Underwater
Sound, 3rd ed., McGraw-Hill: New York, and in particular the discussion around p.413.
13
Here it important to note that such detailed models can be of great interest in other contexts. For
example, acoustic signals are also used to estimate the volume of zooplankton available as a food source
for sea creatures higher in the food chain. To do so, it is essential to relate acoustic signals to the detailed
behavior of such microscopic creatures. See for example, Stanton T.K. and D. Chu (2000) “Review and
recommendations for the modeling of acoustic scattering by fluid-like elongated zooplankton: euphausiids
and copepods”, ICES Journal of Marine Science, 57: 793–807.
________________________________________________________________________
ESE 502 II.3-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
So what is needed here is a statistical model of spatial residuals that allows for local
spatial dependencies, but is simple enough to be estimated explicitly. To do so, we will
adopt the following basic assumptions of spatial stationarity:
These assumptions are loosely related to the notion of “isotropic stationarity” for point
processes discussed in Section 2.5 of Part I. But here we focus on the joint distribution of
random variables at selected locations in space rather than point counts in selected
regions of space. To motivate the present assumptions in the context of our example,
observe first that while zooplankton concentrations at any point of time may differ
between locations, it can be expected that the range of possible concentration levels over
time will be quite similar at each location. More generally, the Homogeneity assumption
asserts that the marginal distributions of these concentration levels are the same at each
location. To appreciate the need for such an assumption, observe first that while it is in
principle possible to take many depth measurements at each location and employ these
samples to estimate location-specific distributions of each random variable, this is
generally very costly (or even infeasible). Moreover, the same is true of most spatial data
sets, such as the set of total rainfall levels or peak daily temperatures reported by regional
weather stations on a given day. So in terms of the present example, one typically has a
single set of depth measurements [ D( si ) : i 1,.., n] , and hence only a single joint
realization of the set of unobserved residuals [ ( s i ) : i 1,.., n] . Thus, without further
assumptions, it is impossible to say anything statistically about these residuals. From this
viewpoint, the fundamental role of the Homogeneity assumption is to allow the joint
realizations, [ ( si ) : i 1,.., n] , to be treated as multiple samples from a common
population that can be used to estimate parameters of this population.
The Isotropy assumption is very similar in spirit. But here the focus is on statistical
dependencies between distinct random variables, ( si ) and ( s j ) . For even if their
marginal distributions are known, one cannot hope to say anything further about their
joint distribution on the basis of a single sample. But in the present example it is
reasonable to assume that if a given cloud of zooplankton (in Figure 3.5) covers location,
si , then it is very likely to cover locations s j which are sufficiently close to s j . Similarly
for locations that are very far apart, it is reasonable to suppose that clouds covering si
have little to do with those covering s j . Hence the Isotropy assumption asserts more
generally that similarities between concentration levels at different locations depend only
on the distance between them. The practical implication of this assumption is that all
________________________________________________________________________
ESE 502 II.3-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
But before proceeding, it should also be emphasized that while these assumptions are
conceptually appealing and analytically useful – they may of course be wrong. For
example, it can be argued in the present illustration that locations in shallow depths
(Figure 3.5) will tend to experience lower concentration levels than locations in deeper
waters. If so, then the Homogeneity assumption will fail to hold. Hence more complex
models involved “nonhomogeneous” residuals may be required in some cases.14 As a
second example, suppose that the spatial movement of zooplankton is known to be
largely governed by prevailing ocean currents, so that clouds of zooplankton tend to be
more elongated in the direction of the current. If so, then spatial dependencies will
depend on direction as well as distance, and the Isotropy assumption will fail to hold.
Such cases may require more complex “anisotropic” models of spatial dependencies.15
In many cases the assumptions above are stronger than necessary. In particular, recall
from the Spatial Random Effects Theorem (together with the introductory discussion in
Section 3.3) that such random effects are already postulated to be multi-normally
distributed with zero means. So all that is required for our purposes is that these
homogeneity and isotropy assumptions be reflected by the matrix, , of covariances
among these random effects.
To do so, it will be convenient for our later purposes to formulate such covariance
properties in terms of more general spatial stochastic processes. A spatial stochastic
process, {Y ( s ) : s R} , is said to be covariance stationary if and only if the following two
conditions hold for all s1 , s2 , v1 , v2 R :
These conditions can be stated more compactly by observing that (3.3.4) implies the
existence of a common mean value, , for all random variables. Moreover, (3.3.5)
14
For example, it might be postulated that the variance of ( s ) depends on the unknown true depth, d ( s ) ,
at each location, s . Such nonstationary formulations are complex, and beyond the scope of these notes.
15
Such models are discussed for example by Gotway and Waller (2004, Section 2.8.5).
________________________________________________________________________
ESE 502 II.3-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
implies that covariance depends only on distance, so that for each distance, h , and pair of
locations s, v R with s v h there exists a common covariance value, C (h) , such
that cov[Y ( s ), Y (v)] C (h) . Hence, process {Y ( s) : s R} is covariance stationary if and
only if (iff) the following two conditions hold for all s, v R ,
Note in particular from (3.3.7) that since var[Y ( s)] cov[Y ( s ), Y ( s )] by definition, and
since || s s || 0 , it follows that these random variables must also have a common
variance, 2 given by
(3.3.9) Y ( s) ( s)
These are the appropriate covariance stationarity conditions for residuals that correspond
to the stronger Homogeneity (3.3.2) and Isotropy (3.3.3) conditions in Section 3.3.1
above.
Note finally that even these assumptions are too strong in many contexts. For example (as
mentioned above) it is often convenient to relax the isotropy condition implicit in (3.3.7)
and (3.3.11) to allow directional variations in covariances. This can be done by requiring
that covariances dependent only on the difference between locations, i.e., that for all
h (h1 , h2 ) , s v h cov[Y ( s), Y (v)] C (h) . This weaker stationarity condition is
often called intrinsic stationarity. See for example [BG] (p.162), Cressie (1993, Sections
________________________________________________________________________
ESE 502 II.3-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
2.2.1 and 2.3) and Waller and Gotway (2004, p.273). However, we shall treat only the
isotropic case [(3.3.7),(3.3.11)], and shall use these assumptions throughout.
Note that since the above covariance values, C (h) , are unique for each distance value, h ,
in region R , they define a function, C , of these distances which is designated as the
covariogram for the given covariance stationary process.16 But as with all random
variables, the values of this covariogram are only meaningful with respect to the
particular units in which the variables are measured. Moreover, unlike mean values, the
values of the covariogram are actually in squared units, which are difficult to interpret in
any case. Hence it is often more convenient to analyze dependencies between random
variables in terms of (dimensionless) correlation coefficients. For any stationary process,
{Y ( s ) : s R} , the (product moment) correlation between any Y ( s ) and Y (v) with
s v h is given by the ratio:
which is simply a normalized version of the covariogram. Hence the correlations at every
distance, h , for a covariance stationary process are summarized by a function called the
correlogram for the process:
C (h)
(3.3.13) ( h) , sR
C (0)
16
To be more precise, if the set of all distances associated with pairs of locations in region R is denoted by
h( R ) {h :|| s v || h for some s , v R} , then the covariogram, C , is a numerical function on h( R ) .
Note also that for the weaker form of intrinsic stationarity discussed above, the covariogram depends on
the differences in both coordinates, h ( h1 , h2 ) , and hence is a two-dimensional function in this case.
________________________________________________________________________
ESE 502 II.3-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
4. Variograms
The covariogram and its normalized form, the correlogram, are by far the most intuitive
methods for summarizing the structure of spatial dependencies in a covariance stationary
process. However, from an estimation viewpoint such functions present certain
difficulties (as will be discussed further in Section 4.10 below). Hence it is convenient to
introduce a closely related function known as the variogram, which is widely used for
estimation purposes.
E (YsYv ) 2 2 2 E (YsYv ) 2
E (YsYv ) C (h) 2
Hence by substituting (4.1.2) and (4.1.3) into (4.1.1) we see that expected squared
differences for all s, v R with s v h can be expressed entirely in terms of the
covariogram, C , as
________________________________________________________________________
ESE 502 II.4-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
and observing from (4.1.4) that with this definition we obtain the following simple
identity for all distances , h :
From (4.1.6) it is thus evident that the “scaled” expected squared differences in (4.1.5)
define a unique function of distance which is intimately related to the covariogram. For
any given covariance stationary process, this function is designated as the variogram, ,
of the process. Moreover, it is also evident that this variogram is uniquely constructible
from the covariogram. But the converse is not true. In particular since (4.1.6) also implies
that
(4.1.7) C ( h) 2 ( h)
it is clear that in addition to the variogram, , one must also know the variance, 2 , in
order to construct the covariogram.1 Hence this variance will become an important
parameter to be estimated in all models of variograms developed below.
Before proceeded further with our analysis of variograms it is important to stress that the
above terminology is not completely standard. In particular, the expected squared
difference function in (4.1.4) is often designated as the “variogram” of the process, and
its scaled version in (4.1.5) is called the “semivariogram” [as for example in Cressie
(1993, p.58-59) and Gotway and Waller (2004, p.274)]. (This same convention is used in
the Geostatistical Analyst extension in ARCMAP.) But since the scaled version in (4.1.5)
is the only form used in practice [because of the simple identity in (4.1.7)] it seems most
natural to use the simple term “variogram” for this function, as for example in [BG,
p.162].2
To illustrate the relation in (4.1.7) it is most convenient to begin with the simplest and
most commonly employed model of spatial dependence. Recall from the Ocean Depth
Example in Section 3.3.1 above, that the basic hypothesis there was that nearby locations
tend to experience similar concentration levels of plankton, while those in more widely
separated locations have little to do with each other. This can be formalized most easily
in terms of correlograms by simply postulating that correlations are high (close to unity)
for small distances, and fall monotonely to zero as distance increases. This same general
hypothesis applies to a wide range of spatial phenomena, and shall be referred to here as
the standard model of spatial dependence. Given the relation between correlograms and
covariograms in (3.3.13), it follows at once that covariograms for the standard model, i.e.,
standard covariograms, must fall monotonely from C (0) 2 toward zero, as illustrated
However, assuming that lim h C ( h ) 0 , it follows from (4.1.6) that lim h ( h) . So is in
1 2 2
________________________________________________________________________
ESE 502 II.4-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
in Figure 4.1 below. The right end of this curve has intentionally been left rather vague. It
may reach zero at some point, in which case covariances will be exactly zero at all
greater distances. On the other hand, this curve may approach zero only asymptotically,
so that covariance is positive at all distances but becomes arbitrarily small. Both cases are
considered to be possible under the standard model (as will be illustrated in Section 4.6
below by the “spherical” and “exponential” variogram models).
sill
• 2
•
2
C () ()
h • h
On the right in Figure 4.2 is the associated standard variogram, which by (4.1.6) above
must necessarily start at zero and rise monotonely toward the value 2 . Graphically this
implies that the standard variogram must either reach the dashed line in Figure 4.2,
designated as the sill, or must approach this sill asymptotically.3
But while this mathematical correspondence between the standard variogram and
covariogram is quite simple, there are subtle differences in their interpretation. The
interpretation of standard covariograms is straightforward, since decreases in (positive)
covariance at large distances are naturally associated with decreases in spatial
dependence. But the associated increase in the standard variogram is somewhat more
difficult to interpret in a simple way. If we recall from (4.1.5) that these variogram values
are proportional to expected squared differences, then is reasonable to conclude that such
differences should increase as variables become less similar (i.e., less positively
dependent). But as a general rule, it would still appear that the simplest approach to
interpreting variogram behavior is to describe this behavior in terms of the corresponding
covariogram.
Since the analysis to follow will focus almost entirely on the standard model, it is of
interest to consider one example of a naturally occurring stationary process that exhibits
non-standard behavior. As a more micro version of the Ocean Depth Example in Section
3.3.1 above, suppose that one is interested in measuring variations in ocean depth due to
wave action on the surface. Figure 4.3 below depicts an idealized measurement scheme
3
As noted by [BG, p.162] the scaling by ½ in (4.1.5) is precisely to yield a “sill” which is associated with
rather than 2 .
2 2
________________________________________________________________________
ESE 502 II.4-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
involving a set of (yellow) corks at locations {si : i 1,.., n} that are attached to vertical
measuring rods, allowing them to bob up and down in the waves. The set of cork heights,
H i H ( si ) , on these n rods at any point of time can be treated as a sample of size n
from a spatial stochastic process, {H ( s ) : s R} , of wave heights defined with respect to
some given ocean region, R .
wave crest
water
level
H1 Hn
s1 s2 s4 s6 sn
d
Figure 4.3. Measurement of Wave Heights
Here the fluctuation behavior of corks should be essentially the same over time at each
location. Moreover, any dependencies among cork heights due to the smoothness of wave
actions should depend only on the spacing between their positions in Figure 4.3. Hence
the homogeneity and isotropy assumptions of spatial stationarity in Section 3.3.1 should
apply here as well, so that in particular, {H ( s ) : s R} can be treated as a covariance
stationary process.
But this process has additional structure implied by the natural spacing of waves. If this
spacing is denoted by d , then it is clear that for corks separated by distance d , such as
those at locations s2 and s6 in Figure 4.3, whenever a wave crest (or trough) occurs at
one location it will tend to occur at the other as well. Hence pairs of location separated
by a distance d should exhibit a positive correlation in wave heights, as shown in the
covariogram of Figure 4.4 below. However, for locations spaced at around half this
distance, such as s2 and s4 in Figure 4.3, the opposite should be true: whenever a crest
(or trough) occurs at one location, a wave trough (or crest) will tend to occur at the other.
Hence the wave heights at such locations can be expected to exhibit negative correlation,
as is also illustrated by the covariogram in Figure 4.4.
Finally, it should be clear that distances between wave crests are themselves subject to
some random variation (so that distance d in Figure 4.3 should be regarded as the
expected distance between wave crests). Thus, in a manner similar to the standard model,
one can expect that wave heights a distant locations will be statistically unrelated. This in
turn implies that the positive and negative correlation effects above will gradually
________________________________________________________________________
ESE 502 II.4-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
dampen as distance increases. Hence this process should be well represented by the
“damped sine wave” covariogram shown in Figure 4.4.4
0.8 0.8
0.7 0.7
2 0.6 2 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
-0.1 -0.1
-0.2
0 0.5 1 1.5 2 • 2.5 3 3.5 4 4.5 5
-0.2
0 0.5 1 1.5 2 • 2.5 3 3.5 4 4.5 5
d d
Finally, the associated variogram for this process [as defined by (4.1.6)] is illustrated in
Figure 4.5 for sake of comparison. If the variance, 2 , in Figure 4.4 is again take to
define the appropriate sill for this variogram (as shown by the horizontal dashed line in
Figure 4.5) then it is clear that the values of this variogram now oscillate around the sill
rather that approach it monotonely. Hence this sill is only meaningful at larger distances,
where wave heights no longer exhibit any significant correlation.
2 • 2
( h)
C ( h)
h • h
4
A mathematical model of this type of covariogram is given in expression 4.6.9 below.
________________________________________________________________________
ESE 502 II.4-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Such processes are of course only mathematical idealizations, since literally all physical
processes must exhibit some degree of smoothness (even at small scales). But if
independence holds at least approximately at sufficiently small scales then this
idealization may be reasonable. For example, if one considers a sandy desert region, R ,
and lets D( s ) denote the depth of sand at any location, s R , then this might well
constitute a smooth covariance stationary process, {D( s ) : s R} , which is quite
consistent with the standard model of Section 3.5 (or perhaps even the “wave model” of
Section 3.6 if wind effects tend to ripple the sand). But in contrast to this, suppose that
one considers an alternative process {W ( s ) : s R} in which W ( s) now denotes the
weight of the topmost grain of sand at location s (or perhaps the diameter or quartz
content of this grain). Then while is it reasonable to suppose that the distribution of these
weights is the same at each location s (and is thus a homogeneous process as in Section
3.3.1 above), there need be little relation whatsoever between the specific weights of
adjacent grains of sand. So at this scale, the process {W ( s ) : s R} is well modeled by
pure spatial independence.
The standard model in Section 4.2 and the model of pure spatial independence in
Section 4.4 can be viewed as two extremes: one with continuous positive dependence
gradually falling to zero, and the other with zero dependence at all positive distances.
However, many actual processes are well represented by a mixture of the two. This can
be illustrated by a further refinement of the Ocean Depth Example in Section 3.3.1.
Observe that while mobile organisms like zooplankton have some ability to cluster in
response to various stimuli, the ocean also contains a host of inert debree (dust particles
from the atmosphere, and skeletal remains of organisms, etc.) which bear little relation to
each other. Hence in addition to the spatially correlated errors in sonar depth
measurements created by zooplankton, there is a general level of “background noise”
created by debree particles that is best described in terms of spatially independent errors.
(4.5.1) ( s ) 1 ( s ) 2 ( s ) , s R
Moreover, it is also reasonable to assume that these error components are independent
(i.e., that the distribution of zooplankton is not influenced by the presence or absence of
debree particles). More formally, it may be assumed that 1 ( s ) and 2 (v) are independent
random variables for every pair of locations, s, v R . With this assumption it then
________________________________________________________________________
ESE 502 II.4-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
follows (see section A2.1 in Appendix A2) that the covariogram, C , of error process
must be the sum of the separate covariograms, C1 and C2 , for component processes 1
and 1 , i.e., that for any h 0 ,
To see the graphical form of this combined model, observe first that by setting h 0 in
(4.5.2) it also follows that
where 12 and 22 are the corresponding variances for the spatially dependent and
independent components, respectively. Hence the covariogram for the combined process
in (4.5.2) is given by Figure 4.7 below:
2 • nugget effect
12
+ =
22 •
C1 C2 C
In this graphical form it is clear that the covariogram for the combined model is
essentially the same as that of the standard model, except that there is now a discontinuity
at the origin. This local discontinuity is called the nugget effect in the combined model,6
and the magnitude of this effect (which is simply the variance, 22 , of the pure
independent component) is called the nugget. Note that by definition the ratio, 22 / 2 ,
5
This combined model is an instance of the more general decomposition in Cressie (1993, pp.112-113)
with C1 reflecting the “smooth” component, W , and C2 reflecting the “noise” components, .
6
This term originally arose in mining applications where there are often microscale variations in ore
deposits due to the presence of occasional nuggets of ore [as discussed in more detail by Cressie
(1993,p.59)]. In the present context, such a “nugget effect” would be modeled as an independent micro
component of a larger (covariance stationary) process describing ore deposits.
________________________________________________________________________
ESE 502 II.4-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
gives the relative magnitude of this effect, and is designated as the relative nugget effect.
For example, if the relative nugget effect for a given covariogram is say .75, then this
would indicate that the underlying process exhibits relatively little spatial dependence.
Next we consider the associated variogram for the combined model. If denotes the
variogram of the combined process in (4.5.1) then we see from (4.1.6) together with
(4.5.2) and (4.5.3) that
[ 12 C1 (h)] [ 22 C2 (h)]
( h) 1 ( h) 2 ( h )
where 1 and 2 are the variograms for the spatially dependent and independent
components, respectively. Hence it follows that variograms add as well, and yield a
corresponding combined variogram as shown in Figure 4.8 below:
sill
2 •
12 ( h)
C ( h)
22
nugget
•
Figure 4.8. Summary of the Combined Model
While the combined model above provides a useful conceptual framework for variograms
and covariograms, it is not sufficiently explicit to be estimated statistically. We require
explicit mathematical models that are (i) qualitatively consistent with the combined
model, and (ii) are specified in terms of a small number of parameters that can be
estimated.7
7
There is an additional technical requirement covariograms yield well-defined covariance matrices, as
detailed further in the Appendix to Part III (Corollary 2.p.A3-70).
________________________________________________________________________
ESE 502 II.4-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
The simplest and most widely used variogram model is the spherical variogram, defined
for all h 0 by:
0 , h0
3h h 3
(4.6.1) ( h; r, s, a ) a ( s a ) 3 , 0 h r
2r 2r
s , hr
4.5 4.5
s 4 s
•
4
3.5 3.5 a
3 3
2.5 2.5
2 2
1.5 1.5
a 1 1
0.5 0.5
•
0
0 1 2 3 4 5 6
r
7 8
0
0 1 2 3 4 5
r
6 7 8
A comparison of Figure 4.9 with the right hand side of Figure 4.8 shows that parameter,
s , corresponds to the sill of the variogram and parameter, a , corresponds to the nugget
[as can also be seen by letting h approach zero in the (4.6.1)]. So for this particular
example the relative nugget effect is a / s 1/ 4 . Note finally that since the spherical
variogram reaches the sill at value, r [as can also be seen by setting h r in (4.6.1)],
this implies that the corresponding covariogram in Figure 4.10 falls to zero at r . Hence
the parameter, r , denotes the maximum range of positive spatial dependencies, and is
8
More generally the expression, f ( x1 ,.., xn ; 1 ,.., k ) , is taken to denote a function, f , with arguments
( x1 ,.., xn ) and parameters (1 ,.., k ) .
________________________________________________________________________
ESE 502 II.4-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
designated simply as the range of the variogram (and corresponding covariogram). These
same notational conventions for range, sill and nugget will be used throughout.9
s , h0
3h h3
(4.6.2) C ( h ; r , s, a ) ( s a ) 1 3 , 0 h r
2r 2 r
0 , hr
Together, (4.6.1) and (4.6.2) will be called the spherical model. One can gain further
insight into the nature of this model by differentiating (4.6.2) in the interval, 0 h r , to
obtain:
dC 3h 2 3 3 h
2
(4.6.3) ( s a ) 3 ( s a ) 2 1
dh 2r 2r 2r r
dC
(4.6.3) 0 hr
dh
d 2C 3 2h
(4.6.4) 2
( s a ) 2 0
dh 2r r
whenever the sill is greater than the nugget (i.e., s a 0 ). Thus, except for the extreme
case of pure independence, this function is always “bowl shaped” on the interval
0 h r , and has a unique differentiable minimum at h r . Hence this spherical
covariogram yields a combined-model form with finite range that falls smoothly to zero.
These properties (together with its mathematical simplicity) account for the popularity of
the spherical model.
All explicit variogram applications in these notes will employ this spherical model.
However, it is of interest at this point to consider one alternative model which is also in
wide use.
9
Note that the use of “s” to denote sill should not be confused with the use of “ s ( s1 , s2 ) ” to denote
spatial locations. Also, since the symbol, n, is used to denote sample size, we choose to denote the nugget
by “a” rather than “n”.
________________________________________________________________________
ESE 502 II.4-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
While the spherical model is smooth in the sense of continuous differentiability, it makes
the implicit assumption that correlations are exactly zero at all sufficiently large
distances. But in some cases it may be more appropriate to assume that while correlations
may become arbitrarily small at large distances, they never vanish. The simplest model
with this property is the exponential variogram, defined for all h 0 by,
0 , h0
(4.6.5) ( h ; r , s, a )
a ( s a) 1 e
3 h / r
, h0
s , h0
(4.6.6) C ( h ; r , s, a ) 3 h / r
( s a) e , h0
4.5 4.5
s 4
3.5
•s
4
3.5
3 3
s-a
s-a 2.5 2.5
2 2
1.5 1.5
a 1 1
0.5 0.5
•
0
0 1 2 3 4 5
r
6 7 8
0
0 1 2 3 4 5 6
r
7 8
Here it is clear that the sill, s , and nugget, a , play the same role as in the spherical
model. However, the “range” parameter, r , is more difficult to interpret in this case since
spatial dependencies never fall to zero. To motivate the interpretation of this parameter,
________________________________________________________________________
ESE 502 II.4-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
observe first that since spatial dependencies are only meaningful at positive distances, it
is natural to regard the quantity s a in Figure 4.12 as the maximal covariance for the
underlying process.10 In these terms, the practical range of spatial dependency is
typically defined to be the smallest distance, r , beyond which covariances are no more
than 5% of the maximal covariance. To see that r in (4.6.6) in indeed the practical range
for this covariogram, observe simply that since e x .05 x ln(.05) 2.9957 3 , it
follows that
Note finally that in terms of the corresponding variogram (which plays the primary role
in statistical estimation of the exponential model), the quantity s a in Figure 4.11 is
usually called the partial sill.11
0 , h0
(4.6.8) (h; r , s, a) sin(h / w)
a (s a) 1 w , h0
h
where the parameter, w , denotes the wave intensity. Here the corresponding covariogram
is given by:
s , h0
(4.6.9) C (h; r , s, a ) sin(h / w)
( s a) w , h0
h
The wave covariogram and variograms shown in Figures 4.4 and 4.5 above are in fact the
instances of this wave model with ( w 0.6, a 0, s 0.6) .
10
More generally, this maximal covariance for any combined model in Figure 4.7 is seen to be given by the
variance, 1 , of the (continuous) spatially dependent component.
2
11
Indeed this quantity plays such a central role that variograms are often defined with the partial sill as an
explicit parameter rather than the sill itself. See for example the spherical and exponential (semi) variogram
models in Cressie (1993, p.61). See also the Geostatistical Analyst example in Section 4.9.2 below.
12
This is also referred to as the hole-effect model [as in Cressie (1993, p.623)], and in particular, is given
this designation in the Geostatistical Analyst kriging option of ARCMAP.
________________________________________________________________________
ESE 502 II.4-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
There are many approaches to fitting possible variogram models to spatial data sets, as
discussed at length in Cressie (1993, section 2.4) and Schabenberger and Gotway (2004,
sections 4.4-4.6). Here we consider only the standard two-stage approach most
commonly used in practice (as for example in Geostatistical Analyst). The basic idea of
this approach is to begin by constructing a direct model-independent estimate of the
variogram called the “empirical variogram”. This empirical variogram is then used as
intermediate data to fit specific variogram models. We consider each of these steps in
turn.
An examination of (4.1.5) suggests that for any given set of spatial data y ( si ) : i 1,.., n
and distance, h , there is an obvious estimator of the variogram value, (h) , namely “half
y(s ) y(s )
2
the average value of i j for all pairs of locations si and s j separated by
distance h ”. However, one problem with this estimator is that (unlike K-functions) the
value (h) refers to point pairs with distance si s j exactly equal to h . Since in any
finite sample there will generally be at most one pair that are separated by a given
distance h (except for data points on regular grids, as discussed below), one must
necessarily aggregate point pairs ( si , s j ) with similar distances and hence estimate (h)
at only a small number of representative distances for each aggregate. The simplest way
to do so is to partition distances into intervals, called bins, and take the average distance,
hk , in each bin k to be the appropriate representative distances, called lag distances, as
shown in Figure 4.13 below:
0
•h •h • • •
1 2 h3 h4 • • • • • h
lag distances
More formally, if N k denotes the set of distance pairs, ( si , s j ) , in bin k , [with the size
(number of pairs) in N k denoted by | N k | ], and if the distance between each such pair is
denoted by hij si s j , then the lag distance, hk , for bin k is defined to be
________________________________________________________________________
ESE 502 II.4-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
1
(4.7.1) hk
Nk
( si , s j )N k
hij
To determine the size of each bin, the most common approach is to make all bins the
same size, in order to insure a uniform approximation of lag distances within each bin.
However there is an implicit tradeoff here between approximation of lag distances and
the number of point pairs used to estimate the variogram at each lag distance. Here the
standard rule of thumb is that each bin should contain at least 30 point pairs,13 i.e., that
(4.7.2) N k 30
Next observe that the choice of the maximum lag distance (max-lag), h , (in Figure 4.13)
also involves some implicit restrictions. First, for any given set of sample points,
{si : i 1,.., n} R , one cannot consider lag distances greater than the maximum pairwise
distance,
(4.7.3)
hmax max si s j : i j n
in this sample since no observations are available. Moreover, practical experience has
shown that even for lag distances close to hmax the resulting variogram estimates tend to
be unstable [Cressie (1985, p.575)]. Hence, in a manner completely analogous to the rule
of thumb for K-functions [expression (4.5.1) of Part I], it is common practice to restrict
h to be no greater than half of hmax , i.e.,
hmax
(4.7.4) h
2
Hence our basic rule for constructing bins is choose a system of bins {N k : k 1,.., k } of
uniform size, such that the max-lag, h hk , is as large as possible subject to (4.7.3) and
(4.7.4). More formally, if the biggest distance in each bin k is denoted by
d k max ( si ,s j )Nk dij , then our procedure (in the MATLAB program variogram.m
discussed below) is to choose a maximum bin number, k , and maximum distance (max-
dist), d , such that 14
(4.7.5) N1 N k 30
(4.7.6) h hk d k d
13
Notice that this rule of thumb is reminiscent of that for the Central Limit Theorem used in the Clark-
Evans test of Section 3.2.2 in Part I (and in Section 3.1.3 above). Note also that some authors recommend
there be at least 50 pairs in each bin [as for example in Schabenberger and Gotway (2005, p.153)].
14
This is essentially a variation on the “practical rule” suggested by Cressie (1985, p.575).
________________________________________________________________________
ESE 502 II.4-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
(Here the default value of d is hmax / 2 and the default value of k is 100 bins.) With
these rules for constructing bins and associated lag distances, it then follows from (4.1.5)
that for any given set of sample points, {si : i 1,.., n} R , with associated data,
{ y ( si ) : i 1,.., n} , an appropriate estimate of the variogram value, (hk ) , at each lag
distance, hk h , is given by half the average squared differences y ( si ) y ( s j ) over
2
1
y(s ) y(s )
2
(4.7.7) ˆ (hk ) ( si , s j )N k i j
2 Nk
This set of estimates at each lag distance is designated as the empirical variogram.15
More formally, if for any given set of (ordered) lag distances, {hk : k 1,.., k } , the
associated variogram estimates in (4.7.7) are denoted simply by ˆk ˆ (hk ) , then the
empirical variogram is given by the set of pairs {(hk , ˆk ) : k 1,.., k } . An schematic
example of this empirical variogram construction is given in Figure 4.14 below:
ˆk 1
ˆk
y(s ) y(s )
2
i j
hij hk hk 1 h
y(s ) y(s )
2
Here the blue dots correspond to squared-difference pairs, i j , plotted
against distances, hij si s j , for each point pair, ( si , s j ) , [as illustrated for one point in
the lower left corner of the figure]. The vertical lines separate the bins, as shown for bins
15
The empirical variogram is also known as Matheron’s estimator, in honor of its originator
[Schabenberger and Gotway (2005, Section 4.4.1)].
________________________________________________________________________
ESE 502 II.4-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
k and k+1. So in bin k, for example, there is one blue dot for every point pair,
( si , s j ) N k . The red dot in the middle of these points denotes the pair of average values,
(hk , ˆk ) , representing all points in that bin. Hence the empirical variogram consists of all
these average points, one for each bin of points. [Schematics of such empirical
variograms are shown (as blue dots) in Figure 4.15 below. An actual example of an
empirical variogram is shown in Figure 4.19 below.]
While this empirical variogram will be used to fit all variograms in these notes, it should
be mentioned that a number of modifications are possible. First of all, while the use of
average distances, hk , in each bin k has certain statistical advantages (to be discussed
below), one can also use the median distance, or simply the midpoint of the distance
range. Similarly, while uniformity of bin sizes in (4.7.5) will also turn out to have certain
statistical advantages for fitting variograms in our framework (as discussed below), one
can alternatively require uniform widths of bins.
In addition, it has been observed by Cressie and Hawkins (1980) [also Cressie (1993,
Section 2.4.3)] that estimates involving squared values such as (4.8.7) are often
dominated by a few large values, and are thus sensitive to outliers. Hence these authors
propose several “robust” alternatives to (4.7.7) based on square roots and median values
of absolute differences.
Finally it should be noted that a number of fitting procedures in use actually drop this
initial stage altogether, and fit variogram models directly in terms of the original data,
{ y ( si ) : i 1,.., n} .16 In such approaches, the empirical variogram is essentially replaced
by a completely disaggregated version called the variogram cloud, where each point pair
( si , s j ) is treated as a separate “bin”, and where ij si s j is estimated by the
16
Most prominent among these is the method of maximum likelihood, as detailed for example in
Schabenberger and Gotway (2005, Section 4.5.2). [This general method of estimation will also be
developed in more detail in Part III of these notes for fitting spatial regression models.]
17
An example is given in Figure 4.19 below.
18
For additional discussion see the section on “Binning versus Not Binning” in Schabenberger and
Gotway (2005, Section 4.5.4.3). See also the excellent discussion in Reilly and Gelman (2007).
________________________________________________________________________
ESE 502 II.4-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Given an empirical variogram, {(hk , ˆk ) : k 1,.., k } , together with a candidate variogram
model, (h ; r , s, a ) [such as the spherical model in (4.7.1)], the task remaining is to find
parameter values, (rˆ, sˆ, aˆ ) , for this model that yield a “best fit” to the empirical variogram
data. The simplest and most natural approach is to adopt a “least squares” strategy, i.e.,
to seek parameter values, (rˆ, sˆ, aˆ ) , that solve the following (nonlinear) least-squares
problem:
While this procedure will be used to fit all variograms in these notes, it is important to
note some shortcomings of this approach. First of all, since squared deviations are being
used in (4.7.8), it again follows that this least-squares procedure is sensitive to outliers.
As with all least-squares procedures, one can attempt to mitigate this problem by using an
appropriate weighting scheme, i.e., by considering the more general weighted least-
squares problem:
for some set of appropriate nonnegative weights wk : k 1,.., k . A very popular choice
for these weights [first proposed by Cressie (1985)] is to set:19
Nk
(4.7.10) wk , k 1,.., k
(hk ; r , s, a)2
Here the numerator simply places more weight on those terms with more samples. The
denominator is approximately proportional to the variance of the estimates, ˆk ,20 so that
the effect of both the numerator and denominator is to place more weight on those terms
for which the estimates, ˆk , are most reliable. However, it has been pointed out by others
that the inclusion of the unknown parameters (r , s, a ) in these weights can create certain
instabilities in the estimation procedure [see for example Zhang et al. (1995) and Müller
(1999, Section 4)]. Moreover, since our constant bin sizes in (4.7.5) eliminate variation in
the sample weights, we choose to use the simpler unweighted least-squares procedure in
(4.7.8).
19
In particular, this is the weighted least-squares procedure used in Geostatistical Analyst.
20
This approximation is based on the important case of normally distributed spatial data.
________________________________________________________________________
ESE 502 II.4-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Finally it should also be noted that this least-square procedure is implicitly a constrained
minimization problem since it is required that (i) r 0 and (ii) s a 0 . In the present
setting, however, nonnegativity of both r and s is essentially guaranteed by the
nonnegativity of the empirical variogram itself. But nonnegativity of the nugget, a , is
much more problematic, and can in some cases fail to hold. This is illustrated by the
schematic example shown on the left in Figure 4.15 below, where a spherical variogram
model (red curve) has been fitted to a set of hypothetical empirical variogram data (blue
dots). Here it is clear that the best fitting spherical variogram does indeed involve a
negative value for the estimated nugget, â .
• • • • •• • • • • • •• •
• • • •• • • • • • •• • •
• • • • • • • • • •
• • • •
h h
â
Hence in such cases, it is natural to impose the additional constraint that a 0 , and then
solve the reduced minimization problem in the remaining unknown parameters, (r , s ) :
The solution to this reduced problem, shown schematically above will yield the “closest
approximation” to the solution of (4.8.8) with a feasible value for the nugget, a . It is this
two-stage fitting procedure that will be used (implicitly) whenever nuggets are negative.
(4.8.1) Y ( s) ( s) ( s) , s R
________________________________________________________________________
ESE 502 II.4-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
that are in turn used to fit the variogram model. Much of the present section on
Continuous Spatial Data Analysis will be devoted to this larger modeling-and-estimation
problem. Hence to develop a meaningful example of variogram estimation at this point, it
is necessary to make stronger assumptions about the general framework in (4.9.1) above.
(4.8.3) Y ( s) ( s) , s R
for some (possibly unknown) scalar, . Under these conditions it follows at once that
E Y ( s ) Y (v ) E ( s ) (v ) E ( s ) (v )
2 2 2
(4.8.4)
for all s, v R , so that by definition the variograms for the Y -process and the -process
are identical:
(4.8.5) Y ( h) ( h) , h 0
Hence, under these assumptions we see that for any given spatial data, y ( si ) : i 1,.., n ,
the residual variogram, , can be estimated directly in terms of the empirical variogram,
1
y(s ) y(s )
2
(4.8.6) ˆY (hk ) ( si , s j )N ( hk ) i j , k 1,.., k
2 N (hk )
for the observable Y -process. This approach will be illustrated in the following example.
________________________________________________________________________
ESE 502 II.4-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
The following example is taken from [BG, pp.150-151] and is based on sample data from
Vancouver Island in British Columbia collected by the Geological Survey of Canada.
This data set [contained in the ARCMAP file (…\projects\nickel\nickel.mxd)], extends
over the area at the northern tip of the island shown in Figure 4.16 below. The area
outlined in red denotes the full extent of the data site. For purposes of this illustration, a
smaller set of 436 sample sites was selected, as shown by the dots in Figure 4.17.
•0 •
50 km
! !
! !
!! !! ! ! ! !! !
! ! ! !!
! ! !! ! ! !!!! !
! ! ! ! ! !
! ! ! ! !
! ! ! !! !! ! ! !
! !
!!! !
!! ! ! ! ! ! !! ! ! !!
! !! ! ! !
!! ! ! ! ! !! !! !!
! ! !! ! !!!! ! !
!
! ! ! !!!
! ! !! ! ! ! ! !
!
!!
! ! ! ! ! !! ! ! ! !
!
!
!
!! !!
!! ! ! !!!
! !!! !
! ! ! !
! !
! ! ! ! ! !! !!
! ! !! !
!! ! ! ! ! !
! !!
! !! ! ! !!
! ! !! ! ! ! !
! ! ! ! ! ! ! ! ! ! !!
!! ! ! !
! !!! !! !!
! ! ! ! !
! !!
!
!! !! ! !
! !! ! ! !!!! ! ! ! !
! !! ! ! ! ! !!! ! !!
! !
!!! ! ! ! ! !! ! !! ! !
! !!
! !! ! !! ! !
! ! !!!!
!
! !! !
! !! ! !! ! ! ! !! ! !
! ! ! ! !
!!!! !! ! !! ! ! !!!
! ! !!! ! !! !!! ! !
!
! ! ! ! !! ! !! !
!
! ! ! ! !! ! ! !
!
! !!
!! ! ! !! !
!! !! !! ! ! !!
! ! ! ! !! !! !!
Figure 4.16. Vancouver Sample Area Figure 4.17. Vancouver Sample Area
Note the curvilinear patterns of these sample points. As with many geochemical surveys,
samples are here taken mainly along stream beds and lake shores, where minerals
deposits are more likely to be found. In particular, samples of five different ore types
were collected. The present application will focus on deposits of Nickel ore. [In class
Assignments 3 and 4 you will study deposits of Cobalt and Manganese at slightly
different site selections.] This Nickel data is shown in the enlarged map below, where
Nickel concentration in water samples is measured in parts per million (ppm).
! !
! ! ! !! !
!! ! !! ! ! !! !
! !! !
! ! !! !
! !
!! ! ! ! !!!! !
! !! ! !
! ! ! ! !!! !! !
! !
!! ! ! ! ! !! ! ! !! Nickel (ppm)
! !! ! !! ! !
!! ! ! !! !! !! !! ! !!! ! !
! ! ! !!! ! ! !!
! !! !! ! !!! ! ! ! ! !!
! ! !! !! ! ! ! 1.00 - 19.00
!! !! !
! !
!!
! ! ! ! !! !
! ! !
!! ! !! ! !
! ! ! ! ! !! ! !!
! !! !
!! ! ! ! ! ! 19.01 - 43.00
! !!
! ! ! !!
! !
!
! !
! !!!
!
! ! ! !!! ! ! ! ! ! ! !!
!! ! ! ! ! 43.01 - 78.00
! !!! !! !!
! ! ! ! !!! ! !
!
! !!! ! !
!
! ! ! !
! ! ! ! !!! !
! ! ! ! !!! ! ! !
!!! ! ! ! ! !!
! 78.01 - 140.00
!!! ! ! !!
! !
! !! !! !
!! !!! ! !
! ! !
!! !
! ! !
! !! ! ! ! !! !! !
! !! ! ! ! !! !
! ! ! !! ! !
! !!!
! 140.01 - 340.00
! ! !
! !
!
!! !! ! !!! ! !!
!
! ! ! ! ! ! ! ! ! !
! ! ! ! !
!
! !!
! ! ! !! ! !! ! ! !! !
!
! !
! !!
! !! !!! !! !
!
! !! !!
Since the mapped data exhibits strong similarities between neighboring values (at this
physical scale), we can expect to find a substantial range of spatial dependence in this
data. Notice however that the covariance-stationarity assumption of Isotropy in (3.3.5)
[and (3.3.3)] is much more questionable for this data. Indeed there appear to be diagonal
“waves” of high and low values rippling through the site. An examination of Figure 4.16
above shows that these waves are roughly parallel to the Pacific coastline, and would
seem to reflect the history of continental drift in this region.21 Hence our present
assumption of covariance stationarity is clearly an over-simplification of this spatial data
pattern. We shall see this more clearly in the variogram estimation procedure to follow.
>> variogram_plot(nickel);
one obtains a plot of the empirical variogram, as shown in Figure 4.19 below.
2000
1800
1600
1400
1200
1000
800
600
0 1 2 3 4 5
4
x 10
21
In fact these waves are almost mirror images of the Cascadia subduction zone that follows the coastline
immediately to the west of Vancouver Island.
________________________________________________________________________
ESE 502 II.4-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Here the point scatter does rise toward a “sill”, as in the classical case illustrated in
Figure 4.8 above. So it appears that one should obtain a reasonable fit using the spherical
model in Figure 4.9 [from expression (4.6.1)]. But before fitting this model, there are a
number of additional observations to be made.
First, for purposes of comparison, the corresponding variogram cloud is plotted in Figure
4.20.22 Notice first that while the horizontal (distance) scales of these two figures are the
same, the vertical (squared difference) scales are very different. In order to include the
full point scatter in the variogram cloud, the maximum squared-difference value has been
increased from 2000 in Figure 4.19 to around 120,000 ( 12 104 ) in Figure 4.20. For
visual comparison, the value 2000 is shown by a red arrow in both figures. So while the
empirical variogram does indeed look “classical” in nature, it is difficult to draw many
inferences about the shape of the true variogram from the wider scatter of points
exhibited by the variogram cloud. The reason for this is that while the empirical
variogram shows mean estimates of the variogram at k 100 selected lag distances, the
variogram cloud contains the squared y-differences for each of the 70,687 individual
pairs, ( si , s j ) , with dij d . Hence about all that can be seen from this “cloud” of points
is that there are a considerable number of outliers that are very much larger than the mean
values at each distance. But fortunately this pattern of outliers is fairly uniform across the
distance spectrum, and hence should not seriously bias the final result in this particular
case. On the other hand, if outliers were more concentrated in certain distance ranges (as
is often typical for the larger distance values), then this might indicate the need to “trim”
some of these outliers before proceeding. In short, while the variogram cloud may
provide certain useful diagnostic information, the empirical variogram is usually far more
informative in terms of the possible shapes of the true variogram.
Next, it should be noted that in addition to the variogram plot, one obtains the following
screen output
MAXDIST = 48203.698
which is precisely d above. To compare this with the max-lag distance, h , note first that
there are a number of optional outputs for this program as well. First, the actual values of
the empirical variogram, {(hk , ˆk ) : k 1,.., k } , are contained in the matrix, DAT, where
each row contains one (hk , ˆk ) pair. This can be seen by running the full command,
and then clicking on the matrix, DAT, in the workspace to display the empirical
variogram. In particular, the value h corresponds to the last element of the first column
and can be obtained with the command [ >> DAT(end,1) ] yielding h 47984 . This is
smaller than d since h is somewhere in the middle of the last bin (as in Figure 4.13
above), and d is by definition the outer edge, d k , of this last bin.
22
This was constructed using the MATLAB program, variogram_cloud_plot.m.
________________________________________________________________________
ESE 502 II.4-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
As for the additional outputs, maxdist is precisely the screen output above, and the value,
bin_size = 707, tells you how many point pairs there are in each bin [as in condition
(4.7.5) above]. In this application there are many more than 30 point pairs in each bin, so
that the maximum number of bins, k 100 , is precisely the number realized. However, if
the number of sample points had been sufficiently small, then bin_size = 30 , would be a
binding constraint in (4.7.5) , and there could well be fewer than 100 bins.23 Finally, the
value, bin_last, is simply a count of points in the last bin, to check whether it is
significantly smaller than the rest. This will only occur if d is chosen to be very close to
the maximum pairwise distance, hmax , and hence will rarely occur in practice.24
As one last observation, recall from the “wave” pattern in Figure 4.17 above that one may
ask whether this effect is picked up by the empirical variogram at larger distances. By
using the measurement tool in ARCMAP and tracing a diagonal line in the direction of
these waves (from lower left to upper right), it appears that a reasonable value of maxdist
to try is d 80,000 meters. To do so, we can run the program with this option as
follows:
>> opts.maxdist = 80000;
>> variogram_plot(nickel,opts);
We then obtain the empirical variogram in Figure 4.21b, where the previous variogram
has been repeated in Figure 4.21a for ease of comparison:
2000 2000
1800 1800
1600 1600
1400 1400
1200 1200
1000 1000
800 800
600 600
0 1 2 3 4 5 0 1 2 3 4 5 6 7 8
4 4
x 10 x 10
Figure 4.21a. Max Distance = 48,203 Figure 4.21b. Max Distance = 80,000
23
For example if n = 50 so that the number of distinct point pairs is 50(49)/2 = 12256 < 30(100), then there
would surely be fewer than 100 bins.
24
For example, if one were to set opts.maxdist = 95000, which is very close to hmax in the present
example, then the last bin will indeed have fewer points than the rest.
________________________________________________________________________
ESE 502 II.4-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Notice that while the vertical (squared difference) scales for these two figures are the
same, the horizontal distance scales are now different (reflecting the different maximum
distances specified). Moreover, while the segment of Figure 4.21b up to 50,000
( 5 104 ) meters is qualitatively similar to Figure 4.210a, the bins and corresponding
lag distances are not the same as in Figure 4.21a. Hence it is more convenient to show
separate plots of these two empirical variograms rather than try to superimpose them on
the same scale. Given this scale difference, it is nonetheless clear that the slight dip in the
empirical variogram on the left, starting at about 40,000 meters, becomes much more
pronounced at the larger lag distances shown on the right. Recall (from the corresponding
covariograms) that this can be interpreted to mean that pairs of y-values (nickel
measurements) separated by more than 40,000 meters tend to be more similar (positively
correlated) than those separated by slightly smaller distances. Finally, by again using the
measurement tool in ARCMAP, it can be seen that the spacing of successive waves is
about 40,000 meters. So it does appear that this effect is being reflected in the empirical
variogram.
As a final caveat however, it should be emphasized that the most extreme dip in Figure
4.21b occurs at lag distances close to hmax , where variogram estimates tend to be very
unreliable. In addition, there are “edge effects” created by this rectangular sample region
that may add to the unreliability of comparisons at larger distances.
Recall from Section 4.6.1 above that all variogram applications in these notes (as well as
the class assignments) will involve fitting spherical variogram models to empirical-
variogram data. [Other models can easily be fitted using the Geostatistical Analyst (GA)
extension in ARCMAP, as illustrated below.] For purposes of the present application, we
shall adhere to the restriction in (4.7.4) that d not exceed hmax / 2 , and hence shall use
only the empirical variogram in Figure 4.19 (and 4.21a) constructed under this condition.
To fit a spherical variogram model to this empirical-variogram data, we shall use the
simple nonlinear least-squares procedure in (4.7.8) above.
25
One can also use the weighted nonlinear least-squares procedure in (4.8.9) and (4.8.10) above, which is
programmed in var_spher_wtd_plot.m.
________________________________________________________________________
ESE 502 II.4-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
spherical variograms is shown in red. If you click Enter again you will see the associated
covariogram plot, as shown in Figure 4.23 below.
VARIOGRAM PLOT COVARIOGRAM PLOT
2000
2000
1800
1500
1600
1400
1000
1200
1000
500
800
600
400 0
200
0 -500
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 1 2 3 4 5
4 4
x 10 x 10
Here it must be emphasized that this covariogram is not being directly estimated. Rather,
the estimates (rˆ, sˆ, aˆ ) obtained for the spherical variogram are substituted into (4.6.2) in
order to obtain the corresponding covariogram. Hence it is more properly designated as
the derived spherical covariogram. Similarly, the blue dots shown in this figure are
simply an inverted reflection of the empirical variogram shown in Figure 4.22. However,
they can indeed be similarly interpreted as the derived empirical covariogram
corresponding to the empirical variogram in Figure 4.22. To do so, recall first from
(4.1.7) that for all distances, h , it must be true that C (h) 2 (h) . But since each
empirical variogram point (ˆk , hk ) by definition yields an estimate of (hk ) , namely
ˆk ˆ (hk ) , and since the sill value, ŝ , is by definition an estimate of 2 , i.e., sˆ ˆ 2 , it
is natural to use (4.1.7) to estimate the covariogram at distance hk by
Hence by letting Cˆ k Cˆ (hk ) , it follows that the set of points, {(hk , Cˆ k ) : k 1,.., k } ,
obtained is precisely the derived empirical covariogram in Figure 4.23 corresponding to
the empirical variogram, {(hk , ˆk ) : k 1,.., k } , in Figure 4.22.26
26
In particular, the vertical component, ˆk , of each variogram point ( hk , ˆk ) has simply been shifted to the
new value, Cˆ k sˆ ˆk .
________________________________________________________________________
ESE 502 II.4-25 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
display of the parameter estimates (rˆ, sˆ, aˆ ) [along with maxdist, d , and the number of
iterations in the optimization procedure27], as shown in Figure 4.24 below.
SPHERICAL VARIOGRAM:
RANGE 17769.160
SILL 1554.658
NUGGET 618.044
MAXDIST 48203.698
ITERATIONS = 126
In particular, the RANGE ( rˆ 17769.160 meters) denotes the distance beyond which
there is estimated to be no statistical correlation between nickel values.28 In Figure 4.22,
this corresponds to the distance at which the variogram first “reaches the sill”. But this
offers little in the way of statistical intuition. In Figure 4.23 on the other hand, it is clear
that this is the distance at which covariance (and hence correlation) first falls to zero.
This is the key difference between these two representations. Notice also that the vertical
axis in Figure 4.23 has been shifted relative to Figure 4.22, in order to depict the negative
covariance values in the cluster of values around the zero line.
Turning to the other estimated parameters, note first from Figure 4.23 that the SILL
( sˆ 1554.658 ) is seen to be precisely the estimated variance of individual nickel values
(i.e., the estimated covariance at “zero distance”). Similarly, the NUGGET
( aˆ 618.044 ) is seen to be that part of the individual variance that not related to spatial
dependence among neighbors. Since in this case the relative nugget effect, 0.398 ( =
618.044/1554.658), is well below 0.5, it is evident that there is a substantial degree of
local spatial dependence among nickel values. So in summary, it should be clear that
while the variogram model is useful for obtaining these parameter estimates, (rˆ, sˆ, aˆ ) , the
derived covariogram model is far more useful for interpreting them.
27
Note that if ITERATIONS exceeds 600, you will get an error message telling you that the algorithm
failed to converge in 600 iterations (which is the default maximum number of iterations allowed).
28
Notice also that this RANGE value is considerably below the MAXDIST (48203.698 meters), indicating
that the range of spatial dependence among nickel values is well captured by this empirical variogram
________________________________________________________________________
ESE 502 II.4-26 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Note first that the title of this window is “Semivariogram” rather than “Variogram” (as
discussed at the end of Section 4.1 above). Since the full details of this variogram fitting
procedure are given in Assignment 3, it suffices here to concentrate on the estimated
parameter values. However, it is important to point out one aspect of this procedure that
is crucial for parameter estimation. Recall from the discussion of Figure 4.13 that one
must define appropriate bins for the empirical variogram. Since the “default” option for
bin definitions in GA is rather complex compared to ver_spher_plot.m, it is most
convenient to define the bin sizes in GA manually in order to make them (roughly)
comparable to those in ver_spher_plot. To do so, recall from Figure 4.24 the MAXDIST
value is close to 48000 meters. So by setting the number of lags to 12 and choosing a
constant bin size of 4000 meters (as seen in the Lag window in the lower right of Figure
4.25), we will obtain a maximum distance of exactly 48000 meters (as seen on the
distance axis of the variogram plot). Note also that in the Model #1 window we have
chosen Type = “Spherical”, indicating that a spherical variogram is to be fitted.
The fitted spherical variogram is shown by the blue curve in the figure, and the empirical
variogram is shown by red dots. Note that while the number of lags (12) is considerably
smaller than the number of bins (100) used in ver_spher_plot, there actually appear to be
________________________________________________________________________
ESE 502 II.4-27 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
more red dots here than there are blue dots in Figure 4.22 above. The reason for this can
be seen by considering the circular pattern of squares in the lower left corner of the
figure. Starting from the center and moving to the right, one can count 12 squares, which
denote the 12 lag distances. Hence as the figure shows, point pairs are here distinguished
not only by the length of the line between them (distance) but also the direction of this
line (angle). Each square thus defines a “bin” of point pairs with similar separation
distances and angles. So the number of bins here is much larger than 12.29 While these
directional distinctions are important for fitting anisotropic variogram models in which
the isotropy assumption of covariance stationarity is relaxed, we shall not explore such
models in these notes.30 Hence, under our present isotropy assumption, the appropriate
empirical variogram in GA is constructed by using each of these squares as a separate bin
with “lag distance” equal to the average distance between point pairs with distance-angle
combinations in that square.
Next observe that in addition to the different binning conventions, the actual estimation
procedure used in GA is more complex than the simple least-squares procedure used in
ver_spher_plot [and is essentially an elaboration of the weighted least-squares approach
of Cressie in shown in expressions (4.7.9) and (4.7.10) above]. So it should be clear that
the resulting spherical model estimate will not be the same as in Figure 4.22 above. In
particular, the estimated range and nugget in this case are given, respectively, by “Major
Range” (= 17806.86) and “Nugget” (= 617.32). However, the “Sill” is here replaced by
“Partial Sill” (= 943.17). Hence, recalling from the discussion at the end of Section 4.6.2
that “Sill = Partial Sill + Nugget”, it follows the corresponding sill is here given by
1560.5 (= 943.17 + 617.32). A comparison of the parameter estimates using both
MATLAB and GA in this example (Figure 4.26 below) show that in spite of the
differences above, they are qualitatively very similar.
MATLAB GA
Range 17769.2 17806.9
Sill 1554.7 1552.6
Nugget 618.0 617.3
29
In this example, the number of bins is given approximately by 12 452 . However the number of red
2
dots is actually half the number of bins, since each bin has a “twin” in the opposite direction. Hence the
number of red dots in this case is given approximately by 226, which is still much larger than 100.
30
For a detailed discussion of such anisotropic models see Gotway and Waller (2004, Section 2.8.5).
________________________________________________________________________
ESE 502 II.4-28 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
First recall from expression (3.3.7) that for any distance h 0 the covariogram value,
C (h) , is by definition
for any s1 , s2 R with s1 s2 h . Hence suppose for sake of simplicity that we are able
to draw n sample pairs, [ y1 ( s1i ), y2 ( s2i )] ( y1i , y2i ) , from this process with s1i s2i h
holding exactly for all i 1,.., n . In this context, the standard sample estimator for the
covariance value in (4.10.1) is given by
Cˆ (h) y y1 y2i y2
n
(4.10.2) 1
n1 i 1 1i
with sample means denoted by y j (1/ n) i1 y ji , j 1, 2 . Here division by n 1 (rather
n
than the seemingly more natural choice of division by n ) ensures that if these sample
pairs [ ( y1i , y2i ) , i 1,.., n ] were independent draws from jointly distributed random
variables (Y , Y ) with covariance given by (4.10.1), then Cˆ (h) in (4.10.2) would be an
1 2
unbiased estimator of C (h) . However, if these pairs are not independent, then it is
shown in Appendix A2.2 that the actual expectation of Cˆ (h) is given by
Notice first that if these sample pairs were independent then by definition each
covariance, cov(Y1i , Y2 j ) , with i j must be zero so that (4.10.3) would reduce to
E [Cˆ (h)] C (h) , and Cˆ (h) would indeed be an unbiased estimator. But for the more
classical case of nonnegative spatial dependences, all covariances in the second term of
(4.10.3) must either be positive or zero. Hence for this classical case it is clear that there
will in general be a considerable downward bias in this estimator. Moreover, without
prior knowledge of the exact nature of such dependencies, it is difficult to correct this
bias in any simple way. It is precisely this difficulty that motivates the need for
alternative approaches to modeling spatial dependencies.
________________________________________________________________________
ESE 502 II.4-29 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
There is one case in which this is possible, namely when there exist multiple pairs,
[Y ( s1i ), Y ( s2i ) : i 1,.., nh ] , each separated by the same distance h , i.e., satisfying the
condition that s1i s2i h for all i 1,.., nh . In particular, if spatial samples form a
regular lattice, as illustrated by the small set of red dots in Figure 4.27 below, then there
will generally be a set of representative distances for which this is true. In particular, the
symmetry of such lattices implies that distance values such as h1 , h2 , and h3 in the figure
will occur for many different point pairs.
h1
● ● ● ● ●
h2
● ● ● ● ●
h3
● ● ● ● ●
More generally, whenever there exists a representative range of distinct distance values,
{hk : k 1,.., k } , at which a substantial set of exact-distance pairs,
(4.10.4) N k {( s1 , s2 ) : s1 s2 hk }
can be sampled at each hk , then the associated empirical variogram, {(hk , ˆk ) : k 1,.., k } ,
in (4.7.7) will indeed provide a meaningful unbiased estimate of the true variogram,
(hk ) , at each of these distance values.31 To see this, it is enough to recall from (4.1.5)
that E Y ( s1i ) Y ( s2i ) 2 (hk ) for all ( s1 , s2 ) N k , and hence that
2
31
Here the qualifier “meaningful” is meant to distinguish this estimator from one in which there is no
possibility of eventually accumulating a large set of sample pairs, N k , for each hk .
________________________________________________________________________
ESE 502 II.4-30 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
1
Y ( s1 ) Y ( s2 )
2
(4.10.5) E[ˆ (hk )] E ( s1 , s2 )N k
2 Nk
1
E Y ( s1 ) Y ( s2 )
2
2 Nk ( s1 , s2 )N k
1
2 Nk
( s1 , s2 )N k
2 (hk )
2 Nk
(hk ) (hk )
2 Nk
So regardless of the size of each exact-distance set, N k , this empirical variogram will
always yield an unbiased estimate of the true variogram, (hk ) , at each distance
k 1,.., k . Hence if in addition it is true that each of these sets is sufficiently large, say
N k 30 , then this empirical variogram should provide a reliable estimate of the true
variogram.
Finally, it should be noted that if one is able to choose the pattern of samples to use in
studying a given spatial stochastic process, (Y ( s ) : s R) , then such regular lattices have
the practical advantage of providing a uniform coverage of region R . This is particularly
desirable for interpolating unobserved values in R (as discussed in detail in Part 6
below). It is for this reason that much attention is focused on regular lattice samples of
such processes [as for example in Cressie (1993, p.69) and Waller and Gotway (2004,
p.281)].32
For the general case of irregular samples, where exact-distance sets rarely contain more
than one observation, it is necessary to rely on the binning procedure developed in
Section 4.7.1 above. The “Nickel” example in Section 2.4 above provides a good
illustration of such a case where regular sample patterns are impractical if not impossible.
In this more typical setting, it is difficult to find much discussion in the literature about
the bias of empirical variogram estimates created by binning.33
However, it is not difficult to show that if the true variogram is reasonably smooth, then
one can at least bound the bias in a rather simple way. In particular, if by “smooth” we
32
It should be mentioned again that these references define empirical variograms with respect to the more
general notion of stationarity mentioned in footnote 6 of Section 3.2 above. So the exact-distance sets used
here are replaced by “exact-difference sets”.
33
One noteworthy exception is the interesting analysis of “clustered” sampling schemes by Reilly and
Gelman (2007).
________________________________________________________________________
ESE 502 II.4-31 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
mean that the variogram, (h) , is locally linear in the sense that its values are well
approximated by linear functions on sufficiently small intervals, then one can bound the
bias of the general empirical variogram in (4.7.7) in terms of these linear approximations.
To be more specific, suppose that the true variogram is given by red curve in Figure 4.27
below, and that the set of bins chosen for estimating this (unknown) function are shown
schematically as in Figure 4.28 below [where by definition each bin, k 1,.., k , is
defined by the interval of separation distances, d k 1 h d k (with d 0 0 )].
( h) lk ( h ) k
0 d1 d 2 d k 1 dk dk d k 1 dk
Figure 4.28 Bins for Variogram Estimation Figure 4.29. Local Linear Approximation
(4.10.6) lk (h) ak h bk
(with slope, ak , and intercept, bk ) has been implicitly chosen to minimize the maximum
deviation, (h) lk (h) , over the interval d k 1 h d k . If this maximum deviation is
denoted by k , then the variogram, (h) , is said to have an k -linear approximation on
bin k . With these definitions, it is shown in Appendix A2.3 that in terms of this k -
linear approximation, the maximum bias in the empirical variogram estimate of (hk )
can never exceed 2 k , i.e.,
________________________________________________________________________
ESE 502 II.4-32 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Of course one cannot know the value of k without knowing the true variogram itself. So
the bound in (4.10.7) is simply a qualitative result showing that if (h) is assumed to be
sufficiently smooth to ensure that the maximum deviation, max{ k : k 1,.., k } , for
the given bin partition is “small”, then the bias in the empirical variogram,
{(hk , ˆk ) : k 1,.., k } , will also be “small”. In other words, for variograms with good
“piece-wise linear approximations” on the given set of bins, empirical variogram
estimates can be expected to exhibit only minimal bias.
________________________________________________________________________
ESE 502 II.4-33 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
s0
________________________________________________________________________
ESE 502 II.5-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Here it is assumed that elevations, y ( s) , have been measured at a set of spatial locations
{si : i 1,.., n} in some relevant region, R, as shown by the dots outlined in white. Given
these measurements, one would like to estimate the elevation, y ( s0 ) , at some new
location s0 R , shown in the figure (outlined in black). Given the typical continuity
properties of elevation, it is clear that those measurement locations closest to s0 are the
most relevant ones for estimating y ( s0 ) , as illustrated by the red dots lying in the
neighborhood of s0 denoted by the yellow circle. While it is not obvious how large this
neighborhood should be, let us suppose for the moment that it has somehow been
determined (we return to this question in Section 6.4 below). Then the question is how to
use this set of five elevations at locations, say s1 ,.., s5 , to estimate y ( s0 ) . These locations
are displayed in more detail in Figure 5.2 below, where d 0i || s0 si || denotes the
distance from s0 to each point si , i 1,..,5 .
s2
d 02
d 03 s3
s1 d 01 s0
d 04
s4
d 05
s5
________________________________________________________________________
ESE 502 II.5-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
w (d 0i ) y ( si ) w (d 0i )
(5.2.1) yˆ ( s0 )
si S ( s0 )
siS ( s0 ) y ( si )
s j S ( s0 )
w (d 0 j )
s jS ( s0 )
w (d 0 j )
w (d 0i ) w (d 0i )
(5.2.2) siS ( s0 ) si S ( s0 )
1
s jS ( s0 )
w (d0 j ) w (d 0 j )
s S ( s )
j 0
and are thus interpretable as the “fractional contribution” of each y ( s) to the estimate,
yˆ ( s0 ) . Thus points closer to s0 in S ( s0 ) will have higher fractional contributions to
yˆ ( s0 ) since for all si , s j S ( s0 )
(5.2.3) d 0i d 0 j w (d 0i ) w (d 0 j )
w (d 0i ) w (d0 j )
s S ( s ) w (d0k )
k 0
sk S ( s0 )
w (d 0 k )
We have already seen an example of a kernel smoother, namely the inverse distance
weighting (IDW) smoother in Section 2.1 above. In this case, the weight function is a
simple inverse power function of the form,
(5.2.4) w(d ) d a
(5.2.5) w(d ) e d
________________________________________________________________________
ESE 502 II.5-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
3.5
2.5
1.5 d 2
1
e d
0.5
0
0 0.5 1 1.5 2 2.5 3
d
Figure 5.3. Comparison of Exponential and IDW Smoothers
In this respect, exponential smoothers may be preferred. However, if one requires exact
interpolation at data points [i.e., yˆ ( si ) y ( si ) ] then this is only possible if w(d ) as
d 0 . So kernel smoothers like the exponential yield results that are actually smoother
than the data itself. In summary, it is important to be aware of such differences between
possible kernel smoothers, and to employ one that is most suitable for the purpose at
hand.
1
It should be emphasized that while this estimation procedure is the same as in regression, there is no
appeal to a linear statistical model, and in particular, no random error model.
________________________________________________________________________
ESE 502 II.5-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
n
(5.3.1) i 1
[ yi ( 0 1si1 2 si 2 )]2
Using these estimates, the interpolated value of y ( s0 ) at point s0 ( s01 , s02 ) is given by
s1 s2 s0 s3 s4
More generally, this defines a single point on the interpolation function defined for all
points, as shown schematically by the solid curve in the figure.2 In practice this function
would not be so smooth, and in fact would not even be continuous. In particular there
would be jumps in this function at each location, s0 , where a data point, si , either enters
or leaves the current interpolation set, S ( s0 ) Such discontinuities can be removed by
fixing the diameter (bandwidth) of interpolation sets, say
(5.3.3) S ( s0 ) {si : || s0 si || d 0 }
2
A particularly good discussion of this local linear polynomial case is given in ESRI Desktop Help at
http://webhelp.esri.com/arcgisdesktop/9.2/index.cfm?TopicName=How_Local_Polynomial_interpolation_works .
________________________________________________________________________
ESE 502 II.5-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
and introducing kernel smoothing weight function, w , similar to (5.2.1) which falls to
zero at distance, d 0 . One then modifies the local least squares in (5.3.1) to a local
weighted least squares of the form,3
n
(5.3.4) i 1
w0i [ yi ( 0 1si1 2 si 2 )]2
where w0i w (|| s0 si ||) . Hence points in S ( s0 ) at greater distances from s0 will have
less weight in the interpolation (and will have no weight at distance, d 0 ). This implies in
particular that as s0 is moves along the axis in Figure 5.4, data points entering or leaving
the interpolation set will initially have zero weight, thus preserving continuity of the
interpolation function. An actual example of such a locally weighted linear interpolation
function is shown (in red) on the left in Figure 5.5 below.
4 1
3 0.9
0.8
2
0.7
1
0.6
w(d )
0
0.5
-1
0.4
-2
0.3
-3 0.2
d0
-4 0.1
-5 0
-3 -2 -1 0 1 2 3 0 0.2 0.4 0.6 0.8 1 1.2
Here the kernel smoothing function used is the popular “tri-cube” function which has the
mathematical form:
1 (d / d )3 3 , d d
(5.3.5) w(d ) 0 0
0 , d d0
where in this case a bandwidth of d 0 1 was used. The shape of this function is shown
on the right in Figure 5.5, where the distance scale has been increased for visual clarity.
________________________________________________________________________
ESE 502 II.5-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
example above nonetheless serves to illustrate the basic mechanism of locally weighted
least squares used in GWR.4
Finally it should be mentioned that local polynomial interpolation can of course involve
high-order polynomials. As one illustration, suppose we consider local polynomial
interpolation with second-order (quadratic) polynomials. Then the sum-of-squared
deviations in (5.3.1) would now be replaced by the quadratic version,
n
(5.3.6) i 1
[ yi ( 0 1si1 2 si 2 3 si21 4 si22 5 si1si 2 )]2
A schematic of such an interpolation paralleling Figure 5.4 is shown in Figure 5.6 below.
s1 s2 s0 s3 s4
Figure 5.6. Local Quadratic Interpolation
In this schematic example, the red dashed curve is now the quadratic function in (5.3.7)
evaluated at all points on the line, include s0 . Given the obvious nonlinear nature of this
data, it would appear that a quadratic polynomial interpolation would yield a better fit to
this data.
However, it should again be emphasized that the actual locus of interpolations obtained
would still be discontinuous for exactly the same reasons as in the linear interpolation
4
Here it should be noted that the type of one-dimensional example shown in Figure 5.5 is not readily
implemented using GWR in ARCMAP. Rather this example was computed using the MATLAB program,
gwr.m , in the suite of programs by James LeSage, available at http://www.spatial-econometrics.com/ .
________________________________________________________________________
ESE 502 II.5-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
case above. So when using these interpolators in GA, be aware that implicit smoothing
procedures are being used, in a manner similar to the kernel smoothing procedure
outlined above. Hence the numerical values obtained will differ slightly from the simple
interpolatins, yˆ ( s0 ) , in (5.3.2) and (5.3.7) above, depending the spacing of actual data
points. Also be aware that these smoothing procedures used are not documented in
ARCMAP help. As with essentially all commercial GIS software, there are often hidden
layers of calculations being done that are not fully documented.
where the normalizing factor (2 ) 1/2 plays no role here, and has been removed. One
then defines the interpolation function for this model to be a weighted combination of
these basis functions:
n
(5.4.2) yˆ ( s ) i 1 i
a fi ( s )
To choose an appropriate set of weights, one typically requires exact interpolation at each
data point ( si , yi ) , i.e.,
n
(5.4.3) yi yˆ ( si ) j 1
a j f j ( si ) , i 1,.., n
Since this is simply a system of linear equations in the unknown weight vector,
a (a1 ,.., an ) , one can easily solve for these weights. In particular, if we let
y ( y1 ,.., yn ) denote the vector of observe y-values, and let the n-square function matrix,
F , be defined by
5
Recall that Euclidean distance between vectors x and y is denoted by d ( x, y ) || x y || .
________________________________________________________________________
ESE 502 II.5-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
y1 F11 F1n a1
(5.4.5) y Fa
y F F a
n n1 nn n
(5.4.6) a F 1 y
This set of weights necessarily yields a smooth interpolation function that passes through
all of the data points.
Although the normal density above is a very common choice for radial basis functions,
this option is not available in GA. However, one option that is available which looks very
similar to the standard bivariate normal density is the so called inverse multiquadratic
function defined for all s 2 by 7
1
(5.4.7) f ( s)
1 || s ||2
While this function appears to be mathematically very different from the bivariate normal
density, a two-dimensional plot of (5.4.7) shows that it is virtually indistinguishable from
Figure 3.2 in a qualitative sense. About the only significant difference is that it falls zero
much more slowly that the normal density.
To gain some feeling for this type of interpolation, it is again convenient to develop a
one-dimensional example paralleling the example for local polynomial interpolations
above. In this case, (5.4.7) reduces to the function
1
1 0.8
f ( s)
(5.4.9) f ( s) , s 0.6
1 s2 0.4
0.2
0
-3 -2 -1 0 1 2 3
s→
which is now seen to be qualitatively similar to the univariate normal density in
expression (3.1.11) of Section 3. Interpolation with these radial basis functions can be
6
This of course assumes that F is nonsingular, which will hold in all but the most degenerate cases.
7
As with the standard normal density, this function can be generalized by adding a weight, , to s
1
yielding the one-parameter family, f ( s | ) || s ||
2 2
.
________________________________________________________________________
ESE 502 II.5-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
illustrated by the example shown in Figure 5.7 below, which involves only three data
points, ( si , yi ) , i 1, 2,3 .
a1 f1 ( s )
2
a3 f 3 ( s )
1
0
s1 s2 s3
-1
yˆ ( s)
-2
a2 f 2 ( s )
-3
-4
-1 0 1 2 3 4 5 6 7 8 9
s
Here the fitted radial basis functions [ ai f i ( s ), i 1, 2,3] are shown in black, and the
resulting interpolation function, yˆ ( s) , is shown in red.
Notice that unlike the kernel smoothers above, there is no need for interpolation sets in
this case. Here the entire interpolation function, yˆ ( s) , is determined simultaneously at all
point locations, s. Notice also that this function is necessarily smooth (since it is a sum of
smooth functions). Finally, note from the figure that yˆ ( s) is indeed an exact interpolator
at data points, i.e., it passes through each of the data points shown.
While the above procedure offers a remarkably simple way to obtain smooth and exact
interpolations, it can be argued that the choice of basis functions is rather arbitrary.
Moreover, it is difficult to regard this fitting procedure as “optimal” in any sense. Rather
its weights are determined entirely by the exact-interpolation condition in (5.4.3) above.
However, there is an alternative method of interpolation, known as spline interpolation,
which is more appealing from a theoretical viewpoint. As with radial basis functions, the
classical spline model seeks to find a prediction function, yˆ ( s) , that satisfies the exact-
interpolation condition,
(5.5.1) yˆ ( si ) yi , i 1,.., n
________________________________________________________________________
ESE 502 II.5-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
There are of course infinitely many smooth functions that could satisfy this condition.
Hence the unique feature of spline interpolation is that rather than simply pre-selecting a
given set of smooth candidate functions, this approach seeks to find the smoothest
possible function satisfying (5.5.1). To characterize “smoothness”, recall that for one-
dimensional functions, f ( s ) , the second derivative, f ( s ) , measures the curvature of the
function. In particular, linear functions, f ( s) a bs , have zero curvature as reflected
by the fact that f ( s ) 0 . More generally, if we ignore signs and define the curvature of
f at s by f ( s ) 2 , then to compare the curvature of functions f on a given interval, say
[a, b] , it is natural to consider their total curvature
b
(5.5.2) C( f ) a
f ( s ) 2 ds
For two dimensions, the idea is basically the same. Here “curvature” at point, s ( s1 , s2 ) ,
is defined in terms of the Hessian matrix of second partial derivatives
2 f 2 f
s 2 s1s2
(5.5.3) H f ( s ) 21
f 2 f
s1s2
s22
Again to ignore signs, one can define the size of a matrix, M (mij ) , by its squared
distance from the origin (as a vector), i.e.,
2
2 2 2
2 f 2 f 2 f
(5.5.5) || H f ( s ) ||2 s12 s1s2 s22
[which is seen to be the natural generalization of the one-dimensional case, || f ( s ) ||2
f ( s ) 2 ]. Hence to compare the curvature of functions, f , on a (bounded) two-
dimensional region, R 2 , the natural extension of (5.5.2) is to define total curvature,
C ( f ) , by
8
While it might be conceptually more appropriate to use average curvature, C ( f ) 1
ba
C ( f ) , this
simple rescaling has no effect on the ordering of smoothness among functions, f.
________________________________________________________________________
ESE 502 II.5-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
2 ds ds
2 2 2
2 f 2 f 2 f
(5.5.6) C( f ) || H f ( s) ||2 ds s12 s1s2 s22 1 2
R R
Thus, for a given set of data points, [( s1 , y1 ),..,( sn , yn )] with {s1 ,.., sn } R 2 , the
corresponding spline interpolation problem is to find a function, yˆ ( s ) , on R which
minimizes total curvature (5.5.6) subject to the exact-interpolation condition (5.5.1).
While these interpolation problems are relatively simple to state, they can only be solved
by very sophisticated mathematical methods. Hence for our purposes, it suffices to say
that these solutions are themselves remarkably simple, and lead to optimally smooth
interpolation functions, yˆ ( s) . To illustrate the basic ideas, it is again convenient to focus
on the one dimensional case. Here it turns out that the basic interpolation functions are
combinations of cubic functions,
(5.5.7) f ( s ) a3 s 3 a2 s 2 a1s a0
between every pair of adjacent data points. To gain some intuition here, suppose one
considers a “partial” smooth curve with a gap between two data points, s1 and s2 , as
shown in Figure 5.8a below. To “complete” this curve in a smooth way, one must match
both the end values (shown as black dots) and the end slopes (shown as dashed lines).
y y
s1 s2 s1 s2
Figure 5.8a. Partial Smooth Curve Figure 5.8b. Completed Smooth Curve
Since the cubic function is the smoothest function with an infection point (where the
second derivative changes sign), it should be clear from Figure 5.8b that this adds just
enough flexibility to complete this curve in the smoothest possible way.9
As one example of this interpolation method, the data set in Figure 5.7 has been re-
interpolated using cubic spline interpolation in Figure 5.9 below.10 Here there appears to
be a dramatic difference between the two methods. But except for the slight scale
9
For a more general discussion of fitting cubic splines to (one-dimensional) data sets, see McKinley and
Levine at http://online.redwoods.cc.ca.us/instruct/darnold/laproj/fall98/skymeg/proj.pdf .
10
This interpolation was computed in MATLAP using their package program, spline.m.
________________________________________________________________________
ESE 502 II.5-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
differences between these two figures, the qualitative “bowl” shape of yˆ ( s) within the
interval defined by these three data points is roughly similar. Moreover, it should now be
clear that this “bowl” is much smoother for the cubic spline case than for the (inverse
quadratic) radial basis function case in Figure 5.7. Indeed, this cubic spline is the
smoothest possible exact interpolator within this interval. But notice also that outside this
interval, the two interpolations differ radically. Since the individual radial basis functions
in Figure 5.7 all approach zero as distance from data points increase, the interpolation
function also decreases to zero. But for the cubic spline case, this smooth “bowl” is only
achieved by continuing the bowl shape outside the data interval.
8
2
yˆ ( s )
0
-2
-4
1 s21 3 s42 5 s63 7
More generally, these cubic spline interpolations tend to diverge at the data boundaries,
and are much less trustworthy in this range. A better example of this is shown in Figure
5.10 below where the interpolation now involves five points. Here again the cubic spline
interpolator, yˆ ( s) , is seen to yield the smoothest possible exact interpolation of these
five data points. But outside this range, yˆ ( s) now diverges downward on both sides in
order to achieve this degree of smoothness.
2 yˆ ( s )
0
-2
-4
-6
-8
-10
2 4 6 8 10 12
s
Figure 5.10 Cubic Spline Interpolation
________________________________________________________________________
ESE 502 II.5-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
While these one-dimensional examples serve to illustrate the main ideas of spline
interpolation, the solution in two dimensions [i.e., the function yˆ ( s) minimizing (5.5.6)
subject to (5.5.1)] is mathematically quite different than the one-dimensional case. From
an intuitive viewpoint, the basic reason for this is that in two dimensions it is not possible
to “piece together” solutions between data points. In fact, the solution in two dimensions
is formally much closer to the radial basis function approach above. In particular, the
optimal interpolation function, yˆ ( s) , designated as a thin-plate spline function, takes
essentially the same form as (5.4.2), namely
yˆ ( s ) yˆ ( s1 , s2 ) (0 1s1 2 s2 ) a f (s | )
n
(5.5.8) i 1 i i
d
(5.5.9) fi ( s | ) f (dis ) dis2 log is , i 1,.., n
30
25
20
15
10
0
f (dis )
-5
-10
-15
-20
0 2 4 6 8 10 12
dis
Figure 5.11 Radial Shape of Spline Basis Functions
11
In fact, this linear part has zero curvature by definition, and hence has no effect on the curvature of
yˆ ( s ) .
________________________________________________________________________
ESE 502 II.5-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Here the value of 10 was used, so that by definition f10 (dis ) rises back up to zero at
exactly dis 10 . This also makes it clear that beyond radial distance from data points,
si , these functions diverge rapidly, and can produce rapid changes in yˆ ( s ) . So larger
values of tend to produce “stiffer” surfaces with less variation. This can also be seen in
the full two-dimensional representation of fi ( s | ) is shown in Figure 5.12 below [again
with 10 ].
As in the one-dimensional case, the local flexibility of this “Mexican hat” function allows
much more rapid changes in value than say the multi-quadratic basis function above. So
for example in the case of elevations, these thin-plate splines will do a much better job of
capturing rapid changes in elevation than will the multi-quadratic functions.
Finally, it should again be noted that (as mentioned for kernel smoothers at the end of
Section 5.2 above) the two-dimensional spline interpolation methods employed in the
Spatial Analyst (SA) extension of ARCMAP are considerably more complex than the
basic thin-plate spline model developed above. To describe the main differences, we
focus on the regularized spline option (rather than the less commonly used “tension”
spline option). While this model is based essentially on thin-plate splines, the radial basis
functions in (5.5.9) above are “augmented” by an additional term that reflects third
derivative effects as well and second derivative (curvature) effects.12 So the “ ”
12
This use of third derivatives is appropriate if continuity of second derivatives is required for the
prediction function [see Mitáš and Mitášová (1988, Section 6)]. But continuity of first derivatives is usually
sufficient for adequate smoothness. So from a practical viewpoint, the simpler approach of thin-plate spline
interpolation is in many ways more appealing. [See for example the constructive development of this
method in Franke (1982)].
________________________________________________________________________
ESE 502 II.5-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
parameter in (5.5.9) plays a somewhat more complex role in these functions. However,
its basic interpretation remains the same. In particular, larger values of still produce
“stiffer” surfaces with less variation.
But a key point here is that the exact interpolation constraint, yˆ ( s ) y ( s ) , is dropped. So
the surface need not pass through each data point. In addition, the regularized spline
interpolation procedure in SA is “localized” by partitioning R into smaller cells to
simplify the calibration procedure. So in addition to the parameter, the user is also
asked to choose the “number of points”, say , to be used in the interpolation of each
cell. Again, larger numbers of points produce smoother interpolations with less variation.
While the above models were motivated in terms of the “elevation” example in Figure
5.1, most spatial data sets tend to exhibit more local variation that elevations (or, say, the
mean temperature levels studied in Assignments 3 and 5). A typical example is provided
by the Nickel data displayed in Figure 4.18 of Section 4.9 above. Here we start with the
regularized spline tool discussed above, and compare interpolations for two different
parameter settings, ( , ) :
! ! !
!
! ! !! ! ! ! ! ! !
!! ! ! !
! ! ! !! ! ! ! ! ! !! !
!!
! !! ! ! ! ! !
! ! ! !! ! ! !!
!
! ! ! !! ! ! ! !
! !!
! ! ! ! ! ! !! ! ! ! ! !
! ! ! ! ! !
! ! !!
! ! !! ! !! !!
! ! ! ! ! ! ! ! ! !
! !
!! ! ! ! ! ! ! !! !! !
!
!
!! !! !! ! ! !! ! ! ! ! !! ! ! !!!
! ! ! !! ! !! !!
! !!! ! ! ! !! !! ! ! !! ! !! ! ! !!! ! ! ! !! ! !!! ! !
!
!
! ! ! !!!! ! !
!! ! ! ! !!
!!
!
! !
!!
! ! ! ! ! !!! ! ! ! ! ! !! !! ! !!! ! ! ! ! ! !!
! !! ! ! ! ! ! !
! !! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! !!! ! !
! ! !! ! !!! ! !
!
! !
!! !
!! ! !! !
! ! ! ! !! !
! !! !
!
! ! ! ! ! !! ! ! ! ! ! ! ! ! !
!! !! !
! ! ! ! ! ! !! ! ! ! ! ! ! !!
! ! ! !
!!
!
! !
! ! !
! !! ! ! !
! !
! !!
! ! !!
!
! ! ! !!
!! ! !
!
! !
! !!
!
!
! ! ! ! ! !
! ! ! !
! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! !
!! ! ! ! !
! !!! !! ! ! ! ! !!! ! !
! !
! ! !!! !! ! ! ! ! !!
! !
! !
!
!! !! ! !! !! !!
!! !! ! !!
!! !! ! !
! !! ! ! ! !! !! ! ! !
! !! ! ! ! ! ! !! !!
! ! ! ! !! !! ! ! !
!
!! !
! !
! ! !! !!! ! !
!
!
! ! ! !! ! ! ! ! !! !!! ! ! !
!
! ! ! !!
! ! !!! ! !! ! !
!!! ! !! !! !
!! !!! !
! !! ! !! ! ! !!!! ! !! !
!
! ! !! !! ! ! !! !! ! ! ! !! !! ! ! !! ! ! ! ! !
! !! ! !! ! ! ! !! ! ! !! ! !
!
! !
!!!! !! !!!
! !!! ! ! !!! !!!! !! !!! ! !
!
!!!
! ! !!! ! !! ! !! ! !! ! !
! ! ! ! !! ! !! ! !!
! ! ! ! !
! ! ! ! ! !! !
!
!
!! ! !
!
! !
!
! ! ! ! !! ! !! ! ! ! ! ! !
! ! ! ! ! !
!!! ! !! !! !! ! ! ! ! !! ! !! !!! !! !
! !
! ! ! ! ! !! ! !! !
Figure 5.13a. Spline with ( τ = 0.1, η = 12) Figure 5.13b. Spline with ( τ = 10, η = 50)
Here the spline interpolation shown in Figure 5.13a uses the default parameter settings
for the spline tool, namely 0.1 and 12 . However, a comparison with Figure 4.18
shows that this interpolation does a rather poor job of capturing overall variations in
nickel deposits. Much like the IDW interpolation of rainfall in Figure 2.2 of Section 2
above, this interpolation shows far too many local extremes, both high and low. This can
also be seen numerically by noting that while the actual values of nickel deposits are of
course nonnegative (starting from 1.0 ppm) the values in this spline interpolation go
down to values as low as 700 ppm, which are of course totally meaningless. As
described above, the key problem here is that this spline function is attempting to
interpolate between points that exhibit extreme local variation in values. Hence while the
________________________________________________________________________
ESE 502 II.5-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
values, 0.1 and 12 , are reasonable choices for very smooth surfaces, they are
creating far too much variation between data locations in this case. In fact, to achieve an
interpolation that is sufficiently “stiff” to provide a reasonable overall approximation to
this surface, it is necessary to use much larger settings, such as the values 10 and
50 shown in Figure 5.13b.
Panel (a) shows the results of a local linear polynomial interpolation in Geostatistical
Analyst (GA). Panel (b) shows a radial basis function interpolation in GA with the
inverse multiquadratic option described above. Panel (c) shows the same regularized
spline interpolation in Figure 5.13b above (minus the data points). Finally, the last panel
compares these deterministic methods with the stochastic method of ordinary kriging in
GA, which will be developed fully in Section 6.
________________________________________________________________________
ESE 502 II.5-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
For the present, the main purpose of this graphical comparison to show that in spite of
their mathematical differences, the actual results of these different interpolation methods
are qualitatively quite similar. . However, it should be emphasized that considerable
experimentation was required to find parameter settings for each method the yielded this
degree of visual similarity. For example, as already mentioned above, the spline
interpolation in panel (c) required the use of very large values of ( , ) to achieve a
sufficient degree of overall smoothness. As for panel (b), note that while the inverse
multiquadratic function is itself very smooth, the variation in y-values here leads to the
least smooth interpolation of the four panels shown. The key factor appears to be the lack
of flexibility in fitting, which in this case is determined entirely by the exact-interpolation
condition. Finally, as was pointed out at the end of Section 5.2, the local linear
interpolation in panel (a) involves a number of internal smoothing procedures to remove
the discontinuities created by shifting interpolation sets from point to point. So here it is
not even clear how such smoothness was achieved. The same is in fact true of the
ordinary kriging results shown in panel (d) . As we shall see below, this procedure
involves “prediction sets” identical in spirit to the “interpolation sets” of local polynomial
interpolations. So again, internal smoothing procedures have been employed in GA to
obtain continuous results.
Finally, it is important to emphasize once again that in all interpolation models developed
in this section are completely deterministic. In other words, the surface being
approximated is treated simply as an unknown function, y ( s) , on region R , rather than
as the realization of an unknown spatial stochastic process Y ( s ) ( s) ( s) , on R. The
deterministic approach may be quite appropriate for applications such as “filling in”
surface elevations from a subset of observed values, where such elevations can
reasonably be assumed to vary continuously. But for spatial data such as nickel deposits
example, where local variations can be quite substantial, it is generally more useful to
treat such variations as random fluctuations, ( s ) , about a deterministic trend function,
( s) , representing the mean values of a stochastic process, Y ( s ) ( s) ( s) . Hence we
now turn to a consideration of this stochastic approach to spatial prediction.
________________________________________________________________________
ESE 502 II.5-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
From a formal viewpoint, kriging models are closely related to the kernel smoothing
models developed in Sections 5.1 and 5.2 above. In particular, the fundamental idea of
predicting values based on local information is exactly the same. In fact, a slight
modification of Figure 5.2, as in Figure 6.1 below, serves to illustrate the main ideas.
Y ( s2 )
02
Y ( s3 )
03
Y ( s1 ) 01
Yˆ ( s0 ) 04
Y ( s4 )
05
Y ( s5 )
Given spatial data, y ( s) , at a set of locations, {si : i 1,.., n} R , we again consider the
prediction of the unobserved value at some location, s0 R . The first key difference is
that we now treat the observed data as a finite sample from a spatial stochastic process
{Y ( s ) : s R} . As in the case of deterministic interpolation, not all sample data is
necessarily relevant for prediction at s0 . Hence, for the present, we again assume that
some appropriate subset of sample locations,
1
For further background discussion of kriging methods see Cressie (1990) and (1993, p.106).
________________________________________________________________________
ESE 502 II.6-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
has been chosen for prediction, which for convenience we here designate as the
prediction set at s0 (rather than “interpolation set”). The choice of S ( s0 ) will of course
play a major role in determining the final prediction value at s0 . But it will turn out that
the best way to choose these sets is first to determine a “best prediction” for any given
set, S ( s0 ) , and then determine a “best prediction set” by comparing these predictions.
This procedure, known as cross validation, will be developed in Section 6.4 below.
So given prediction set, S ( s0 ) {s1 ,.., sn0 } , the next question is how to determine a
prediction, yˆ ( s0 ) , based on the sample data, { y ( s1 ),.., y ( sn0 )} . Given the present
stochastic framework, this question is more properly posed by treating this prediction as a
random variable, Yˆ ( s0 ) , and asking how it can be determined as a function of the random
variables, {Y ( s1 ),.., Y ( sn0 )} , associated with the observed data. As with kernel smoothers,
we again hypothesize that Yˆ ( s ) can be represented as some linear combination of these
0
Yˆ ( s0 ) i 01 0i Y ( si )
n
(6.1.2)
where the weights 0i are yet to be determined. This fundamental hypothesis shall be
referred to as the linear prediction hypothesis.
In contrast to kernel smoothing, the unknown weights 0i in (6.1.2) need not be simple
functions of distance ( so that 0i in Figure 6.1 now replaces d 0i in Figure 5.2).2 In any
case, the key strategy of kriging models is to choose weights that are “statistically
optimal” in an appropriate sense. To motivate this approach in the simplest way, we
begin by designating the difference between Yˆ ( s0 ) and the unknown true random
variable, Y ( s0 ) , as the prediction error,
(6.1.3) e( s0 ) Y ( s0 ) Yˆ ( s0 )
This prediction error will play a fundamental role in the analysis to follow. But before
proceeding, it is important to distinguish prediction error, e( s0 ) , from the random effects
2
One would expect that points si closer to s0 will tend to have larger weights, 0i . However we shall see
in Section 6.2.3 below that is not true, even when spatial correlations decrease with distance.
________________________________________________________________________
ESE 502 II.6-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
However, it is clearly desirable that prediction errors satisfy this zero-mean property, i.e.,
that prediction error on average be zero. Indeed, this is our first statistical optimality
criterion, usually referred to as the unbiasedness criterion:
(6.1.4) E [e( s0 ) ] E [Y ( s0 ) Yˆ ( s0 )] 0
All predictors, Yˆ ( s0 ) , satisfying both (6.1.2) and (6.1.4) are referred to as linear unbiased
predictors of Y ( s0 ) . In these terms, our single most important optimality criterion is that
among all possible linear unbiased predictors, the prediction error of Yˆ ( s ) should be as 0
“close to zero” as possible. While there are many ways to define “closeness to zero”, for
the case of random prediction error it is natural to require that the mean squared error,
E[e( s0 ) 2 ] , be as small as possible.3 Hence our third criterion, designated as the efficiency
criterion is that Yˆ ( s ) have minimum mean squared error among all linear unbiased
0
predictors.
This criterion is so pervasive in the statistical literature that it is given many different
names. On the one hand, if we abbreviate “minimum mean squared error” as MMSE,
then such predictors are often called MMSE predictors. In addition, notice that since
unbiasedness ( E[e( s0 )] 0 ) implies
Within this general framework we consider four different kriging models, proceeding
from simpler to more general models. These models are each characterized by the
specific assumptions made about the properties of the underlying spatial stochastic
process, {Y ( s) ( s ) ( s) : s R} . For all such models, we start with a fundamental
3
Another possibility would be to require that the mean absolute error, E[| e( s0 ) |] , be as small as possible.
However, since the absolute-value function is not differentiable at zero, this criterion turns out to be much
more difficult to analyze.
________________________________________________________________________
ESE 502 II.6-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
normality assumption about spatial random effects. In particular, for each finite set of
locations {si :i 1,.., n} in region R, it will be assumed that the associated spatial random
effects [ ( si ) : i 1,.., n] are multi-normally distributed.4 Since E[ ( s )] 0 , by definition,
this distribution is determined entirely by the covariances, cov[ ( si ), ( s j )] , i, j 1,.., n .
Hence the assumptions characterizing each model can be summarized in terms of
assumptions about (i) the spatial trend, ( s ) , and (ii) the covariances, cov[ ( s ), ( s)] ,
between pairs of random errors.
Here “simple” refers to the (rather heroic!) assumption that underlying stochastic process
itself is entirely known. In addition, it is also assumed that the spatial trend is constant.
More formally, this amounts to the assumptions:
(6.1.6) ( s) known , s R
Before proceeding, it is reasonable to ask why one would even want to consider this
model. Since all parameters of the stochastic process are determined outside the model, it
would appear that there is nothing left to be done. But remember that the underlying
stochastic process model serves only as a statistical framework for carrying out spatial
prediction. In particular, given any location, s0 R , and associated prediction set,
S ( s0 ) {s1 ,.., sn0 } , the basic task is to predict a value for Y ( s0 ) given observed values of
{Y ( s1 ),.., Y ( sn0 )} . So in terms of the linear prediction hypothesis in (6.1.2), the key
prediction weights, (0i : i 1,.., n0 ) , are still unknown, i.e., are yet to be determined.
Hence the chief advantage of this simple kriging model from a conceptual viewpoint is to
4
In addition there is an obvious “consistency” condition that must also be satisfied. For example, if
{Y ( s1 ), Y ( s2 )} is bivariate normal, then the univariate normal distributions for subsets {Y ( s1 )} and {Y ( s2 )}
must of course be the marginal distributions of {Y ( s1 ), Y ( s2 )} . More generally each subset of size k from
the n-variate normal, {Y ( s1 ),..., Y ( sn )} must have precisely the corresponding k-variate marginal normal
distribution.
5
A somewhat more accurate terminology would be to use “ exogenous” and “ endogenous”. But the
terms “known” and “unknown” are so widely used that we choose stay with this convention.
________________________________________________________________________
ESE 502 II.6-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
allow us to derive optimal prediction weights without having to worry about estimating
other unknown parameters at the same time.
The only difference between this model and simple kriging is that the constant mean, ,
is now assumed to be unknown, and hence must be estimated within the model. More
formally, it is assumed that
(6.1.8) ( s) unknown , s R
This ordinary kriging model in fact the simplest kriging model that is actually used in
practice. As will be seen below, the constant-mean assumption (6.1.8) allows both the
mean and covariances to be estimated in a direct way from observed data. So a practical
estimation procedure is available for this model. However, one may still ask why this
model is of any interest from a spatial viewpoint when all variations in spatial trends are
assumed away. The key point to keep in mind here is that spatial variation is still present
in this model, but all such variation is assumed to be captured by the covariance structure
of the model. We shall return to this issue in Section 6.3 below.
We turn now to kriging models that do allow for explicit variation in the trend function,
( s) . The simplest of these, designated as the universal kriging model, allows the trend
function to be modeled as a linear function of spatial attributes, but maintains the
assumption that all covariances are known. More formally, if we now let
x( s ) [ x1 ( s ),.., xk ( s)] denote a (column) vector of spatial attributes [which may include
the coordinate attributes, s ( s1 , s2 ) , themselves], and let ( 1 ,.., k ) denote a
corresponding vector of coefficients, then this model is characterized by the assumptions:
Here it should be emphasized that “linear” means linear in parameters ( ). For example,
if x( s) [1, s1 , s2 , s12 , s22 , s1s2 ] so that
________________________________________________________________________
ESE 502 II.6-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Our final kriging model relaxes the assumption that covariances are known. More
formally, this geostatistical kriging model (or simply, geo-kriging model) is characterized
by the following assumptions:
In this model, the spatial trend parameters, , as well as all covariance parameters must
be simultaneously estimated. While this procedure is clearly more complex from an
estimation viewpoint, it provides the most general framework for spatial prediction in
terms of prior assumptions. Hence our ultimate goal in this part of the NOTEBOOK is to
develop this geostatistical kriging model in full, and show how it can be estimated.
To develop the basic idea of kriging, we start by assuming as in (6.1.6) and (6.1.7) above
that the relevant spatial stochastic process, {Y ( s) ( s) ( s ) : s R} has a constant
mean, E[Y ( s)] ( s) , and that this mean value, , together with all covariances,
cov[ ( s), ( s)] , s, s R have already been estimated. We shall return to such estimation
questions below. But for the present we simply take all these values to be given. In this
setting, observe that if we want to predict a value, Y ( s0 ) , at some location, s0 R , then
since ( s0 ) is already known, we see from the identity,
(6.2.1) Y ( s0 ) ( s0 )
that it suffices to predict the associated error, ( s0 ) . Moreover, if we are given a finite
set of sample points, {s1 ,.., sn } R where observations, { y ( s1 ),.., y ( sn )} have been made,
then in fact we have already “observed” values of the associated errors, namely,
(6.2.2) ( si ) y ( si ) , i 1,.., n
Hence if S ( s0 ) {s1 ,.., sn0 } {s1 ,.., sn } denotes the relevant prediction set at s0 , then the
linear prediction hypothesis for ( s0 ) in this setting reduces to finding a linear
combination,
ˆ ( s0 ) i1 i 0 ( si )
n0
(6.2.3)
________________________________________________________________________
ESE 502 II.6-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
(6.2.4) Yˆ ( s0 ) ˆ ( s0 )
Note since by definition all errors, ( s ) , s R , have zero means, it then follows at once
from (6.2.1) and (6.2.4) together with the linearity of expectations that,
E[ ( s0 ) i01 i 0 ( si )]
n
and hence that the unbiasedness condition is automatically satisfied for Yˆ ( s0 ) [and
indeed, for every possible linear estimator given by (6.2.3) and (6.2.4)]. This means that
for simple kriging, BLU prediction reduces precisely to Minimum Mean Squared Error
(MMSE) prediction. So the task remaining is to find the vector of weights,
0 (0i : i 1,.., n0 ) in (6.2.3) that minimize mean squared error:
(6.2.6)
MSE (0 ) E [Y ( s0 ) Yˆ ( s0 )]2 E [ ( s0 ) ˆ ( s0 )]2
Here it might seem that without further information about the distributions of these
errors, one could say very little. But surprisingly, it is enough to know their first and
second moments [as assumed in (6.1.6) and (6.1.7) above]. To see this, we begin by
introducing some simplifying notation. First, as in (1.1.1) above, we drop the explicit
reference to locations and now write simply
(6.2.7) ( si ) i , i 0,1,.., n0
[Here it is worth noting that the choice of “0” for the prediction location is very
convenient in that it often allows this location to be indexed together with it predictor
locations, as in (6.2.7).] Next, recalling that E ( i ) 0 it follows that variances and
covariances for the predictor variables can be represented, respectively, as
________________________________________________________________________
ESE 502 II.6-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
(6.2.10) var( 0 ) E ( 02 ) 2
Notice in particular that in the variance expression (6.2.10) we have omitted subscripts
and written simply 00 2 . This variance will play a special role in many of the
expressions to follow. Moreover, since only stationary models of covariance will actually
be used in our kriging applications, this variance will be independent of location s0 .6 In
these terms, we can now write mean squared error explicitly in terms of these parameter
values as follows:
MSE (0 ) E [ ( s0 ) ˆ ( s0 )]2 E 0 i01 0i i
n 2
(6.2.12)
2
E 02 2 0 i01 0i i
n n0
i 1 0 i i
E 02 2 E 0 i01 0i i E
2
n n0
i 1 0 i i
But since
(6.2.13)
E 0 i01 0i i E
n
n0
i 1
0i 0 i
n0
i 1
0i E ( 0 i ) i1 0i 0i
n0
x x x x
2
i1 xi
n n n n n n n
(6.2.14) i 1 i j 1 j i 1 i j 1 j i 1
x xj
j 1 i
implies that
2
0i 0 j i j
n0 n0 n0
(6.2.15) E
i 1 0 i i E i 1 j 1
6
Note that this also implies subscripts could be dropped on all predictor variances, ii . But here it is
convenient to maintain these subscripts so that expressions involving all predictor variances and
covariances can be stated more easily
________________________________________________________________________
ESE 502 II.6-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Thus mean squared error, MSE (0 ) , is seen to be a simple quadratic function of the
unknown vector of weights, 0 (0i : i 1,.., n0 ) , with known coefficients given by the
variance-covariance parameters in (6.2.8) and (6.2.9). This means that one can actually
minimize this function explicitly and determine the desired unknown weights. As shown
in Appendix A2, such quadratic minimization problems are easily solved in terms of
vector partial differentiation. But to illustrate the main ideas, it is instructive to consider a
simple case not requiring vector analysis.
Consider the one-predictor case shown in Figure 6.2 below. Here the task is to predict
Y ( s0 ) on the basis of a single observation, Y ( s1 ) , at a nearby location, s1 [so the relevant
prediction set is simply S ( s0 ) {s1} ].
Y ( s1 )
01
Yˆ ( s0 )
While such “sparse” predictions are of little interest from a practical viewpoint, the
derivation of a BLU predictor in this case is completely transparent. If we let
(6.2.17) Y ( si ) ( si ) i , i 0,1 ,
________________________________________________________________________
ESE 502 II.6-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
so that the expression for mean squared error takes the simple form
2 201 01 012 11
1.05
0.85
0.8
0.75
0.7
0.65
0.6
0 0.2 0.4 0.6 0.8 1
̂01
Figure 6.3 Optimal Weight Estimate
Hence the second derivative is positive everywhere (as in the figure), and it follows that
the unique optimal weight, ̂01 , is given by the solution of the first-order condition,
In this example, 1 11 and 01 0.5 , so that the resulting optimal estimate in (6.2.21) is
7 2
ˆ01 0.5 .
________________________________________________________________________
ESE 502 II.6-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
In this simple case, the interpretation of this optimal weight is also clear. Note first that if
the covariance, 01 cov( 0 , 1 ) , between 0 and 1 is zero (so that these random
variables are uncorrelated), then ˆ 0 . In other words, if they are uncorrelated then
01 1
provides no information for predicting 0 , and one can do no better than to ignore 1
altogether.8 Moreover, as this covariance increases, 1 is expected to provide more
information about 0 , and the optimal weight on 1 increases. On the other hand, as the
variance, 11 var(1 ) , of this predictor increases the optimal weight, ̂01 , decreases.
This reflects the fact that a larger variance in 1 decreases its reliability as a predictor.
Finally, given this optimal weight, ̂01 , it then follows from (6.2.4) together with (6.2.18)
that the resulting optimal prediction, Yˆ ( s ) , in Figure 6.2 is given by
0
As we shall see below, these results are mirrored in the general case of more than one
predictor.
Given the above results for a single predictor, we now generalize this setting to many
predictors. The main objective of this section is to reformulate (6.2.16) in vector terms,
and to use this formulation to extend expression (6.2.22) to the general the vector of
optimal prediction weights, ˆ0 (ˆ0i : i 1,.., n0 ) , for Simple Kriging. A complete
mathematical derivation of this result is given in Section A2.7.1 of Appendix A2. To
begin with, let the full covariance matrix for 0 ( s0 ) together with its corresponding
prediction set of error values, i ( si ) , be denoted by
2 01 0 n
0
01 11 1n
6.2.23 C0 0
0n n 1 n0n0
0 0
8
Note in particular that for the present case of multi-normally distributed errors, zero correlation is
equivalent to statistical independence.
________________________________________________________________________
ESE 502 II.6-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
The partitioning shown in this matrix identifies its relevant components. Given the
ordering, i 0,1,.., n0 of both rows and columns, the upper left hand corner denotes the
variance of 0 . The column vector below this value (and the row vector to the right)
identifies the covariances of 0 with each predictor variable, i , i 1,.., n0 , and is now
denoted by
01
(6.2.24) c0
0n
0
Finally, the matrix to the lower right is the covariance matrix for all predictor variables,
i , i 1,.., n0 , and is now denoted by
11 1n0
(6.2.25) V0
n0 1 n0n0
In these terms, the full covariance matrix, C0 , can be given the compact form,
2 c0
(6.2.26) C0
c0 V0
It is the components of this partitioned matrix that form the basic elements of all kriging
analysis. In particular, for the vector of unknown weights, 0 (0i : i 1,.., n0 ) , the mean
squared error function, MSE (0 ) , in (6.2.16) can now be written in vector terms as
follows
[which can be checked by applying (6.2.24) and (6.2.25) together with the rules of matrix
multiplication]. By minimizing this function with respect to the components of 0 , it is
shown in expression (A2.7.20) of the Appendix that the optimal weight vector,
ˆ0 (ˆ0i : i 1,.., n0 ) , is given by
Hence, letting (1 ,.., n0 ) denote the vector of predictors for 0 , it follows that the
BLU predictor of 0 is given by
________________________________________________________________________
ESE 502 II.6-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
By way of comparison with the single-predictor case above, note that in the present
setting, this case takes the form,
2 01
(6.2.31) C0
01 11
predictors provide no information. More generally, suppose that all predictors are
uncorrelated, i.e., the ij 0 for all i, j 1,.., n0 (i j ) . Then V0 reduces to a positive
diagonal matrix with inverse given by the diagonal of reciprocals, i.e.,
11 111
1
(6.2.32) V0 V0
n0n0 1
n0n0
So if all predictors are uncorrelated, then the contribution of each predictor, i , to ˆ0 in
(6.2.3) is the same as if it were a single predictor. In particular, it has zero contribution if
and only if it is uncorrelated with 0 .
________________________________________________________________________
ESE 502 II.6-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
However, if such predictors are to some degree correlated, then optimal prediction
involves a rather complex interaction between the covariances, V0 , among predictors and
their covariances, c0 , with 0 . In particular, if 0i 0 then it is possible that interactions
between both 0 and i with other predictors may result in either positive or negative
values for ˆ . As one illustration, suppose there are two predictors, ( , ) , with
0i 1 2
1 1/ 2 0
(6.2.34) V0 , c0
1/ 2 1 1/ 2
so that 1 is uncorrelated with 0 , but both have positive covariance (1/2) with 2 . Then
it can be verified in this case that
4/3 2 / 3 0 1/ 3
(6.2.35) ˆ0 V01c0
2 / 3 4 / 3 1/ 2 2 / 3
So even though all covariances (and hence correlations) are nonnegative, the optimal
weight on 1 is actually negative. This shows that in the general case the interpretation of
individual weights is much more complex. Indeed, it turns out in this case that the only
quantity than can meaningfully be interpreted is the full linear combination of predictors
in (6.2.29), i.e.,
As expected, we see that 2 contributes positively to the prediction, ˆ0 , and makes a
more influential contribution than 1 . But the negative influence of 1 is less intuitive.
To gain further insight here, notice that by definition,
________________________________________________________________________
ESE 502 II.6-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
But since 2 is a constant not involving ˆ0 , it becomes clear that minimization of
MSE (0 ) essentially involves a tradeoff between the covariance of the predictor ˆ0 with
0 and the variance of the predictor itself. Indeed, this is the proper generalization of the
original interpretation given in the single predictor case, where the relevant covariance
and variance in that case were simply 01 and 11 , respectively. Moreover, the form of
this tradeoff in (6.2.38) makes it clear that to minimize MSE (0 ) , one needs a predictor
ˆ0 with positive covariance, cov(ˆ0 , 0 ) , as large as possible while at the same time
having a variance, var(ˆ0 ) , as small as possible. It is from this viewpoint that the
negativity of ̂01 in (6.2.35) can be made clear. To see this observe that since 01 0 ,
covariance in this case takes the form
But since 02 1/ 2 0 , it follows that this covariance can only be positive if 02 0 .
Turning next to variance, observe that for any two-predictor case,
12 01
(6.2.42) var(ˆ0 ) 0 V00 (01 02 ) 11
12 22 02
But since the first term is always positive and since 12 1/ 2 0 , we see from the
positivity of 02 above that var(ˆ0 ) can only be made small by requiring that 01 0 . In
short, since 1 has no effect on the correlation of the predictor, ˆ0 , with 0 , its best use
for prediction is to shrink the variance of ˆ0 by setting 01 0 .
Before using these kriging weights for prediction, it is of natural interest to consider their
spatial nature. In particular, referring again to our initial illustration in Figure 6.1, it
would seem reasonable that points, si , closer to s0 should have larger weight, ˆ0i . In
particular, if invoke the “standard covariogram” assumption of Figure 4.1 in Section 4,
namely that covariances decrease with distance, then points further away should
contribute less to the prediction of Y ( s0 ) . But for Simple Kriging predictors this is simply
not the case. One simple example is shown in Figure 6.4 below:
________________________________________________________________________
ESE 502 II.6-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
s4
3
point s1 s2 s3 s4
2
1
s1 distance 1.41 2.24 2.83 3.16
0
s2
s0 weight .555 .045 .306 .095
-1
-2
s3 rank 1 4 2 3
-3
-3 -2 -1 0 1 2 3
Here points ( s1 , s2 , s3 , s4 ) are ordered in terms of their distance from prediction point, s0 ,
as shown in the second row of the table.9 To calculate weights in this case, a simple
exponential covariogram was used.10 So in this spatially stationary setting, covariances
are strictly decreasing in distance. Hence the key point to notice is that the kriging
weights (ˆ01 , ˆ02 , ˆ03 , ˆ04 ) in the third row of the table are not decreasing in distance.
Indeed the second closest point, s2 , here is the least influential of the four (as depicted by
the ranking of weights in the last row of the table). Notice that since s1 and s2 are closer
to each other than to s0 , and since distances are in this case inversely related to
correlations,11 the errors ( s1 ) and ( s2 ) are more correlated with each other than either
is to ( s0 ) . So it might be argued here that ( s2 ) is adding little prediction information
for ( s0 ) beyond that in ( s1 ) . But notice that the influence of points s3 and s4 is also
reversed, and that no such relative correlation effects are present here. So even in this
simple monotone-covariance setting, it is difficult to draw general conclusions about the
exact relation between distance and kriging weights.
While these illustrations are necessarily selective in nature, they do serve to emphasize
the complexity of possible interaction effects in MMSE prediction. Given this
development of Simple Kriging predictors, we turn now to the single most important
justification for such stochastic predictors, namely the construction of meaningful
prediction intervals for possible realized values of Y ( s0 ) .
Note that up to this point we have relied only on knowledge of the means and
covariances of the spatial error process { ( s ): s R} to derive optimal predictors. But to
develop prediction intervals for these errors, we must now make explicit use of the
9
The actual point coordinates are s0 (0, 0) , s1 (1,1) , s2 (2,1) , s3 ( 2, 2) and s4 ( 1, 3) .
10
With respect to the notation in expression (4.6.6) of Section 4, the range, sill, and nugget parameters used
were ( r 30, s 1, a 0) .
11
Recall from (3.3.13) in Section 3 that spatially stationary correlations are proportional to covariances.
________________________________________________________________________
ESE 502 II.6-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
( s0 ) 0
( s1 ) 1 0
(6.2.43)
( s n )
0 n0
0 0 2 c0
(6.2.44) ~ N 0 ,
n0 c0 V0
Our primary application of this distribution will be to derive the distribution of the
associated prediction error in (6.1.3), which we now write simply as e0 e( s0 ) . But
before proceeding it is important to emphasize once again the distinction between 0 and
e0 . Recall that 0 is the deviation of Y ( s0 ) about its mean [ Y ( s0 ) 0 ], while e0 is
the difference between Y ( s ) and its predicted value [ e Y ( s ) Yˆ ( s ) ].
0 0 0 0
To derive the distribution of e0 from that of random error vector, ( 0 , ) ,13 we begin by
using (6.2.1), (6.2.4) and (6.2.29) to write e0 in terms of ( 0 , ) as follows,
(6.2.45) e0 Y ( s0 ) Yˆ ( s0 ) ( s0 ) ˆ ( s0 ) 0 ˆ0
0 ˆ0 (1 , ˆ0 ) 0
12
Here 0 n0 denotes the n0 -dimensional zero vector.
13
Note that technically this vector should be written inline as ( 0 , ) to indicate that it is column vector.
But for sake of visual clarity, we write simply ( 0 , ) .
________________________________________________________________________
ESE 502 II.6-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
In view of the importance of this particular variance, we derive it in two ways. First we
derive it directly from the covariance-transformation identity in (3.2.21) of Section 3. In
particular, for any linear compound, aX , of a random vector, X , with covariance
matrix, , it follows at once from (3.2.21) [with A a ] that
(6.2.46) var(aX ) a a
Hence by letting
1 2 c0
X 0 , a
ˆ
(6.2.47) ,
0 c0 V0
2 c0 1
(6.2.48)
var(e0 ) (1 , 0 )
ˆ
c0 V0 ˆ0
2 c0 ˆ0
(1 , ˆ0 ) ( 2 c0 ˆ0 ) ˆ0c0 ˆ0V0ˆ0
c V ˆ
0 0 0
But since for any vectors, x ( x1 ,.., xn ) and y ( y1 ,.., yn ) , it must be true that
xy i xi yi i yi xi yx , we see that (6.2.48) can be reduced to
The form of the right hand side should look familiar. In particular, the representation of
mean squared error, MSE (0 ) , in (6.2.27) now yields the identity,
This relation is no coincidence. Indeed, recall from (6.1.5) that for any unbiased
predictor, ˆ0 ,
so that its mean squared error is identically equal to the variance of its associated
prediction error. So for the optimal predictor in particular, this variance must be given by
the mean squared error evaluated at ̂0 . Indeed we could have derived (6.2.49) through
this line of reasoning. Hence the direct derivation in (6.2.45) through (6.2.48) offers an
instructive confirmation of this fact.
________________________________________________________________________
ESE 502 II.6-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
To complete this derivation, it suffices to substitute the solution for ̂0 in (6.2.28) [i.e.,
ˆ V 1c ] into (6.2.49) to obtain,
0 0 0
2 2c0V01c0 c0V01c0
By combining the last two terms, we obtain the final expression for prediction error
variance (also called Kriging variance),
where we have now introduced the simplifying notation ( 02 ) for this important quantity.
While this expression for 02 is most useful for computational purposes, it is of interest to
develop an alternative expression that is easier to interpret. To do so, if we now use the
simplifying notation, Y0 Y ( s0 ) , for the variable to be predicted at location s0 R , then
[as a consequence of (3.2.21)] it follows that the first term in (6.2.53) is simply the
variance of Y0 , since
________________________________________________________________________
ESE 502 II.6-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
In these terms, it is clear that prediction error variance, 02 is smaller that than the original
variance of Y0 . Moreover, the amount of this reduction is seen to be precisely the
variance “explained” by the predictor, Yˆ . Indeed, it can be argued that this reduction in
0
Given this expression for prediction error variance, it follows at once from the arguments
above the prediction error, e0 , must be normally distributed as
(6.2.57) e0 ~ N (0, 02 )
Hence the task remaining is to use this normal distribution of e0 [ Y0 Yˆ0 ] to construct
prediction intervals for Y in terms of Yˆ and 2 . To do so, we first recall from Sections
0 0 0
3.1.1 and 3.1.2 above that the standardization of e0 must be distributed as N (0,1) . In
particular, since the mean of e0 is zero, and since it standard deviation of e0 is given
from (6.2.53) by var(e0 ) 0 , it follows that
Y0 Yˆ0 e0
(6.2.58) ~ N (0,1)
0 0
Hence it now becomes clear that, together with Yˆ0 , the key distributional parameter is the
standard deviation, 0 , of e0 , which is usually designated as the standard error of
prediction. Indeed, as will be seen below, the fundamental outputs of all kriging software
are precisely estimates of the kriging prediction, Yˆ0 , and standard error of prediction,
0 , at all relevant prediction locations, s0 .
1
Y Yˆ /2
(6.2.59) Pr z /2 0 0 z /2 1
0
z /2 0 z /2
________________________________________________________________________
ESE 502 II.6-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Y0 Yˆ0
(6.2.60) z /2 z /2 0 z /2 Y0 Yˆ0 0 z /2
0
Yˆ0 0 z /2 Y0 Yˆ0 0 z /2
it follows that their probabilities must be the same, and hence from (6.2.59) that,
(6.2.61)
Pr Yˆ0 0 z /2 Y0 Yˆ0 0 z /2 1
In other words, the probability that the value of Y0 lies between Yˆ0 0 z /2 and
Yˆ z is 1 . In terms of confidence levels, this means that we be 100(1 )%
0 0 /2
The single most common instance of (6.2.62) is for the case, 0.05 , with
corresponding critical value z /2 z.025 1.96 . In this case, one can thus be 95%
confident that Y0 lies in the prediction interval,
As with all statistical confidence statements, the phrase “95% confident” here means that
if we were able carry out this same prediction procedure many times (i.e., to take many
random samples from the joint distribution of Y0 and its kriging prediction, Yˆ0 ) then we
would expect the realized values of Y0 to lie in the corresponding realized of intervals
[Yˆ0 (1.96) 0 ] about 95% of the time.
Finally it should again be emphasized that it is the ability to make confidence statements
of this type that distinguishes stochastic prediction methods from the deterministic
methods of spatial interpolation developed in Section 5.
________________________________________________________________________
ESE 502 II.6-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
In the following development, we again postulate that the values of some variable Y
defined over a relevant region R can be modeled by a spatial stochastic process,
{Y ( s ) ( s ) : s R} , with constant mean, . In addition, we assume the existence of
a given set of n observations (data points), { yi y ( si ):i 1,.., n} in R, where of course
each data point, yi , is taken to be a realization of the corresponding random variable,
Yi Y ( si ) in this spatial stochastic process. Also, for purposes of illustration, we shall
again consider the problem of predicting, Y ( s0 ) , at a single given location, s0 R , with
respect to a given prediction set, S ( s0 ) {s1 ,.., sn 0 } {s1 ,.., sn } . Within this framework,
we can operationalize the Simple Kriging model as follows:
Recall from the assumption in (6.1.6) that our first task is to produce an estimate of the
mean, , outside the Simple Kriging model. Here the obvious choice is just to use the
sample mean of the given data, i.e.,
ˆ yn
n n
(6.2.62) 1
n i 1
y ( si ) 1
n i 1
yi
(6.2.63) E ( ˆ ) E 1
n
n
Y
i 1 i 1
n
n
i 1
E (Yi ) 1
n
n
i 1
So even though these random variables are spatially correlated, this has no effect on
unbiasedness. What spatial correlation does imply is that the variance of this estimator is
much larger than that of the classical sample mean under independence. We shall return
to this issue in the development of Ordinary Kriging in Section 6.3 below.
________________________________________________________________________
ESE 502 II.6-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Recall next from assumption (6.1.7) that the covariances, cov[ ( s), ( s)] are assumed to
be given for all locations, s, s R . But we must of course provide some prior estimates
of these covariances. This was in fact one of the primary motivations for the assumption
of covariance stationarity in Section 3.3.2 above. Hence we now invoke this assumption
in order to estimate spatial covariances in a manner that accounts for spatial correlation
effects. Recall also from Section 4.10.1 that, unlike the mean above, the classical estimate
of covariance is biased in the presence of spatial correlation. So our estimation procedure
here will always start with variograms rather than covariograms. Fortunately, this basic
estimation procedure is exactly the same as that used for Ordinary Kriging, and indeed,
for all more advanced kriging models. So it is worthwhile to develop this procedure in
detail here.
To do so, we begin by recalling from (3.3.7) and (3.3.11) in Section 3 that under
covariance stationarity, all covariances can be summarized by a covariogram, C (h) . As
emphasized in Section 4, this is best estimated by first estimating a variogram,
(h; r , s, a) with parameters, r range, s sill, and a nugget. Since the common
variance, C (0) 2 , is precisely the sill parameter, s , one can then obtain the desired
covariogram from the identity in (4.1.7) of Section 4, namely14
Hence, the estimation procedure starts by using the MATLAB program, var_spher_plot,
together with the full sample data set above to obtain estimates, (rˆ, sˆ, aˆ ) , of the spherical
variogram parameters. The estimated spherical variogram, (h; rˆ, sˆ, aˆ ) , is then used
together with (6.2.64) to obtain an estimate, Cˆ (h) , of the desired covariogram as follows:
Recall that for any pair of point, s, s R separated by distance, || s s || h the quantity,
Cˆ (h) , then yields an estimate of cov[ ( s ), ( s)] , i.e.,
Using this identity, we can then estimate the full covariance matrix, C0 , relevant for
prediction at s0 [as in (6.2.23) above]. In particular, if we let dij || si s j || for each pair
of points, si , s j {s0 , s1 ,.., sn0 } , and [as instances of (6.2.66)] set
(6.2.67) ˆ ij Cˆ (dij )
14
Again, remember not to confuse the symbol, s , for “sill” with points, s ( s1 , s2 ) R .
________________________________________________________________________
ESE 502 II.6-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
ˆ 2 cˆ01 cˆ0 n0
cˆ cˆ11 cˆ1n0 ˆ 2 cˆ0
(6.2.68) Ĉ0 10
cˆ0 Vˆ0
cˆn 0
cˆn0 1 cˆn0n0
0
Note in particular, that the common variance, 2 , of all random variables is again
estimated by the sill, since
(6.2.69) ˆ 2 Cˆ (0) sˆ
Finally, given these parameter estimates, we are ready to estimate the Simple Kriging
prediction, Yˆ ( s0 ) , of Y ( s0 ) . To do so, begin by recalling that that the deviation error,
i yi , at each data point, i 1,.., n0 , can now be estimated in terms of (6.2.62) by
(6.2.70) ˆi yi ˆ
Finally, by using (6.2.30) together with these estimates, it follows that the Simple Kriging
prediction of Y0 Y ( s0 ) is given by15
15
Here it should be noted that for simplicity, we have used the same notation for the theoretical and
estimated Simple Kriging prediction, Yˆ0 (and ˆ0 ).
________________________________________________________________________
ESE 502 II.6-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
(6.2.74) ˆ 02 sˆ cˆ0Vˆ01cˆ0
(6.2.75) ˆ 0 sˆ cˆ0Vˆ01cˆ0
to be the relevant estimate of the standard error of prediction at location s0 . The pair of
values (Yˆ0 , ˆ 0 ) can then be used as in (6.2.61) to estimate the (default) 95% prediction
interval for Y0 , namely,
One final comment should be made about these estimates. In the theoretical development
of Section 6.2.2, the predictors ˆ0 and Yˆ0 were derived as Best Linear Unbiased (BLU)
predictors. This is only accurate if the true mean, , and covariances, C0 , are known –
which is of course almost never the case. So to be accurate, the above values ˆ0 and
Yˆ are in fact only estimates of BLU predictors. This distinction is often formalized by
0
Given the estimation procedure above, we now illustrate an application of Simple Kriging
in terms of the Vancouver Nickel data in Section 4.9 above. But before developing this
example, it is important to emphasize that the underlying normality assumption on all
spatially-dependent random effects, ( s ) , is crucial for the estimation of meaningful
prediction intervals. Moreover, since these random effects are not directly observable,
this distributional assumption can only be checked indirectly. But by assuming that there
are no global trends (as in Simple and Ordinary Kriging), it should be clear from the
identity
(6.2.77) Y ( si ) ( si ), i 1,.., n
that these random effects differ from the observed data, { y ( si ) : i 1,.., n} , only by a
(possibly unobserved) constant, . Moreover, since the variance, 2 var[ ( si )]
var[Y ( si )] is constant for all covariance-stationary processes, it follows that under this
additional assumption, the marginal distributions must be the same for all Y data,
namely
________________________________________________________________________
ESE 502 II.6-25 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
(6.2.78) Y ( si ) ~ N ( , 2 ) , i 1,.., n
So even though these are not independent samples from this common distribution, it is
still reasonable to expect that the histogram of this data should look approximately
normal. This motivates the following simple test of normality.
A very simple and appealing test of normality is available in JMP, known as Normal
Quantile Plots (also called Normal Probability Plots). The main appeal of this test is that
it is graphical, and in addition, provides global information about possible failures of
normality. The idea is very simple. Given a set of data ( y1 ,..., yn ) from an unknown
distribution, one first reorders the data (if necessary) so that y1 y2 yn , and then
standardizes it by subtracting the sample mean, yn 1n in1 yi , and dividing by the sample
1/2
standard deviation, sn n11 in1 ( yi yn ) 2 , to obtain:
yi yn
(6.2.79) zi , i 1,.., n
sn
Now if ( y1 ,..., yn ) were coming from a normal distribution, then ( z1 ,..., zn ) should be
approximately distributed as, Z i ~ N (0,1), i 1,.., n [we are using only estimated means
and standard deviations here]. So for an independent sample ( Z1 ,.., Z n ) of size n from
N (0,1) , if we compute the theoretical expected values, i E ( Z i ), i 1,.., n , then we
would expect on average that the observed values zi in (6.2.79) should be reasonably
closed to their expected values, i . This in turn implies that if plot zi against i , the
points should like close to the 45 line. This is illustrated in Figure 6.5 below, where a
sample of size n 100 has been simulated in JMP (using Formula → Random →
Random Normal).
3
-2.33 -1.64-1.28 -0.67 0.0 0.67 1.281.64 2.33
-1
-2
The values on the vertical axis are exactly the zi values together with their histogram
shown on the left. The Normal Quantile Plot is displayed on the right (using the
procedure detailed in Assignment 4). The values on the horizontal axis at the top of the
figure are precisely the expected values, i , for each zi , i 1,..,100 .16 Here it is clear
that all point pairs are indeed close to the 45 line (shown in red). The dashed lines
denote 95% probability intervals on the realized values zi , so that if the sample were
normal (as in this simulation) then each dot should lie between these bands about 95% of
the time.17 For example, the middle sample value, z50 , with expected value,
50 E ( Z 50 ) 0 , should lie in the interval between these two bands on the vertical green
center line about 95% of the time. So this plot provides compelling evidence that this
sample is indeed coming from a normal distribution.
We now apply this tool to the Nickel data, as shown in Figure 6.6 below. For ease of
comparison with Figure 6.7, the histogram and corresponding normal quantile plot are
show using the horizontal display option18. (The only difference here is that the Normal
Quantile Plot is now above the histogram, with i values on the vertical axis to the right.)
Since most data observed in practice is nonnegative (i.e., is truncated at zero), the
corresponding histograms tend to be “skewed to the right”, as illustrated by this Nickel
data.
16
The values on the bottom horizontal axis are the associated cumulative probabilities, so that “0” on the
top corresponds to “ (0) .5 ” on the bottom.
17
Note that such probability intervals are different from confidence intervals. In particular, their end points
are fixed. Note also that these (Lilliefors) probability bounds actually account for the estimated mean and
standard-deviation valued used [for more information, Google “Lilliefors test”].
18
Right click on the label bar above the histogram and select Display Options → Horizontal Layout.
________________________________________________________________________
ESE 502 II.6-27 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
The degree of non-normality of this data is even more evident from the Normal Quantile
Plot. Here the mid-range values are well above the 45 line (slightly distorted in this
plot), indicating that there is “too much probability mass to the left of center” relative to
the normal distribution. Hence it is difficult to krige this data directly, since the
corresponding prediction intervals would have little validity.
However, if this data is transformed to natural logs, then the familiar “bell shaped” curve
starts to appear, as seen in Figure 6.7 above. What is happening is that the log
transformation “shrinks” the upper range of the distribution (above value one) and
“expands” the lower range (below value one). While other transformations are possible
here, (such as taking square roots rather than logs), the log transformation is by far the
most common. It is also used for regression residuals, as we shall see in later sections.
To perform this log transformation in MATLAB, we start with original data set, nickel,
the MATLAB file, nickel.mat. Next we replace the data column, nickel(:,3) with log
data, and save as log_nickel using the command:
>> log_nickel = [nickel(:,1:2),log(nickel(:,3))];
This makes a new matrix consisting of the first two columns of nickel and the log of the
third column.19
1.2
1.4
1
1.2
0.8
1
0.6
0.8
0.4
0.6
0.2
0.4
0
0.2 -0.2
0 -0.4
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 1 2 3 4 5
4 4
x 10 x 10
Figure 6.8. Log Nickel Variogram Figure 6.9. Log Nickel Covariogram
19
Note that the log command uses natural logs by default. Logs to the base 10 are obtained with the
command, log10.
________________________________________________________________________
ESE 502 II.6-28 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
The corresponding covariogram estimate is on the right in Figure 6.9. Here we again see
a wave effect which is qualitatively very similar to that in Figure 4.22 for the raw nickel
data. Here the reported maxdist value is 48,204. However, it appears that up to about
30,000 meters the empirical variogram is reasonably consistent with a classical spherical
variogram. Hence to capture this range, we now rerun var_spher_plot with this specified
maxdist value as follows:
The new covariogram is plotted in Figure 6.10 below, and is seen to be quite in keeping
with the classical model.
1.4
1.2
1
RANGE 21630.857
0.8 SILL 1.356
0.6
NUGGET 0.340
0.4
MAXDIST = 30000
0.2
-0.2
0 0.5 1 1.5 2 2.5 3
4
x 10
Here we no longer show the variogram, since its main purpose was to estimate the
desired covariogram. By using the estimated range, sill and nugget parameters
(rˆ, sˆ, aˆ ) (21631, 1.356 , 0.340) shown on the right, we can now construct estimates of all
desired covariances as in (6.2.65) and (6.2.66) above.
To use these parameters in MATLAB, recall that the first cell of the OUT structure above
contains these parameter values. So we may identify these for later use as:
Note that by leaving off the semicolon on the command line, the new vector is
automatically displayed as
so that the correctness of this command is easily checked from the output above.
________________________________________________________________________
ESE 502 II.6-29 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Given this covariogram estimate, we first apply simple kriging to a single point in order
to illustrate the procedure. In particular we choose the point, s0 = (659000,586000),20
shown as a red dot in Figure 6.11 below. Here the nickel values in Figure 4.18 have been
replaced by log-nickel values. Notice that while the values have changed, the overall
pattern is essentially the same. With respect to the particular point, s0, it appears that a
bandwidth of h0 = 5000 meters is sufficient to capture the (12) most important neighbors
of this point, as shown in the enlarged portion of the map. So for purposes of this
illustration we take the relevant prediction set, S ( s0 ) , to be given by these 12 points.
! !
! !
! ! ! !! ! ! ! !!!
!
! ! ! ! ! ! !! !! ! ! !!!! !
! ! ! ! ! ! !
! ! !!! ! ! !
!
! ! ! !!! !! !
! !
!! ! ! ! ! ! ! !! ! ! !! !
!
! !! ! !
! !! ! !
!! ! ! ! ! ! !! ! !!
!! ! !! ! !
! ! !! !!! ! ! !
! !! !! ! !!! ! ! ! ! !! !
! ! ! ! ! !! !! ! !!
!! ! ! ! ! !
! !! !
! !
!!
! !
!
! ! !
!! !!!
!!
!
!
! !
! !
!
!! ! ! ! ! ! ! !!
! ! ! !
!!
! !
! ! !! ! ! ! ! !
! !
! !!
!
! ! ! ! ! !
! ! ! ! ! !! ! ! ! !
!! !
! !!! !! !!
! ! ! !! !
!! !! !!!
!! !! ! !! ! !
! !
!!! ! !
!
! !! ! ! ! ! ! !! ! !
!
! ! !! ! ! !
!! !!
!
!!! ! ! ! !!! !! ! !! ! !
! !! !! ! ! ! ! !
! !! !! !! ! ! ! ! !!! !! !! ! !
! ! ! ! ! !! ! ! ! !! ! !!
! ! !
! ! !!!! !!! ! !!! ! !
! !
! ! ! ! !! ! !! !
!
! ! ! ! !
!
! !!
! ! ! !! ! !
!! ! ! !! !
! !! !!! ! !
!!
! ! !
! !
! ! !
!! ! ! !
!
The rest of the simple kriging procedure is operationalized in the MATLAB program,
krige_simple.m . So to obtain the desired simple kriging prediction and an associated
estimate of the standard error of prediction at s0, one can use the command:
Here the OUT matrix lists the krige prediction in the first column and the standard errors
in the second column (see also the documentation at the beginning of the program). So in
the present case, we can simply leave off the semicolon again and see the screen display:
If we now denote nickel values by the random variable, Y, and log_nickel values by
logY, the kriging prediction of log_nickel at the point s0 is seen to be
(6.2.80)
log Y ( s0 ) 3.0488
20
In the following discussion we shall refer to the given location as s0 when discussing input/output for
MATLAB programs, and as s0 when referring to the formal development above. The same is true of
bandwidths, where h0 and h0 .will be used respectively.
________________________________________________________________________
ESE 502 II.6-30 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
where the “hat” notation, logY , is used to denote a prediction (or estimate) of the
random variable, logY . The corresponding estimate of the standard error of prediction
at location s0 is then given by,
(6.2.81) ̂ 0 0.76697
For our later purposes, it is important to note that as in Step 1 of the estimation procedure
for simple kriging, this program uses the sample mean of the log_nickel data, which can
be obtained directly in MATLAB with the command
>> mean(log_nickel(:,3))
Before analyzing this simple kriging output further, it is instructive to compare it with the
output obtained by using the simple kriging procedure in Geostatistical Analyst. First it is
necessary to construct log-nickel values in ARCMAP. This is easily accomplished by
opening the attribute table for the Vancouver_dat shapefile, making a new field, say
LOGNI, and using the Calculator to create the logs of Nickel values [written in the
calculator window as log([NI]) ].21 [These log values are shown in Figure 6.11 above.]
To perform simple kriging start with the path:
and use attribute LOGNI for input data Vancouver_dat. In the next window, select
Notice that the mean value is displayed as 3.2515, which is precisely the (rounded)
MATLAB value above. In the next window, be sure to select the “Semivariogram”
option, to obtain a variogram plot. Recall that the maxdist above was chosen to be 30000
meters.
To obtain a fit that is roughly comparable in this case set the number of lags to 15 with a
lag size of 2000 meters (yielding a maxdist of 15 2000 30000 meters) as shown in
Figure 6.12 below. Here the estimated range of 21706 meters is remarkably close to the
MATLAB value of 21630 meters in Figure 6.10 above. Similarly, the estimated nugget
value, 0.3409, and sill value, (.3409 1.0206 1.3615) , are also very close to those in
Figure 6.10. So in this case one expects the simple kriging results to be quite similar as
well.
21
As with MATLAB, the “log( )” function in ARCMAP calculates natural logs.
________________________________________________________________________
ESE 502 II.6-31 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
This can be verified in the next window, shown in Figure 6.13 below. Here the sample
point coordinates have been set to X = 659000 and Y = 586000 to agree with the point s0
above. Similarly, to produce a circular neighborhood of 5000 meters, the “Sector type” is
set to the simple ellipse form shown, and the axes are both set to 5000 to yield a circle.22
22
Be sure to set Copy from Variogram = “False” in order to set these axis values.
________________________________________________________________________
ESE 502 II.6-32 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Notice also that the maximum “Neighbors to include” has been set to 15 to ensure that all
points in the circle around point (X,Y) in the preview window will be included.23
The kriging prediction for log nickel is then displayed in Figure 6.13 as “Prediction =
3.0504`” located below the (X,Y) coordinate values. [Notice also that exactly the 12
points inside the circle have been used for this kriging prediction.] As expected, this
value is seen to be quite close to the MATLAB prediction in (6.2.80) above.
Finally, to produce an estimate of the standard error of prediction at (X,Y), click “Back”
twice to return to the “Step 1” window and now select
With this selection, return to the “Step 3” window by clicking “Next” twice. Notice that
that all settings in Steps 2 and 3 have remained constant, so that prediction standard
errors are now being calculated under the same settings as the kriging prediction. The
only change is that “Prediction = 3.0504” is now replaced by “Error = 0.7676”. Again,
this value is quite close to the MATLAB standard error estimate in (6.2.81) above. As
mentioned above, this close agreement is largely due to the similarity of the variogram
parameter estimates in this case. Hence such close agreement cannot be expected in
general.
(6.2.82) Z1 g ( Z 2 ) Z 3 g 1 ( Z1 ) g 1[ g ( Z 2 )] g 1 ( Z 3 )
g 1 ( Z1 ) Z 2 g 1 ( Z 3 )
23
Note also in Figure 6.13 that the “Enlarge” tool for the preview window has been used to focus in on the
point (X,Y).
________________________________________________________________________
ESE 502 II.6-33 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
where the last line follows from the identity, g 1[ g Z 2 ] Z 2 . This in turn implies that
the probabilities of these events must be identical, so that
Now in the present case, recall from (6.2.59) that the 95% prediction interval for
log Y ( s0 ) is defined by the relation:
(6.2.84)
Pr[log
Y ( s0 ) (1.96)ˆ 0 log Y ( s0 ) log Y ( s0 ) (1.96)ˆ 0 ] .95
Hence if we now let Z1 log
Y ( s0 ) (1.96)ˆ 0 , Z 2 log Y ( s0 ), Z 3 log Y ( s0 ) (1.96)ˆ 0 and
let g () log() so that g 1 () exp() , then it follows at once from (6.2.83) and (6.2.84)
that
(6.2.85)
Pr exp log
Y ( s0 ) (1.96)ˆ 0 Y ( s0 ) exp log
Y ( s0 ) (1.96)ˆ 0 .95
This yields the desired prediction interval for Y ( s0 ) . In the present case we have the
estimated values,
(6.2.86)
exp log
Y ( s0 ) (1.96) ˆ 0 , exp log Y ( s0 ) (1.96) ˆ 0
exp 3.0504 (1.96) (.7676) , exp 3.0504 (1.96) (.7676)
and hence can be 95% confident that the true value of Y ( s0 ) lies in the interval
[4.6922, 95.097] . Note finally that [as stated following expression (6.2.61)] this result
can be interpreted to mean that if we were able to perform this same estimation procedure
many times, then Y ( s0 ) would lie in the estimated interval about 95% of the time. So in
the present case, one can be reasonably confident that the interval obtained (namely
[4.6922, 95.097] ) does indeed contain Y ( s0 ) .
While the restriction to a single point, s0, was valuable as an illustration of the Simple
Kriging procedure, typically one wishes to predict (estimate) the entire sample area based
on the observed data points { y ( si ) : i 1,.., N } . In ARCMAP this is precisely the “default”
option (where predictions are restricted to the smallest box in the sample area containing
the observed data). But in MATLAB one must actually specify the set of points where
________________________________________________________________________
ESE 502 II.6-34 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
predictions are desired. So a simple procedure here is to use the program, grid_form.m,
to construct a reasonably fine grid of points in the smallest box containing the data. To
display this visually, one can then import this data to ARCMAP and use some
appropriate (non-statistical) interpolation method to interpolate this grid to every pixel. In
the MATLAB file, nickel.mat, the coordinates of all 437 data points are in the matrix,
L0. So to form a bounding box, write:
>> Xmin=min(L0(:,1));
>> Xmax=max(L0(:,1));
>> Ymin=min(L0(:,2));
>> Ymax=max(L0(:,2));
Next, to choose a grid cell size, observe from the map display in ARCMAP that a
division of the box sides into about 25 segments yields a reasonably fine grid for
interpolation. So we now set,
and use the command (recall the application on p.4-26 of Part I):
>> G = grid_form(Xmin,Xmax,Xcell,Ymin,Ymax,Ycell);
to construct an appropriate grid, G. This grid is shown in Figure 6.14 below, and is seen
to just cover the region of the data points. Using grid G as an input rather than the single
point, s0, we can then obtain a full kriging of all grid points with the command:
[Here we use the semicolon to avoid screen output of all kriging values.] This data can
then be imported to ARCMAP by making a data table,
in which the first two columns include the grid coordinate points and the last two include
the krige and standard error estimates at each grid point. By saving this as an ASCII file;
(and editing the file in EXCEL to include column labels) one can then import
DAT_G.txt into ARCMAP, make a shapefile Simple_Krige_Grid.shp, and display this
layer as shown in Figure 6.15 below.
________________________________________________________________________
ESE 502 II.6-35 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
To display the simple kriging results from MATLAB, we can then use any of the
interpolators in Geostatistical Analyst. The contours shown in Figure 6.15 are obtained
by first interpolating the kriging data in Simple_Krige_Grid with the radial basis
functions option, and then using the command, Data → Export to Vector. The layer
produced contains precisely these contours. The reason why contours are used here is to
allow a visual comparison with a simple kriging of log-nickel in Geostatistical Analyst.
This is accomplished by completing the simple kriging procedure outlined above [that we
terminated with Step 3 (Searching Neighborhood) shown in Figure 6.13]. If one places
the contours above the kriging map displayed, then both can be seen together.24
Finally, this visual comparison shows that while these two kriging surfaces are not in
perfect agreement, they are qualitatively very similar. Moreover, while the Geostatistical
Analyst procedure is clearly easier to perform in this case, the MATLAB “grid”
procedure will prove to be very useful for universal kriging, where the Geostatistical
Analyst version is very limited in terms of applications. This will be illustrated by the
“Venice example” in Section 7.3.5 below.
24
To make the boundaries of the kriging map agree exactly with the contours (as seen in Figure 6.15),
open the “properties” of the kriging map layer, select “Extent” and set this to “the rectangular extent of
Simple_Krige_Grid”.
________________________________________________________________________
ESE 502 II.6-36 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
The procedural details of Ordinary Kriging are almost identical to those of Simple
Kriging. Hence the present development focuses on those aspects that extend the above
analysis by internalizing the estimation of the unknown mean, . Here again we start
with a spatial stochastic process {Y ( s) ( s) : s R} where each finite set of sample
variates, {Y ( si ) ( si ) : i 1,.., n} , is assumed to be multi-normally distribution with
known covariances, cov[ ( si ), ( s j )], i, j 1,.., n . Given such a sample, we again
consider the problem of predicting Y ( s0 ) at some location, s0 R , not in this sample. It
is also assumed that the relevant prediction set, S ( s0 ) {s1 ,.., sn0 } , for location s0 has
been identified within this set of sample locations. Hence the basic task is to predict a
value for Y ( s0 ) in terms of observed values of the variates {Y ( s1 ),.., Y ( sn0 )} . By the linear
prediction hypothesis in (6.1.2) we then seek a best linear unbiased (BLU) predictor,
Yˆ ( s0 ) i01 i 0 Y ( si )
n
(6.3.1)
Since the mean, , is assumed to be constant throughout region R, it is natural to use the
entire set of sample observations, {Y ( si ) ( si ) : i 1,.., n} , to estimate . To do so,
we again we start with the linear hypothesis that the desired estimate, ˆ n , can be written
as a linear combination of these observations, say
where Yn [Y ( s1 ),.., Y ( sn )] denotes the full sample vector of Y-variates, and where
a (a1 ,.., an ) denotes the vector of unknown coefficients. To ensure that this linear
estimator is unbiased, we then require that
where 1n (1,..,1) is the unit vector of length n. Hence unbiasedness for all values of
will be guaranteed if and only if these unknown coefficients sum to one, i.e.,
(6.3.4) 1n a 1
________________________________________________________________________
ESE 502 II.6-37 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Among all such linear unbiased estimators, we seek that one with minimum variance. To
calculate the variance of linear estimators, we start by letting
2 1n
(6.3.5) V cov(Yn )
n1 2
denote the full sample covariance matrix (in contrast to the smaller covariance matrices,
V0 , for each predictor set, S ( s0 ) {s1 ,.., sn } ). With this definition, it follows at once from
(3.2.21) that
Hence to determine the linear unbiased estimator of with smallest variance, we seek to
find that coefficient vector, â , that yields a minimum value of (6.3.6) subject to the unit-
sum condition in (6.3.4), i.e., which solves the following constrained minimization
problem in a :
In expression (A2.8.23) of the Appendix it is shown that the unique solution of this
problem is given by the coefficient vector:
1 1
(6.3.8) â V 1n
1n V 1n
1
Hence for each possible vector of sample variates, Yn [Y ( s1 ),.., Y ( sn )] , the unique BLU
estimator for is given by:
1 1n V 1Yn
(6.3.9) ˆ ˆ
n a Yn 1
1n V Yn
1nV 1n 1n V 11n
1
To gain some feeling for this estimator, consider the classical case of uncorrelated
samples, namely where the covariance matrix in (6.3.5) reduces to
2 0
(6.3.10) V cov(Yn ) 2 I n
0 2
with I n denoting the n-square identity matrix. In this case we see that
________________________________________________________________________
ESE 502 II.6-38 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
1n ( I n )Yn 1 Y
(6.3.11) ˆ n n n
1n ( I n )1n 1n1n
But since 1n1n i1 (1) n and 1n Yn i1Y ( si ) , it follows that
n n
ˆ n
n
(6.3.12) 1
n i 1
Y ( si ) Yn
Thus ˆ n reduces to the sample mean, Yn , which is of course the unique BLU estimator of
for uncorrelated samples. Hence in the presence of spatial correlation, the optimal
weights in the coefficient vector, â , reflect the covariances among these correlated
samples. In the case of Simple Kriging, the use of Yn to estimate necessarily results in
a linear unbiased estimator with higher variance than ˆ n .
Given this intermediate result, we now formulate the Best Linear Unbiased prediction
problem for Y ( s0 ) . Here we again stress that the prediction set, S ( s0 ) {s1 ,.., sn0 } , for s0
is generally smaller than the full sample of size n. So here we focus on the smaller vector
of sample variates, Y [Y ( s1 ),.., Y ( sn0 )] used for predicting Y ( s0 ) in (6.3.1) above. As in
the case of Simple Kriging, if we again denote the desired vector of prediction weights by
0 (01 ,.., 0 n0 ) , then the desired linear predictor of Y0 Y ( s0 ) can be written in vector
form as
For purposes of prediction, recall from (6.1.4) that the desired unbiasedness criterion for
Yˆ0 is that expected prediction error be zero, i.e., that
(1 0 1n0 )
So, as a parallel to (6.3.4) above, it follows that 0 will yield an unbiased predictor for all
possible values of if and only if the bracketed expression is zero, i.e.,
(6.3.15) 1n0 0 1
________________________________________________________________________
ESE 502 II.6-39 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Moreover, to satisfy the efficiency criterion it is required that among all linear unbiased
predictors, Yˆ0 should yield the smallest prediction error variance, which in view of
(6.3.15) together with (6.2.12) is again seen to be precisely residual mean squared error,
(6.3.16)
var(e0 ) E (e02 ) E[(Y0 Yˆ0 ) 2 ] E[(Y0 0Y ) 2 ] E [( 0 ) ( 1n0 )]2
E [ (1 01n0 ) ( 0 0 )]2 E[( 0 0 ) 2 ] MSE (0 )
But since all covariances in (6.2.26) continue to be given (i.e., are assumed to be known)
for the case of Ordinary Kriging, the argument leading to (6.2.27) for Simple Kriging still
holds. Hence we again seek to minimize
but now subject to the unit sum condition in (6.3.15). Hence the desired weights, ̂0 , for
Ordinary Kriging are given by the solution of the constrained minimization problem:
The solution to this problem is shown in the Appendix [expression (A2.8.26)] to be given
by
1 1n0 V01c0 1
(6.3.19) ˆ0 1
V0 1n0 V0 c0
1n0 V0 1n0
1
By substituting this solution into (6.3.13), one then obtains the following BLU predictor
of Y0 [see also expression (A2.8.28) in the Appendix]:
At first glance, this expression appears rather formidable. But by using the results of
Section 6.3.1 above, it can be made quite transparent. In particular, suppose that the
samples available for mean estimation are taken to be given by the prediction sample,Y ,
at s0 rather than the full sample, Yn . Then it follows at once from (6.3.9) that this BLU
estimator must be of the form
1n0 V01Y
(6.3.21) ˆ n0
1n0 V011n0
________________________________________________________________________
ESE 502 II.6-40 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
ˆ n0 c0V01 (Y ˆ n0 1n0 )
Finally, if we treat ˆ n0 as a prior estimate of , and [as in (6.2.2)] take the corresponding
sample residuals based on this prior estimate, to be
(6.3.23) ˆi Y ( si ) ˆ n0 , i 1,.., n0
ˆ1 Y ( s1 ) ˆ n0
(6.3.24) ˆ Y ˆ n0 1n
ˆn
0 Y ( sn0 ) ˆ n0
Similarly, if we let ˆ0 Yˆ0 ˆ n0 denote the residual predictor corresponding to Yˆ0 , then
(6.3.22) is further reduced to
In retrospect, this procedure seems quite natural. Since all covariance information is
assumed to be given (as in Simple Kriging) the first step simply uses this information to
obtain a BLU estimator for . The second step then uses Simple Kriging to construct the
predictor. What is remarkable here is that this ad hoc procedure actually yields the Best
Linear Unbiased predictor for Y ( s0 ) based solely on the prediction sample Y .
________________________________________________________________________
ESE 502 II.6-41 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
The only shortcoming of this procedure is that it does not use all sample information
available for estimating . For since this mean is assumed to be constant over the entire
region R, it should be clear that a better estimate can be obtained by using the BLU
estimator, ˆ n , based on the full sample, Yn . It is this modified procedure that constitutes
the most commonly used form of Ordinary Kriging.25 To formalize this procedure, it thus
suffices to modify the two steps above as follows:
(1). Construct the BLU estimator, ˆ n , of based on the full sample data,
Yn , as in (6.3.9).
(2). Use the sample residuals, ˆ Y ˆ n1n0 to obtain the Simple Kriging
predictor, ˆ , of as in (6.3.25), and set Yˆ ˆ ˆ .
0 0 0 n 0
Recall that to obtain prediction intervals, one requires an estimate of the standard error
of prediction, 0 , as well as Yˆ0 . To do so, recall from the argument in (6.3.16) and
(6.3.17) that prediction error variance for any weight vector, 0 , has the same form as for
Simple Kriging, i.e.,
So all that is required to obtain the desired prediction error variance is to substitute the
optimal weight vector, ̂0 , into this expression. After some manipulation, it can be shown
[see expression (A2.8.69) in the Appendix] that desired value, ˆ 02 , is given by:
(1 1n0 V01c0 )2
( 2 c0 V01c0 )
1n0 V011n0
The key point to notice is that the first bracketed expression is precisely the prediction
error variance for Simple Kriging in expression (6.2.53). But since the second term is
25
It should noted however that one may consider “local” versions of ordinary kriging in which the mean is
re-estimated at each prediction site, s0 . This yields a set of local mean estimates, ˆ ( s0 ) , which can be
regarded as local estimates of a possibly non-constant trend surface. See for example [BG], pp.195-196.
This idea is also implicit in Section 5.4.2 of Schabenberger and Gotway (2005).
________________________________________________________________________
ESE 502 II.6-42 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
always positive,26 it follows that prediction error variance for Ordinary Kriging is always
larger than for Simple Kriging. The additional positive term turns out to be precisely the
addition to prediction error variance created by the internal estimation of the mean, .
Finally, given this expression for prediction error variance, the desired standard error of
prediction is simply the square root of this expression, namely,
(1 1n0 V01c0 )2
(6.3.28) ˆ 0 ( c0 V c )
2 1
1n0 V011n0
0 0
From the development above, it should be evident how to implement Ordinary Kriging
by a direct modification of the three-step procedure for Simple Kriging in Section 6.2.5.
To do so, we again start by assuming the existence of a given set of n observations (data
points), { yi y ( si ):i 1,.., n} in R, where each yi is a realization of the corresponding
random variable, Y ( si ) , in the full sample vector, Yn [Y ( si ) : i 1,.., n] , in (6.3.2) above.
In this context, we again consider the prediction of Y0 Y ( s0 ) , at a single given location,
s0 R , with respect to a given prediction set, S ( s0 ) {s1 ,.., sn 0 } {s1 ,.., sn } . Within this
framework, we can operationalize the Ordinary Kriging model by re-ordering the three
steps of the Simple Kriging implementation in Section 6.2.5 as follows:
This step amounts essentially to a reinterpretation of Step 2 for Simple Kriging, where
here we focus on Y -process rather than the -process. To do so, simply recall from
Section 4.8 that (as with Simple Kriging) Ordinary Kriging assumes a constant-mean
model [Y ( s ) ( s ) : s R ] , so that the variograms for the Y -process and -process
are identical. Hence we can again use the sample data ( y1 ,.., yn ) in var_spher_plot.m to
obtain a spherical variogram estimate, (h; rˆ, sˆ, aˆ ) , and derived covariogram estimate as
in (6.2.65), i.e.,
The only difference in the present setting is that we treat the covariances between Y
values rather than values. In particular, we now require estimates of the covariances,
ij cov[Y ( si ), Y ( s j )] , for all sample pairs, Y ( si ) and Y ( s j ) , in Yn . Using (6.3.29), these
can be estimated precisely as (6.2.66) by setting,
26
Positivity of the denominator follows from the fact that it is the variance of the linear compound,
1n V0 Y , since var(1n V0 Y ) 1n V0 cov(Y )V0 1n 1n V0 (V0 )V0 1n 1n V0 1n .
1 1 1 1 1 1 1
0 0 0 0 0 0 0 0
________________________________________________________________________
ESE 502 II.6-43 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
By the same procedure, we can obtain estimates for all covariances between the variable,
Y0 Y ( s0 ) , to be predicted and the given set of prediction variates, Y [Y ( s1 ),.., Y ( sn0 )] ,
namely,
By again letting cˆ0 (ˆ 0i : i 1,.., n0 ) , we can use these together with the appropriate
sub-matrix, Vˆ , of covariance estimates in (6.3.27) to obtain an estimate,
0
ˆ 2 cˆ0
(6.3.33) C0
ˆ
cˆ Vˆ0
0
This step involves the main departure from Simple Kriging. Here we replace the sample-
mean estimator (Yn ) of with the BLU estimator, ˆ n , in expression (6.3.9) above. By
using the covariance estimates in (6.3.27) together with the sample data vector,
y ( y1 ,.., yn ) , this estimate can be calculated as
1n Vˆ 1 y
(6.3.34) ˆ n
1n Vˆ 1 1n
As emphasized in the final two-step procedure of Section 6.3.2 above, this step is
identical to that in the Simple Kriging procedure. All that it required at this point is to
________________________________________________________________________
ESE 502 II.6-44 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
replace the sample-mean estimate, ̂ , with the BLU estimate, ˆ n , and redefined the
appropriate residual estimates in (6.2.70) by
and again use (6.2.71) and (6.2.72) to construct the desired prediction, Yˆ0 , by
(1 1n0 Vˆ01cˆ0 )2
(6.3.37) ˆ 0 ( ˆ 2 cˆ0 Vˆ01cˆ0 )
1 Vˆ 11
n0 0 n0
The pair (Yˆ0 , ˆ 0 ) can then be used to construct prediction intervals for Y0 Y ( s0 )
precisely as in (6.2.62) and (6.2.63) above.
With these general observations, we can now sketch how Ordinary Kriging is done in
both MATLAB and ARCMAP. Starting with MATLAB, Ordinary Kriging is
operationalized in the program, o_krige.m. The inputs are essentially the same as
simple_krige.m, except that values are made distinct from locations. So here, values are
given by y = log_nickel(:,3) and locations by L0 = log_nickel(:,1:2). To obtain a
________________________________________________________________________
ESE 502 II.6-45 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
prediction at the given location, s0 = (659000,586000), in Figure 6.11 above, one now
uses the command:
Here the prediction and standard error are the last two cells of the output structure, which
can be obtained as:
A comparison with the results on p.5-30 above show that (as expected) the Ordinary
Kriging results are virtually the same.
Hence, as expected, these are again seen to be virtually the same as those for MATLAB.
(6.3.40) S ( s0 ) {si S n : || s0 si || rˆ }
[In fact, this option for defining search neighborhoods is available in “Kriging step 4 of
5” in ARCMAP, as denoted by “Copy from Variogram”.] However, in spite of its
apparent theoretical appeal, this option generally tends to include “too much”. This will
become evident in the simulation analysis below.
________________________________________________________________________
ESE 502 II.6-46 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
To determine a “best” size for prediction sets, one first defines a set of candidate sizes. In
the present case, we shall focus on circular prediction sets of the form (6.3.40) for a
selection of bandwidths, H {h1 ,.., hk } , and let
(6.3.41) S h ( s0 ) {si S n : || s0 si || h } , h H
The standard procedures for doing so, known as cross validation procedures, leave out
part of the data and attempt to predict these values with the rest of the data. By
calculating the average prediction error for this data, one can then find the bandwidth that
minimizes this value. The most commonly used procedure, known as leave-one-out cross
validation, is to systematically omit single data points one at a time, and predict these
using the rest of the data. Hence, given a candidate bandwidth, h H , one will obtain for
each data point, yi y ( si ) , a predicted value, say yˆi (h) , by using all other sample data in
y ( y1 ,.., yn ) together with the prediction set S h ( si ) . By squaring these prediction
errors, yi yˆi (h), i 1,.., n , and taking the average, one obtains a summary measure that
can be viewed as a sample version of mean squared error. But in order to preserve units
(so that values, for example, are in terms of Nickel rather Nickel-squared) the most
commonly used measure of performance is root mean squared error, as defined for each
candidate bandwidth, h H , by:
n
(6.3.42) RMSE (h) 1
n i 1
[ yi yˆi (h)]2
Hence by systematically calculating RMSE (h) for all h H , one can define the best
bandwidth, h * , to be the one that minimizes (6.3.42), i.e.,
For the case of Ordinary Kriging, this leave-one-out cross validation procedure is
operationalized in the MATLAB program, o_krige_cross_val.m. To apply this program
to the log-nickel example, recall that the estimated range was rˆ 21,631 meters, and that
the bandwidth chosen for kriging at s0 was h0 5000 meters. Hence we choose H to
be the set of bandwidths increasing from 1000 to 25,000 in increments of 1000, i.e.,
>> H = [1000:1000:25000];
________________________________________________________________________
ESE 502 II.6-47 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
>> L = log_nickel(:,1:2);
>> y = log_nickel(:,3);
then the above program can be run for this example using the command
>> o_krige_cross_val(y,L,H);
The output is a graph that plots the values of RMSE (h) against bandwidths, h, as shown
in Figure 6.16(a) below. Here the best bandwidth (shown by the red arrow in the figure)
is here seen to be 11,000 meters, which is roughly twice the value chosen for kriging at
point s0 in the examples above. This larger bandwidth is shown by the larger circle in
Figure 6.1(b), with the smaller circle denoting the original choice of 5000 meters. Notice
that many more data points are now included (33 versus 12 in the original analysis). The
predictions obtained by o_krige.m using this larger bandwidth are shown below:
So the predicted value is seen to be somewhat higher, and the standard error of
prediction is slightly smaller.27 Since the latter implies a slight tighter prediction interval,
this larger bandwidth may indeed be preferable.
0.75
!
! !
0.74 ! !
!
! ! !
!
!! ! !
0.73 ! !
! !
!! !
!
0.72 ! ! !
!
! !
! !
! !
0.71 ! !!
! !
! !
! ! !
! !
0.7 ! !
! ! ! ! !
!! !
0.69 ! ! ! !
!
! ! !!
!
!
0.68 !
0 0.5 1 1.5 2 2.5 ! ! !
4
x 10
27
The values obtained in ARCMAP are 3.1237 and 0.7643, respectively, and are again seen to be in close
agreement.
________________________________________________________________________
ESE 502 II.6-48 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
But the most important point to note here is that this best bandwidth is much smaller than
the estimate range ( rˆ 21,631 ). It can of course be argued that in this particular example,
the estimated range may not be very accurate. Indeed, it is well known that estimates of
the range tend to be the least stable (most variable) of the three parameter estimates
(rˆ, sˆ, aˆ ) . Hence it is useful to consider this question in simulated data sets where the true
range is known.
>> Y = cov_stat(p,L,20);
! ! ! !! !! ! !! ! !
! ! ! ! !!
! !! ! ! ! ! !!
!
! ! !! !! !
! ! ! !! !
!
!! ! ! ! ! ! !
!! ! ! !! ! !! !! !!!!! ! ! ! !!
! ! !! ! ! ! !! !!! !!!
!! !!!! ! ! ! !!!
! ! !! !
! !
! !! !! !! !
! !! ! !
! ! !! ! !
! ! !
! ! ! ! !! ! ! ! ! !
! ! ! ! ! ! !
! !
! ! ! ! ! ! ! !
! ! ! ! !! ! !!
! ! ! ! ! !!
! !!! !!!! !! !
! !
!! ! !
!!! ! ! !!
! ! ! ! ! !
!! ! ! ! ! !
! ! ! ! !
! !! !! !! ! !
!! ! ! !! !
! ! ! !! ! ! ! ! ! !!
! ! ! !! ! ! ! !
! !! ! !
! ! ! ! ! !! !
!! ! !! ! ! !! ! !!! !
!! !
!! ! !!! ! ! ! ! ! !!
! !
!! ! ! ! ! !!
! ! !
!!!! !! ! !! ! ! ! ! ! !
! !! ! ! !
! !!! !! !! ! ! ! !
! !! ! ! !! !
! ! !
! ! ! ! ! ! !
!! ! ! !!
!
! !
!!! ! ! !
! ! ! ! ! !! ! !
!
!! ! ! ! !
! ! ! ! ! !
!! ! ! ! !
!!!
! ! !!!! ! ! !! ! ! !
! ! ! ! ! ! !! !! ! !
! ! !! ! !! ! !! !
!! !
!!! ! !!! ! !
! ! ! ! ! !!
! ! ! ! !!! !! !
! !! ! ! ! ! !
! ! ! ! !
25 km
Notice that spatial correlation is indeed evident at scales smaller than the 25 km range
shown. Hence the question of interest is whether bandwidths less than this range value do
a better job of prediction. The above program, o_krige_cross_val.m, was run for each of
these 20 simulations. Based on this limited sample, the answer is definitely yes. The
cross-validation plot for the realization in Figure 6.17 is shown in Figure 6.18 below:
________________________________________________________________________
ESE 502 II.6-49 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
1.01
0.99
0.98
0.97
0.96
0.95
0 5 10 15 20 25 30 35 40
best range
Figure 6.18 Cross Validation Plot
So again the best bandwidth is seen to be about half the true range value. It is also
important to note that the estimates of the constant mean and covariogram parameters are
actually quite reasonable:
So it cannot be argued that this is a result of parameter-estimation error. Indeed, given the
moderate overestimation of the true range in this case, one might have expected larger
bandwidths to do quite well here.
Finally it should be added that these best bandwidths showed considerable variation over
the 20 simulated realizations. The lowest was 5 km, and one was actually above the true
range (27 km), even though the range estimate for this case was almost exactly 25 km. So
a great deal seems to depend on the spatial structure of the particular pattern realized. But
this limited investigation does support the commonly held belief that that best bandwidths
for kriging predictions are generally less that the estimated range value.
________________________________________________________________________
ESE 502 II.6-50 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
We begin by developing the types of trend functions to be considered. Recall from the
Sudan Rainfall example in Section 2.1 that a number of such trend functions were
developed. Here the simplest of these postulated that there was some linear trend over
space expressible entirely in terms of the spatial coordinates, s ( s1 , s2 ) , i.e.,
(7.1) ( s) 0 1 s1 2 s2
More generally, one may consider polynomial trend functions of the form,
( s) 0 j 1 j s1 s2
k nj mj
(7.3)
where n j and m j are nonnegative integer values. Spatial trends in phenomena that vary
smoothly over space tend to be well approximated (locally) by such polynomial
functions. A good example is elevation in hilly terrain. The advantage of these functions
is that they require nothing more than the coordinate data in the map itself. Hence the
data for constructing such functions is essentially always available. It is for this reason
that ARCMAP uses polynomial functions as built-in options for modeling spatial trends
(including all polynomials up to order three, i.e., with n j m j 3 , j 1,.., k ). A second
advantage of these functions is that even though they may involve many spatially
nonlinear terms, they are still linear in parameters. In other words, such functional forms
are linear in all parameters to be estimated, namely 0 , 1 ,.., k . So unlike the nonlinear
least squares estimation procedure required for the standard variogram models in Section
4.7.2 above, these models can be estimated by ordinary least squares (OLS).
But while such functions are sufficiently general to fit many types of spatial trends, they
offer little in the way of explanation regarding the nature of these trends. For example,
we saw in the introductory California Rainfall example that variables such as “altitude”
and “rain shadow” were useful predictors of average rainfall that are not captured by
coordinate positions. Even in the case of Vancouver Nickel used for Simple and Ordinary
_______________________________________________________________________
ESE 502 II.7-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Kriging above, it may well be that local soil types as well as concentrations of other
mineral types might yield better predictions of nickel deposits that simple location
coordinates. So, in the spirit of the regression model used in the California Rainfall
example, it is of interest to consider linear-in-parameter spatial trend functions involving
many possible spatial attributes:
( s) 0 j 1 j x j ( s)
k
(7.4)
This is seen to include all examples above, where for example, one may have polynomial
terms, x j ( s ) s1 j s2 j , or more general spatial attributes such as x j ( s ) “altitude at s”, or
n m
Y (s) 0 j x j ( s) ( s) , s R
k
(7.5) j 1
which appear to be simply instances of classical linear regression models like the
California Rain example. However there is one important difference, namely that the
spatial random effects, ( s ) , are allowed to exhibit nonzero covariances. The only
difference here is the covariance structure of the residuals. More formally, such models
are instances of the general linear regression model that allows for nonzero covariances
between residuals. Hence to develop spatial prediction models with non-constant trends,
we turn first to a consideration of the general linear regression model itself.
To formalize such models in the simplest way, it is essential to use vector representations.
We start with a given finite sample,1 Y [Y ( si ) : i 1,.., n] (Yi : i 1,.., n) from a spatial
stochastic process with global trend of the form (7.5). Let
denote the vector of relevant attributes at each sample location, i 1,.., n , where by
convention the “attribute”, xi 0 1 , corresponds to the intercept term ( 0 ) in (7.5). With
this convention, the integer k denotes the actual number of spatial attributes used in the
model. If ( 0 , 1 ,.., k ) denotes the corresponding vector of coefficients, then (7.6)
can be rewritten as
1
Notice that we now drop the notation, Yn , used for this sample in Section 6 [in order to avoid confusion
with the data point, Yn Y ( sn ) ].
________________________________________________________________________
ESE 502 II.7-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
1 x11 x1k x( s1 ) ( s1 ) 1
(7.1.3) X ,
1 x x x( s ) (s )
n1 nk n n n
(7.1.4) Y X
Our primary interest for the moment focuses on the residual vector, . Recall from
Section 3 that is assumed to be multi-normally distributed with mean zero. Moreover,
the usual multiple regression model (as for example in the California Rain case), assumes
that the individual components of are statistically independent, and hence have zero
covariance. Thus [as in (6.3.10) above] this covariance matrix has the form:
2 0
(7.1.5) cov( ) 2 I n
0 2
In this spatial context, the classical regression model can be formally summarized as
follows:
(7.1.6) Y X , ~ N (0, 2 I n )
But as in Section 3.3 above, we wish to extend this model by allowing covariance-
stationary spatial dependencies between the individual components of . Hence, while
all diagonal elements will continue to have the constant value, 2 , many of the off-
diagonal elements will now be nonzero. If we now let ij and ij denote, respectively,
the covariance and correlation between residuals i and j , and recall that [as in
expression (3.3.13)], ij ij / 2 , then the general covariance matrix, V, for can be
written as follows:
2 1n 1 1n
2
(7.1.7) V cov( ) 2 C
n1 2 1
n1
where C is the corresponding correlation matrix for . The advantage of this particular
representation is that the important variance parameter, 2 , is made explicit. Moreover,
(7.1.7) is now more easily related to the classical case in (7.1.5) where C reduces to the
________________________________________________________________________
ESE 502 II.7-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
identity matrix, I n . In this setting, the general linear regression model can be formally
summarized for our purposes by simply replacing I n with C in (7.1.6), i.e.,2
OLS Estimators
Given a sample realization, y ( y1 ,.., yn ) , of Y in model (7.1.6), the method of ordinary
least squares (OLS) seeks to determine an estimate of the unknown coefficient vector,
, that minimizes the sum of squared deviations between the yi values and their
estimated mean values, x( si ) . More formally, if the sum-of-squares function (S) is
defined for all possible values by:
then the OLS estimator, ˆ ( ˆ0 , ˆ1 ,.., ˆk ) , is taken to be the minimizer of (7.1.9), i.e.,
(7.1.10) S ( ˆ ) min S ( )
To determine this estimator, we begin by using (7.1.3) to rewrite this function in matrix
terms as,
(7.1.11) S ( ) ( y X )( y X ) y y 2 y X X X
Notice that this is again a quadratic form in the unknown value, , that is similar to the
mean squared error function, MSE (0 ) , in expression (6.2.27) above. So the solutions for
these two problems are also similar. In the present case, it is shown in Section A2.7.3 of
the Appendix that the solution to (7.1.10) is given by
2
In Part III of this NOTEBOOK we shall return to this general linear regression model in a somewhat
different context. So both covariance representations, V and C , will be useful. For similar treatments see
2
expression (9.11) in Gotway and Waller and section 10.1 in Green (2003).
________________________________________________________________________
ESE 502 II.7-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
(7.1.12) ˆ ( X X ) 1 X Y
Notice that we have used the random vector, Y , rather than the realized sample data, y, in
(7.1.12) in order to view ˆ as a random vector defined for all realizations. [In statistical
terms, the distinction here is between ˆ as an estimate of for a given data set, y , and
its role as an estimator of for all sample realizations of Y.] Notice also that for this
OLS estimator to be well defined, it is necessary that the matrix X X be nonsingular.
This will almost surely be guaranteed whenever the number of samples is substantially
greater than the number of parameters to be estimated, i.e., whenever n k 1 .3 More
generally, statistical estimation of any set of parameters can only be reliable when the
number of data points well exceeds the number of parameters. In the case of classical
linear regression, a common rule of thumb is that there be at least 10 samples for every
parameter, i.e., n 10(k 1) .
Before proceeding to the more general case, it is important to point out that ˆ is an
unbiased estimator, since under model (7.1.6), E (Y ) X implies that
(7.1.13) E ( ˆ ) E[( X X ) 1 X Y ] ( X X ) 1 X E (Y ) ( X X ) 1 X X
What is equally important is the fact that (like the sample mean used in Simple Kriging
predictions) this unbiasedness property is independent of cov( ) . All that is required is
that the linear trend specification, X , is correct [i.e., that E ( ) 0 ]. So in the case of
California Rainfall, for example, if the four final variables used were a correct
specification of the model, then regardless of possible spatial dependencies among
residuals ignored in this model, the estimated beta coefficients would still be unbiased.
GLS Estimators
To extend these results to generalized linear regression, we employ the fact that every
(nonsingular) covariance matrix can be decomposed in a very simple way. For the
covariance matrix, C , in (7.1.7) it is shown in the Appendix [by combining the Positive
Definiteness Property above expression (A2.7.67) with the Cholesky Theorem following
expression (A2.7.45)] that there exists a Cholesky decomposition of V, i.e., there exists a
lower triangular matrix,
t11 0 0
t t
(7.1.14) T 21 22
0
tn1 tn 2 tnn
such that
3
The symbol “ ” is conventionally used to mean “substantially greater then”
________________________________________________________________________
ESE 502 II.7-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
(7.1.15) C T T
The matrix, T, is usually called the Cholesky matrix for C. While we require no detailed
knowledge of such matrices here, it is of interest to point out that the desired Cholesky
matrix is easily obtained in MATLAB by the command,4
>> T = chol(C);
Perhaps the most attractive feature of the lower triangular matrices is that they are
extremely easy to invert (and indeed first appeared as part of the classical “Gaussian
elimination” method for solving systems of linear equations). Moreover, it is this inverse,
T 1 , which is directly useful for out purposes. In particular, since C is given, we can
compute T 1 prior to any analysis of model (7.1.8). But if we then premultiply both sides
of the equation in (7.1.8) to obtain,
(7.1.18) Y X
T 1 ( 2C )(T ) 1
2 T 1 (T T )(T ) 1
2 (In )
4
Note the transpose operation here. MATLAB for some reason has chosen to produce T rather than T.
________________________________________________________________________
ESE 502 II.7-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
(7.1.20) Y X , ~ N (0, 2 I n )
Finally, by comparing this with (7.1.6) we see that the generalized linear regression
model in (7.1.8) has been transformed into a classical linear regression model. This may
seem a bit like “magic”. But it is simply one of the many consequences of the Linear
Invariance Theorem for multi-normal random vectors, and serves to underscore the
power of this theorem.
Given this equivalence, we may again use OLS to estimate . In particular, by using the
transformed data, ( X , Y ) in (7.1.17), it follows at once from (7.1.12) the desired OLS
estimator is given by
To distinguish this from the classical linear regression model, it is customary to transform
this estimator back into the form of the generalized linear regression model. This amounts
simply to substituting the above relations into (7.1.21) [and using the matrix identity
(T T ) 1 (T ) 1 (T 1 ) (T 1 )(T 1 ) ] to obtain
(7.1.23) ˆ ( X C 1 X ) 1 X C 1Y
which is entirely independent of Cholesky matrices or transformed models. For our later
purposes, it is convenient to rewrite (7.1.23) using the full covariance matrix, V , for in
(7.1.7), i.e.,
(7.1.24) ˆ ( X V 1 X ) 1 X V 1Y
The latter version is typically designated as the generalized least squares (GLS) estimator
of . However, these two versions are in fact equivalent representations of the GLS
estimator since the substitution, V 2C , in (7.1.24) shows that
________________________________________________________________________
ESE 502 II.7-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Note that this identity also shows that the GLS estimator, ˆ , is functionally independent
of 2 . This independence will prove to be enormously useful in later applications.
Note also that even though ˆ is still dependent on the covariance matrix, V, this
dependence has no effect on the unbiasedness of ˆ . This should be obvious from its
equivalence to an OLS estimator. But in any case, by taking expectations in (7.1.24) we
see that
( X C 1 X )1 X C 1 ( X ) ( X C 1 X )1 ( X C 1 X )
Finally it should be noted that by letting y T 1 y , one can also transform the sum-of-
squares function, S, (by using the same matrix identities above) to obtain:
[T 1 ( y X )][T 1 ( y X )] ( y X ) (T 1 )(T 1 )( y X )
( y X ) (TT )1 ( y X ) ( y X ) C 1 ( y X )
Note again that since C differs from V 2C by a positive scalar, it can be replaced by
V in (7.1.27) without altering the solution. Both forms are seen to be weighted versions of
(7.1.11). For this reason, GLS estimation is often referred to as weighted least squares. In
any case, it should be clear that by minimizing (7.1.27) to obtain (7.1.23) [or (7.1.24)],
one need never mention Cholesky matrices or transformed models. But this underlying
equivalence between OLS and GLS has many consequences that are not readily
perceived otherwise (as will be seen, for example, in Section 7.3.4 below).
________________________________________________________________________
ESE 502 II.7-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Here the standard regression procedure is simply to plug in the beta estimators, ˆ , and
use the derived “Yhat” estimator,
Hence, even if one were able to establish optimality properties for individual estimators,
ˆ j , there would remain the question as to whether linear combinations of estimators such
as in (7.1.28) were still optimal in any sense.
It is for this reason that a much more powerful way to characterize optimality properties
of vector estimators is in terms of all possible linear combinations of these estimators. In
the present case, observe that if we now focus on GLS estimators and consider any linear
combination of the unknown vector, say a , then by (7.1.24) the corresponding
estimator, aˆ , takes the form
where a a ( X V 1 X ) 1 X V 1 . But since (a, X ,V ) are all known values, this estimator
is indeed seen to be a linear estimator of a , i.e., a linear function of the Y vector (in a
manner completely analogous to Simple and Ordinary Kriging weights). Moreover, by
the argument in (7.1.26) it follows at once that
Hence the “plug-in” estimator, aˆ , is seen to be a linear unbiased estimator of a , for
all possible choices of a . But the real power of this “linear compound” approach is that
it provides natural definition of best linear unbiased estimators in this vector setting. In
particular, we now say that ˆ is a Best Linear Unbiased (BLU) estimator of , if and
only if in addition to (7.1.30) and (7.1.31) is also true that the variance of aˆ is smallest
among all such linear unbiased estimators. More formally, if we now denote the class of
all linear unbiased estimators, , of by
(7.1.32)
LU ( ) ( X ,V , Y ): [a aY ]&[ E (a ) a ] , a k 1
________________________________________________________________________
ESE 502 II.7-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
then ˆ is said to be a Best Linear Unbiased (BLU) estimator of if and only if for all
linear compounds, a k 1 ,
While this definition looks rather ambitious, it is shown in the Appendix (see the first
subsection of Section A2.8.3) that the unique estimator in LU ( ) satisfying this
minimum variance condition for all a k 1 is precisely the GLS estimator in (7.1.24).
As discussed in detail in Section 3 above, our primary interest in GLS models is to allow
covariance structures to reflect spatially dependent random effects. We are now in a
position to see the consequences of such effects in more detail. To do so, we begin with
the simplest possible spatial regression model, where such effects can be seen explicitly.
We then examine these effects in a more complex setting by means of simulation.
Here we start with the simplest possible spatial regression model with a constant mean,
i.e., with no “explanatory variables” at all:
In this context, suppose we ignore possible spatial dependencies among residuals, and
assume simply that the residuals in (7.1.34) are independent, say ( s ) ~ N (0, 2 ) . Then
iid
where in this case, X 1n (1,..,1) and . Hence for this case it follows from
(7.1.24) that the BLU estimator of is given by:
which is of course simply the sample mean . [Recall also expressions (6.3.11) and
(6.3.12)]. Moreover, recall from (3.1.19) that the variance of this estimator must be given
by
2
(7.1.37) var(Y )
n
________________________________________________________________________
ESE 502 II.7-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
So all inferences about the true value of will be based on the estimator, Y , and its
variance in (7.1.37).
But suppose that in reality there are positive spatial dependencies among the residuals in
(7.1.35) so that in fact the covariance of has the form,
cov(1 , 1 ) cov(1 , n ) 1n
2
with ij 0 for many distinct (i, j ) pairs. Then, in a manner similar to expression
(4.10.3) above, it follows that since cov(Y ) cov( ) , the true variance of Y is given by
2
1
n2
(n 2 ) n12 i j i ij 1
n2
i j i
ij
n
2
(7.1.39) var(Y )
n
and hence that standard deviation, (Y ) , is much larger than assumed, i.e.,
(7.1.40) (Y )
n
This means, for example, that if we consider a 95% confidence interval for the true mean,
, then the actual interval is given by
________________________________________________________________________
ESE 502 II.7-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
So for any given estimate, y , this implies from (7.1.40) that the actual confidence
intervals for are much larger than those calculated, as depicted schematically below:
Assumed CI
[ [ ] ]
y
Actual CI
Thus if such spatial dependencies are not accounted for, then the results obtained will
tend to look “too good”. It is this type of false significance that motivates the need to
remove the effects of spatial dependencies in residuals before attempting to draw
statistical inferences.
with specific parameter values, 0 1 , 1 .04 , and 2 .08 . Suppose moreover that
the residuals {ui : i 1,..,100} are part of an underlying covariance-stationary spatial
stochastic process with covariogram, C (h) , parameterized by [r 5, s 1, a 0] , as
shown in Figure 7.1 below.
s 1
0.8
0.6 C ( h)
0.4
0.2
0
0 1 2 3 4 5 6 7
h r
Figure 7.1. Example Covariogram
________________________________________________________________________
ESE 502 II.7-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Given this model, one can in principle calculate the theoretical estimates and standard
errors for any given set of data { yi : i 1,..,100} under the (OLS) assumption of
independent errors, and under the true (GLS) model itself. But it is more instructive to
simulate this model many times and compare the OLS and GLS estimates of beta
parameters. In Table 7.1 below, the average results of 100 simulations are shown, where
the “GLS Est” column shows the average GLS estimates of each beta parameter, the
“GLS Std Err” column shows the corresponding average standard errors of these
estimates, and similarly for the OLS columns.
GLS Est GLS Std Err OLS Est OLS Std Err
const 0.9284 0.4802 0.9156 0.2396
X1 0.0564 0.0565 0.0568 0.0289
X2 0.0897 0.0565 0.0934 0.0289
Notice first that while the GLS estimates are on average slightly better than the OLS
estimates, both sets of estimates are unbiased (regardless of the true covariance) and
should tend to be roughly the same. The real difference is in the estimated standard
errors for each of these models. Here it is clear that the GLS estimates are about twice as
large as the OLS estimates. So as a direct parallel to expression (7.1.40) above, it is now
clear that by ignoring the true spatial dependencies, OLS is severely underestimating the
true standard deviations. So the confidence intervals on true beta values are again much
tighter than they should be.
GLS Est GLS Std Err OLS Est OLS Std Err
const 1.5197 0.6754 2.0143 0.2669
X1 -0.0062 0.0789 -0.1228 0.0352
X2 0.0913 0.0789 0.0981 0.0352
This example illustrates a particularly bad case in which the estimates of 1 actually have
the wrong sign in both OLS and GLS. But if we display the 95% confidence intervals for
each case, we can see a substantial difference in the conclusions reached. First for the
GLS case we have:
________________________________________________________________________
ESE 502 II.7-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
In particular, since the true value, .04, is contained it this interval, this value cannot be
ruled out by these results. More generally, since zero is also contained in this interval, it
can certainly not be concluded that x1 is negatively related to y . On the other hand, since
the corresponding OLS confidence interval is given by:
it must here be concluded that 1 .0537 , and thus that x1 is significantly negatively
related to y. This is precisely the type of false significance that one seeks to avoid by
allowing for the possibility of spatially-dependent errors in estimation procedures.
Given this general linear regression framework, together with our present emphasis on
modeling spatially-dependent errors, the task remaining is to develop specific methods
for spatial prediction within this setting. Recall from our general classification of Kriging
models in Section 6.1.2 that the method for doing so is known as Universal Kriging.
Hence we now develop this spatial prediction model in more detail.
Recall from (6.1.10) and (6.1.11) that the basic probability model underlying Universal
Kriging is essentially the general linear regression model in (7.1.2) above. Within this
probabilistic framework, the task of spatial prediction (as in both Simple and Ordinary
Kriging) is to determine a BLU predictor for values, Y ( s0 ) , at locations s0 not in the
given sample data set, Yn [Y ( si ) : i 1,.., n] . As we shall see in the next section, this
essentially amounts to an appropriate extension of the analysis for Ordinary Kriging.
Following this development we derive the appropriate standard error of prediction for
Universal Kriging. As with Simple Kriging, our main interest in Universal Kriging is that
it provides the simplest setting within which one can include the types of spatial trend
models developed above. Because this model is included as part of ARCMAP, we also
outline the procedure for implementing this model. However, the main role of this model
for our present purposes is to serve as an introduction to Geostatistical Regression and
Kriging, as developed in Section 7.3 below.
Here we again start with a given prediction set, S ( s0 ) {si : i 1,.., n0 } {s1 ,.., sn } , for s0
together with corresponding prediction samples, Y [Y ( si ) : i 1,.., n0 ] .5 Moreover, by
5
Note that we have now returned to the convention that Yn denotes the full sample vector and Y is the
prediction sample vector for s0 . As with Ordinary Kriging, both random vectors will be used here.
________________________________________________________________________
ESE 502 II.7-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
again appealing to the linear prediction hypothesis, it is assumed that the desired
predictor, Yˆ0 Yˆ ( s0 ) , is of the form:
for some appropriate weight vector, 0 (1 ,.., n0 ). Turning next to the unbiasedness
condition, it follows from condition (6.3.14) for Ordinary Kriging, that this unbiasedness
condition again takes the basic form:
But now these expectations are more complex. By (7.1.2) we see that
1
x
(7.2.5) x( s0 ) x0 01
x0 k
x( s1 ) 1 x11 x1k
(7.2.6) X0
x( sn ) 1 xn 1 xn k
0 0 0
Then since
x( s1 )
i1 0i x(si ) (01,.., 0n0 ) 0 X 0
n0
(7.2.7)
x( sn )
0
________________________________________________________________________
ESE 502 II.7-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
But since this unbiasedness condition is required to hold for all , it should be clear that
this is only possible if x0 X 00 0 , or equivalently, if and only if
(7.2.9) X 00 x0
Turning finally to the efficiency condition, the argument in (6.3.17) for Ordinary Kriging
can now be extended by using (7.2.8) to show that prediction error variance continues to
be the same as residual mean squared error:
But since all covariances are given, it follows by setting V0 cov(Y ) that (as with both
Simple and Ordinary Kriging) prediction error variance must again be given by,
Hence the optimal weight vector, ̂0 , for the case of Universal Kriging must be the
solution to the following constrained minimization problem:
At this point, it should be clear that Ordinary Kriging is simply a special case of
Universal Kriging. Indeed, if one eliminates all explanatory variables and keeps only the
intercept term in (7.1.2), then by (7.2.5) and (7.2.6), x0 reduces to 1 and X 0 reduces to
1n0 , so that the constraint in (7.2.12) reduces to 1n0 0 1 ,which is precisely (6.3.18). This
is a consequence of the fact that under the assumptions of Ordinary Kriging, this reduced
model implies that 0 , i.e.,
Turning now to the solution, ̂0 , of (7.2.12), it is shown in the Appendix [expression
(A2.8.58)] that
________________________________________________________________________
ESE 502 II.7-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
By substituting ̂0 into (7.2.1) we then obtain the following BLU predictor of Y0 Y ( s0 )
for Universal Kriging [see also expression (A2.8.59) in the Appendix]:
While this solution appears to be even more complex than expression (6.3.20) for the
Ordinary Kriging case, it turns out to have an equally simple interpretation. To show this,
we start by noting that as a parallel to (6.2.21), if we now estimate based solely on the
prediction sample, Y [Y ( si ) : i 1,.., n0 ] , for Y0 (with attribute data, X 0 , and covariance
matrix, V ) then it follows from (7.1.24) that the resulting GLS estimator of , say ˆ ,
0 n0
Moreover, by the results of Section 7.1.2 above, this must be the BLU estimator of
based on this sample data. But by substituting (7.2.16) into (7.2.15), we then see that Yˆ 0
Finally, since the last expression in brackets is simply the vector of estimated residuals,
(7.2.18) ˆ Y X 0 ˆn 0
(i). Construct the BLU estimator, ˆn0 , of based on the prediction sample data,
Y , as in (7.2.16).
(ii). Use the sample residuals, ˆ , in (7.2.18) to obtain the Universal Kriging
predictor, ˆ0 c0V0 1ˆ , of 0 and set Yˆ0 x0 ˆn0 ˆ0 .
________________________________________________________________________
ESE 502 II.7-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
But as with Ordinary Kriging, it can also be argued that if characterizes the global
trend over the entire region, R, then a better estimate can be obtained by using the GLS
estimator,
based on the full set of samples, Yn , with attribute data, X . It is this modified procedure
that constitutes the most commonly used form of Universal Kriging.6 To formalize this
procedure, it thus suffices to modify the two steps above as follows:
(1). Construct the BLU estimator, ˆn , of based on the full sample data,
Yn , as in (7.2.20).
(2). Use the sample residuals, ˆ Y X 0 ˆn , to obtain the Universal Kriging
predictor, ˆ cV 1ˆ , of and set Yˆ x ˆ ˆ .
0 0 0 0 0 0 n 0
As with Ordinary Kriging, one can obtain prediction error variance for the optimal weight
vector, ̂0 , by substituting (7.2.14) into (7.2.11). As is shown in the Appendix [see
expression (A2.8.69)] this yields the follow explicit expression for prediction error
variance in the general case of Universal Kriging:
Paralleling the interpretation ˆ 02 for Ordinary Kriging, the first bracketed expression in
(7.2.21) is again prediction error variance for Simple Kriging, and the second expression
is again positive. This second term now accounts for the additional variance created by
estimating internally. Finally, the resulting standard error of prediction for Universal
Kriging is by definition the square root of (7.2.21), i.e.,
6
As with Ordinary Kriging, there are again arguments for using the local version in [(i),(ii)] above. In fact,
many treatments of Universal Kriging implicitly use this local version, as for example in Section 5.3.3 of
Schabenberger and Gotway (2005).
________________________________________________________________________
ESE 502 II.7-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
With these preliminary observations, the implementation procedure for Universal Kriging
can be specified as follows. We again start with a given set of sample data,
yn ( y ( si ):i 1,.., n) in R, where each yi is taken to be a realization of the
corresponding random variable, Y ( si ) , in a sample vector, Yn [Y ( si ) : i 1,.., n] . This
sample vector, Yn , is now hypothesized to satisfy the generalized linear regression model
in (7.1.8) with attribute data, X, and covariance matrix, V . In this context, we again
consider the prediction of Y0 Y ( s0 ) , at a given location, s0 R . This prediction is
carried out through the following series of steps:
________________________________________________________________________
ESE 502 II.7-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
ˆ 2 ˆ1n
(7.2.27) Vˆ
ˆ n1 ˆ 2
Given the development of prediction set selection in Section 6.4 above, we can now
consider this selection problem more explicitly for Universal Kriging. In particular, we
now assume that the appropriate prediction set, S ( s0 ) , is defined by an appropriate
bandwidth, h0 , as follows,
(7.2.29) S ( s0 ) {si Sn : || s0 si || h0 }
where S n {s1 ,.., sn } is again the full sample set of locations. Ideally this bandwidth
should be selected by a cross-validation procedure such as in Section 6.4. But given the
computation intensity of such procedures, we here assume that h0 is selected simply by a
visual inspection of the mapped data surrounding site, s0 .
However, there is one additional requirement that must be met by prediction sets,
S ( s0 ) {s1 ,.., sn0 } , in the case of Universal Kriging. Recall that if the attribute vector at
s0 is denoted by x0 as in (7.2.5), then the unbiasedness condition for Universal Kriging
in (7.2.9) requires that
(7.2.30) X 00 x0
________________________________________________________________________
ESE 502 II.7-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Given a prediction set, S ( s0 ) {s1 ,.., sn0 } , one can then use (7.2.26) above to construct
estimates of the set of covariances,
ˆ 2 cˆ0
(7.2.31) Cˆ 0
cˆ Vˆ0
0
relevant for prediction of Y ( s0 ) [as in (6.3.33)]. These can in turn be to krige residuals as
in the second step of the basic two-step procedure for Universal Kriging above. Here the
procedure is as follows:
If the prediction sample data relevant for s0 is denoted by y ( y1 ,.., yn0 ) , and if the
corresponding prediction residuals are estimated by,
(7.2.32) ˆ y X 0 ˆn
Finally, (7.2.33) can be combined with (7.2.28) to obtain the desired prediction of the
unobserved value, Y ( s0 ) x0 0 , at s0 , namely
7
More precisely, x0 is required to lie in the span of theses column vectors. Hence there must be at least
k 1 linearly independent columns of X 0 to insure this condition. But if this number were exactly
n0 k 1 then 0 would be uniquely determined by 0 ( X 0 ) 1 x0 . So for nontrivial solutions one must
require that n0 k 2 .
________________________________________________________________________
ESE 502 II.7-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
one can use the pair (Yˆ0 ,ˆ 0 ) to construct prediction intervals for Y ( s0 ) . As in (6.2.63),
the default interval takes the form:
As mentioned at the beginning of Section 7.2.3 above, the estimation of variograms for
Universal Kriging is somewhat problematic. In particular, observe that the OLS residuals
in (7.2.24) used for estimation of variograms are generally not consistent with the final
GLS residuals in (7.2.32). So if the variogram were re-estimated on the basis of these
residuals, then generally this would not agree with the variogram used. This
inconsistency is simply ignored in the implementation of Universal Kriging outlined
above, and hence renders this procedure somewhat ad hoc. To be more precise, if now
denote the parameter vector for the spherical variogram by
(7.3.1) ( r , s, a ) ,
then one the one hand, if were known (as is implicit in the “known covariance”
assumption of Universal Kriging) one could employ GLS estimation to determine ˆ . On
the other hand, if were known, then the residual “data”, Y X , could be used to
construct a consistent estimate, ˆ , of the variogram parameters, . Hence the real
difficulty here is trying to obtain simultaneous estimates, ( ˆ ,ˆ) , of these two sets of
parameters. In Schabenberger and Gotway (2005, p.257) ) this circular argument is aptly
described as the “cat and mouse game of Universal Kriging”. While it is possible to
reformulate this entire estimation problem in terms of more general maximum-likelihood
methods,8 a more practical approach is simply to construct an iterative estimation
procedure in which each parameter vector is estimated given some current value of the
other. It is this procedure that we now develop in more detail.9
8
For further discussion of such methods, see Section 9.2.1 in Waller and Gotway (2004). Here it should
also be noted that a maximum-likelihood estimation approach of this type will be developed to estimate
spatial autoregressive models in Part III of this NOTEBOOK.
9
This procedure is also developed in Section 9.2.11 in Waller and Gotway (2004), where it is designated as
the Iteratively Re-Weighted Generalized Least Squares (IRWGLS) procedure. A less formal presentation of
the same idea is given in [BG], p.189.
________________________________________________________________________
ESE 502 II.7-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Before doing so, it is important to emphasize that the type of spatial model developed
here has uses other than simply predicting values of Y at unobserved locations. A good
example is the California Rainfall study, already used to motivate the present class of
more general spatial trend functions. In this study, the main focus was on identifying
spatial attributes that are significant predictors of rainfall at each data location. While one
could also attempt to predict rainfall levels at locations not in the data set, this was not the
main objective. Hence it is useful to distinguish between two types of spatial applications
here. We begin with a general linear regression model as in (7.1.8), where it is now
assumed that the covariance matrix, V , is generated by an underlying covariogram with
parameter vector, , in (7.3.1), which we now write explicitly as,
(7.3.2) Y X , ~ N [0,V ( )]
This is of course precisely the type of model postulated for Universal Kriging above.
However, since the iterative estimation procedure developed below differs from the
implementation of Universal Kriging as developed in Section 7.2.3, it is convenient to
distinguish between these two models. Hence we now designate model (7.3.2) [together
with its iterative implementation developed below] as a Geostatistical Regression model.
In the California Rainfall example, such a model might well be used to incorporate
possible spatial dependencies between rainfall in cities close to one another. The
emphasis here is on estimating in a manner that will allow proper statistical inferences
to be drawn about each of its components. On the other hand, such a model might also be
used for prediction purposes. Hence when such geostatistical regression models are used
for spatial prediction, they will be designated as Geostatistical Kriging models.10
We first given an overview of the estimation procedure and then formalize its individual
steps. Every iterative estimation procedure must start with some initial value. Here, as
with Universal Kriging, the initialization used (step [1] below) is to estimate by OLS,
which we designate as ˆ . The residuals ˆ generated by ˆ are then used to obtain an
0 0 0
10
It should be noted that in other treatments, such as Schabenberger and Gotway (2005), all such
implementations are regarded simply as different ways of estimating the same “Universal Kriging model”.
However, for our purposes it seems best to avoid confusion by reserving the term “Universal Kriging” for
the implementation adopted in ARCMAP, as outlined in Section 7.2.3 above.
11
Note again that we here use Y for the full sample rather than Yn . The latter is only required when we need
to distinguish between the full sample and subsamples used for prediction at each location.
________________________________________________________________________
ESE 502 II.7-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
estimate, ˆ0 , of the spherical variogram parameters in (7.3.1). These are in turn used (in
steps [2] to [6] below) to obtain a GLS estimate, ˆ of using the covariance matrix,
1
V (ˆ ) . Up to this point, the implementation is identical with that in Section 7.2.3. But the
0
deemed (as in steps [8] to [9] below) to be “sufficiently similar” to ( ˆ0 ,ˆ0 ) , then the
estimation procedure terminates with these as final values. Otherwise it continues until
such values are found. With this overview, we now formalize these steps as follows:
(7.3.3) ˆ0 ( X X ) 1 X y
[2] Use these residuals to estimate an empirical variogram, ˆ0 (h) , at some set of
selected distance values, (hi : i 1,.., q ) .
[3] Next use this empirical variogram data (ˆ0i , hi ), i 1,.., q to fit (by nonlinear least
squares) a spherical variogram, (h ; ˆ ) , with parameter vector,
0
[4] Then use the identity, C (h) 2 (h) , to construct the corresponding spherical
covariogram,
[5] If the distance between each pair of data points, si and s j is denoted by hij , then
the covariance, ij cov( i , j ) , between the residuals at si and s j is estimated by
________________________________________________________________________
ESE 502 II.7-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
ˆ 02 ˆ 01n
(7.3.6) Vˆ0 V (ˆ0 )
ˆ 0 n1 ˆ 02
[6] Using this covariance matrix, now apply GLS to obtain a new estimate of :
[7] Then replace ˆ0 by ˆ1 and apply steps [2] and [3] to obtain a new spherical
variogram, (h ;ˆ ) , with parameter vector,
1
[8] At this point, one can check to see if there are any “significant” differences between
the initial parameter estimates, ( ˆ0 ,ˆ0 ) , and the new estimates, ( ˆ1 ,ˆ1 ) . Here there
are many criteria to check for differences. If one is primarily interested in the
parameters (as is typical in regression), the simplest approach is to focus on
fractional changes in these estimates by letting13
ˆ ˆ0 j
(7.3.10) 1 max 1 j : j 0,1,.., k
0 j
ˆ
One may then choose an appropriate threshold value, (say .001 ) and define a
significant change to be 1 . If one is also interested in the variogram parameters,
(r , s, a) , then one may replace (7.3.10) by the broader set of fractional changes
12
Be careful not to confuse this initial estimate, Vˆ0 , with the estimated sub-matrix of covariances, Vˆ0 , used
to predict Y ( s0 ) in previous sections.
13
For a possible modification of this simple criterion, see Schabenberger and Gotway (2005, p.259).
________________________________________________________________________
ESE 502 II.7-25 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
rˆ rˆ sˆ sˆ aˆ aˆ
(7.3.11) 1 max 1 , 1 0 , 1 0 , 1 0
rˆ0 sˆ0 aˆ0
[9] If there is no significant change, i.e., if 1 (or 1 ), then stop the iterative
estimation procedure and set the final parameter estimates to be
[10] On the other hand, if 1 (or 1 ), then continue the iterative estimation
procedure by replacing ˆ with ˆ in steps [4] through [7] to obtain a new
0 1
estimate,
[based on the new covariance matrix, Vˆ1 V (ˆ1 ) ], and new variogram
parameter estimates
[11] With these new parameters, define 2 (or 2 ) as in step [8]. If 2 (or 2 )
then stop the procedure and set the final parameter estimates to
[12] On the other hand, if 2 (or 2 ), then continue the iterative estimation
procedure by replacing ( ˆ ,ˆ ) with ( ˆ ,ˆ ) in steps [4] through [7].
1 1 2 2
[13] Continue in the same way until a set of parameters ( ˆm ,ˆm ) is found for which
(or ). Then stop the procedure and set the final estimates to
m m
These final parameter estimates are said to be mutually consistent in the sense
that the covariance matrix, Vˆ V (ˆ) , will (approximately) reproduce ˆ as,
________________________________________________________________________
ESE 502 II.7-26 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Here it should be emphasized that while this mutual consistency property is certainly
desirable from a conceptual viewpoint, there is no guarantee that any of the Best Linear
Unbiased estimation properties for GLS estimators will continue to hold for ˆ . Hence, as
discussed at the end of the implementation for Simple Kriging in Section 6.2.5 above,
these are often designated as Empirical GLS estimators.14
Given the regression estimates, ˆ , one can use the parameter estimates, ˆ (rˆ, sˆ, aˆ ) , to
construct the final covariogram as follows:
ˆ 2 ˆ1n
(7.3.19) Vˆ V (ˆ)
ˆ n1 ˆ 2
(7.3.20) ˆ ( X V 1 X ) 1 X V 1Y ( X V 1 X ) 1 X V 1 ( X )
( X V 1 X ) 1 ( X V 1 X ) ( X V 1 X ) 1 X V 1
( X V 1 X ) 1 X V 1
14
See for example the discussion in Waller and Gotway (2004, p.337).
________________________________________________________________________
ESE 502 II.7-27 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
But by the Linear Invariance Theorem for multi-normal random vectors [in (3.2.22)], it
then follows that ˆ is multi-normally distributed with mean
and covariance,
( X V 1 X ) 1 X V 1 cov( )V 1 X ( X V 1 X ) 1
( X V 1 X ) 1 X V 1V V 1 X ( X V 1 X ) 1
( X V 1 X ) 1 ( X V 1 X ) ( X V 1 X ) 1
( X V 1 X ) 1
vˆ11 vˆ1k
(7.3.24) ˆ ( X Vˆ 1 X ) 1
vˆ vˆkk
k1
(7.3.25) s j vˆ jj
for each beta parameter estimate, ˆ j , j 0,1,.., k . These standard errors can then be used
to construct p-values for significance tests of these coefficients based on the t-ratios:
(7.3.26) t j ˆ j / s j , j 0,1,.., k
Hence, standard tests of significance can be carried out in terms of these estimates.15 This
procedure is implemented in the MATLAB program, geo_regr.m, and will be illustrated
in Section 7.3.4 below.
Recall that Universal Kriging used a prior estimate of the variogram parameters based on
OLS residuals. But one can now improve this procedure by using the mutually consistent
estimates obtained above. In doing so, we must again distinguish between the full sample
15
As with OLS, t j is t-distributed with n ( k 1) degrees of freedom under the null hypothesis, j 0 .
See also expressions (9.16) through (9.18) in Waller and Gotway (2004).
________________________________________________________________________
ESE 502 II.7-28 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
to emphasize that this model refers to the full sample. Hence for the mutually consistent
estimates, ( ˆ ,ˆ) [ ˆ ,(rˆ, sˆ, aˆ )] , obtained from the iterative procedure above, the estimate,
ˆ , now yields the (full sample) GLS estimate, ˆ , in (7.2.28), and the estimated
n
At this point, Steps 4 through 8 in the implementation of Universal Kriging can now be
carried out in tact [where the prediction covariance estimates, Ĉ0 , in (7.2.31) are again
assumed to be constructed using the variogram parameters, ˆ (rˆ, sˆ, aˆ ) , from the
iterative estimation procedure].
(7.3.29) Co 0 1 x 2 y 3 x y 4 x 2 5 y 2
The Cobalt data for this example is in the JMP file, Cobalt_1.JMP. Before proceeding,
it is worthwhile noticing from this data that the coordinates locations are in feet, so that
16
Here the equality in (7.3.28) is implicitly taken to be “approximately equal” in the sense defined by the
mutual consistency condition in the iterative estimation procedure above.
________________________________________________________________________
ESE 502 II.7-29 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
! ! ! ! ! !
! !
!! ! ! !
!! ! ! !
!! ! ! !!
! ! ! !
!! ! ! !!
!! ! ! ! !
! ! ! ! ! !
! !
! !! ! !
! ! ! ! ! !
!
! ! ! ! !
! ! ! !!
!
!! ! !
!
!! !!
! ! !!!
!
!
! ! ! ! !
! !
! ! !
! ! !!
!
! ! !
! ! ! ! ! !
! ! ! !
! !
! ! !
! ! !!
! ! !
!
!! ! ! !! !!
!
! ! ! ! !
! ! ! !
! ! ! !! !
! !
!
! ! ! ! !!
! !!
! ! ! !
their values are quite large. For example, the first point is ( x1 651612, y1 566520) .
More importantly, when one forms a quadratic function, these values are squared in order
of magnitude. So for example the cross product term in (7.3.29) is x1 y1 3.69 1011 .
Since the cobalt magnitudes are drastically smaller (in this case, Co1 36 ), it should be
clear that some of the beta slope coefficients in (7.3.29) will be vanishingly small
(roughly of order 108 ). Such values are so close to zero that they are awkward to
analyze. More importantly, since the intercept is by definition a data vector of ones,
1n (1,..,1) , this column in the data matrix, X , is vanishing small compared to other data
columns like, xy . This can create numerical instabilities in the regression itself.17 So
before beginning the present analysis, it is advisable to rescale the coordinate data to a
more reasonable range. In the present case, we have divided all coordinate values by
10,000, so that terms like the cross product above now have more tractable values
( x1 y1 3691.5 ). With these values, the OLS regression in (7.3.29) yields the following
results (where xx denotes x 2 , and so on):
17
Software such as JMP is usually sophisticated enough to employ internal rescaling procedures to avoid
such obvious instabilities. But this is not true of all regression software.
________________________________________________________________________
ESE 502 II.7-30 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Notice that y is not significant, and that y 2 is only weakly significant. But since there are
clear nonlinearities in the y direction, this suggests that the collinearity between y and
y 2 in this region are masking the effect of y 2 . If the insignificant y variable is removed,
then one obtains the new regression shown below.
Notice that y 2 is now very significant, and moreover, that the adjusted R 2 value has
increased by removing y . This is a clear indication that the present model is capturing
this spatial trend more accurately. Note finally that the coefficients on x 2 and y 2 have
opposite signs. This is a characteristic of hyperbolic paraboloids.18
However, there still remains the question of possible spatial dependencies among the
unobserved residuals, , in (7.3.29). We can check this in the usual way by regressing
these residuals on their nearest-neighbor residuals. The result of this regression are shown
below:
y _
30
20
residuals
10
-10
-20
-20 -10 0 10 20 30
nn_res
Here it is clear that there does indeed exist significant spatial dependency among these
residuals. As discussed in Section 7.1.3, this can in turn inflate the significance levels
18
See for example http://mathworld.wolfram.com/HyperbolicParaboloid.html.
________________________________________________________________________
ESE 502 II.7-31 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
To do so, this cobalt data has been transported to MATLAB, and is found in the
workspace, Cobalt_1.mat. Here the 176 locations are stored in the matrix, L0, with
corresponding cobalt values in y0 and data [x, xy, xx, yy] in the matrix, X0. The geo-
regression is run with the command,
where vnames contains the variable names, and is constructed by the command:
The actual regression portion of the screen output for this iterative estimation procedure
is as follows:
Notice first that the basic signs of all beta coefficient is the same, so that this new spatial
trend is again a “saddle” shape. In fact this is precisely the saddle shape plotted in Figure
7.2(b) above. But the main thing to notice is that all variables are now less significant
than they were under OLS. In particular, x 2 is no longer even weakly significant.
However, the relative ordering among the p-values (as seen more clearly from the
absolute t-values) is essentially the same. So there appears to have been a fairly uniform
deflation of all significance levels under OLS. While this will certainly not always be
true, in the present case it suggests that spatial dependencies in these OLS residuals are
relative isotropic (i.e., the same in the x and y directions), and hence are consistent with
the covariance stationarity assumption underlying geo-regression.
Before interpreting these results, it is important to check to see whether this geo-
regression has in fact removed the spatial dependencies among residuals. Here it is
important to stress that this cannot be done by simply examining the residuals of the geo-
________________________________________________________________________
ESE 502 II.7-32 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
regression. Indeed these residuals exhibit precisely the spatial covariance structure
estimated by the geo-regression as displayed in Figure 7.4 below:
COVARIOGRAM PLOT
80
60 SPHERICAL VARIOGRAM-
COVARIOGRAM PARAMETERS:
40
20
Range 16265.939790
0
Sill 73.253163
-20 Nugget 41.206102
-40
0 0.5 1 1.5 2 2.5 3
4
x 10
So the task remaining is to remove this estimated spatial covariance structure and
determine whether any spatial dependencies remain. This can be accomplished by
recalling that every GLS model can be reduced to an equivalent OLS model by the
Cholesky procedure in (7.1.15) through (7.1.20) above. By way of review, let us now
write the appropriate GLS model as
(7.3.30) Y X , ~ N (0,V )
where in this case, Y is the random vector of n 176 cobalt levels, X is the (n 4)
matrix of coordinate variables (labeled as X0 above), and is the spatially dependent
residual vector with unknown covariance matrix, V . As in (7.1.15), if T denotes the
Cholesky matrix for V , so that V TT , then as in (7.1.16) and (7.1.17), if we multiply
both sides of (7.3.30) by T 1 , and let YT T 1Y , X T T 1 X , and T T 1 , then we
obtain a new linear model,
(7.3.31) YT X T T , T ~ N (0,VT )
where is exactly the same as in (7.3.30), but where the argument in (7.1.19) now
shows that the covariance matrix, VT , is simply the identity matrix, i.e.,
In particular, this implies that the components of the transformed residual vector, T , are
independent. Of course, the true covariance matrix, V , and its Cholesky matrix, T , are
unknown. But if the geo-regression above was successful, then the covariogram estimate
in Figure 7.4 should generate a reasonably good estimate, Vˆ , of this covariance matrix
[by the same procedure as in (7.2.25) through (7.2.27) above]. If so, then by letting T̂
________________________________________________________________________
ESE 502 II.7-33 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
denote the Cholesky matrix for Vˆ , we can use this to transform the given data into an
OLS regression problem. In particular, if [ y, X ] denotes the given cobalt and coordinate
data (represented by [y0,X0] above), then the transformed data for the present case is
given by,
(7.3.33) yˆT Tˆ 1 y , Xˆ T Tˆ 1 X
Hence if the geo-regression above was successful, then this data should yield an OLS
regression with approximately independent residuals. This can be checked by the
nearest-neighbor regression procedure above, and provides a useful diagnostic for geo-
regression. To do so, the transformed data in (7.3.33) is saved as part of the output of
geo-regression. By examining the program description of geo_regr.m, it can be seen that
the fifth component, OUT{5}, of the output cell structure, OUT, contains precisely the
matrix [ yˆT , Xˆ T ] . This can be imported to JMP and run as a regression. In doing so, it is
important to note that the first column of the data matrix, X , in (7.3.30) is necessarily the
unit vector, 1n , corresponding to the intercept coefficient, 0 . But in Xˆ T this is
transformed to the vector, Tˆ 11 , which is not a unit vector. So if this regression were run
n
in JMP without modification, then JMP would add a unit vector which is not present in
(7.3.30). This means that JMP must be run using the “No Intercept” option (at the bottom
of the Fit Model window).19 The results of this no-intercept regression must produce
exactly the same beta estimates as the geo-regression output above (except for possible
rounding errors in transporting the data from MATLAB). So this in itself is a good check
to be sure that the data has been transported properly. The results of this nearest-neighbor
residual regression are shown below:
2
Residual Co*
-1
-2
-3
-2 -1 0 1 2 3 4
nn_res*
19
We shall see this option used again in Section 4.1.1 of Part III.
________________________________________________________________________
ESE 502 II.7-34 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Here it should be clear that the geo-regression above has indeed been successful in
removing any trace of spatial dependencies among residuals. However, there is one
additional check that is worth mentioning. Notice in (7.3.32) that these transformed
residuals are not only independent, but in fact all have unit variance ( 2 1 ) so that the
associated standard deviation is also one ( 1 ). This means that the estimated standard
deviation, ̂ , known as “Root Mean Squared Error” should be close to one. This value is
reported in the regression output right under Adjusted R 2 . In the present case,
ˆ 0.995 , which provides additional support for the success of this geo-regression.
By way of summary, this cobalt example provides a simple illustration of the use of geo-
regression. Here the objective has been simply to capture the overall shape of spatial
trends in this data. (A more substantive example will be given in the next section.) But
aside from the geo-regression procedure itself, this example serves to illustrate a number
of more general issues that are common to all spatial regressions. First notice from the
initial OLS regression itself that this spatial trend captures less than 20% of the overall
variation in this cobalt data (with an adjusted R 2 of 0.188). So even though a visual
inspection of Figure 7.2(a) suggests an overall “saddle” shape for these trends, the
present quadratic specification is at best only a rough approximation. Thus for purposes
of spatial prediction, it is vital that the residual structure be modeled in a careful way.
This is a further motivation for techniques like geo-regression.
From an even more general perspective, this example illustrates the fundamental problem
of separating “trends” from “residuals”. To what extent is the spatial pattern of cobalt
values in Figure 7.4(a) the result of some underlying trend, or simply the result of
correlations between cobalt values at nearby locations? If one were able to examine many
“replications” of the underlying spatial process, then such separation would be a
relatively simple matter. Indeed, if most replications produced similar “saddle-like”
patterns, then this would suggest the presence of a dominant spatial trend along the lines
that we have modeled. On the other hand, if such replications produced a wide variety of
similarly correlated patterns (including “mountains” and “valleys” as well as “saddles’),
then this would suggest the presence of a dominant covariance stationary process,
possibly even with a constant mean (as postulated in Ordinary Kriging for example). But
since direct replications are not possible, the best that one can do is to be aware of these
problems, and to treat all model specifications with some degree of suspicion. To
paraphrase the famous remark of George Box,20 “all models are wrong, but some are
more useful than others”.
20
See for example http://en.wikipedia.org/wiki/George_E._P._Box.
________________________________________________________________________
ESE 502 II.7-35 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
References 7 and 8 in the class reference material].21 The area around Venice Island in
Italy is shown (schematically) in Figure 7.6 below.
! !
! ! !
!
!
! !
!
! !
! !
!
INDUSTRY
! ! !
! ! ! !
! VENICE
! ! !
! ! ! !
! ! ! !
!
!
!
!
0 5 Miles
Venice Island (shown in red) lies in a shallow lagoon, and has been slowly sinking for
many decades. In 1973 there was a suspicion that the Puerto Marghera industrial area to
the west of Venice was contributing to this rate of sinking. The reason for this suspicion
can be seen from the schematic depiction of the groundwater structure underlying the
Venice Lagoon shown in Figure 7.6 below.
Aquifers
Aquitards
21
This paper also contains an excellent overview of Kriging methods, as well as the groundwater problem
in Venice.
________________________________________________________________________
ESE 502 II.7-36 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Here the blue bands denote porous water-filled layers of soil called aquifers that are
separated by denser layers called aquitards. Industry consumes water by drilling wells
into the aquifer layers (as depicted by the red shaft in the figure). This lowered the level
of the water table, potentially contributing to the sinking of Venice. Thus the question in
1973 was whether or not this industrial draw-down of water was a significant factor in
the sinking of Venice.
Geo-Regression Model
To study this question, data was gathered on water table levels, Li , from 40 bore hole
sites, i 1,.., 40 , in existing wells throughout the Venice Lagoon area (shown by the dots
in Figure 7.5 above with colors ranging from red to blue denoting higher to lower levels).
[This data, along with the coordinate locations of well sites, can be found in the ( 40 3 )
matrix, venice, in the workspace, venice.mat.] The objective of this study was to identify
the key factors influencing these water table levels by applying geo-regression methods.
Here it was hypothesized that the key factors influencing the water table level, L( s) , at
any location, s ( s1 , s2 ) , were the elevation, Ev( s ) , above sea level at s, together with
local draw-down effects both from industry, DI ( s ) , and from local water consumption,
DV ( s ) , in Venice itself. To model DI a convenient coordinate system was chosen, with
origin centered in the Industrial Area as shown in Figure 7.7 below.
c2
c1
c2
c1
0 5 Miles
________________________________________________________________________
ESE 502 II.7-37 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
The orientation of these axes is designed to simplify the model representation of both
elevation and industrial draw-down effects. Starting with the Industrial draw-down
function, DI , this can be essentially approximated by a decreasing function with elliptical
contours centered on the axes. The present equation used is the following:22
A similar draw-down function, DV , was constructed for Venice Island and has the
following form:
8
(7.3.36) DV ( s ) DV ( s1 , s2 ) exp ( s1 560)2 ( s2 390)2 35
Here the large exponent, ()8 , is designed to drive this function to zero outside of Venice
Island, where local water consumption has little effect. The procedure for calculating
these functions (as well as the elevation function below) can be found in the MATLAB
script, venice_funcs.m. The resulting contours of these two functions are shown in
Figures 7.8 and 7.9 below.
!
! !
! ! !
!
! !
!
!
!
! !
!
• !
!
!
! ! !
! !
! !
! ! ! !
! ! ! !
!
!
!
As mentioned above, there is a third effect that cannot be overlooked, namely elevation.
Though detailed data on elevation was not available in this data set, the elevation
contours are roughly parallel to the c2 axis in Figure 7.7, and increase in elevation more
22
The actual functions used in Gambolati and Volpi (1979) are based on more complex hydrological
models. So the present simplified functions are for illustrative purposes only.
________________________________________________________________________
ESE 502 II.7-38 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
rapidly to the west. So the following simple (local) approximation to elevation, Ev( s ) , at
locations, s, was adopted,23
If the data sites (well locations) are denoted by si ( si1 , si 2 ) , i 1,..,40 , and if the
computed values of the above functions at these locations are denoted by
( DIi , DVi , Evi ) [ DI ( si ), DV ( si ), Ev ( si )] , i 1,..,40 , then these values can now serve as the
explanatory variables in a linear regression model of this water table data as follows:
As with the Cobalt example above, this model was run using both OLS and the iterative
Geostatistical Regression Procedure implemented in geo_regr.m, with the command
>> geo_regr(y0,X0,L0,vnames);
where y0 is the L data, X0 the computed ( DI , DV , Ev ) data, and L0 the coordinate data at
each of the 40 well sites. A comparison of the parameter estimates and significance levels
in shown in Tables 7.6 and 7.7 below;
Note that as in the Cobalt case above, the signs of all coefficients are consistent in both
procedures, but the t-ratios are generally lower (in absolute magnitude) for GLS. Notice
however that the Venice drawdown effect provides an exception to this rule, and shows
that significance levels need not always be higher for OLS. As a final consistency check,
note that the signs of these coefficients are as expected, namely that mean water table
levels rise with higher elevations and that greater levels of water drawdown lower the
mean water table level.
23
This approximation produces a maximum elevation of about 30 meters at the western edge of the
Industrial Area, where the water table level is about 7 meters.
________________________________________________________________________
ESE 502 II.7-39 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
than repeat the nearest-neighbor residual analysis done for the Cobalt case, it is of interest
to consider a different approach here. In particular, one can compare the (spherical)
covariogram for the original OLS residuals with that of the residuals from the final
transformed model in expressions (7.3.31) and (7.3.32) above. If the procedure has been
successful, then the final covariogram should be much closer to pure independence. But it
is important to note here that since the transformed data is quite different from that of the
original model, there is a problem in comparing these residual covariograms directly. In
fact, this provides us with an important case where it is more appropriate to compare the
correlograms derived from these covariograms, as defined in expression (3.3.13) above.
These correlograms are free from any dimensional restrictions, and hence are directly
comparable. In particular, since (0) 1 for all correlograms, their scales must be
identical. This allows one to focus entirely on their relative shapes. In the present case,
the original correlograms and final correlograms of the transformed data are shown in
Figures 7.10 and 7.11, respectively. Notice first that in the original correlogram the
relative nugget effect (defined in Section 4.5 above) is zero, indicating that this process
exhibits no spatial independence whatsoever. In contrast, the relative nugget effect in the
final correlogram is close to one, indicating that the process is now almost completely
spatially independent. In other words, very little spatial correlation remains in this
transformed data. Notice also that the fluctuation of nonzero correlation values is much
smaller, indicating that spatial correlations are uniformly closer to zero at all scales.24
These two observations provide convincing evidence that this geo-regression has indeed
been successful in accounting for almost all spatial correlation in the original OLS model.
1 1
·
0 0
-1 -1
0 100 200 300 0 100 200 300
Given these preliminary findings, the main purpose of this model is to analyze the
impacts of industrial water drawdown effects on the water table level in Venice. To
estimate this impact, observe first from the geo-regression results above, that we can
24
This is due in part to the larger bin sizes used in this figure (50 rather than 30 points per bin).
________________________________________________________________________
ESE 502 II.7-40 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
obtain a upper 95% confidence bound on the beta coefficient, I , for DI in model
is denoted by s , then for
(7.3.38) as follows. First note that if the standard error of I I
any level of significance, , the 100(1 )% upper confidence bound for I can be
obtained from the probability identity,
t
Pr( I
(7.3.39) I ,n ( k 1) sI ) 1
where t ,n ( k 1) is the t-critical value at level for degrees of freedom, n ( k 1) [where
n sample size and k number of explanatory variables]. To obtain the desired standard
/s ,
error, recall that by definition the t-ratio, t , for in Table 7.7 is given by t
I I I I I
Next observe that for the representative location, s ( s1 , s2 ) (555,390) , in the middle of
Venice Island (shown by the red dot in Figure 7.9 above), the transformed coordinates in
(7.3.34) are seen to be (c1 , c2 ) ( 1.572,0.075) , so that the value of the Industrial
drawdown in (7.3.35) is given by:
Thus, for each additional meter of Industrial water drawdown, one can be 95% confident
that the expected decrease, , in the water table level at location s will be bounded
below by
Thus, based on the above model, one can be 95% confident that the mean industrial
drawdown effect on Venice Island is at least 15%.
While this model is only a rough approximation to the analysis of Gambolati and Volpi
(1979),25 it serves to illustrate how geo-regression can actually be used to address
substantive spatial issues. According to these authors, water pumping in Puerto Marghera
25
Aside from their more elaborate drawdown functions, Gambolati and Volpi also used a universal kriging
approach rather than our present application of geo-regression.
________________________________________________________________________
ESE 502 II.7-41 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
was in fact reduced by 60% after 1973, and their subsequent analysis of 1977 data
showed that the “subsurface flow field had substantially recovered, and the land
settlement had been arrested”. So their post-analysis confirmed that this industrial water
drawdown was indeed a major contributing factor to the sinking of Venice. Of course, in
more recent times, Venice has once again started to sink from more natural causes. But
this is another story.
An Application of Geo-Kriging
s1 [150 : 25 : 900]
(7.3.44)
s2 [200 : 25 : 650]
(where the cell size, 25, is roughly a third of a mile in terms of Figure 7.5). This grid was
then used as input to the program, geo_krige.m, with the command
where (y0,X0,L0) is the same as for geo_regr above, and where (X1,L1) are the
computed values of ( DI , DV , Ev ) and coordinate values at each of the 589 grid points
from (7.3.44). Finally, the bandwidth used was h = 50 (around two thirds of a mile.)
for all locations, s. These results were constructed from the above output as follows:
________________________________________________________________________
ESE 502 II.7-42 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
and exported from MATLAB to ARCMAP. These values were then interpolated using
the Spline option in ArcToolbox:
to the spline rasters. A comparison of the fitted value, L̂ , and kriging values, Yˆ , is shown
in Figures 7.12 and 7.13 below:
!
! !
! ! !
! ! !
! !
!
!
!
!
!
!
!
!
!
! ! !
!
! !
! ! !
!
! ! ! !
! ! !
! ! ! ! !
!
! !
! !
^ ^
Figure 7.12. Geo-Regression L Values Figure 7.13. Geo-Kriging Y Values
Notice that the L̂ values are essentially a weighted combination of the drawdown effects,
DI and DV , in Figures 7.8 and 7.9 respectively (as captured by their values at the 40 well-
site data points). The kriged values, Yˆ , also reflect these underlying drawdown effects,
but to a lesser extent. By construction, these values also include stochastic interpolations
of the regression residuals, and thus should reflect water table levels more accurately than
the simpler regression predictions. Note however that alternative models of drawdown
functions and fitting procedures will of course produce somewhat different results, as can
be seen by comparing Figure 7.13 with Figure 5.21(a) in [BG, p.199] and Figure 2(a) in
Part 2 of Gambolati and Volpi (1979, p.292).
Finally, the main advantage of this stochastic interpolation procedure is that it allows
prediction intervals to be constructed for actual water table levels in terms of estimated
________________________________________________________________________
ESE 502 II.7-43 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
standard errors of prediction. A plot of these standard errors around Venice Island is
shown in Figure 7.14 below (with the 0.4 and 0.7 contours labeled to indicate
representative values). Here a much finer grid of kriging locations was used here (with
increments of about a tenth of a mile) in order to show the details of these standard error
contours.
0.4
0.7
Notice in particular that these standard errors fall to zero at each of the five data points
(well sites) on Venice Island [in a manner similar to Figure 2(b) in Part 2 of Gambolati
and Vopi (1979), though Venice Island itself is rather difficult to see in their figure]. This
reflects the fact that geo-kriging (along with simple and ordinary kriging) is an exact
interpolator that goes through every data point. This can be seen most easily from
expression (7.2.17) above, together with the fact that if point s0 is actually a data point,
then it must always be a member if its own predition set, S ( s0 ) , and hence must
correspond to one of the elements of the covariance matrix, V0 . But since
V0 V01 I n0 ( ei : i 1,.., n0 ) , it follows that if c0 is the i th column of V0 , then c0V01 ei ,
so that (7.2.17) becomes:
This same argument also shows that the kriging standard error in (7.2.22) is identically
zero.
Finally, it is of interest to consider the kriged values on Venice Island. Though the
specific kriging contour values are not shown in Figure 7.13, these values yield water
________________________________________________________________________
ESE 502 II.7-44 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
table predictions of around Yˆ ( s0 ) 3.0 for points, s0 , on Venice Island (i.e., about 3
meters below sea level). Moreover, while not all standard error contours are shown in
Figure 7.14, the 0.7 contour is roughly the average value, so that ˆ 0 ˆ ( s0 ) 0.7 . Thus a
typical prediction interval for points s0 on Venice is about
While such intervals are not extremely sharp, one must take into account the fact that
only 5 of the 40 data points are actually on Venice Island. So this is probably about the
best that can be expected from such a small data set.
________________________________________________________________________
ESE 502 II.7-45 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
APPENDIX TO PART II
This Appendix, designated as A2, contains additional analytical results for Part II of the
NOTEBOOK, and follows the notational conventions in Appendix A1.
First recall that the covariance of any random variables, Z1 and Z 2 , with respective
means, 1 and 2 , is given by
E ( Z1Z 2 ) 1 2
(A2.1.3) Y ( s ) Y1 ( s) Y2 ( s ) , s R ,
with respective means, 1 and 2 , then it follows by definition that 1 2 , and that
Y1 ( s ) and Y2 (v) are independent for all s, v R . Hence for any h 0 and s, v R with
s v h , we see that the covariogram, C , of the Y -process must satisfy,
( 12 2 1 2 22 )
________________________________________________________________________
ESE 502 A2-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
E[Y1 ( s )Y1 (v)] E[Y1 ( s )]E[Y2 (v)] E[Y2 ( s )]E[Y1 (v)] E[Y2 ( s )Y2 (v)]
( 12 1 2 2 1 22 )
C1 (h) C2 (h)
where C1 and C2 are the respective covariograms for the Y1 and Y2 components of Y .
ˆ12
n
(A2.2.1) 1
n1 i 1
(Y1i Y1 )(Y2i Y2 )
12
n
(A2.2.3) 1
n i 1
(Y1iY2i Y1iY2 Y1Y2i Y1Y2 )
1
n
n
YY
i 1 1i 2 i 1
n
n
Y Y2 Y1
i 1 1i 1
n
n
Y
i 1 2 i n 1 YY
n 1 2
n
1
n Y Y YY
i 1 1i 2 i 1 2 YY
1 2 YY
1 2
n
1
n Y Y YY
i 1 1i 2 i 1 21
But since
(A2.2.4) 1 2
YY 1
n
n
Y
i 1 1i 1
n
n
Y
i 1 2 i 1
n2
n
i 1
n
YY
j 1 1i 2 j
________________________________________________________________________
ESE 502 A2-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
n 1
E (Y1iY2 j )
n n n
n1 n E (Y1iY2i ) 1
i 1 n2 i 1 j 1
n 1
E (Y1iY2i ) n12 i 1 j i E (Y1iY2 j )
n n n
n1 n E (Y1iY2i ) 1
i 1 n2 i 1
n n1
E (Y1iY2i ) n12 i 1 j i E (Y1iY2 j )
n n
n1 n2 i 1
12 1 2 n ( n1)
cov(Y1i , Y2 j ) n ( n1) 1 2
n
1
n ( n1) i 1 j i
12
n
1 cov(Y1i , Y2 j )
n ( n1) i 1 j i
E (ˆ 2 ) 2
n
(A2.2.7) 1 cov(Y1i ,Y2 j )
n ( n 1) i 1 j i
Here it suffices to consider the variogram, (h) , on the interval of distance values,
d k 1 h d k , for a typical bin k . Recall from (4.7.1) that for a given sample of values
Y ( si ) : i 1,.., n , if N k denotes the set of distance pairs, ( si , s j ) , in bin k , and if the
distance between each such pair is denoted by hij si s j , then the lag distance, hk , for
bin k is defined to be
1
(A2.3.1) hk
Nk
( si , s j )N k
hij
________________________________________________________________________
ESE 502 A2-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Recall also that if the k -linear approximation to (h) on this interval is denoted by
(A2.3.2) lk (h) ak h bk
then by definition,
In this context we have the following bound on the bias of the empirical variogram
estimates,
1
Y (s ) Y (s )
2
(A2.3.4) ˆ (hk )
2 Nk
( si , s j )N k i j
at lag distance, hk :
Proposition A2.1. If for any bin, k 1,.., k , the true variogram, (h) , has an k -linear
approximation, then at lag distance, hk , it must be true that
ij (hij ) 12 E Y ( si ) Y ( s j )
2
(A2.3.6)
1
Y (s ) Y (s )
2
(A2.3.7) E[ˆ (hk ) ] E ( si , s j )N k i j
2 Nk
1
Nk
( si , s j )N k 1E
2
Y (s ) Y (s )
i j
2
1
Nk
( si , s j )N k
ij
But since hij [d k 1 , d k ) for all ( si , s j ) N k , we see from (A2.3.2) that ij lk (hij ) k ,
and thus that
________________________________________________________________________
ESE 502 A2-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Hence by summing this set of inequalities and taking averages [with the observation that
(1/ | N k |) ( s ,s )N k (| N k | / | N k |) k k ], we have
i j k
(A2.3.9) k 1
|N k | ( si , s j )N k
[ ij lk (hij )] k
Next. by using (A2.3.1), (A2.3.2) and (A2.3.7), the middle expression of (A2.3.9) can be
rewritten as,
(A2.3.10) 1
|N k | ( si , s j )N k
[ ij lk (hij )] 1
|N k | ( si , s j )N k
ij 1
|N k | l (hij )
( si , s j )N k k
E[ˆ (hk )] 1
|N k | ( si , s j )N k
[ak hij bk ]
E[ˆ (hk )] ak 1
|N k | ( si , s j )N k
hij bk
E[ˆ (hk )] ak hk bk
But since hk [d k 1 , d k ) it also follows from (A2.3.3) that lk (hk ) (hk ) k and
hence that
________________________________________________________________________
ESE 502 A2-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
In order to understand multidimensional analysis, one must begin with vector geometry.
In particular, all matrix manipulations are interpretable geometrically. If for any vector,
x ( x1 ,.., xn ) n we denote the (Euclidean) length of x by
xx
n
(A2.4.1) || x || x2
i 1 i
then for any two vectors, x ( x1 ,.., xn ), y ( y1 ,.., yn ) n , the distance between x and
y is just the length of the vector x y ( x1 y1 ,.., xn yn ) n , i.e.,
( x y )( x y )
n
(A2.4.2) || x y || i 1
( xi yi ) 2
This is illustrated for two dimensions ( 2 ) in Figures A2.1 and A2.2 below.
x ( x1 , x2 )
x
yx || y x ||
y ( y1 , y2 ) || x ||
|| y ||
y
These distances in turn define angles, that complete the geometry of Euclidean spaces,
n . All that is really required here is the notion of orthogonal vectors which constitute
the sides of a right triangle, as shown for 2 in Figure A2.2. Recall from the
Pythagorean Theorem, that such triangles are characterized by the familiar identity that
the square of the hypotenuse equals the sum of squares of the sides, i.e.,
Hence if we now write this orthogonality relation as, x y , then terms of the notation
above, this implies that
xy 0
________________________________________________________________________
ESE 502 A2-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Hence we are led to the fundamental geometric relation that orthogonality between
vectors is equivalent to zero inner products. This essentially defines vector geometry in
Euclidean spaces. (A somewhat sharper derivation of this result is given in terms of
cosines in Section ?? below.)
f ( x0 )
f ( x0 )
x
0
f ( x0 ) f ( x0 )
(A2.5.1) d f ( x0 ) lim 0
dx
The example in Figure A2.3 is a simple parabolic function, f ( x) x 2 , for which the
derivative is given explicitly by
( x0 ) 2 ( x0 ) 2 x02 2 x0 2 x02
(A2.5.2) d f ( x0 ) lim 0 lim 0
dx
lim 0 (2 x0 ) 2 x0
Such limiting slopes values cannot usually be obtained so easily. But this case serves to
illustrate the basic idea.
1
In Figure A2.3 we have implicitly assumed that increments are positive ( 0 ). But for smooth
functions, the same limiting slope results for negative increments as well.
________________________________________________________________________
ESE 502 A2-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
From a geometric viewpoint, this limiting slope defines the unique tangent line to f at
x0 (shown in red in Figure A2.3). More importantly, the linear function defined by this
line yields the best linear approximation to function f in small intervals around x0
(since by construction it has the same value and slope as f at x0 ).
z f ( x1 , x2 )
z
x (x , x )
0 01 02
x01
________________________________________________________________________
ESE 502 A2-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
2
[2( x01 2 x011 12 ) x02
2
] [2 x01
2
x02
2
]
lim i 0
1
These partial derivatives can in turn be used to define differential changes in any
direction. The key point to note is that for smooth functions, f ( x) f ( x1 ,.., xn ) , in
higher dimensions, the unique tangent line defining the scalar derivative in Figure A2.3
is replaced by a unique tangent plane. This is again illustrated by the two-dimensional
function,2 f ( x) f ( x1 , x2 ) , shown in Figure A2.5 below:
f ( x)
f ( x)
z
0
x0
x2
x1
As in the scalar case, the plane tangent to f at a given point, x0 ( x01 ,.., x0 n ) , is
essentially the “best linear approximation” to f in small neighborhoods of x0 . In
geometric terms, this tangent plane is more accurately described as the n-dimensional
(hyper) plane tangent to the surface (or graph) of f at the point [ x0 , f ( x0 )] n1 , as
illustrated by the 2-dimensional plane tangent to f at z0 [ x0 , f ( x0 )] 3 in the figure
(where the “red arrows” can be ignored for the moment).
2
The actual function plotted is the quadratic function, f ( x ) f ( x1 , x2 ) 10 [2 y1 y1 y2 y2 ] with
2
yi xi 10 , i 1, 2 .
________________________________________________________________________
ESE 502 A2-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
If we continue to focus on this two-dimensional case for the present, and consider any
small change in x0 , say x0 x0 ( x01 1 , x02 2 ) , then the corresponding
change in f , denoted by f ( x0 ) , is well approximated by a corresponding movement
on this tangent plane. As we have already seen, movement in the x1 direction (with
2 0 ) yields changes governed entirely by the partial derivative of f with respect to
x1 at x0 . This can now be depicted graphically as in Figure A2.6 below, where for
notational simplicity we have represented the partial derivative of f with respect to xi
at x0 by ai f ( x0 ) / xi , i 1, 2 . Here we have also shifted the origin up to the point,
z0 [ x0 , f ( x0 )] , so that local movements away from x0 can be represented simply by
pairs (1 , 2 ) . [Note that the size of these shifts (relative to the “red arrow” from
Figure A2.5) have been exaggerated for visual clarity.]
f ( x0 ) a11 a2 2
a
2 2
x2
a11
2 f ( x0 ) 0
x1 1
z0
f ( x0 ) f ( x0 )
(A2.5.4) f ( x0 ) a11 a2 2 1 2
x1 x2
Finally, if these -shifts are allowed to become “arbitrarily small”, then we obtain the
limiting differential relation
3
Here the symbol, , can be loosely read as “is approximately equal to”.
________________________________________________________________________
ESE 502 A2-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
f ( x0 ) f ( x0 )
(A2.5.5) df ( x0 ) dx1 dx2
x1 x2
But for our present purposes, the key property of total derivatives is what they imply
about partial derivatives in particular. Here we use some vector geometry by first
writing the vector of differential elements in (A2.5.5) as dx (dx1 , dx2 ) . In geometric
terms, this can be viewed as a directional vector of small movements from any given
point. Similarly, if we designate the vector of partial derivatives of f at x ( x1 , x2 ) as,
f ( x)
x
(A2.6.1) f ( x ) 1
f ( x)
x
2
(A2.6.2) df ( x0 ) f ( x0 )dx
(A2.6.3) 0 df ( x0 ) f ( x0 )dx
Hence, by recalling (A2.4.4), we see that the key geometric consequence of this zero-
inner-product condition is that the vector of partial derivatives, f ( x0 ) , must
necessarily be orthogonal to the directions of no change in f . In Figure A2.5, f ( x0 )
thus corresponds to the red arrow on the ( x1 , x2 ) -plane starting at x0 . Moreover, since
its three-dimensional counterpart starting at z0 on the tangent plane (in both Figures
A2.5 and A2.6) is necessarily the steepest direction of movement on this plane, it
________________________________________________________________________
ESE 502 A2-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Finally, while the n 2 case is extremely useful for gaining geometric intuition, it
should be emphasized that all relationships above are immediately extendable to
general functions, f ( x) f ( x1 ,.., xn ) . In particular, if we let dx (dx1 ,.., dxn ) and
define the general gradient vector at x ( x1 ,.., xn ) n by
f ( x)
1 f ( x) x1
(A2.6.4) f ( x )
f ( x) f ( x)
n
x
n
Given these key geometric results, we can now consider optimization problems
involving smooth multidimensional functions, f ( x) f ( x1 ,.., xn ) . These amount to
finding points, x , in some specified region, R n , with either maximum or
minimum values, f ( x) , in R , depending on the given problem. Here it is important to
emphasize that maximizing the function, f ( x) , over R is equivalent to minimizing the
function, f ( x) , over R. For this reason, it suffices to consider only maximization
problems (which are usually easier to depict graphically for the n 2 case).4
4
Here it is also worth noting that optimization software (such the MATLAB optimization toolbox) is
typically designed to do only minimization problems. So all maximization problems must be reformulated
as minimization problems.
5
An inflection point, x, for f is a point at which the second derivative of f changes sign.
________________________________________________________________________
ESE 502 A2-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
eliminated by requiring that the second derivative be negative at all singular points, so
that the unique maximum is always characterized by the zero-slope condition. This is
precisely analogous to the one-dimensional kriging problem in Section 6.2.1 of the text.
Here a global minimum was insured for the simple quadratic function in (6.2.19) with
positive second derivative in (6.2.20).
f
0 d
dx f ( x0 ) 0
f ( x)
f ( x)
x0 x0 x1
The situation is more complex for multidimensional functions. Here the first-order
“zero-slope” condition, dx d
f ( x0 ) 0 , is replaced by a more general “zero-gradient”
condition, f ( x0 ) 0 , which ensures that the total derivative in (A2.6.2) is zero in all
directions, dx .6 Geometrically, this first-order condition requires that the tangent plane
at x0 be flat, as is illustrated in Figure A2.9 below.
f ( x0 ) 0
f0
f ( x)
x0 R
x2
x1
6
Note that since f ( x0 ) is an n-vector, the “0” here is also an n-vector, 0 (0,.., 0) . While we could
write this as 0 n , standard practice is to take the dimension of zero vectors as understood by context.
________________________________________________________________________
ESE 502 A2-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
So by taking the partial derivatives of this function and setting them equal to zero, we
obtain the relations,
(A2.7.2) 0 f ( x0 ) 26 6 x1 x2
x1
(A2.7.3) 0 f ( x0 ) 20 8 x1 x2
x2
These linear equations can easily be solved to yield the unique solution point,
x0 ( x01 , x02 ) (4, 2) , shown in the figure. However when the dimension, n, is much
larger than two, it is practically possible to write down the full expression for
f ( x) f ( x1 ,.., xn ) , let alone the simultaneous equation system corresponding to the
first-order condition. Here is where the power of matrix algebra takes full force. If we
let
3 2 26
(A2.7.4) A , b , c 928
1 4 20
then it can easily be verify (by matrix multiplication) that the function in (A2.7.1) can
be equivalently written in matrix form for all x ( x1 , x2 ) as,
Notice the similarity of this quadratic form to the general expression for mean squared
error, MSE (0 ) , in expression (6.2.27), where x now plays the role of the weight
vector, 0 .7 The power of this notation is that the quadratic form in (A2.7.5) can be
analyzed in the same way regardless of the dimension, n. All that is required here is
that we formalize the vector version of the partial derivatives in (A2.7.2) and (A2.7.3).
To do so, notice first that for any coefficient vector, b (b1 , b2 ) , such as in (A2.7.4), if
we now employ the gradient notation in (A2.6.4) then it follows that,
1 (bx) x1 (b1 x1 b2 x2 ) b1
(A2.7.6)
(b x)
b
2 (bx ) x2 1 1 2 2 b2
(b x b x )
7
It is also worth noticing the difference in signs of the quadratic term, where MSE was to be minimized,
and f is to be maximized. We shall return to this distinction below.
________________________________________________________________________
ESE 502 A2-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
More generally, for any linear compound, bx i1 bi xi , exactly the same argument
n
shows that
(A2.7.7) (bx) b
Turning next to the quadratic term in (A2.7.5) observe that for any 2 2 matrix, A ,
a a x a x a x
(A2.7.8) xAx ( x1 x2 ) 11 12 1 ( x1 x2 ) 11 1 12 2
a21 a22 x2 a21 x1 a22 x2
a a x a a21 x1
11 12 1 11 Ax Ax
a21 a22 x2 a12 a22 x2
More generally, for any quadratic expression, xAx i1 j 1 aij xi x j , essentially the
n n
Here there is one important special case, namely when the matrix A is symmetric, i.e.,
when A A . For this case it follows at once from (A2.7.10) that
(A2.7.11) ( xAx) 2 A x
To see the special relevance of this case, notice that every square matrix, A , has an
associated symmetrization,
But since xy yx for all vectors, it then follows that
________________________________________________________________________
ESE 502 A2-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
1
2 x( Ax) ( Ax) x 12 x( Ax) x( Ax) xAx
3 2 3 1 3 1/ 2
(A2.7.14) As 1
2
1 4 2 4 1/ 2 4
Using these identities, we can now establish first-order conditions for any quadratic
maximization problem as follows. If f ( x) is assumed to have the general quadratic
form
In the present case, where (symmetric) A is given by the negative of (A2.7.14) [to be
consistent with (A2.7.15)] it follows that
1
1
1 ( A ) 1 b 1 3 1/ 2 26 4
(A2.7.18) x0 2 s 1
2 A b
s 2
1/ 2 4 20 2
________________________________________________________________________
ESE 502 A2-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
in expression (6.2.25), we can now solve the corresponding first-order condition for the
optimal weight vector, ̂0 , as follows
But while these first-order conditions are necessary for optimal solutions, they are not
sufficient. In particular, (A2.7.18) is claimed to be the solution of a maximization
problem, and (A2.7.20) is claimed to be the solution of a minimization problem. Hence
to check whether either of these are actually solutions of their respective problems, we
must develop appropriated second-order conditions.
Recall that in the scalar case, the second-order condition for a maximum (or minimum)
2
of f ( x) at x0 is that the second derivative, dxd 2 f ( x0 ) , be negative (or positive), as seen
for the case of a maximum in Figure A2.7 above. In the multidimensional case the
conditions are similar in nature, but are necessarily somewhat more complex. The
simplest way to motivate the basic idea here is to reduce the problem to “one
dimension” in the following way. For a two dimensional function, f ( x) f ( x1 , x2 ) ,
with a maximum at point, x0 , such as in Figure A2.9 above, consider a one-
dimensional “slice” through this function such as the one shown in Figure A2.10 below.
f ( x0 )
f f ( x0 t x)
x2
x
0 x0 t x
x1
Such a slice can be defined formally by choosing any fixed nonzero vector, x , and
considering all linear combinations, x0 t x . As the scalar, t , increases from zero, one
moves away from x0 in “direction” x . Similarly, as t decreases from zero, one moves
________________________________________________________________________
ESE 502 A2-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
in the opposite direction. The one-dimensional slice through f shown in the figure
thus corresponds precisely to the scalar function of t defined by g x (t ) f ( x0 t x) .
So if f achieves its maximum at x0 , then in particular, it must exhibit a maximum
along this slice at t 0 . This of course implies that dtd g x (0) 0 , and more importantly
for our present purposes, that d2
dt 2
g x (0) 0 . To analyze this latter condition more
explicitly, we introduce the following simplifying notation. For any function,
f ( x) f ( x1 ,.., xn ) of n arguments, let
(A2.7.21) fi ( x)
xi f ( x1 ,.., xi ,.., xn )
denote the partial derivative of f with respect to its i-th argument, and for each
i, j 1,.., n let
(A2.7.22) fij ( x)
xi f j ( x1 ,.., xi ,.., xn ) 2
xi x j f ( x1 ,.., xi ,.., x j ,.., xn )
denote the cross partial derivative of f with respect to its i-th and j-th arguments (so
that in particular, fii ( x) is the second partial derivative of f with respect to its i-th
argument). In terms of this notation, if we consider a compound function,
g (t ) f [h1 (t ), h2 (t )] , and recall from the chain rule for derivatives that
(A2.7.23) d
dt g (t ) f1[h1 (t ), h2 (t )] dtd h1 (t ) f 2 [h1 (t ), h2 (t )] dtd h2 (t )
(A2.7.24) d
dt g x (t ) d
dt f ( x0 t x) d
dt f ( x01 t x1 , x02 t x2 )
f1 ( x0 t x) x1 f 2 ( x0 t x) x2
(A2.7.25) d2
dt 2
g x (t ) d
dt [ dtd g x (t )] d
dt [ f1 ( x0 t x)] x1 d
dt [ f 2 ( x0 t x)] x2
So by applying the chain rule to the first term on the right, we obtain
(A2.7.26) d
dt [ f1 ( x0 t x)] x1 d
dt [ f1 ( x01 t x1 , x02 t x2 )] x1
f11 ( x0 t x) x1 f12 ( x0 t x) x2 x1
________________________________________________________________________
ESE 502 A2-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
(A2.7.27) d
dt [ f 2 ( x0 t x)] x2 f 21 ( x0 t x) x2 x1 f 22 ( x0 t x) x22
By combining these, we can now write the second derivative in (A2.7.25) more
explicitly as
(A2.7.28) d2
dt 2
g x (t ) f11 ( x0 t x) x12 f12 ( x0 t x) x1 x2
f 21 ( x0 t x) x2 x1 f 22 ( x0 t x) x22
(A2.7.29) d2
dt 2
g x (0) f11 ( x0 ) x12 f12 ( x0 ) x1 x2 f 21 ( x0 ) x2 x1 f 22 ( x0 ) x22 0
This second-order condition can be written more compactly in matrix form as follows.
If we now designate the matrix of cross partial derivatives of f at point x0 as the
Hessian matrix,
f (x ) f 21 ( x0 )
(A2.7.30) H f ( x0 ) 11 0
f12 ( x0 ) f 22 ( x0 )
then the right hand side of (A2.7.29) can be written in matrix terms as
f (x ) f 21 ( x0 ) x1
(A2.7.31) d2
g x (0) x1 , x2 11 0 x H f ( x0 ) x
dt 2
f12 ( x0 ) f 22 ( x0 ) x2
Hence the desired second order condition for a maximum of f at x0 with respect to
direction x takes the simple form:
(A2.7.32) x H f ( x0 ) x 0
________________________________________________________________________
ESE 502 A2-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
f11 ( x0 ) f n1 ( x0 )
(A2.7.33) H f ( x0 )
f (x ) f nn ( x0 )
1n 0
then the argument leading to (A2.7.32) continues to hold for any direction vector,
x n and Hessian matrix given by (A2.7.33).
Given this “one dimensional” condition, it remains only to observe that for a true
maximum at x0 , this same condition must hold in all directions with respect to x0 . So
if we now designate an n-square matrix, A , to be negative definite if and only if
then it follows at once from (A2.7.32) and (A2.7.34) that the desired full-dimensional
condition for a maximum of f at x0 is precisely that the Hessian matrix, H f ( x0 ) , be
negative definite.
This condition for a maximum also yields a corresponding condition for a minimum of
f at x0 . For the n 2 case, simply observe that if the “mountain” shape of f ( x) in
Figure A2.9 is inverted to “bowl” shape, then it is clear that the function,
g x (t ) f ( x0 t x) , corresponding to each slice in Figure A2.10 must now have a
positive second derivative at t 0 , i.e., d2
dt 2
g x (0) 0 . Hence same the argument leading
to (A2.7.32) now shows that
(A2.7.35) x H f ( x0 ) x 0
must hold in each nonzero direction x . This argument is again directly extendable to n
dimensions (but without pictures). So if we now designate an n-square matrix, A , as
positive definite if and only if
________________________________________________________________________
ESE 502 A2-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
The task remaining is to establish readily testable conditions for determining when a
matrix is positive or negative definite. Here we begin by observing from (A2.7.34) and
(A2.7.36) that a matrix, A , is positive definite if and only if A is negative definite.
Hence, it suffices to consider only one of these two conditions. Follows standard
practice, we here focus on positive definiteness. Next recall from the identity in
(A2.7.13) that to establish positive definiteness, we may assume that the matrix A is
symmetric (for if not then use its symmetrization, As ). For Hessian matrices in
particular, it turns out that such matrices are guaranteed to be symmetric, i.e.,
fij ( x) f ji ( x) , whenever these cross partial derivatives are continuous.8 So we shall
focus on conditions for establishing that a symmetric matrix is positive definite.
(A2.7.37) A BB
for some nonsingular n-square matrix, B. To see this, observe first that since
it follows that A must be symmetric. More importantly, observe that since the inner
product of a nonzero vector, x, with itself is always positive, i.e.,
and hence that A is SPD. This characterization helps to clarify the real meaning of
positive definiteness. In particular, if we consider the simplest case, n 1 , and let a
denote the scalar matrix, A , then the positive definiteness condition simply says that for
all nonzero scalars, x , we must have x(a ) x a x 2 0 , which of course simply
characterizes positivity of the scalar, a . So again letting b denote the scalar matrix, B ,
condition (A2.7.39) simply says that a is positive if and only if it can be written as
8
This result is usually known as Young’s Theorem, and can be found in most calculus textbooks.
________________________________________________________________________
ESE 502 A2-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
a b 2 for some scalar b , i.e., if and only if it has a real square root, b.9 So in this
sense, positive definite matrices are the natural generalization of positive numbers. But
while this decomposition characterizations is very informative, it is no more “testable”
than positive definiteness itself. However, there do exist testable conditions for
ensuring the existence of such decompositions as we now show.
The simplest and most commonly used test for positive definiteness is based on the
properties of certain determinants. If the determinant of a n-square matrix, A , is
denoted by det( A) , then this condition involves positivity of the determinants of certain
sub-matrices of A. In particular, for each k 1,.., n we now designate the k -square
matrix, Ak (aij : i, j 1,.., k ) , in the “upper left-hand corner” of A (aij :i, j 1,.., n) ,
i.e.,
as the kth leading principle sub-matrix of A, and designate its determinant, det( Ak ) , as
the kth leading principle minor of A, then the following condition, known as Sylvester’s
Condition is both necessary and sufficient for positive definiteness:
This result will be shown later to a simple consequence of the Spectral Decomposition
Theorem for symmetric matrices. To illustrate its application, consider the symmetrized
matrix in (A2.7.14) above, i.e.,
a a 3 1/ 2
(A2.7.42) A 11 12
a21 a22 1/ 2 4
Observe that since the principle minors are det(a11 ) a11 3 0 and
9
Later we shall see that SPD matrices, A, actually have square roots as well, i.e., can be written as A B
2
for a nonsingular symmetric matrix, B. But this requires the Spectral Decomposition Theorem for
symmetric matrices.
________________________________________________________________________
ESE 502 A2-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
But our main interest in Sylvester’s condition is that it provides the basis for establish a
more useful testable condition that has many applications of its own. In particular, it
yields a simple decomposition of SPD matrices known as the Cholesky decomposition.
In particular, if a matrix, T, with zeros everywhere above the diagonal, i.e., of the form
t11 0 0
t t
(A2.7.44) T 21 22
0
tn1 tn 2 tnn
(A2.7.45) A T T
By the argument above, every matrix of this form is SPD. Moreover, this again turns
out to completely characterize SPD matrices as we now show: 10
A a
(A2.7.46) A n1 n1
an1 ann
10
The following proof is based on an argument given by Prof. David Hill that is available online at:
http://astro.temple.edu/~dhill001/course/math254/CHOLESKYDECOMPOSITION_stu.pdf
________________________________________________________________________
ESE 502 A2-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
T 0
(A2.7.48) T n1
h c
for some unknown (n 1) -vector, h , and scalar, c . Hence by (A2.7.46) and (A2.7.48)
we seek values for h and a such that,
(A2.7.52) h Tn11an1
Hence to complete this construction, it remains only to show that the last operation is
legitimate, i.e., that
But by the determinant rule for partitioned matrices, it follows from (A2.7.46) that
A an
(A2.7.55) det( A) det n1 det( An1 ) (ann an 1 An1an1 )
1
an ann
(since ann an1 An11an1 is a scalar).11 Moreover, since the hypothesis of positive leading
principle minors for A implies in particular that det( A) 0 and det( An1 ) 0 , we see
from (A2.7.55) that
11
To gain some intuition for this determinant rule, observe simply that for the case of n 2 , we must have
a11 a12 1
det a11a22 a12 a21 ( a11 )( a22 a12 a11 a21 ) .
a21 a22
________________________________________________________________________
ESE 502 A2-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
ann h h
Remark: It should also be noted that this proof yields a recursive construction for T ,
and in particular shows that it is unique. This is obvious for n 1 , where T A is the
only possible choice. Moreover, by recursive use of the constructions in (A2.7.53) and
(A2.7.54), one must obtain a unique extension T for each n 1 .
As noted above, the most attractive feature of Cholesky decompositions is there ease of
calculation. As mentioned in the text, this is easily accomplished with the command,
>> T = chol(A);
If this algorithm fails then one obtains the error message “Matrix must be positive
definite”. So by the Cholesky Theorem above, this procedure yields a practical test of
positive definiteness, which can be designated as the Cholesky Test. In summary, while
Sylvester’s Condition provides a useful test for relatively small matrices, such as
(A2.7.42), the calculation of principle minors is very time consuming for larger
matrices. Here the Cholesky Test is much faster and more practical.12 If the algorithm
succeeds, then the matrix is SPD, and otherwise, it is not.13
Calculation of Hessians
To see how these conditions can be applied in practice, it is instructive to analyze the
maximization example in (A2.7.4) and (A2.7.5). While in this simple case, the desired
Hessian can of course be calculated term by term (i.e., each cross partial derivative), for
larger problem it is much more efficient to do the calculations in matrix terms. So it is
appropriate to see how this can be accomplished. To do so, we begin by rewriting the
gradient vector of first partial derivatives in expression (A2.6.4) in terms of our present
notation as follows
12
Even for n-square SPD matrices, A, as large as n 1000 , the MATLAB command, chol(A), produces
the unique Cholesky decomposition in about 0.03 seconds.
13
Care must be taken for “almost singular” SPD matrices, where rounding errors can sometimes lead to
failure. Methods of numerical analysis must then be used to check whether this is the case.
________________________________________________________________________
ESE 502 A2-25 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
f ( x)
x
1 f1 ( x)
(A2.7.58) f ( x )
f ( x)
f ( x) n
x
n
This can be viewed as a vector of functions, fi ( x), i 1,.., n . Notice that in the Hessian
of (A2.7.33) the i-th column is just the gradient of the i-th function, fi ( x) , in (A2.7.58).
So if we now define the gradient of a vector of smooth functions, [ g ( x),..., h( x)] with
commons arguments, x ( x1 ,.., xn ) , by
g ( x) g1 ( x) h1 ( x)
(A2.7.59) [g ( x),..., h( x)]
h( x ) g ( x) h ( x)
n n
f1 ( x0 )
(A2.7.60) H f ( x0 ) [f ( x0 )] 2 f ( x0 )
f (x )
n 0
As a second application of (A2.7.59), note that if the i-th row of a matrix, A , is denoted
by ai , then the linear expression, Ax , can be written as a vector of linear functions as
follows,
a1 a1x
(A2.7.61) Ax x
a a x
n n
a1x
(A2.7.62) ( Ax) [(a1x),..., (an x)] (a1 ,..., an ) A
a x
n
With these preliminaries, we can now reconsider the maximization of the general
quadratic expression in (A2.7.5),
________________________________________________________________________
ESE 502 A2-26 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
2A
Hence any point, x0 , satisfying the first-order condition for f ( x) will be a maximum if
and only if the matrix A is positive definite (so that the associated matrix, 2 A , is
negative definite). But for the specific maximum problem with parameters in (A2.7.4),
we have already seen that the symmetrized matrix, A, in (A2.7.56) above is positive
definite. Thus the unique point, x0 (4, 2) , satisfying the first-order conditions is
indeed a maximum (which was already evident in Figure A2.9).
was given by ˆ0 V01c0 . We are now in a position to complete that analysis. If the
Hessian for this function is denoted by H MSE , then by recalling that every covariance
matrix is symmetric, it follows the same analysis in (A2.7.64) now yields
Thus to ensure that ˆ0 V01c0 is the unique minimum of (A2.6.65), it remains only to
show that V0 is positive definite. In fact, it turns out that:
While we don’t yet have all the tools to show this fully, we can establish the most
essential part of this condition as follows. Recall from the covariance result in (3.2.21)
that for any random vector, X , with covariance matrix, cov( X ) , the variance of
each linear compound, aX is given by var(aX ) aa . So it must certainly be true
that
________________________________________________________________________
ESE 502 A2-27 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Non-Definite Hessians
x2 x1
However, there is seen to be a third point (shown in red) between these two local
maxima which also satisfies the first-order condition that the gradient be zero. Notice
also that movement from this point toward either maximum point must go “uphill”, so
that second derivative is positive in these directions. But movement orthogonal to these
directions leads “downhill” and hence yields negative second derivatives. At such
saddle point locations, the Hessian is neither positive nor negative definite. Note finally
14
This can actually be shown without the Spectral Decomposition Theorem. For a simple proof that
positive semidefiniteness plus nonsingularity implies positive definiteness, see Horn and Johnson (1985,
p.400)
________________________________________________________________________
ESE 502 A2-28 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
that such saddle points are not rare. Indeed, whenever there are multiple maxima one
can expect to find intermediate saddle points.
and hence is seen to be a quadratic form very similar in nature to the mean squared
error function, MSE (0 ) in (A2.7.19) above. Thus, as in (A2.7.2), we see from the
symmetry of the matrix X X that the first-order condition for this minimization problem
takes the form:
(A2.7.69) 0 SSD( ) 2 X y 2 X X X X X y
But if is assumed that there are no collinearities between the columns of X (so that
X is of full column rank) then the (k 1) -square matrix, X X , is nonsingular. Hence
the unique solution to (7.17), designated as the ordinary least squares (OLS) estimator
of is given by
(A2.7.70) ˆ ( X X )1 X y
The only question remaining is whether this yields a proper minimum. Here we can
answer this question definitively. In particular, recall first from (A2.7.64) that in this
case,
(A2.7.71) H SSD ( ) (2 X y 2 X X ) 2 X X
so that it remains only to show that X X is positive definite. But in the argument of
(A2.7.37) through (A2.7.40) above it was shown that for any nonsingular matrix, B ,
the matrix BB is necessarily positive definite. Hence it is enough to observe that this
continues to hold as long as B is of full column rank. For if it were true that
0 x( BB) x ( Bx) Bx for some x 0 , then the same argument shows that
x1
0 Bx (b1 ,.., bm ) j 1 x j b j
m
(A2.7.72)
x
m
________________________________________________________________________
ESE 502 A2-29 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
To motivate the main ideas, we again begin with a two-dimensional example in which
the relevant tangency conditions can be depicted graphically. For ease of visualization,
it is convenient to switch to a minimization problem. So consider minimizing the
quadratic objective function defined for each x ( x1 , x2 ) by,
25 1
(A2.8.2) A
1 15
In particular, we now suppose that feasible values of x for this minimization problem
are also required to satisfy a linear constraint of the following form,
(A2.8.4) d x
with d (5, 4) and 13 . In other words, the only relevant values of x for this
problem are those lying on the blue line shown in Figure A2.12.
________________________________________________________________________
ESE 502 A2-30 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
f ( x)
d x
x0
x1
x2
To put this problem in more standard form, let the function g ( x) be defined by
(A2.8.5) g ( x) d x
so that (A2.8.4) is equivalent to the condition that g ( x) . In these terms, the present
problem is formally a constrained minimization problem of form,
To solve this problem, observe next that [in a manner similar to Figure A2.5 (and
Figure A2.11) above] the contours of the function f ( x) are shown on the ( x1 , x2 ) plane
in Figure A2.12. Moreover, we know from (A2.8.3) above that this function decreases
toward its global minimum, x * , in the negative quadrant. So the lowest contour
touching the blue line in Figure A2.12 clearly defines the desired constrained minimum
point, x0 , solving problem (A2.8.6).
________________________________________________________________________
ESE 502 A2-31 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
With these observations, the key question it how to identify this point analytically. Here
it is convenient to give a planar representation of these contours as in Figure A2.13
below [where the ( x1 , x2 ) plane has now been rotated to place the origin in its more
natural position at the lower left corner of the figure].15
2
g ( x0 )
1.5
x 0
x1 1
0.5
f ( x0 )
0
0 0.5 1 1.5 2 2.5
x2
Here the solution point, x0 , is again identified by a tangency between the linear
constraint, g ( x) (blue line), and the lowest contour of f ( x) . But recall from
(A2.6.3) that the gradient, f ( x0 ) , of f at x0 must be orthogonal to this tangent line,
which by definition defines the directions of “no change” in f at x0 . Recall also that
gradients point in the direction of maximum increase in f . But since we are here
interested in minimizing f it is more appropriate to consider the (opposite) direction of
maximum decrease in f at x0 , as given by the negative gradient, f ( x0 ) . This is
negative gradient is shown by the red arrow in Figure A2.13.
Similarly, since the blue tangent line is also a constant-value contour for the constraint
function, g [i.e., the set of x values where g ( x) ], it then follows that the gradient,
g ( x0 ) , of g at x0 must be orthogonal to this same tangent line, as shown by the blue
arrow in Figure A2.13. [Since the positivity of the coefficient vector, d , in this case
15
Note also that for compatibility with Figure A2.12, the horizontal axis is x2 rather than x1 .
________________________________________________________________________
ESE 502 A2-32 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
Finally, since there is only a single line in the plane that is orthogonal to this blue line,
it follows that the two gradients f ( x0 ) and g ( x0 ) must both lie on this same line,
i.e., must be collinear. Since this implies that f ( x0 ) and g ( x0 ) must be scalar
multiples of one another, the fundamental tangency condition in Figure A2.13 implies
that for some scalar, 0 , it must be true that f ( x0 ) 0 g ( x0 ) , or equivalently that
(A2.8.7) f ( x0 ) 0 g ( x0 ) 0
(A2.8.8) g ( x0 )
These equation system allows all unknowns to be solved for. But before doing so, it is
important to note that while the above derivation is geometrical in nature, and hence
can be illustrated graphically, there is a mathematically more powerful way of deriving
the same conditions. In particular, if we now combine the functions, f and g , into a
single function of the form
(A2.8.9) L ( x, ) f ( x ) [ g ( x ) ]
then this augmented function, called the Lagrangian function, actually yields conditions
(A2.8.8) and (A2.8.9) as first-order conditions. In particular, if for any function, h( y, z )
of vectors, y ( y1 ,.., yk ) and z ( z1 ,.., zm ) , we write the gradients of h with respect to
y and z as,
y1 h( y, z ) z1 h( y, z )
(A2.8.10) y h( y , z ) and z h( y , z )
h( y , z )
h( y , z )
yk zm
(A2.8.11) x L ( x , ) f ( x ) g ( x )
(A2.8.12) L( x, ) g ( x)
________________________________________________________________________
ESE 502 A2-33 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
So (A2.8.7) and (A2.8.8) are seen to be precisely the first order conditions of L with
respect to ( x, ) evaluated at ( x0 , 0 ) , i.e.,
(A2.8.13) 0 x L( x0 , 0 ) f ( x0 ) 0 g ( x0 )
(A2.8.14) 0 L( x0 , 0 ) g ( x0 )
We shall consider a general Lagrangian problem of this type below. But for the present,
it is instructive to complete the solution of our particular example. First, recall from
expressions (A2.8.1) and (A2.8.5) that (A2.8.9) can be written more explicitly as
follows:
Hence by employing the gradient identities in (A2.7.7) and (A2.7.11) together with
(A2.8.11) and (A2.8.11), we see that (A2.8.13) and (A2.8.14) take the explicit form:
(A2.8.16) 0 x L( x0 , 0 ) b 2 Ax0 0 d
(A2.8.17) 0 L( x0 , 0 ) d x0
2 d A1b
0
d A d
1
Finally, substitution of (A2.8.19) into (A2.8.18) yields the following explicit solution
for x0 :
________________________________________________________________________
ESE 502 A2-34 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
2 d A1b
(A2.8.20) x0 12 A1 d b
d A d
1
Substitution of the values, c 20, b (1, 2), 13, d (5, 4) together with A in
(A2.8.2) yields the final solution
Finally, we apply these results to the case of ordinary kriging. Here we proceed in two
steps. First we derive a BLU estimator for the unknown mean parameter, , and then
use this to interpret the solution to the optimal weight vector problem. Turning first to
the BLU estimator for , recall from expression (6.3.7) of the text that the optimal
coefficient vector, â , is given by the solution of the constrained minimization problem:
2 (0) 1 1
(A2.8.23) aˆ 12 V 1 1n (0) V 1n
1nV 1n 1n V 1n
1 1
This in turn implies that the unique BLU estimator, ˆ n , of given sample vector Y is
given by
1 1 V 1Y
(A2.8.24) ˆ n aˆ Y 1n V Y
1 n
________________________________________________________________________
ESE 502 A2-35 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
But this is again a special case of the constrained minimization problem in (A2.8.15)
with (c 2 , b 2c0 , A V0 , 1, d 1n ) . Hence by now setting x0 ˆ0 in
(A2.8.20), it follows that
2 2 1n V01c0
(A2.8.26) ˆ0 12 V01 0
1n0 2c0
1n0 V0 1n0
1
1 1n0 V01c0 1 1
V 1 V0 c0
1n V011n 0 n0
0 0
1 1n0 V01c0
(A2.8.27) ˆ ˆ
Y0 0 Y 1n0 V0 Y c0V0 Y
1 1
1n V0 1n
1
0 0
For purposes of interpreting this expression, observe that since 1n0 V01c0 c0V011n0 , we
may rewrite (A2.8.27) as
Given the results above for a single constraint, we now proceed to the case of multiple
constraints. For purposes of illustration we begin with the case of two (linear)
constraints on functions of three variables, f ( x) f ( x1 , x2 , x3 ) , where is still possible to
obtain some geometric intuition. As an extension of (A2.8.6) we thus consider the
following constrained minimization problem:
g ( x)
(A2.8.29) minimize: f ( x) subject to: 1 1
g 2 ( x) 2
________________________________________________________________________
ESE 502 A2-36 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
x3
f ( x0 ) f ( x) f 0
x0
x2
g1 ( x) 1
g 2 ( x) 2
x1
x3
f ( x0 )
02 g 2 ( x0 ) 01 g1 ( x0 )
x2
x1
________________________________________________________________________
ESE 502 A2-37 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
To compare these figures with the single constraint case above, we start by restricting
attention to the x -space in Figure A2.12, i.e., the ( x1 , x2 ) -plane. Recall that the single
linear constraint corresponds to the blue line in this plane, and the critical tangency
condition for a minimum is shown in terms of the contour representation of f ( x) on
this plane. The situation in Figure A2.14 is conceptually the same, except that the x -
space is now three dimensional. Here the two linear constraints, g1 ( x) 1 and
g 2 ( x) 2 , are shown, respectively, by the blue and black planes in this space. Note
that these planes constitute constant-value contour surfaces for the functions g1 and
g 2 . Hence, like Figure A2.12, the constraint space defined by the intersection of these
two planes is again one dimensional, as shown by the heavy blue line. With respect to
the objective function, f ( x) f ( x1 , x2 , x3 ) , constant-value contour surfaces in this
space are curvilinear. Hence for visual clarity, only the single contour surface,
f ( x) f ( x0 ) f 0 , tangent to the constraint space at point x0 is shown. As in Figure
A2.13, the negative gradient vector, f ( x0 ) , at x0 must be orthogonal to the
constraint space, as shown by the red arrows in both Figures A2.13 and A2.14. So the
tangency conditions in these two cases are seen to be conceptually the same.
Turning next to the relation between this gradient vector and those for the constraints
recall that in Figure A2.13 the single gradient vector, g ( x0 ) , was also orthogonal to
the constraint space as defined by a constant-value contour of g . Moreover, since all
vectors orthogonal to this constraint line at x0 must necessarily be collinear, this in turn
implied that f ( x0 ) must be a scalar multiple of g ( x0 ) . But in higher dimensions
this is no longer true. In the present case, the set of vectors orthogonal to the blue line at
x0 must define a plane (not shown) which is called the orthogonal complement of this
line at x0 . So all that can be said is that these three gradient vectors, f ( x0 ) ,
g1 ( x0 ) and g 2 ( x0 ) , must all lie in this plane. But assuming that the two constraint
planes [ g1 ( x) 1 and g 2 ( x) 2 ] have a well-defined linear intersection (and hence
are not parallel), it follows that g1 ( x0 ) and g 2 ( x0 ) cannot themselves be collinear.
Hence they must span this plane, which means that every vector in the plane can be
written as a unique linear combination of g1 ( x0 ) and g 2 ( x0 ) . In particular this
implies that for the negative gradient vector, f ( x0 ) , there must exist unique scalars,
01 and 02 , such that f ( x0 ) 01 g1 ( x0 ) 02 g 2 ( x0 ) , or equivalently,
(A2.8.30) f ( x0 ) 01 g1 ( x0 ) 02 g 2 ( x0 ) 0
as shown in Figure A2.15. This is the fundamental constrained gradient condition that
generalizes (A2.8.7) for the single-constraint case. Hence, as an extension of (A2.8.9),
if we now consider the Lagrangian function:
(A2.8.31) L( x,1 , 2 ) f ( x) 1 [ g1 ( x) 1 ] 2 [ g 2 ( x) 2 ]
________________________________________________________________________
ESE 502 A2-38 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
(A2.8.32) 0 x L( x0 , 01 , 02 ) f ( x0 ) 01 g1 ( x0 ) 02 g 2 ( x0 )
(A2.8.33) 0 1 L( x0 , 01 , 02 ) g1 ( x0 ) 1
(A2.8.34) 0 2 L( x0 , 01 , 02 ) g 2 ( x0 ) 2
then it is clear that the minimum for this function satisfies both the constrained gradient
condition in (A2.8.30) together with the two constraints in (A2.8.29).
g1 ( x)
(A2.8.35) G ( x) , x ( x1 ,.., xn ) n
g k ( x)
then letting (1 ,.., k ) denote a vector of Lagrange multipliers, we may again form
the corresponding Lagrangian function,
L ( x, ) f ( x ) j [ g j ( x) j ]
k
(A2.8.37) j 1
f ( x) [G ( x) ]
0 x L( x0 , 0 ) f ( x0 ) j 1 0 j g j ( x0 )
k
(A2.8.38)
01
f ( x0 ) [g1 ( x0 ),.., g k ( x0 )]
0k
f ( x0 ) G ( x0 ) 0
________________________________________________________________________
ESE 502 A2-39 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
and,
(A2.8.39) 0 L( x0 , 0 ) G ( x0 )
In terms of Figures A2.14 and A2.15, condition (A2.8.38) again reflects the
constrained gradient condition that the negative gradient, f ( x0 ) , be a linear
combination of the constraint gradients. As a generalization of the constraint space in
these figures (with dimension 3 2 1 ), it is implicitly assumed here that the relevant
constraint set (i.e., the intersection of k constraint surfaces) is a well defined surface of
dimension n k , so that the orthogonal complement to this surface at x0 has dimension
k . This is equivalent to assuming that the constraint gradients are linearly independent.
If so, then they must span this complement, so that (A2.8.38) must hold for some
unique vector of multipliers, 0 ( 01 ,.., 0 k ) .
Our objective is to apply this general formulation to the case of quadratic objective
functions
d1x 1
(A2.8.41) Dx
d k x k
where the above constrained gradient condition is guaranteed to hold as long as these k
constraints are linearly independent (i.e., D is of full row rank, k ). Here the
minimization problem in (A2.8.36) takes the form:
Assuming that A is symmetric positive definite, this problem always has a unique
solution, ( x0 , 0 ) , which is characterized by the first-order conditions,
(A2.8.45) 0 L( x0 , 0 ) Dx
________________________________________________________________________
ESE 502 A2-40 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
which are seen to reduce precisely to (A2.8.16) and (A2.8.17) for the case of a single
constraint. Hence the solution is quite similar. Again we start by using the
nonsingularity of A to solve for x0 in (A2.8.44) as
0 ( DA1D) 1 ( DA1b 2 )
Substitution of (A2.8.47) into (A2.8.46) then yields the following solution for x0 :
We now apply these results to the case of Universal Kriging. As with Ordinary Kriging
above, we proceed in two steps. Given the linear model
(A2.8.49) Y X , ~ N (0,V )
we first determine the unique BLU estimator of , and then use this to interpret the
solution of the optimal weight vector problem. But in this case, the first step is a of
major interest in itself, and in fact yields an important characterization of Generalized
Least Squares estimation.
Here we proceed to show that the GLS estimator for as developed in Section 7.1.2 of
the text is a BLU estimator as defined there. Moreover, since this argument is required
to hold for all possible linear compounds, a k 1 , it suffices to pick a representative
compound, a , and consider the problem of finding that estimator of in the set of
linear unbiased estimators,
(A2.8.50)
LU a ( ) ( X ,V , Y ) :[ a Y ] &[ E (a ) a ]
with smallest variance. The solution to this problem will show that this estimator is
always given by the GLS estimator,
(A2.8.51) ˆ ( X V 1 X ) 1 X V 1Y
________________________________________________________________________
ESE 502 A2-41 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
But this can only hold for all possible values of if a X , or equivalently,
(A2.8.54) X a
it follows that weight vector, , of the desired BLU estimator must solve the
constrained minimization problem:
(A2.8.57) a 1
2 V 1 X ( X V 1 X ) 1 (0 2a) 0 V 1 X ( X V 1 X ) 1 a
and hence that the corresponding linear estimator in (A2.8.50), say a satisfies
Finally, since this holds identically for all linear compounds, a , we see that the unique
estimator satisfying all these conditions is given precisely by the GLS estimator. To
make this precise, observe that by setting a equal to the i th column, ei , of I k 1 for each
i 1,.., k 1 [as in (3,2,16) of the text], it must follow from (A2.8.58) that
(A2.8.59)
ei
i
ei ei ei ˆ ˆi , i 1,.., k 1
16
Our present approach is based on the development in Searle (1971, Section 3.3.d).
________________________________________________________________________
ESE 502 A2-42 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
and hence that all components of ˆ are uniquely identified by these particular choices
of a.
Finally, it should be noted that this result is usually referred to as the Gauss-Markov
Theorem in the literature.17 The above constrained minimization approach thus yields a
constructive proof of this theorem.
Next we derive the solution of the constrained minimization problem for Universal
Kriging in expression (7.2.12) of the text:
(A2.8.62) Yˆ0 ˆ0 Y ( x0 X 0V0 1c0 )( X 0V0 1 X 0 ) 1 X 0 V0 1Y c0V0 1Y
Finally, to determine the prediction error variance for Universal Kriging, one must
substitute ̂0 into the general expression for the prediction error variance [as given by
the objective function in (8.2.60)], to obtain:
17
See for example Section 4.4 in Green (2003).
________________________________________________________________________
ESE 502 A2-43 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
(V0 1 X 0 0 0 V0 1c0 ) ( X 0 0 0 c0 )
But since the two center terms are the same, and since ( X 0V0 1 X 0 ) 0 I n0 by (A2.8.63),
we see that,
Finally, by substituting (A2.8.66) and (A2.8.67) into (A2.8.63) and cancelling terms,
we obtain an explicit expression for prediction error variance:
________________________________________________________________________
ESE 502 A2-44 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part II. Continuous Spatial Data Analysis
______________________________________________________________________________________
(A2.8.70) ˆ 02 2 c0 V01c0 (1 1n V0 1c0 ) (1n V0 11n )1 (1 1n V0 1c0 )
0 0 0 0
(1 1n0 V0 1c0 )2
2 c0 V01c0
1n0 V0 11n0
________________________________________________________________________
ESE 502 A2-45 Tony E. Smith
AREAL DATA ANALYSIS
The key difference between areal data and continuous data is basically in terms of the
form of the data itself. While continuous data involves point samples from a continuous
spatial distribution (such as temperature readings at various point locations), areal data
involves aggregated quantities for each areal unit within some relevant spatial partition
of a given region (such as census tracts within a city, or counties within a state). Such
differences are illustrated in Figures 1.1 and 1.2 below.
R1 R2
s1 • • c1 c2•
• s2
R
R4
s4 • R3
s3• • c3 • c4
Here Figure 1.1 shows four sample points, si , in region, R, (which is qualitatively the
same as Figure 1.1 of Part II ), and Figure 1.2 represents a partition of region, R, into four
areal units {R1 , R2 , R3 , R4 } . Such areal units, Ri , are often represented by appropriate
central locations, ci Ri , such as major cities, or geometric “centroids” (to be defined
below).1 But the data values associated with these points represent summary measures for
the areal unit as a whole. For example, rather than measuring the temperature at location,
ci , one could assign (an estimate of) the average temperature over all points in areal unit
Ri . More importantly, one can represent values that have no particular point locations at
all, such as the population of Ri or the average income of all household units in Ri .
The practical significance of areal data for purposes of analysis is that most socio-
economic data comes in this form. For example, while individual income data is
generally regarded as proprietary in nature, such data is often made publically available
in terms of averages (such as per capita income at the state or county level). More
generally, most publically available data (such as US Census data) is only of this type.2
1
This type of representation in terms of point locations has led to the alternative description of areal data as
“lattice data”, as for example in Cressie (1993, Section 6.1).
2
There are exceptions however, such as the Center for Economic Studies (CES) run by the Census Bureau,
which allows restricted access to individual micro data by qualified researchers.
________________________________________________________________________
ESE 502 III.1-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
As in Parts I and II above, it is appropriate to illustrate some of the key features of areal
data in terms of specific examples (again drawn from [BG, Part D]).
Areal data is most easily represented visually in terms of choropleth maps, such as the
child mortality data for each of the 167 Census Districts in the city of Auckland, New
Zealand over the nine year period from 1977 to 1985 (taken from [BG, pp.249, 300-
303]). Here we focus only on the populations “at risk”, i.e., children under the age of 5 in
each district. Two possible representations of this data are shown in Figures 1.3 and 1.4
below.
CBD CBD
0 10 miles
10 miles
0
• • • •
Figure 1.3. Raw Population Data Figure 1.4. Population Density Data
The representation in Figure 1.3, which shows the actual number of children under 5 in
each district, appears to suggest that the most substantial concentration of these children
lies in districts to the southeast of the Central Business District (CBD). But it is important
to note that census districts are specifically designed to include roughly the same
population totals in each district. So the smaller districts around the CBD indicate that
population densities are much higher in this area.
An alternative representation of this population in given in Figure 1.4, which displays the
density of such children in each district, i.e., the number of children per square mile
(approx.) Here it is clear that the most dense concentrations are precisely in the smallest
districts, including the CBD. So this representation suggests (not surprisingly) that
children under five are in fact quite evenly spread throughout the population as a whole.
________________________________________________________________________
ESE 502 III.1-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
This example serves to underscore the fact that the distribution of areal data is usually
more accurately represented in terms of density values. More generally, representations
in terms of actual data totals (such a population counts) are designated as extensive
representations of areal data, and representations in terms of densities (such as population
densities) are designated as intensive representations of areal data. The key difference is
that intensive representations allow more direct comparisons between values in each areal
unit. For example, “population per square mile” has the same meaning everywhere, and is
independent of the size of each areal unit.3
The above example also demonstrates that when intensive data representations are used,
choropleth maps can serve to reveal meaningful patterns in areal data. In this case,
children under five (and indeed all people) are more concentrated around the CBD than in
outlying areas. But there are many more interesting pattern examples that this.
Figure 1.5. 1984 Per Capita GDP Figure 1.6. 1994 Per Capita GDP
Here it is clear at a glance that coastal region of China has been the high growth area.
Statistical analysis can of course be applied to confirm this. But the key point here is that
visual pattern analysis is a powerful heuristic tool for discerning relations that may not
be immediately evident in the data itself.
3
For a more detailed discussion of intensive versus extensive data representations, see the classic paper by
Goodchild and Lam (1980).
4
To allow direct comparison, data on both maps has been normalized to have unit maximum values.
________________________________________________________________________
ESE 502 III.1-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
A second example is provided by the Irish blood group data from [BG, p.253] for the 26
counties of Eire. From an historical perspective, there is strong reason to believe that the
Anglo-Norman colonization of Ireland in the 12th Century had a lasting effect on the
population composition. Figure 1.7 below shows the estimated proportion of adults in
each county with blood group A in 1958 (where values increase from blue to red). Figure
1.8 shows the original colonized area of Eire, known as the “Pale”. Since blood group A
is much more common among individuals with Anglo-Norman heritage, a visual
comparison of these two figures strongly suggests a continued pattern of Anglo-Norman
influence in the region around the Pale. We shall later confirm these findings with spatial
regression.
0 50 miles 0 50 miles
Figure 1.7. Blood Group A Percentages Figure 1.8. Counties in the Pale
A final example of areal data is provided by the English Mortality data from [BG,
pp.252-253]. Here the areal units are the 190 Health Authority Districts throughout
England, and the data used involve deaths from myocardial infarctions among males (35-
64), as shown in Figure 1.9 below. This data is in standardized rates, defined here to be
the number of deaths in the period 1984-1989 divided by the expected number of deaths
during that period based on national averages. Such standardized rates are quite typical
for medical data, and help to identify those areas where death rates are much higher than
expected relative to national averages. In particular, the darkest areas on the map indicate
rates well above average. While such higher rates may be influenced by many factors, the
present analysis focuses on aspects of “social deprivation” as summarized by the “Jarman
underprivileged areas score”, or Jarman score (which is a weighted average of factors
including levels of unemployment and overcrowding). This measure for each Health
________________________________________________________________________
ESE 502 III.1-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Authority District is shown in Figure 1.10, where darker areas here show higher levels of
“social deprivation”.
0 100 km 0 100 km
A visual comparison of these two maps suggests that there may indeed be some positive
correlation between these two patterns, especially in Northern England where the highest
levels of both death rates and social deprivation seem to occur.
________________________________________________________________________
ESE 502 III.1-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
However, it is also clear from Figures 1.9 and 1.10 above that there is a strong correlation
between both MI rates and Jarman scores in neighboring districts. Moreover, since it is
highly unlikely that the correlations among Jarman scores could completely account for
those among MI rates, one can expect there to be a strong spatial autocorrelation among
the regression residuals. This is confirmed by the simple nearest-neighbor analysis of
these regression residuals shown in Figure 1.12 (where nearest neighbors are here defined
with respect to centroid distances between districts). In fact the correlation among these
residuals is even stronger than that between lnMI and lnJARMAN (as can be seen by
comparing the t-ratios of Res_NN versus lnJARMON). While much of this residual
correlation could in principle be removed by including a range of other relevant
explanatory variables, it is quite apparent from Figure 1.9 that significant autocorrelation
will remain.
With these observations, our ultimate objective is to extend this simple nearest-neighbor
analysis to a broader and more rigorous framework for spatial autocorrelation analyses of
areal data. But to do so, we must first address the difficult issue of defining appropriate
measures of “distance” between areal units.
________________________________________________________________________
ESE 502 III.1-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
Aside from data aggregations, the second major difference between continuous and areal
data models concerns the representation of spatial structure itself. In particular, while
“distance between points” for any given units of measure (straight-line distance, travel
distance, travel time, etc.) is fairly unambiguous, the same is not true for “distance
between areal units”. As mentioned above, the standard convention here is to identify
representative points for areal units, the most typical being areal centroids (as defined
formally below). In fact, these centroids serve as the default option in ARCMAP for
constructing such representative points [refer to Section 1.2.9 in Part IV of this
NOTEBOOK]. But in spite of the fact that these points constitute the so-called
“geometric centers” of each areal unit, they can sometimes be quite misleading in terms
of distance relations between areal units.
An example is given in Figure 1.13 below, which involves three areal units, R1 , R2 , and
R3 . Here it might be argued that since units R2 and R3 are spatially separated, but are
each adjacent to R1 , they are both “closer” to (or exhibit a “stronger tie” to) unit R1 than
to each other. However, the centroids of these three units, shown by the black dots in
Figure 1.14, are equidistant from one another. Thus all of these spatial relations are lost
when “closeness” is summarized by centroid distances.
c1
R1
R2 R3 c2 c3
In particular, this suggests that the shapes of areal units also contain important
information about their relative proximities, even though they are much more difficult to
quantify. We shall return to this question below.
In addition to these geometric issues, there are other non-spatial properties of areal units
that influence their “closeness” in terms of human interactions. For example, it is often
observed that the opposite coasts in the US are relatively “close” to one another in terms
of human interactions (such as phone calls or emails). More generally, there tends to be
more interaction between states with large cities (such as those shown in Figure 1.15)
than would be expected on the basis of their separation in geographical space. For
example, such cities tend to contain relatively large professional populations conducting
business between cities.
________________________________________________________________________
ESE 502 III.2-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
!
! ! !
! ! !
!! ! !
!
!
!
! ! !
! !
!
! !
But while such socio-economic linkages between areal units may indeed be relevant for
many applications, we shall restrict our present analysis to purely geometric notions of
“closeness”. The main justification for this is that we are primarily interested in modeling
unobserved residual effects in regression models involving areal units. So these measures
of closeness are designed solely to capture possible spatial autocorrelation effects.
Indeed, it can be argued that potentially relevant socio-economic interactions between
units (such a communication and travel flows) should be part of the model, and not the
residuals.
To model spatial relations between areal units, we now let n denote the number of units
to be considered, so that the region of interest, say R Continental US, is partitioned
into areal units, R {Ri : i 1,.., n} , say the n 48 states in R (as in Figure 1.15 above).
Our basic hypothesis is that the relevant degree of “closeness” or “proximity” of each
areal unit R j to unit Ri (or alternatively, the “spatial influence” of R j on Ri ) can be
represented by a numerical weight, wij 0 . where higher values of wij denote higher
levels of proximity or spatial influence. Under this hypothesis, the full set of such spatial
relations can be represented by a single nonnegative weight matrix:
w11 w1n
(2.1.1) W
w w
n1 nn
Notice in particular that while the distance between a point and itself is naturally zero,
this need not be true for areal units. For example, if wij were to represent the average
distance between all cities in states i and j (possibly weighted by population sizes) then
since the average distance between cities within each state i is certainly positive, one
________________________________________________________________________
ESE 502 III.2-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
must have wii 0 for all i 1,.., n .. So in general, any nonnegative matrix can be a
spatial weights matrix.
However, certain special structural properties of such matrices are quite common. For
example, if distance itself is measured symmetrically, i.e., if d ( x, y ) d ( y, x ) for all
locations x and y (as with Euclidean distance), then weight measures such as the average
distance between cities in states i and j will also be symmetric, i.e., wij w ji . So, much
like covariance matrices, many spatial weights matrices will be symmetric matrices.
Moreover, while diagonal weights, wii , can in principle be positive (as in the city
example above), it will often be convenient for analysis to set wii 0 for all i 1,.., n . In
particular, when wij is taken to reflect some notion of the “spatial influence” of unit j on
unit i , then we set wii 0 in order to avoid “self-influence”. This will become clearer in
the development of spatial autocorrelation models in Section ?? below.
Many of the most common spatial weights are based on distances between point
representations of areal units. So before developing these weight functions, it is
convenient to begin with a more detailed consideration of point representations
themselves.
(2.1.2) a( R ) R
dx ,
then the spatial median, c , of R (with respect to Euclidean distance) is given by the
solution to
(2.1.3) min c 1
a( R) || x c || dx
R
But while this point is well defined and is easily shown (from the convexity properties of
this programming problem) to be unique, it is not identifiable in closed form. Even if R
is approximated by a finite grid of points, the solution algorithms for determining spatial
1
Here we ignore other possible reference points (such as the capital cities of states or countries) that might
be relevant in certain applications.
________________________________________________________________________
ESE 502 III.2-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
medians are computationally intensive. For this reason, we shall not use spatial medians
for reference points. However, it is still of interest to note that if R were approximated by
some finite grid of points, Rn {xi : i 1,.., n} , (say the set of raster pixels inside an
ARCMAP representation of R ), then the spatial median of this set, Rn , can in fact be
calculated in ARCMAP using the ArcToolbox command: Spatial Statistics Tools >
Measuring Geographic Distributions > Median Center.
Spatial Centroids
But in view of these computational complexities, a far more popular choice is the spatial
“centroid” of R , which minimizes the average squared distance to all points in R . More
formally, the centroid, c, of R is given by the solution to:
1
|| x c ||
2
(2.1.4) min c a( R) dx
R
The advantage of using squared distances is that this minimization problem is actually
solvable in closed form. In particular, by recalling that || x c ||2 xx 2 xc cc , and that
the minimum of (2.1.4) is given by its first-order conditions [as for example in Section
A2.7 of the Appendix to Part II], we see in particular that
(2 x 2c) dx 2 x dx c dx
R R R
R
x dx c dx c a( R)
R
c 1
a( R)
R
x dx
which is simply the average over all locations, x R . So the coordinate values of
c (c1 , c2 ) are precisely the average values of coordinates, x ( x1 , x2 ) , over R . In more
practical terms, if one were to approximate R by a finite grid of points,
Rn {xi : i 1,.., n} , in R as mentioned above for spatial medians, then the centroid
coordinates, c (c1 , c2 ) , are well approximated by
(2.1.6) ci 1
n xRn
xi , i 1,2
________________________________________________________________________
ESE 502 III.2-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
For this reason, the centroid of R is also called the spatial mean of R . Such spatial means
can be calculated (for finite sets of points) in ARCMAP using the ArcToolbox
command: Spatial Statistics Tools > Measuring Geographic Distributions > Mean
Center.
Computation of Centroids
But while this view of centroids is conceptually very simple and intuitive, there is in fact
a much more efficient and exact way to calculate centroids in ARCMAP. In particular,
since areal units R are defined as polygon features with finite sets of vertices (in a
manner paralleling the matrix representations of polygon boundaries in MATLAB
discussed in Section 3.5 of Part I), one can actually calculate the exact centroids of these
polygons with rather simple geometric formulas. Since the derivation of these formulas is
well beyond the scope of these notes, we simply record them for completeness.2 If we
proceed in a clockwise direction around a given polygon, R , and denote its vertex points
by ( x1i , x2i ) , i 1,.., n [where by definition, ( xn , yn ) ( x1 , y1 ) ], then the area of R is
given by
n 1
(2.1.7) a( R) 1
2 i 1
( x1,i 1 x2i x1i x2,i 1 )
These are precisely the same formulas used for calculating areas and centroids in the
“Calculate Geometry” option in ARCMAP, using the procedures outlined in Sections
1.2.8 and 1.2.9 of Part IV of this NOTEBOOK.
Displaying Centroids
One can display these centroids in ARCMAP by opening the Attribute Table containing
the centroids calculated above and using Table Options > Export… to save this table as
2
Full derivations of (2.7) and (2.8) require an application of Green’s Theorem, and are given in expressions
(31),(33) and (34) of Steger (1996). Here it should be noted that the signs in Steger are reversed, since it is
there assumed that vertices proceed in a counterclockwise direction.
3
A more general MATLAB program of this type can be downloaded at the web site:
http://www.mathworks.com/matlabcentral/fileexchange/319-polygeom-m.
________________________________________________________________________
ESE 502 III.2-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
say centroids.dbf. When prompted to add this data to the existing map, click OK. If you
right click on this new entry in the Table of Contents and select Display XY Data, then
the centroids will now appear on the map. If you wish to save these centroids, right click
on the new “centroid events” entry in the Table of Contents and use Data > Export
Data. Finally, if you save to the map as centroids.shp, then you can edit this copy as a
permanent file. This procedure was carried out for the Eire map in Figure 1.7 above, and
is shown in Figure 1.16 below.
! ! !
!
!
!
! !
!
!
! !
!
!
! !
!
!
!
!
!
!
! !
0 50 miles
________________________________________________________________________
ESE 502 III.2-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
While we shall implicitly assume that point representations, ci , of areal units, Ri , are
based on centroids, the following definitions hold intact for any relevant sets of points
(such as state capitals or county seats). Moreover, while centroid distances,
dij d (ci , c j ) , are implicitly assumed to be Euclidean distances, || ci c j || , the present
definitions of spatial weights are readily extendable to other relevant notions of distance
(such as travel distance or travel time). But it should also be stressed that our present
conventions are in fact used in most areal data analyses. The following examples of
spatial weights based on centroid distances extend the list given in [BG. p.261].
k-Nearest-Neighbor Weights
Recall from Section 3.2 in Part I that the nearest-neighbor distances defined within and
between point patterns are readily extendable to centroid distances. However, such
distance relations can be very restrictive for modeling spatial relations between areal
units. This is again well illustrated by the Eire example above, where the neighbors of
Laoghis county are shown in Figure 1.18 below.
!
!
!
!
!
Here it turns out that the nearest neighbor to Laoghis county in centroid distance is Offaly
county to the north (shown by the red arrow). But it is clear that the neighbors adjacent to
Laoghis county in all other directions may be of equal importance it terms of spatial
relations. We shall be more explicit about such adjacency relations below. But in the
present case, it is clear that we can achieve the same effect by considering the five nearest
neighbors to this county,
So to formalize such multiple-neighbor relations, let the centroid distances from each
areal unit i to all other units j i be ranked as follows: dij (1) dij ( 2 ) dij ( n1) . Then
for each k 1,.., n 1 , the set N k (i) { j (1), j (2),.., j (k )} contains the k areal units closest
to i (where for simplicity we ignore ties). For each given value of k , the k-nearest
neighbor weight matrix, W , is then defined to have spatial weights of the form:
1 , j N k (i )
(2.1.9) wij
0 , otherwise
Note in particular that the values of wij for the k-nearest neighbors of i are higher than
for other areal units, signifying that these neighbors are deemed to have greater proximity
to i (or greater spatial influence on i ) than other spatial units. Similar conventions will
be used for all weights discussed below. Note also that the common value of these
weights implicitly assumes that levels of proximity or influence are the same for all k-
nearest neighbors. This constancy assumption will be relaxed for other types of spatial
weights.
Before proceeding to other weighting schemes, it is also important to note that such
nearest-neighbor relations are generally asymmetric in nature. For if j is a k-nearest
neighbor of i , then it need not be true that i is a k-nearest neighbor of j , i.e., one may
have wij w ji . As seen in Figure 1.19 below, this is true even for k 1 , where R2 is the
nearest neighbor of R1 , but R3 is the nearest neighbor of R2 :
R1 R2 R3
c•1 c•2 •c
3
________________________________________________________________________
ESE 502 III.2-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
1 , j N k (i ) or i N k ( j )
(2.1.10) wij
0 , otherwise
In some cases, distance itself is an important criterion of spatial influence. For example,
locations “within walking distance” or “within one-hour driving distance” may be
relevant. Such proximity criteria are usually more relevant for comparing actual point
locations (such as distances to shopping opportunities or medical services), but are
sometimes also used for areal data. If d denotes some threshold distance (or bandwidth)
beyond which there is no “direct spatial influence” between spatial units, then the
corresponding radial distance weight matrix, W , has spatial weights of the form:
1 , 0 dij d 1
(2.1.11) wij
0 , dij d
d dij
Power-Distance Weights
dij
where is some positive exponent, typically 1 (as in the graph) or 2 . Note that
expression (2.1.12) is precisely the same as expression (5.2.4) in the interpolation
discussion of Section 5.2 in Part II. Thus all of the discussion in that section is relevant
here as well.
Exponential-Distance Weights
As in expression (5.2.5) of Part II, the negative exponential alternative to negative power
functions is also relevant here, and is again defined by:
________________________________________________________________________
ESE 502 III.2-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
1
(2.1.13) wij exp( dij )
dij
for some positive exponent, (such as 1 in the graph). As discussed in that section,
the negative exponential version is better behaved for short distances, but converges
rapidly to zero for larger distances.
Double-Power-Distance Weights
A somewhat more flexible family incorporates finite bandwidths with “bell shaped” taper
functions. If d again denotes the maximum radius of influence (bandwidth) then the
class of double-power distance weights is defined for each positive integer k by
1
ij
k k
1 d d , 0 d ij d
(2.1.14) wij
0 , d ij d
d
where typical values of k are 2, 3 and 4. Note that wij falls continuously to zero as dij
approaches d , and is defined to be zero beyond d . The graph shows the case of a
quadratic distance function with k 2 (see also [BG, p.85]).
The advantage of the distance weights above is that such distances are easily computed.
But in many cases the boundaries shared between spatial units can play in important role
in determining degree of “spatial influence”. The case of Eire in Figures 1.16 is a good
example. In particular, recall that k-nearest-neighbor weights were in fact motivated by
an effort to capture the counties surrounding Laoghis county in Figure 1.18. But such
neighbor distances can at best only approximate spatial contiguity relations (especially
since areal units can each have different numbers of contiguous neighbors). A better
approach is of course to identify such contiguities directly. The main difficulty here is
that the identification of contiguities requires the manipulation of boundary files, which
are considerably more complex than simple point coordinates. We shall return to this
issue in Section ?? below. But for the moment, we focus on the formal task of defining
contiguity relations.
________________________________________________________________________
ESE 502 III.2-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
The simplest contiguity weights indicate only whether pairs of areal units share a
boundary or not. If the set of boundary points of unit Ri is denoted by bnd (i ) then the so-
called queen contiguity weights are defined by
1 , bnd (i ) bnd ( j )
(2.1.15) wij
0 , bnd (i ) bnd ( j )
However, this allows the possibility that spatial units share only a single boundary point
(such as a corner point shared by diagonally adjacent cells on a chess board).4 Hence a
stronger condition is to require that some positive portion of their boundary be shared. If
lij denotes the length of shared boundary, bnd (i ) bnd ( j ) , between i and j , then these
so-called rook contiguity weights are defined by
1 , lij 0
(2.1.16) wij
0 , lij 0
Shared-Boundary Weights
As a sharper form of comparison, note that if li defines the total boundary length of
bnd (i ) that is shared with other spatial units, i.e., j i lij , then fraction of this length
shared with any particular unit j is given by lij / li . These fractions themselves yield a
potentially relevant set of shared boundary weights, defined by
lij lij
(2.1.17) wij
li l
k i ik
4
In fact, the present use of the terms “queen” and “rook” in expressions (2.1.15) and (2.1.16) refer
precisely to the possible moves of queen and rook pieces on a chess board, where rooks can only move
through faces between adjacent squares, but the queen can also move diagonally through corners.
________________________________________________________________________
ESE 502 III.2-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Finally, it should be evident that in many situations spatial closeness or influence may
exhibit aspects of both distance and boundary relations. One classical example of this is
given in the original study of spatial autocorrelation by Cliff and Ord (1969). In
analyzing the Eire blood-group data, they found that the best weighting scheme for
capturing spatial autocorrelation effects was given by the following combination of
power-distance and boundary-shares,
lij dij
(2.1.18) wij
l dik
k i ik
Having defined a variety of spatial weights, we next observe that for modeling purposes
it is generally convenient to normalize these weights in order to remove dependence on
extraneous scale factors (such as the particular units of distance employed in exponential
and power weights). Here there are two standard approaches:
Row-Normalized Weights
Recall that the i th row of W contains all spatial weights influencing areal unit, i , namely
( wij : j 1,.., n ) [possibly with wii 0 ]. So if the positive weights in each row are
normalized to have unit sum, i.e., if
n
(2.1.19) j 1
wij 1 , i 1,.., n
then this produces what called the row normalization of W.5 Note that each row-
normalized weight, wij , can then be interpreted as the fraction of all spatial influence on
unit i attributable to unit j . The appeal of this interpretation has led to the current wide-
spread use of row-normalized weight matrices. In fact, many of the spatial weight
definitions above are often implicitly defined to be row normalized. The most obvious
example is that of shared boundary weights in (2.1.17), which by definition are seen to be
row normalized. [Also the combined example in (2.1.18) was defined by Cliff and Ord
5
In cases where wii 0 by definition, it is possible that isolated units, i , may have all-zero rows in W. So
condition (2.1.19) is only required to hold for those rows, i , with j wij 0 .
________________________________________________________________________
ESE 502 III.2-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
dij
(2.1.20) wij
k j
dik
These normalized weights are seen to be precisely the Inverse Distance Weighting (IDW)
scheme employed in Spatial Analyst for spatial interpolation (as mentioned in Section 5.2
of Part II). A similar example is provided by exponential distance weights, with row-
normalized form,
exp( dij )
(2.1.21) wij
k j
exp( dik )
These weights are also used for spatial interpolation. In addition, it should be noted that
such normalized weights are commonly used in spatial interaction modeling, where
(2.1.20) and (2.1.21) are often designated, respectively, as Newtonian and exponential
models of spatial interaction intensities or probabilities.
In spite of its popularity, row-normalized weighting has its drawbacks. In particular, row
normalization alters the internal weighting structure of W so that comparisons between
rows become somewhat problematic. For example, consider spatial contiguity weighting
with respect to the simple three-unit example shown below:
0 wij wik 0 1 0
(2.1.22) Ri Rj Rk W w ji 0 w jk 1 0 1
wki 0 0 1 0
wkj
________________________________________________________________________
ESE 502 III.2-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
0 1 0
(2.1.23) Wrn 1/ 2 0 1/ 2
0 1 0
Here the “total” influence on each unit is by definition the same, so that unit i now
influences j only “half as much” as j influences i . While the exact meaning of
“influence” is necessarily vague in most applications, this effect of row-normalization
can hardly be considered as neutral.6
1
(2.1.24) 0
wmax
provides one such normalization that has the advantage of ensuring that the resulting
spatial weights, wij , are all between 0 and 1, and thus can still be interpreted as relative
influence intensities.
1
(2.1.25) 0
W
Our primary interest here is to show how spatial weight matrices can be constructed for
applications in MATLAB. We begin with those spatial weights based on centroid
distances as in Section 2.1.2 above, and illustrate their construction in MATLAB. Next
6
A more detailed discussion of this problem can be found in Kelejian and Prucha (2010).
________________________________________________________________________
ESE 502 III.2-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
we consider certain of the spatial contiguity weights in Section 2.1.3, which require initial
calculations to be made on shapefiles in ARCMAP.
All spatial weights defined in Section 2.1.2 can be constructed in MATLAB using the
program dist_wts.m. By opening this program, one can see that the inputs include a
matrix, L, of centroid coordinates together with a MATLAB structure, info, containing
information about the specific spatial weights desired. The use of this program can be
illustrated by an application to the Eire centroid data in the workspace, eire.mat.
Here L is a 26 2 matrix containing the centroid coordinates for the 26 counties in Eire.
If one chooses to construct a weight matrix containing the five nearest neighbors for each
county, say W_5nn, then by looking at the top of the program, one sees that k-nearest
neighbors corresponds to the first of six types of spatial weights that can be created. In
particular, by setting info.type = [1,5], one specifies a 5-nearest-neighbor matrix. Thus
the appropriate commands for this case are given by:
To understand the matrix which is produced, we again consider the case of Laoghis
county in Figure 1.18 above. By using the Identify tool in ARCMAP, one sees that the
FID of Laoghis county is 10, so that its centroid coordinates correspond to row 11 in L
(remember that FID numbers start at 0 rather than 1). Similarly, one can verify that the
five surrounding counties (which are also its five nearest neighbors) have FID values
(0,8,9,18,21). So their row numbers in L are given by (1,9,10,19,22). These numbers
should thus correspond to the “1” values in row 11. This can be verified by displaying the
positive column numbers of all positive elements in row number 11 of W_nn5 using the
find command in MATLAB as follows:
ans =
1 9 10 19 22
It is also important to emphsize that this matrix is constructed to be in sparse form, which
means that only nonzero values are recorded. This can be seen by attempting to display
the first 5 rows and columns of W_nn5 as follows:
>> W_nn5(1:5,1:5)
ans =
(5,2) 1
________________________________________________________________________
ESE 502 III.2-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
The result displayed says that the only nonzero element here is in (row 5, column 2) and
has value 1. This is a particularly powerful format in MATLAB since spatial weight
matrices tend to have many zero values, and can thus be stored and manipulated very
efficiently in sparse form. If one wants to obtain a full matrix version of W_nn5, say
Wnn5, then use the command:
>> Wnn5(1:5,1:5)
ans =
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 1 0 0 0
and shows in particular that all elements other than (5,2) are indeed zero.
Here again we use the Eire data as an example, and assume that the shapefile, Eire.shp,
is currently displayed in ARCMAP. The desired spatial weights can be obtained in
ArcToolbox using the command sequence:
where “Path” here represents the full path to the directory containing Eire.shp.
________________________________________________________________________
ESE 502 III.2-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
(ii) One then needs to have a unique identifier for each boundary polygon (county) in
Eire. If none are present, then the simplest procedure is to construct a new field, ID,
calculated as “[FID] + 1” and to set:
(iii) Here we calculate queen contiguity weights, and thus name the output as:
where “Path” now represents the full path to the directory where the output should be
placed.
Note that a number of other spatial weight matrices can also be constructed:
► Before leaving this window be sure to remove the check on Row Standardization,
unless you want row standardized values.
(v) Now click OK and the file Queen_W.swm should appear in the directory specified.
►Note this file is a binary file that is only useful inside ARCMAP. To use this data in
MATLAB, it must be transformed into a suitable text file. To do so:
Spatial Statistics Tools > Utilities > Convert Spatial Weights Matrix to Table
________________________________________________________________________
ESE 502 III.2-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
► Note that this output path can have no spaces (or you will get an error message). So
you may have to choose a higher level directory that can be reached without using spaces
(iii) Click OK, and the file Queen_W_Table.dbf should appear in this directory.
(iv) Since MATLAB cannot (yet) import .dbf files, you must transform this to a text file.
To do so, open the file in EXCEL as a .dbf file. Now delete the first column (containing
zeros), so that only three columns remain (“ID” “NID” “WEIGHT”). Save this as a tab-
delimited text file, Queen_W_Table.txt.
(vi) In the IMPORT Window, change the default “Column vector” setting in the
IMPORTED DATA box to “Matrix”, and click
The file will now appear in the workspace as a 112x3 matrix, QueenWTable. You can
rename this as W_queen by right clicking on the workspace entry.
As a check to be sure this procedure was successful, one may compare W_queen with
the ARCMAP representation. In particular, by repeating the procedure for W_nn5 above,
we now see that:
ans =
1 9 10 19 22
so that, as seen in Figure 1.18 above, the five contiguous neighbors to Loaghis county are
indeed its five nearest neighbors with respect to centroid distance.
________________________________________________________________________
ESE 502 III.2-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
Given the above formulation of spatial structure in terms of weights matrices, our
objective in this section is to develop the basic model of areal-unit dependencies that will
be used to capture possible spatial correlations between such units. Unless otherwise
stated, we shall implicitly represent the relevant set of areal units, {R1 ,.., Rn } , by their
indices, i 1,.., n . In particular, these areal units will almost always represent the sample
units of interest. To put this spatial-dependency model in proper perspective, we begin
with a typical linear model of the form
Yi 0 j 1 j xij ui , i 1,.., n
k
(3.1)
where Yi is taken to represent some relevant attribute of each spatial unit, i , and where
( xij : j 1,.., k ) represents a set of “explanatory” attributes of i that are postulated to
influence Yi . For example, if Yi is the Myocardial Infarction rate of each English Health
District, i 1,..,190 , in Section 1.3 above, then xi1 might correspond to the Jarman score
for District i , together with other possible attributes of that district. This model exhibits
an obvious similarity to expression (7.5) in Part II. The key difference is in terms of their
respective spatial sample units, where the point locations ( s ) in expression (7.5) are here
replaced by areal units ( R) that partition this space. As mentioned in the introduction,
this change in spatial sample units reflects the type of spatial data being analyzed. For
example, while, say, temperature is meaningful each point in space, this is not true of
Myocardial Infarction rates.1 But much more important for our present purposes is the
way in which the unobserved errors (or residuals) are treated in each model. Notice in
particular that we have switched notation in (3.1), and are now representing such
residuals by ui rather than i . The reason for this is that we shall proceed to develop an
explicit linear model of these spatial residuals themselves.
(3.2) Y X u
where as usual, Y (Y1 ,..,Yn ), X [1n , x1 ,.., xk ], ( 0 , 1 ,.., k ) and u (u1 ,.., un ) . We
again assume that the random vector, u , of residuals is multinormally distributed with
mean, E (u ) 0 , so that by construction,
(3.3) E (Y ) X
1
Note however that in cases such as the California rainfall example, where cities were treated as points, the
relevant data implicitly involves “local” spatial averages. So in this setting, for example, it would be
perfectly meaningful to compare the Myocardial Infarction rates of San Francisco and Los Angeles.
________________________________________________________________________
ESE 502 III.3-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
But rather than postulate spatial stationarity properties of u (as was done for spatially
continuous data in Part II), we must now rely on discrete spatial structure as summarized
by a given spatial weights matrix, W ( wij : i, j 1,.., n ) . In terms of our Myocardial
example above, wij , may represent some measure of the spatial proximity of Health
District j to (or influence on) Health District i , where higher values of wij denote greater
spatial proximity or influence. In this setting, it seems reasonable to postulate that each
unobserved residual, ui , in (3.1) is influenced by those residuals, u j , in neighboring areal
units j , i.e., with positive spatial weights, wij . As a parallel to (3.1), such influences
might also be represented by linear “spatial error” model of the form:
(3.4) ui j i ( wij ) u j i
where ( wij ) is some appropriate “influence” function depending on wij , and where i
represents that part of residual ui that is not influenced by other areal units. But as we
have seen in Section 3.2, there is already great flexibility in the specification of spatial
weights, wij , and hence no need for further functional elaborations. Rather, the strategy
here is to use the simplest possible specification in terms of a common scale factor, ,
so that ( wij ) takes the form wij , and (3.4) reduces to2
To interpret (3.5), note first that (except for the absence of an intercept term) this relation
is essentially a type of linear regression model in which each residual, ui , is regressed on
its neighbors, u j (with coefficients wij ). Moreover, since this effectively implies that
the full set of residuals is being regressed on itself, model (3.5) is designated as a spatial
autoregressive model of residual dependencies. In this context, the summation over all
j i ensures that no individual residual is “regressed on itself”. But even with this
restriction, it will be shown below that the estimation of such autoregressive models is far
more subtle than that of standard regression models.
For the present however, we focus only on the basic meaning of (3.5). First consider the
parameter, , which plays a very special role in this model. At one extreme, if 0
then each residual, ui , reduces to its own intrinsic component, i , and all spatial
dependencies vanish. More formally, if we now assume that these individual components
are independently and identically normally distributed as,
2
Here the notation, j i , means summation over all units, j , other than unit i .
________________________________________________________________________
ESE 502 III.3-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
then model (3.1) is seen to reduce to a standard linear regression model when 0 . At
the other extreme, when | | becomes large, the strength of all spatial dependencies
(positive or negative) must also become large. This suggests that be designated as the
spatial dependency parameter for the model.
Note also, that for any pairs of areal units, ij and kh , with positive spatial weights,
wij , wkh 0 , and any nonzero level of spatial dependence, 0 , it must always be true
that
wij wij
(3.7)
wkh wik
Thus the relative strength of these spatial dependencies is determined entirely by their
spatial weights. In summary, this model provides a natural “division of responsibilities”
in which governs the overall strength of spatial dependencies, and in which the
spatial weight structure governs their relative strength among individual areal-unit pairs.
Finally, to write this model in more compact matrix form, it convenient to assume that
wii 0 in the given spatial weights matrix, W , so that (3.5) can be rewritten in more
standard terms as
ui j 1 wij u j i , i 1,.., n
n
(3.8)
In this form, if we now let (1 ,.., n ) denote the random vector of intrinsic
components, then expressions (3.8) and (3.6) together yield the follows Spatial
Autoregressive Model of residual dependencies:3
(3.9) u Wu , ~ N (0, 2 I n )
where in addition it is assumed that the diagonal element of W are zero, written as
(3.10) diag (W ) 0 .
Like most of the spatial dependency models considered in these notes, model (3.9) was
originally inspired by a time series model [as in Whittle (1954)]. In the present case, this
3
This model was originally proposed by Whittle (1954). But the present matrix formulation was first given
by Ord (1975), who designated (3.9) as a first-order spatial autoregressive process.
________________________________________________________________________
ESE 502 III.3-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
(3.1.1) ut ut 1 t , t 2,.., T
(3.1.2) u1 1
Except for the nonzero value of , this AR(1) model can be viewed formally as a special
case of model (3.9). To see this, observe simply that if the T T weights matrix,
W ( wts : t , s 1,.., T ) , is defined by
1 , t 2,.., T , s t 1
(3.1.3) wts
0 , otherwise
u1 0 0 0 u1 1
u
(3.1.4) 2 1 0 0 u2 2 u Wu
uT 0 1 0 uT T
But this particular instance of (3.9) has the important property that time dependencies
flow in only one direction – namely from the past to the present. Formally, this is
reflected by the so-called “lower triangular” structure of W in (3.1.4).
4
While (3.1.2) can be replaced by more standard “steady state’ initial conditions, the present simpler form
is most appropriate for our purposes.
________________________________________________________________________
ESE 502 III.3-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
To appreciate the significance of this unidirectional flow, it is instructive to ask how one
might simulate this model. Here the answer is almost self-evident from (3.1.1) and
(3.1.2):
However, for more general examples of model (3.9), this simple process of simulation is
not possible.
• • •
1 2 3 • • • n-1 n
In particular, suppose that household i ’s opinion, ui , on how much each house should
contribute to their annual street party is influenced both by i ’s initial opinion, i , and by
the opinions of i ’s immediate neighbors, including ui 1 and/or ui 1 . Then a natural
spatial model of opinion formation by these residents might well take the form:
ui 1 i , i 1
(3.2.1) ui (ui 1 ui 1 ) i , 2 i n 1
u , in
i 1 i
where now reflects how influential the opinions of these neighbors are. Note in
particular that the “edge” residents 1 and n have only one neighbor, while all other
residents have two neighbors.
________________________________________________________________________
ESE 502 III.3-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Given this spatial model of opinion formation,5 one may again ask: how might we
simulate this model? Here the key question is where to start the simulation. For if we
start with edge resident 1, then it clear from the first line of (3.2.1) that we must know the
opinion, u2 , of 1’s neighbor in order to simulate u1 . Similarly, if we start with edge
resident n then the last line of (3.2.2) shows that the opinion, un1 , of n ’s neighbor is
required to simulate un . Moreover, the situation is even worse for intermediate residents,
i , where both neighboring opinions, ui 1 and ui 1 , are required in order to simulate ui . So
it would appear that there is no way to simulate this process at all. But to be more precise,
this argument shows that there is no possible sequential simulation procedure for
realizing samples of (3.2.1). Rather, the full set of opinions, (u1 , u2 ,.., un ) , must be
somehow be simulated simultaneously.
Here it turns out that there is a remarkably simple procedure for doing so. In particular,
let us again formulate (3.2.1) as an instance of (3.9) where W now takes the form:
1 , t 2,.., n 1, s {t 1, t 1}
(3.2.2) wts
0 , otherwise
u1 0 1 0 0 u1 1
u 1 0 1
u2 2
2
(3.2.3) 0 1 0 u Wu
un 1 1 un1 n1
u 0 0 1 0 un n
n
But given this matrix formulation, observe that we may solve for u in terms of as
follows:
(3.2.4) u Wu u Wu
( I n W ) u
So assuming for the moment that the inverse matrix, ( I n W ) 1 , exists, we can multiply
both sides of (3.2.4) by ( I n W ) 1 to obtain the following reduced form solution for u in
terms of ,
(3.2.5) u ( I n W ) 1
5
Formally, expression (3.2.1) is an instance of the bilateral autoregressive process proposed by Whittle
(1954). Indeed, this is precisely the one-dimensional example that motivated his original analysis of spatial
autoregressive processes.
________________________________________________________________________
ESE 502 III.3-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Given this existence assumption, observe that if “intrinsic opinions” are again assumed
(for sake of illustration) to be independently and identically normally distributed about
some average opinion level, , as i ~ N ( , 2 ) , i 1,.., n , then we can now simulate
(3.1.5) in essentially only two steps:
Our objective in this section is to obtain conditions for the existence of ( I n W )1 and
to give an intuitive spatial interpretation to this inverse matrix. To do so, we start by
recalling that for any number, a, the basic geometric series:
(3.3.1) S 1 a a 2 a 3 k 0
ak
represents the simplest example of an infinite summation that can be given a closed form
solution in an elementary way. For if one considers the partial sum,
(3.3.2) Sk 1 a a 2 a 3 a k
(3.3.3) a S k a a 2 a 3 a k a k 1
(3.3.4) S k a Sk (1 a a 2 a 3 a k ) ( a a 2 a 3 a k a k 1 ) 1 a k 1
1 a k 1
(3.3.4) Sk
1 a
But since by definition, S lim k Sk , it follows at once from (3.3.4) that this limiting
sum exists if and only if lim k a k 0 , and must have the closed-form solution:
________________________________________________________________________
ESE 502 III.3-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
1
(3.3.5) S lim k Sk (1 a )1
1 a
(1 a )1 1 a a 2 a 3 k 0 a k
(3.3.6)
The point of this exercise for our purposes is that exactly the same argument can be
applied to matrices, by simply substituting the scalar, a, with an n-square matrix A. In
particular, if On denotes the n-square zero matrix, then it is shown in Section A3.5 of the
Appendix that
( I A)1 I n A A2 A3 k 0 Ak
(3.3.7)
if and only lim k Ak On . So in our case, by setting A W , it follows that that the
inverse ( I n W )1 will exist and have the limiting form
( I n W )1 I n W 2W 2 3W 3 k 0 kW k
(3.3.8)
if and only if
(3.3.9) lim k kW k On
________________________________________________________________________
ESE 502 III.3-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
a a
(3.3.10) A (a1 , a2 ) 11 12
a21 a22
x2 Ae2
Ae2
x2e2 •x
e2 Ae1 x1 Ae1
e1 x2e2
Figure 3.2. Basis Image Vectors Figure 3.3. General Image Vectors
From a geometrical viewpoint, it is of interest to ask whether there exist any vectors,
x n , that are simply “stretched” by A into (possibly negative) multiples of themselves,
i.e., whether
(3.3.11) Ax x
0 w12
W
0
(3.3.12) R1 R2
w21
________________________________________________________________________
ESE 502 III.3-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
If W represents a simple contiguity relation with w12 1 w21 [as in the 3-unit example of
expression (2.1.22) above], and if we let x1 (1, 1) and x2 ( 1, 1) , then simple matrix
multiplication shows that W x1 x1 and W x2 x2 , so that these are both eigenvectors
of W with corresponding eigenvalues, Eig (W ) {1 , 2 } {1, 1} . This is shown
graphically in Figure 3.4 below (where x1 and Wx1 are slightly offset so that both can be
seen):
x2 x x1
•
Wx1
Wx2
More generally (as shown in Section A3.3 of the Appendix), each n-square matrix, A,
possesses at most n distinct eigenvalues. To see that there may be fewer than n, consider
the identity matrix, I n , which has only one distinct eigenvalue ( 1) since by
definition, I n x x for all x n . This example also shows that eigenvectors in such
cases can be chosen in many ways. There also exist matrices with no (real) eigenvalues,
as illustrated by the matrix
0 1
(3.3.13) A
1 0
As seen in Figure 3.5 below, this matrix rotates the plane by 90 , so that no vector can be
sent into a scalar multiple of itself.
e2 Ae1
Ae2 e1
But for sake of simply, we focus here n-square matrices, A, with a full set of eigenvalues,
Eig ( A) {1 ,.., n } , and associated eigenvectors, x1 ,.., xn , that are linearly independent.6
In geometric terms, this means that every point, x n , can be written as a linear
combination of these eigenvectors, as illustrated by the point, x, in Figure 3.4. In
algebraic terms, it means that the n-square matrix, X [ x1 ,.., xn ] , defined by these
eigenvectors is nonsingular, so that the inverse matrix, X 1 , exists. We may thus write
out the relations among these eigenvalues and eigenvectors as follows,
(3.3.14) A xi i xi , i 1,.., n
1
A X [ Ax1 ,.., Axn ] [1 x1 ,.., n xn ] [ x1 ,.., xn ]
n
AX X
where diag (1 ,.., n ) is the diagonal matrix of eigenvalues. So (post) multiplying
both sides of (3.3.14) by X 1 , we obtain the following “spectral” representation of A,
(3.3.15) A X X 1 X X 1 A X X 1
To see the power of this representation, observe that if we multiply A by itself, then:
(3.3.16) A2 ( X X 1 )( X X 1 ) X ( X 1 X ) X 1 X 2 X 1
By comparing this with (3.3.15), it follows at once that that the eigenvalues of A2 are
precisely the squares of the eigenvalues of A, and moreover that the associated
eigenvectors remain the same. By simply repeating this argument k times, it follows
more generally that
1k
1
(3.3.17) Ak X k X 1 X X , k 1,2,...
n
k
So the eigenstructure of A tells us a great deal about how the associated powers, Ak , of
A must behave. In particular, the limiting behavior of these powers as k for any
matrix, A, is governed entirely by the maximum size of its eigenvalues, which we denote
by,
6
In fact the eigenvectors for distinct eigenvalues are always linearly independent, as illustrated in Figure
A3.27 of the Appendix.
________________________________________________________________________
ESE 502 III.3-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
To see this, note simply from (3.3.17) that these powers will converge to the zero matrix
if and only if k 0 for all Eig ( A) . Because this is equivalent to the single
condition, | |A 1 , it then follows that
(3.3.18) lim k Ak On | |A 1
For the important case of nonnegative matrices, it shown in Section ?? of the Appendix
that this maximum always corresponds to the largest positive eigenvalue of A, denoted
here by A , so that A | |A . As an illustrative example, the eigenstructure of the
nonnegative matrix,
2 / 3 1/ 3
(3.3.19) A
1/ 6 1/ 2
is easily seen to be given by
5 / 6 2 1
(3.3.20) , X [ x1 , x2 ] ,
1/ 3 1 1
x2
x1
Ax2 Ax1
Since all points are linear combinations of the eigenvectors, x1 and x2 , and since
| |A A 5 / 6 1 implies that both these eigenvectors shrink toward zero, we see that
________________________________________________________________________
ESE 502 III.3-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
all points are shrunk towards zero (as illustrated by the parallelogram in the figure). In
other words, by using the coordinate system created by these eigenvectors, we see that
the shrinking behavior of these eigenvectors is inherited by all points with respect to this
coordinate system. While not every case is so simply illustrated, Figure 3.6 helps to
provide some geometric intuition for the general result in (3.3.18).7
By combining (3.3.9) and (3.3.18), we see that a necessary and sufficient condition for
the geometric-series representation in (3.3.8) to hold is that the maximum eigenvalue of
the matrix ( W ) , be less than one. But for each eigenvalue, , of W , say with
eigenvector, x , it follows at once from (3.3.11) that
(3.3.21) Wx x Wx x ( W ) x ( ) x
(3.3.23) | |W | || |W | | W
it follows that
1
(3.3.24) | |W 1 | | W 1 | |
W
So for the present case of spatial weight matrices, W , the general convergence condition
in (3.3.18) now takes the form
(3.3.25) lim k kW k On | | 1/ W
(3.3.26) ( I n W ) 1 k 0
kW k | | 1 / W
7
See Section ?? in the Appendix for a general development of this result.
8
Here it must be stressed that in spite of the apparent similarity of the condition, | | 1 , to the properties
of correlation coefficients, this spatial dependency parameter, , is not a correlation coefficient.
________________________________________________________________________
ESE 502 III.3-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
(3.3.27) ( I n W ) 1 k 0
kW k | | 1
and thus that (3.3.27) always holds for W W * . In fact, this is the primary motivation
for the normalizing convention in expression (2.1.25) of Section 2 above.
Before proceeding, it is important to note that row normalized weight matrices, Wrn , must
also exhibit this same property. This can be seen in part by observing that the normalizing
condition (2.1.19) for Wrn in Section 2 can be written as
1
1 j 1 wij [ wi1 ,.., win ] wi1n , i 1,.., n
n
(3.3.29)
1
where wi is the i th row of Wrn . This set of conditions can in turn be written in matrix
form as
w1 1
(3.3.30) 1 1 W 1 1 ,
n n rn n n
w 1
n
which shows that 1n must always be an eigenvector of Wrn with unit eigenvalue. Thus for
the row normalization of any spatial weights matrix, W, we must have 1 Eig (Wrn ) . In
addition, it is shown in Section ?? of the Appendix that this unit eigenvalue is necessarily
the maximum eigenvalue of Wrn , and thus that (3.3.27) must always hold for row
normalized matrices.
________________________________________________________________________
ESE 502 III.3-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
0 w12 w13 0 1 0
(3.3.31) R1 R2 R3 W w21 0 w23 1 0 1
w w32 0 0 1 0
31
In this example, the only direct influences are between unit 2 and each of the other units,
1 and 3. This can be represented by the following graph, with areal units as “nodes” and
positive weights as directed “links” (in red):
(3.3.32) 1 2 3
So, for example, the top two arrows show that unit 2 directly influences both units 1 and
3. Now consider the square of this weight matrix,
0 1 0 0 1 0 1 0 1
(3.3.33) W 1 0 1 1 0 1 0 2 0
2
0 1 0 0 1 0 1 0 1
If one thinks of direct links as influence paths of length 1, then the ij elements of
W 2 ( wij2 ) are precisely number of influence paths of length 2 from j to i . In particular,
each m th term of the ij -value, wij2 3m 1wim wmj , of W 2 contributes a value of 1 to this
sum if and only if both wim and wmj are 1, i.e., if and only if there is a path, j m i ,
of length 2. For example, while unit 3 does not directly influence unit 1, there is an
indirect influence on the path, 3 2 1 , seen in (3.3.32). This single influence path of
length 2 corresponds to the 1 in the upper right hand corner of W 2 . Notice also that while
the diagonal elements of W are zero by construction, this is not true of W 2 . For example
there is now an influence path of length 2 from unit 1 to itself, namely the path
1 2 1 in which 1’s influence on 2 is “echoed back” as a second order influence on 1.
In a similar manner, the ij elements of the k th power, W k ( wijk ) , of W indicate the
number of length k paths from j to i . But notice in the present example, that these
relations depend explicitly on the fact that W consists entirely of zeroes and ones. More
generally, for any n-square weights matrix, W, the ij elements of the k th power,
W k ( wijk ) , of W take the form9
9
For a deeper discussion of such influence paths see Martellosio (2012).
________________________________________________________________________
ESE 502 III.3-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
where each positive product, wim1 wm1m2 wmk 1 j , in wijk still corresponds to a unique path,
j mk 1 m1 i , of positive influences – but where this product need not be
unity. Moreover, if we now introduce the spatial dependency parameter, , and consider
the k th power, kW k , then (3.3.34) becomes
In this form, it is clear that the w -values along each path reflect only the relative
influences of each link, where typically such influences will be smaller on links between
more widely separated units. The full influences of these links are then determined by .
With these preliminary observations, it should now be clear that the geometric sum in
(3.3.8) represents the cumulative effect of all these direct and indirect spatial influences
among units. The can be seen more explicitly by using (3.3.8) to expand (3.2.5) as
follows:
(3.3.36) u ( I n W ) 1 ( I n W 2W 2 )
W 2W 2
So for any given vector of intrinsic effects, (1 ,.., n ) , expression (3.3.36) displays
the accumulation of all direct and indirect effects of that define the vector,
u (u1 ,.., un ) , of autoregressive residuals. This is illustrated graphically in Figure 3.7
below for the “over the fence” communications example in Figure 3.1 (for the case of
n 7 neighbors):
ε1 ε2 ε3 ε4 ε5 ε6 ε7
ε • • •3 •4 • •6 •7
1 2 5
+
ρWε • • • • • • •
+
2
ρ2W ε • • • • • • •
________________________________________________________________________
ESE 502 III.3-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Here we only show the first three terms of (3.3.35), where the first term reflects the initial
(intrinsic) opinions of each neighbor, and where subsequent terms represent the
cumulative indirect influences on these opinions resulting from over-the-fence
communications. Alternatively, if one were to imagine each initial opinion as a pebble
falling into water, then the influences of these opinions spread out like “ripples” in all
directions. (An empirical example of such a ripple effect is given in Figure 7.8 below.)
More generally this example suggests that spatial autoregressive residuals, u , can be
viewed as the steady state of an implicit spatial diffusion process generated by a random
vector of intrinsic effects, . Of course, the spatial autoregressive model in (3.9) is static
in nature, and involves no explicit notion of time. But such cumulative effects can
nonetheless be usefully represented as a steady state over virtual time periods as shown in
Figure 3.8 below.
t 4 t 3 t 2 t 1 t0
+
W W W W W
+
2W 2 2W 2 W
2 2
2W 2 W 2
2
+
W 3 3
W
3 3
W
3 3
W
3 3
W 3
3
Here for example, W , in the “current” state, t0 , is interpreted as the direct effect of
in the “previous” state, t1 , and similarly, 2W 2 , in t0 is the indirect effect of in t2 .
The main feature of this representation is that the total effect in each state resulting from
all previous states remains the same, thus yielding a “steady state” independent of time.
But regardless of whether or not this steady state interpretation is used, the essential
result here is that the reduced-form representation of spatial autoregressive residuals, u, in
(3.3.35) does indeed incorporate all direct and indirect effects generated by in the
presence of spatial structure, W .
One final point needs to be made about this reduced-form representation. It is often
observed that this representation is not essential for the existence of the inverse
( I n W ) 1 . For example, if W is given by (3.3.30), and say, 2 , then it may be
________________________________________________________________________
ESE 502 III.3-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
verified (by direct matrix multiplication) that the inverse of this matrix exists, and is
given (approximately) by
But while this inverse exists, it is far more difficult to interpret in a meaningful way. In
particular, the negative elements in this matrix are rather questionable. Note in particular
from the positivity of that W must be a nonnegative matrix. So it seems clear from
the basic autoregressive relation, u Wu , that a positive increase in the components
of in the should certainly not decrease any component of u . However, (3.3.37) and
(3.3.36) together imply for example that the second component, u2 , is related linearly to
(1 , 2 , 3 ) by
But such problems do not arise when this inverse is representable as in (3.3.36). In the
present case, observe that since W 2 1.414 for this W matrix, it follows that if
| | 1 / W 0.707 , then (3.3.36) must hold. But in case, the nonnegativity of
ensures that every term of the expansion, k 0 kW k , must be nonnegative, so that
( I n W ) 1 is always nonnegative. For example, if .5 then it can again be verified
that
1.5 1 .5
(3.3.39) ( I 3 W ) 1
1 2 1
.5 1 1.5
So positive spatial dependencies here imply that spatial autoregressive residuals, u , are
always monotone nondecreasing in the components of .
Finally, it should be emphasized that the negative signs in (3.3.37) are no accident. In fact
it is shown in Section ?? of the Appendix that all elements of ( I n W ) 1 are
nonnegative if and only if | | 1 / W . So while the steady-state representation in
(3.3.36) is not strictly necessary for the existence of a reduced form solution for u, it
characterizes those cases where a meaningful spatial interpretation of these residuals can
be given.
________________________________________________________________________
ESE 502 III.3-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
To apply the spatial autoregressive model above, we start by restating the linear model
(for n areal units) in expression (3.2) above, where the residuals, u, are now specified
more explicitly as:
(4.1) Y X u , u ~ N (0,V )
with both the parameter vector, , and covariance matrix, V , unknown. The simplest
procedure for specifying the residual covariance is to start by assuming that
(4.2) V 2In ,
so that can be estimated by OLS. Given this estimate, one can then test to see whether
there is sufficient spatial autocorrelation in the residuals to warrant more elaborate
specifications of V . If for a given sample, y (i.e., observed realization of Y ) we denote
the OLS estimate of by
(4.3) ˆ ( X X )1 X y
(4.4) û y X ˆ
then our objective is to develop statistical tests for the presence of spatial autocorrelation
using these residuals. To do so, we assume that the underlying spatial structure of these n
areal units is representable by a given spatial weights matrix, W ( wij : i, j 1,.., n) . In
terms of W, it is then hypothesized that all relevant spatial autocorrelation among the
residuals, u , in (4.1) can be captured by the spatial autoregressive model,
The key feature of this hypothesis is that testing for spatial autocorrelation then reduces
to testing the null hypothesis:
(4.6) H0 : 0
(4.7) u ~ N (0, 2 I n ) ,
so that the OLS specification of V in (4.4) above is appropriate. If not, then some more
elaborate specification of V needs to be considered.
________________________________________________________________________
ESE 502 III.4-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
In this context, our main objective is to construct appropriate test statistics based on û
and W for testing H 0 . In the following subsections, we shall consider three alternative
test statistics that are in common use.
Given model (4.5), one natural approach is simply to treat the OLS residuals, û , as a
sample of u , and use model (4.5) to obtain a corresponding OLS estimate of . To do
so, recall the “one variable regression” illustration given in class, where we started with a
linear model:
(4.1.1) Yi xib i , i 1,.., n
In vector form, this is seen to yield the special case of (4.4.1) where X x and b is
a scalar, i.e.,
(4.1.2) Y xb
Hence, as a special case of (4.3), the OLS estimate of the scalar, b , in (4.1.2) is given by
xy xy
(4.1.3) bˆ ( xx) 1 xy
xx || x ||2
But (4.5) can be viewed as a model of the form (4.1.2), where b , Y u and x Wu .
Hence, for our present “data”, y uˆ , the corresponding OLS estimate of is given by
This yields our first test statistic for H 0 , which we designate as the rho statistic. Note
also that we use the subscript “ W ” to emphasize that this statistic (and those below)
depends explicitly on the choice of W .
Having constructed this statistic, it is of interest to observe that the basic spatial
autocorrelation test we have been using so far, namely regressing residuals on nearest-
neighbor residuals, is essentially a special case of this rho statistic. To see this, observe
that the ith row of (4.5) is of the form:
ui (Wu )i i j 1 wij u j i
n
(4.1.5)
But if W is chosen to be the first nearest-neighbor matrix (i.e., wij 1 if j is the nearest
neighbor of i and wij 0 otherwise) and if we let nn(i ) denote the first nearest-neighbor
of each point i then by construction,
________________________________________________________________________
ESE 502 III.4-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
n
(4.1.6) j 1
wiju j unn (i )
(4.1.7) ui unn (i ) i
But this is almost exactly the regression we have been using, where the important slope
coefficient is now precisely . So the test for significance of this slope is based on the
estimator, ˆW . Notice however that unlike our regression, there is no intercept in (4.1.7).
This makes sense theoretically since1
(4.1.8) E (ui ) E (unn (i ) ) E ( i ) 0 ,
which in turn implies that the intercept term must also be zero in this model. So in fact,
(4.1.7) is the model we should have been using. But since regression residuals must sum
to zero by the construction of OLS,2 the intercept is usually not statistically significant.
This is well illustrated by the regression of Myocardial Infarctions on the Jarman Index in
Section 1.3 above. Residual regression with and without the intercept are compared in
Figures 4.1 and 4.2 below. Notice in Figure 4.1 that the intercept is close to zero and
completely insignificant. More importantly, notice that the t-values for the slope
intercepts in both figures are virtually identical.
g
0.6 0.6
0.4 0.4
0.2 0.2
Resids
Resids
0 0
-0.2 -0.2
-0.4 -0.4
-0.6 -0.6
-0.8 -0.8
-0.8 -0.6 -0.4 -0.2 0 0.1 0.3 0.5 -0.8 -0.6 -0.4 -0.2 0 0.1 0.3 0.5
nn-Resids nn-Resids
0.8
Parameter Estimates Parameter Estimates
Term Estimate Std Error t Ratio Prob>|t| Term Estimate Std Error t Ratio Prob>|t|
Intercept 0.0058051 0.015586 0.37 0.7100 nn-Resids 0.5100272 0.063483 8.03 <.0001*
nn-Resids 0.5111284 0.063697 8.02 <.0001*
Figure 4.1. Regression with Intercept Figure 4.2. Regression with No Intercept
2
This is established in expression (4.2.9) below.
________________________________________________________________________
ESE 502 III.4-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
However, it is also important to notice that from a regression viewpoint, models like
(4.1.7) are seriously flawed. In particular, since the same random vector, u , appears both
on the left and right hand sides, this regression suffers from what is called an
“endogeneity problem”. Here it can be shown that ˆW is actually an inconsistent
estimator of , which means that even for very large sample sizes, there is no guarantee
that ˆW will eventually be close to the true value. Nevertheless, we have already seen
that the p-values derived from this regression are generally quite reasonable. So even
though we will develop a more satisfactory Monte Carlo approach using this ˆW statistic,
the regression approach in (4.1.7) is generally quite robust and easy to perform.
Hence if the OLS regression residuals, û , are taken to be a sample of u (so that Wuˆ is a
sample of Wu ) then all sample pairs [uˆi ,(Wuˆ )i ] must be correlated with the same sign.
This suggests that, as a summary measure, the sample correlation between vectors, û
and Wuˆ , should reflect this common sign. Since all these random variables have zero
means by construction,3 we start by observing that the correlation between any zero-mean
random variables, X and Y , is given by4
cov( X , Y ) E ( XY )
(4.1.9) ( X ,Y )
( X ) (Y ) E ( X 2 ) E (Y 2 )
3
Again, (Wu ) i j
wij u j implies from (4.1.8) that E[(Wu ) i ] j
wij E (u j ) 0 .
4
Recall that cov( X , Y ) E ( XY ) E ( X ) E (Y ) E ( XY ) , so that var( X ) cov( X , X ) E ( X ) .
2
________________________________________________________________________
ESE 502 III.4-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
n n n
(4.1.10) Eˆ ( XY ) 1
n x yi , Eˆ ( X 2 )
i 1 i
1
n x2
i 1 i
, Eˆ (Y 2 ) 1
n i 1
yi2
n
1
Eˆ ( XY ) x yi
ˆ ( X , Y )
n
(4.1.11) i 1 i
Eˆ ( X 2 ) Eˆ ( X 2 )
n 2 n
1
n i 1 i
x 1
n i 1
yi2
n
x yi
i 1 i
n 2 n
x
i 1 i i 1
yi2
n
x yi xy xy
(4.1.12) r ( x, y ) i 1 i
xx yy
|| x || || y ||
n 2 n
x
i 1 i i 1
yi2
In these terms, our second test statistic is given by the sample correlation between û and
Wuˆ , i.e.,
uˆ Wuˆ
(4.1.13) rW r (uˆ,Wuˆ )
|| uˆ || || Wuˆ ||
Up to this point we have focused mainly on constructing statistics for estimating the
value (or at least the sign) of in model (4.5). But we have given little attention to how
these statistics behave under the null hypothesis, H 0 , in (4.6). One might suspect from
the inconsistency of ˆW that this statistic exhibits little in the way of “optimal” behavior
under H 0 . The sample correlation, rW , does somewhat better in this respect. But from a
statistical viewpoint, it again suffers from another type of “inconsistency”. For while the
classical sample correlation statistic assumes that ( xi , yi ) , i 1,.., n are independent
random samples from the same statistical population ( X , Y ) , this is not true of the
samples [uˆi ,(Wuˆ )i ] , i 1,.., n . Even under the null hypothesis, where (4.7) implies that
(uˆi ˆi : i 1,.., n) , are independently and identically distributed, this is not true of the
samples, (W ˆ )i , i 1,.., n , which are neither independent nor identically distributed. So
________________________________________________________________________
ESE 502 III.4-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
there remains a question as to how well either of these statistics behaves with respect to
testing H 0 . In this context, we now introduce our final test statistic,
uˆ Wuˆ uˆ Wuˆ
(4.1.14) IW I (uˆ ,Wuˆ )
uˆ uˆ || uˆ ||2
designated as the Moran statistic, or more simply as Moran’s I.5 Here it is important to
emphasize that expression (4.1.14) is different from the version of Moran’s I in [BG,
p.270] (also used in ARCMAP), which is designed for detecting autocorrelation in Y
itself. This can in fact be viewed as the special case of (4.1) in which there only an
“intercept” term with coefficient, , representing the common mean of the Y
components, i.e.,
(4.1.15) Y 1n u
If u is again assumed to satisfy (4.5), then under the null hypothesis, 0 , the “OLS”
estimate in (4.3) reduces to the sample mean of y, i.e.,
Thus the “residuals” in (4.4) are here seen to be simply the deviations of the y
components about this sample mean, i.e.,
(4.1.17) uˆ y y 1n
So the appropriate version of Moran’s I in this special case is seen to have the form,
w ( y y )( y
n n
( y y 1n )W ( y y 1n ) i 1 j 1 ij i j y)
(4.1.18) IW
( y y 1n )( y y 1n ) ( y y)
n 2
i 1 i
which is essentially the version used in [BG, p.270], except for the normalizing constant
n
(4.1.19) W
i j
wij
For simplicity we have simply dropped this constant [as for example in Tiefelsdorf
(2000,p.48)].6
5
Be careful not to confuse this use of “ I ” with the n-square identity matrix, I n .
6
Notice that for the common case of row normalized W (with zero diagonal) it must be true that
i j wij in1 ( j i wij ) in1 (1) n , so this constant is unity.
________________________________________________________________________
ESE 502 III.4-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
While this statistic is more difficult to motivate by simple arguments,7 it turns out to
exhibit better statistical behavior with respect to testing H 0 then either of the statistics
above.
we see that they exhibit striking similarities. Indeed, since the numerators are identical,
and since all denominators are positive, it follows that these three statistics must always
have the same sign. Hence the differences between them are not at all obvious.
But for testing purposes, the key issue is their relative behavior under the null hypothesis,
H 0 . To study this behavior, it is necessary to express û more explicitly as a random
vector. To do so, observe first from (4.3) and (4.4) that
[ I n X ( X X )1 X ]( X u )
( X X ) [ I n X ( X X )1 X ] u
[ I n X ( X X )1 X ] u
û Mu
(4.1.22) M I n X ( X X )1 X
I n X ( X X )1 X
MM M
7
However, a compelling motivation of this statistic can be given in terms of the “concentrated likelihood
function” used in maximum likelihood estimation of . We shall return to this question in Section (??)
after maximum likelihood estimation has been introduced.
________________________________________________________________________
ESE 502 III.4-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Finally, to study the relative behavior of these estimators under H 0 , recall from (4.5)
that 0 implies u ~ N (0, 2 I n ) , so that û takes now takes the form
(4.1.24) û M
7 Moran
6
corr
rho
5
0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Note first that while all three distributions are roughly centered on the true value, 0 ,
there is actually some degree of bias in all three. The simulated means for these three
8
The conditions M M MM together imply that M is an orthogonal projection matrix.
9
Since the row sums are always 5, i.e., since W 1n (5)1n , it turns out that 5 and thus that the max-
W
eigenvalue normalization [(2.1.25) of Section 2 above] and row normalization for this particular W matrix
are the same.
10
Exact distribution results for Moran’s I have been obtained by Tiefelsdorf (2000, Chap.7).
11
Density estimation was done using the kernel-smoothing procedure, ksdensity.m, in MATLAB.
________________________________________________________________________
ESE 502 III.4-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
statistics are displayed in Table 4.1 below, and show that the mean of Moran’s I is in fact
an order of magnitude closer to zero than the other two. Moreover, the exact theoretical
mean for Moran in this case [expression (4.1.29) below] can be calculated to be
0.00655 , which shows that for a sample of size 10,000 these simulated values are quite
accurate.
Moran -0.0067
corr -0.0198
rho -0.0851
But Figure 4.3 also suggests that relative bias among these three estimators is far less
important that their relative variances. Indeed it is here that the real superiority of IW is
evident. While the variance of IW under H 0 is known [see expression (4.1.31) below], its
exact relation to the variances of rW and ˆW under H 0 is difficult to obtain analytically.
But simulations with many examples show that these English Mortality results are quite
typical. In fact, even for individual realizations of it is generally true that
(4.1.26) | IW ( ) | | rW ( ) | | ˆW ( ) |
While counterexamples show that (4.1.26) need not hold in all cases, this ordering was
exhibited by all 10,000 simulations in the English Mortality example.
In summary, this example shows why Moran’s I is by far the most widely used statistic
for testing spatial autocorrelation. Given its relative unbiasedness and efficiency (small
variance) properties under H 0 , Moran’s I tends to be the most reliable tool for detecting
spatial autocorrelation.12
Given the superiority of Moran’s I , the most common procedure for testing H 0 is to use
the asymptotic normality of IW under H 0 , first established by Cliff and Ord (1973) [see
also Cliff and Ord (1981, pp.47-51), [BG], p.281 and Cressie (1993), p.442]. Since an
asymptotic testing procedure using Moran’s I is available in ARCMAP, it is of interest to
develop this procedure here. But before doing so, it must be emphasized that the test used
in ARCMAP is based on the version of Moran’s I in expressions (4.1.18) and (4.1.19)
above, which we here denote by
wij ( yi y )( y j y )
n n
n
(4.2.1) IW i 1 j 1 n
wij i1 ( yi y )2
i j
12
For a deeper discussion of its optimality properties, see Section 4.3.2 in Tiefelsdorf (2000).
________________________________________________________________________
ESE 502 III.4-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
The mean and variance of this statistic under H 0 are given in [BG, p.281]. For our
present purposes, it is enough to observe that the mean of I has a very simple form,
W
1
(4.2.2) E ( IW )
n 1
becomes large.
(4.2.3) E ( IW ) 1
nk tr ( M W )
where M is given in terms of the n k data matrix, X , by (4.1.22) , and where the
trace, tr ( A) , of any n-square matrix, A ( aij ) , is defined to be the sum of its diagonal
elements, i.e.,
tr ( A) i 1 aii
n
(4.2.4)
So in the case of regression, E ( IW ) , is seen to depend not only on W but also on the
particular data matrix, X . This is also true for the variance of IW , which has the more
complex form:
tr ( M W M W ) tr ( M W M W ) [tr ( M W )]2
(4.2.5) var ( IW ) [ E ( IW )]2
( n k )( n k 2)
13
The mean (4.2.3) and variance (4.2.5) of I W under H 0 are taken from Tiefelsdorf (2000, p.48). The
original derivations of these moments (using the normalizing factor, W ) can be found in Cliff and Ord
(1981, Sections 8.3.1 and 8.3.2).
________________________________________________________________________
ESE 502 III.4-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
IW E ( IW )
(4.2.6) ZW
var( IW )
where (as in Section 3.1.3 of Part II) the notation, d means “is approximately
distributed as”. A more detailed description of these “appropriate conditions” will be
given at the end of Section 4.2.2 below. So for the present, we simply assume that the
approximation in (4.2.7) is valid.
Given this assumption, the appropriate testing procedure is operationalized in the
MATLAB program, moran_test_asymp.m. To apply this program to the English
Mortality example, let y = “lnMI” and x = “lnJarman” (as in Section 1.3 above) and
denote the 5-nearest neighbor weight matrix by W. This test can then be run with the
command:
>> moran_test_asymp(y,x,W);
Note that this program actually runs the OLS regression, calculates the residuals, û , and
then calculates ZW in (4.2.6). The test results are reported as screen outputs:
Moran = 0.46405
Zval = 11.3044
Pval < 10^(-323)
Here the calculated value, ZW , is denoted by Zval and is seen to be more than 11
standard deviations above the mean. This suggests that there is simply no chance at all
that these residual values (shown in Figure 4.4 below) could be spatially independent.14
As mentioned above, the Moran test used in ARCMAP relies on IW in (4.2.1) rather than
IW , and essentially tests whether a given set of spatial data, y, can be distinguished from
independent normal samples. This procedure can again be illustrated using the English
14
In fact, the p-value here is so small that it is reported as “0” in MATLAB. In such cases, the program
simply reports “Pval < 10^(-323)”, which is roughly the smallest number treated as nonzero in MATLAB.
________________________________________________________________________
ESE 502 III.4-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Mortality data. But in doing so, it must be born in mind that these regression residuals, û ,
are now treated as the basic data set “y” itself. This is always possible in the case of OLS
residuals since by definition the “sample mean”, 1n 1n uˆ , of such residuals is identically
zero. This is a consequence of the following property of the projection matrix, M,
(4.2.8) M X [ I n X ( X X )1 X ] X X X On
which together with the definition, X (1n , x1 ,.., xk ) , implies in particular that M 1n 0 .
But for any realized value, y , of Y it then follows from (4.1.21) that
and thus that 1n 1n uˆ is always zero. This means that we can set û y in (4.1.17) then
y 0 and obtain no immediate contradictions. However, it is important to emphasize
that the mean and covariance of IW are in principle very different from those of IW ,
depending on the explanatory data, X.
Given this observation, we now proceed to test for spatial autocorrelation in û by using
IW rather than IW . To do so, the OLS residuals of the regression of Myocardial
Infarction rates on the Jarman Index must first be imported to ARCMAP and joined to
the Eng_Mort.shp file as a new column, say resid, and saved as a new shapefile, say
OLS_resids. These residuals are shown in Figure 4.4 below, with positive residuals in
red, negative in blue, and with all values “close to zero” (i.e., within half a standard
deviation) shown in white. Here it is clear that while the Jarman Index is certainly a
significant predictor of Myocardial Infarction, these unexplained residuals are highly
correlated in space. Recall from the simple nearest-neighbor test that this correlation was
more significant than the Jarman Index itself. We now show that this degree of
significance is in fact even greater than in that simple heuristic test.
0 100 km
________________________________________________________________________
ESE 502 III.4-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
For this illustration, we again use the above weight matrix, W, consisting of the first 5
nearest neighbors of each district. To use this weight matrix in ARCMAP, it must first be
converted to a text file using the MATLAB program, arcmap_wt_matrix.m, with the
command:
>> L = arcmap_wt_matrix(W);
Here you must be sure that W is not in “sparse” form (which can be seen by displaying
the first row of W). If it is, then use the command, W = full(W), to convert it to a full
matrix. The matrix output, L, should then have initial rows of the form:
1 2 0.2
1 3 0.2
1 8 0.2
1 15 0.2
1 16 0.2
This shows in particular that the first 5 nearest neighbors of district 1 are districts
(2,3,8,15,16). To import this matrix to ARCMAP, first open it in EXCEL and “clean” the
numerical format to look like the above. ARCMAP also requires an ID for these values,
which can be accomplished in three steps:
(i) First add a new column to the attribute table (say next to resid) labeled ID and
use the calculator (with “short integer”) to create values (1 2 3 …) by setting
ID = [FID] + 1.
(ii) Now add a new row at the top of the matrix, L, in EXCEL, and write the
identifier name, ID, so that L is now of the form:
ID
1 2 0.2
1 3 0.2
1 8 0.2
(iii) Finally, save this as a text file, say Wnn_5.txt, (to indicate that it includes the
first 5 nearest neighbors). This file will be used by ARCMAP below.
To apply the Moran test to the OLS residuals, resid, in ARCMAP, follow the path:
________________________________________________________________________
ESE 502 III.4-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
In the “Spatial Autocorrelation” window that opens, fill in the shapefile and field, and be
sure to check the “Generate Report” box. You are now going to use the option,
“GET_SPATIAL_WEIGHTS_FROM_FILE”
Click OK, and when the procedure terminates, you will get a report displayed. The most
relevant portion of this report is shown in Figure 4.7 below:
Before proceeding further, notice that while the value of Moran’s I, (Index = 0.464048),
is the same as in Section 4.2.1 above, the Z value (ZScore = 11.188206) is slightly
different. This is because the mean and variance used to standard Moran’s I are different
________________________________________________________________________
ESE 502 III.4-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
in these two tests. Rather than using (4.2.3) and (4.2.5), the values used are those in [BG,
p.281], including (4.2.2) for the mean. In the present case, spatial autocorrelation is so
strong that there is little difference between these results. But this need not always be the
case.
It is also important to note that while this report contains all key test information, there is
a much better graphical representation that can be obtained by clicking the “HTML
Report File” that is shown here in blue. This graphic is shown in Figure 4.9 below.
This graphic facilitates the interpretation of the results by making it abundantly clear (in
the present case) that these test results show positive spatial correlation that is even more
significant than that of the heuristic nearest-neighbor approach used previously.
But it should be emphasized that while spatial correlation is visually evident in Figure 4.5
above, this will not always be the case. Moreover, it should also be stressed that the
Moran statistics, IW and IW (as well as ˆW and rW ) are defined only with respect to a
given weight matrix, W . Hence it is advisable to use a number of alternative weight
matrices when testing for spatial autocorrelation. For example, one might try alternative
numbers of neighbors (say 4, 5, and 6), or more generally, weight matrices involving both
distance-based and boundary-based notions of spatial proximity. A general rule-of-
thumb is to try three substantially difference matrices, (W1 ,W2 ,W3 ) , that cover a range of
potentially relevant types of proximity. If the results for all three matrices are comparable
(as will surely be the case in the English Mortality example), then this will help to
substantiate these results. On the other hand, if there are significant differences in these
________________________________________________________________________
ESE 502 III.4-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
results, then an analysis of these differences may serve to yield further information about
the underlying structure of the unobserved spatial dependencies.
Finally it should be emphasized that, as with all asymptotic tests, these asymptotic Moran
tests require that the number of samples (areal units) be “sufficiently large”. Moreover, it
is also required that the W matrix be “sufficiently sparse” (i.e., consist mostly of zero-
valued cells) to ensure that the Central Limit Theorem is working properly. The present
case with n 190 spatial units, and with each row of W containing only 5 nonzero
entries, this should be a reasonable assumption. But as with the Clark-Evans tests for
random point patterns, it is often difficult to know how well this normal approximation is
working.15
With this in mind, we now develop an alternative testing procedure based on Monte
Carlo methods that is more computationally intensive, but requires essentially no
assumptions about the distribution of test statistics under H 0 . The basic idea is very
similar to the “random relabeling” test of independence for point patterns in Section 5.6
of Part I. There we approximated the hypothesis of statistical independence by “spatial
indistinguishability”. Here we adopt the same approach by postulating that if the
particular spatial arrangement of sample points doesn’t matter, then neither should the
labeling of these points. More specifically, in the vector of regression residuals,
uˆ (uˆ1 ,.., uˆn ) , it shouldn’t matter which residual is labeled as “ û1 ”, “ û2 ”, or “ uˆn ”. If so,
then regardless of what the joint distribution of these residuals (uˆ1 ,.., uˆn ) actually is, each
relabeling (uˆ1 ,.., uˆ n ) of these residuals should constitute an equally likely sample from
this distribution. So under this spatial invariance hypothesis, H SI , we may generate the
sampling distribution for any statistic, say S (uˆ1 ,.., uˆn ) , under H SI by simply evaluating
S (uˆ1 ,.., uˆ n ) for many random relabelings, , of (uˆ1 ,.., uˆn ) . As we have seen for point-
pattern tests, this hypothesis can then be rejected if the observed value, S (uˆ1 ,.., uˆn ) ,
appears to be an unusually high or low value from this sampling distribution.
15
For further discussion of these issues, see for example Tiefelsdorf (2000, Section 9.4.1), Anselin and Rey
(1991) and Anselin and Florax (1995).
________________________________________________________________________
ESE 502 III.4-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
to violations of normality.16 So while we shall report results for all three statistics,
Moran’s I tends to be the most reliable of these three.
With this overview, we now outline the steps of testing procedure for H SI , designated as
the permutation test of spatial autocorrelation, or more simply the sac-perm test. For
convenience we maintain the general notation, S , which can stand for either I, ̂ , or r .
Since higher positive values of each of these three statistics correspond to higher levels of
positive spatial autocorrelation, we assume that S exhibits this same ordering property. In
this setting, our test is designed as a one-tailed test of positive spatial autocorrelation
(paralleling the one-tailed test of clustering for K-functions). In particular, significant
positive (negative) spatial autocorrelation will again be reflected by low (high) p-values.
As with the asymptotic Moran tests above, this sac-perm test is defined with respect to a
given spatial weight matrix, W. Finally, it should be noted that while the notation, u , will
be used to represent the given residual data in this procedure, virtually all applications
will be in terms of OLS residuals, i.e., u uˆ . With these preliminary observations, the
steps of this testing procedure are as follows:
Step 1. Let u 0 (u1 ,.., un ) denote the vector of observed residuals, and construct the
corresponding value, S 0 S (u 0 ) , of statistic, S .
Step 3. For each permutation, j , construct the corresponding permuted data vector,
u ( j ) (u j ,.., u j ) , and the resulting value of S , denoted by S j S[u ( j )] , j 1,.., N .
1 n
Step 4. Rank the values ( S 0 , S 1 ,.., S N ) from high to low, so that if S j is the k th highest
value then rank ( S j ) k .
(4.3.1) k / ( N 1)
(i) If is low (say 0.05 ) then conclude that there is significantly positive
spatial autocorrelation at the -level of significance.
16
See for example the Monte Carlo results in Anselin and Rey (1991) and Anselin and Florax (1995).
17
So if n = 3 then the first permutation of (1,2,3) might be ( 1 , 2 , 3 ) (2, 3,1) .
1 1 1 1
________________________________________________________________________
ESE 502 III.4-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
The “cutoff” levels for significantly positive or negative spatial autocorrelation are
intentionally left rather vague. Indeed the entire nature of this sac-perm test is meant to be
exploratory in nature.
Recall from the asymptotic Moran tests in Section 4.2 that there was extremely strong
autocorrelation in the OLS residuals of the England Myocardial Infarction data when
regressed on the Jarman Index. We now reproduce those results in MATLAB using
sac_perm. Here the data can be found in the workspace, eng_mort.mat, where the OLS
residuals are in the vector, Res. Finally, recall that desired weight matrix, Wnn_5, was
already constructed for Eire in Section 2.2.1 above. So the only difference here is that L
is now a 190x2 matrix of coordinates for English Health Districts. To construct a sac-
perm test of the residual data, Res, using this weight matrix, we can employ 9999 random
permutations with the command:
Here the key outputs are the significance levels for the three test statistics (Moran, corr,
rho). Notice that (as expected), these values are each maximally significant, i.e., they are
higher than the values for all of the 9999 random permutations simulated. In fact, they
are much higher, as can be seen by comparing them with the range of values displayed
above. For example, the important Moran value, 0.4640, is seen to be well above the
range of values, - 0.1444 to 0.1939, reported for all 9999 permutations. Note also that the
________________________________________________________________________
ESE 502 III.4-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
ranges of corr and rho are successively larger than this, in a manner consistent with
expression (4.1.26) for the asymptotic Moran test.
As expected, the Moran value, 0.4640 , is the same as that for the asymptotic tests above,
confirming that the same weight matrix and calculation procedure are being used.
Moreover, the extremely significant p-value reported for those tests is consistent with the
present fact that this Moran value is way above the simulated range. This shows that if
the number of permutations were increased well beyond, 9999, the same maximally-
significant results would almost surely persist.
Finally, just to show that normality of IW persists under random permutations for
samples this large, we have plotted the histogram for the 9999 simulated values of IW
(ranging from -0.1444 to 0.1939), together with the observed value, 0.4640, shown in red.
2000
1800
1600
1400
1200
1000
800
600
400
200
0
-0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5
This plot also serves to further dramatize the significance of spatial autocorrelation for
these particular regression residuals.
________________________________________________________________________
ESE 502 III.4-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
The above testing procedures are all motivated by the spatial autoregressive model of
residual errors. So before moving on to spatial regression analyses of areal data, it is
appropriate to consider certain alternative measures of spatial association that are also
based on spatial weights matrices. By far the most important of these for our purposes are
the so-called G-statistics, developed by Getis and Ord (1992,1995).1 These statistics
focus on direct associations among (nonnegative) spatial attributes rather than spatial
residuals from some underlying explanatory model. For any given set of nonnegative
data, x ( x1 ,.., xn ) , associated with n areal units, together with an appropriate spatial
weights matrix, W ( wij : i, j 1,.., n ) , the G * statistic for x is defined to be:2
n n
i 1
x wij x j
j 1 i xWx
(5.1) GW* ( x )
(1n x )2
n n
i 1
x xj
j 1 i
As discussed further below, the diagonal elements of W are allowed to be nonzero (since
no autoregressive-type relations are involved). However, if one is only interested in
relations between distinct areal units, i j , so that the diagonal elements of W are treated
as zeros, then the resulting statistic is called simply the G statistic, and is given by:
n n
i 1 j 1 i
x wij x j xW 0 x
(5.2) GW ( x )
(1n x )2 xx
n n
i 1
x xj
j 1 i
While the definitions in (5.1) and (5.2) serve to clarify the formal similarities between
these indices and those of the previous section, there is an alternative representation
which suggests a more meaningful interpretation of these indices. Here we focus on G * .
First observe that since xi 0 , if we let
xi xi
(5.1.1) pi
1n x
n
x
j 1 i
1
The 1992 paper is Reference 7 in the class Reference Materials.
2
While our present focus is on areal units, it should be noted that these G-statistics are equally applicable
to sets of point locations, such as hospitals or supermarkets within a given urban area.
3
It should be clear from these definitions that a better choice of notation would have been to use G with W
0 0
and G with W . But at this point, it is best to stay with the standard notation in the literature.
________________________________________________________________________
ESE 502 III.5-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
denote the proportion (or fraction) of x in unit i , and let p ( p1 ,.., pn ) denote the
corresponding vector of proportions, then G * can be rewritten as
xi wij x j xi x j
(5.1.2) GW* ij
(1n x )2
ij wij pi p j wij
1n x 1n x
ij
Next observe (from the title of their 1992 paper) that Getis and Ord are primarily
interested in distance-based measures of proximity or accessibility. In particular, if we let
dij denote some appropriate notion of distance between units i and j , and let a( d )
denote an appropriate (nonincreasing) accessibility function of distance [such as
a( d ) d or a( d ) exp( d ) ], then we may now interpret each spatial weight as an
accessibility measure
and write
To give a concrete interpretation to Ga* , let us assume for the moment that xi represents
the population in areal unit i , so that pi is the fraction of population in i , and
p ( p1 ,.., pn ) is the population distribution among areal units. In this context one may
ask: What is the expected accessibility between two randomly sampled individuals from
this distribution? To answer this question, observe that since pi is by definition the
probability that a randomly sampled individual is from unit i , it follows by independence
that pi p j must be the joint probability that these two random samples are from units i
and j , respectively. So if accessibility is treated as a random variable with values, a( d ij ) ,
for each pair of areal units, then it follows from (5.1.4) that Ga* must be the expected
value of this random variable, i.e.,
(5.1.5) Ga* E ( a )
Thus the value of Ga* is precisely the answer to the question above, i.e., the expected
accessibility between two randomly sampled individuals in this population.
In terms of this particular example, there are several additional features that should be
noted. First it should be clear that two individuals in the same areal unit are by definition
maximally accessible to one another. So any measure of overall accessibility will surely
be distorted if these relations are omitted – as in G statistics. It is for this reason that our
focus is almost exclusively on G * statistics. Notice also from the definitions of a and p
________________________________________________________________________
ESE 502 III.5-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
that Ga* must achieve its maximum value when all population is concentrated in the
smallest of these n areal units. This suggests that Ga* is more accurately described as a
measure of spatial concentration than association.
More generally, these interpretations carry over to essentially any nonnegative data. For
example, if xi denotes income or crime levels, then Ga* represents the spatial
concentration of income or crime. But here one must be careful to distinguish between
extensive and intensive quantities. For example, while proportion of total income (dollars)
in areal unit i is straightforward, the “proportion” of per capita income is less clear.
Hence one must treat such intensive quantities in terms of density units that can be added.
So for example, if per capita income is twice as high in i as in j , this would here be
taken to mean that the income density in i is twice that in j . So a better interpretation of
Ga* in this case would be in terms of the spatial concentration of income density. In any
case it is certainly meaningful to ask whether certain spatial patterns of per capita income
are more concentrated than others
Finally, we should add that even for spatial weights matrices, W, that are not distance
based (such as spatial contiguity matrices), such weights can still be viewed as measures
of “closeness” in an appropriate sense. So in the analyses to follow, we shall continue to
interpret GW* in (5.1.2) as measuring the degree of spatial concentration of quantities,
x ( x1 ,.., xn ) .
As one application of this testing procedure, we again consider the English Mortality data
in Figure 1.9 above (p.III.1-5). For purposes of illustration, we here consider a new type
of spatial weights matrices, namely exponential-distance weights [expression (2.1.13)]
which is also constructed by using the MATLAB program, dist_wts.m. Starting with
exponential-distance weights, say
we first note that since the negative exponential function approaches zero very rapidly, it
is often advisable to normalize distance data to the unit interval to avoid vanishingly
________________________________________________________________________
ESE 502 III.5-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
small values.4 To do so we first identify the largest possible centroid distance, d max ,
between all pairs of Health Districts, and then convert centroid distances, dij , to the unit
interval by setting
so that 0 d ij* 1 . Using this normalization, we can then design exponential distance
weight to yield some appropriate “effective band width” by simply plotting the function
exp( d ), 0 d 1 , for various choices of . For our present purposes, the value
10 yields the plot shown in Figure 5.1 below, 5 which is seen to yield an effective
bandwidth of about d 1/ 2 (shown by the red arrow). In terms of our normalization in
(5.2.2) this yields the familiar value, d max / 2 :
1
0.9
0.8
0.7
0.6
Using the workspace, eng_mort.mat, the corresponding spatial weights matrix, W1, is
constructed by using dist_wts.m with the commands:
Here L is the 199x2 matrix of centroid coordinates, ‘4’ indicates that exponential-
distance weights are option 4 in dist_wts.m, ‘10’ denotes the exponent value, and (most
importantly) ‘1’ denotes the option to leave all diagonal elements as calculated [in this
case, exp(0) 1 ]. Note also that since these weights are already guaranteed to lie in the
unit interval (as in Figure 5.1), there is no need to consider any additional normalizations
(as provided by the info.norm option). Finally, denoting the myocardial infarction rates
4
For example, if distance were in meters, then while a distance of 800 meters is not very large, you will
discover that MATLAB yields the negative exponential value, exp(-800) = 0. Moreover, this is not
“rounded” to zero, but is actually so small a number that it is beyond the limits of double precision
arithmetic to detect.
5
This plot is obtained with the commands: x = [0:.01:1]; y = exp(-10*x); plot(x,y,'k','Linewidth',5);
________________________________________________________________________
ESE 502 III.5-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
by z = mort(:,3), the test of spatial concentration using g_perm.m is performed with the
command:
>> g_perm(z,W1,999);
The results of this test (with 999 random permutations of Health Districts) is shown
below:
Notice first that both G and G * values are reported, even though G * is of primary interest
for our purposes. Next observe that, not surprisingly, these myocardial infarction rates are
maximally significant given 999 permutations, and that in this case there is very little
disagreement between G and G * .
For purposes of comparison, we also try the more local spatial weights matrix, Wnn_5,
already employed in Section 4.3.2 above to test for spatial autocorrelation in the
regression residuals for this same data. Here the results of using
>> g_perm(z,Wnn_5,999);
As with spatial autocorrelation, it is always a good idea to use several spatial weight
matrices to check the robustness of the results. Here it is clear from the very different
(implicit) bandwidths used in these two examples that the significance of spatial
concentration in this case is firmly established.
6
To see this, simply Google “How High/Low Clustering (Getis-Ord General G) works”.
________________________________________________________________________
ESE 502 III.5-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
For sake of comparison with the MATLAB results above, we have used exactly the same
procedure developed in Section 4.2.2 above for testing spatial autocorrelation in terms of
Wnn_5. Here the only difference is that General G is used rather than Moran’s I. The
graphical output for this application is shown in Figure 5.2 below:
Given the z-score of 9.05, there is a less than 1% likelihood that this
high-clustered pattern could be the result of random chance.
Notice from the value of G = 0.005676 that this is the same value (when rounded) as that
obtained in MATLAB above. Notice also that the result here is in terms of the asymptotic
normal approximation of this G statistic (obtained by Getis-Ord, 1992, under the same
random permutation hypothesis as above), and is thus reported as a z-score (9.0538) with
extremely small p-value. This again suggests that the MATLAB results would continue
to obtain maximal significance for many more permutations than 999.
Observe that both GW* and GW are decomposable into local measures of concentration
about each location i as follows. Let the local GW* value at i be defined by
________________________________________________________________________
ESE 502 III.5-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
n
wij x j
j 1
j 1 p j wij
n
(5.3.1) G *
(i )
W n
j 1
xj
(5.3.2) GW (i )
j i
wij x j
j i
xj
where, again, our interest focuses almost entirely on GW* (i ) . Note in particular from
(5.1.2) that these local measures of concentration are related to GW* by the identity,7
(5.3.3) GW* i 1 pi
n
n
j 1
p j wij i 1 pi GW* (i )
n
Thus GW* can be viewed as a weighted average of these local concentration measures,
where the weights, pi , are simply the proportions of x in each areal unit i . In terms of
the probability interpretation above, if we again consider accessibility weights of the
form, wij a ( dij ) , then Ga* (i ) is precisely the expected accessibility from a randomly
sampled unit of x in i to any other randomly sampled unit, i.e., the conditional expected
accessibility
n
(5.3.4) Ga* (i ) j 1
p j a( dij ) E ( a | i )
In these terms, it follows from (5.1.5) together with (5.3.4) that the decomposition in
(5.3.3) is simply an instance of the standard conditional-expectation identity:
(5.3.5) E ( a ) i pi E ( a | i )
But the real interest in these local measures is that they provide information about where
concentration is and is not occurring.8 In particular, by assigning p-values indicating the
significance of local concentration at each areal unit, one can map the results and
visualize the pattern of these significance levels. Those areas of high concentration are
generally referred to as “hot spots” (in a manner completely analogous to strong clusters
in point patterns).
7
It is of interest to note that this decomposition is an instance of what Anselin (1995) has called Local
Indicators of Spatial Association (LISA).
8
Indeed, the original paper by Getis and Ord (1992) starts with these local indices, and only groups them
into a “General G” statistic a later section of the paper.
________________________________________________________________________
ESE 502 III.5-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
In this setting, one may test for the presence of such “hot spots” with respect to data set,
( xi : i 1,.., n ) by employing essentially the same random permutation test as above. In
particular, for any random permutation, (1 ,.., n ) , of the areal unit indices (1,.., n) ,
one may compute for each unit i the associated statistic, GW* (i ) , and compare this
observed value with the distribution of values, GW* (i , k ) for N random permutations,
k (1k ,.., nk ) , k 1,.., N . Here it is important to note that the index i is itself included
in this permutation. For if the value of xi is relatively large, then to reflect the
significance of this local concentration at i it is important to allow smaller values to
appear at i in other random permutations.
If the observed value of GW* (i ) has rank ki among all values [GW* (i ), GW* (i ,1),.., GW* (i, N )]
(with rank 1 denoting the highest value), then the significance of concentration at i is
again represented by the p-value,
ki
(5.3.6) Pi , i 1,.., n .
N 1
This testing procedure is implemented for local G * -statistics in the MATLAB program,
g_perm_loc.m. Here it is assumed that tests for all areal units, i 1,.., n , are to be done.
Hence the outputs contain the local G* -statistic and P-value for each areal unit. To
illustrate the use of this local-testing procedure, it is convenient to continue with the
English Mortality example above. For the exponential-distance weights matrix, W1,
constructed above, together with the myocardial infarction data, z, the command:
yields a (190 x 2) output matrix GP1 [(Gi* , Pi ) : i 1,..,190] containing the local G * -
statistic, Gi* [ GW* 1 (i )] and P-value, Pi , for each of the 190 districts, based on 999
random permutations. These values were imported to ARCMAP and displayed in the
map document, Eng_mort.mxd, as shown in Figure 5.3 and 5.4 below. Figure 5.3 plots
the actual values of Gi* in each areal unit, i , with darker green areas denoting higher
values. The corresponding P-values are shown in Figure 5.4, where darker red shows the
area of most significance (and where only the legend for P-values is shown). As
expected, there is seen to be a rough correspondence between high local G * values and
more significant areas of concentration.
________________________________________________________________________
ESE 502 III.5-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
P-VALUES
.001 - .01
.01 - .05
.05 - .10
.10 - .20
.20 - 1.00
Notice in particular that the local G * -values reflect the general concentration of
myocardial infarction rates in the north that is seen in the original data set [Figure 1.9
(p.III.1-5)], but now are smoothed by the exponentially weighted averages in the local
G * statistics. However this “north-south” divide ([B-G], p.279) is seen to be much more
dramatic in the associated P-values, where the darkest region, denoting P-values less than
.01, now covers all of Northern England.
Turning next to the nearest-neighbor weights matrix, Wnn_5, the test results are now
obtained with the command,
which again yields a (190 x 2) output matrix GP2 [(Gi* , Pi ) : i 1,..,190] containing the
local G* -statistics and P-value for this case. By again importing these values to
ARCMAP, we obtain the comparable displays shown in Figures 5.5 and 5.6 below.
Notice that key difference between these two sets of results is the additional local
variation in values created by the smaller numbers of neighbors used by Wnn_5. For
example, while each areal unit has only 5 neighbors in Wnn_5, if we approximate the
bandwidth in exponential matrix, W1, by counting only weights, wij .01 , then some
areal units i still have more than 70 neighbors. So the degree of smoothing is much
greater in the associated Gi* values. But still, the highest values of both Gi* and Pi
continue to be in the north, and in fact are seen to agree more closely with those
concentrations of myocardial infarction rates seen in the original data, such as the
concentration seen around Lancashire county [compare Figure 1.6 (p.I.1-3) with Figure
1.9 (p.III.1-5)]. So it would appear that 5 nearest neighbor yields a more appropriate scale
for this analysis.
________________________________________________________________________
ESE 502 III.5-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
P-VALUES
.001 - .01
.01 - .05
.05 - .10
.10 - .20
.20 - 1.00
An alternative test using G * is available in ARCMAP. This procedure can be found at:
To employ this procedure, we will again use the English Mortality data with the nearest-
neighbor spatial weights matrix, Wnn_5, already constructed for ARCMAP in Section
4.3.2. In the Hot Spot window that opens, type:
________________________________________________________________________
ESE 502 III.5-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
where the specific path names will of course vary. Click OK, and a shapefile will be
constructed and added to the Table of Contents in your map document. The result
displayed is shown in Figure 5.7 below (where the legend from the Table of Contents has
been added).
As with the General G test in Figure 5.2 above, this test is based on the asymptotic
normal approximation of the local G * -statistics under the same random permutation
hypothesis as above. So the values shown in the legend above are actually in terms of the
z-scores obtained for each test. For example, the familiar “1.96-2.58” valued in the
second to last red entry indicates that myocardial infarction rates for districts with this
color are significantly concentrated at between the .05 and .01 level. (The actual p-values
are listed in the Attribute Table for this map). Here it is important to note that two-sided
tests are being performed. So for a corresponding one-sided test (as done above), these
values are actually twice as significant (i.e., with one-sided p-values between .025 and
.005). So even though the red areas look slightly “smaller” than those in Figure 5.6, the
results are actually more significant than those of MATLAB, in a manner consistent with
all of the asymptotic tests we have seen so far. Notice also that because two-sided tests
are being done, it is also appropriate show areas with significantly less concentration than
would be expected under the null hypothesis. These districts are shown in blue.
Before leaving this topic, it is instructive to consider an additional example that illustrates
the advantage of local G * -statistics over G-statistics for the analysis of spatial
concentration. Here we construct a fictitious population distribution for the case of Eire in
which it is assumed that there is a single major concentration of population in one county
(FID 18 = “Offaly” County), as shown in Figure 5.8 below.9
9
In particular, about 25% of the population has been placed in this county, and the rest has been distributed
randomly (under the additional condition that no other county containing more than 5% of the population).
________________________________________________________________________
ESE 502 III.5-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Here an exponential-distance matrix has been constructed similar to W1 above (to ensure
a smooth representation), and the local G * -statistics for this case are shown in Figure 5.9.
Notice these these G * -values roughly approximate the concentration of the original data,
but are somewhat smoother (as was also seen for the myocardial infarction data above
using W1). The corresponding P-values (again for 999 simulations) are shown in Figure
5.10 below.
These results confirm that Offaly County is the overwhelmingly most significant
concentration of population ( P-Value .02 ), with several of the surrounding counties
________________________________________________________________________
ESE 502 III.5-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
gaining significance from their proximity to Offaly. However, if one carries out the same
test procedure using local G -statistics, then a substantially different picture immerges.
Here Offaly County is not in the least significant – but two of its immediate neighbors
are. The reason of course is that by setting the matrix diagonal to zero, the population of
Offaly itself is ignored in the local G -test for this county. Moreover, since its neighbors
do not exhibit unusually high population concentrations, the local G -value for Offaly
will not be unusually high compared to the corresponding values for random
permutations of county populations. However, its neighbors are still likely to exhibit
significantly high values, because their proximity to the population concentration in
Offaly yields unusually high local G -values compared to those for random permutations.
Hence the anticipated result here is something like a “donut hot spot”, with the “donut
hole” corresponding to Offaly. This is basically what is seen in Figure 5.10, except that
some neighbors are closer (in exponential proximities) to Offaly than others. This
extreme example serves to underscore the difference between these two local statistics,
and shows that local G * -statistics are far more appropriate for identifying significant
local concentrations.
________________________________________________________________________
ESE 502 III.5-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
________________________________________________________________________
ESE 502 III.5-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
The primary models of interest for areal data analysis are regression models. In the same
way that geo-regression models were used to study relations among continuous-data
attributes of selected point locations (such as the California rainfall example), the present
spatial regression models are designed to study relations among attributes of areal units
(such as the English Mortality example in Section 1.3 above). The key difference is of
course the underlying spatial structure of this data. In the case of geo-regression, the
fundamental spatial assumption was in terms covariance stationarity, which together
with multi-normality, enabled the full distribution of spatial residuals to be modeled by
mean of variograms and their associated covariograms. In the present case, this
stationarity assumption is replaced by spatial autogressive hypotheses that are based on
specific choices of spatial weights matrices, as developed in Section 5. Here we start with
the most fundamental spatial autogressive hypothesis in terms of regression residuals
themselves.
The most direct analogue to geo-regression is the spatial regression already developed in
Section 3 above. In particular, if we start with the regression model in (3.1) above, i.e.,
Yi 0 j 1 j xij ui , i 1,.., n
k
(6.1.1)
and postulate that dependencies among the regression residuals (errors), ui , at each areal
unit i are governed by the spatial autoregressive model in (3.5) and (3.6), i.e., by
for some choice of spatial weights matrix, W ( wij : i, j 1,.., n ) [with diag (W ) 0 ] then
the resulting model summarized in matrix form by (3.2) and (3.9) as:
is now designated as the Spatial Errors Model (also denoted as the SE-model or simply
SEM).1
As mentioned above, this constitutes the most direct application of the spatial
autogressive model in Section 3. In essence it is hypothesized here that all spatial
dependencies are among the unobserved errors in the model (and hence the name, SEM).
In the case of the English Mortality data for example, it is clear that while the Jarman
index includes many socio-economic and demographic factors influencing rates of
myocardial infarctions, there are surely other factors involved. Moreover, since many of
1
See footnote 3 below for further discussion of this terminology.
________________________________________________________________________
ESE 502 III.6-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
these excluded factors will exhibit spatial dependencies, such dependencies will
necessarily be reflected by the corresponding residual errors, u , in (6.1.3).
Before considering other types of autoregressive dependencies, it is of interest to
reformulate this model as an instance of the General Linear Regression Model. First, if
for notational convenience, we now let
(6.1.4) B I n W
Thus by the Invariance Theorem for multi-normal distributions, it follows at once from
the multi-normality of that u is also multi-normal with covariance given by 2
Finally, there is a third equivalent way of writing this SE-model which is also useful for
analysis. If we simply substitute (6.1.5) directly into (6.1.3) and eliminate u altogether,
then this same model can be written as
Since all simultaneous relations, u Wu , have been eliminated, expression (6.1.9) is
usually called the reduced form of (6.3).
1 1 1
Here we have used the matrix identities, ( A) ( A ) , and, A B ( BA) , which are established,
2 1 1
________________________________________________________________________
ESE 502 III.6-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
where again i ~ N (0, 2 ) , i 1,.., n . Here the autoregressive term, nh1wihYh , reflects
possible dependencies of Yi on values, Yh , in other areal units. A standard example of
(6.2.1) is in terms of housing prices. If the relevant areal units are say city blocks within a
metropolitan area, and if Yi is interpreted as the average price (per square foot) of
housing on block i , then in addition to other housing attributes ( xij : j 1,.., k ) , of block
i , such prices may well be influenced by prices in surrounding blocks. So the relevant
autoregressive relations here are among the housing prices, Y , and not the spatial
residuals, . Such relations are typically called spatial lag relations, which motivates the
name spatial lag model (SLM).4
4
At this point, it should be emphasized that (much like “variograms” versus “semivariograms” in the
Kriging models of Part II), there is no general agreement regarding the names of various spatial regression
models. For example, while we have reserved the term Spatial Autogressive Model (SAR) for the basic
residual process in expression (3.9) above, this term is used by LeSage and Pace (2009) for the spatial lag
model (SLM). Our present terminology follows that of the open-source software, GEODA, (to be discussed
later) and has the advantage of clarifying where the basic spatial autoregressive model is being applied, i.e.,
to the error terms in SEM and to the dependent variable in SLM.
5
Relaxations of this assumption will be considered in the “combined model” of Section 6.3.1 below.
________________________________________________________________________
ESE 502 III.6-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
This can be seen more clearly by formalizing this model in matrix terms and solving for
its reduced form. By employing the same notation as in (6.1.3), the Spatial Lag Model
(SL-model or simply SLM) can be written as
(6.2.3) Y WY X ( I n W )Y X
B Y X
Y B1 X B1
In this reduced form, it should now be clear that the spatial lag term, WY , in (6.2.2) is
not simply another “regression term”.
Finally, one can also view this model as an instance of the Generalized Linear Regression
Model, though the correspondence is not as simple as that of SEM. In particular, if we
now treat the spatial dependency parameter, , as a known quantity, or more properly, if
we condition (6.2.4) on a given value of , then [in a manner similar to the Cholesky
transformation in expression (7.1.16) of Part II] we can treat
(6.2.5) X B1 X
as a transformed data set, and again use (6.1.5) through (6.1.7) to write (6.2.4) as
with spatial covariance structure, V , again given by (6.1.7). The key difference here is
that is no longer simply an unknown parameter in the covariance matrix, V , but now
also appears in X . So while (6.2.6) does permit the GLS methods in Section 7.1.1 in
________________________________________________________________________
ESE 502 III.6-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Part II to also be applied to SL-models, these applications are somewhat more restrictive
than for SE-models.
One final difference between SE-models and SL-models that needs to be emphasized is
the interpretation of the standard beta coefficients, , in (6.1.8) versus (6.2.4) [or
equivalently, (6.1.9) versus (6.2.6)]. Recall that one of the appealing features of OLS is
the simple interpretation of beta coefficients. For example, consider an OLS version of
the housing price example above, namely
Yi 0 j 1 j xij i ,
k
(6.2.7) i 1,.., n
with i ~ N (0, 2 ) , i 1,.., n . If say xi1 denotes the average age of housing on block i
(as a surrogate for structural quality), then one would expect that 1 is negative. In
particular since,
the value of 1 should indicate the expected decrease in mean housing prices on block i
resulting from a one-year increase in the average age of houses on block i . More
generally, these marginal changes can be expressed as partial derivatives of the form:
Of course this OLS model ignores spatial dependencies between blocks. So if (6.2.7) is
reformulated as an SE-model to account for such dependencies, say of the form in
(6.1.8):
Yi 0 j 1 j xij ui ,
k
(6.2.10) (u1 ,.., un ) ~ N (0,V )
then since E (ui ) 0 , i 1,.., n , it follows that (6.2.8) and (6.2.9) continue to hold. Thus,
while certain types of spatial dependencies have been accounted for, the interpretation of
betas (such as 1 above) continues to hold.
However, if the major spatial dependencies are among these price levels themselves, so
that an SL-model is more appropriate, then the situation is far more complex. This can be
seen by observing from the reduced form in (6.2.4), together with the “ripple”
decomposition of ( I n W )1 in expression (3.3.26) above that 6
6
Here it is implicitly assumed that the convergence condition, | | 1 / W , holds for and W.
________________________________________________________________________
ESE 502 III.6-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
X WX 2W 2 X
So the partial derivative in (6.2.9) cannot even be defined without specifying all
attributes on all blocks. Moreover, while (6.2.8) implies that there are no interaction
effects between blocks, i.e., that the partial derivatives of E (Yi | x1i ,.., x ji ,.., xki ) with
respect to housing attributes on any other block are identically zero, this is no longer true
in (6.2.11). For example, if the age of housing on block i is increased, then this not only
has a direct effect on expected mean prices in block i , but also has indirect effects on
prices in all other blocks. Moreover, such indirect effects in turn affect prices in i . So this
spatial ripple effect leads to complex interdependencies that must be taken into account
when interpreting each beta coefficient. These effects can be summarized by analyzing
(6.2.11) in more detail. To do so, we now employ the following notation. For any n m
matrix, A ( aij : i 1,.., n, j 1,.., m) , let A(i , j ) aij denote the (ij )th element of A, and let
A(, j ) denote the j th column of A. In these terms, (6.2.11) can be decomposed as
follows:
In terms of this decomposition, it now follows that the desired partial derivatives can be
obtained directly. First, as a parallel to (6.2.9) we see that
So this marginal effect depends not just on j but also on the i th diagonal element of
B1 , which has the more explicit form
1 2W 2 (i, i )
________________________________________________________________________
ESE 502 III.6-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
where the last line follows from the zero-diagonal assumption on W. But since 2W 2 (i , i )
together with all higher order effects are positive, it is clear that the effect of each j is
being inflated by these spatial effects, as described informally above. Moreover it is also
clear from (6.2.13) that expected mean prices in i are affected by housing attribute
changes in other blocks. In particular, for attribute j in block h, it now follows that
Total effects on E (Yi | X ) of attributes in the same areal unit i are designated as direct
effects by LeSage and Pace (2009, Section 2.7.1), and similarly, the total effects of
attributes in different areal units are designated as indirect effects. For further analysis of
these effects see LeSage and Pace (2009).
While there are many variations on the SE-model and SL-model above, we focus only on
those that are of particular interest for our purposes.
When developing the SL-model above, a question that naturally arises is why all
unobserved factors should be treated as spatially independent. Clearly it is possible to
have spatial autoregressive dependencies both among the Y variables and the residuals,
. If we now distinguish between these by letting M and denote the spatial weights
matrix and spatial dependency parameter for the spatial-error component, then one may
combine these two models as follows,7
(6.3.2) ( I n W )Y X ( I n M )1
However, our primary interest in this model will be to construct comparative tests of
SEM versus SLM as instances of the same model structure. Hence we shall focus on the
special case with M W ,
7
This model has been designated by Kelejian and Prucha (2010) as the SARAR(1,1) model, standing for
Spatial Autoregressive Model with Autoregressive disturbances of order (1,1).
________________________________________________________________________
ESE 502 III.6-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
which we now designate as the combined model, with corresponding reduced form:
So for any given spatial weights matrix, W, the corresponding SE-model (SL-model) is
seen to be the special case of (6.3.3) with 0 ( 0 ).
One additional point worth noting here is that while this combined model is
mathematically well defined, and can in principle be used to obtain joint estimates of
both and , these joint estimates are in practice often very unstable. In particular,
since both and serve as dependency parameters for the same matrix, W, they in fact
play very similar rolls in (6.3.4). But, as will be seen in Section 10.4 below, this
instability will turn out to have little effect on the usefulness of this model for comparing
SEM and SLM.
A second model that will prove useful for our comparisons of SEM and SLM can again
be motivated by the housing price example above. In particular, if housing prices, Yi , in
block group i are influenced by housing prices in neighboring block groups, then it is not
unreasonable to suppose that they may be influenced by other housing attributes in these
block groups. If so, then a natural extension of the SL-model in (6.2.1) would be to
include these spatial effects as additional terms, i.e.,
Following Anselin (1988) this extended model is designated as the Spatial Durbin Model
(also SD-model or simply SDM). This SD-model can be written in matrix form by letting
(1 ,.., k ) . However, one important additional difference is that (as in all previous
models) the matrix, X, is defined to include the intercept term in (6.3.5). So here it is
convenient to introduce the more specific notation,
(6.3.6) X [1n , X v ] and 0
v
where both X v [ ( x1 ,.., xk )] and v now refer explicitly to the explanatory variables.
With this additional notation, (6.3.5) can be written in matrix form as follows: 8
8
It is of interest to note here that in many ways it seems more natural to use X for the x variables, and to
employ separate notation for the intercept. But while some authors have chosen to do so, including LeSage
and Pace (2009) [compare (6.3.7) above with their expression (2.34)], the linear-model notation
( Y X ) is so standard that it seems awkward at this point to attempt to introduce new conventions.
________________________________________________________________________
ESE 502 III.6-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
As pointed out by LeSage and Pace (2009, Sections 2.2, 6.1) this model is also useful for
capturing omitted explanatory variables that may be correlated with the x variables. In
this sense, it may serve to make the SL-model somewhat more robust. However, as
developed more fully in Section 10.3 below, our main interest in this model is that it
provides an alternative method for comparing SLM and SEM.
There is one additional spatial regression model that should be mentioned in view of its
wide application in the literature. While this model is conceptually similar to the SE-
model, it involves a fundamentally different approach from a statistical viewpoint. In
terms of our housing price example, rather than modeling the joint distribution of all
housing prices (Y1 ,..,Yn ) among block groups, this approach focuses on the conditional
distributions of each housing price, Yi , given all the others. The advantage of this
approach is that it avoids all of the simultaneity issues that we have thus far encountered.
In particular, since all univariate conditional distributions derivable from a multi-normal
distribution are themselves normal, this approach starts off by assuming only that the
conditional distribution of each price, Yi , given any values ( yh : h i ) of all other prices
(Yh : h i ) , is normally distributed. So these distributions are completely determined by
their conditional means and variances. To construct these moments, we start by rewriting
the reduced SE-model in (6.1.9) as follows:
( I n W )(Y X )
Y X W (Y X )
Y X W (Y X )
But if we now denote the i th row of W by wi ( wi1 ,.., win ) , then the i th line of this
relation can be written as,
where the last equality follows from the assumption that wii 0 . This suggests that if we
if we now condition Yi on given values ( yh : h i ) of (Yh : h i ) , then the natural
conditional model of Yi to consider it the following:
________________________________________________________________________
ESE 502 III.6-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Thus our present analysis will focus on the Spatial Errors Model (SEM) and the Spatial
Lag Model (SLM), which are by far the most commonly used spatial regression models.
In the next section, we shall develop the basic methods for estimating the parameters of
these models. This will be followed in Section 8 with a development of the standard
regression diagnostics for these models.
________________________________________________________________________
ESE 502 III.6-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
Recall from the specification of both SEM in (6.1.3) and SLM in (6.2.2) above that the
parameters, ( , 2 , ) , are essentially the same for both. As mentioned already, the key
difference is how the spatial autoregressive hypothesis is applied (namely to the
unobserved errors in SEM and to the observed dependent variable itself in SLM). So it is
not surprising that the method of estimation is very similar for both of these models. But
unlike the iterative estimation scheme employed for geo-kriging models in Section 7.3.1
of Part II (based on iteratively reweighted least-squares), the present method involves the
simultaneous estimation of all model parameters. So our first objective is to develop this
general method of maximum-likelihood estimation, and then to apply this method to both
SEM and SLM.
0.5
0.4 f2 ( y)
0.3 f1 • f2
0.2
0.1
f1 ( y )•
0
-3 -2 -1 0 1 •y 2 3 4 5 6
1 2
To do so, observe that while the density values, f1 ( y ) and f 2 ( y ) , are not themselves
probabilities, their ratio is approximately the relative likelihood of observing values from
these two populations in any sufficiently small neighborhood, [ y , y ] , of y, as
________________________________________________________________________
ESE 502 III.7-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
shown in Figure 7.2 below. In particular, the area under each density, f i , is seen to be
well approximated by a rectangle with base length, 2 , and height, f i ( y ) , i 1,2 .
f2 ( y)
• •
f1 ( y )
• •
(2 ) f 2 ( y )
(2 ) f1 ( y )
• • • •
y y 2 y y 2
This figure shows that for any sufficiently small positive increment, ,
Pr( y Y y | 1 ) f ( y)
(7.1.1) 1
Pr( y Y y | 2 ) f2 ( y)
(7.1.2) ˆ i f i ( y ) f j ( y ) , i, j {1,2}, i j
Next suppose that nothing is known about the mean of this population, so that could in
principle be any real value. In this case, there is a continuum of possible normal
populations, { f ( | ), } , to be considered. But it should still be clear that y is most
likely to have come from that population for which the probability density, f ( y | ) , is
largest. Thus the maximum-likelihood estimate, ̂ , is now given by the condition that,
________________________________________________________________________
ESE 502 III.7-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
contain all mean parameters, ( 1 ,.., n ) , together with all covariance parameters,
( ij : i , j 1,.., n ) defining f [as in expression (3.2.11) of Part II]. But more typically,
, will contain a much smaller set of parameters that are assumed to completely specify
both and in any given model (as will be illustrated by the many examples to
follow). Even in this general setting, the above notion of maximum-likelihood estimator
continues to be perfectly meaningful. For example, suppose that n 2 , so that each
candidate population is representable by a bivariate normal density similar to that Figure
3.2 of Part II. Then as a two-dimensional analogue to Figure 7.2 above, one can imagine
the portion of density above a small rectangular neighborhood of y0 ( y01 , y02 ) , as
shown schematically on the left side of Figure 7.3 below.
f ( y0 | )
f f
● ●
y2 y2
y02 y02 y02 y02
y02 • y02 •
● y0 ● y0
y01 y01
y01 • y01 •
y01 y01
y1 y1
Here again, it is clear that for sufficiently small positive increments, , this density
volume is well approximated by the box with base area, (2 )2 , and height, f ( y0 , ) , so
that for any candidate parameter vectors, 1 and 2 , we again have the approximation1
Pr( y0 12 Y y 12 | 1 ) f ( y0 | 1 )
(7.1.4)
Pr( y0 12 Y y0 12 | 2 ) f ( y0 | 2 )
1
Recall that 1n is the unit vector in .
n
________________________________________________________________________
ESE 502 III.7-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Given the fact that sample y is the known quantity and is unknown, it is usually more
convenient to define the corresponding likelihood function, l ( | y ) , by
(7.1.6) l ( | y ) f ( y | ) , k
Finally, because densities are positive (in the range of realizable samples, y ), and
because the log-likelihood function,3
The reason for this transformation is that multivariate density functions often involve
products – as exemplified by the important case of independent random sampling,
f ( y | ) f ( y1 ,.., yn | ) in1 f ( yi | ) . Moreover, since logs convert products to sums,
this representation is often simpler to analyze (as for example when differentiating
likelihood functions).
To apply this estimation procedure, we start in Section 7.2.1 by considering the most
familiar case of Ordinary Least Squares (OLS). By applying essentially the same
arguments as in Section 7.1 of Part II, we then extend these results to Generalized Least
2
For example, if (1 , 2 ) ( , ) for N ( , ) , then {(1 , 2 ) : 2 0} .
2 2 2
3
In these notes “log” always means natural log, so the symbols ln and log may be used interchangeably.
________________________________________________________________________
ESE 502 III.7-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Squares (GLS) in Section 7.2.2. These maximum-likelihood estimates for GLS will then
serve as the general framework for obtaining comparable results for the spatial regression
models, SEM and SLM in Sections 7.3 and 7.4 below.
(7.2.1) Y X , ~ N (0, 2 I n )
1 ( y X ) ( 2 I n ) 1 ( y X )
(7.2.2) f ( y | , 2 ) (2 ) n/2 | 2 I n |1/2 e 2
[where the parameter vector, , for the general version in (7.1.5) above is here given by
( , 2 ) ( 0 , 1 ,.., k , 2 ) ]. By observing that | 2 I n |1/2 ( 2 n )1/2 | I n | ( 2 ) n /2
and ( 2 I n ) 1 2 I n1 2 I n , we see that this density can be simplified to:
1 2 ( y X ) ( y X )
(7.2.3) f ( y | , 2 ) (2 2 ) n/2 e 2
so that the appropriate log-likelihood function for the OLS model is given by:
in a manner identical to expressions (7.1.10) and (7.1.11) in Part II. Thus expression
(7.1.12) of Part II shows that this solution is again given by:
(7.2.6) ˆ ( X X ) 1 X y
While this simple identity might appear to suggest that there is really no need for
maximum likelihood estimation in the case of OLS, the real power of this method
________________________________________________________________________
ESE 502 III.7-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
becomes evident when we turn to the estimation of 2 . Indeed, the method of least
squares used for OLS is not directly extendable to 2 , so that other methods must be
employed. Even in the case of geostatistical regression, where a comparable estimate of
2 was developed in expression (7.3.19) of Part II, the actual estimation procedure
involved a rather ad hoc application of nonlinear least-squares procedure for fitting
spherical variograms to data. But in the present setting, we can now obtain a theoretically
more meaningful estimate. In particular, by substituting ˆ from (7.2.5) into (7.2.4), we
can derive the exact maximum-likelihood estimate, ̂ 2 , of 2 by minimizing the reduced
function,
(7.2.7) Lc ( 2 | y ) L( ˆ , 2 | y )
where the subscript “c” reflects the common designation of this function as the
concentrated likelihood function of parameter, 2 [also called a profile likelihood
function]. But since the first order condition for a maximum yields:
(7.2.8) 0
d
d
2
Lc ( 2 | y ) n2 1 12 (1
2 2 2
) ( y X ˆ )( y X ˆ )
n 12 ( y X ˆ )( y X ˆ )
(7.2.9) ˆ 2 n1 ( y X ˆ )( y X ˆ )
This can be given a more familiar form in terms of estimated residuals, ˆ (ˆ1 ,.., ˆn ) as
ˆ 2 n ˆ ˆ ˆ
n
(7.2.10) 1 1
n i 1 i
2
To extend these estimation results to GLS, we start with the general linear model,
(7.2.11) Y X , ~ N (0, 2V )
4
One may also check that the second derivative of Lc evaluated at ˆ is negative and thus yields a
2
maximum.
________________________________________________________________________
ESE 502 III.7-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
where the matrix, V , is assumed to be known.5 So in this setting, OLS is seen to be the
special case with V I n . The key feature of this model is that, like the OLS model in
(7.2.3) above, the only unknown parameters are the beta coefficients, , together with
the positive variance parameter, 2 [so that again, ( , 2 ) ]. As with OLS, this
implies that Y is again multi-normally distributed, where in this case, Y ~ N ( X , 2V ) ,
with density:
1 ( y X ) ( 2V )1 ( y X )
(7.2.12) f ( y | , 2 ) (2 ) n /2 | 2V |1/2 e 2
(T 1 y T 1 X ) (T 1 y T 1 X )
( y X )( y X )
But this is precisely the squared deviation function in (7.2.5) for the new data set, y T 1 y
and X T 1 X . So it follows at once from (7.2.6) that the GLS maximum-likelihood
estimate, ˆ , of is given [as in expressions (7.1.21) through (7.1.24) in Part II] by
5
Unlike the model specification in expression (7.1.8) of Part II, the matrix V need not be a correlation
matrix (i.e., its diagonal elements need not be all ones). However, since V is required to be a
2
nonsingular covariance matrix, V , must be symmetric and positive definite (as in Section A2.7.2 of the
Appendix to Part II).
6
Here existence of T is ensured by the Cholesky Theorem in Section A2.7.2 of the Appendix to Part II.
________________________________________________________________________
ESE 502 III.7-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
so that by (7.2.15),
(7.2.18) ˆ ( X V 1 X )1 X V 1 y
Moreover, precisely the same maximization arguments for 2 in (7.2.8) and (7.2.9) above
now show that the GLS maximum-likelihood estimate for 2 is given by
n1 ( y X ˆ )(T 1 )(T 1 )( y X ˆ )
(7.2.20) ˆ 2 n1 ( y X ˆ )V 1 ( y X ˆ )
Thus the maximum-likelihood estimation results for OLS are seen to be directly
extendable to the class of GLS models (7.2.11).
(7.3.3) B I n W
So SEM can be viewed as an instance of the GLS model in (7.2.11), where V now takes
the specific form V in (7.3.2). However, it must be emphasized that unlike (7.2.11), the
matrix V involves an unknown parameter, . So to be precise, (7.3.1) should be viewed
________________________________________________________________________
ESE 502 III.7-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
as a GLS model conditioned on . But nonetheless, we can still employ (7.2.14) to write
down the appropriate log-likelihood function for SEM as
In particular, we now know from (7.2.18) and (7.2.20) that for any given value of , the
maximum-likelihood estimates for and 2 , conditional on , are given respectively
by
and
(7.3.6) ˆ 2 n (y
1 X ˆ )V1 ( y X ˆ )
where the subscript on these estimates reflects their dependency on the value of . But
since these conditional estimates are expressible as explicit (closed form) functions of ,
we can substitute these results into (7.3.4) and obtain a concentrated likelihood function
for in a manner similar to that of 2 in the case of OLS [in expression (7.2.7) above].
In the present case, this concentrated likelihood takes the following form:
To further simplify this expression, we first note from (7.3.6) that the last term in (7.3.7)
reduces to a constant, since
Moreover, it follows from standard properties of matrix inverses and determinants [as in
expressions (A31.18), (A3.1.20), (A3.2.70) and (A3.2.71) of the Appendix] that
(7.3.9) | V | | ( B B )1 | | B1 | | ( B )1 | | B |1| B |1 | B |2
So by substituting these identities into (7.3.7) we obtain the simpler form of the
concentrated likelihood function for :
________________________________________________________________________
ESE 502 III.7-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
With these results, the desired maximum-likelihood estimation procedure for SEM is now
evident. In particular, we first maximize the concentrated likelihood function, Lc ( | y ) ,
to obtain the estimate, ̂ , and then use (7.3.5) and (7.3.6) to obtain the remaining
estimates, ˆ and ̂ 2 as:
and
(7.3.12) ˆ 2 ˆ 2ˆ n (y
1 X ˆ )Vˆ1 ( y X ˆ )
Since Lc ( | y ) is a smooth function in one variable, the first step can be accomplished
by standard numerical “line search” methods. So for reasonably small sample sizes, n,
this estimation procedure is very efficient.
But for larger sample sizes (say, n 500 ), an additional problem is created by the need to
evaluate the n-square determinant, | B | , at each step of this procedure. However, such
computations can often be made more efficient by means of the following observation.
Recall from the discussion of eigenvalues and eigenvectors in Section 3.3.1 above that
nonsingular matrices such as B have a “spectral” representation in terms of the diagonal
matrix, diag ( 1 ,.., n ) , of their eigenvalues, together with the nonsingular matrix,
X ( x 1 ,.., x n ) , of their associated eigenvectors as:
(7.3.13) B X X 1
Moreover, if the eigenvalues of the weight matrix, W , in (7.3.3) are denoted by i with
associated eigenvectors, xi , i 1,.., n , so that
(7.3.15) W xi i xi , i 1,.., n
Thus we see that the eigenvalues of B are obtainable from those of W by the identity
________________________________________________________________________
ESE 502 III.7-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
(with corresponding eigenvector, xi xi ). In particular, this implies from (7.3.13) that
| B | i 1 (1 i )
n
(7.3.18)
So by calculating the eigenvalues (1 ,.., n ) of the weight matrix, W, we can rapidly
compute the determinant, | B | , for any value of . While the computation of these
eigenvalues can itself be time consuming, the key point is that this calculation need only
be done once. This procedure is so useful, that it is incorporated into almost all software
packages for calculating such maximum-likelihood estimates (when n is sufficiently
large).7
where X B1 X and where V and B are again given by (7.3.2) and (7.3.3). So the
only formal difference here is that for each given value of , we now obtain a GLS
model in which both V and X depend on . So the corresponding log likelihood
function takes the form,
which in turn implies that for the SLM case, the maximum-likelihood estimate for
conditional on is given by:
7
However, it should also be noted that for extremely large sample sizes (say n 1000 ) the numerical
accuracy of such eigenvalue calculations becomes less reliable. In such cases, (7.3.19) is often
approximated by using only those terms with eigenvalues of largest absolute magnitudes.
________________________________________________________________________
ESE 502 III.7-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
(7.4.7) ˆ 2 n (y
1 X ˆ )V1 ( y X ˆ )
But by using the same arguments in (7.4.4) and (7.4.5) we see that
(7.4.8) ( y X ˆ )V1 ( y X ˆ ) ( y B1 X ˆ )( B B )( y B1 X ˆ )
and thus that the maximum-likelihood estimate for 2 conditional on for SLM
reduces to:
(7.4.9) ˆ 2 n ( B y
1 X ˆ )( B y X ˆ )
________________________________________________________________________
ESE 502 III.7-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
As with SEM, this can be reduced by again observing from (7.4.7) that
which together with (7.3.9) shows that the concentrated likelihood function for has
exactly the same form for SLM and for SEM, i.e.,
So the only difference between (7.3.10) and (7.4.12) is the explicit form of ˆ 2 in (7.3.6)
and (7.4.9), respectively. In particular, this implies that all the discussion about numerical
maximization of concentrated likelihoods to obtain ̂ is identical for both models. In
particular, the eigenvalue decomposition in (7.3.19) is precisely the same. So to complete
the estimation procedure, it remains only to substitute this estimate, ̂ , into (7.4.6) and
(7.4.9) to obtain the respective estimates,
and
(7.4.14) ˆ 2 n ( Bˆ y
1 X ˆ )( Bˆ y X ˆ )
Recall from Figures 1.7 and 1.8 above that the “footprint” of the 12th Century Anglo-
Norm counties, known as the Pale, can still be seen in the spatial density pattern of Blood
Group A in 1958. So an interesting question to explore is how much of this pattern can be
statistically accounted for by this single explanatory variable.8 To do so, we now consider
a simple regression
(7.5.1) Yi 0 1 xi i , i 1,.., n
8
Note that the Irish Blood Group data set in [BG] contains one other potentially relevant explanatory
variable, namely the number of place names (per unit of area) ending in “town” within each county.
However, in the present example we focus only on the (marginal) effect of the Pale itself.
________________________________________________________________________
ESE 502 III.7-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
where relevant dependent variable, Yi is the proportion of adults with Blood Group A in
each county i , and the single explanatory variable, xi , is taken to be the indicator (zero-
one) variable for the Pale (corresponding to the red area in Figure 1.8 above), where
1 , if i Pale
(7.5.2) xi
0 , if i Pale
To run this regression, we here use the ARCMAP version of OLS, and employ the
ARCMAP data set in Eire.mxd. While JMP is generally more suitable for such analyses,
performing OLS inside ARCMAP has the particular advantage of allowing the regression
residuals to be mapped directly. This program can be found on the ArcToolbox path:
Spatial Statistics Tools > Modeling Spatial Relationships > Ordinary Least Squares
In the window that opens, type the entries shown on the left in Figure 7.4 below (where
as usual, path names are machine specific):
Notice that both the coefficient estimates and diagnostics are “optional” tables, which
should definitely be added. These will appear in the Table of Contents, as shown at the
bottom right in Figure 7.4. The relevant portion of eire_output (for our purposes)9 is
shown in Table 7.1 below:
9
Note in particular that the “robust” estimates and tests in this Table have not been shown. As with a
number of other statistical diagnostics in ARCMAP, these robust-estimation results are difficult to interpret
without further documentation.
________________________________________________________________________
ESE 502 III.7-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
So the “Pale effect” is seen to be positive and very significant, indicating that Blood
Group A levels are significantly higher inside the Pale than elsewhere in Eire. But as we
have seen many times before, this significance may well be inflated by the presence of
spatial dependencies among Blood Group levels that are not accounted for by the Pale
alone. So the remaining task is to test the regression residuals for spatial autocorrelation.
These residuals are shown graphically on the right in Figure 7.6 below, where the pattern
of Blood Group values in Figure 1.7 is reproduced on the left for ease of comparison.
0 50 miles 0 50 miles
Before analyzing these residuals, it is important to emphasize that the “default” residuals
that appear in ARCMAP (as indicated on the right side of Figure 7.4) have been
normalized to Studentized Residuals (StdResiduals). So to be comparable with the rest of
our analysis, this plot must be redone in terms of the Residuals column in the Attribute
Table, as is done in Figure 7.5.10
10
Note that studentized residuals (again not documented in ARCMAP) are useful for many testing
purposes when the original assumption of independent residuals holds. But in the presence of possible
spatial dependencies, it is generally preferable to analyze the raw residuals themselves.
________________________________________________________________________
ESE 502 III.7-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Note from the plot of these residuals in Figure 7.5 that (as with many linear regressions)
the highest Blood Group values in the Pale are underestimated (red residuals) and the
lowest values outside the Pale are overestimated (blue residuals). This by itself tends to
preserve a certain amount of the positive correlation seen in the original Blood Group
data.
The first spatial weights matrix considered is the simple (centroid) nearest-neighbor
matrix, Wnn , which (as already mentioned above Figure 1.18) is very restrictive for areal
data in terms of potentially relevant neighbors ignored. Here it is clear that no spatial
autocorrelation is detected by any method using this matrix. A slightly more appropriate
version is the symmetric nearest-neighbor matrix, Wnns , [expression (2.1.10) above with
k 1 ] shown in the next row. Here the results are all still very insignificant, but are
nonetheless dramatically more significant than for the asymetric case. The reason for this
in the case of Eire can be seen in Figure 7.6 below, where county centroids are shown as
blue dots, and where the red line emanating from each centroid is directed toward its
nearest neighbor. This figure (which extends the Laoghis County illustration in Figure
1.18 above) confirms that such neighbor relations are relatively sparse throughout Eire. In
particular, there are very few mutual nearest neighbors, i.e., red lines with both ends
connected to centroids. So when moving from nearest neighbors, Wnn , to symmetric
nearest neighbors, Wnns , it is now clear that many more relations are added to the matrix,
thus allowing many more possibilities for spatial correlation to be considered.
________________________________________________________________________
ESE 502 III.7-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
The third and fourth rows show respective results for the queen contiguity matrix, Wqueen ,
[expression (2.1.15)] and for one of its k-nearest-neighbor approximations, namely, the
five nearest neighbor version, Wnn 5 [as in expression (2.1.9) with k 5 , and as also used
in Figure 1.18 for Laoghis County]. These two cases are of special interest, since they are
by far the most commonly used weights matrices for analyzing areal data. But in both
cases, spatial autocorrelation is at best seen to be weakly significant – and is totally
insignificant for the standard asymptotic Moran test.11
( p g
900
800
700
600
500
400
300
200
100
200 300 400 500 600 700
In view of this lack of significance, the results in the final two rows are quite striking.
These show respective results for the boundary shares matrix, Wshare [expression
(2.1.17)], and for the combined distance-shares matrix, W, of Cliff and Ord (1969)
[expression (2.1.18)]. Because we shall employ this latter matrix, W, in our subsequent
analyses, it is here convenient to reproduce its typical elements, wij , as follows,
lij dij1
(7.5.3) wij
l dik1
k i ik
11
In fairness, it should be pointed out (as is done for example in ARCMAP) that such asymptotic tests
typically require more samples (areal units) for statistical reliability. A common rule of thumb (that we
have seen already for the Central Limit Theorem) is that n be at least 30.
________________________________________________________________________
ESE 502 III.7-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
where lij is the fraction of the boundary of county i shared with county j , and dij is the
distance between their respective centroids.12 While it is difficult to explain exactly why
these two matrices capture so much more significance, one can gain insight by simply
noting the unusual complexity of county boundaries in Eire. These complexities have
most likely resulted from a long history of interactions between neighboring counties, so
that shared boundary lengths may well reflect the degree of such interactions. Moreover,
in so far as centroid distances tend to reflect relative travel distance between counties, it
is reasonable to suppose that such distances reflect other dimensions of interaction. In any
case, this example provides a clear case where it is prudent to consider a variety of tests
in terms of alternative spatial weights matrices before drawing any firm conclusions
about the presence of spatial autocorrelation. One rule of thumb is to try several (say
three) different matrices which exhibit sufficient qualitative differences to capture a range
of interaction possibilities. As stressed at the beginning of Part III, one of the most
perplexing features of areal data analysis is the absence of any clear notion of “spatial
separation” between areal units.
Given this weight matrix, we now employ the spatial regression models, SEM and SAR,
to capture the relation between Blood Group levels and the Pale in a manner that
accounts for the spatial autocorrelation detected in Table 7.2. The estimation procedures
for SEM and SLM are implemented in the MATLAB programs, sem.m and slm.m,
respectively. The inputs required for each program consist of a data vector, y, for the
dependent variable, a data matrix, X, for the explanatory variables, and an appropriate
spatial weights matrix, W, relating the relevant set of areal units. In the present case, y is
the vector of Blood Group proportions for each county, X is the vector, x, identifying
those counties in the Pale, and W is the combined distance-shares matrix above.
12
Further discussion of this weight matrix can be found in Upton and Fingleton (1985, pp.287-288) [see
Reference 18 in the class Reference Materials].
13
Here it is of interest to note that these weights differ slightly from those of Cliff and Ord (1969), which
can be found in Table 5.1 of Upton and Fingleton (1985), and which are also reproduced as matrix, W2, in
the workspace, Eire.mat. This illustrates the fact that such constructions will differ to some degree
depending on the particular map of Eire that is used. (Indeed, digital maps did not even exist in 1969 when
the original work was done.)
________________________________________________________________________
ESE 502 III.7-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Before running these models, it should be noted that there are two additional inputs,
vnames and val (also described in the program documentation). We have already seen
vnames used as the list of variable names in previous applications (as for example in
Cobalt Example of Section 7.3.4 in Part II). For the present case of a single variable, one
need only write the variable name in single quotes, which here is ‘Pale’. The final input,
val, represents the optional input of eigenvalues for W used to calculate the log
determinant in (7.3.19) above. In the case of Eire with n 26 , this is hardly necessary.
But for very large weight matrices, W, it is worth noting that the corresponding vector of
eigenvalues is easily obtained in MATLAB with the command:
With these preliminary observations, we can now run both SEM and SLM, using the
respective commands:
>> sem(y,X,W,‘Pale’);
>> slm(y,X,W,‘Pale’);
It should also be noted that there are a number of data outputs given by these two models.
But for our present purposes, it is enough to examine their screen outputs, as shown in
Figure 7.7 below. Here it is clear that there is a strong parallel between the output formats
of each model. In particular, they are quite comparable in terms of both their output
results and diagnostics (as discussed in more detail below). Note also that these two
formats look very much the same as for OLS regression in the sense that significance
levels (p-values) are reported for each parameter estimate, together with various measures
of “goodness of fit”. But as we shall see below, the actual methods of obtaining these
results (and in some cases, even their meaning) differs substantially from OLS.
Nonetheless, the basic interpretations of parameter estimates and their significance levels
will remain the same as in OLS. So before getting into the details of calculation methods,
it is appropriate to begin by examining these results in a qualitative way.
With respect to SEM, notice first that while the Pale effect continues to be positive (as in
Table 7.1 for OLS), this effect is now both smaller in magnitide (1.55 versus 4.25) and
dramatically less significant (with a p-value of .0788 versus .000012). Notice also that the
level of spatial autocorrelation, ˆ 0.7885 , is significantly positive. As we have seen
before, this suggests that such differences are largely due to the presence of spatial
autocorrelation. While the exact nature of these effects is difficult to identify in the
present spatial regression setting, we can nonetheless make certain useful observations.
First, if the relevant data matrix for this Eire example is denoted by X [1n , x ] , then it
follows from expression (7.1.12) in Part II together with and (7.3.11) above that the OLS
and SEM estimates of ( 0 , 1 ) are given respectively by
________________________________________________________________________
ESE 502 III.7-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
MORAN z-score and p-val = (0.2741,0.3920) MORAN z-score and p-val = (-0.7550,0.7749)
Figure 7.7. Regression Results and Autocorrelation Tests for SEM and SLM
________________________________________________________________________
ESE 502 III.7-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
and,
In contrast to OLS, the beta estimates for SEM are thus seen to depend on the estimated
level of spatial autocorrelation, ̂ , together with the choice of spatial weights matrix, W,
implicit in Vˆ . So while in theory such estimates are still unbiased [recall expression
(7.1.26) in Part II], their sensitivity to ̂ tends to inflate the variance of these
estimates.
This can be seen in part by considering the standard errors of the estimated Pale
parameter, ˆ1 , for both OLS and SEM. To do so, recall first from Table 7.1 that the
standard error for ˆ under OLS was given by,
1
To derive the comparable standard error under SEM, we begin by noting that appropriate
“Z-VAL” for ˆ1 in Figure 7.7 is given [in a manner analogous to expression (7.3.26) in
Part II] by
ˆ1
(7.5.7) zˆ ,
1
sˆ
1
so that the estimated standard error for ˆ1 under SEM is given from Figure 7.7 by,
ˆ 1.553209
(7.5.8) sSEM ( ˆ1 ) 1 0.88368
z ˆ 1.757660
1
This shows that standard errors of beta estimates do indeed tend to be larger in the
presence of spatial autocorrelation.
Before turning to the SL-model, it is important to note that while the estimated spatial
autocorrelation level, ̂ , for this SE-model is significantly positive, it is not evident that
̂ has successfully eliminated all spatial autocorrelation effects found for weight matrix,
W, in Table 7.2. To address this issue, we may again appeal to the results developed for
all GLS models in expressions (7.1.18) and (7.1.19) in Part II, which show that if the
spatial covariace structure, V , [in (7.3.2) and (7.3.3)] has been correctly estimated, then
the Cholesky reduction of this model to OLS form should yield residuals that exhibit no
significant spatial autocorrelation (with respect to W). In the present case, however, there
is no need for Cholesky decompositions, since V in (7.3.2) is already factorized in terms
________________________________________________________________________
ESE 502 III.7-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
of B1 . In fact the reduction of SEM to an OLS form can be made even more transparent
by simply recalling from expression (6.1.9) that
B Y B X , ~ N (0, 2 I n )
Y X , ~ N (0, 2 I n )
of the estimated OLS model in (7.5.9), by again using sac_perm.m. Since this procedure
is detailed in part (c) of Assignment 7, it suffices here to observe that the full command
for sem.m in Section 7.5.2 above is of the form:
where the matrix OUT contains a number of useful transformations of the regression
outputs. In particular, the residuals in (7.5.10) are contained in the third column, so that
the command,
produces a copy, res_SEM, of these residuals that can be tested using sac_perm as
follows:
>> sac_perm(res_SEM,W,999);
The results of this test are shown in the lower left panel of Figure 7.7, and confirm that
this application of SEM has indeed been successful in removing the spatial
autocorrelation found under weight matrix, W.
Turning next to the SL-model, the most important difference to notice here is that while
the Pale effect on Blood Group A is again positive – it is now vastly more significant
than for the SE-model, with p-value = 0.0005. Moreover, by substituting the maximum-
likelihood estimates ( ˆ ,ˆ 2 , ˆ ) for each model into their respective log-likehood
functions in (7.3.4) and (7.4.2), we obtain maximum log-likehood values for SEM and
SLM that constitute one possible measure of their goodness of fit to this Eire data (see
Section 9 below for a more detailed discussion of goodness-of-fit measures). As seen in
the GOODNESS-OF-FIT section for each model in Figure 7.7, these values are given
respectively by,
________________________________________________________________________
ESE 502 III.7-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
and
So in terms of this likelihood comparison, it is clear that SLM also yields a much better
fit to the Eire data than SEM (i.e., a much higher log-likelihood value).
This raises the natural question as to why SLM is so much more successful in capturing
this spatial pattern of Blood Group A levels in Eire. Interestingly enough, the answer
appears to lie in the ripple effect underlying the spatial autoregressive multiplier matrix,
B1 ( I n W )1 , for these models, as detailed in Section 3.3 above. The key point here
is that while this ripple effect applies only to unobserved residuals in the SE-model, it
also applies to the explanatory variables in the SL-model, as is evident in expression
(6.2.4) above. More specifically, since our present weight matrix, W, in expression (7.5.3)
is row normalized, it follows from expression (2.1.19) above that
w11 w1n 1 j w1 j 1
(7.5.13) W 1n 1n
wn1 wnn 1 j wnj 1
B11n 11 1
n
So in the present case, expression (6.2.4) for SL-models now takes the form:
(7.5.15) Y B1[1n , x ] 0 B1 B11n 0 B1 x 1 B1
1
B11n 0 B1 x 1 B1 1n (B
1
0
1
x ) 1 B1
Y 01n x 1 B1
________________________________________________________________________
ESE 502 III.7-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
With these preliminary observartions, it should now be clear that the relative success of
the SL-model versus the SE-model in this Eire case can be attributed entirely to this
rippled Pale effect. The dramatic nature of this effect in the Eire case is illustrated in
Figure 7.8 below, where values of the rippled Pale are plotted on the far right (and where
the maximum and minumum values of the rippled Pale have been rescaled to be the same
as those of Blood Group A). Further reflection suggests that this remarkable fit may not
be simply a coincidence. Indeed, the gradual intermingling of blood-group types between
Anglo-Normans and the indigenous Eire population might well be viewed as a “rippling”
of intermarriage effects over many generations.
With this qualitative overview of SEM and SLM applications to Eire, we turn now to a
more detailed development of the many diagnostics displayed in Figure 7.7. To do so, we
start in Section 8 below with a development of the fundamental significance tests for
model parameters.
________________________________________________________________________
ESE 502 III.7-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
In addition, note that the symbol, , in (8.1) is treated as a variable which denotes
possible parameter values. The desired estimator, ˆn , is then distinguished as the value
of that maximizes the log-likelihood function, Ln ( ) . But it is also important to
distinguish the true value of , which we now denote by 0 . In particular, note that all
distributional properties of the random vector, ˆ , will necessarily depend on the true
n
i.e.,
(8.2)
lim n Pr || ˆn 0 || 0 for all 0
This is expressed more compactly by saying that ˆn converges in probability to 0 and is
written as
(8.3) ˆn 0
prob
This consistency property ensures that given enough sample information, maximum-
likelihood estimators will eventually “learn” the true values of parameters. Without such
a guarantee, it is hard to consider any estimator as being statistically reliable.
The single most useful tool for establishing such consistency results is the classical Law
of Large Numbers (LLN), which states that for any sequence of independently and
identically distributed (iid) random variables, ( X 1 ,.., X n ) , from a statistical population, X,
________________________________________________________________________
ESE 502 III.8-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
X i E( X )
n
(8.4) Xn 1
n i 1 prob
Since this law is one of the two most important results in statistics (and will be used
several times below), it is worth pointing out that unlike the other major result, namely
the Central Limit Theorem, assertion (8.4) is obtainable by elementary means that are
completely intuitive. To do so, recall first from expression (3.1.18) of Part II that we have
already shown that X n is an unbiased estimator of , i.e., that for n,
(8.5) E( X n )
Moreover, if we let var( X ) 2 , then it was shown in expression (3.1.19) of Part II that
for all n,
In particular, this implies that the expected squared deviation, E[( X n )2 ] , of X n from
must shrink to zero as n becomes large. But since the mean (center of mass) of X n is
always the same, namely at , this implies that the probability distribution of X n must
eventually concentrate around , as shown schematically in Figure 8.1 below:
f ( X 20 )
f (X2)
•
Here sample sizes n 2,20 are shown with 2 1 , so that the respective sample-mean
variances are given by 1/2 and 1/20.1 For the particular epsilon interval [ , ]
1
The densities plotted here are for X ~ N ( ,1) .
________________________________________________________________________
ESE 502 III.8-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
shown, it is clear that almost all of the probability mass for X 20 is already inside this
interval. So even without a formal proof, it should be clear that X n must converge in
probability to .2
Given the general consistency property in (8.3), the second major property of maximum-
likelihood estimators is that their sampling distributions are always asymptotically
normal with means given by the true parameter values. This can be expressed somewhat
more formally (in a manner analogous to the Central Limit Theorems in Section 3 of Part
II) by asserting that for sufficiently large sample sizes, n,
where the relevant covariance matrix, cov(ˆn ) , is here left undefined, and will be
developed in more detail below. Note in particular from (8.7) that ˆn is always an
asymptotically unbiased estimator of 0 , i.e., that 3
With respect to significance tests for SEM and SLM in particular, recall that all such tests
in Figure 7.7 above use z-values [as in expression (7.5.7) above] rather than the standard
t-values used for parameter significance tests in OLS models [as in expression (7.3.26) of
Part II]. The reason for this is that even though we are assuming multi-normally
distributed errors in both SEM and SLM, the exact distributions of estimators ( ˆ ,ˆ 2 , ˆ )
for these models are not necessarily normal, or even expressible in closed form. So we
must appeal to the asymptotic normality of such estimators to carry out significance tests,
and it is for this reason that z-values are used. (See Section 8.4.1 below for further
discussion of z-values versus t-values).
It should be evident here that (with the notable exception of the Central Limit Theorems
developed in Section 3 of Part II) the present asymptotic analysis is the most technically
challenging material in this NOTEBOOK. In view of this, our present objective is simply
to illustrate these results by examples where these general asymptotic properties reduce to
more familiar results obtainable by elementary means. We start with the classic example
2
A formal proof amounts simply to Chebyschev’s Inequality, which shows in the present case that for any
k 0 , Pr(| X n | ( k / n ) ) 1 / k . So as long as k increases more slowly than
2
n , both k / n
2
and 1 / k can be made arbitrarily small.
3
It might seem obvious from (8.3) that condition (8.8) should hold. But in fact these two conditions are
generally quite independent (i.e., each can hold without the other).
________________________________________________________________________
ESE 502 III.8-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
of estimating the mean of a univariate normal random variable in Section 8.1 below, and
then proceed to a multivariate example in Section 8.2. This second example involves the
General Linear Model, and will provide a useful conceptual framework for the SEM and
SLM results to follow.
y
2
12 i
Ln ( | y , 2 ) log i 1 f ( yi | ) log i 1 1 e
n n
(8.1.1)
2
n 1 yi
2
1
i 1 log 2
2
n log 2 21 2 i 1 ( yi )2
n
If we now use the simplifying notation Ln for first derivatives,5 and solve the usual first-
order condition for a maximum with respect to , we see that
d
0 Ln ( | y ) Ln ( | y ) ( yi ) yi n2
n n
(8.1.2) 1 1
d 2 i 1 2 i 1
y n 0 ˆ n
n n
i 1 i
1
n i 1
yi y n
and thus that ˆ n is precisely the sample mean, yn . So the main advantage of this
example is that the sampling distribution of this particular maximum-likelihood estimator
is obtainable by elementary methods.
4
In fact, this is one of the prime examples used by early contributors to Maximum Likelihood Estimation,
including Gauss (1896) and Edgeworth (1908), as well as in the subsequent definitive work of Fisher
(1922). For an interesting discussion of these early developments see Hald, A (1999) “On the History of
Maximum Likelihood in Relation to Inverse Probability and Least Squares”, Statistical Science, 14: 214-
222
5
Be careful not to confuse this use of primes with that of vector and matrix transposes, like A ,
________________________________________________________________________
ESE 502 III.8-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Note first that consistency of this estimator is precisely the Law of Large Numbers in
(8.4) with X replaced by the random variable Y in this case. As for the asymptotic
normality condition in (8.7), we have a much sharper result for the sample mean. In
particular, it follows as a very special case of the Linear Invariance property (Section
3.2.2 of Part II) of the multi-normal random vector, (Y1 ,..,Yn ) , that the sample mean,
Yn in1 ( n1 ) Yi , is exactly normally distributed. In particular, if the true mean of Y is
E ( y ) 0 , so that Y ~ N ( 0 , 2 ) , then by linear invariance we obtain the exact sampling
distribution of ˆ n ,
(8.1.3) ˆ n Yn ~ N ( 0 , 2 / n )
Given these well-known results for ˆ n , we now consider how they would be obtained
within the general theory of maximum-likelihood estimation. In the present case, the
general asymptotic normality result in (8.7) asserts that
(8.1.4) ˆ n d N [ 0 , var( ˆ n )]
which is clearly consistent with (8.1.3). We leave the derivation of the general asymptotic
normality result for the Appendix, and focus here on the large-sample variance, var( ˆ n ) ,
which remains to be determined. Thus our primary objective is show the value of
var( ˆ n ) determined by the general theory is precisely, 2 / n . In doing so, we shall also
illustrate the general strategy for analyzing the large sample properties of maximum-
likelihood estimators. This will not only yield an asymptotic approximation to the
variance of such estimators, but will also show why they are consistent.
The key observation to be made here is that by replacing data values, yi , with their
associated random variables, Yi , the log-likelihood function in (8.1.1) can be viewed as a
sum of iid random variables, X i ( ) log f (Yi | ) ,
[where we now suppress the given parameter, 2 , except when needed]. So if we divide
both sides by n and let Ln 1n Ln , then this is seen to be the sample mean , X n ( ) , for a
sample of size n from the random variable, X ( ) log f (Y | ) , i.e.,
________________________________________________________________________
ESE 502 III.8-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
(8.1.7) L ( ) E[ X ( )] E[log f (Y | )]
[where the expectation is with respect to Y ~ N ( 0 , 2 ) ], then it follows from the LLN
that Ln ( | Y1 ,..,Yn ) converges in probability to this mean, i.e.,
(8.1.8) Ln ( | Y1 ,.., Yn ) L ( )
prob
Notice also that since 1/ n is simply a positive constant, this transformation of Ln has no
effect on maxima. So the maximum-likelihood estimator, ˆ n , for sample data, ( y1 ,.., yn ) ,
must still be given by
For purposes of analysis, this scaled version of Ln thus constitutes a perfectly good
“likelihood” function, and will be treated as such. In these terms, the LLN ensures that
the likelihood functions, Ln ( | y1 ,.., yn ) , must have a unique limiting form, L () , given by
(8.1.8) which may be designated as the limiting likelihood function. This implies that
essentially all large-sample properties of maximum-likelihood estimators can be studied
in terms of this limiting form, and in particular, that the large-sample distribution of ˆ n
can be obtained.
In the present case, we can learn a great deal by simply computing this limiting likelihood
function. To do so, recall from (8.1.1) and (8.1.7) that
1 1 Y 2
(8.1.10) L ( ) E[log f (Y | )] E log 2
2
1 1
log 2 2 E Y
2
2
log 2 21 2 E Y 2 2 Y 2
log 2 21 2 E (Y 2 ) 2 E (Y ) 2
________________________________________________________________________
ESE 502 III.8-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
c 21 2 ( 0 )2
where c 1
2 log( 2 ) is a constant depending only on . So we see that in the
present case, L is a simple quadratic function, as shown by the solid black curve in
Figure 8.2 below, where we have used the parameter values, 0 1 and 2 1 .
L ( 0 ) Ln
•
L ( 0 ) • L
L ( 0 )
Figure 8.2. Limit Curve and ε-Band Figure 8.3. Estimate Interval for ε-Band
Notice also that this limiting function achieves its maximum at precisely the true mean
value, 0 , which can be seen from the following first order condition (that is shown in
the Appendix to hold for all limiting likelihood functions):
(8.1.12) L( ) 1
2
( 0 ) L( 0 ) 0
Next observe from expression (8.1.8) that for any given value of on the horizontal
axis, the likelihood values, Ln ( | y1 ,.. yn ) , should eventually be very close to the limiting
likelihood value, L ( ) , for all sufficiently large data samples, ( y1 ,.., yn ) . However, this
does not imply that the entire function, Ln ( | y1 ,.. yn ) , will be close to the limiting
function, L () . Here one requires a “uniform” version of probabilistic convergence (as
detailed in the Appendix). For the present, it suffices to say that under mild regularity
conditions, one can ensure uniform convergence in probability on any given interval
containing the true mean, 0 , such as the interval, I [0.5,2.5] , about 0 1 shown
________________________________________________________________________
ESE 502 III.8-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Figure 8.2. What this means is that as sample sizes increase, realized likelihood
functions, Ln ( | y1 ,.. yn ) , will eventually be contained in any given -band on interval I
(such as the one shown) with probability approaching one. One such realization, Ln , is
shown (schematically) by the blue curve in Figure 8.3,6 with corresponding maximum-
likelihood estimate, ˆ n , also shown.
Consistency of ˆ n
Large-Sample Variance of ˆ n
First observe that by taking the second derivative, L ( d / d ) L , of the limiting
likelihood function and evaluating this at 0 , we see from (8.1.12) that
(8.1.13) L( ) d
d L( ) 12 L( 0 ) 12
But for sufficiently large sample sizes, n, the scaled likelihood functions, Ln , were seen
to be uniformly close to L in the neighborhood of 0 , so that their shapes should be
6
Here it is worth noting from expression (8.1.1) that like the limiting curve, L , all such realizations, Ln ,
in the present case must be smooth quadratic functions (such as the one shown).
________________________________________________________________________
ESE 502 III.8-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
But this in turn implies from the definition of Ln that the original log-likelihood
functions, Ln , must satisfy:
Finally, since we happen to know that the right hand side is precisely the variance of
ˆ n Yn , we see that this variance is well approximated by the negative inverse of the
second derivative of the log-likelihood function, Ln , evaluated at the true mean, 0 , i.e.,
While all of this might seem to be purely coincidental, it is shown in the Appendix that
this relation is always true for maximum-likelihood estimators. More importantly for our
present purposes, this geometric argument actually suggests why this should be so. To
begin with, note that while the first derivative, L( ) , of the limiting likelihood function
reveals its slope at each point, , the second derivative, L( ) reveals its curvature, i.e.,
rate of change of slope at . So L( 0 ) corresponds geometrically to the curvature of
the limiting likelihood function at the true mean, 0 . With this in mind we now illustrate
the effects of such curvature in Figures 8.4 and 8.5 below.
Figure 8.4. Estimate Interval for σ 2 = 1 Figure 8.5. Estimate Interval for σ 2 = 1/4
________________________________________________________________________
ESE 502 III.8-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Figure 8.4 simply repeats the relevant features of Figure 8.3. Here we used the variance
parameter, 2 1 , which implies from (8.1.13) that L( 0 ) 1 . Note in particular that
this negative sign reflects the concavity of L required for a maximum at 0 . In Figure
8.5 we have used the same value of for the -band around function, L , but have now
reduced the variance parameter to 2 1/ 4 . This in turn is seen to yield more extreme
curvature, L( 0 ) 4 , at 0 . But the key point to notice is that this sharper curvature
necessarily compresses the corresponding -containment interval that delimits the
feasible range of maximum-likelihood estimates, ˆ n , for large n. By comparing these two
figures, one can see that the permissible deviations from 0 in Figure 8.5 are only about
half those in Figure 8.4 (decreasing from about 0.8 to 0.4). This in turn implies that the
permissible squared deviations are only about a quarter as large. Moreover, the constancy
of curvature in the present example implies that this same relation must hold for all ,
and thus that the expected squared deviations of ˆ n should also be about a quarter as
large. But this is precisely the relative variance of ˆ n at each level of curvature. In short,
we see that for large samples, n, with log-likelihoods close to the limiting likelihood, the
desired variance of ˆ n is indeed (inversely) proportional to negative curvature, as in
(8.1.17).
Finally, it should be noted that while the constancy of curvature in this example makes
such relations easier to see, this is of course a very special case. More generally, all that
can be said is that for sufficiently large samples, n, almost all realizations of ˆ n will be so
close to 0 that curvature can be treated as constant over the relevant range of ˆ n .
By way of motivation, it should be noted that perhaps the weakest link in the chain of
arguments above was the supposition that curvature of the likelihood function, Ln ( 0 ) , at
________________________________________________________________________
ESE 502 III.8-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
the true mean is well approximated by that of the limiting likelihood function, L ( 0 ) ,
i.e., that Ln( 0 ) L( 0 ) , as in expression (8.1.14) above. Since averaging produces
smoothing effects, it should thus be more reasonable to suppose that
Finally we note that the negative expected curvature value used in (8.1.20) is of much
wider interest, and (in honor of its discoverer) is usually designated as Fisher
information,
Note in particular that since higher values of n ( 0 ) mean lower variances of ˆ n and
thus sharper estimates of 0 , this measure does indeed reflect the amount “information”
in Ln about 0 . For computational purposes, we must again substitute ˆ n for 0 , to
obtain the large-sample variance estimate,
(8.1.23) ˆ ) [ ( ˆ )]1
var( n n n
In subsequent sections it will be shown that explicit expressions for Fisher information
can be obtained for both SEM and SLM. So we shall employ this expectation version of
variance estimates in our analyses of these models. Thus, to avoid any possible confusion
of (8.1.23) with (8.1.18) above, we now follow the standard convention of designating
the more direct estimate of negative curvature in (8.1.18) as observed Fisher information,
7
As shown in the Appendix, it is this expected-curvature expression that is used in formal convergence
proofs. So while both approximations are used in practice, the main advantage of the (8.1.17) approach is
that it allows the role of geometric curvature to be seen more easily.
________________________________________________________________________
ESE 502 III.8-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
8.2 Sampling Distributions for General Linear Models with Known Covariance
(8.2.1) Y X , ~ N (0,V )
where the covariance matrix, V, is assumed to be known. As in Section 7.2.2, this in turn
implies that
(8.2.2) Y ~ N ( X ,V )
Before proceeding with this case, it is worth noting that (8.2.2) is in fact a direct
extension of our previous model in Section 8.2.1. In particular, that model is seen to be
the special case in which, V 2 I n with 2 known, and in which X 1n . Thus
reduces to a single parameter, ( 0 ) , in this case, and we see that:
(8.2.3) Y ~ N ( 1n , 2 I n ) Yi ~ N ( , 2 ) , i 1,.., n
iid
So it is perhaps not surprising that the same methods above can be applied to this more
general version.8
As mentioned above, the only difference here is that we are now in a multi-parameter
setting where sampling distributions must be obtained for the vector of maximum-
likelihood estimators in (7.2.18) above, i.e., for
But since this is simply a linear transformation of the random vector, Y , we can obtain
the sampling distribution of ˆ by again appealing directly to expression (3.2.2) of the
Linear Invariance Theorem. To do so, note simply that if we let
(8.2.5) A ( X V 1 X )1 X V 1
so that ˆn AY , then it follows at once from (3.2.2) together with (8.2.2) above that
8
Here we ignore questions of consistency, which involve a somewhat more complex application of the
Law of Large Numbers [as for example in Theorem 10.2 in Green (2003)].
________________________________________________________________________
ESE 502 III.8-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
and thus that ˆn is exactly multi-normally distributed. Moreover, as we have already
shown in expressions (7.3.21) and (7.3.22) of Part II, ˆn has mean vector
( X V 1 X ) 1 X V 1 cov( )V 1 X ( X V 1 X ) 1
( X V 1 X ) 1 X V 1V V 1 X ( X V 1 X ) 1
( X V 1 X ) 1 ( X V 1 X ) ( X V 1 X ) 1
( X V 1 X )1
So the main task is to estimate the covariance matrix, cov( ˆn ) , in (8.2.10). To do so,
recall from expression (8.1.17) that the desired asymptotic variance estimate was
obtained in terms of the second derivative of the log-likelihood function evaluated at the
true parameter value. Exactly the same result is true in the multi-parameter case, except
that here we must calculate partial derivatives of the log-likelihood function with respect
to all parameters. The details of partial derivatives for both the scalar and multi-
________________________________________________________________________
ESE 502 III.8-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
dimensional case are developed in Sections A2.5 through A2.7 in the Appendix to Part II
(which we here designate as Appendix A2). In the present case, the log-likelihood
function is precisely the same as that in expression (7.2.14) above with 2 1 , i.e.,
As shown for the OLS case in Section A2.7.3 of Appendix A2, maximizing this function
with respect to parameter vector, , amounts to setting all partial derivatives of L( | y )
equal to zero, where the vector of partial derivatives is called the gradient vector of
L( | y ) , and is written as:
1 L( | y )
(8.2.12) L( | y )
L( | y )
k
But since only appears in the last term of (8.2.11), it follows that this first order
condition for a maximum reduces to:
(8.2.13) 0 L( | y ) [ 12 ( y X )V 1 ( y X )]
12 [ y V 1 y 2 y V 1 X 2 X V 1 X ]
X V 1 y X V 1 X
where the last line follows from expressions (A2.7.7) and (A2.7.11) of Appendix A2.
Notice that solving this expression for yields precisely the maximum-likelihood
estimate in (8.2.4) above. But our present interest in the matrix of second partial
derivatives of L at the true value of , say 0 ,9
2 L( 0 | y ) 2
L( 0 | y )
2
1 k
1
(8.2.14) L( 0 | y ) [ L( | y )]
0
2
k 1 L( 0 | y )
L( 0 | y )
2
k2
which (as in Section A2.7 of Appendix A2) is designated as the Hessian matrix for
L( | y ) evaluated at 0 . So by the last line of (8.2.13) [together with (A2.7.7) in
Appendix A2]10, we see that
9
Note that the intercept coefficient in is here designated as “ 1 ” precisely to avoid notational conflicts
with this true coefficient vector, 0 .
10
In particular, it follows from (A2.7.7) that for any symmetric matrix, B ( b1,.., bn ) , we must have
x Bx ( x b1x ,.., x bn x ) B B .
________________________________________________________________________
ESE 502 III.8-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
(8.2.15) L( 0 | y ) [ X V 1 y X V 1 X ] 0
( X V 1 X )
X V 1 X
But, if we again replace data vector, y , by its corresponding random vector, Y , and take
expectations (under 0 ) then we see in this case that
(8.2.16) E[ L( 0 | Y )] X V 1 X
(8.2.18) n ( 0 ) E[ L( 0 | Y )]
then it follows from (8.2.17) that the covariance of ˆn is precisely the inverse of the
Fisher Information matrix, i.e.,
While this is of course a very special case in which covariance is exactly inverse Fisher
Information, it serves to motivate the general results to follow.
Given this multi-parameter example, it can be shown that the asymptotic sampling
distributions for general maximum likelihood estimates are essentially of the same form.
In particular, if the log-likelihood function for n samples, y ( y1 ,.., yn ) from a
distribution with k unknown parameters, (1 ,.., k ) , is denoted by L( | y ) [as in
(7.1.8) above], and if the maximum-likelihood estimator for is denoted by ˆ [as in
n
(7.1.9), with sample size, n , made explicit], then (under mild regularity conditions) ˆn is
both asymptotically multi-normal and asymptotically unbiased, i.e.,
________________________________________________________________________
ESE 502 III.8-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
where 0 is the true value of . So expressions (8.1.4) and (8.2.10) are both seen to be
instances of this general expression. Moreover, the asymptotic covariance matrix,
cov(ˆn ) , takes the same form as expression (8.2.17) through (8.2.19). In particular, if we
again denote the relevant Fisher Information matrix by
then the asymptotic covariance of ˆn is precisely the inverse of the Fisher Information
matrix, i.e.,
Finally, this distribution is again made operational by appealing to the consistency of ˆn
to replace with ˆ and write
0 n
But taking this approximation to be the relevant sampling distribution for ˆn , one can
proceed with a range of statistical analyses regarding the nature of 0 . But rather than
developing testing procedures within this general framework, it is more convenient to do
so in the specific contexts of SEM and SAR, to which we now turn.
For the SE-model in (7.3.1) through (7.3.3), it follows at once that the relevant parameter
vector is given by ( , , 2 ) ,11 with likelihood function,
(8.4.1) L( | y ) L( , 2 , | y )
where B I n W . For this model, if we now designate the sum of diagonal elements
of any matrix, A ( aij ) , as the trace of A, written tr ( A) i aii , and if (for notational
11
For notational simplicity, we here drop transposes and implicitly assume that both and are column
vectors.
________________________________________________________________________
ESE 502 III.8-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
simplicity) we let G WB1 B1W ,12 then it can be shown [see Ord (1975) and
Appendix B in Doreian (1980)] that the expected value of the Hessian matrix for L( | Y )
evaluated at the true value of is given by 13
E ( L) E ( 2 L) E ( L)
(8.4.2) E[ L( | Y )] E ( L) E ( 2 L) E ( 2 2 L) E ( 2 L)
E ( L) E ( 2 L) E ( L)
12 X B B X 0 0
0 n
2 4
1 tr (G )
2
0 1 tr (G ) tr[G (G G )]
2
where for simplicity we now drop the subscripts “0” denoting “true” values. It then
follows at once from (8.3.2) and (8.3.3) that asymptotic covariance matrix of the
maximum-likelihood estimators, ˆ ( ˆ , ˆ ,ˆ 2 ) ,14 is given by
Thus the desired asymptotic sampling distribution of ( ˆ ,ˆ 2 , ˆ ) for SEM is given by
1
ˆ 2 X B B X
1 0 0
2 2
ˆ ~ N , 0
n 1 tr (G )
(8.4.4) 2 4 2
ˆ 1 tr (G ) tr[G (G G )]
0
2
The last equality follows from the fact that WB W ( n 0 W ) ( n 0 W )W B W . Because of
12 1 n n n n 1
this, G is defined both ways in the literature [compare for example Ord (1975) and Doreian (1980)].
13
Note in the last line of (8.4.2) that 0 denotes a zero vector of the same length as , together with its
transpose, 0 .
14
Again for notational simplicity, we now drop the sample-size subscripts (n) on estimators.
________________________________________________________________________
ESE 502 III.8-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Before applying this distribution to construct specific tests, it is of interest to notice from
the pattern of zeros inside this covariance matrix that further simplifications are possible
here. In particular this covariance matrix is seen to be the inverse of a block diagonal
matrix. But just as in the case of simple diagonal matrices, matrix multiplication shows
that the inverse of any ( 2 2 ) block diagonal matrix is given by
1
A A1
B
(8.4.5)
B 1
2 ( X B B X )1
1
(8.4.6) cov(ˆ) 2n 4
1 tr (G )
2
1
2 tr (G ) tr[G (G G )]
where V 2 ( B B )1 , it follows from expressions (8.2.8) [together with (6.1.6) and
(6.1.7)] that expression (8.4.6) is precisely the instance of the GLS in Section 7.6.2 above
for the case of an SE-model with known spatial dependency parameter given by . In
other words, the independency property of SEM allows all analyses of ˆ to be carried
out using the GLS model in (8.2.9) for any given values of the other parameters, ( 2 , ) .
As we shall see below, this simplification is not true for SLM.
To develop appropriate tests of parameters for SEM, recall that all unknown true
parameter values ( , 2 , ) in the above expressions are estimated using ( ˆ ,ˆ 2 , ˆ ) . So
in these terms, the estimated asymptotic covariance for purposes of statistical inference is
________________________________________________________________________
ESE 502 III.8-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
obtained by substituting these estimated values ( ˆ ,ˆ 2 , ˆ ) into (8.4.6). For notational
convenience, we denote this estimated covariance matrix by S SEM , which is now seen to
have the block diagonal form:
s2ˆ sˆ ˆ
0 0 k
s
ˆk ˆ0 sˆk
2
s2ˆ 2 sˆ 2 ˆ
sˆ ˆ 2 sˆ2
Before using these estimates, notice from (8.4.6) that we have here factored out the
quantity, ̂ 4 , in the lower diagonal matrix [see also Ord (1975), expression (19) in
Doreian (1980), and expression (5.48) of Upton and Fingleton (1985) – which is class
reference number 18]. The reason for this can be seen from the first row of the
(unfactored) lower block-diagonal matrix in (8.4.6), which will have all elements close to
zero when variance, 2 , is large. So to avoid possible numerical stability problems when
computing the inverse of this matrix, it is convenient to introduce such a factorization.
To apply these estimates, we first consider the most important set of parameter tests,
namely tests for beta coefficients, j , j 0,1,.., k , in the linear term, X (where as
usual, the intercept, 0 , tends to be of less interest than the slope coefficients, 1 ,.., k ,
for explanatory variables). While such analyses are is conceptually similar to those for
Geo-Regression in Section 7.3.2 above, there are sufficient differences to warrant a more
careful development here. For parameter, j , it follows at once from (8.4.7) that the
marginal distribution of the estimator, ˆ , must be normal and of the form
j
matrix, S SEM , in (8.4.9). So by using this estimate, (8.4.10) can be rewritten as,
________________________________________________________________________
ESE 502 III.8-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
This is the operational form of the sampling distribution that we shall employ for testing
the significance of j . To employ the standard normal tables, one must first standardize
ˆ to obtain the corresponding z-statistic:
j
ˆ j j
(8.4.12) zˆ ~ N (0,1), j 0,1.., k
j
sˆ
j
where sˆ s2ˆ denotes the standard deviation of ˆ j . So under the null hypothesis,
j j
j 0 , it follows from (8.4.12) that the z-score, z j ˆ j / sˆ , must be standard normal,
j
i.e., that
ˆ j
(8.4.13) zj ~ N (0,1) , j 0,1,.., k
sˆ
j
15
Finally, if the observed z-score value is denoted by z obs
j , then the p-value for this (two-
sided) test is thus given by
p j Pr | z j | | z obs
j |
pj / 2 pj / 2
(8.4.14)
obs
| zj |
We shall illustrate this test for the Eire example below. But before doing so, it is
important to point out that this test treats s2ˆ in (8.4.11) as a “known” quantity, and in
j
particular ignores all variation in this estimator. But in OLS, for example, this estimator
can be shown to be both chi-square distributed and independent of ˆ j , so that the ratio,
ˆ / s 2 , is t-distributed. Thus, the relevant tests of coefficients in OLS are t-tests. But
j ˆ j j
in more general settings such as SEM, the distribution of ˆ j / s2ˆ is unknown. So the
j
2
standard “fall back” position is to treat s as a constant (by appealing implicitly to its
ˆ j
large-sample consistency property), and to employ the normal distribution in (8.4.11) for
testing purposes.
The consequence of this convention is to inflate significance levels (i.e., reduce p-values)
to some degree. In the OLS case, this can be seen by noting that t-distributions have fatter
15
Here “observed” means the actual value calculated by maximum-likelihood estimation.
________________________________________________________________________
ESE 502 III.8-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
tails than the standard normal distribution, thus increasing the p-value in (8.4.14) for any
given observed value, z obsj . Because of this, some analysts prefer to use a more
conservative t-test based on the number of parameters in the model (as in the OLS case).
In particular, if we now re-designate the ratio in (8.4.13) as a pseudo t-statistic,
ˆ j
(8.4.15) tj
sˆ
j
and let Tv denote the t-distribution with v degrees of freedom, then [following Davidson
and MacKinnon (1993, Section 3.6)] one can construct a corresponding pseudo t-test of
the null hypothesis, j 0 , by assuming that t j is t-distributed with degrees of freedom
equal to n minus the number of model parameters. In this case, the relevant parameter
vector ( , 2 , ) is of length, k 3 , so that16
ˆ j
(8.4.16) tj ~ Tn ( k 3)
sˆ
j
The appropriate p-value for this test is then given by the probability in (8.4.14) with
respect to the t-distribution in (8.4.16), and will be dented by p tj pseudo . While these values
are not reported in the screen output of sem.m or slm.m (as for the Eire example in
Figure 7.7), we shall report them here just to illustrate the types of significance inflation
that can occur.
But before turning to the Eire example, we first construct a test of the other key
parameter of the SE-model, namely the spatial dependence parameter, .17 Here it is
important to note that unlike the simple (and appealing) rho statistic, ˆW , in Section 4.1.1
above, which unfortunately yields an inconsistent estimate of , the present maximum-
likelihood estimator, ̂ , of is consistent (assuming of course that the SE-model is
correctly specified). So from a theoretical perspective, formal hypothesis tests based on
this estimator are of great interest. Here the same testing procedure for j coefficients
can be applied with the appropriate changes. First, it again follows from (8.4.4) that the
estimator, ̂ , is normally distributed, so that as a parallel to (8.4.11), we now have
(8.4.17) ˆ ~ N , s2ˆ
16
Given that ˆ in (7.3.11) is functionally independent of , one could in principle use v n ( k 2) in
2
(8.4.16). However, we here adopt the (conservative) approach of using all parameters to calculate v.
17
Note that while the error variance, , is also a model parameter, and indeed is also asymptotically
2
normally distributed by (8.4.4), one is rarely interested in testing specific hypotheses about . So
2
following standard convention, we simply report the estimated value, ˆ , in screen outputs like Figure 7.7.
2
________________________________________________________________________
ESE 502 III.8-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
where sˆ s2ˆ . Moreover, since the natural null hypothesis for this parameter is again,
0 (here denoting the absence of spatial autocorrelation), we have the corresponding
z-score,
ˆ
(8.4.18) zˆ ~ N (0,1)
sˆ
under this hypothesis. So the presence of non-zero spatial autocorrelation (either positive
or negative) is now gauged by the two-sided p-value,
where z obs
ˆ is again the observed z-score value. While this is always the default test
employed in spatial regression software, it should be noted that (as in Section 4 above) a
one-sided test for positive spatial autocorrelation is generally of more relevance. But we
choose to employ the (more conservative) two-sided test to maintain comparability with
other software. Finally, as with tests of coefficients above, we shall also report the p-
value, ptˆ pseudo , for the corresponding pseudo t-test that captures at least some of the
statistical variation in the estimator, ̂ .
To apply these results to the Eire case, we first note that the estimated covariance matrix,
S SEM , in (8.4.9) is one of the outputs of sem.m, denoted by cov. In the present case, this
matrix for the Eire data take the form in Figure 8.6 below, where each row and column is
labeled by its corresponding parameter estimator ( ˆ0 , ˆ1 ,ˆ 2 , ˆ ) :
ˆ0 ˆ1 ˆ 2 ̂
ˆ0 1.9464 ‐ 0.3061 0 0
̂ 2 0 0 0.3874 ‐ 0.0211
̂ 0 0 ‐ 0.0211 0.0112
This clearly illustrates the block diagonal structure of S SEM . To relate this covariance
matrix to the SEM results in Figure 7.7, we focus on the important “Pale” coefficient, 1 .
By recalling that the standard error of ˆ in (8.4.12) is given from Figure 8.6 by
1
________________________________________________________________________
ESE 502 III.8-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
we see that the z-score for a two-sided test of the hypothesis, 1 0 , is given by18
ˆ1 1.5532
(8.4.21) z1 1.7577
sˆ .8837
1
as in Figure 7.7. Finally, the desired p-value for this test is given by
as in Figure 7.7. To compare this results with the corresponding pseudo t-test for 1 ,
observe first that in this case there are 4 parameters, ( 0 , 1 , 2 , ) , so that the for the
n 26 counties in Eire, the appropriate two-sided p-value is calculated with respect to a
t-distribution with v 26 4 22 degrees of freedom, yielding
This is still weakly significant ( .10 ), but is noticeably less significant than the result in
(8.4.22) based on the normal distribution. However, it should be noted that the sample
size, n 26 , in this Eire example is quite small. So in larger sample sizes, where the tails
of Tn( k 3) are much closer to those of N (0,1) , this difference will be far less noticeable.
Finally, for completeness, we also calculate the corresponding tests for the spatial
dependency parameter, . As in (8.4.21), the relevant z-score in (8.4.18) is seen from
Figures 7.7 and 8.6 to be
ˆ 0.7885
(8.4.24) zˆ 7.467
sˆ 0.0112
with corresponding p-values for the z-test and pseudo t-test given by:
(8.4.25) pˆ Pr(| z ˆ | 7.467) 8.2 1014 and ptˆ pseudo 1.82 107
So while the pseudo t-test again yields a somewhat weaker result, both p-values are
vanishingly small,19 and confirm that spatial autocorrelation in this model is strongly
present.
18
Note that since all of the following calculation examples are done to a much higher degree of precision
than the numbers shown, the results on the right hand sides will not agree “exactly” with the indicated
operations on the left-hand sides.
19
Note that the reported value in Figure 7.7 is not zero, but rather is simply smaller than the number of
decimal places allowed in this (default) printing format.
________________________________________________________________________
ESE 502 III.8-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Using the same notation as above, recall that the log-likelihood function for the SL-model
in (6.2.2) through (6.2.4) is given by
(8.5.1) L( | y ) L( , 2 , | y )
where X B1 X and where is now the spatial dependency parameter for the
dependent variable, y , rather than for the residual errors, . If for notational simplicity
we let H ( , ) tr[G (G G )] 2 X G G X , and again let G WB1 , then the
same analysis of this log-likelihood function as in (8.4.2) and (8.4.3) above [see
Appendix B in Doreian (1980)] yields the corresponding covariance matrix for SLM:
1
1 X X
2
0 1
2
X G X
(8.5.2) cov( ˆ ,ˆ 2 , ˆ ) 0 n
2 4
1 tr (G )
2
1
2 X G X
1
2
tr (G ) H ( , )
Thus the appropriate asymptotic sampling distribution for SLM is given by:
X G X
1
ˆ 2 X X
1 0 1
2 2
2
ˆ ~ N , 0
n 1 tr (G )
(8.5.3) 2 4
2
ˆ
12 X G X 1 tr (G )
H ( , )
2
The key difference from SEM is that the present covariance matrix, cov( ˆ ,ˆ 2 , ˆ ) , is not
block diagonal. The essential reason for this can be seen by comparing the reduced forms
for SEM and SLM in (6.1.8) and (6.2.6), respectively, which we now reproduce:
These are seen to differ only in that X for SEM is replaced by X B1 X for SLM. So
the difference here is that in SLM is directly influencing the mean of Y while in SEM
it is not [i.e., E (Y | X ) B1 X rather than E (Y | X ) X ]. It is this linkage between
________________________________________________________________________
ESE 502 III.8-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
and in SLM that creates non-zero covariances between ̂ and the components of
ˆ .
1
ˆ 2 X X 0 ˆ 2 X Gˆ X ˆ
(8.5.6) S SLM ˆ 4 0 n
2 ˆ 2tr (Gˆ )
2 ˆ
ˆ X Gˆ X ˆ 2tr(Gˆ ) ˆ 4 H ( ˆ , ˆ )
Given this estimated matrix, it follows by using the same notation as for SEM that all
relations in expressions (8.4.11) through (8.4.19) continue to hold, where
({ˆ j : j 0,1,.., k },ˆ 2 , ˆ ) is now the vector of maximum-likelihood estimates for SLM
rather than SEM, and where the standard deviations, ({sˆ j : j 0,1,.., k }, sˆ 2 , sˆ ) , are now
the square roots of the diagonal elements of S SLM rather than S SEM . So aside from these
differences, all z-tests and pseudo t-tests are identical in form.
As with sem.m, the MATLAB program slm.m offers an optional output of the S SLM
matrix, designated by cov, which in a manner to Figure 8.6 above, now has the form:
ˆ0 ˆ1 ̂ 2 ̂
ˆ0 10.327 0.8259 0.4129 ‐ 0.3589
SLM [recall (8.3.2)], there are indirect links as seen in its inverse. More generally, only
block-diagonal patterns of zeros in the Fisher information matrix ensure independence in
the multi-normal case.
________________________________________________________________________
ESE 502 III.8-25 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
With these observations, the relevant test statistics can again be obtained from (the right
panel of) Figure 7.7 together with the diagonal elements of S SLM in Figure 8.7. For 1 we
see in this case that
ˆ1 2.0142
(8.5.7) z1 3.472
sˆ .3366
1
which in turn yields the following p-value for the z-test in Figure 7.7, together with the
corresponding pseudo t-test:
(8.5.8) p1 Pr(| z1 | 3.472) 2 ( 3.472) 0.00052 and p1t pseudo 0.0022
So again we see that the significance level for the z-test is inflated. But even for the more
conservative pseudo t-test, the “Pale” effect is here vastly more significant than for SEM,
as was seen graphically in Figure 7.8 above.
Turning finally to the spatial dependency parameter, , we may again use Figures 7.7
and 8.7 to obtain the following z-score,
ˆ 0.7264
(8.5.9) zˆ 6.467
sˆ 0.0126
(8.5.10) pˆ Pr(| z ˆ | 6.467) 9.99 1011 and ptˆ pseudo 1.66 106
So even though the “Rippled Pale” in Figure 7.8 fits the Blood Group data far better than
the “Pale” itself, these results show that there remains a great deal of spatial
autocorrelation that is not accounted for by this single explanatory variable.
________________________________________________________________________
ESE 502 III.8-26 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
Unlike Ordinary Least Squares, where there is a single dominant measure of goodness of
fit – namely R-squared (and adjusted R-squared), no such dominant measure exists for
more general linear models. So relative goodness of fit for models such as SEM and SLM
is best gauged by employing a variety of candidate measures, and attempting to establish
“dominance” in terms of multiple measures. Recall from Figure 7.7 that seven different
measures were reported for each of these models. So the main objective of this section is
to clarify the meaning and interpretation of these measures. To do so, we begin in Section
9.1 below with a detailed investigation of the classical R-squared measure. Our objective
here is to show why it is appropriate for classical OLS but not for more general models.
This will lead to “extended” R-squared measures that can be applied to both SEM and
SLM.
yi
y • y • yi
y i yˆ i
• yi y yˆ i
• yˆ i y
y
• • • y
• •
• ˆ0 ˆ1 x
xi xi
From an estimation viewpoint, the regression problem for this data is to find a linear
function, y 0 1 x , which best fits this data. If we let ei denote the actual deviation
of point ( yi , xi ) from this function (or line), so that by definition,
(9.1.1) yi 0 1 xi ei , i 1,.., n
then the regression line is defined to be that linear function, y ˆ0 ˆ1 x , which
minimizes the sum of squared deviations, i ei2 . In this case, the desired regression line is
given by the blue line in Figure 9.2 [where only the single representative data point,
( yi , xi ) , from Figure 9.1 is shown here].
________________________________________________________________________
ESE 502 III.9-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
To evaluate “goodness of fit” for this line, we first construct an appropriate benchmark
for comparison. To do so, it is natural to ask how we might “fit” y-values if the
explanatory variable, x , were ignored altogether. This can be accomplished by simply
setting 1 0 , so that model (9.1.1) reduces to:
(9.1.2) yi 0 ei , i 1,.., n
In this setting the least-squares fit, ˆ0 , is now obtained by minimizing the sum of squares
(9.1.3) S ( 0 ) i ( yi 0 ) 2
(9.1.4) 0 d
d 0 S ( ˆ0 ) 2 i ( yi ˆ0 )( 1)
0 ( y ˆ )
i i 0 i
yi nˆ0
ˆ0 1
n i
yi y
and thus that the best least-squares fit to y in this case is precisely the sample mean, y .
[Recall also the arguments of expressions (7.1.35) and (7.1.36) in Part II]. In other words,
if one ignores possible relations with other variables, then the best predictor of y values
based only on data ( yi : i 1,.., n ) is given by the sample mean of this data. So the flat line
with value y in Figure 9.1 represents the natural benchmark (or null hypothesis) against
which to compare the performance of any other possible regression model, such as
(9.1.1). But for this benchmark case, it is clear that “goodness of fit” to the y-values can
be measured directly in terms of their squared deviations around y . This can be
summarized in terms of the sum of squared deviations,
S y2 i 1 ( yi y )2
n
(9.1.5)
designated here as the total variation in y.1 Note in particular that with respect to this
measure, one has a perfect fit (i.e., yi y for all i 1,.., n ) if and only if S y2 0 .
In this setting, candidate explanatory variables, x , for y only have substance in so far as
they can reduce this benchmark level of uncertainty in y. As we shall see, it is here that
1
Equivalently, one could take averages, and use the sample variance, s 2y S y2 / ( n 1) , of y in model (9.2).
But as we shall see below, it turns out to be simpler and more direct to consider the fraction of total
variation in y that can be accounted for by a given regression model.
________________________________________________________________________
ESE 502 III.9-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
the R-squared measure ( R 2 ) comes into play. In short, R 2 captures the reduction in
uncertainty about y that can be achieved by regressing y on any given set of explanatory
variables. The key idea can be seen in an intuitive way by reconsidering the regression
shown in Figures 9.1 and 9.2 above. Note first that the full deviation, yi y , of the
representative point, ( yi , xi ) , from the benchmark flat line, y , is shown explicitly in
Figure 9.1. In the presence of the regression line in Figure 9.2, this deviation can be
decomposed into two parts by using the predicted value, yˆ i , of yi for this regression.
The lower segment, yˆ i y , reflects that part of the overall deviation, yi y , that has
been “explained” by the regression line, and the upper segment, yi yˆ i , reflects that part
left “unexplained” by the regression. In this context, the essential purpose of R 2 is to
yield a summary measure of the fractional deviations accounted for by the regression.
But notice that this example point, ( yi , xi ) , has been carefully chosen so that both the
deviation, yi y , and its fractional parts are positive. To ensure positivity, it is more
appropriate to ask how much of the squared deviation, ( yi y )2 , is accounted for by the
regression line. Note moreover that not all points will yield such “favorable” results for
this regression. For example, data points that happen to be very close to the y -line will
surely be better predicted by y than by the regression, so that ( yi y )2 ( yi yˆ i )2 .
Thus the key question to be addressed how well a given regression is doing with respect
to total variation of y in (9.1.5). In the context of Figure 9.2, the main result will be to
show that this total variation can be decomposed into the sum of squared deviations of
both yi yˆ i and yˆ i y , i.e., that
(9.1.6) S y2 i ( yˆ i y )2 i ( yi yˆ i )2 ( yˆ y ) eˆ
i i
2 2
i i
If these terms are designated respectively as model variation and residual variation, then
this fundamental decomposition says that
In this setting, the desired R 2 measure (also called the Coefficient of Determination) is
taken to be the fraction of total variation accounted for by model variation, i.e.,
(9.1.8) R
2 model variation
( yˆ y )
i i
2
total variation ( y y)
i i
2
(9.1.9) R 2 1
residual variation
1
eˆ 2
i i
total variation ( y y)
i i
2
The task remaining is to demonstrate that this decomposition holds for linear regressions
with any number of explanatory variables. To do so, we begin by developing a “dual”
representation of the regression problem which (among other things) will yield certain
key results for this construction.
To motivate this representation, we again begin with the simplest possible case of one
explanatory variable, x, together with only three samples, ( yi , xi ), i 1, 2,3 , as shown in
Figure 9.3 below.
y s3
y1
y3 • y y2
y
0 1 x 3
y1 •
s2 x1
y2 • x x2
x
3
x1 x2 x3 x s1
This sample plot is simply another instance of the scatter plot in Figure 9.1, where a
candidate line, 0 1 x , for fitting these three points is shown in blue. As in expression
(9.1.1), this yields the identity,
(9.1.10) yi 0 1 xi ei , i 1,2,3
where again the desired regression line, ˆ0 ˆ1 x , minimizes the sum of squared
deviations, i ei2 e12 e22 e32 . But recall that (9.1.6) can also be written in vector form
as,
y1 1 x1 e1
(9.1.11) y2 0 1 1 x2 e2 y 0 13 1 x e
y 1 x e
3 3 3
________________________________________________________________________
ESE 502 III.9-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
[here drawn as vectors from the origin]. Each of these representations has its own
advantages. For the present case of a single explanatory variable, x, the more standard
sample plot has the advantage of allowing any number of samples to be plotted and
displayed. The variable plot in Figure 9.2 is far more restrictive in this context, since the
present case of a single explanatory variable with three samples is essentially the only
instance in which a graphic representation is even possible.2 Nonetheless, this dual
representation, or regression dual, reveals key geometric properties of regression that
simply cannot be seen in any other way. This is more apparent in Figure 9.5 below,
where we have included the unit vector, 13 (1,1,1) from expression (9.1.11) as well.
s3 s3
y y
13 13
ê
s2 x ŷ x
s1 s1
Note also that we have now colored the vectors, x and 13 , and have connected them with
a dashed line to emphasize that these two vectors define a two-dimensional plane called
the regression plane. In geometric terms, the linear combinations, 0 13 1 x , in
expression (9.1.10) above represent possible points on this plane (so for example,
0 1 1/ 2 , corresponds to the point midway on dashed line joining x and 13 ). In
these terms, the regression problem of finding a point, ˆ0 13 ˆ1 x , in the regression
plane that minimizes the sum of squared deviations, i ei2 , has a very clear geometric
interpretation. In particular, since the relation,
shows that this sum of squares is simply the squared distance from y to 0 13 1 x ,
the regression problem in this dual representation amounts geometrically to finding that
point, yˆ ˆ013 ˆ1 x , in the regression plane which is closest to y. Without going into
further details, this closest point is precisely the orthogonal projection of y into this
2
Note that while more variables could in principle be included in Figure 9.4, the associated regression
would be completely overdetermined. More generally, when variables outnumber sample points, there are
generally infinitely many regression planes that all yield perfect fits to the data.
________________________________________________________________________
ESE 502 III.9-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
plane, as shown by the red arrow in Figure 9.6,3 where the red dashed line represents the
corresponding residual vector, ê , from (9.1.12), as defined by eˆ y yˆ .
This view of regression as an orthogonal projection also yields a number of insights into
the algebraic structure of regression.4 The most important of these follow from the
observation that since the residual vector, ê , is orthogonal to the regression plane, it must
necessarily be orthogonal to every vector in this plane. In particular, ê must be
orthogonal to both ŷ and 13 . Not surprisingly, the same is true for regressions in any
dimension, n (i.e., with n samples).5 So we can generalize these observations by first
extending the present case to multiple regressions with k explanatory variables and n
samples as,
y yˆ eˆ X ˆ eˆ ˆ01n j 1 ˆ j x j eˆ
k
(9.1.13)
Here ŷ is now the orthogonal projection of y into the regression hyperplane spanned by
the vectors (1n , x1 ,.., xk ) in n . Moreover (as shown in Section A2.4 of the Appendix to
Part II), orthogonality between vectors can be expressed algebraically as follows: vectors,
a, b n , are orthogonal if and only if their inner product is zero, i.e., if and only if
ab 0 .6 So these observations yield the following two important inner product
conditions for any regression in n :
As we shall see, it is precisely these two conditions that allow the total variation of y to
be decomposed as desired.
3
Here the s2 axis has been hidden for visual clarity
4
An excellent discussion of all these ideas is given in Sections 3.2.4 and 3.5 of Green (2003). In particular,
his Figure 3.2 gives an alternative version of Figure 9.6. For a somewhat more advanced treatment, see
Section 1.2 in Davidson and MacKinnon (1993).
5
As an extension of footnote 2 above, it of interest to note that the present case of one explanatory variable
with n 3 (non-collinear) samples is in fact the unique case where all the relevant geometry can be seen.
On the one hand, three points are just enough to yield a non-trivial regression as in Figure 9.3, while at the
same time still allowing a graphical representation of variable vectors in Figure 9.4.
6
This is perhaps the most fundamental identity linking the algebra of Euclidean vector spaces to their
underlying geometry. As one simple illustrative example, note that any vectors, a ( a1 , 0) and b (0, b2 ) ,
on the horizontal and vertical axes in 2 must be orthogonal in geometric terms, and in algebraic terms,
must satisfy a b a1 0 0 b2 0 .
________________________________________________________________________
ESE 502 III.9-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
yi n1 (1n y )
n
(9.1.15) y 1
n i 1
as follows,
y1 y
(9.1.16) y y1n
yn y
(9.1.17) D I n n1 (1n1n )
Like regression, this transformation is also an orthogonal projection, where in this case
D projects n onto the orthogonal complement of the unit vector, 1n , i.e., the subspace
of all vectors orthogonal to 1n . In algebraic terms, D sends 1n to the origin, i.e.,
and leaves all vectors orthogonal to 1n where they are. For example, the residual vector,
ê , for any regression is orthogonal to 1n by (9.1.10), and we see that,
These facts allow the total variation in (9.1.5) to be expressed directly in terms of D as,
7
These two conditions in fact characterize the set of orthogonal projection matrices.
________________________________________________________________________
ESE 502 III.9-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
yˆ Dyˆ eˆeˆ
To relate this decomposition to (9.1.6), we note first that if we now denote the residual
variation term in (9.1.6) by S ê2 then it follows at one that this is precisely the second term
in (9.1.23), i.e, that
eˆ eˆeˆ
n
(9.1.24) S eˆ2 2
i 1 i
Turning next to the model variation term in (9.1.6), notice again from (9.1.14) that
and thus that the mean of the regression predictions, ( yˆ1 ,.., yˆ n ) , is precisely y , i.e.,
yˆ i n1 (1n yˆ ) n1 (1n y ) y
n
(9.1.26) 1
n i 1
Thus if we now denote model variation in (9.1.6) by S ŷ2 , then it follows from (9.1.17)
and (9.1.26), together with the above properties of D that
yˆ Dyˆ
and thus that S ŷ2 is precisely the first term in (9.1.23). By putting these results together,
we may conclude that the desired decomposition of total variation for y is given by
In these terms, the R-squared measure in (9.1.8) and (9.1.9) can now be re-expressed as:
________________________________________________________________________
ESE 502 III.9-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
S y2ˆ S eˆ2
(9.1.29) 2
ROLS 1
S y2 S y2
where the OLS subscript is here used to emphasize that this decomposition property
holds for OLS. Notice also from the nonnegativity of all terms in (9.1.28) that
0 ROLS
2
1 , and thus that ROLS
2
can be interpreted as the fraction of total variation
explained by a given OLS regression. For computational purposes, it is more convenient
to express R-squared in vector terms as,
yˆ Dyˆ eˆeˆ
(9.1.30) 2
ROLS 1
y Dy y Dy
where the latter form, in terms of unexplained variation, is by far the most commonly
used in practice.
2
While ROLS is intuitively very appealing as a measure of goodness of fit, it suffers from
certain drawbacks. Perhaps the single most important of these is that fact that the measure
can never decrease when more explanatory variables are added to the model, and in fact
it almost always increases. This can be most easily seen by relating residual variation to
the solution of the regression problem itself. Recall that if for any given set of data,
( yi , x1i ,.., xki ), i 1,.., n , we define the sum-of-squares function
2
S k ( 0 , 1 ,.., k ) i yi j 0 j xij
k
(9.1.31)
over possible beta values ( 0 , 1 ,.., k ) [as in expression (7.1.9) of Part II], then the
regression problem is to find those values ( ˆ0 , ˆ1 ,.., ˆk ) that minimize this function. But
the residual variation for this regression problem, say eˆk eˆk , is precisely the value of S k at
the minimum, i.e.,
2
eˆk eˆk i eˆik2 i yi j 0 ˆ j xij S k ( ˆ0 , ˆ1 ,.., ˆk )
k
(9.1.32)
2
S k 1 ( 0 , 1 ,.., k ,0) i yi j 0 j xij (0) xi ,k 1
k
(9.1.33)
________________________________________________________________________
ESE 502 III.9-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
2
i yi j 0 ˆ j xij S k ( 0 , 1 ,.., k )
k
eˆk eˆk
Thus, when a new explanatory variable is added to the regression, the resulting residual
variation never increases, and in fact must decrease unless the new variable, xk 1 , is
totally unrelated to y in the sense that ˆk 1 0 . Finally, since y Dy is the same in both
2
regressions, we may conclude from last term in (9.1.30) that ROLS never decreases, and
8
almost always increases.
2
This property creates serious problems when using ROLS as a criterion for model
2
selection. Since ROLS can always be increased by adding more variables to a given model,
this will lead inevitably to the classic problem of “overfitting the data”. Indeed, for
2
problems with n samples, it is easy to see that a perfect fit ( ROLS 1) can be guaranteed
by increasing the number of (non-collinear) explanatory variables, k , to n 1 . For
example, if there were only n 2 samples, then since two points define a unique line,
almost any simple regression ( k 1) must yield a perfect fit.
2
This serves to underscore the need to modify ROLS to reflect the number of explanatory
variables used in a given regression model. This can be accomplished by essentially
“penalizing” those models with larger numbers of explanatory variables. The standard
2 2
procedure for doing so is to replace ROLS by the following modification, ROLS ,
designated as adjusted R-squared:
eˆeˆ
(9.1.35) 2
ROLS 1 nn11k 1 nnk1 (1 ROLS
2
)
y Dy
2
Here the first equality is the standard definition of ROLS , and the second equality simply
2
re-expresses this measure directly in terms of ROLS . While this measure can be given
8
The exact magnitude of this increase is given in Green (2003, Theorem 3.6).
________________________________________________________________________
ESE 502 III.9-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
2 2
In spite of the success of ROLS and ROLS for OLS models, their appropriateness as
goodness-of-fit measures for more general models is more problematic. Here it suffices
to consider the simplest possible extension involving the GLS model in Section 7.2.2
above,
(9.2.1) Y X , ~ N (0, 2V )
with known covariance structure, V . In this modeling context, the key difficulty is that
the resulting y-predictions obtained from (7.2.18) by
(9.2.2) yˆ X ˆ X ( X V 1 X )1 X V 1 y
9
The standard theoretical justification relies on the fact that (i) y Dy / ( n 1) yields an unbiased estimate of
y variance in the null model (9.1.2), (ii) eˆeˆ / ( n 1 k ) yields an unbiased estimate of residual variance,
2 , in the regression model, and (iii) the second term in (9.1.35) is precisely the ratio of these unbiased
estimates. But while this argument is appealing, it does not imply that this ratio is an unbiased estimate of
the fraction of unexplained variance. Indeed, the expectation of a ratio is almost never the same as the ratio
of expectations.
10
An excellent discussion of this issue is given in Davidson and MacKinnon (1993 ,Sections 1.2 and 9.3).
________________________________________________________________________
ESE 502 III.9-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
2
features of ROLS now vanish. In particular, the model-oriented and error-oriented
2
definitions of ROLS in (9.1.30) are no longer equivalent. So there is no unambiguous way
to define the “fraction of variation explained” by the given GLS model.
But as in the introductory discussion to Section 9.1 above, the residual vector, eˆ y yˆ ,
still captures the deviations of data, y , from their predicted values, ŷ , under any GLS
model. Moreover, since Dy y y1n still represents the y deviations from their least-
squares prediction, y , under the null model [as in (9.1.4) above], it is reasonable to
gauge the goodness of fit of this model by comparing its mean squared error:
n
(9.2.3) MSE 1
n i 1
( yi yˆ i )2
n
(9.2.4) MSE0 1
n i 1
( yi y ) 2
y • y •
ŷ
• •
• •
y
• • • y
• • •
• • ŷ
• •
• •
x x
In particular, the positivity (and common units) of these measures suggests that their ratio
should provide an appropriate comparison, as given by
n
MSE ( yi yˆ i )2 ( y yˆ )( y yˆ )
(9.2.5) i 1
( y y1n )( y y1n )
n
MSE0
i 1
( yi y ) 2
________________________________________________________________________
ESE 502 III.9-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
2
which is precisely the second term in the error-oriented version of ROLS . Finally, since
smaller values of this ratio indicate better average fit relative to the null model, it follows
that larger values of the difference,
MSE eˆDeˆ
(9.2.6) 2
RGLS 1 1
MSE0 y Dy
2
also indicate a better fit. To distinguish this general measure from ROLS , it is convenient
2
to designate (9.2.6) as extended R . This terminology also serves to emphasize that
(9.2.6) cannot be interpreted as “explained variation” outside of the OLS case. This is
made clear by the fact that extended R 2 can be negative. But as with adjusted R 2 for
OLS, it should be clear that negative values of extended R 2 are a strong indication of
poor fit. Indeed, models with higher mean squared error than y by itself can generally be
ruled out on this basis alone.
Finally, as with the OLS case, it should be clear that larger numbers of explanatory
variables must necessarily reduce MSE and thus increase the value of extended R 2 . So
goodness of fit for GLS models must be also be penalized for the addition of new
variables. While the penalty ratio, ( n 1) / ( n 1 k ) , in (9.1.35) is somewhat more
difficult to interpret in the GLS setting,11 it nonetheless continues to exhibit the same
appealing properties discussed in Section 9.1.3 above. So in the present GLS setting, we
now the designate
ˆ ˆ
(9.2.7) 2
RGLS 1 nn11k yeDy
e
Before applying these extended measures to SEM and SLM, it is also of interest to note
that there is an alternative approach which seeks to preserve the appealing properties of
2
ROLS . In particular, recall that one can convert any given GLS model to an OLS model
that is equivalent in terms of parameter estimation. In the present setting, it follows from
expressions (7.1.15) through (7.1.18) that if T is the Cholesky matrix for V, so that
V TT , then (9.2.1) can be converted to an OLS model
(9.2.8) Yo X o o , o ~ N (0, 2 I n )
11
While the simple “unbiasedness” argument in footnote 9 no longer holds, it can still be shown that
replacing n by n 1 k corrects bias in the GLS estimate of variance, ̂ 2 , in (7.2.20). So at least in these
terms, a justification in terms of “unbiasedness” can still be made.
________________________________________________________________________
ESE 502 III.9-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
So if goodness of fit for model (9.2.1) is now measured in terms of R 2 and R 2 for model
(9.2.8), then it would appear that all of the properties of these measures are preserved. In
particular, if for any given y data, we set yo T 1 y , then the appropriate prediction, say
yˆ o , is given by
(9.2.10) yˆ o X o ˆ X o ( X o X o )1 X o yo
So by setting eˆo yo yˆ o , it follows that the appropriate R-squared measure, say Ro2 , is
given from (9.1.30) by
yˆ o Dyˆ o eˆ eˆ
(9.2.11) Ro2 1 o o
yo Dyo yo Dyo
Such measures are typically designated as pseudo R-squared measures for GLS models
[see for example, Buse (1973)]. However, the most serious limitation of such measures in
that they account for total variation in yo T 1 y rather than in y itself. This is not only
difficult to interpret, but in fact can vary depending on the factorization of covariance
used. For example, the estimated SEM covariance matrix, Vˆ in (7.3.2) has a natural
factorization in terms of the matrix, Bˆ 1 , which will clearly yield different results than for
the Cholesky matrix. So the essential appeal of the extended R 2 and R 2 measures above
is that they are directly interpretable in terms of y and ŷ .
Turning first to SEM, recall from expression (6.1.8) that for any given spatial weights
matrix, W, we can express SEM as a GLS model of the form:
(9.2.14) B I n W
So for any given y data, the maximum-likelihood estimate, yˆ SEM , of the conditional
mean, E (Y | X ) X , is given by
________________________________________________________________________
ESE 502 III.9-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
(9.2.15) yˆ SEM X ˆ X ( X Vˆ1 X )1 X Vˆ1 y X ( X Bˆ Bˆ X )1 X Bˆ Bˆ y
Finally, letting
it follows from (9.2.6) that the extended R 2 measure for SEM is given by,
eˆSEM
eˆSEM
(9.2.17) 2
RSEM 1
y Dy
(9.2.18) 2
RSEM 1 nn11k (1 RSEM
2
)
These two values are reported for the Eire data in the left panel of Figure 7.7 as
(9.2.19) 2
RSEM 0.3313 2
( ROLS 0.5548)
and
(9.2.20) 2
RSEM 0.3034 2
( ROLS 0.5363)
where the corresponding OLS values are given in parentheses. As expected, these
extended measures for SEM are lower than for OLS since they incorporate more of the
true error variation due to spatial dependencies among residuals.12 So the main interest in
these goodness-of-fit measures is their relative magnitudes compared to SLM, or other
models which may serve to account for spatial dependencies (such as the spatial Durbin
model in Section 6.3.2).
Turning next to SLM, recall from (6.2.6) that this can also be expressed as a GLS model
of the form:
12
This can be seen explicitly by observing from the SEM log likelihood function in (7.3.4) that for the OLS
case of 0 , the estimate, ˆ , is chosen precisely to minimize mean squared error. So whenever ˆ 0 ,
one can expect that the associated mean squared error for SEM will be larger than this global minimum.
________________________________________________________________________
ESE 502 III.9-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
where V is again given by (9.2.13) and (9.2.14) for some choice of spatial weights
matrix, W, and where in this case,
So for any given y data, the maximum-likelihood estimate, yˆ SLM , of the conditional
mean, E (Y | X ) X , is given in terms of (7.4.13) by
it follows from (9.2.6) that the extended R 2 measure for SLM is given by,
eˆSLM
eˆSLM
(9.2.25) 2
RSLM 1
y Dy
(9.2.26) 2
RSLM 1 nn11k (1 RSLM
2
)
These two values are reported for the Eire data in the right panel of Figure 7.7 as
(9.2.27) 2
RSLM 0.7335 2
( ROLS 0.5548)
and
(9.2.28) 2
RSEM 0.7224 2
( ROLS 0.5363)
where the corresponding OLS values are again given in parentheses. So in contrast to
2 2
SEM, we see that both RSLM and RSLM for SLM are actually considerably higher than for
OLS. The reason for this is again explained by the contrast between the “pale” effect in
X and the “rippled pale” effect, X ˆ , as illustrated in Figure 7.8 above. However, this
appears to be a very exceptional case in which yˆ SLM ( X ˆ ˆ ) happens to yield an
________________________________________________________________________
ESE 502 III.9-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
extraordinarily good fit to y . More generally, one expects both SEM and SLM to yield
extended R 2 values that are lower than ROLS
2
, so that the spatial components W and
serve mainly to capture the hidden variation arising from spatial autocorrelation effects.
A measure that turns out to be closely related to extended R 2 is the squared correlation
between y and its predicted value, ŷ , under any GLS model (including OLS). Here it is
again convenient to begin with the OLS case, where this measure is shown to be identical
to R 2 . We then proceed to the more general case of GLS models, including both SEM
and SLM. Finally, the correlation measure itself is given a geometrical interpretation in
terms of angle cosines in deviation subspaces, which helps to clarify its relevance for
measuring goodness of fit.
Let us begin by recalling that the sample correlation, r ( x, y ) , between any pair of data
vectors, x ( x1 ,.., xn ) and y ( y1 ,.., yn ) , can be expressed in vector form by employing
the properties of the deviation matrix, D, in (9.1.17) , (9.1.18) and (9.1.21) as follows:
n
( xi xn )( yi yn )
(9.3.1) r ( x, y ) i 1
n n
i 1
( xi xn )2 i 1
( xi xn )2
( x xn 1n )( y yn 1n )
( x xn 1n )( x xn 1n ) ( y yn 1n )( y yn 1n )
( Dx )Dy
( Dx )Dx ( Dy )Dy
xDDy
xDDx y DDy
xDy
xDx y Dy
( xDy )2
(9.3.2) r 2 ( x, y )
( xDx )( y Dy )
Given this general expression, we now consider the correlation between data, y, and
model predictions, ŷ , for the case of OLS.
________________________________________________________________________
ESE 502 III.9-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
In these terms, the squared correlation measure for OLS is given in terms of (9.3.2) by
( y Dyˆ OLS )2
(9.3.4) r 2 ( y , yˆ OLS )
( y Dy )( yˆ OLS
Dyˆ OLS )
With this definition, our first objective is to show that (9.3.4) is precisely the same as
2
ROLS . If for notational simplicity we let yˆ yˆ OLS and again denote the estimated residuals
for OLS by eˆ y yˆ , then it follows from expression (9.1.14) that
2
which together with the first (model-oriented) representation of ROLS implies that
For purposes of later comparison, it follows from (9.3.9) that for the Eire case
________________________________________________________________________
ESE 502 III.9-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
By employing yˆ SEM in expression (9.2.15), it follows at once that the squared correlation
measure for SEM is given by,
( yDyˆ SEM )2
(9.3.11) r 2 ( y , yˆ SEM )
( y Dy )( yˆ SEM
Dyˆ SEM )
( y Dyˆ SLM )2
(9.3.12) r 2 ( y , yˆ SLM )
( yDy )( yˆ SLM
Dyˆ SLM )
and
Notice first that the squared correlation for SEM is identical with that of OLS. This
appears somewhat surprising, given that their estimated beta coefficients are quite
different. But in fact, this is an instance of the strong scale invariance properties of
correlation. To see this, we again use the simplifying notation in (9.3.8),
( y Dyˆ )2
(9.3.15) r 2 ( y , yˆ )
( yDy )( yˆ Dyˆ )
and observe that for the case of only one explanatory variable, the ŷ values for both
SEM and OLS, must be linear combinations of 1n and x , i.e., must be of the form,
(9.3.16) yˆ a1n bx
for some scalars a and b. But note first from the properties of the deviation matrix, D, that
and thus that Dyˆ is already independent of a. Moreover, (9.3.17) in turn implies both that
________________________________________________________________________
ESE 502 III.9-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
and may conclude that squared correlation depends only on y and x . So in particular,
the squared correlation of OLS and SEM must always be the same for the case of one
explanatory variable.
However, this is clearly not true for SLM, where X [1n , x ] is transformed to
so that ŷ is no longer of the form (9.3.16). Thus there is little relation between the
squared correlations for SLM and OLS, and as we have seen before, the squared
correlation fit for SLM in (9.3.14) is much higher than for OLS (and SEM).
To gain further insight into the role of squared correlation as a general measure of
goodness-of-fit, it is instructive to start with the correlation coefficient itself. As we shall
show below, if one writes vectors, x, y n , in deviation form as Dx x x1n and
Dy y y1n , then from a geometric viewpoint, the correlation coefficient, corr ( x, y ) , in
(9.3.1) turns out to be precisely the cosine of the angle, ( Dx, Dy ) , between these
vectors, i.e.,
This is most easily seen by first considering the cosine of the angle, ( x, y ) , between
any pair of (nonzero) vectors, x, y n , as shown for n 2 in Figure 9.9 below:
y y
x x
x
________________________________________________________________________
ESE 502 III.9-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
To calculate the cosine of this angle, we first construct a right triangle by finding the
point, x , on the x -vector for which the line segment, y x , is orthogonal to x , as
shown by the red dotted line in Figure 9.10. Since vectors are orthogonal if and only if
their inner product is zero, this point can be identified by solving:
xy xy
(9.3.21) 0 x( y x ) xy xx
xx || x ||2
Next, recall (from trigonometry) that for this right triangle, the desired cosine of ( x, y )
is given by the (signed) length of the adjacent side, i.e., || x || , over the length of the
hypotenuse, || y || , so that
|| x || xy || x ||
(9.3.22) cos[ ( x, y )] 2
|| y || || x || || y ||
xy
cos[ ( x, y )]
|| x || || y ||
Before proceeding further, recall from expression (4.1.12) that this already establishes
(9.3.20) for the case of “zero mean” vectors. But the more general case is now obtained
by simply considering the vectors, Dx and Dy. In particular, since by definition,
and similarly, || Dy || y Dy , it follows at once from (9.3.1) together with (9.3.22) and
(9.3.23) that
( Dx )Dy xDy
(9.3.24) cos[ ( Dx, Dy )] r ( x, y )
|| Dx || || Dy || xDx xDx
and thus that (9.3.20) does indeed hold for all (nonzero) vectors, x, y n . This in turn
implies that the squared correlation is simply the square of this cosine:
So in our case, if we now let ŷ denote the predicted value of data vector, y , for any
given model (whatsoever), then it follows at once that
________________________________________________________________________
ESE 502 III.9-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
This geometric view of squared correlation helps to clarify the exact sense in which it
constitutes a robust goodness-of-fit measure. In particular, it yields a measure of
“similarity” between y and ŷ which is completely independent of the measurement
units employed. Indeed, this was already shown in arguments of (9.3.16) through (9.3.18)
above, where shifts of measurement origins were seen to be removed by the deviation
matrix, D, and where scale transformations were removed by the ratio form of squared
correlation itself. Even more important is the fact that since cos2 ( ) is close to one if and
only if is close to 0 (or ), the identity in (9.3.26) shows that r 2 ( y, yˆ ) is close to one
if and only if the vectors, Dy and Dyˆ , point in almost the same (or opposite) directions.
Algebraically, this implies they are almost exact linear multiples of one another, i.e., that
Dyˆ Dy for some nonzero scalar, . In practical terms, this means that the relative
sizes of all deviation components must be approximately the same, so that if ŷ denotes
the sample mean of ŷ , then
yˆ i yˆ y y
(9.3.27) i , i j
yˆ j yˆ yj y
Thus large (or small) deviations from the mean in components of y are reflected by
comparable large (or small) deviations the mean in components of ŷ . The shows exactly
the sense in which prediction, ŷ , is deemed to be similar to data, y , when r 2 ( y , yˆ ) 1 .
Recall that our basic strategy for estimating model coefficients, ( , 2 , ) , was to find
values ( ˆ ,ˆ 2 , ˆ ) that maximized the likelihood of observed data, y, given explanatory
data values, X. This suggests that a natural measure of fit should be provided by the
maximum (log) likelihood value, L( ˆ ,ˆ 2 , ˆ | y, X ) , obtained. One difficulty here is that
since likelihood values themselves are probability density values, and not probabilities,
any direct interpretation of such values is tenuous at best. But the ratios of these values
for different models might still provide meaningful comparisons in terms of the limiting
probability-ratio arguments used in expressions (7.1.1) and (7.1.4) above.
However, there is a second more serious difficulty with likelihood values that is
reminiscent of R-squared values. Recall from the argument in expressions (9.1.31)
through (9.1.34) that R-squared essentially always increases when new explanatory
variables are added to the model. In fact, that argument really shows that the increase in
R-squared results from the addition of new beta parameters. But this argument is far
more general, and in fact shows that maximum values of functions are never decreased
when more parameters are added. In particular, if we consider the case of two likelihood
________________________________________________________________________
ESE 502 III.9-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
functions, say L( k ) (1 ,.., k | y , X ) and L( k 1) (1 ,.., k , k 1 | y , X ) , where the first is simply a
special case of the second with k 1 0 , i.e., with
(9.4.2) max (1 ,..,k ) L( k ) (1 ,.., k | y , X ) max (1 ,..,k ) L( k 1) (1 ,.., k ,0 | y, X )
with strictly inequality almost always holding. What this means for our purposes is that
log likelihood functions suffer from exactly the same “inflation problem” as R-squared
whenever new parameters are added. So if one attempts to compare the goodness of fit
between models that are “nested” in the sense of (9.4.1), [i.e., where one is a special case
of the other with certain parameters set to zero (or otherwise constrained in value)], then
the larger model will always yield a better fit in terms of maximum-likelihood values.
This observation suggests that such likelihood comparisons must somehow be penalized
in terms of the numbers of parameters in a manner analogous to adjusted R-squared. If
we again let L(ˆ | y ) denote a general log likelihood function evaluated at its maximum
value, then the simplest of these penalized versions is Akaike’s Information Criterion
(AIC):
where K now denotes the dimension of ˆ , i.e., the number of parameters being estimated
[and where factor “2” in AIC, as well as in the other measures to be developed, relates to
the form of the log likelihood ratio statistic in expression (10.1.7) below.] For both SEM
and SLM with parameters, ˆ ( ˆ0 , ˆ1 ,.., ˆk ,ˆ 2 , ˆ ) , this implies in particular that
K ( k 1) 2 k 3 . This measure is discussed in detail by Burnham and Anderson
(2002), where AIC is both defined (p.61) and later derived (Section 7.2). In addition,
these authors recommend a “corrected” version of AIC (p.66) for sample sizes that are
small relative to the number of parameters ( n / K 40 ). This is usually designated as
corrected AIC (AICc) and can be written in terms of (9.4.3) as
2 K ( K 1)
(9.4.4) AICc AIC
n ( K 1)
________________________________________________________________________
ESE 502 III.9-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
While this measure is also developed in Burnham and Anderson (2002, Section 6.4.1), a
more lucid derivation can be found in Raftery (1995, section 4.1). Given its heavier
penalization term for model sizes, K [when log( n ) 2 ], this measure is well known to
favor smaller models (i.e., with fewer parameters) than AIC in terms of goodness of fit.
Finally it should be noted that when comparing SEM and SLM for a given specification
of k explanatory variables, all such measures will differ only in terms of their
corresponding maximum-likelihood values, L(ˆ | y ) , for these two models. So in the
present case of Eire, where Figure 7.7 shows that
it is clear that SLM must continue to yield a better fit than SEM with respect to all of
these criteria.
________________________________________________________________________
ESE 502 III.9-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
While the notion of relative likelihood values for different models is somewhat difficult
to interpret directly (as mentioned above), such likelihood ratios can in many cases
provide powerful test statistics for comparing models. In particular, when two models are
“nested” in the sense of expression (9.4.1) above, it turns out that the asymptotic
distribution of such ratios can be obtained under the (null) hypothesis that the simpler
model is the true model. To develop such tests, we begin in Section 10.1 below with a
simple one-parameter example where the general ideas to be developed can be given an
exact form.
Here we revisit the example in Section 8.1 of estimating the mean of a normal random
variable, Y N ( , 2 ) , with known variance, 2 , given a sample, y ( y1 ,.., yn ) , of size
n. The relevant likelihood function is then given by expression (8.1.1) as
(10.1.1)
L( ) Ln ( | y , 2 ) n log 2 21 2 i 1 ( yi )2
n
But rather than simply estimating , suppose that we now want to test whether 0 , or
more generally to test the null hypothesis, H 0 : 0 , for some specified value, 0 .
Then under H 0 the likelihood value in (10.1.1) becomes:
(10.1.2)
L( 0 ) Ln ( 0 | y , 2 ) n log 2 21 2 i 1 ( yi 0 )2
n
As shown in Figure 10.1 below, it seems reasonable to argue that the likelihood of 0
relative to the maximum likelihood at ˆ n should provide some indication of the strength
0 ˆ n
• •
•
L( ˆ ) n
•L( )
0
Ln ( )
________________________________________________________________________
ESE 502 III.10-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
of evidence in sample y for (or against) hypothesis H 0 . In terms of log likelihoods, such
relations are expressed in terms of the difference between L( ˆ n ) and L( 0 ) . But
following standard conventions, we here refer to such log-differences as likelihood ratios.
Moreover, since L( ˆ n ) L( 0 ) by definition, it is natural to focus on the nonnegative
difference, L( ˆ n ) L( 0 ) . If the distribution of L( ˆ n ) L( 0 ) can be determined under
H 0 , then this statistic can be used to test H 0 . In particular, if L( ˆ n ) L( 0 ) is
“sufficiently large”, then this should provide statistical grounds for rejecting H 0 . With
this in mind, observe that by canceling the common terms in the log likelihood
expressions, and recalling that ˆ n yn , we see that this likelihood ratio can be written as
L( ˆ n ) L( 0 ) 21 2 i 1 ( yi ˆ n )2 i 1 ( yi 0 )2
n n
(10.1.3)
21 2 i 1[( yi yn )2 ( yi 0 )2 ]
n
n
2 2
yn2 2 0 yn 02
n
2 2
( y n 0 ) 2
y 0
2
(10.1.4) 2[ L( ˆ n ) L( 0 )] n
/ n
But under the null hypothesis, H 0 , the standardized mean in brackets is standard normal:
yn 0
(10.1.5) ~ N (0,1)
/ n
So the right-hand side of (10.1.4) is distributed as the square of a standard normal variate,
which is known to have a chi square distribution, 12 , with one degree of freedom, i.e.,
2.5
yn 0
2 2
/ n
1
0.5
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
________________________________________________________________________
ESE 502 III.10-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
where the density of 12 is plotted on the right. So we may conclude that this likelihood-
ratio statistic is chi-square distributed (up to a factor of 2) as:
(10.1.7) 2[ L( ˆ n ) L( 0 )] ~ 12
[As mentioned in Section 9, this factor of 2 is closely related to the same factor appearing
in the penalized likelihood functions developed there.]
Note that we are implicitly comparing two models here, one with a single free parameter
( ) and the other a “nested” special case where has been assigned a specific value,
0 (typically, 0 0 ). But the same likelihood-ratio procedure can be used for much
more general comparisons between a “full” model and some special case, denoted as the
“restricted” model. Here we simply summarize the main result. Suppose that the full
model is represented by a log likelihood function, L( | y ) , with parameter vector,
(1 ,.., K ) , and that the restricted model is defined by imposing a set of m K
restrictions on these parameters that are representable by a vector, g ( g j : j 1,.., m) , of
(smooth) functions as relations of the form,
(10.1.8) g j ( ) 0 , j 1,.., m
then it again follows that the relevant likelihood-ratio statistic, L(ˆ | y ) L(ˆg | y ) , is
nonnegative. In this more general setting, if it is hypothesized that the restricted model is
true (i.e., that the true value of satisfies restrictions, g ), then under this null hypothesis
it can be shown1 that L(ˆ | y ) L(ˆg | y ) is now asymptotically chi square distributed (up
to a factor of 2) with degrees of freedom, m , equal to the number of restrictions defined
by g :
1
This result, known as Wilk’s Theorem, is developed, for example, in Section 3.9 of the online Lecture
Notes in Mathematical Statistics (2003) by R.S. Dudley at MIT (http://ocw.mit.edu/courses/mathematics/
18-466-mathematical-statistics-spring-2003/lecture-notes/).
________________________________________________________________________
ESE 502 III.10-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
This family of likelihood-ratio tests provides a general framework for comparing a wide
variety of “nested” models. Moreover, as in the one-parameter case of (10.1.7) above, the
basic intuition is essentially the same for all such tests. In particular, since the full
maximum likelihood, L(ˆn | y ) , is almost surely larger than the restricted maximum
likelihood, L(ˆg | y ) , the only question is whether it is “significantly larger”. If so, then it
can be argued that the restricted model should be rejected on these grounds. If not, then
this suggests that the full model adds little in the way of statistical substance, and thus (by
Occam’s razor) that the simpler restricted model should be preferred. For example, in the
OLS case above, the key question is whether a given parameter, such as 1 , is
significantly different from zero (all else being equal). If so, then this indicates that the
larger model including variable, x1 , yields a better predictor of y than the same model
without x1 .2 In the following sections, we shall employ this strategy to compare the SE-
model and SL-model from a number of perspectives.
Here we begin by observing that since SEM and SLM are “non-nested” models in the
sense that neither is a special case of the other, it is not possible to compare them directly
in terms of likelihood-ratio tests. But since OLS is precisely the “ 0 ” case of each
model, both SEM and SLM can be compared with OLS in terms of such tests. Thus, by
using OLS as a “benchmark” model, we can construct an indirect comparison of SEM
and SLM. For example, if the improvement in likelihood of SEM over OLS is much
greater than that of SLM over OLS for a given data set, ( y, X ) , then in this sense it can
be argued that SEM provides a better model of ( y, X ) than does SLM.
To operationalize such comparisons, we start with SEM and for a given data set, ( y, X ) ,
let ( ˆSEM ,ˆ SEM
2
, ˆ SEM ) denote the maximum likelihood estimates obtained using the SEM
likelihood function, L( , 2 , | y , X ) , in (7.3.4) above [as in expressions (7.3.10)
through (7.3.12)]. Then the corresponding SEM maximum-likelihood value can be
denoted by:
2
One may ask how this likelihood-ratio test in the OLS case relates to the standard (Wald) tests of
significance, such as in expression (8.4.12) above (with 0 ). Here it can be shown [as for example in
Section 13.4 of Davidson and MacKinnon (1993)] that these tests are asymptotically equivalent.
________________________________________________________________________
ESE 502 III.10-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Finally, since the likelihood function in (7.2.4) is clearly the special case of (7.3.4) with
0 [or more precisely, with g1 ( , 2 , ) in (10.1.8) ], it follows from the general
discussion above that under the null hypothesis, 0 , it must be true that the likelihood
ratio, LRSEM /OLS 2[ LˆSEM LˆOLS ] , is distributed as chi square with one degree of
freedom, i.e., that
Then in the same manner as (10.2.3), it follows that under the null hypothesis that 0
for SLM, we also have
For the Eire case, these two likelihood ratios and associated p-values are reported in
Figure 7.7 as
and
So for example, if OLS were the correct model, then the chance of obtaining a likelihood
ratio, LRSLM /OLS , as large as 15.803 would be less than 7 in 100,000. Moreover, while the
p-value for LRSEM /OLS is also quite small, it is relatively less significant than for SLM.
Thus a comparison of these p-values provides at least indirect evidence that SLM is more
appropriate than SEM for this Eire data.
But given the indirect nature of this comparison, it is natural to ask whether there are any
more direct comparisons. One possibility is developed below, which will be seen to be
especially appropriate for the case of row normalized spatial weights matrices.
________________________________________________________________________
ESE 502 III.10-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Here we start by recalling from Section 6.3.2 that if X and are partitioned as
X [1n , X v ] and ( 0 , v ) , respectively, then an alternative modeling form is
provided by the Spatial Durbin model (SDM),
But this model can be viewed as a special case of the SLM model in the following way. If
we group terms in (10.3.1) by letting X SDM [1n , X v ,WX v ] and SDM
(1n , v , ) so that
0
(10.3.2) X SDM SDM [1n X v WX v ] v 01n X v v WX v ,
then (10.3.1) can be rewritten as,
Moreover, if W is row normalized, then SEM can in turn be viewed as a special case of
SDM. To see this, observe first that the reduced form of SEM in expression (6.1.9) can be
expanded and rewritten as follows:
(10.3.4) Y X ( I n W )1
( I n W )Y ( I n W ) X
Y WY ( X WX )
Y WY X WX
________________________________________________________________________
ESE 502 III.10-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
(10.3.7) v
(10.3.8) H CF : v 0
Under this hypothesis, it follows that SEM is formally a restriction of SDM in the sense
of expression (10.1.8), where the relevant vector, g, of restriction functions is now given
by g ( , 0 , v , , 2 ) v . The number of restrictions (i.e., dimension of g) is here
simply the number of explanatory variables, k . Given this relationship, one can then
employ likelihood-ratio methods to test the appropriateness of SDM versus SEM. To do
so for any given any data set, ( y, X ) , we now let ( ˆSDM ,ˆ SDM 2
, ˆ SDM ) denote the
maximum likelihood estimates obtained by applying the SLM likelihood function,
L( SDM , 2 , | y , X ) , in (7.4.2) to the SLM form of SDM in (10.3.3) above. In these
terms, the resulting SDM maximum-likelihood value is then given by:
where again, k, is the number of explanatory variables in SEM. The results of this
comparative test are part of the SEM output, denoted by Com-LR. For the case of Eire,
the result reported in Figure 7.7 is
and shows that SDM fits this Blood Group data far better than SEM. This can largely be
explained by noting from (10.3.2) and (10.3.3) that the reduced form of the SDM model
is given by
________________________________________________________________________
ESE 502 III.10-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
and thus contains the Rippled Pale term, B1 X v v ( B1 x ) 1 , which was shown to yield
a striking fit to this data. So a strong result is not surprising in this case.
Finally, it should be noted that while the above analysis has focused on row-normalized
matrices in order to interpret the “SLM version” of SEM as a Spatial Durbin model, this
restriction can in principle be relaxed. In particular, when W 1n 1n , it is possible to treat
the vector, W 1n , as representing the sample values of an additional “explanatory
variable” and thus modify (10.3.2) to
0
(10.3.13) X SDM SDM [1n X v W 1n WX v ] v 01n X v v 0W 1n WX v
0
With this addition, SEM can still be viewed formally as an instance of SLM. Moreover, if
the additional restriction, 0 0 0 , is added to yield a set of k 1 restrictions, then
this new likelihood ratio must now be distributed as k21 under the null hypothesis of
SEM. So while the problematic nature of this artificial “explanatory variable”
complicates the interpretation of the resulting test, it can still be argued that the presence
of the spatial lag term, WY , suggests that SLM may yield a better fit to the given data
than SEM.
A final method of comparing SEM and SLM is provided by the combined model (CM)
developed in Section 6.3.1 above, which for any given spatial weights matrix, W, can be
written as [see also expression (6.3.3) ]:
Here is clear that SEM is the special case with 0 , and SLM is the special case with
0 . So these two models are seen to lie “between” OLS and the Combined Model, as
in Figure 10.2 below:
Combined Model
SEM SLM
OLS
In the same way that OLS served as a “lower” benchmark for comparing SEM and SLM,
the Combined Model can thus serve as an “upper” benchmark. Here the only issue is how
to estimate this more complex model. To do so, we start by observing from (6.3.4) that
the reduced form of this model can be written as:
where
(10.4.3) X ( I n W )1 X
So it should be clear that this model is simply another instance of GLS, where in this case
conditioning is on the pair of spatial dependence parameters, and . So for the
parameter vector, ( , 2 , , ) , the corresponding likelihood function takes the form:
(10.4.8) ˆ
2
n1 ( y X ˆ )V1 ( y X ˆ )
By substituting (10.4.7) and (10.4.8) into (10.4.6), we may then obtain a concentrated
likelihood function for and , denoted by:
>> sys502/Matlab/Lesage_7/spatial/sac_models
________________________________________________________________________
ESE 502 III.10-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
While the parameter estimates, ̂ and ̂ , obtained by this procedure often tend to be
collinear (in view of their common role in modifying the same weight matrix, W), the
corresponding maximum-likelihood value,
continues to be well defined and numerically stable. This value can thus be used to test
the relative goodness of fit of the two restricted models, SEM and SLM. In particular, it
follows by the same arguments as above that under the SEM null hypothesis ( 0 ) we
have
The results of these respective tests for the Eire case are as follows:
Thus the Combined Model is seen to yield a significantly better fit than SEM, but not
SLM. So relative to this CM benchmark, it can again be concluded that SLM yields a
better fit to the Eire data than does SEM.
________________________________________________________________________
ESE 502 III.10-10 Tony E.
Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
The ultimate objective of this section of the appendix is to develop the Spectral
Decomposition Theorem for symmetric matrices, that illuminates many of the most
important properties of covariance matrices. But to gain an intuitive understanding of this
result, it is important to understand the geometry of linear transformations as represented
by matrices. A transformation, T , on n is simply a mapping that assigns every vector,
x n , to some other vector, T ( x) n , called the image of x under T. A
transformation, T , is linear if and only if (iff ) it preserves vector addition, i.e., iff for
each pair of vectors, x, y n , and scalars, , ,
(A3.1.1) T ( x y ) T ( x) T ( y )
The intimate connection between matrices and linear transformations is seen most readily
in 2 . If we let e1 (1,0) 2 and e2 (0,1) 2 denote the so-called identity basis
vectors in 2 (shown in Figure A3.1 below)1,
x2e2 x
e2
e1 x1e1
x 1 0
(A3.1.2) x 1 x1 x2 x1e1 x2e2
x2 0 1
1
Note that we maintain the convention that all vectors are represented as column vectors, so that transpose
notation is used for all inline representations [as in expression (1.1.2) of Part II].
________________________________________________________________________
ESE 502 A3-1 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
This basis representation of x , shown in Figure A3.2, implies from (A3.1.1.) that the
image of x under any linear transformation, T, can be represented as
So if we know where the identity basis vectors, (e1 , e2 ) , are sent by T, then we can
construct the entire transformation. In particular, if we now let
a a
(A3.1.4) T (e1 ) a1 11 , T (e2 ) a2 21
a12 a22
a a
(A3.1.5) T ( x) x1T (e1 ) x2T (e2 ) x1 11 x2 21
a12 a22
a x a x a a x
11 1 21 2 11 12 1 A x
a21 x1 a22 x2 a21 a22 x2
a a
(A3.1.6) A (a1 , a2 ) 11 12
a21 a22
x2 Ae2
Ae2
x2e2
e2 Ae1 x1 Ae1
e1 x2e2
Figure A3.3. Basis Image Vectors Figure A3.4. General Image Vectors
________________________________________________________________________
ESE 502 A3-2 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
More generally, if the identity basis2 in n is associated with the columns of the identity
matrix, I n (e1 , e2 ,.., en ) , and if the images of these basis vectors under any linear
transformation, T, are denoted by T (ei ) ai (ai1 ,.., ain ) n , i 1,.., n , then for all
x ( x1 ,.., xn ) n , T again has the matrix representation
ai1
T ( x) i1 xiT (ei ) i1 xi ai i1 xi
n n n
(A3.1.7)
a
in
n a1 j x j
j 1 a11 a1n x1
A x
n
j 1 anj x j n1
a ann xn
In the analysis to follow, we shall use the terms matrix and transformation
interchangeably. Note also that is this equivalence that motivates the basic multiplication
rules of matrix algebra. So the meaning of these rules is often best understood in this
way.
To examine some of the more important matrix properties, we begin by observing that
every matrix can be written in two equivalent ways. First there is a column representation
of A,
where ai denotes the i th row of A.3 This in turn implies that matrix products, AB, can be
written in two ways:
2
A fuller discussion of vectors bases for linear spaces is given on page A3-16 below.
3
It is important to note, for example, that a1 in (A3.1.9) is not the transpose of a1 in (A3.1.10). To be
more precise here, one could use the “dot” notation, a j , for columns and ai , for rows. However, we
choose not to add this notational complexity since the rows and columns of A will generally be clear in
context.
________________________________________________________________________
ESE 502 A3-3 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
b1
AB (a1 ,.., an ) i1 aibi
n
(A3.1.10)
b
n
and
Both of these representations are very useful, and will be used throughout the analysis to
follow. As one immediate application, it is important to note that for every matrix,
A (aij : i, j 1,.., n) , the transpose matrix, A (a ji : i, j 1,.., n) , represents a linear
transformation closely related to that of A. In particular, the rows of A are the columns of
A . So from a transformation viewpoint, A , represents the “row space” of A. Moreover,
if for any matrices A and B we use the representations
a1 b1
(A3.1.12) A A (a1 ,.., an ) , B (b1 ,.., bn ) B
a b
n n
then (A3.1.11) together with the identity, ab ba , imply that,
and hence that the transpose of a product, AB , is the product of their transposes in the
reverse order.
Perhaps the single most important feature of a linear transformation is whether or not it
has an “inverse”. In particular, a linear transformation A is said to be nonsingular iff there
exists another linear transformation, A1 , called the inverse of A such that
(A3.1.14) A1 A I n
This inverse transformation can be equivalently defined by the requirement that for all
x, y n ,
(A3.1.15) A1 y x Ax y
________________________________________________________________________
ESE 502 A3-4 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
This version also shows that AA1 I n . For if we let X ( x1 ,.., xn ) be defined by
Axi ei , i 1,.., n , so that AX I n , then by (A3.1.15), A1ei xi , i 1,.., n implies that
A1 A1I n X , and hence that AA1 I n . Note also that since A1 is well defined as a
transformation (i.e., A1 y is uniquely defined), it must be true that A is a one-to-one
transformation, i.e., for all x1 , x2 n ,
For if Ax1 y Ax2 then we would have {x1 , x2 } A1 y , so that A1 y is not uniquely
defined. As an additional consequence of (A3.1.14), note that for any pair of nonsingular
transformations, A and B, we must have
Since the same argument shows that AB ( B 1 A1 ) I n , it then follows from (A3.1.14) that
AB must also be nonsingular, and in particular, has a well defined inverse ( AB) 1 given
by
(A3.1.18) ( AB ) 1 B 1 A1
In other words, the inverse of A is the transpose of A1 (so that the operations of taking
transposes and inverses are said to commute).
(A3.1.21) A( S ) { Ax : x n }
________________________________________________________________________
ESE 502 A3-5 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
(A3.1.22) A( n ) n
Since A( n ) n by definition, (A3.1.22) follows from the observation that for any
x n , A ( A1 x ) x x A( n ) , so that n A( n ) . In summary, every
nonsingular transformation is both one-to-one and onto as a mapping.
We next observe that for all transformations, A, the full image set, A( n ) , is of special
importance since it is always a linear subspace of n , i.e., it is contained in n and is
closed under linear combinations [ x, y A( n ) x y A( n ) for all scalars,
, ]. In particular, since A( n ) contains all vectors that are expressible as a linear
combinations of the columns of A (a1 ,.., an ) , it is said to be spanned by these columns,
and is often written as:
i zi 0 i 0, i 1,.., k
k
(A3.1.24) i 1
In these terms, a matrix, A (a1 ,.., an ) , is nonsingular iff its columns {a1 ,.., an } are
linearly independent. So by replacing zi with ai and i with xi in this general
definition, we can write this nonsingularity condition for A in matrix form as follows. For
all x n ,
(A3.1.25) Ax 0 x 0
4
Note that for convenience we drop the subscript notation on the n-vector of zeros by 0 n (0,.., 0) . The
dimension of 0 should always be clear in context. So in expression (A3.1.24) for example, the 0 on the left
is k-dimensional and the 0’s on the right are scalars (one-dimensional).
________________________________________________________________________
ESE 502 A3-6 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
In these notes, we shall deal almost exclusively with nonsingular transformations. But to
understand the full scope of the matrix decomposition theorems to follow, it is important
to consider all linear transformations on n . In particular, those linear transformations,
A, for which no inverse exists are said to be singular transformations. In terms of
(A3.1.16) above, this means is that there are distinct vectors, x y with Ax Ay , so that
the transformation A1 is not well defined. In view of linearity, this in turn implies that
there is a nonzero vector, namely x y , with A( x y ) Ax Ay 0 . This observation
shows that the characterizing property of singular transformations, A, is that there is a
nontrivial set of vectors mapped into zero by A. This set is designated as the null space
for A, written as
(A3.1.26) null ( A) {x n : Ax 0}
(A3.1.27) x, y null ( A) Ax 0 Ay
A( x y ) Ax Ay 0
x y null ( A)
For a nonsingular transformation this is trivially true, since null ( A) {0} by (A3.1.13).
But for singular transformations, null ( A) is a proper linear space. In fact, the two linear
spaces, span( A) and null ( A) completely characterize most of the geometric features of
every linear transformation. A simple example of a singular transformation, A, in 2 is
given by
2 1
(A3.1.28) A
2 1
where for the vector, x (1, 2) 0 , we see from (A3.1.28) that Ax 0 . Here span( A)
and null ( A) are shown in Figure A3.5 below.
________________________________________________________________________
ESE 502 A3-7 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
x Ae1
e2 Ae2
e1
span( A) null ( A)
The image vectors Ae1 (2, 2) and Ae2 (1,1) are seen to be collinear, so that span( A)
is reduced to a line, i.e., a one-dimensional subspace of 2 . Similarly, the point x above
is also shown, and is seen to generate a one-dimensional subspace, null ( A) , which is
collapsed into 0 by A. [An example in 3 is given in Figure A3.16 below.] More
generally, the dimensions of these two subspaces always add to n. To be more precise, for
any linear subspace, S n , the dimension of S, denoted by dim( S ) is the maximum
number of linearly independent vectors in S. So by (A3.1.23), the dimension of span( A)
must be the maximum number of linearly independent columns (a1 ,.., an ) of A. Moreover,
by (A3.1.26) the dimension of null ( A) must be the maximum number of linearly
independent vectors mapped to zero by A. As seen in Figure A3.5
where in this case, n 2 . In turns out that this is always true. Since its validity will be
apparent from the Singular Value Decomposition Theorem below, we shall not offer a
proof of this “rank-nullity” theorem here.5
For our later purposes, it is important to note that the maximum number of linearly
independent columns of any matrix, A, is also called the rank of A, written as rank ( A) .
When matrices are not square, [as for example in the Linear Invariance Theorem for
multi-normal random vectors, stated both in expression (3.2.22) of Part II and in
expression (A3.2.121) below], then it is useful to distinguish between columns and rows
of matrix A by designating the column rank (row rank) of A to be the maximum number
of linearly independent columns (rows) of A. In these terms, matrix A is said to be of full
column rank (full row rank) iff all its columns (rows) are linearly independent, i.e., iff its
column rank (row rank) is equal to the number of columns (rows) of A. In terms of linear
5
For an elegant on line proof see http://en.wikipedia.org/wiki/Rank%E2%80%93nullity_theorem.
________________________________________________________________________
ESE 502 A3-8 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
transformations, the row rank of A can also be viewed as the rank of the linear
transformation represented by A .
With this general discussion of linear transformations, we now consider several specific
types of transformations that will play a central role in the decomposition theorems to
follow.
While there are many different types of linear transformations, it turns out that from a
geometric view point there are essentially only two basic transformation types. The first,
and by far the simplest, are scale transformations that simply rescale the identity basis
vectors, as in Figures A3.6 and A3.7 below:
2e2 2e2
e2 e2
e1 3 e1 3 e1 e1
Figure A3.6 represents a positive scalar transformation in which all basis vectors are
scaled by positive multiples. In many cases, such transformations result from simply
changing the measurement units (dollars, meters, etc.) of the variables represented by
each axis. However, some scale transformations may involve negative multiples, as in
Figure A3.7. The matrix representations, A1 and A2 , of these respective transformations
are given by the diagonal matrices (with zeros omitted for visual clarity),
2 2
(A3.1.30) A1 , A2
3 3
a11
(A3.1.31) A diag (a11 ,.., ann )
ann
________________________________________________________________________
ESE 502 A3-9 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
4
(A3.1.32) A1 A2 A2 A1
9
The second important class of linear transformations is far richer, and in fact, is given
many different names, including isometric transformations, orthonormal transformations
and rigid motions. From a geometric viewpoint the term “isometric” is perhaps most
appropriate, since these transformations preserve both distances and angles (as we shall
see below). But from a matrix viewpoint, the term “orthonormal” is most useful since it
relates more directly to the corresponding matrix representations, U (u1 ,.., un ) , of such
transformations. In particular, if both distances and angles are preserved, then since the
vectors in the identity basis, I n (e1 ,.., en ) , are mutually orthogonal and of unit length, it
follows that their images
under U must necessarily have the same properties. More precisely, [recalling property
(A2.4.4) in Appendix A2] it must be true that
These defining conditions for orthonormality can be written in equivalent matrix form as
________________________________________________________________________
ESE 502 A3-10 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Note also from (A3.1.26) that this condition implies that U must be nonsingular, since
(A3.1.37) Ux 0 U Ux U 0 0 I n x 0 x 0
(A3.1.38) U U (UU 1 ) (U U )U 1 U 1
we see that the inverse of U is simply its transpose. This an equivalent form of the
defining condition in (A3.1.36), though the geometric argument above is far more
intuitive. All geometric and algebraic properties of such transformations are in turn
readily established from these equivalent conditions. The most immediate result is that all
inner products must be preserved, since for any vectors, x, y n ,
This in turn implies that all distances (lengths) are preserved, since
Finally, if denotes the angle between any pair of vectors, x and y, as in Figure A3.8
below,
x
|| x ||
|| y x ||
0
|| y || y
it follows at once from (A3.1.40) that U must also preserve angles. In other words, all
geometric figures are mapped into congruent copies by U.
________________________________________________________________________
ESE 502 A3-11 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
This same argument obviously holds for any finite product, U1U 2 U n .
Such orthonormal transformations can be further classified into rotations and reflections,
as illustrated in 2 by Figures A3.9 and A3.10 below:
Ue2 e2 e2
Ue1
e1
e1
Ue2 Ue1
Another key difference from a practical viewpoint relates to the extendibility of these
concepts to higher dimensions. In particular, while rotations are easily defined with
respect to angles in 2 , the extension of this definition to n is highly complex (to say
the least). However, the extension of reflections is completely straightforward. For the
case of 3 , the reflection line in Figure A3.10 is simply replaced by a reflection plane
through the origin. For example, the transformation, U [e1 , e2 , e3 ] , is seen to reflect all
points in 3 about the (e1 , e2 ) plane. More generally, every reflection in n is uniquely
defined by a (n 1) -dimensional reflection hyperplane through the origin. In addition,
such reflections can be given a unified matrix representation as we now show.
________________________________________________________________________
ESE 502 A3-12 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Householder Reflections
(A3.1.43) v {x n : xv 0}
denote the orthogonal complement of v , then the reflection about this hyperplane through
the origin is representable by the Householder matrix,
so that the image of v is precisely its refection (through the origin) about v . Moreover,
for any x v it also follows that
But since H v is completely defined by this set of images, it then follows that H v must be
the unique reflection in n about v . This is shown graphically by the 2 example in
Figure A3.11 below:
x
v
v
v Hv x
H vv
________________________________________________________________________
ESE 502 A3-13 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Finally, since every reflection has such a representation, it follows that all reflections are
representable by Householder matrices, as in (A3.1.44). So all reflections are easily
computable in n .
From a geometric viewpoint, the importance of this fact is that all orthogonal
transformations n are constructible as compositions of (at most n ) reflections.
Alternatively phrased, every n-square orthonormal matrix is the product of at most n
Householder matrices. Since this fact will not actually be used in our subsequent
analyses, we will not prove it here (see footnote 2 below). Rather we simply illustrate this
general result by showing how all rotations in 2 (such as in Figure A3.9) are equivalent
to (at most) a pair of reflections in 2 . For any given angle, , let the corresponding
(counterclockwise) rotation be denoted by, R , as in Figure A3.12.
R e2 e2
R e1
e1
This is clearly not a reflection, and moreover cannot be equivalent to any single
reflection, since this would necessarily reverse the clockwise order of the basis vectors,
e1 and e2 , as mentioned above. But it can be represented as a composition of two
reflections as follows. Choose the first (Householder) refection, H1 H v1 , by setting
v1 R e1 e1 , and observe that by construction it reflects R e1 back into e1 , as shown in
Figure A3.13 below:
e2
R e2
R e1
v1
e1 H1R e1
e2 H1R e2
________________________________________________________________________
ESE 502 A3-14 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Notice also that since every reflection is an orthonormal transformation, the image of
R e2 under H1 must continue to be orthogonal to that of R e1 . But in two dimensions,
there are only two possibilities (with unit length), namely e2 and e2 . In this case,
H1R e2 e2 , as shown in the figure. Finally, this configuration is easily reflected back
into (e1 , e2 ) by simply choosing H 2 H v2 with generating vector, v2 e2 (e2 ) 2 e2 ,
so that the orthogonal complement, v2 , in this case is simply the horizontal axis, as
shown in Figure A3.14 below.
R e2 e2 H 2 H1R e2
v2
e1 H 2 H1R e1
e2
Finally, since all Householder matrices in (A3.1.44) are seen to be symmetric, we may
then conclude that:
R H1H 2 R H1H 2
Hence, each such rotation is seen to be equivalent to this particular pair of reflections.6 So
in this sense, Householder reflections can be regarded as the fundamental “generator” of
all orthonormal transformations.
6
The proof of the general representation of orthonormal matrices by products of Householder matrices is
surprisingly difficult to find in standard references. But one can easily show this by extending the standard
Householder construction of QR decompositions (see for example the nice discussion by Tom Lyche
available on line at http://heim.ifi.uio.no/~tom/ortrans.pdf ) which shows in particular that every
orthonormal matrix, U, can be represented as, U H 1 H 2 H nT , for some choice of Householder matrices,
H 1 H 2 H n , together with an upper triangular matrix, T. But by successive multiplications of this
expression by H i, i 1,.., n , together with (A3.1.36) , we obtain T H n H 2 H1U , which implies from
(A3.1.20) that T must also be orthonormal. Finally, since a simple inductive argument can be used to show
that the only orthonormal triangular matrix is the identity matrix, I n , it then follows that U H 1 H 2 H n .
________________________________________________________________________
ESE 502 A3-15 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
One final aspect of orthonormality is important to consider. Recall that we have often
referred to the (orthogonal) columns of the identity matrix, I n ( e1 ,.., en ) , as the identity
basis for n . So before proceeding to the Singular Decomposition Theorem, it is
appropriate to formalize the more general concept of orthonormal bases. First we extend
the notion of span( A) in expression (A3.1.23) to any set of vectors, z1 ,.., zk n as
follows:
The special feature of linear independence is that for each, x L , the - coefficients in
the representation, x ik1 i zi , must be unique.7 So in geometric terms, these
coefficients (1 ,.., k ) yield a natural coordinate system for L. Notice also that if
( z1 ,.., zk ) is a basis for L, then no larger set ( z1 ,.., zk , zk 1 ) can be a basis since zk 1 L
implies that zk 1 must already be a linear combination of ( z1 ,.., zk ) , which would violate
linear independence. So the size, k, of each basis is a unique characteristic of L,
designated as the dimension of L, and often written as dim( L) .
The single most important example of these concepts is of course the identity basis
( e1 ,.., en ) for n itself. But this basis has the important additional feature that its
component vectors form an orthonormal set, i.e., they are each of unit length and are
mutually orthogonal [as we have already seen for the columns of orthonormal matrices in
(A3.1.34) and (A3.1.35) above]. Any basis with these properties is called an orthonormal
basis. The key feature of such bases is that coordinates of any vector, x span( z1 ,.., zk ) ,
are immediately constructible as inner products with the basis vectors, i.e., for each
i 1,.., k ,
7
To see this, note simply that if i 1 i zi x i 1 i zi then i 1 ( i i ) zi 0 , so that by linear
k k k
independence, i i , i 1,.., k .
________________________________________________________________________
ESE 502 A3-16 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
This is why orthonormal bases provide such useful representations of linear spaces. So it
is important to ask how such bases can be constructed.
In particular, for any given set of vectors, z1 ,.., zk n , we next consider how to
construct an orthonormal basis for span( z1 ,.., zk ) . There is a remarkably simple
procedure for doing so, known as the Gram-Schmidt orthogonalization procedure.
Because the geometry of this procedure is of such fundamental importance, we begin by
considering orthogonal projections. Given two vectors, x, y n (as illustrated for n 2
in Figure A3.15 below), one may ask what vector in the span of y is “closest” to x, or
equivalently, “best approximates” x ?
x yx
●
span( y ) ● y yx
If one were to imagine drawing circles around x, denoting points of equal distance from
x, then the smallest circle touching the line, span( y ) { y : } , would be just
tangent to this line, and would identify the desired closest point, y x (shown as red in the
figure). Formally, this amounts to finding the which minimizes the distance,
|| x y || , from x. But since minimizing distance is equivalent to minimizing squared
distance, it follows that if we now write ( ) || x y ||2 , as a function of , then we
can identify this point by solving the “least squares” minimization problem:
Since the last equality is just a quadratic function in , the desired “tangency” is given
precisely by the first-order condition:
xy
(A3.1.52) 0 d
d ( ) 2 xy 2 y y y y xy
yy
________________________________________________________________________
ESE 502 A3-17 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
xy
(A3.1.53) yx y
y y
xy
(A3.1.54) ( x y x ) y xy y x y xy y y xy xy 0
y y
So if one starts with two vectors, ( x, y ) , and wishes to construct an orthonormal basis for
span( x, y ) , then this projection procedure yields a natural choice. In particular, since
x y x x ( xx / yy ) y is automatically a linear combination of ( x, y ) , it follows than
( y, x y x ) yields a pair of orthogonal vectors in span( x, y ) . Hence, by normalizing
these, we have found an orthonormal basis for span( x, y ) .
This argument implicitly assumes that x and y are linearly independent, so that the basis
will consist of two orthonormal vectors. But notice also that if x and y were linearly
dependent, so that x was already in span( y ) [i.e., x y 0 for some ], then the
solution in (A3.1.53) would automatically yield y x x so that ( x y x ) is simply the
zero vector. In other words, this procedure would identify this linear dependence, and tell
us that by normalizing only y we would obtain a natural orthonormal basis for
span( x, y ) span( y ) .
This two-vector example defines the simplest possible instance of the Gram-Schmidt
procedure. So all that remains is to be done is to show how this procedure can be
extended to larger sets of vectors. This extension is extremely simple, and only uses the
two-vector procedure detailed above. To see this, let us proceed to a three vector case.
Suppose we given linearly independent vectors, z1 , z2 , z3 n ( n 3) , and wish to
construct an orthonormal basis (u1 , u2 , u3 ) for span( z1 , z2 , z3 ) . To do so, we first construct
an orthogonal basis (b1 , b2 , b3 ) as follows:
(A3.1.55) b1 z1 .
________________________________________________________________________
ESE 502 A3-18 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
z2b1 z z
(A3.1.56) b2 z2 b1 z2 2 1 z1
b1b1 z1 z1
As in the example above, (b1 , b2 ) , are now orthogonal, and are both in span( z1 , z2 , z3 ) .
Step 3. Finally, project z3 on b1 and b2 individually and take the vector difference:
z b z b
(A3.1.57) b3 z3 3 1 b1 3 2 b2
b b
1 1 b2b2
z b z b
(A3.1.58) b1b3 b1 z3 3 1 b1b1 3 2 b1b2
b b
1 1 b2b2
z b
b1 z3 ( z3b1 ) 3 2 (0) b1 z3 b1 z3 0 ,
b b
2 2
and similarly for b2 . Given this orthogonal basis, it then follows by setting
(A3.1.59) u i bi / || bi || , i 1,2,3
m z b
(A3.1.60) bm1 zm1 i 1 m1 i bi span( z1 ,.., zk )
b b
i i
________________________________________________________________________
ESE 502 A3-19 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Then the argument in (A3.1.58) again shows that bm1 is orthogonal to each bi , i 1,.., m .
So by induction, we thus obtain an orthogonal basis (b1 ,.., bk ) for span( z1 ,.., zk ) . This in
turn yields an orthonormal basis (u1 ,.., uk ) by normalizing all nonzero vectors in
(b1 ,.., bk ) as in (A3.1.59). Moreover, the number of such vectors will again identify the
dimension of span( z1 ,.., zk ) .
One final possibility is of interest. Suppose that we are given an orthogonal basis,
(b1 ,.., bk ) for some span( z1 ,.., zk ) with k n , and wish to extend this to an orthogonal
basis for all of n . This is again quite simple, since we already have a basis for n ,
namely the identity basis, ( e1 ,.., en ) . So to extend (b1 ,.., bk ) to a larger orthogonal basis,
(b1 ,.., bk , bk 1 ) we may proceed by setting m k in (A3.1.60) and then successively letting
zk 1 ei for each i 1,.., n until a nonzero difference vector, bk 1 , is found. There must
be one, since not all ei can lie in the lower dimensional space, span( z1 ,.., zk ) . Once bk 1
is found, the procedure can be repeated by setting m k 1 in (A3.1.60) and continuing
down the list of identity basis vectors, ei , until a new nonzero difference vector, bk 2 , is
found. Again by induction, this procedure must result in a full set of basis vectors,
(b1 ,.., bn ) , which yield the desired extension. These can in turn be normalized as in
(A3.1.59) to obtain an orthonormal basis, (u1 ,.., un ) , for n . Finally, if the original basis
is already orthonormal, say (u1 ,.., uk ) , then this procedure is designated as an
orthonormal extension of (u1 ,.., uk ) to all of n .
________________________________________________________________________
ESE 502 A3-20 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
While there are of course many special types of matrices that are of analytical interest [as
for example the triangular Cholesky decompositions of symmetric matrices in (A2.7.44)
of Appendix A2], our focus above on diagonal matrices and orthonormal matrices was
for a reason. In the same way that orthonormal matrices have a simple decomposition
into reflections, it turns out that every n-square matrix, A, is decomposable into a simple
product of orthonormal and diagonal matrices as follows:
(A3.2.1) A U S V
where U and V are orthonormal and where S diag ( s1 ,.., sn ) is a nonnegative diagonal
matrix with diagonal entries, si , called the singular values of matrix A. In geometric
terms, every linear transformation is constructible as a composition of a nonnegative
scale transformation together with two orthonormal transformations. This fundamental
result, known as the Singular Value Decomposition (SVD) Theorem, holds for all
matrices (even rectangular matrices). At this level of generality, it has been designated
by Gilbert Strang (1993,2009) as the Fundamental Theorem of Linear Algebra.
The main objective of the present section is to establish this theorem. By way of
motivation, recall from the beginning of these notes that our ultimate objective is to
establish the Spectral Decomposition (SPD) Theorem for symmetric matrices, which
asserts that every symmetric matrix, A , can be represented in terms of a single
orthonormal matrix, W, and diagonal matrix, diag (1 ,.., n ) , as
(A3.2.2) A W W
where the diagonal entries, i , are called the eigenvalues of A (see Section A3.3 below).
So except for the nonnegativity of S in (A3.2.1), it would appear that this important
result is simply a special case of the SVD Theorem with W U V . As we shall see
below, this intuition is correct in many important cases. Moreover, it is essentially correct
in all cases in the sense that an SPD can always be constructed from any given SVD. It is
this relationship that provides the main motivation for our consideration of this more
general result. But as emphasized by Strang’s renaming of this result, anyone interested
in understanding linear transformations should try to gain some understanding of
(A3.2.1) in its own right.
While proofs of the SVD Theorem can be found in most standard texts on matrix algebra,
the most common approach is to start with the SPD Theorem and then apply this result to
the partitioned symmetric matrix,
A
(A3.2.3) MA
A
________________________________________________________________________
ESE 502 A3-21 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
in order to establish the SVD Theorem. But this “trick” offers little insight into the
geometric origins of either result. So the specific objectives of this section are to illustrate
these origins with an easily visualized geometric example in 2 , and then use these
insights to motivate a constructive proof of the SVD Theorem.
(A3.2.4) || x || 1 || Vx || 1
To see this, note first that since any vector, x n , can be transformed to have unit
length by the rescaling, x ||1x|| x , it follows from (A3.2.4) that all distances must be
preserved, since
(A3.2.5) 1
||x|| x 1
||x|| || x || 1 1 V 1
||x||
x 1
||x|| || Vx ||
|| Vx || || x ||
xy 1
2 || x ||
2
|| y ||2 || x y ||2
that inner products are entirely expressible in terms of distances, it then follows from
(A3.2.5) that all inner products must be preserved as well. Hence the defining conditions
for orthonormality in (A3.1.34) and (A3.1.35) must hold, and V is orthonormal.
Given this alternative characterization, we next observe that the product of matrices on
the right hand side of (A3.2.1) can be directly interpreted geometrically as an
orthonormal transformation, V , followed by a rescaling, S , followed by a second
orthonormal transformation, U .8 But while this composite transformation is of course
linear, the key question remains as to why every linear transformation, A, can be so
represented. Assuming that A is nonsingular (so that its inverse exists), a more
informative geometric approach is to start with transformation, A, and see how to “undo
it” (i.e., invert it back to the identity) through a series of simple transformations. For the
two dimensional case, this process can be illustrated by the four panels shown in Figure
A3.16 below.
8
This is illustrated for example in Figure 6.8 of Strang (2009, p.366).
________________________________________________________________________
ESE 502 A3-22 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Starting from the upper left panel, suppose that a given transformation, A, maps the basis
vectors (e1 , e2 ) in 2 as shown in the upper right panel. In geometric terms, the key here
is to consider not only how these basis vectors are transformed, but also how the entire
unit circle (shown in blue) is transformed. In 2 the image of this circle is always some
ellipse, as shown (in blue) in the upper right panel. Since the unit circle consists of all
vectors of unit length, we see that some of these vectors will typically be “stretched”
more than others by transformation A. In particular, since the major axis and minor axis
of this ellipse (shown as thin blue lines) denote the directions of maximum and minimum
distances from the origin, it follows that the vector on the unit circle which is “maximally
stretched” by A must be the vector (not shown) that is mapped into the major axis of this
ellipse. Similarly, the vector that is “minimally stretched” is mapped into the minor axis.
Ae1
A
e1 Ae2
e2
unit circle
V U
U Ae1
S 1U Ae1
S U Ae1
1
U Ae2
S 1
So to remove all stretch effects, the simplest procedure is to rotate these (orthogonal)
axes into the coordinate axes, and then rescale them back to unit lengths. The appropriate
rotation is shown in the lower right panel, and is represented by an orthonormal matrix,
________________________________________________________________________
ESE 502 A3-23 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
U . The rescaling back to unit lengths is then shown in the lower left panel, and is
represented by a positive diagonal matrix, S 1 . Notice also that by scaling the maximum
and minimum lengths to unity, all intermediate lengths must also be scaled to unity.9 So
the ellipse again becomes a unit circle. What this implies is that the transformation
represented by the product, S 1 U A , has actually mapped the unit circle back into itself.
So if we now denote this product matrix by
(A3.2.7) V S 1 U A
then it follows that V must satisfy (A3.2.4), and hence must be orthonormal. In
particular, the images of e1 and e2 under this transformation (namely the two vectors,
S 1U Ae1 and S 1U Ae2 , shown in the lower left panel of Figure A3.16) must be
orthogonal. So by construction we may use (A3.1.38) to conclude that
(A3.2.8) S 1 U A V U A S V A U S V
and thus that A is representable as in (A3.2.1) [where in this nonsingular case, S must be
a positive diagonal matrix].
(A3.2.9) n {x n : || x || 1}
then one can in principle construct similar arguments for the ellipsoidal images,
(A3.2.10) A( n ) { Ax n : x n }
of n under linear transformations, A. The basic ideas can be illustrated for 3 as shown
in Figure A3.17 below.
9
To show this formally, observe first that the equation of the ellipse in the lower right panel must be of the
form, a1 x1 a2 x2 c for some positive constants, a1 , a2 , c . So for the principle axes of this ellipse, say
2 2
( x01 , 0) and (0, x02 ) , it must be true that a1 x01 c a2 x02 . But if the given scale transformation is denoted
2 2
1 1 1 1 1 1
by, S diag ( s1 , s2 ) , so that S x ( s1 x1 , s2 x2 ) , then for this unit scaling it must be also true that
1 1 2
s1 x01 1 s2 x02 , so that x01 s1 and x02 s2 . These two relations together imply that a1 s1 c and
2 2 2
a2 s2 c so that c a1 x1 a2 x2 ( s1 c ) x1 ( s2 c ) x2 . Finally, by canceling c on both sides, we see that,
2 2 2 2
1 1 1 1 1 1
1 ( s1 x1 ) ( s2 x2 ) || ( s1 x1 , s2 x2 ) || , and may thus conclude that || ( s1 x1 , s2 x2 ) || 1 , i.e., that all
2 2 2
1 1
transformed vectors ( s1 x1 , s2 x2 ) have unit length.
________________________________________________________________________
ESE 502 A3-24 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
x3 x3
A(3 )
3
A
v x1 x1
2 Av2
x2
v1
x2
Av1
In this example, the unit sphere, 3 , shown (in red) on the left is mapped by the linear
transformation,
0.7 0 0
(A3.2.11) A 0 1.8 0
0 0.7 0.7
into the ellipsoidal image set, A(3 ) , shown (in red) on the right. The details of this
example will be discussed further as we proceed. But for the moment, it should be clear
that the first principle axis (major axis) of this ellipsoid is the line through the origin (not
shown) that connects the two ends of this “football-shaped” set. So the point labeled,
Av1 , (to be discussed below) is the image of a point, v1 3 , which is “maximally
stretched” by transformation, A . The location of this particular point, v1 , is shown on 3
(just below the x2 axis). So by linearity, the other maximally stretched point in 3 (not
shown) must be just opposite to v1 on the line from v1 through the origin. Note also that
the second principle axis is a line through the origin which is orthogonal to the first
principle axis and passes through the point labeled, Av2 , in the figure.
While it is possible to construct an orthonormal transformation that rotates these axes into
the coordinate axes, and then rescale the ellipsoidal image back to a sphere as in Figure
A3.16 above, the details of such a construction are extremely tedious (especially in
higher dimensions). Hence the two most important features of the argument in Figure
A3.16 are (i) its graphical simplicity in 2 , and (ii) its role in suggesting a more tractable
approach to the SVD Theorem in n . In particular, this approach is motivated by the
observation that the critical task in the above argument is to identify those unit vectors in
________________________________________________________________________
ESE 502 A3-25 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
n that are mapped by A into the principle axes of the ellipsoidal image, A( n ) , so that
the appropriate rotations can be defined. Note in particular that the vector mapped into
the major axis of the ellipse in Figure A3.16 (or ellipsoid in Figure A3.17) is by
definition that unit vector, v1 , with maximal image length, || Av1 || . So the most natural
procedure for identifying v1 is to solve the maximization problem
There will of course be two solutions, corresponding to each end of the ellipse (or
ellipsoid). But this vector is essentially unique up to a choice of direction. The second
key point established for the case of 2 was that the vector mapped into the minor axis is
necessarily determined (up to a choice of direction) as one orthogonal to v. In the case of
2 , this was established by verifying that the transformation, V , in (A3.2.7) was
orthonormal. In higher dimensions, a direct proof of this fact is much more difficult. So
our approach will be to start by assuming that this is the case, and use this assumption to
construct a sequence of maximization problems similar to (A3.2.12). The final solutions
to these problems will be seen to yield precisely desired representation in (A3.2.1), and
thus show (among other things) that V in (A3.2.7) is indeed orthonormal in all cases.
u
v v
then by construction
10
See Stewart (1993) for an interesting historical discussion of this variational approach, which goes back
to the work of Jordan in the 1870’s.
________________________________________________________________________
ESE 502 A3-26 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
(A3.2.13) Av s u
for some scalar, s . Note that if Av 0 then (A3.2.13) will hold trivially for s 0 . While
we shall eventually need to deal with this degenerate case, we focus for the present on
vectors, v n , with Av 0 [i.e., v null ( A) ] so that s 0 . Moreover, by replacing u
with u if necessary, we can always ensure that s 0 , so that by construction,
|| Av || s || u || s 0 . Thus, as an alternative to (A3.2.12), one can find the direction, v ,
of maximal stretch by solving the associated maximization problem:
Note also that since uu || u ||2 for any vector, u, it follows from the first constraint that
As we shall see below, the advantage of this alternative formulation is that will allow us
to solve simultaneously for all three matrices, U , S , and V in (A3.2.1) , where u and v
will turn out to be column vectors of U and V respectively, and where s ( uAv) will be
the diagonal elements of S . This constrained maximization problem thus constitutes the
center piece of the present analysis, and will be used recursively to construction the full
SVD representation for arbitrary linear transformations.
Before doing so, it is important to note finally that (A3.2.16) must always have a
solution. While this may seem obvious in our original two dimensional problem, it is less
so in higher dimensions. In particular, since the objective function, uAv , in (A3.2.16) is
a bilinear form in u and v (i.e., it is linear in u for each fixed v, and linear in v for each
fixed u) there are no natural maxima or minima for this function. But the existence of
such solutions follows from what is usually called the Generalized Extreme Value
Theorem. The classical Extreme Value Theorem simply states that every continuous
function, f ( x) , on a closed bounded interval, [a, b] , has both a maximum and
minimum value. This can be seen intuitively as in Figure A3.19 below:
f (a)
f (b)
a x b
The generalized version simply shows that same is true for continuous functions on
nonempty closed bounded sets in any finite-dimensional space, N .11 In the present case,
the bilinear form, f (u , v) uAv is a continuous function on 2 n constrained to the
product of unit spheres, n n {u n : || u || 1} {v n : || v || 1} n n 2 n ,
which is easily seen to be a nonempty closed bounded set in 2 n . Hence there always
exists a maximum solution to (4). Moreover, since both the objective function, f (u , v) ,
and constraint functions, uu and vv , are continuously differentiable on 2 n , this
maximum can be characterized by the first order conditions of the associated Lagrangian
function [recall expression (A2.8.38) in Section 8 of the Appendix, A2, to Part II of these
notes]:
(A3.2.18) 0 u L Av 1 [ 2 s u 0] Av su Av su
2
(A3.2.19) 0 v L Au 1 [0
2 2 v] Au v Au v
where (A3.2.19) also uses the identity, v (uAv ) (uA) Au . Similarly, the first order
conditions for s and reduce to the constraints
At this point, notice that conditions (A3.2.18) and (A3.2.20) are simply the constraints in
(A3.2.14) that originally motivated this formulation. In particular, transformation A must
achieve its maximum stretch, s , at vector v. Hence the most important new information
provided by this solution is condition (A3.2.19), which shows that there is a parallel
relation for the transpose, A , of A . In particular, the same argument leading to
(A3.2.14) shows that maximum stretch, , of the transpose transformation, A , must be
achieved at vector u, so that there is a clear duality between these two transformations.
Moreover, by the symmetry of inner products, it follows from (A3.2.18), (A3.2.19) and
(A3.2.20) that
So in fact this maximum stretch value must be the same for both A and A .
11
Even more generally, this is true for continuous functions on compact sets in arbitrary topological spaces.
See for example the self-contained development of this general version in Murphy (2008).
________________________________________________________________________
ESE 502 A3-28 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Before extending this argument to obtain the SVD representation (A3.2.1) for all
matrices, A, it is essential to distinguish between the nonsingular and singular cases.
Figure A3.17 above illustrates a typical nonsingular case, which is by far the most
important case for all applications that we consider in these notes. However, since this
same representation also holds for singular matrices, it is instructive to see what this
means for the geometry of linear transformations. To illustrate the basic differences
between these two cases, we now consider the following modification of transformation,
A, in expression (A3.2.11) above
0.7 0 0
(A3.2.22) A0 0 1.8 0
0 0.7 0
Here the matrix, A0 , differs from A in only the third column, which is now the zero
vector. This of course implies that A0e3 0 , and hence that A0 is singular. The
corresponding modification of Figure A3.17 is shown in Figure A3.20 below.
x3 x3
A0
v x1 x1
v 2 A0v2
1
x2 span( A0 ) x2
A0v1
The key difference here is that span( A0 ) is now a two-dimensional plane (shown in
blue). So the ellipsoid in Figure A3.17 has now been collapsed into an ellipse on this
plane. Notice also that even though 3 is only the surface of an ellipsoidal solid in 3 ,
the image set, A0 (3 ) , consists of the full area inside the ellipse on the right (including
the origin). But the initial maximization problem in (A3.2.12) above is still well defined,
and is seen to have a solution very similar to the full-dimensional case in Figure A3.17.
Notice in particular that the analysis of this ellipse in span( A0 ) is qualitatively the same
as that for the ellipse in upper right panel of Figure A3.17 for the 2 case. More
generally, it will turn out that for any singular matrix, A, one proceeds by first analyzing
________________________________________________________________________
ESE 502 A3-29 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
the ellipsoid in span( A) , and then extending this analysis to the collapsed dimensions in
null ( A) in order to complete the SVD representation.
This extension process is most transparent in 2 . So before proceeding with the formal
argument, it is instructive to reconsider the singular example expression (A3.1.28)
together with Figure A3.5. This figure is reproduced in Figure A3.21 below, where the
unit circle, 2 , is now included. The image set, A( 2 ) , is given by the red line segment
shown, which by definition lies in span( A) . So the possible solution vectors, v 2 , in
(A3.2.12) are seen to be either v1 or v1 , with images, Av1 and Av1 , corresponding to the
end points of interval, A( 2 ) , as shown in the figure. For purposes of discussion, we now
focus on v1 . In this case, the full solution to this maximization problem is given by the
triple, (v1 , s1 , u1 ) , where u1 is the unit-scaled version of Av1 (shown by the red point),
with scale factor, s1 , denoting the maximum-stretch value, i.e., Av1 s1u1 .
Av1
v2 A( 2 )
u1
2
v1
v1
u2
v2
Av1 null ( A)
________________________________________________________________________
ESE 502 A3-30 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
vector , u2 , orthogonal to u1 , (such as the point, u2 , just to the right of v2 in the figure),
then it is automatically true that Av2 0 2u2 . So this degenerate “maximal stretch”
solution is summarized by the triple (v2 , s2 , u2 ) where s2 0 . Notice that when taken
together, these two solutions can be written as
Av1 s1u1 s1
(A3.2.23) A(v1 , v2 ) (u1 , u2 ) AV U S
Av2 s2u2 s2
(A3.2.24) A U S V
and thus that (A3.2.1) holds for this choice of matrices. In the present case, it can readily
be verified that these matrices have the exact form:
1/ 2 1/ 2 10 2/ 5 1/ 5 2 1
(A3.2.25) U S V A
1/ 2 -1/ 2 0 1/ 5 -2/ 5 2 1
where for example, s1 10 3.16 , is the length of the major axis vector,
Av1
10 , 10 in Figure A3.21. The most important feature of this singular example is
that all analysis of the collapsed “minor axis” in null ( A) is formally identical to that of
the positive “major axis” in span( A) . The only difference is that the solutions in the
collapsed case are nonunique, so that any choice of a unit vector, u2 , orthogonal to u1
will work.
Note finally that nonuniqueness of solutions is also possible for positive axes of the
ellipsoid in span( A) . A simple example is provided by any orthonormal transformation,
A U , where U ( n ) n implies that all “axes” of this spherical image must have the
same length. In this extreme case, there are infinitely many SVD representations of U,
including the trivial one, U U I n I n . A more interesting example is based on the matrix,
A , in (A3.2.11) with SVD given by12
12
This solution was obtained numerically with the MATLAB program, svd.m.
________________________________________________________________________
ESE 502 A3-31 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
0.7 0 0
0 1.8 0 A
0 0.7 0.7
While all principle axes in this example are distinct, notice that the lengths of the second
and third axes (0.7 and 0.6462) are almost the same. Geometrically, this implies that the
intersection of the surface of the ellipsoid in the right panel of Figure A3.17 with the
plane orthogonal to the major axis vector, Av1 , must be almost circular (as shown by the
blue curve in the figure). So one can imagine that this intersection can be made exactly
circular by an appropriately small modification of the matrix, A .13 In this circular case, it
should be clear that while the principle axis vector, Av1 , is still unique (up to a choice of
sign) there is no unique choice of the second principle axis vector, Av2 , shown in Figure
A3.17. Any selection of a unit vector, v2 , orthogonal to v1 will do. But, as we shall see
below, the actual maximization problem for identifying this principle axis is still well
defined, and most importantly, all such choices of v2 must satisfy the corresponding
Lagrangian first-order conditions.
With these preliminary observations, we are now ready to extend the maximization
problem in (A3.2.16) in order to obtain a full singular value decomposition (SVD) of
matrices, A (which will further clarify the natural duality between A and A ).14
Singular Value Decomposition Theorem. For any n -square matrix, A , there exist
orthonormal matrices, U (ui : i 1,.., n) , V (vi : i 1,.., n) and a nonnegative diagonal
matrix, S diag ( si :i 1,.., n) , such that
(A3.2.27) A U S V
Proof: To establish this result, we begin by observing that if (A3.2.27) holds, then [as an
extension of (A3.2.23) above] it follows by definition that,
(A3.2.28) A US V AV US
13
One such modification, A , is obtained by simply replacing with diag (1.95, 0.7, 0.7) and using
o o
14
As mentioned earlier, more compact versions of this SVD Theorem can be obtained by appealing to the
Spectral Decomposition Theorem and employing the symmetric-matrix device in (A3.2.3) above. [For a
“variational” version of this proof see Theorem 7.3.10 in Horn and Johnson (1985).] However, it should be
emphasized that essentially all direct proofs of the Spectral Decomposition Theorem implicitly embed in
the complex plane, , to ensure existence of such decompositions. Hence one of the objectives of the
present approach is to avoid any appeal to complex number theory whatsoever.
________________________________________________________________________
ESE 502 A3-32 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
s1
Av1 ,.., Avn (u1 ,.., un ) ( s1u1 ,.., snun )
sn
Avi si ui , i 1,.., n
where the last line is seen to have exactly the same form as (A3.2.18) above. Hence if we
now denote the solution to (A3.2.17) by (u1 , s1 , v1 ) , so that conditions (A3.2.18) through
(A3.2.20) imply
then our objective is to extend this relation to a full SVD as in (A3.2.28) by generating
the successive triplets, (ui , si , vi ) , one at a time. Here it is instructive to generate the first
triplet, (u2 , s2 , v2 ) , in full detail, and then proceed by induction for the rest. To do so, we
begin by observing that in order for U and V to be orthonormal, we must require that
(u2 , v2 ) satisfy the orthogonality conditions, u2u1 0 v2v1 . So if we now let
(u1 ) {u n : uu1 0} and (v1 ) {v n : vv1 0} denote the vectors of unit length
orthogonal to u1 and v1 respectively, then in geometric terms, the task is to find
“maximal stretch” vectors, (u2 , v2 ) (u1 ) (v1 ) , for transformation A which generate
the “second principle axes” of the ellipsoids, A( n ) and A( n ) , respectively. [For
example, the set (v1 ) for the nonsingular illustration in Figure A3.17 above is shown by
the blue circle on 3 in the left panel, with corresponding image, A[(v1 )] , shown by the
blue circle in the right panel. Similarly, for the singular illustration in Figure A3.20, the
set (v1 ) is again shown on the left as a (different) blue circle on 3 , with associated
image now corresponding to interval shown in dark blue in the right panel. [Note that (for
sake of visual clarity) neither the vector, u1 , or its orthogonal set, (u1 ) , are shown in
these figures.] As a natural extension of (A3.2.17), the appropriate maximization problem
for determining (u2 , v2 ) is given by
Moreover, (u1 ) and (v1 ) are again a nonempty closed bounded subsets of n for n 2
[implying that (u1 ) (v1 ) must be a nonempty closed bounded subset of 2 n ]. So the
same argument using the Generalized Extreme Value Theorem again shows that a
solution to (A3.2.30) must exist. Since the above constraint conditions for (u2 , v2 ) can be
equivalently stated as
________________________________________________________________________
ESE 502 A3-33 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
it follows that the appropriate Lagrangian function for this problem takes the form:
2 (u2u1 ) 2 (v2 v1 )
Here the first order conditions for u2 and v2 are given respectively by
Moreover, exactly the same argument in (A3.2.21) with (u2 , s2 , v2 ) replacing (u, s, v)
now shows that 2 s2 , so that (A3.2.38) becomes
________________________________________________________________________
ESE 502 A3-34 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Hence the maximal stretch, s2 , for transformation A among vectors in (v1 ) is achieved
at v2 , and similarly, the same maximal stretch for transformation A among vectors in
(u1 ) is achieved at u2 . Most importantly for our present purposes, expression (A3.2.37)
shows that (u2 , s2 , v2 ) yields the desired second row for the SVD in expression (A3.2.28).
Note finally that this solution (u2 , s2 , v2 ) may not be unique, even when (u1 , s1 , v1 ) is
unique [such as in the modification of example (A3.2.11) illustrated above]. But all such
solutions must necessarily satisfy conditions (A3.2.31), (A3.2.37) and (A3.2.39).
The task remaining is to extend this argument by induction to all rows of (A3.2.28). To
do so, we start with the inductive hypothesis that for a given k n , the first k 1 rows
have been filled with triplets, (ui , si , vi ), i 1,.., k 1 , satisfying
If we now let (u1 ,.., uk 1 ) {u n : uui 0, i 1,.., k 1} denote the set of vectors in n
orthogonal to (u1 ,.., uk 1 ) , and similarly let (v1 ,.., vk 1 ) {v n : vvi 0, i 1,.., k 1}
denote the vectors in n orthogonal to (v1 ,.., vk 1 ) , then since these nonempty sets are again
closed and bounded, one final application of the Generalized Extreme Value Theorem
shows that the maximization problem
(A3.2.44) maximize: uk Avk subject to: (uk vk ) (u1 ,.., uk 1 ) (v1 ,.., vk 1 )
must have a solution. Moreover, as an extension of (A3.2.31) and (A3.2.32), it follows that
if the constraint conditions on (uk , vk ) are written explicitly as
then the appropriate Lagrangian function for (A3.2.44) is seen to have the form:
________________________________________________________________________
ESE 502 A3-35 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Here the first order conditions for uk and vk have the respective forms
and the remaining first order conditions are now given by (A3.2.45). [Note again from
the orthogonality conditions in (A3.2.45) that the constraint gradient vectors in both
(A3.2.47) and (A3.2.48) are linearly independent, so that this Lagrangian formulation of
(A3.2.46) is indeed valid.] Next, to show that j 0 j , j 1,.., k 1 , we again
premultiply (A3.2.47) by uj and use the inductive hypotheses (A3.2.40) through
(A3.2.43) together with (A3.2.45) to conclude that
(A3.2.51) Avk sk uk
(A3.2.52) Auk k vk
Finally, since the argument in (A3.2.21) with (uk , sk , vk ) replacing (u, s, v) again shows
that k sk , we see that (A3.2.52) becomes
(A3.2.53) Auk sk vk
Thus the conditions in (A3.2.40) through (A3.2.43) hypothesized for i 1,.., k 1 are seen
to hold for k as well, and it follows by induction that they must hold for all i 1,.., n .
Most importantly for our purposes, conditions (A3.2.40) together with (A3.2.42) and
(A3.2.43) are now seen to yield a full SVD for A as in expression (A3.2.28).
________________________________________________________________________
ESE 502 A3-36 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
This particular proof of the SVD Theorem has a number of additional geometric
consequences. Note first from (A3.2.45) and (A3.2.51) that
so that the stretch values, sk , are indeed the maximum values of the objective function,
uk Avk , at each step, k. Moreover, since this objective function is formally the same at
each step, and since the constraints sets form a nested decreasing sequence of sets, i.e.,
(A3.2.55) (u1 ,.., uk ) (v1 ,.., vk ) (u1 ,.., uk 1 ) (v1 ,.., vk 1 ) , k 1,.., n
it follows that these maximal values must necessarily form a non-increasing sequence, so
that
(A3.2.56) s1 s2 sn
In geometric terms, these singular values thus yield the successive lengths of the principle
axes corresponding to the ellipsoidal image, A( n ) , of the unit sphere, n , under the
linear transformation, A. In particular, if s1 s2 sn 0 , then A is nonsingular and
and the n-dimensional ellipsoid, A( n ) , has a well defined set of principle axes.
However, if there are say k repetitions of a positive singular value, such as in the
modified version of Figure A3.17 illustrated above with k 2 , then a k-dimensional
“slice” through this ellipsoid will be spherical. Similarly, if the last k singular values are
zero, then A is singular and its null space, null ( A) , has exactly dimension k. So a great
deal of information about A is conveyed by these singular values.
However, it should also be emphasized that the programming formulation of this proof is
not meant to provide a method for computing the SVD of a matrix. This is particularly
evident when there are repeated positive singular values (either positive or zero). Here
there are infinitely many programming solutions, and procedures such as Gram-Schmidt
orthogonalization must be used to construct appropriate orthonormal sets of solution
vectors. While there exist very efficient methods for constructing such decompositions
(often based on Householder representations in section A3.1.2 above), such procedures
are beyond the scope of these notes.15
We now consider some of the more useful consequences of the SVD Theorem for our
purposes. As already mentioned, one direct consequence is to clarify the geometric
relation between A and A . In particular, it follows at once from (A3.2.27) together with
(A3.1.13) that
(A3.2.57) A US V A V S U
15
For a discussion of such methods as used by MATLAB, see Chapter 10 of Mohler (2004).
________________________________________________________________________
ESE 502 A3-37 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
So the singular values of A and A must always be the same. More the above proof shows
their respective ellipsoidal images, A( n ) and A( n ) , of the unit sphere, n , must
essentially be rotations of one another, where the roles of the orthonormal matrices, U
and V , are exactly reversed. A simple illustration of this relationship is given in Figure
A3.22 below, where the unit circle, 2 , is shown in black, and the elliptical images,
A( 2 ) and A( 2 ) , for a given matrix, A, 16 are shown in blue and red, respectively.
A( 2 ) A( 2 )
2
Next we consider a number of SVD consequences that will be used in our subsequent
analyses.
Note first that since the inverse of an orthonormal matrix is simply its transpose, it
follows at once from the SVD Theorem that for any nonsingular matrix, A,
s11 u1
(A3.2.58) A U S V A1 V S 1 U (v1 ,.., vn )
1 u
sn n
Thus, by recalling (A3.1.11), we see that the inverse, A1 , can be determined from the
SVD of A almost by inspection. While this of course assumes that this SVD has already
been calculated, it nonetheless provides a powerful analytical tool in many contexts. For
example, it now reveals the behavior of “almost nonsingular” matrices, which by
16
The particular matrix used here was A = [1.0689, 2.9443 ; 0.8095, -1.4384].
________________________________________________________________________
ESE 502 A3-38 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
definition have at least one singular value, si , very close to zero. But since this in turn
implies that 1/ si must be very large, it can be seem from the last equality in (A3.2.58)
that vectors in the ui direction are being stretched enormously. So this shows not only
that A1 is becoming unstable, but also the directions in which this instability is worst.
Even more important is the fact that this SVD shows how to construct generalized
inverses for singular matrices. In particular, when no inverse exists for A, this SVD
representation suggest a very natural “best approximation” to such an inverse. The idea is
seen most clearly in trying to solve the associated linear equation system, A x b . If A1
exists, then there is an exact solution, x A1b . But if A is singular, one would like to
find x so that Ax is as “close” to b as possible, i.e. so that || Ax b || is minimized. But by
(A3.2.54),17
(A3.2.59) Ax b (U S V ) x b U S V x UU b U ( S V x U b) U ( S x b )
So this approximation problem has now been reduced to a diagonal form for which the
solution is seen to be trivial, namely, set
b / s , si 0
(A3.2.61) xi i i
0 , si 0
Finally, if we assume (for convenience) that the first k components of S are the positive
ones, and set
then it follows from (A3.2.61) together with the definitions of x and b that
(A3.2.63) x S b V x S U b x (V S U ) b
17
The following argument is base on the excellent discussion of SVD properties in Kalman (1996).
________________________________________________________________________
ESE 502 A3-39 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
(A3.2.64) A V S U
(A3.2.65) ( X X ) ˆ X y
(A3.2.66) ˆ ( X X ) X y
a a
(A3.2.67) A 11 12 | A | a11a22 a12 a21 ,
a21 a22
1 a22 a12
(A3.2.68) A1
| A | a21 a11
[In fact, the determinant itself originated as part of the first general solution of linear
equations (Cramer’s Rule, 1750).] Note in particular from (A3.2.68) that such solutions
exist iff | A | 0 . The geometric meaning of this relationship will become clear below.
________________________________________________________________________
ESE 502 A3-40 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
But for the present, we simply note that the formula in (A3.2.67) offers little insight by
itself. Indeed, the general formula for determinants (in terms of alternating-signed sums
of products of matrix elements)18 is even more obtuse. But one important observation
about this formula can be made in terms of the following instance of a Householder
reflection, A H v , in 2 [recall expression (A3.1.44) above], where in this case
v (1, 1) [with vv 2 ], so that:
1 1 0 1
(A3.2.69) A I 2 v2v vv I 2
1 1 1 0
By expression (A3.2.67) this matrix has a negative determinant, | A | 1 . To interpret
the meaning of this negative sign, note from Figure A3.23 below that this transformation
simply reverses the basis vectors, (e1 , e2 ) , so that A(e1 , e2 ) (e2 , e1 ) :
Ae1 v
Ae2
v
More generally, negative values of determinants are always associated with such
reversals of orientation. But this “sign” property of determinants is not of direct interest
for our purposes (even though the present Householder example will prove useful later).
Rather, we are primarily interested in the absolute value of determinants. As mentioned
above, these absolute values tell us exactly how volumes are transformed under linear
transformations. The standard example which is often shown in the literature is illustrated
in Figure A3.24 below:
(1,0)
18
See for example section 0.3 in Horn and Johnson (1985)
________________________________________________________________________
ESE 502 A3-41 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
In terms of Figure A3.3, we here set e1 (1,0) , e2 (0,1) , Ae1 (a11 , a21 ) and
Ae2 (a12 , a22 ) in order to emphasize the role of each matrix element. The key point is
that the unit area of the unit square on the left is transformed by A into a parallelogram on
the right with area given precisely by | A | , which in the case is seen to be positive (no
reversal of orientation). This in turn implies (from linearity) that every area on the left is
transformed by A into an area scaled by a factor of | A | . But even in this simple case, it is
not obvious that the parallelogram area should be given by a11a22 a12 a21 . While the
geometric proof in this case is not difficult, its generalization to linear transformations, A,
in n is tedious, to say the least. So our first objective is to show that this relation
between volume and absolute determinant values can be made completely transparent in
terms of the SVD of A.
To do so, we must first deal with the (unfortunate) notational fact that the symbol, | | , is
used both for determinants and absolute values. This is often resolved by using “ det( A) ”
for the determinant of A, so that its absolute value can be directly represented by
| det( A) | . But since the relevant determinants for our purposes will almost always be
nonnegative, we choose to stay with the simpler notation, | A | . Where it is essential to
specify absolute values of determinants (such as in the present section) we shall simply
write, | A | .
Aside from this notational convention, the only algebraic properties of determinants that
we require are the product rule,
(A3.2.70) | AB | | A| | B |
(A3.2.71) | A | | A |
Note in particular that for absolute values, the product rule implies
(A3.2.73) | A B | | A | | B |
Together with the SVD Theorem, these properties of determinants imply that the absolute
determinant of any matrix is the product of its singular values, i.e., that for all
transformations, A, in (A3.2.28)
________________________________________________________________________
ESE 502 A3-42 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
To see this, note first from (A3.2.72) that | I n | 1 , so that by the defining property of
orthonormal transformations, U,
(A3.2.75) 1 | I n | | U U | | U || U | | U |2 | U | 1
(A3.2.76) U orthonormal | U | 1
n
s
i 1 i
Using this result, it is a simple matter to show that for any linear transformation, A, on
n , volumes are transformed by a factor of | A | . To do so, observe that if the unit cube
in n is denoted by
and if we denote the volume of any set, T n by vol (T ) ,19 then clearly vol (Cn ) 1 . So
if the image of Cn under transformation A is denoted by
(A3.2.79) A(Cn ) { Ax : x Cn }
then it suffices to show that vol [ A(Cn )] is always given by | A | . But since each linear
transformations scales all volumes by the same amount, if we now denote this common
scale factor by s ( A) vol [ A(Cn )] ,20 then for all T n ,
In these terms, our objective is to show that for any linear transformation, A, on n ,
19
The knowledgeable reader will note that technically we here refer to any measurable set, T .
n
20
Be careful not to confuse scale factors, s ( A) , with singular values, s.
________________________________________________________________________
ESE 502 A3-43 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
(A3.2.81) s ( A) | A |
Here we need only appeal to certain elementary properties of volume itself. The most
fundamental property concerns scale transformations of individual coordinates. For
example, if a transformation scales all coordinate axes by 2 , then volumes increase by a
factor of 2n . More generally, since positive diagonal matrices, D diag (d1 ,.., d n ) , scale
each coordinate, xi , by a factor of di , i.e., since
In fact, this is how volumes of n-dimensional “boxes” are computed. Note also that if
coordinates are scaled by factors, di , one at a time, then since the composition of these
transformations is precisely D, the cumulative effect of these scale changes is necessarily
multiplicative. More generally, the cumulative scale effect of any successive
transformations, say A followed by B , is always multiplicative. For example, if A doubles
volumes and B triples volumes, then the composite transformation, BA, increases
volumes by a factor of (2)(3) 6 . More generally, for all transformations, A and B,
(A3.2.84) s ( BA) s ( B) s ( A)
The only other property of volume that we require is one we have already seen, namely
that orthonormal transformations preserve volumes. So by definition,
(A3.2.85) U orthonormal s (U ) 1
Given these volume properties, it follows at once from the SVD Theorem together with
(A3.2.84) that
(1) s (1)
n
i 1 i
| A |
This result has far reaching consequences for determinants, and shows why they play
such a fundamental role in linear algebra. With respect to matrix inverses in particular,
note that if | A | 0 (so that | A | 0 ) then s ( A) 0 implies that all volumes are
________________________________________________________________________
ESE 502 A3-44 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
collapsed to zero. So from a geometric viewpoint, A must collapse the space into a lower
dimensional subspace, such at the examples in Figures A3.20 and A3.21 above.
The final objective of this section is to illustrate the consequences of these results for
linear transformations of random vectors. In particular, our objective is to complete the
derivation of the multi-normal distribution sketched in Section 3.2.1 of Part II in this
NOTEBOOK, and to show how the multi-normal density in (3.2.11) is derived. The key
element we focus on is the role of the determinant, | | , of the covariance matrix, . In
fact, this determinant reflects the volume transformation associated with a particular
linear transformation, as we now show. To do so, we start by considering the standard
normal random vector, X ( X 1 ,.., X n ) , of independent standard normal variates,
X i ~ N (0,1), i 1,.., n. Recall from the Linear Invariance Theorem of Section 3.2.2 of
Part II that if for some nonsingular matrix, A , the random vector, Y , is defined by
(A3.2.87) Y AX ,
then since X ~ N (0, I n ) , this theorem asserts that Y ~ N ( , ) with AA . Moreover,
since all covariance matrices, , are of this form for some A [as we have already seen
from the Cholesky Theorem in Appendix A2, below expression (A2.7.45)], it follows that
all multi-normal random vectors, Y, are derivable as linear transformations of the
standard normal vector, X. In fact, this is precisely how the general multi-normal
distribution is defined.
Our goal is to establish this result by starting with the probability density of the standard
normal random vector, X, and show how this density is transformed under (A3.2.87). To
do so, we first recall from the argument in (3.2.7) of Part II [with ( i , i ) (0,1) ,
i 1,.., n ] that the probability density, f ( x ) f ( x1 ,.., xn ) , of the standard normal random
vector, X, is necessarily of the form:
(A3.2.88) f ( x ) f ( x1 ) f ( xn ) 1
2
e
12 x12
1
2
e
12 xn2
n 12 ( x12 xn2 ) 12 x x
1
2
e (2 ) n / 2 e
1 ( y ) 1 ( y )
(A3.2.89) g ( y ) (2 ) n / 2 | |1/ 2 e 2
________________________________________________________________________
ESE 502 A3-45 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
But before doing so, it is important to emphasize that even though expressions like
(A3.2.87) are usually referred to as “linear transformations”, they technically involve
linear transformations, A, plus translation terms, (and are properly classified as affine
transformations). Only in the case, 0 , is this a linear transformation [as defined in
(A3.1.1) above]. So to simplify the present development further, we start with the case,
0 , where (A3.2.87) reduces to a proper linear transformation,
(A3.2.90) Y AX
It will be seen later that adding a nonzero translation term, , is then a simple matter.
(A3.2.91) 0 ( ) 01 ( ) 02 ( ) 0 n ( ) n ,
(A3.2.92) Pr[Y 0 ( )] 0 ( )
g ( y ) dy 0 n ( )
01 ( )
g ( y1 ,.., yn ) dy1 dyn
This is illustrated for the case of n 2 on the right-hand side of Figure A3.25 below,
where the 2-cube, 0 ( ) , is seen to be a square (shown in blue) about point, y0 :
g ( y0 ) ●
f ( A 1 y 0 ) ●
x2 A 1 y2
01 ( )
A1 y0 ● y0 ●
A1[ 0 ( )] 0 ( ) 02 ( )
x1 y1
So the probability integral in (A3.2.92) is simply the volume under that portion of
density, g, above this square (also shown in blue). The key point here is that if the value
of is sufficiently small, then this volume is well approximated by the box with base,
0 ( ) , and height, g ( y0 ) . More precisely, if the area (more generally, volume) of this
base is denoted by vol [ 0 ( )] , so that the volume of the box (height base) is given by,
g ( y0 ) vol [ 0 ( )] , then we obtain the approximation:
eY ( )
(A3.2.94) lim 0 0
vol [ 0 ( )]
To gain some feeling for such error representations, observe that if we divide both sides
of (A3.2.93) by vol [ 0 ( )] , and let 0 then we obtain
Pr[Y 0 ( )]
(A3.2.95) lim 0 g ( y0 ) ,
vol [ 0 ( )]
In order to associate these quantities with the random vector, X, observe from (A3.2.90)
that since, y Ax x A1 y , it follows that Y-outcome, y, occurs iff X-outcome,
A1 y , occurs. So by using the same image notation in (A3.2.10) to write
This provides the key link between the X and Y distributions. The X-event in the last
equality is shown (for the n 2 case) on the left-hand side of Figure A3.25 as a
parallelogram (in red) representing the image of 0 ( ) under A1 . (Note also that the
bold red arrow shows the direction of this inverse relationship.) But since X is
continuously distributed with density, f, this probability again has a “box” approximation
with base, A1[ 0 ( )] , and height, f ( A1 y0 ) , i.e.,
________________________________________________________________________
ESE 502 A3-47 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
eX ( )
(A3.2.99) lim 0 0
vol A1[ 0 ( )]
But now we are in a position to simplify (A3.2.98) by using (A3.2.86), together with
(A3.2.80) to obtain
This can be further simplified by recalling from the same argument as (A3.2.75) that
Pr X A1[ 0 ( )]
eY ( ) eX ( )
(A3.2.104) g ( y0 ) f ( A1 y0 ) | A |1
vol [ 0 ( )] vol [ 0 ( )]
eX ( )
| A |1 f ( A1 y0 )
vol A1[ 0 ( )]
Finally, by letting 0 and using (A3.2.95) and (A3.2.99) , we obtain the density
relation:
________________________________________________________________________
ESE 502 A3-48 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
eY ( ) eX ( )
(A3.2.105) g ( y0 ) lim 0 | A |1 f ( A1 y0 ) lim 0
vol [ 0 ( )] vol A1[ 0 ( )]
g ( y0 ) f ( A1 y0 ) | A |1
But since this is an identity for all choices of y0 , we can now replace y0 by y and write
This is the key result for constructing density, g ( y ) , from f ( x ) under linear
transformations, Y AX , as in (A3.2.88). Essentially it asserts that the desired density,
g ( y ) , at y is obtained by evaluating f at A1 y and rescaling to adjust for the volume
changes created by A.
But before applying this result to the multi-normal case, we first extend (A3.2.102) to
include translations as in (A3.2.87). To do so, observe that if we now let Z Y so
that
(A3.2.107) Y AX Y AX Z AX
(A3.2.109) Y T (Z ) Z
(A3.2.110) Z T 1 (Y ) Y
we can now use h( z ) to obtain g ( y ) from these relations. Here the key point to note is
that translations on n simply shift locations, and involve no rescaling of volumes.21 So
in fact, the relation between h and g in this case reduces simply to:
(A3.2.111) g ( y ) h(T 1 y ) h( y )
21
Here it is worth noting that the terms isometric transformations and rigid motions mentioned at the
beginning of Section A3.1.2 formally include translations as well as rotations and reflections, since all such
transformations preserve both distances and angles.
________________________________________________________________________
ESE 502 A3-49 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Finally, by combining (A3.2.108) and (A3.2.111), we obtain the desired general relation
Before applying this to the multi-normal case, we can make one additional observation
about covariances that is independent of normality. Recall from expression (3.2.21) in
Part II of the NOTEBOOK that
So if we let cov(Y ) and assume that cov( X ) I n , then as in the standard normal
case of (A3.2.87) we obtain the formal identity:
(A3.2.114) A I n A A A
But by the determinantal identities in (A3.2.70) and (A3.2.71), this in turn implies that
(A3.2.115) | | | A || A | | A |2 0
So (as we have already seen in Sylvester’s Condition leading to the Cholesky Theorem in
Appendix A2) the determinant of every (nonsingular) covariance matrix is positive. This
means that “plus” subscripts can be dropped for determinants of covariance matrices. In
particular, by letting | |1/ 2 denote the positive square root of | | , it follows that
(A3.2.116) | A | | |1/ 2
Finally, to apply these results to the multi-normal case, we need only observe that if X is
standard normal, X ~ N (0, I n ) , with density in (A3.2.88), then g ( y ) in (A3.2.117) takes
the form:
(2 ) n / 2 e
12 [ A1 ( y )][ A1 ( y )]
| |
1/ 2
________________________________________________________________________
ESE 502 A3-50 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
But by (A3.2.114) together with the matrix identities in (A3.1.18) and (A3.1.20) we see
that
12 ( y ) 1 ( y )
(A3.2.120) g ( y ) (2 ) n / 2 | |1/ 2 e
Thus the resulting probability density is precisely that in (A3.2.89), and the multi-normal
case is established. In particular, the family of multi-normal random vectors,
Y ~ N ( , ) , is seen to be generated by transformations, Y AX , of the standard
normal random vector, X, satisfying AA , with A nonsingular. As an immediate
consequence of this, we have the following simple proof of the Linear Invariance
Theorem of Section 3.2.2 of Part II, which we now restate for convenience as:
(A3.2.121) Y ~ N ( A b, A A)
(A3.2.122) Z C 1 ( X )
so that by construction,
(A3.2.124) X CZ
(A3.2.125) Y A X b A (C Z ) b ( AC ) Z ( A b)
shows that Y is an affine transformation of Z, the same argument shows that Y is multi-
normally distributed. Moreover, from expressions (3.2.18) and (3.2.21) in Part II we see
that the mean and covariance of Y are given respectively by
________________________________________________________________________
ESE 502 A3-51 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
(A3.2.126) E ( y ) AC E ( Z ) ( A b) A b , and
AC ( I n )C A A(CC ) A A A
Finally it is important to clarify the above requirement that A be of full row rank. Note in
particular that if A has fewer rows than columns, say m n , then the random vector,
Y AX b , must be of dimension m (where b must also be m-dimensional so that vector
addition is well defined). So it is implicitly assumed that N ( A b, A A) is a multi-
normal distribution on m with density given by replacing and in (A3.2.120) with
A b and A A , respectively. With this in mind, it should be clear from (A3.2.120)
that such a density is only defined if A A is a nonsingular covariance matrix (i.e., with
a well defined inverse). As shown in Corollary 3 of Section A3.4.3 below, the condition
that A be of full row rank, insures that the m -square covariance matrix, A A , will
indeed be nonsingular.
________________________________________________________________________
ESE 502 A3-52 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
As stated earlier, the single most important application of the Singular Value
Decomposition Theorem for our purposes is to provide a simple proof of the Spectral
Decomposition Theorem for symmetric matrices. Recall from (A3.2.2) that this theorem
asserts that if matrix A is symmetric (i.e., A A ) then there exists an orthonormal
matrix, U, and diagonal matrix, , such that
(A3.3.1) A U U
The elements of diag (1 ,.., n ) are called the eigenvalues of A and the columns of
U (u1 ,.., un ) are the associated eigenvectors. However, these concepts are much more
general, and indeed, provide additional geometric intuition about linear transformations
in general. In particular, it is useful to consider eigenvalues and eigenvectors for
nonnegative spatial weight matrices, W, which may possibly be non-symmetric (as for
example in the case of nearest-neighbor matrices). So it is convenient to start with a
broader consideration of these concepts, and then focus in on symmetric matrices.
For any given n-square matrix, A, and nonzero vector, x n , if A maps x into a scalar
multiple of itself, i.e., if
(A3.3.2) Ax x
Before analyzing these concepts, it is important to reiterate that our present view of (real-
valued) matrices, A nn , is as representations of linear transformations on n . So our
focus is naturally on the geometric properties of such transformations on the real vector
space, n . But such matrices can also be viewed as representing a class of linear
transformations on the complex vector space, n . This distinction is important for the
present discussion because the general theory of eigenvalues and eigenvectors treats such
matrices as linear transformations on n . The reason for this can be seen by the
following equivalent view of eigenvalues. If we rewrite (A3.3.2) as,
(A3.3.3) Ax x 0 ( A I n ) x 0
then it becomes clear that the eigenvalues of A are precisely those values for which the
matrix, A I n , is singular. So, as was observed following expression (A3.2.86) above,
this is equivalent to the condition that
22
The word “eigen” is German for “own” as in belonging to. So the eigenvalues of A are also referred to as
its “own” values or “characteristic” values.
________________________________________________________________________
ESE 502 A3-53 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
(A3.3.4) | A In | 0
a a
(A3.3.5) 0 11 12 (a11 )(a22 ) a12 a21
a21 a22
So the eigenvalues of 2 2 matrices are thus seen to be the roots of a quadratic equation.
More generally they are the roots of an nth -degree polynomial called the characteristic
polynomial for A. In this setting, the key result here is of course the Fundamental
Theorem of Algebra, which tells us that there are always exactly n roots to this equation
(counting repetitions) if we allow complex-valued roots. So if A is regarded as a linear
transformation on n , where both and x in (A3.3.2) can be complex-valued, then one
obtains a very elegant and powerful theory of eigenvalues and eigenvectors. But from a
geometric viewpoint, there is a fundamental difference between the simple scaling of real
vectors in n and the corresponding interpretation of expression (A3.3.2) in n . In
particular, multiplication of complex numbers involves rotation as well as scaling. We
shall return to these issues in Section ?? below, where the geometric meaning of such
rotations will be interpreted in n . But for the present our attention is restricted to the
case of real eigenvalues. Indeed, one major objective of these notes is to show that the
eigenvalues of symmetric matrices are always real – without any appeal to complex
numbers whatsoever. Hence, unless otherwise stated, we implicitly assume that the
relevant eigenvalues and eigenvectors for matrices, A, in this section are real valued, i.e.,
are meaningful for A as a linear transformation on n .
In this setting, we begin by noting from (A3.3.2) that each eigenvector, x, for is
inherently nonunique. In particular, every nonzero scalar multiple, x , of x is also an
eigenvector for , since
(A3.3.6) A( x) Ax x ( x)
With this normalization, the next question concerns the relation between eigenvectors for
distinct eigenvalues. Our objective is to show that such eigenvectors must always be
linearly independent. Here some geometric intuition can be gained by considering several
examples. We start with the simplest and most transparent example of eigenvalues and
associated eigenvectors, namely those for diagonal matrices, A diag (a11 ,.., ann ) . Here is
it obvious that
________________________________________________________________________
ESE 502 A3-54 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
a11
(A3.3.7) AI n I n A A (e1 ,.., en ) (e1 ,.., en )
ann
So if we now denote the set of eigenvalues for any matrix, A , by Eig ( A) , then for
diagonal matrices in (A3.3.7) it is clear that Eig ( A) {aii : i 1,.., n} with associated
eigenvectors, ei , i 1,.., n . This example shows that n-square matrices can indeed have n
distinct eigenvalues. Notice also that all eigenvectors in this case are in fact orthogonal,
and hence are necessarily linearly independent even if their eigenvalues are not distinct.
We shall see below that this property is shared by all symmetric matrices (of which
diagonal matrices are the simplest example).
Of course, the orthogonal basis in (A3.3.7) is a very special case. A more typical example
with a full set of eigenvalues is given by the following simple matrix
3 1
(A3.3.8) A
0 2
for which it can easily be verified that the eigenvalues of A are Eig ( A) {1 , 2 } {3, 2}
with associated eigenvectors given respectively by x1 e1 (1,0) and
x2 1/ 2, 1/ 2 , as shown in Figure A3.26 below.
Ax2
x2
x1 Ax1
In both these examples, the eigenvectors associated with distinct eigenvalues are indeed
linearly independent. But the question remains as to whether this is always true. To see
that it is, we now consider a general matrix, A, and suppose that 1 and 2 are two
________________________________________________________________________
ESE 502 A3-55 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Ax3
Ax1
x3
x1
Ax2
x2
So the coefficients (3a , 3b) of this new linear combination of x1 and x2 would
necessarily be proportional to the original coefficients (a, b) , as shown by all points on
the blue line in the figure. But by hypothesis,
which together with 0 1 2 , shows that in fact more weight is now placed on the
maximal eigenvector, x2 , and thus that proportionality cannot hold. More generally, the
same argument shows that the image of any vector, x3 span( x1 , x2 ) [not collinear with
either x1 or x2 ] is necessarily “pulled toward” this maximal eigenvector (shown by the
arrow in the figure), and cannot itself be an eigenvector. So we may conclude that no
________________________________________________________________________
ESE 502 A3-56 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
eigenvector with eigenvalue distinct from 1 and 2 can be collinear with ( x1 , x2 ) , i.e.,
can lie in span( x1 , x2 ) .
While this illustration involves only triples of distinct eigenvalues, the argument is in fact
quite general, and can in be used to show that eigenvectors for distinct eigenvalues must
always be linearly independent.23 But since our main interest is in symmetric matrices,
where the argument will seen to be even more transparent, the above example suffices for
our purposes.
A final property of eigenvalues relates to their possible repetitions, and can again be
illustrated most easily by diagonal matrices, A diag (a11 ,.., ann ) . Notice in particular that
this is the one case where the characteristic equation in (A3.3.4) is completely
transparent, since
a11
(A3.3.11) 0 | A In | (a11 ) (ann )
ann
This implies at once that the diagonal elements of A are indeed the roots of its
characteristic equation. If some of these diagonal elements are the same, then such
repeated roots are designated as algebraic multiplicities. For example, the matrix
A diag (1,1,3) has only two distinct eigenvalues, Eig ( A) {1,3} , but since (1 ) ,
appears twice in (A3.3.11), this eigenvalue said to have an algebraic multiplicity of two.
Notice also from (A3.3.7) that since there are two linearly independent eigenvectors for
this eigenvalue, namely e1 and e2 , its geometric multiplicity (i.e., the maximum number
of its linearly independent eigenvectors) is also two. More generally, it follows at once
from (A3.3.7) that algebraic and geometric multiplicities of eigenvalues are always
identical for diagonal matrices.
But for general matrices, even when eigenvalues do exist, these two multiplicities need
not be the same. For example, while the algebraic and geometric multiplicities of 2
in the diagonal matrix, A diag (2, 2) , are both equal to two, consider the following
(modest) variation of this matrix:
2 1
(A3.3.12) A
0 2
This matrix is still nonsingular, and moreover, has the same characteristic equation, since
0 | A I 2 | (2 )(2 ) . So the algebraic multiplicity of 2 is two. But observe
that if x ( x1 , x2 ) is any associated eigenvector, then
23
A simpler and more elegant proof of this fact is given in Lemma 1.3.8 of Horn and Johnson (1985). The
advantage of the present argument is that it provides some geometric intuition as to why this is true.
________________________________________________________________________
ESE 502 A3-57 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
2 1 x1 x1 2 x1 x2 2 x1
(A3.3.13) Ax 3 x x 2 x x2 0
0 2 2 2 2 x2 2 x2
Moreover, there is only one eigenvector (up to a choice of sign) with this property,
namely, x (1,0) . So the geometric multiplicity this eigenvalue is one. Such matrices are
usually designated as defective matrices in the literature. The reason for this “defective”
property can be seen by plotting the transformation, as in Figure A3.28 below:
v1 v2 v3 v4
y 1 y 1
v0 Av0
x x
-3 -2 -1 1 2 3
Here we have used the vector notation, v ( x, y ) , for points in the plane, and have
displayed the unique eigenvector by v0 (1,0) , with associated image,
Av0 2 v0 (0, 2) . To show where all other points are sent, we have fixed the y-
coordinate value at y 1 , and have plotted the four points, v1 (2,1) , v2 (1,1) ,
v3 (0,1) , and v4 (1,1) , as shown in blue. Multiplying each of these four vectors by
matrix A in (A3.3.12), we obtain the corresponding image vectors, Av1 (3, 2) ,
Av2 (1, 2) , Av3 (1, 2) , and Av4 (3, 2) , shown in red. The key point to notice is
that all these image vectors are to the right of the original vectors, indicating that (along
with a certain amount of stretching) each vector has been rotated clockwise toward the
eigenvector, v0 . Similarly, by extending all vector arrows in the opposite direction
through the origin, it is clear that the vectors, v1 , v2 , v3 , v4 , are also rotated
clockwise toward the negative eigenvector, v0 . This shows that all nonzero vectors
other than these unique eigenvectors are rotated clockwise to some degree, and thus
cannot be eigenvectors. So essentially, such matrices involve some form of non-rigid
________________________________________________________________________
ESE 502 A3-58 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
rotations that can reduce the number of linearly independent eigenvectors associated with
repeated eigenvalues.
Given these general properties of real eigenvalues and eigenvectors, our objective is to
apply these concepts to symmetric matrices in particular. But before doing so, it is
important to reiterate that not all matrices have a full complement of real eigenvalues.
The following simple orthonormal matrix will turn out to be a particularly important case
in point:
0 1
(A3.3.14) U
1 0
Geometrically, this matrix rotates the plane counterclockwise through an angle of 90 , as
shown in Figure A3.29 below. Clearly no vector can possibly be mapped by this
transformation into a scalar times itself.
Ae1
e2
Ae2 e1
(A3.3.15) 0 | U I2 | 2 1
which is seen to have only the “imaginary” solutions, 1 . We shall return to this
example in Section ?? below.
We begin by recalling from the very beginning of Section A3.2 that there seems to be an
obvious relation between the Spectral Decomposition (SPD) Theorem for symmetric
matrices and the Singular Value Decomposition (SVD) Theorem for general matrices.
Since the SVD Theorem shows that for every matrix, A, there exist orthonormal matrices,
U, V, and a diagonal matrix of singular values, S , such that
________________________________________________________________________
ESE 502 A3-59 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
(A3.4.1) A U S V
(A3.4.2) U S V A A V S U
So at first glance, this identity would appear to suggest that U V , and thus that (A3.3.1)
must hold with S . To see that this intuition is wrong, recall that | U | 1, which
together with the nonnegativity of the singular values, S , must imply that
(A3.4.3) | A | | U || S || U | | U || S || U | | U |2 | S | | S | 0
and thus that the determinant of every symmetric matrix is nonnegative! But we have
already seen from (A3.2.69) that the symmetric (orthonormal) matrix
0 1
(A3.4.4) A
1 0
has a negative determinant, | A | 1 . More generally, the fact that singular values are by
construction nonnegative, shows that the relation between singular values and
eigenvalues for symmetric matrices is not immediately obvious.
This is made even more clear by a closer examination of this particular counterexample.
Here one can verify (by direct multiplication) that (A3.3.2) holds for this matrix A with
0 1 1 1 0
(A3.4.5) U , S , V
1 0 1 0 1
Moreover, since U and V are easily seen to be orthonormal, this is indeed a singular value
decomposition of A with U V . Here it can also be verified by direction computation
that
1 0 1 0
(A3.4.6) U S U A V S V
0 1 0 1
so that neither U nor V yield spectral decompositions of A with S . But it turns out
that A does indeed have a unique spectral decomposition:
(A3.4.7) A W W
________________________________________________________________________
ESE 502 A3-60 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
1 1 1
W ,
(A3.4.8) 1
1 1
2 1
So at first glance, there would seem to be little relation between the decompositions if A
in (A3.4.5) and (A3.4.8). But closer inspection show that the absolute value of is
precisely S. As we shall see below, this relationship is fundamental.
To gain further insight, it is convenient for the moment to suppose that the SPD Theorem
is true, and to examine its geometric consequences. To do so, note first that if (A3.3.1)
holds for a symmetric matrix, A, then since U 1 U , it follow at once that
(A3.4.9) A U U AU U (UU ) U
1
A (u1 ,.., un ) (u1 ,.., un )
n
Thus, as an extension of the diagonal-matrix case in (A3.3.7), we see that the diagonal
elements (1 ,.., n ) of must indeed be the eigenvalues of A with associated
orthonormal eigenvectors (u1 ,.., un ) , as asserted at the beginning of Section A3.3. Note
also that by definition this decomposition implies that all eigenvalues must be real.
Moreover, these eigenvalues and eigenvectors together imply that such matrices (like
diagonal matrices) are in fact representations of scale transformations with respect to
some appropriate coordinate system. This can be illustrated in two dimensions by the
symmetric matrix,
3 1
(A3.4.10) A
1 3
Here A does indeed have a spectral decomposition as in (A3.3.1) with diag (2, 4) and
orthonormal matrix,
1 1 2
1 1
2
(A3.4.11) U 1
1
(u1 , u2 )
2
1 1 2
1
2
________________________________________________________________________
ESE 502 A3-61 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
[which is precisely W in (A3.4.8) above]. So the eigenvectors for this matrix are the two
diagonal vectors, u1 and u2 , with corresponding eigenvalues, (1 2 , 2 4) , as shown
in Figure A3.29 below:
Au2
Au1
u1 u2
2
A( 2 )
So if (u1 , u2 ) are regarded as the coordinate axes, then A is seen to be a pure scale
transformation with respect to this coordinate system. More generally, if the spectral
decomposition of A is regarded as a composition of the respective transformations, U ,
and U , then we obtain a diagram very reminiscent of that in Figure A3.16, with V in the
last step replaced by U . In particular, the eigenvectors (i.e., columns of U ) correspond
precisely to the principle axes of the ellipsoidal image, A( n ) , of the unit circle, n , as
seen for n 2 in the figure. So this shows geometrically that there must be an intimate
connection between the singular value decomposition (SVD) and spectral decomposition
(SPD) of symmetric matrices.
In fact for the present matrix, A, in (A3.4.10), these two decompositions are identical.
The special feature of this symmetric matrix that leads to this identity is that its
eigenvalues are all positive. What this implies is that these eigenvalues play exactly the
same role as singular values, i.e., they measure the lengths of these axes from the origin.
More generally, this suggests that the lengths of such axes for symmetric matrices, A.
should be precisely the absolute values of their eigenvalues. In other words, the
eigenvalues of A should differ only in sign from the associated singular values of A.
All these conjectures will be shown to be true in the following sections. But for the
moment, we continue with our illustrations by reconsidering the counterexample in
expression (A3.4.4) above. As mentioned already, the eigenvectors here are precisely the
________________________________________________________________________
ESE 502 A3-62 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
same as in (A3.4.11) above. So only the eigenvalues are different, as shown in Figure
A3.30 below.
u1 u2 Au2
A( 2 ) 2 Au1
The key point to notice [as was evident in the SVD of this matrix in expression (A3.4.5)
above] is that the unit circle is mapped onto itself, i.e., A( 2 ) 2 . So any set of
orthonormal axes can be used for an SVD. However, this is not true for the SPD. In fact
there is exactly one eigenvector, u1 , associated with eigenvalue, 1 1 , and exactly one
eigenvector, u2 , associated with eigenvalue, 2 1 .24 So in contrast to the SVD, this
SPD is essentially unique. But it will be shown below that in spite of its nonuniqueness,
the SVD for A still contains enough information to allow the SPD to be constructed
explicitly. The essential reason for this is that relations between U and V implicit in the
identity (A3.4.2) will yield additional analytical information.
Before proceeding to these analytical results, it should be noted that there is one
additional complication that cannot be illustrated in two dimensions. Consider the
following 4-dimensional version of the matrix in (A3.4.4) above:
1
1
(A3.4.12) A
1
1
which can be seen (by direct multiplication) to have eigenvalues, diag (1, 1, 1 , 1)
with associated eigenvectors:
24
Again, remember that if ui is an eigenvector for i , then so is ui . So we are implicitly ignoring this
trivial form of nonuniqueness.
________________________________________________________________________
ESE 502 A3-63 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
0 1 1 0
(A3.4.13) U 1 1 0 0 1
(u1 , u2 , u3 , u4 )
2 1 0 0 1
0 1 1 0
Note in particular that the unit sphere is again mapped onto itself, i.e., A( 4 ) 4 , so that
any set of four mutually orthonormal axes can again be used to define an SVD. But now,
the two distinct eigenvalues, 1 and -1, both have two-dimensional spaces of eigenvectors,
namely span(u1 , u2 ) and span(u3 , u4 ) , respectively. So even the SPD is nonunique in this
example. Such cases require more effort to construct an admissible SPD from any given
SVD. So this general case will be treated by itself.
With these examples in mind, we now proceed to establish the Spectral Decomposition
(SPD) Theorem in stages. The first task is to establish some general consequences of
singular value decompositions (SVD) for symmetric matrices. This will provide a general
foundation for the SPD results to follow.
Here we focus on the additional information contained in the identity (A3.4.2) for
symmetric matrices. These equalities can be rewritten in the following way:
(A3.4.14) A U SV A V U S
(A3.4.15) A A V S U A U V S
(A3.4.16) A (V U ) (U V ) S (V U ) S
(A3.4.17) A (V U ) (U V ) S (V U ) ( S )
(A3.4.18) X U V
(A3.4.19) Y V U
then (A3.4.16) and (A3.4.17) yield the associated sets of eigenvalue equations:
(A3.4.20) AX X S
________________________________________________________________________
ESE 502 A3-64 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
(A3.4.21) AY Y ( S )
(A3.4.22) A xi si xi , i 1,.., n
(A3.4.23) A yi ( si ) yi , i 1,.., n
This shows us that each nonzero column of X and Y, namely each xi ui vi 0 and
yi ui vi 0 , respectively, must yield corresponding real eigenvalues, si or si for
symmetric matrix, A. As we shall see below, many columns of X and/or Y must be zero.
But the key point to notice is that for all i 1,.., n , the column pair ( xi , yi ) cannot both be
zero. For if so then
(A3.4.24) ui vi xi 0 yi ui vi
vi vi vi 0
Thus the first major consequence of these observations is that without loss of generality
we can focus our attention on real eigenvalues for symmetric matrices. This is of
sufficient important to be stated formally. If the set of distinct eigenvalues for any matrix,
A, is denoted by Eig ( A) , and if we define symmetric matrices by the condition that
A A , then this first consequence of SVD can be stated as follows:25
(A3.4.24) A A Eig ( A)
A second consequence (as suggested by the examples above) is that all eigenvalues in
(A3.4.22) and (A3.4.23) are either the singular values of A or their negatives. So the
absolute magnitudes of all eigenvalues can be determined by the SVD of A. To state this
more formally, let the set of distinct singular values of any matrix, A, be denoted by
Sing ( A) , and let the negatives of these values be denoted by Sing ( A) . Then, in a
manner paralleling (A3.4.24), this second consequence of SVD can be stated as follows:
25
The standard proof of this fact is to show that eigenvalues of symmetric matrices must always be equal to
their complex conjugates, and hence must be real (see for example Theorem 4.1.3 in Horn and Johnson,
1985).
________________________________________________________________________
ESE 502 A3-65 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
There is a third important consequence that relates to the eigenvectors associated with
distinct eigenvalues of symmetric matrices. Recall that in Figure A3.27 above a
geometric argument was sketched showing that eigenvectors for distinct (real)
eigenvalues are always linearly independent. For symmetric matrices we have the
stronger property that such eigenvectors must actually be orthogonal. This can be
demonstrated as follows:26
(A3.4.26) i j xix j 0
(A3.4.27) A xi i xi , and
(A3.4.28) A xj j xj
But premultiplying (A3.4.28) by xj and employing the symmetry of A, we see that,
( xj xi )(i j ) 0
Given these properties of eigenvalues and eigenvectors for symmetric matrices, the key
questions remaining are (i) how to identify which of the values on the right hand side of
(A3.4.25) are relevant in any particular case, and (ii ) how to construct their associated
eigenvectors in terms of the SVD of A. To answer these questions, we shall proceed on a
case-by-case basis from the simplest to the most general cases.
26
It should be noted that both the statement and proof of this result make constant use of property
(A3.4.24), since eigenvectors for real eigenvalues can always be restricted to . This allows orthogonality
n
(and indeed all inner products) to be defined solely on . While this same analysis can of course be
n
carried on using complex inner products, property (A3.4.24) shows that this is not necessary for real
n
symmetric matrices.
________________________________________________________________________
ESE 502 A3-66 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
The simplest and by far the most important cases for our purposes all involve symmetric
definite or semidefinite matrices. So this is the best place to begin. Recall from
expressions (A2.7.36) and (A2.7.67) in Appendix A2 that an n-square matrix, A, is
positive semidefinite iff for all x n ,
(A3.4.30) x 0 xAx 0
(A3.4.31) x 0 xAx 0
(A3.4.32) A U SV ,
27
It is of interest to note that a direct proof for this case follows from the standard construction of principle
components in multivariate analysis, which in fact closely parallels the above proof of the Singular
Decomposition Theorem. See for example the classic treatment in Anderson (1958, pp.273-275).
________________________________________________________________________
ESE 502 A3-67 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
for all i 1,.., n . But by applying the same column decomposition in (A3.2.28) to
(A3.4.15) for the SVD in (A3.4.32), it follows that
(A3.4.35) Aui si vi , i 1,.., n
Given this representation, there are two cases to consider. First if si 0 , then it follows at
once from (A3.4.35) that,
On the other hand if si 0 , then observe from (A3.4.23) that we must have yi 0 . For if
not, then since yi 0 yi yi 0 , it would follow from (A3.4.23) that
(A3.4.38) 0 yi ui vi ui vi
So (A3.4.34) must hold in all cases, and the first equality (A3.4.33) is established. The
second equality follows in exactly the same way by replacing (A3.4.15) with (A3.4.14)
and thus switching the roles of ui and vi in (A3.4.35).
(ii ) Finally, if A is positive definite, then since uiui || ui ||2 1 for each i 1,.., n , it
follows from (A3.4.34) that
and hence from positive definiteness that si 0 . Thus diag ( S ) diag ( s1 ,.., sn ) 0 .
Moreover since the argument in (A3.4.37) and (A3.4.38) now holds for all i 1,.., n , it
also follows that U V .
For symmetric positive definite matrices, A, the above theorem (now referred to as SPD
Theorem 1), shows that the two decompositions, SVD and SPD, of A exhibit a one-to-one
correspondence. As a direct consequence of this correspondence, we now have the
following additional characterizations of positive definiteness:
________________________________________________________________________
ESE 502 A3-68 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Corollary 1. For any symmetric positive semidefinite matrix, A, the following three
properties are equivalent:
(A3.4.43) A is nonsingular.
xAx
(A3.4.44) Ax x xAx xx 0
xx
and thus that all eigenvalues must be positive. Next, to show that positive eigenvalues
imply nonsingularity, observe since the symmetric positive semidefiniteness of A implies
from part (i ) of SPD Theorem 1 that A has a spectral decomposition,
(A3.4.45) A U SU
[given by the first equality in (A3.4.33)], it follows that if all eigenvalues are positive,
then the positive diagonal matrix, S , in (A3.4.45) must have a well defined inverse, S 1 .
But this together with the orthonormality of matrix, U, implies that
(A3.4.46) U S 1 U (U S U ) 1 A1
and thus that A is nonsingular. Finally, to show that nonsingularity implies positive
definiteness, note first from the nonnegativity of the diagonal matrix, S, in (A3.4.45) that
S has a well defined square root,
(A3.4.48) xAx xU S U x xU S 1/2 S 1/2 U x ( S 1/2 U x)( S 1/2 U x) || S 1/2 U x ||2
But since for any vector z , || z ||2 0 || z || 0 z 0 , we see from (A3.4.48) that
So if xAx 0 for any x 0 , then it would also be true that Ax 0 , which contradicts
the nonsingularity of A. Thus, nonsingularity together with the positive semidefiniteness
________________________________________________________________________
ESE 502 A3-69 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
of A imply that xAx 0 must hold whenever x 0 , and it follows that A is positive
definite.
For our present purposes, the single most important application of these results is to
characterize the spectral properties of covariance matrices. To begin with, we can now
give a more complete statement of the Positive Definiteness Property for nonsingular
covariance matrices stated in Appendix A2 (page A2-27):
(A3.4.50) U U , diag ( ) 0
Proof: For convenience we start by repeating the argument in Appendix A2. First recall
that if the covariance matrix of a random vector, X ( X 1 ,.., X n ) , is denoted by
cov( X ) , then the symmetry of covariances, ij cov( X i , X j ) cov( X j , X i ) ji
implies that is symmetric. Moreover, since for any coefficient vector, a 0 , we must
have
In addition, recall from the discussion following the Linear Invariance Theorem in
Section A.2.3 that reduced covariance matrices of the form, A A , were asserted to be
nonsingular whenever A is of full row rank. We are now in a position to establish this
result:
Proof: The matrix, A A , has already been shown to be an m-square covariance matrix
in expression (3.2.21) of Part II of these notes. So it remains to be shown that A A is
nonsingular. To do so, recall first (from the end of Section A3.1.1) that A is if full row
rank iff its rows are linearly independent. But since these rows are precisely the columns
of A (a1 ,.., am ) , it follows from the definition of linear independence [expression
(A3.1.24)], that for any x ( x1 ,.., xm ) m ,
Ax 0
m
(A3.4.52) x a 0 xi 0, i 1,.., m x 0
i 1 i i
________________________________________________________________________
ESE 502 A3-70 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
where again 1/2 diag (11/2 ,.., n1/2 ) . So by essentially the same argument as in (A3.4.48)
and (A3.4.49) it follows that for any x m ,
1/2U Ax 0
Ax 0
But this together with (A3.4.52) and the nonsingularity of then shows that
for all x m . Thus A A is positive definite, and we may conclude from (A3.4.43) that
A A is also nonsingular.
There is a second class of symmetric matrices for which each SVD directly yields a
unique SPD, namely those symmetric matrices for which all eigenvalues are distinct.
Here it is of interest to recall the example given in expression (A3.4.4) above, i.e.,
0 1
(A3.4.56) A
1 0
with distinct eigenvalues, Eig ( A) {1 , 1} , but with (necessarily) repeating singular
values given by the absolute values if Eig ( A) , so that A has only one distinct singular
value, namely Sing ( A) { 1 } . Here the SVD in (A3.4.5) appeared to exhibit little direct
________________________________________________________________________
ESE 502 A3-71 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
relation to the SPD in (A3.4.8). Indeed, the situation is even worse for this matrix. In
particular, since the unit circle is mapped onto itself by A, it follows that every pair of
orthogonal unit vectors can serve as principle axes for this “ellipse”. This nonuniqueness
can be seen algebraically by observing that since A is itself orthonormal, it follows that
for any other orthonormal matrix, V, the product, U AV , must also be orthonormal. But
since the singular values of A are given by the identity matrix, S I 2 , we may then use
V to construct a distinct SVD for A by the product:
(A3.4.57) A AVV ( AV )( I 2 )V U S V
Thus there are seen to be infinitely many SVDs for A. One the other hand, since the
eigenvalues of A are distinct, we have already seen from (A3.4.26) that their
corresponding eigenvectors must be orthogonal, and thus must form a basis for 2 . So
these eigenvectors [in (A3.4.8)] must in fact be unique (up to a choice of signs). Given
this stark contrast, it would appear that there is little hope of constructing the unique SPD
for A from its highly nonunique SVDs. But as we now show, this can indeed be done so
long as eigenvalues are distinct in the sense that each has a geometric multiplicity of one
[as in the case of (A3.4.56)]. Note also from the orthogonality of eigenvectors for distinct
eigenvalues in (A3.4.26) that this in turn implies that the SPD for such symmetric
matrices must be unique. With these observations, we now show that:
s1 v1
(A3.4.58) A USV (u1 ,.., un ) ,
sn
vn
1 w1
(A3.4.59) A W W ( w1 ,.., wn )
n
wn
________________________________________________________________________
ESE 502 A3-72 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Awi 0 . But since this implies that A is singular there must at least one zero singular
value [for otherwise, we would have | A | 0 by (A3.2.77), which contradicts the
singularity of A]. Moreover, if there were more than one, the argument in (A3.4.26)
shows that there would be more than one zero eigenvalue for A, which contradicts the
distinctness assumption. So there is exactly one ui with Aui 0 as in (A3.4.36). Thus we
may then set wi ui , and conclude from (A3.4.26) that this will always form and
admissible entry in the orthonormal matrix, W, of (A3.4.59). For the positive singular
values, si diag ( S ) , we now consider the possible distinct eigenvalues they can generate
by (A3.4.22) and (A3.4.23). In view of the distinctness assumption, either exactly one of
the values ( si , si ) belongs to Eig ( A) or both do. The first case is the simplest, and is
equivalent to the condition derived from (A3.4.58) that si only appear in one equation of
the equation systems (A3.4.22) and (A3.4.23) with a nonzero eigenvector. If for
notational simplicity we let (i , wi ) denote the associated eigenvalue-eigenvector pair to
be constructed in (A3.4.59)28 then by using the definitions of xi ui vi and yi ui vi
in (A3.4.22) and (A3.4.23), respectively, (and recalling that at least one of these vectors
must be nonzero) we may set
ui vi
, if ui vi 0
|| ui vi ||
(A3.4.60) wi
ui vi
, if ui vi 0
|| ui vi ||
si , if ui vi 0
(A3.4.61) i
si , if ui vi 0
Turning to the second case, where both ( si , si ) appear in equation systems (A3.4.22)
and (A3.4.23) with nonzero eigenvectors, observe that si must appear twice in diag ( S ) ,
say in positions i and j . If we consider the values of x and y in columns i and j of
both (A3.4.22) and (A3.4.23), namely,
28
More formally, the rows and columns of A can always be permuted to satisfy this relation. The standard
convention in the literature is thus to say that “by relabeling if necessary” we can use i for both si and its
associated eigenvalue, i .
________________________________________________________________________
ESE 502 A3-73 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
xi ui vi yi ui vi
(A3.4.62)
xj uj vj yj uj vj
then it must be true that either xi or x j is nonzero, and similarly that either yi or y j are
nonzero. But if both xi and x j are nonzero, then they must be scalar multiples of one
another. For otherwise, columns i and j of equation system (A3.4.22) would yield two
linearly independent solutions, ( Axi si xi , Ax j s j x j ) with si s j , and it would follow
that eigenvalue, si , has a multiplicity of two. But since this contradicts the assumption of
distinct eigenvalues, xi and x j must be linearly dependent, i.e., scalar multiples of one
another. This in turn implies that they must have the same normalizations, which can be
written terms of u and v by:
ui vi uj vj
(A3.4.63)
|| ui vi || || u j v j ||
Moreover, since exactly the same argument for yi and y j shows that if both are nonzero
then
ui vi uj vj
(A3.4.64)
|| ui vi || || u j v j ||
ui vi
|| u v || , if ui vi 0
i i
(A3.4.65) wi
uj vj , if ui vi 0
|| u j v j ||
uj vj
|| u v || , if u j v j 0
wj
j j
(A3.4.66)
ui vi , if u j v j 0
|| ui vi ||
Again, it follows from (A3.4.63) and (A3.4.64) that these choices of wi and w j are
insensitive to whether the i th or j th quantities are used first on the right hand sides of
________________________________________________________________________
ESE 502 A3-74 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
(A3.4.65) and (A3.4.66). Note also from the orthogonality of eigenvectors for distinct
eigenvalues that these normalized vectors will always yield admissible components of W.
Finally, since the multiplicity of each singular value, s diag ( S ) determines exactly the
number of eigenvalues generated by s (including the s 0 case), it follows that this
procedure must generate precisely n eigenvalues (1 ,.., n ) with corresponding
orthonormal eigenvectors ( w1 ,.., wn ) generating a basis for n . So by construction, this
procedure must yield a complete representation of A as in (A3.4.59).
So for the case of distinct eigenvalues, we see that the unique SPD for symmetric matrix
A can be explicitly constructed from any of its possible SVDs. Here it is instructive to see
how this procedure works for the example in (A3.4.56). In this case, the SVD produced
in (A3.4.5) yields
1 1
(A3.4.67) X U V (u1 v1 , u2 v2 )
1 1
1 1
(A3.4.68) Y U V (u1 v1 , u2 v2 )
1 1
So this is case where all four elements of (A3.4.62) are nonzero, and thus where the
identities in (A3.4.63) and (A3.4.64) are seen to hold. Moreover, since the norms of all
these vectors are seen to be 2 , it follows that they yield precisely the pair of normalized
eigenvectors in (A3.4.8). Moreover, one can verify by direct computation that any choice
of an orthonormal matrix, V, in (A3.4.57) will always produce vectors that are scalar
multiples of those in (A3.4.67) and (A3.4.68), and thus will yield the same eigenvectors
for W.
Finally it is important to note that the case of distinct eigenvalues is overwhelmingly the
most common case observed in practice. Indeed, it is a simple matter to show that within
the space of all n -square symmetric matrices, the subset possessing two or more common
eigenvalues must have zero volume. So if one were to choose a symmetric matrix at
random, then with probability one, this matrix will have all distinct eigenvalues.
________________________________________________________________________
ESE 502 A3-75 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
Here we begin with one preliminary result that will enable us to verify dimensional
consistency for all subspaces of repeated eigenvalues. In particular, note that for any pair
of n k matrices, A and B, the matrix sum, A B , is well defined. In addition, A and
B are said to be mutually orthogonal iff their columns are orthogonal, i.e., AB Ok .29
Hence, recalling that the rank of a matrix is by definition the dimension of its span
[ rank ( A) dim( span( A)) ], we have the following useful rank equality:30
But if we choose any bases [ x1 ,.., xk ] and [ y1 ,.., yh ] for span( A) and span( B) ,
respectively, then by mutual orthogonality it follows that [ x1 ,.., xk , y1 ,.., yh ] must
constitute a linearly independent set. To see this, note that since xi span( A) xi Az
for some z k , and similarly that y j span( B) y j Bw for some w k , this
together with the mutual orthogonality condition, AB Ok , implies that
and hence that [ x1 ,.., xk ] and [ y1 ,.., yh ] are mutually orthogonal sets of vectors. This
together with the linear independence of basis vectors implied that the full set of vectors,
[ x1 ,.., xk , y1 ,.., yh ] , is linearly independent. Finally since for any vector, v n ,
v Au Bu span( A) span( B)
i xi j 1 j y j
k h
v i 1
for some coefficients (1 ,.., k ) and ( 1 ,.., k ) , it then follows that [ x1 ,.., xk , y1 ,.., yh ]
must be a basis for span( A B ) . Thus we may have
(A3.4.73) dim( span( A B)) k h dim( span( A)) dim( span( B))
29
As with the n-square identity matrix, I n , we here denote the n-square zero matrix by On .
30
A detailed development of other rank properties can be found in Chapter 6 of Searle (1982).
________________________________________________________________________
ESE 502 A3-76 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
With this preliminary result, we are now ready to establish the following general form of
the Spectral Decomposition Theorem:
(A3.4.74) A W W
(A3.4.75) A U SV
for the n-square symmetric matrix, A, and to construct and SPD for A as in (A3.4.74). To
do so, we first note that by relabeling the rows and columns of A if necessary, we may
assume that the sets of common singular values (including singleton sets) are grouped
into blocks , Si diag ( si1 ,.., sini ) , i 1,.., m ( n) , along the diagonal of matrix, S , where
each block has common value, si sij , j 1,.., ni ( 1) , and has associated orthonormal
sets of column vectors, U i (ui1 ,.., uini ) and Vi (vi1 ,.., vini ) . With this grouping,
expression (A3.4.75) can be written as
S1 V1
(A3.4.76) A (U1 ,..,U m ) AVi U i Si , i 1,.., m
S m
Vm
1 W1
(A3.4.77) A W W (W1 ,..,Wm ) AWi Wi i , i 1,.., m
m
Wm
As with U and V above, the key conditions to be satisfied by W are that each block, Wi ,
have orthonormal columns, and that the columns in different blocks be mutually
orthogonal. To construct (A3.4.77), we start by observing there is one special case that
can be handled without further analysis. In particular, if matrix A is singular, then exactly
one block, Si , in (A3.4.76), will have si 0 . But since the vectors in this block must
satisfy,
________________________________________________________________________
ESE 502 A3-77 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
(A3.4.79) i Si On and Wi Vi ,
i
(A3.4.80) X i U i Vi , i 1,.., m
(A3.4.81) Yi U i Vi , i 1,.., m
that by construction,
(A3.4.82) A X i si X i , i 1,.., m
So all columns in X i and Yi are potential eigenvectors for A. Here there three possible
cases to be considered, namely (i) U i Vi Yi O , (ii) U i Vi X i O , or (iii)
X i O and Yi O .31 In case (i), all eigenvalues in block i are positive, and governed by
(A3.4.82). But since (A3.4.80) and (A3.4.81) together imply that X i Yi 2U i , it follows
that
(A3.4.84) X i X i O X i Yi 2U i
(A3.4.85) i Si si I n and Wi 12 X i U i
i
to obtain the desired block i in (A3.4.77). The construction for case (ii) is essentially
identical except that now
31
For notational simplicity, we here take the common dimension of these zero matrices (namely n ni ) to
be understood.
________________________________________________________________________
ESE 502 A3-78 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
(A3.4.86) Yi O Yi X i Yi 2U i
with all eigenvalues given by si . So in this case, we can construct the desired block i
by setting
(A3.4.87) i Si ( si ) I n and Wi 12 Yi U i
i
This leaves case (iii), in which both si and si are eigenvalues for A.
This is by far the most complex case, and requires additional analysis. Here we start by
observing from the distinctness of the eigenvalues, si and si , together with (A3.4.82),
(A3.4.83) and the orthogonality condition (A3.4.26), that the matrices X i and Yi must
now be mutually orthogonal (i.e., X iYi Oni ). So by the Rank Lemma above, we must
have
(A3.4.90) ki hi ni
and hence that there are again exactly ni of these basis vectors. Moreover, since
b j span( X i ) b j X i z j for some z j ni , it follows from (A3.4.82) that
and thus that the basis vectors [b1 ,.., bki ] form and orthogonal set of eigenvectors for
eigenvalue, si . Similarly, since c j span(Yi ) c j X iu j for some u j ni , it
follows from (A3.4.83) that
________________________________________________________________________
ESE 502 A3-79 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
As one final comment, we begin by reiterating that our main objective in this section has
been to show that the spectral decomposition (SPD) of any symmetric matrix, A, can be
constructed from its singular value decomposition (SVD). However, this appears to leave
open the converse question of how to construct SVDs of symmetric matrices from their
SPDs. But since the singular values of A are simply the absolute values of its eigenvalues,
it turns out to be a simple matter to transform each SPD into a corresponding SVD. To do
so, recall from (A3.1.10) that the SPD in (A3.4.74) can be rewritten as:
1 w1
n w w
(A3.4.93) A W W w1 ,.., wn i 1 i i i
n wn
To convert these eigenvalues to absolute form, observe that if sgn( ) denotes the sign of
any number, , then by definition, | |sgn( ) , so that (A3.4.93) can be written as,
then by definition,
________________________________________________________________________
ESE 502 A3-80 Tony E. Smith
NOTEBOOK FOR SPATIAL DATA ANALYSIS Part III. Areal Data Analysis
______________________________________________________________________________________
s1 v1
(A3.4.96)
n
A i 1 si ui vi u1 ,.., un USV
s n vn
it also follows that U is orthonormal, and thus that (A3.4.96) is automatically an SVD
for A. However, the SPDs of symmetric matrices clearly contain more information, and
turn out to be far more useful than their corresponding SVDs. So this final result only
serves to complete the full correspondence between the two.
________________________________________________________________________
ESE 502 A3-81 Tony E. Smith
Tony E. Smith
E‐Mail: [email protected]