Notes On Spatial Econometrics: Mauricio Sarrias Universidad de Talca October 6, 2020
Notes On Spatial Econometrics: Mauricio Sarrias Universidad de Talca October 6, 2020
Mauricio Sarrias
Universidad de Talca
October 6, 2020
ii
Contents
2 Spatial Models 29
2.1 Taxonomy of Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.1 Spatial Lag Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.1.2 Spatial Durbin Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1.3 Spatial Error Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1.4 Spatial Autocorrelation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2 Motivation of Spatial Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.1 SLM as a Long-run Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.2.2 SEM and Omitted Variables Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.2.3 SDM and Omitted Variables Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3 Interpreting Spatial Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.1 Measuring Spillovers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.2 Marginal Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.3 Partitioning Global Effects Estimates Over Space . . . . . . . . . . . . . . . . . . . . . 42
2.3.4 Lesage’s Book Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
iii
iv
II Estimation Methods 51
3 Maximum Likelihood Estimation 53
3.1 What Are The Consequences of Applying OLS? . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.1.1 Finite and Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.1.2 Illustration of Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.2 Maximum Likelihood Estimation of SLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.1 Maximum Likelihood Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.2 Score Vector and Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.3 Ord’s Jacobian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.4 Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3 ML Estimation of SEM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.1 What Are The Consequences of Applying OLS on a SEM Model? . . . . . . . . . . . . 64
3.3.2 Log-likelihood function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.3.3 Score Function and ML Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4 Computing the Standard Errors For The Marginal Effects . . . . . . . . . . . . . . . . . . . . 67
3.5 Spillover Effects on Crime: An Application in R . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5.1 Estimation of Spatial Models in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5.2 Estimation of Marginal Effects in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.6 Asymptotic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.6.1 Triangular Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.6.2 Consistency of QMLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6.3 Asymptotic Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
Appendix 3.A Terminology in Asymptotic Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
Appendix 3.B A function to estimate the SLM in R . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4 Hypothesis Testing 93
4.1 Test for Residual Spatial Autocorrelation Based on the Moran I Statistic . . . . . . . . . . . . 93
4.1.1 Cliff and Ord Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.1.2 Kelijan and Prucha (2001) Derivation of Moran’s I . . . . . . . . . . . . . . . . . . . . 94
4.1.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.2 Common Factor Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4.3 Hausman Test: OLS vs SEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.4 Tests Based on ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4.1 Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4.2 Wald Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.4.3 Lagrange Multiplier Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.4.4 Anselin and Florax Recipe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.4.5 Lagrange Multiplier Test Statistics in R . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Index 151
vi
List of Figures
3.1 Distribution of ρb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.2 Spatial Distribution of Crime in Columbus, Ohio Neighborhoods . . . . . . . . . . . . . . . . 69
3.3 Effects of a Change in Region 30: Categorization . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.4 Effects of a Change in Region 30: Magnitude . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.5 Distances from R3 to all Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
vii
viii LIST OF FIGURES
List of Tables
ix
x LIST OF TABLES
Part I
1
Introduction to Spatial Econometric
1
1.1 Why do We Need Spatial Econometric?
An important aspect of any study involving spatial units (cities, regions, countries, etc) is the potential
relationships and interactions between them. For example, when modeling pollution at the regional level it is
awkward to analyze each region as independent units. In fact, regions cannot be analyzed as isolated entities
since they are spatially interrelated by ecological and economic interactions. Therefore, it is highly probably
the existence of environmental externalities: an increase in region i’s pollution will affect the pollution
in neighbors regions, but the impact will be lower for more distance regions. Consider Figure 1.1, where
region 3 is highly industrialized, whereas region 1, 2, 4 and 5 are residential areas. If region 3 increases its
economic activity, then the pollution not only will increase in that region, but also in the neighbor regions.
It is also expected that contamination will increase in region 1 and 5 but in lower magnitudes. We might
think that environmental externality in R3 causes environmental degradation in other regions, though both
spatial-economic interactions (e.g. transportation of input and output from region 3) and spatial-ecological
interactions (e.g. carbon emissions).
R1 R2 R3 R4 R5
In the same vain, if we study crime at the city level then somehow we should incorporate the possibility
that crime is localized. For example identification of concentration or cluster of greater criminal activity
has emerged as a central mechanism to targeting a criminal justice and crime prevention response to crime
problem. These clusters of crime are commonly referred to as hotpots: geographic locations of high crime
concentration, relative to the distribution of crime across the whole region of interest.
Both examples implicitly state that geography location and distance matter. In fact, they reflect the
importance of the first law of geography. According to Waldo Tobler: “everything is related to everything
else”, but near things are more related than distant things. This first law is the foundation of the fundamental
concepts of spatial dependence and spatial autocorrelation.
3
4 CHAPTER 1. INTRODUCTION TO SPATIAL ECONOMETRIC
25
20
15
10
Notes: This graph shows the spatial distribution of poverty in the Metropolitan Region, Chile.
Formally, the existence of spatial autocorrelation may be expressed by the following moment conditions:
1.1. WHY DO WE NEED SPATIAL ECONOMETRIC? 5
Notes: Spatial Autocorrelation among 400 spatial units arranged in an 20-by-20 regular square lattice grid. Different gray-tones
refer to different values of the variable ranging from low values (white) to high values (black). The left plot shows positive
spatial autocorrelation, whereas right plot shows negative spatial autocorrelation.
Positive autocorrelation is much more common, but negative autocorrelation does exists, for example,
in studies of welfare competition or federal grants competitions among local governments (Saavedra, 2000;
Boarnet and Glazer, 2002), and studies of regional employment (Filiztekin, 2009; Pavlyuk, 2011), the cross-
border lottery shopping (Garrett and Marsh, 2002), foreign direct investment in OECD countries (Garretsen
and Peeters, 2009) and locations of Turkish manufacturing industry (Basdas, 2009). In short, we are interested
in studying non-random spatial patterns and try to explain this non-randomness. Possible causes of non-
randomness are (Gibbons et al., 2015):
1. Firms may be randomly allocated across space but some characteristics of locations varies across space
and influences outcomes.
2. Location may have no causal effect on outcomes, but outcomes may be correlated across space because
heterogeneous individuals or firms are non-randomly allocated across space.
3. Individual or firms may be randomly allocated across space but they interact so that decisions by one
agent affects outcomes of other agents.
4. Individuals or firms may be non-randomly allocated across space and the characteristics of others nearby
directly influences individual outcomes.
6 CHAPTER 1. INTRODUCTION TO SPATIAL ECONOMETRIC
Non-stochastic means that the researcher takes W as known a priori, and therefore, all results are
conditional upon the specification of W .
Note also that the definition of W requires a rule for wij . In other words, we need to figure out how to
assign a real number to wij , for i 6= j, representing the strength of the spatial relationship between i and j.
There are several ways of doing that. But, in general, there are two basic criteria. The first type establishes a
relationship based on shared borders or vertices of lattice or irregular polygon data (contiguity). The second
type establishes a relationship based on the distance between locations. Generally speaking, contiguity is
most appropriate for geographic data expressed as polygons (so-called areal units), whereas distance is suited
for point data, although in practice the distinction is not that absolute.
Rook Contiguity
In this case, two locations are neighbors if they share at least part of a common border or side. In
Figure 1.4 we have a regular grid with 9 regions: each square represents a region. If for example we want
to define the neighbors of region 5 using the rook criteria, then its neighbors will be regions 2, 4, 6 and 8.
Those represent the regions filled in red.
If we continue with this reasoning, then the 9 × 9 W matrix will be:
1.2. SPATIAL WEIGHT MATRIX 7
1 2 3
4 5 6
7 8 9
0 1 0 1 0 0 0 0 0
1 0 1 0 1 0 0 0 0
0 1 0 0 0 1 0 0 0
1 0 0 0 1 0 1 0 0
0
W = 1 0 1 0 1 0 1 0 (1.5)
0 0 1 0 1 0 0 0 1
0 0 0 1 0 0 0 1 0
0 0 0 0 1 0 1 0 1
0 0 0 0 0 1 0 1 0
Bishop Contiguity
In bishop contiguity (which is seldom used in practice), region i’s neighbors are located at its corners.
Figure 1.5 shows the neighbors of region 5 under this scheme. The neighbors are regions 1, 3, 7 and 9. Note
that regions in the interior will have more neighbors than those in the periphery.
1 2 3
4 5 6
7 8 9
0 0 0 0 1 0 0 0 0
0 0 0 1 0 1 0 0 0
0 0 0 0 1 0 0 0 0
0 1 0 0 0 0 0 1 0
1
W = 0 1 0 0 0 1 0 1
0 1 0 0 0 0 0 1 0
0 0 0 0 1 0 0 0 0
0 0 0 1 0 1 0 0 0
0 0 0 0 1 0 0 0 0
This criteria is seldom used in practice.
8 CHAPTER 1. INTRODUCTION TO SPATIAL ECONOMETRIC
Queen Contiguity
In queen contiguity, any region that touches the boundary of region i, whether on a side or a single point, is
considered neighbor. Under this criteria, the neighbors of 5 will be regions: 1, 2, 3, 4, 6, 7, 8 and 9.
1 2 3
4 5 6
7 8 9
ij = |xi − xj | + |yi − yj | .
dm
All three measures presented above are useful if we consider the earth as a plane. For example, the
Euclidean distance is the length of a straight line on a map, and is not necessarily the shortest distance if
you take into account the curvature of the earth. The great circle distance take into account the curvature
of Earth. Ships and aircraft usually follow the great circle geometry to minimize the distance and save time
and money. In particular, the great circle distance is computed as:
Inverse Distance
Now we have to transform the information about the distances among spatial points into a weight scheme.
The idea is that wijt → 0 as dij → ∞. In other words, the closer is j to i, the larger wij should be to conform
to Tobler’s first law.
In the inverse distance weighting scheme, the weights are inversely related to separation distance as shown
below:
1.2. SPATIAL WEIGHT MATRIX 9
if i 6= j
(
1
dα
wij = ij
0 if i = j,
where the exponent α is a parameter that is usually set by the researcher. In practice, the parameters are
seldom estimated, but typically set to α = 1 or α = 2. Therefore, the weights are given by the reciprocal
of the distance: the larger the distance between to spatial units, the lowest the spatial weight or the spatial
connection. Finally, by convention, the diagonal elements of the spatial weights are set to zero and not
computed. Plugging in a value of dii = 0 would yield division by zero for inverse distance weights.
k-nearest Neighbors
An alternative type of spatial weights that avoids the problem of isolates is to select the k-nearest neighbors.
In contrast to the distance band, this is not a symmetric relation. However, a potential problem with this
type of neighbors is the occurrence of ties, i.e., when more than one location j has the same distance from
i. A number of solutions exist to break the tie, from randomly selecting one the k-th order neighbors, to
including all of them.
This ensures that all weights are between 0 and 1 and facilities the interpretation of operation with the
weights matrix as an averaging of neighboring values as we will see below. The row-standardized weights
matrix also ensures that the spatial parameter in many spatial stochastic processes are comparable between
models (Anselin and Bera, 1998).
Another important featurePis Pthat, under row-standardization, the element of each row sum to unity and
the sum of all weights, S0 = i j wij = n, the total number of observations. This is a nice interpretation
that we will explore later.
Another important issue is about symmetry. As we have learnt, some spatial weight matrices are symmet-
ric. An important characteristic of symmetric matrix is that all its characteristics roots are real. However,
after the row standardization the matrices are no longer symmetric.
The row-standardized matrix is also known in the literature as the row-stochastic matrix:
Definition 1.2.2 — Row-stochastic Matrix. A real n × n matrix A is called Markov matrix, or row-
stochastic matrix if
1. aij ≥ 0 for 1 ≤ i, j ≤ n;
Pn
2. j=1 aij = 1 for 1 ≤ i ≤ n
Theorem 1.1 — Eigenvalues of row-stochastic Matrix. Every eigenvalue ωi of a row-stochastic Matrix satisfies
|ω| ≤ 1
Therefore, the eigenvalues of the row-stochastic (i.e., row-normalized, row standardized or Markov) neigh-
borhood matrix W s are in the range [−1, +1].
Finally, we the behavior of W s is important for asymptotic properties of estimators and test statistics
(Anselin and Bera, 1998, pp. 244). In particular, the W matrix should be also exogenous, unless endogeneity
is considered explicitly in the model specification.
0 1 0 10 50
0 1 0 10 50
where the weights wij consists of the elements of the ith row of the matrix W , matched up with the
corresponding elements of the vector y. In other words, this is a weighted sum of the values observed at
neighboring locations, since the non-neighbors are not included.
1.2. SPATIAL WEIGHT MATRIX 11
R As stated by Anselin (1988, p. 23-24), standardization must be done with caution.1 For example, when
the weights are based on an inverse distance function (or similar concept of distance decay), which has
a meaningful economic interpretation, scaling the rows so that the weights sum to one may result in a
loss of that interpretation. Can you give an example?
0 1 0 0 0
1 0 1 0 0
W = 0 1 0 1 0 . (1.6)
0 0 1 0 1
0 0 0 1 0
Then W 2 = W W based on the 5 × 5 first-order contiguity matrix W from (1.6) is:
1 0 1 0 0
0 2 0 1 0
W2 = 1 0 2 0 1 (1.7)
0 1 0 2 0
0 0 1 0 1
Note that for region R1, the second-order neighbors are regions R1 and R3. That is, region R1 is a
second-order neighbor to itself as well as to region R3, which is a neighbor to the neighboring region R2.
Now consider R2. The first panel of Figure 1.7 shows the first-order neighbors of R2 given by the spatial
weight matrix in (1.6): the first-order neighbors are R1 and R3. Panel B considers the second-order neighbors
of R2: the second-order neighbors are R2 itself and R4. To understand this, this note that there is a feedback
effect from the first impact from R2 coming from R1 and R3 (first-order neighbors of R2). This explain why
the element w22 2
= 2. Moreover, there is an indirect effect coming from R4 through R3 that finally impacts
R2. This represents the value of 1 for the element w24 2
.
Similarly, for region R3, the second-order neighbors are regions R1 (which is a neighbor to the neighboring
region R2), R3 (a second-order neighbor to itself), and R5 (which is a neighbor to the neighboring region
R4).
R1 R2 R3 R4 R5
R1 R2 R3 R4 R5
0 2 0 1 0
2 0 3 0 1
0
W3 = 3 0 3 0
1 0 3 0 2
0 1 0 2 1
For simplicity in showing how to create neighbor objects in R, we work on the map consisting of the
communes of the Metropolitan Region in Chile.
We first need to load the Metropolitan Region shape file in R. To do so, we will use the maptools package
(Bivand and Lewin-Koh, 2015), which allows us reading and handling spatial objects.
#Load package
library("maptools")
If the shape file mr_chile.shp is in the same working directory, then we can load it into R using the
command readShapeSpatial:
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
The function readShapeSpatial reads data from the shapefile into a Spatial object of class “sp”. The
function names give us the name of the variables in the .dbf file associated with the shape file.
We can plot the shapefile using the generic function plot in the following way
# Plot shapefile
plot(mr, main = "Metropolitan Region-Chile", axes = TRUE)
Metropolitan Region−Chile
−33.0
−33.5
−34.0
−34.5
#Load package
library("spdep")
In the spdep package, neighbor relationships between n observations are represented by an object of class
“nb”. This object is a list of length n with the index numbers of neighbors of each component recorded as an
integer vector. If any observation has no neighbors, the component contains an integer zero.
The function poly2nb is used in order to construct weight matrices based on contiguity. Specifically, it
creates a “neighbors list” based on regions with contiguous boundaries of class “nb”. Check out help(polynb)
to see all the details and options.
First, we create a neighbor list based on the ‘Queen’ criteria for the communes of the Metropolitan Region:
# Create queen W
queen.w <- poly2nb(mr, row.names = mr$NAME, queen = TRUE)
Since we have an nb object to examine, we can present the standard methods for these objects. There are
print, summary, plot, and other methods. The characteristics of the weights are obtained with the usual
summary command:
# Summary of W
summary(queen.w)
## Number of regions: 52
## Number of nonzero links: 292
## Percentage nonzero weights: 10.79882
## Average number of links: 5.615385
## Link number distribution:
##
## 2 3 4 5 6 7 8 9 10 12
## 3 2 7 15 10 10 2 1 1 1
## 3 least connected regions:
## Tiltil San Pedro Maria Pinto with 2 links
## 1 most connected region:
## San Bernardo with 12 links
The output presents important information about the neighbors: it shows the number of regions, which
corresponds to 52 in this example; the number of nonzero links; the percentage of nonzero weights; the
average number of links, and so on.
The commune of San Bernardo is most connected region with 12 neighbors under the queen scheme. The
least connected regions are Tiltil, San Pedro, and Maria Pinto with 2 neighbors each of them. The output also
shows the distribution of neighbors. For example, 7 out of 52 regions has 4 neighbors and only 2 communes
has 8 neighbors.
To transform the list into an actual matrix W , we can use the function nb2listw:
An important argument of the function is style. This argument indicates what P type of matrix to create.
For example, style = "W" creates a row-standardize matrix so that wij s
= wij / j wij . After normalization,
each row of W P s
will
P sum to 1. "B" is the basic binary coding; and "C" is globally standardize, that is,
wij = wij · (n/ i j wij ). If style = "U", then wij
s s
= wij / i j wij . In a minmax matrix, the (i, j)th
P P
element of W s becomes wij s
= wij / min {maxi (τi ), maxi (ci )}, with maxi (τi ) being the largest row sum of
W and maxi (ci ) being the largest column sum of q W (Kelejian and Prucha, 2010). Finally, "S" is the
variance-stabilizing coding scheme where wij = wij /
s
j wij (Tiefelsdorf et al., 1999).
P 2
Furthermore, the summary function reports constants used in the inference for global spatial autocorrela-
tion statistics, which we will discuss later.
We can also see the attributes of the object using the function attributes:
1.3. EXAMPLES OF WEIGHT MATRICES IN R 15
# Attributes of wlist
attributes(queen.w)
## $class
## [1] "nb"
##
## $region.id
## [1] Santiago Cerillos Cerro Navia
## [4] Conchali El Bosque Estacion Central
## [7] La Cisterna La Florida La Granja
## [10] La Pintana La Reina Lo Espejo
## [13] Lo Prado Macul Nunoa
## [16] Pedro Aguirre Cerda Penalolen Providencia
## [19] Quinta Normal Recoleta Renca
## [22] San Joaquin San Miguel San Ramon
## [25] Independencia Puente Alto Las Condes
## [28] Vitacura Quilicura Huechuraba
## [31] Maipu Pudahuel San Bernardo
## [34] Tiltil Lampa Colina
## [37] Lo Barnechea Pirque Paine
## [40] Buin Alhue Melipilla
## [43] San Pedro Maria Pinto Curacavi
## [46] Penaflor Calera de Tango Padre Hurtado
## [49] El Monte Talagante Isla de Maipo
## [52] San Jose de Maipo
## 52 Levels: Alhue Buin Calera de Tango Cerillos Cerro Navia Colina ... Vitacura
##
## $call
## poly2nb(pl = mr, row.names = mr$NAME, queen = TRUE)
##
## $type
## [1] "queen"
##
## $sym
## [1] TRUE
# Symmetric W
is.symmetric.nb(queen.w)
## [1] TRUE
As we previously discussed, generally weight matrix based on boundaries are symmetric. Now, we con-
struct a binary matrix using the Rook criteria:
# Rook W
rook.w <- poly2nb(mr, row.names = mr$NAME, queen = FALSE)
summary(rook.w)
## 2 3 4 5 6 7 8 9 10
## 3 3 12 16 7 6 2 1 2
## 3 least connected regions:
## Tiltil San Pedro Maria Pinto with 2 links
## 2 most connected regions:
## Santiago San Bernardo with 10 links
Finally, we can plot the weight matrices using the following set of commands (see Figure 1.9).
# K-neighbors
coords <- coordinates(mr) # coordinates of centroids
head(coords, 5) # show coordinates
## [,1] [,2]
## 0 -70.65599 -33.45406
## 1 -70.71742 -33.50027
## 2 -70.74504 -33.42278
## 3 -70.67735 -33.38372
## 4 -70.67640 -33.56294
The function coords extract the spatial coordinates from the shape file, whereas the function knearneigh
returns a matrix with the indices of points belonging to the set of the k-nearest neighbors of each other. The
1.3. EXAMPLES OF WEIGHT MATRICES IN R 17
argument k indicates the number of nearest neighbors to be returned. If point coordinates are longitude-
latitude decimal degrees, then distances are measured in kilometers if longlat = TRUE. Furthermore, if
longlat = TRUE, great circle distances are used. Note that the objects k1neigh and k2neigh are of class
knn.
Weight matrices based on inverse distance can be computed in the following way (see Section 1.2.2):
## 0 1 2 3 4
## 0 0.00000000 0.07687010 0.09438408 0.07350782 0.11078109
## 1 0.07687010 0.00000000 0.08226867 0.12324109 0.07489489
## 2 0.09438408 0.08226867 0.00000000 0.07814455 0.15606360
## 3 0.07350782 0.12324109 0.07814455 0.00000000 0.17922003
## 4 0.11078109 0.07489489 0.15606360 0.17922003 0.00000000
## 0 1 2 3 4
## 0 0.000000 13.008960 10.595007 13.603994 9.026811
## 1 13.008960 0.000000 12.155295 8.114177 13.352046
## 2 10.595007 12.155295 0.000000 12.796797 6.407644
## 3 13.603994 8.114177 12.796797 0.000000 5.579733
## 4 9.026811 13.352046 6.407644 5.579733 0.000000
The function dist from stats package computes and returns the distance matrix computed by using
the specified distance measure—euclidean distance in this example— to compute the distance between the
rows of a data matrix. The other methods that can be used are maximum, manhattan, canberra, binary or
minkowski. Finally, the mat2listw function converts a square spatial weight matrix as a sequence of number
1:nrow(x).2
2 For more about spatial weight matrices see (Stewart and Zhukov, 2010).
18 CHAPTER 1. INTRODUCTION TO SPATIAL ECONOMETRIC
# Plot Weights
par(mfrow = c(3, 2))
plot(mr, border = "grey", main = "Queen")
plot(queen.w, coordinates(mr), add = TRUE, col = "red")
plot(mr, border = "grey", main = "1-Neigh")
plot(knn2nb(k1neigh), coords, add = TRUE, col = "red")
plot(mr, border = "grey", main = "2-Neigh")
plot(knn2nb(k2neigh), coordinates(mr), add = TRUE, col = "red")
plot(mr, border = "grey", main = "Inverse Distance")
plot(dist.mat.inve, coordinates(mr), add = TRUE, col = "red")
Queen 1−Neigh
# X matrix
X <- cbind(mr$POVERTY, mr$URB_POP)
head(X, 5)
## [,1] [,2]
## [1,] 8 159919
## [2,] 9 65262
## [3,] 18 131850
## [4,] 12 104634
## [5,] 14 166514
Now, we can construct a spatially lagged version of this matrix, using the queen.w weights:
# Create WX
WX <- lag.listw(nb2listw(queen.w), X)
head(WX)
## [,1] [,2]
## [1,] 9.10000 100138.9
## [2,] 12.40000 299498.4
## [3,] 14.00000 144756.5
## [4,] 14.60000 121974.2
## [5,] 18.25000 170266.5
## [6,] 10.42857 236231.1
n z0W z
I= (1.9)
S0 z 0 z
where z = x − x̄. If the W matrix is row standardized, then:
z0W sz
I= (1.10)
z0z
because S0 = n. Values range from -1 (perfect dispersion) to +1 (perfect correlation). A zero value indicates
a random spatial pattern.
A very useful tool for understanding the Moran’s I test is the Moran Scatterplot. The idea of the Moran
scatterplot is to display the variable for each region (on the horizontal axis) against the standardized spatial
weighted average (average of the neighbors’ x, also called spatial lag) on the vertical axis (See Figure 1.11).
As pointed out by Anselin (1996), expressing the variables in standardized from (i.e. with mean zero and
standard deviation equal to one) allows assessment of both the global spatial association, since the slope
of the line is the Moran’s I coefficient, and local spatial association (the quadrant in the scatterplot). The
Moran scatterplot is therefore divided into four different quadrants corresponding to the four types of local
spatial association between a region and its neighbors:
• Quadrant I displays the region with high x (above the average) surrounded by regions with high x
(above the average). This quadrant is usually denoted High-High.
• Quadrant II show the regions with low value surrounded by region with high values. This quadrant is
usually denoted Low-High.
• Quadrant III display the regions with low value surrounded by regions with low values, and is denoted
Low-Low.
• Quadrant IV shows the regions with high value surrounded by regions with low values. It is noted
High-Low.
Regions located in quadrant I and III refer to positive spatial autocorrelation, the spatial clustering of
similar values, whereas quadrant II and IV represent negative spatial autocorrelation, the spatial clustering
of dissimilar values.
To understand Moran’s I, it is important to note the similarity of the Moran’s I with the OLS coefficient.
Recall that
Pn
(xi − x̄) (yi − ȳ)
βb = i=1Pn 2 (1.11)
i=1 (xi − x̄)
Then looking at (1.8), Moran’s I is equivalent to the slope coefficient of a linear regression of the spatial
lag W x on the observation vector x measured in deviation from their mean. It is, however, not equivalent
to the slope of x on W x which would be a more natural way.
The hypothesis tested by the Moran’s I is the following:
• H0 : x is spatially independent; the observed x is assigned at random among locations. In this case I
is close to zero.
• H1 : x is not spatially independent. In this case I is statistically different from zero.
What is the distribution of the Moran’s I? We are interested in the distribution of:
I − E [I]
Var(I)
p
There are two ways to compute the mean and variance of Moran’s I. The first one is under the normal
assumption of xi and the second one is under randomization of xi . Under the normal assumption, it is
assumed that the random variable xi are the result of n independently drawings from a normal population.
Under the randomization assumption, no matter what the underlying distribution of the populations, we
consider the observed values of xi were repeatedly randomly permuted.
1.4. TESTING FOR SPATIAL AUTOCORRELATION 21
3
2
Quadrant II Quadrant I
1
Wx
Quadrant III
0
Quadrant IV
4:2
−1
4:1
5:4
−3 −2 −1 0 1 2 3 4
Theorem 1.2 — Moran’s I Under Normality. Assume that {xi } = {x1 , x2 , ..., xn } are independent and dis-
tributed as N(µ, σ 2 ), but µ and σ 2 are unknown. Then:
1
E (I) = − (1.12)
n−1
and
n2 S1 − nS2 + 3S02
E I2 = (1.13)
S02 (n2 − 1)
Pn Pn Pn Pn Pn Pn
where S0 = i=1 j=1 wij , S1 = i=1 j=1 (wij +wji )2 /2, S2 = i=1 (wi. +w.i )2 , where wi. = j=1 wij
Pn
and wi. = j=1 wji Then:
2
Var (I) = E I 2 − E (I) (1.14)
E [zi ] = 0
σ2
E zi2 = σ2 −
n
22 CHAPTER 1. INTRODUCTION TO SPATIAL ECONOMETRIC
σ2
E [zi zj ] = −
n
(n2 − 2n + 3)σ 2
E zi2 zj2 =
m2
(n − 3)σ 4
E zi2 zj zk =
−
n
3σ 4
E [zi zj zk zl ] =
n2
Then:
hP i
n Pn
n n
n E i=1 j=1 wij zi zj n XX E [zi zj ]
E [I] = Pn = wij Pn
S0 E[ i=1 zi ]
2 S0 i=1 j=1 i=1 E [zi ]
2
2
−nS0 σn
=
S0 n(1 − 1/n)σ 2 (1.15)
σ2
=− n
(1 − 1/n)σ 2
1
=−
n−1
and
hP i2
n Pn
n2 i=1 j=1 wij zi zj
E I2 = E 2
2 2
Pn
S0 [ i=1 zi ]
1/2 (2) (wij + wji )2 zi2 zj2 + (3) (wij + wji )(wik + wki )zi2 zj zk +
" P P P #
n2 (4) wij wkl zi zj zk zl
= 2E
S0 s
(1.16)
1
E (I) = − (1.17)
n−1
and
It is important to note that the expected value of Moran’s I under normality and randomization is the
same.
• To test a null hypothesis H0 (no spatial autocorrelation in our case), we specify a test statistic T such
that large values of T are evidence against H0 .
• Let T have observed value tobs . We generally want to compute
The algorithm for the Morans’ I Monte Carlo test is the following:
Algorithm 1.4 — Moran’s’ I Monte Carlo Test. The procedure is the following:
1. Rearrange the spatial data by shuffling their location and compute the Moran’s I S times. This will
create the distribution under H0 . This operationalizes spatial randomness.
2. Let I1∗ , I2∗ , ..., IS∗ be the Moran’s I for each time. A consistent Monte Carlo p-value is then:
PS
1+ 1(Is∗ ≥ Iobs )
pb = s=1
(1.21)
S+1
3. For tests at the α level or at 100(1 − α)% confidence intervals, there are reasons for choosing S so
that α(S + 1) is an integer. For example, use S = 999 for confidence intervals and hypothesis tests
when α = 0.05.
25
20
15
10
Figure 1.12 provides some useful insights. First, it clearly shows that the spatial pattern of poverty in the
MR is not spatial homogeneous, but rather the intensity of poverty varies across space. Second, it provides an
example of how disaggregated poverty indicators can reveal additional information to aggregate indicators.
It shows that poverty intensity is lower peripheral communes than central communes.
How to interpret quantile maps? A quantile classification scheme is an ordinal ranking of the data values,
dividing the distribution into intervals that have an equal number of data values. Quantile classification
ensures maps are easily comparable and can be ‘easy to read’.
We can also plot the data using the equal interval classification. Equal interval divides the data into equal
size classes (e.g., 0-10, 10-20, 20-30, etc) and works best on data that is generally spread across the entire
range. In the following example we use a defined interval classification.
However, regarding the possible spatial association that seems to be derived from the above figures for
the poverty variable, it is necessary to note that the results are sensitive to the number of defined intervals
(among other things). Therefore, it is necessary to conduct a comprehensive and formal analysis about the
potential presence of spatial dependence to ascertain whether there exists a pattern of statistically significant
spatial autocorrelation in the spatial distribution of poverty. That is why now we calculate the Moran’s I
test.
# Generate W matrices
queen.w <- poly2nb(mr, row.names = mr$NAME, queen = TRUE)
rook.w <- poly2nb(mr, row.names = mr$NAME, queen = FALSE)
1.5. APPLICATION: POVERTY IN SANTIAGO, CHILE 25
Figure 1.13: Cloropleth map: Poverty in the Metropolitan Region (Equal Interval)
>16
12−16
9−12
<9
Moran’s I test statistic for spatial autocorrelation is implemented in spdep (Bivand and Piras, 2015).
There are mainly two function for computing this test: moran.test, where the inference is based on a
normal or randomization assumption, and moran.mc, for a permutation-based test.
# Moran's I test
moran.test(mr$POVERTY, listw = nb2listw(queen.w), randomisation = FALSE,
alternative = 'two.sided')
##
## Moran I test under normality
##
## data: mr$POVERTY
## weights: nb2listw(queen.w)
##
## Moran I statistic standard deviate = 4.0453, p-value = 5.225e-05
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic Expectation Variance
## 0.306497992 -0.019607843 0.006498517
##
## Moran I test under normality
##
## data: mr$POVERTY
## weights: nb2listw(rook.w)
##
## Moran I statistic standard deviate = 4.3309, p-value = 1.485e-05
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic Expectation Variance
## 0.342282943 -0.019607843 0.006982432
The randomisation option is set to TRUE by default, which implies that in order to get inference based
26 CHAPTER 1. INTRODUCTION TO SPATIAL ECONOMETRIC
on a normal approximation, it must be explicitly set to FALSE, as in our case. Similarly, the default is a
one-sided test, so that in order to obtain the results for the more commonly used two-sided test, the option
alternative must be explicitly to ’two.sided’. Note also that the zero.policy option is set to FALSE by
default, which means that islands result in a missing value code NA. Setting this option to TRUE will set the
spatial lag for island to the customary zero value.
The results show that the Moran’s I statistic are ≈ 0.30 and 0.34, respectively, and highly significant. This
implies that there is evidence of robust positive spatial autocorrelation in the poverty variable (since we
are rejecting the null hypothesis of random spatial distribution).
R If you compute the Moran’s I test for two different variables, but using the same spatial weight
matrix, the expectation and variance of the Moran’s I test statistic will be the same under the normal
approximation. Why?
##
## Moran I test under randomisation
##
## data: mr$POVERTY
## weights: nb2listw(queen.w)
##
## Moran I statistic standard deviate = 4.0689, p-value = 4.723e-05
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic Expectation Variance
## 0.306497992 -0.019607843 0.006423226
Note how the value of the statistic and its expectation do not change relative to the normal case, only
the variance is different.
We can carry out a Moran’s I test based on random permutation the function moran.mc. Unlike previous
test, it needs the number of permutations nsim. Since the rank of the observed statistic is computed relative
to the reference distribution of statistics for the permuted data sets, it is good practice to set this number to
something ending on 9 (such as 99 or 999). This will lead to rounded pseudo p-values like 0.01 or 0.001.
# Moran's Test
set.seed(1234)
moran.mc(mr$POVERTY, listw = nb2listw(queen.w),
nsim = 99)
##
## Monte-Carlo simulation of Moran I
##
## data: mr$POVERTY
## weights: nb2listw(queen.w)
## number of simulations + 1: 100
##
## statistic = 0.3065, observed rank = 100, p-value = 0.01
## alternative hypothesis: greater
Note that none of the permuted data sets yielded a Moran’s I greater than the observed value of 0.3065,
hence a pseudo p-value of (0 + 1)/(99 + 1) = 0.01.
The Moran scatter plot can also be obtained using the function moran.plot of spdep:
1.5. APPLICATION: POVERTY IN SANTIAGO, CHILE 27
# Moran's plot
moran.plot(mr$POVERTY, listw = nb2listw(queen.w))
El Bosque
18
Spatially Lagged Poverty Rate
San Ramon
16
La Pintana
San Miguel La Granja
14
12
10
Vitacura
8
6
0 5 10 15 20 25
Poverty Rate
Figure 1.14 displays the Moran scatterplot of poverty with the queen weight matrix. Positive spatial
autocorrelation, detected by the value of the Moran’s I, is reflected by the fact that most of the communes
are located in quadrant I and III. However, there are some exceptions such as the communes located in
quadrant II and IV. For example, San Miguel is a commune with low poverty rate, but surrounded by
communes with high poverty.
A major limitation of Moran’s I is that it cannot provide information on the specific locations of spatial
patterns; it only indicates the presence of spatial autocorrelation globally. A single overall indication is given
of whether spatial autocorrelation exists in the dataset, but no indication is given of whether local variations
exist in spatial autocorrelation (e.g., concentrations, outliers) across the spatial extent of the data.
28 CHAPTER 1. INTRODUCTION TO SPATIAL ECONOMETRIC
Spatial Models
2
In the previous chapter we learnt some preliminary concepts in spatial econometric, such as spatial depen-
dency, spatial autocorrelation and we learnt how to do an exploratory analysis of the spatial data.
In this chapter we show the formulation of spatial models. In particular, in Section 2.1 we will derive a
complete taxonomy of spatial model including the Spatial Lag Model, Spatial Durbin Model, Spatial Error
Model and the Spatial Autocorrelation Model. We will give a brief motivation of each of them and some
examples. In Section 2.3 we show how to understand the ‘spillover’ effects and how to interpret marginal
effects in the spatial model framework.
where
Pn wij is the (i, j)th element of the W matrix; yi is the dependent variable for spatial unit i, so that
j=1 wij yj is the weighted average of the dependent variable for the neighbors of i (or spatial lag); i is the
error term such that E(i ) = 0; and ρ is the spatial autoregressive parameter which measures the intensity
of the spatial interdependence. If ρ > 0 indicates a positive spatial dependence, whereas ρ < 0 indicates a
negative spatial dependence. It should be clear that if ρ = 0, we have a conventional regression model. By
including the spatial lag variable we are making explicit the the existence of spatial spillovers effects due
to, for example, geographical proximity. This data generating process is known as a Spatial Autoregressive
Process or also labeled as SAR or Spatial Lag Model SLM. Since the model (2.1) does not include explanatory
variables the model is known as the pure SLM.
Figure 2.1 represents the spatial autoregressive model in (2.1) for two regions. The variables (x1 , x2 ) and
unobserved terms (1 , 2 ) have a direct effect on y for both regions. Note that the model incorporates spatial
spillover effects by the effect of y1 on y2 and vice versa. That is, the model reflects the ‘simultaneity’ inherent
in spatial autocorrelation.
The model can also be written in vector form is
yi = α + ρ wi >
y + i , i = 1, ..., n,
(1×n) (n×1)
29
30 CHAPTER 2. SPATIAL MODELS
x1 x2
Effects
y1 y2 → : non-spatial effects
99K: spatial effects
1 2
where wi is the ith row of W . A full SLM specification with covariates in matrix form can be written as:
y = αın + ρW y + Xβ + , (2.2)
where y is an n × 1 vector of observations on the dependent variable, X is an n × K matrix of observations
on the explanatory variables, β is the K × 1 vector of parameters, and ın is a n × 1 vector of ones.
R The reduced form of a system of equations is the result of solving the system for the endogenous
variables. This gives the latter as functions of the exogenous variables, if any. For example, the
general expression of a structural form is f (y, X, ε) = 0, whereas the reduced form of this model is
given by y = g(X, ε), with g as function.
Without restrictions on (In − ρW )—and (αın + Xβ)—the coefficients cannot be identified from data.
In other words, in order to obtain the reduced form we need (In − ρW ) to be invertible. From standard
algebra theory a matrix A is invertible if det(A) 6= 0. Thus, we require that det(In − ρW ) 6= 0. The question
is: which values of ρlead to non-singular (In − ρW )? For symmetric matrices, the compact open interval
−1
for ρ ∈ ωmin −1
, ωmax will lead to a symmetric positive definite (In − ρW ), where ωmin and ωmax are the
minimum and maximum eigen value of W , respectively. This gives rise to the following Theorem:
Theorem 2.1 — Invertibility. Let W be a weighting matrix, such that wii = 0 for all i = 1, ..., n, and assume
that all of the roots of W are real. Assume also that W is not row normalized. Let ωmin and ωmax be the
minimum and maximum eigen value of W . Assume also that ωmax > 0 and ωmin < 0. Then (In − ρW )
is nonsingular for all:
−1 −1
ωmin < ρ < ωmax
Recall that for ease of interpretation, it is common practice to normalize W such that the elements of
each row sum to unity. Since W is nonnegative, this ensures that all weights are between 0 and 1, and has
the effect that the weighting operation can be interpreted as an averaging of neighboring values.
2.1. TAXONOMY OF MODELS 31
According to our Theorem 1.1 (Eigenvalues of row-stochastic Matrix) the eigenvalues of the row-stochastic
(i.e., row-normalized, row standardized or Markov) neighborhood matrix W are in the range [−1, +1]. In
this case ρ ∈ (−1, 1), however it is misleading to consider ρ as a conventional correlation coefficient vector y
and its spatial lag W y. This is only the result of considering the standard row-standardized matrix. Other
standardization methods will lead to other potential parameter space of ρ.
−1
Theorem 2.2 — Invertibility of Row-Normalized W matrix. If W is row-normalized, then (In − ρW ) exists
for all |ρ| < 1
In spite of its popularity, row-normalized weighting has it drawbacks. As we suggested in the remark in
Section 1.2.4, row normalization alters the internal weighting structure of W so that comparisons between
rows become somewhat problematic. In view of this limitation, it is natural to consider simple scalar nor-
malization which multiply W by a single number, say a · W , which removes any measure-unit effect but
preserves relations between all rows of W .
In particular let
a = min {r, c}
X
r = max |wij | maximal row sum of the absolute values
i
j (2.4)
X
c = max |wij | maximal column sum of the absolute values.
j
i
Then, assuming that the elements of W are nonegative, (In − ρW ) will be nonsingular for all |ρ| < 1/a.
Note that this normalization has the advantage of ensuring that the resulting spatial weights, wij , are all
between 0 and 1, and hence can still be interpreted as relative influence intensities. This could be taken as
the parameter space.
This is an important result because a model which has a weighting matrix which is not row normalized
can always be normalized in such a way that the inverse needed to solve the model will exists in an easily
established region.
R For more about normalizing W and the parameter space of ρ see Elhorst (2014, section 2.4) and
Kelejian and Prucha (2010, section 2.2)
Considering the reduced form Equation (2.3), we might be able to find the mean and variance-covariance
matrix of the complete system as function of exogenous variables. The expectation is given by:
h i
−1 −1
E(y|X, W ) = E (In − ρW ) (αın + Xβ) + (In − ρW ) ε X, W
(2.5)
−1
= (In − ρW ) (αın + Xβ) .
−1
When |ρ| < 1, (In − ρW ) implies an infinite series (also called the Leontief expansion) given in the
following Lemma.
Then, using Lemma 2.3 (Leontief Expansion), the reduced model given in Equation (2.3) can be written
as:
since α is a scalar, the parameter |ρ| < 1, and W is row-stochastic. By definition W ın = ın and therefore
W W ın = W ın = ı. Consequently, W l ın = ın for l ≥ 0 (recall that W 0 = In ). This allows to write:
1
y= ın α + Xβ + ρW Xβ + ρ2 W 2 Xβ + ... + ε + ρW ε + ρ2 W 2 ε + ...
(1 − ρ)
This expression allows defining two effects: a multiplier effect affecting the explanatory variables and a
spatial diffusion effect affecting the error terms. On the one hand, with respect to the explanatory variables,
this expression means that, on average, the value of y at one location i is not only explained by the values
of the explanatory variables associated to this location but also by those associated to all other locations
−1
(neighbors or not) via the inverse spatial transformation (In − ρW ) . This spatial multiplier effect decreases
−1
with distance, that is, the powers of W in the series expansion of (In − ρW ) .
On the other hand, with respect to the error process, this expression means that a random shock in a
location i not only affects the value of y in this location but also has an impact on the values of y in all other
locations via the same spatial inverse transformation. To see this, recall that W 2 will reflect second-order
contiguous neighbors, those that are neighbors to the first-order neighbors (review Section 1.2.5). Since the
neighbor of the neighbor (second-order neighbor) to an observation i includes observation i itself, W 2 has
positive elements on the diagonal when each observations has at least one neighbor. That is, higher-order
spatial lags can lead to a connectivity relation for an observations i such that W ε will extract observations
from the vector ε that point back to the observation i itself. This implies that there exists a simultaneous
feedback. This is the diffusion effect, which also declines with distance. We will explore this mechanism more
deeply in Section 2.3.
From Equation (2.3), we derive the variance-covariance matrix of y:
Var ( y| W , X) = E yy > W , X
−1
(2.7)
= (In − ρW ) E εε> W , X In − ρW >
This variance-covariance matrix is full, which implies that each location is correlated with every other
location in the system. However, this correlation decreases with distance. This equation also shows that the
covariance between each pair of error terms
is not null and decreasing with the order of proximity. Moreover,
the elements of the diagonal of E εε> W , X are not constant. This implies error heterokedasticity of
ε. Since we have not assumed anything about the error variance, we can say that E εε> W , X is a full
matrix, say Ω . This covers the possibility of heteroskedasticity, spatial autocorrelation, or both. In absence
of either of these complications, the variance matrix simplifies to the usual σ 2 In .
Example 2.1 — County Homicide Rates in US. In the criminology literature there has been a great emphasis
of spatial diffusion of crime. The idea is that criminal violence may spread geographically via a diffusion
process. For example, the literature suggests that certain social processes such as illegal drug markets and
gang rivalries may be important for explaining the pattern and mechanisms of the spread of homicides (Cohen
and Tita, 1999).
In particular, empirical literature has focused on homicide rates and their determinants using the following
OLS specification:
yi = x>
i β + i ,
where yi is the homicide rate in spatial unit i and x is a set of covariates that explain homicide rates across
spatial units. However, this model does not allow capturing the idea of spatial diffusion and spatial effects of
homicide rates. Furthermore, it has been generally found that homicide rate follows a spatial autocorrelated
process. Given this, Baller et al. (2001), after rejecting the null hypothesis of spatial randomness on homicide
rates, propose (among other spatial models) the following SLM process for modeling homicide rates using a
county-level data for the decennial years in the 1960 to 1990 time period:
y = αın + ρW y + Xβ + ,
where y is the homicide rates for the US counties, X includes a deprivation, population density, median age,
the unemployment rate, percent divorced, and a Southern dummy variable based on census definitions. As
explained by Baller et al. (2001), if homicides rates are determined solely by the structural factors included in
the X matrix, there should be no spatial patterning of homicide beyond that created by socio-demographic
similarities of geographically proximate counties. If this is the case, once all xk are included in the model,
the spatial relationship between yi and yj will become nonsignificant. This implies that ρ = 0.
2.1. TAXONOMY OF MODELS 33
This is the model most compatible with common notions of diffusion processes because it implies an
influence of neighbors’ homicide rates that is not simply an artifact of measured or unmeasured independent
variables. Rather, homicide events in one place actually increase the likelihood of homicides in nearby locales.
y = ρW y + αın + Xβ + W Xγ + (2.8)
−1
y = (In − ρW ) (αın + Xβ + W Xγ + ε) (2.9)
The SDM results in a spatial autoregressive model of a special form, including not only the spatially
lagged dependent variable and the explanatory variables, but also the spatially lagged explanatory variables,
W X: y depends on own-regional factors from matrix X, plus the same factors averaged the n neighboring
regions. This idea is shown in Figure 2.2. Note that Region 1 not only exerts an impact on Region 2 (an
viceversa) via y, but also via the the independent variable x.
x1 x2
Effects
y1 y2 → : non-spatial effects
99K: spatial effects
1 2
As an example, consider that y is some measure of air pollution in each region. Thus, W y states that air
pollution 1 might affect pollution in region 2, and viceversa. If X contains a measure of population density,
the variable W X would indicate that density and region 1 (2) would affect air pollution in region 2 (1).
This model has also very good properties in terms of calculation of marginal effects that will explore later.
y = αı + Xβ + u,
(2.10)
u = λW u + ε.
where λ is the autoregressive parameter for the error lag W u (to distinguish the notation from the spatial
autoregressive coefficient ρ in a spatial lag model), and ε is a generally a i.i.d noise. Figure 2.3 visualizes the
SEM for two regions. Note that the error term of both regions are related, and the only spatial effect goes
from 1 to 2 and vice versa.
As stated by Anselin and Bera (1998), spatial error dependence may be interpreted as a nuisance (and
the parameter λ as a nuisance parameter) in the sense that it reflects spatial autocorrelation in measurement
errors or in variables that are otherwise not crucial to the model (i.e., the “ignored” variables spillovers across
the spatial units of observations).
Unlike previous models, interactions effects among the error terms do not require a theoretical model
for a spatial or social interaction process, but instead, are consistent with a situation where determinants
34 CHAPTER 2. SPATIAL MODELS
x1 x2
Effects
y1 y2 → : non-spatial effects
99K: spatial effects
1 2
of the dependent variable omitted from the model are spatially autocorrelated, or with a situation where
unobserved shocks follows a spatial pattern.
The spatial diffusion of this model can be analyzed if we consider the reduced form equation. If the matrix
(In − λW ) is not singular, then 2.10 can be written under the following reduced form:
E yy > W , X = E εε> W , X
−1 −1 (2.12)
= (In − λW ) E εε> W , X In − λW >
R Interaction effects among the unobserved terms may also be interpreted to reflect a mechanism to
correct rent-seeking politicians for unanticipated fiscal policy changes. See for example Allers and
Elhorst (2005).
or more compactly,
y = αın + ρW y + Xβ + u
(2.13)
u = λM u + ε
where the matrix W and M are n × n spatial-weighting matrices.1 In this model, spatial interactions in the
dependent variable and the disturbances are considered. As standard, the spatial weight matrices W and
M are taken to be known and nonsthocastic. These matrices are part of the model definition, and in many
applications, M = W . When ρ = 0, the model reduces to the SEM. When λ = 0 the model reduces to the
SLM (SAR) specification. Setting ρ = 0 and λ = 0 causes the model to reduce to a linear regression model
with exogenous variables.
1 This model is also known as SARAR(1, 1) model or Cliff-Ord models because of the impact that Cliff and Ord (1973) had
on the subsequent literature. Note that SARAR(1, 1) is a special case of the more general SARAR(p, q) model.
2.2. MOTIVATION OF SPATIAL MODELS 35
SAC
Spatial Lag Model
λ=0
y = ρW y + Xβ + u
y = ρW y + Xβ + ε
u = λW u + ε
0 0 ρ
= = =
γ γ 0
y = ρW y + Xβ + W Xγ + u
y = ρW y+Xβ+W Xγ+ε y = Xβ + W Xγ + ε y = Xβ + ε
u = λW u + ε
ρ
γ
=
ρ = 0
= −
0
0 0 ρβ λ
=
=
λ
Spatial durbin
Spatial error model
error model
γ=0
y = Xβ + u
y = Xβ + W Xγ + u
u = λW u + ε
u = λW u + ε
• Imposing the restriction γ = 0 leads to the SAC model that includes both a spatial lag for the dependent
variable and spatial lag for the error term, but excludes the influence of the spatially lagged explanatory
variables.
• Imposing the restriction λ = 0 leads to the SDM.
• Imposing the restriction ρ = 0 leads to the Spatial Durbin Error Model (SDEM).
• The so-called common factor parameter restrictions (γ = −ρβ) yields the spatial error regression model
(SEM) specification that assumes that externalities across spatial unites are mostly a nuisance spatial
dependence problem caused by the regional transmission of random shocks.
• Imposing the restriction γ = 0 leads to the spatial lag model (SLM), whereas the restriction ρ = 0
results in a least-squares spatially lagged X regression model (labeled SLX) that assumes independence
between regions in the dependent variable, but includes characteristics from neighboring regions in the
form of spatially lagged explanatory variables.
To illustrate this, consider yt , which represent some dependent variable vector at time t. Assume that
this variable is determined by a spatial autoregressive scheme that depends on space-time lagged values of the
dependent variable from neighboring observations. This would lead to a time lag of the average neighboring
values of the dependent variable observed during previous period, W yt−1 . We can also include current period
own-region characteristics Xt in the model. If the characteristics of regions remain relatively fixed over time,
we can write Xt = X As an illustration, consider a model involving pollution as the dependent variable yt ,
which depend on past period pollution of neighboring regions, W yt−1 . Then, the more appropriate process
is the following:
yt = ρW yt−1 + Xβ + εt . (2.15)
Note that we can replace yt−1 on the right-hand side of (2.15) with:
Recursive substitution for past values of the vector yt−r on the right-hand side of (2.17) over q periods
leads to:
where we use the fact that E(εt−r ) = 0, r = 0, ..., q − 1, which also implies that E(u) = 0. Finally, taking the
limit of (2.18),
−1
lim E (yt ) = (In − ρW ) Xβ. (2.19)
q→∞
Note that we use the fact that the magnitude of ρq W q yt−q tends to zero for large q, under the assumption
that |ρ| < 1 and assuming that W is row-stochastic, so the matrix W has a principal eigenvalue of 1.
Equation (2.19) states that we can interpret the observed cross-sectional relation as the outcome or
expectation of a long-run equilibrium or steady state. Note that this provides a dynamic motivation for
the data generating process of the cross-sectional SLM that serves as a workhorse of spatial regression
modeling. That is, a cross-sectional SLM relation can arise from time-dependence of decisions by economic
agents located at various point in space when decisions depend on those neighbors.
y = xβ + zθ,
where x and z are uncorrelated vectors of dimension n × 1, and the vector z follows the following spatial
autoregressive process:
z = ρW z + r
−1
z = (In − ρW ) r
where r ∼ N(0, σ2 In ). Examples of z are culture, social capital, neighborhood prestige.
If z is not observed, then:
2.3. INTERPRETING SPATIAL MODELS 37
y = xβ + u
−1 (2.20)
u = (In − ρW ) ε
where ε = θr. Then, we have the DGP for the spatial error model.
ε = xγ + v
(2.21)
v ∼ N(0, σ 2 In )
where the scalar parameters γ and σ 2 govern the strength of the relationship between X and z = (In −
ρW )−1 r. Inserting (2.21) into (2.20), we obtain:
−1
y = xβ + (In − ρW ) ε
−1
= xβ + (In − ρW ) (xγ + v)
= xβ + (In − ρW )
−1
xγ + (In − ρW )
−1
v (2.22)
(In − ρW ) y = (In − ρW ) xβ + v
y = ρW y + x (β + γ) + W x(−ρβ) + v
This is the Spatial Durbin Model (SDM), which includes a spatial lag of the dependent variable y, as well
as the explanatory variables x.
• Changes in tax rate by one spatial unit might exert an impact on tax rate setting decisions of nearby
regions, a phenomenon that has been labeled tax mimicking and yardstick competition between local
government (see our example below).
• Situations where home improvements made by one homeowner exert a beneficial impact on selling
prices of neighboring homes.
• Innovation by university researchers diffuses to nearby firms.
• Air or water pollution generated in one region spills over to nearby regions.
The models reviewed in the previous section can be use to formally define the concept of a spatial
spillover, and more important to provide estimates of the quantitative magnitude of spillovers and test for
the statistically significance of these. There is however a distinction between global and local spillovers, which
is discussed in Anselin (2003) and LeSage and Pace (2014).
We start our discussion about spillovers by formally defining global spillovers.
Definition 2.3.1 — Global Spillovers. Global spillovers arise when changes in a characteristic of one region
impact all regions’ outcomes. This applies even to the region itself since impacts can pass to the neighbors
and back to the own region (feedback). Specifically, global spillovers impact the neighbors, neighbors to
the neighbors, neighbors to the neighbors to the neighbors and so on.
The endogenous interactions produced by global spillovers lead to a scenario where changes in one region
set in motion a sequence of adjustments in (potentially) all regions in the sample such that a new long-run
steady state equilibrium arises (LeSage, 2014).
38 CHAPTER 2. SPATIAL MODELS
As explained by LeSage (2014), global spillovers might arise when considering local policies interactions.
For example: “it seems plausible that changes in levels of public assistance (cigarette taxes) in state A would
lead to a reaction by neighboring states B to change their levels of assistances (taxes), which in turn produces a
game-theoretic (feedback) response of state A, and also responses of states C who are neighbors to neighboring
states B, and so on.”
The following definition corresponds to local spillovers.
Definition 2.3.2 — Local Spillovers. Local spillovers represent a situation where the impact fall only on
nearby or immediate neighbors, dying out before they impact regions that are neighbors to the neighbors.
As it can be noted from the previous definitions, the main difference is that feedback or endogenous
interaction is only possible for global spillovers.
R According to Anselin (2003) and LeSage and Pace (2014), different spatial models give rise to different
measures of spillovers.
(In − ρW )y = Xβ + W Xθ + ε
y = (In − ρW )−1 Xβ + (In − ρW )−1 W Xθ + (In − ρW )−1 ε
y = A(W )−1 Xβ + A(W )−1 W Xθ + A(W )−1 ε, since A(W ) = (In − ρW )−1
y = A(W )−1 (Xβ + W Xθ) + A(W )−1 ε
K
X
y= A(W )−1 (In βr + W θr ) xr + A(W )−1 ε
r=1
K
X
y = Sr (W ) xr + A(W )−1 |{z}
ε ,
|{z}
r=1
| {z } |{z} | {z }
(n×1) (n×n) n×1 (n×n) n×1
∂E(yi )
= Sr (W )ij (2.25)
∂xjr
2.3. INTERPRETING SPATIAL MODELS 39
where Sr (W )ij is this equation represents the i, jth element of the matrix Sr (W ). This result implies that,
unlike the OLS model, a change in some variable in certain region will potentially affect the expected value of
the dependent variable in all other regions. Given this characteristic, this type of effect is known as indirect
effect.
The impact of the expected value of region i, given a change in certain variable for the same region is
given by
∂E(yi )
= Sr (W )ii . (2.26)
∂xir
This impact includes the effect of feedback loops where observation i affects observation j and obser-
vation j also affects observation i: a change in xir will affect the expected value of dependent variable in i,
then will pass through the neighbors of i and back to the region itself. To shed more light on this, let us
write the all the marginal effects in matrix notation as follows:
∂E(y1 ) ∂E(y1 )
. . . ∂E(y 1)
∂x1r ∂x2r ∂xnr
∂E(y ) ∂E(y2 )
∂x1r2 ∂x2r . . . ∂E(y
∂xnr
2)
∂E(y) ∂E(y)
. . . ∂xnr = .
∂E(y)
.. .. ..
.
∂x1r ∂x2r
(n×n)
. . . .
∂E(yn ) ∂E(yn ) ∂E(yn )
∂x1r ∂x2r . . . ∂xnr
= A(W )−1 (In βr + W θr ) = Sr (W ) (2.27)
βr w12 θr . . . w1n θr
w21 θr βr . . . w2n θr
= (In − ρW )−1 . . .. ..
.. .. . .
wn1 θr wn2 θr . . . βr
This expression is somewhat difficult to understand. To provide a better understanding we follow Elhorst
(2010) and consider a model with 3 regions arranged linearly2 with the following matrices:
0 1 0
1 − w23 ρ2 ρ2 w23
ρ
1
A(W )−1 = ρw21 1 ρw23 (2.29)
1 − ρ2
ρ2 w21 ρ 1 − w21 ρ2
where w12 = w31 = 1 since units 1 and 3 have only one neighbor, and w21 + w23 = 1, so we explicitly consider
a row-standardized matrix. Substituting Equations (2.28) and (2.29) into Equation (2.27) we get:
Every diagonal element of this matrix represents a direct effect. Consequently, indirect effect do not occur
if both ρ = 0 and θk = 0, since all non-diagonal elements will then be zero. Another important insight is that
direct and indirect effects are different for different spatial units in the sample. Direct effects are different
because the diagonal elements of the matrix (In − ρW )−1 are different for different units, provided that
ρ 6= 0. Indirect effects are different because both the non-diagonal elements of the matrix (In − ρW )−1 and
of the matrix W are different for different units, provided that ρ 6= 0 and/or θk 6= 0. Finally, note that
indirect effects that occur if θk 6= 0 are local effects, whereas indirect effects that occur if ρ 6= 0 are global
effects.
2 Unit 1 is neighbor of unit 2, unit 2 is a neighbor of both units 1 and 3, and unit 3 is a neighbor of unit 2.
40 CHAPTER 2. SPATIAL MODELS
Summary Measures
In general, it can be noted that the change of each variable in each region implies n2 potential marginal
effects. If we have K variables in our model, this implies K × n2 potential measures. Even for small values
of n and K, it may already be rather difficult to report these results compactly. To overcome this problem,
LeSage and Pace (2010, p. 36-37) propose the following scalar summary measures:
Definition 2.3.3 — Average Direct Impact. Let Sr = A(W )−1 (In βr + W θr ) for variable r. The impact of
changes in the ith observation of xr , which is denoted xir , on yi could be summarized by measuring the
average Sr (W )ii , which equals
1
ADI = tr (Sr (W )) (2.30)
n
Averaging over the direct impact associated with all observations i is similar in spirit to typical regression
coefficient interpretations that represent average response of the dependent to independent variables over the
sample of observations.
Definition 2.3.4 — Average Total Impact to an Observation. Let Sr = A(W )−1 (In βr + W θr ) for variable
r. The sum across the ith row of Sr (W ) would be represent the total impact on individual observation
yi resulting from changing the rth explanatory variable by the same amount across all n observations.
There are n of these sums given by the column vector cr = Sr (W )ın , so an average of these total impacts
is:
1
ATIT = ı0n cr (2.31)
n
Definition 2.3.5 — Average Total Impact from an Observation. Let Sr = A(W )−1 (In βr + W θr ) for variable
r. The sum down the jth column of Sr (W ) would yield the total impact over all yi from changing the
rth explanatory variable by an amount in the jth observation. There are n of these sums given by the
row vector rr = ı0n Sr (W ), so an average of these total impacts is:
1
ATIF = rr ın (2.32)
n
The definition 2.3.5 relates how changes in a single observation j influences all observations. In contrast,
definition 2.3.4 considers how changes in all observations influences a single observation i. In both cases,
averaging over all n observations, leads to the same numerical result. The implication of this interesting
result is that the average total impact is the average of all derivatives of yi with respect to xjr for any i, j.
Therefore:
(3 − ρ2 ) 2p
βk + θk ,
3(1 − ρ2 ) 3(1 − ρ2 )
and an indirect effect of
3ρ + ρ2 3+ρ
βk + θk .
(3(1 − ρ2 )) 3(1 − ρ2 )
Unfortunately, since every application will have its own unique number of observations n and spatial
weight matrix (W ), these formulae cannot be generalized.
Example 2.2 — The Effect of Number of workers on Commuting Times. Kirby and LeSage (2009) use an
SDM specification to consider changes in the (logged) number of workers in the US census tracts with
commuting times exceeding 45 minutes one way, between 1990 and 2000 (See also the example in Section
2.3.4). The motivation of this investigation is the fact that the percentage of the US workers with these
2.3. INTERPRETING SPATIAL MODELS 41
long commute times in 1990 was 12.5% compared to 15.4% in 2000, an increase of more than 10%. When
deciding which model to estimate, they note that spillover impacts from an increase in commuters traveling
long distances to work would seem global in nature, since the congestion effects of more travelers on one
segment of a metropolitan area roadway network impact travel times of other travelers on the entire network.
Furthermore, they state that feedback effects seem likely since congestion arising from commuting decisions
by workers in one tract will spillover to neighboring tracts, which in turn create congestion feedback to the
own tract. These two observations led the authors to specify the following SDM:
y = ρW y + αın + Xβ + W Xθ + ε,
where X includes the (logged) number of workers with long commute times (y), variables related to location
decision of households; age, gender and income distribution of resident population, and geographical charac-
teristics of the tract, and W X includes these same characteristics of neighboring census tracts. Based on a
comparison of direct, indirect and total effects estimates form the 1990 and 2000 models, they conclude
that the suite of variables reflecting the age and gender distribution of population in the tracts represents
the primary explanation for changes in the number of workers with long commute times between 1990 and
2000. The spillover impacts of the number of employed females in the 1990 model was positive suggesting
that more employed females in a tract produced an increase in long commute times for neighboring tract
commuters. In contrast, for the 2000 model, spillovers associated with employed females were negative, so
that more employed females in a tract reduced long commute times for workers located in neighboring tracts.
Example 2.3 — Effect of Pollution on Housing Price. Kim et al. (2003) use a spatial-lag hedonic model in
order to assess the direct and indirect effect of quality air on housing price. The main model is the following:
p = ρW p + X1 β1 + X2 β2 + X3 β3 + ε,
where p is the vector of housing prices, ρ is a spatial autocorrelation parameter, W is the n × n spatial weight
matrix, X1 is a matrix with observations on structural characteristics, X2 is a matrix with observations on
neighborhood characteristics, and X3 is a matrix with observations on environmental quality (SO2 and NOx ).
The marginal implicit price (marginal benefit) of the hedonic equation is derived as
∂E(p)
∂x1r
∂E(p)
∂x2r . . . ∂E(p)
∂xnr
= A(W )−1 In βr where A(W )−1 = (In − ρW )−1
Focusing on the first row the interpretation if the following: the housing price of location i is not only
affected by a marginal change air quality of location i but also is affected by marginal changes of air quality
in other locations. That is, the total impact of a change in Pair quality on housing price at location i is the
n
sum of the direct impacts ∂p1 /∂x1k plus induced impacts i=2 ∂p1 /∂xik (See our Definition 2.3.4).
An important point evidenced by Kim et al. (2003) is that, if the row-sums of W is less than or equal
to one and ρ in the proper parameter space, i.e., ρ < 1, then the total average effect can be computed as
βr /(1 − ρ). To see this note that
βr
=
(1 − ρ)
The model is estimated in a semi-log functional form, therefore the estimated coefficients can be inter-
preted as semi-elasticities. In particular, note that the elasticity for SO2 is given by:
42 CHAPTER 2. SPATIAL MODELS
SO2 dp
SO2 =
p dSO2
SO2 βr
= ·p since the model is log-lin (2.37)
p (1 − ρ)
βr
= · SO2
(1 − ρ)
Using the estimated ρb = 0.549 and replacing SO2 by its mean value they obtain that the elasticity of
housing price from a given small change in air quality is about 0.348 ≈ 4%. The marginal benefits per
household of a permanent 4% improvement in air quality using βSO2 (In − ρW )−1 p is about $2333 (1.43%
of mean house value) for owners.
Example 2.4 — Human Capital and Labor Productivity. Fischer et al. (2009) analyze the role of human capital
in explaining labor productivity variation among European region. In particular they estimate the following
model:
y = ρW y + Xβ + W Xγ + ε
where y is the vector of observations on the (log of) labor productivity level at the end of the sample period
(2004) and X contains (the log of) labor productivity and human capital at the beginning of the sample
period (1995). The parameter ρ is expected to be positive indicating that regional productivity levels are
positively related to a linear combination of neighboring regions’ productivity. The parameter vector γ
captures two types of spatial externalities:spatial effects working through the level of labor productivity and
spatial effects working through the level of human capital, both at the beginning of the sample period.
The estimated parameter of the spatial autoregressive parameter is ρb = 0.664 providing evidence for the
existence of significant spatial effects working through the dependent variable.
The mean direct impact for the human capital is 0.1317, whereas the indirect impact is -0.1968. They
interpret the indirect impact in two ways. First, they argue that the indirect impact reflects how a change
in the human capital level of all regions by some constant would impact the labor productivity of a typical
region (observation). The sign of the estimated mean indirect impact implies that an increase in the initial
level of human capital of all other regions would decrease the productivity level of a typical region. This
indirect impact takes into account the fact that the change in initial human capital level negatively impacts
other regions’ labor productivity, which in turn negatively influences our typical region’s labor productivity
due to the presence of positive spatial dependence on neighboring regions’ labor productivity levels.
Second Fischer et al. (2009) measure the cumulative impact of a change in region’s i initial level of human
capital averaged over all other regions. The impact from changing a single region’s initial level of human
capital on each of the other region’s labor productivity is small, but cumulatively the impact measures
-0.1968.
R A very good paper for those interesting in making the connection between global/local spillovers and
different spatial model specifications is LeSage (2014). This is a must-read paper.
R Cross-sectional observations could be viewed as reflecting a (comparative static) slice at one point in
time of a long-run steady-sate equilibrium relationship, and the partial derivatives viewed as reflecting
a comparative static analysis of changes that represent new steady-state relationship that would arise
(LeSage, 2014).
2.3. INTERPRETING SPATIAL MODELS 43
Intuition tell us that impacts arising from a change in the explanatory variables will influence low-order
neighbors more than higher-order neighbors. Therefore, we would expect a decline in the impacts’ magnitude
as we move from lower- to higher-order neighbors. To get a better idea of this process is necessary to consider
the matrix Sr (W ) and recognize, by Lemma 2.3, that this matrix can be expressed as a linear combination
of power of the weight matrix W . In particular, recall that if W is a row standardized matrix such that
ρ ∈ (−1, 1), then by Lemma 2.3:
∂E(y) ∂E(y) ∂E(y)
≈ In + ρW + ρ2 W 2 + ρ3 W 3 + ... + ρl W l In βr (2.38)
∂x1r ∂x2r . . . ∂xnr
This expression allow us to observe the impact associated with each power of W , where these powers
corresponds to the observation themselves (zero-order), immediate neighbors (first-order), neighbors of neigh-
bors (second-order), and so on. Using this expansion we could account for both the cumulative effects as
marginal and total direct, indirect associated with different order of neighbors.
We observe the following set of the sample data for these regions that relates travel times to the CBD
(in minutes) contained in the dependent variable vector y to distance (in miles) and population density
(population per square block) of the regions in the two columns of the matrix X.
3 This example is further explore in Kirby and LeSage (2009) with a real application.
44 CHAPTER 2. SPATIAL MODELS
According to LeSage and Pace (2010), the pattern of longer travel times for more distant regions R1
and R7 versus nearer R3 and R5 found in vector y seems to clearly violate independence, since travel times
appear similar for neighboring regions (see also Example 2.2). However one can argue that the observed
pattern is not due to spatial dependence, but rather it is explained by the variables Distance and Density
associated with each region, since these also appear similar for neighboring regions. Note that even for
individual residing in the CBD, it takes time to go somewhere else in the CBD. Therefore, the travel time
for intra-CBD travel is 26 minutes despite having a distance of 0 miles.
If we assume that the observed data was collected in a given day and averaged over a 24-hour period,
it can be hypothesized that congestion effects that arise from the shared highway can explain the observed
patter of travel times. It is reasonable to claim that longer travel times in one region should lead to longer
travel times in neighboring regions on any given day. This is because commuters pass from one region to
another as they travel along the highway to the CBD.
Congestion effects represent one type of spatial spillover, which do not occur simultaneously, but require
some time for the traffic delay to arise. From a modeling point of view, this effect cannot be captured by OLS
model with distance and density as independent variables. These are dynamic feedback effects from travel
time on a particular day that impact travel times of neighboring regions in the short time interval required
for the traffic delay to occur. Since the explanatory variable distance would not change from day to day, and
population density would change very slowly on a daily time scale, these variables would not be capable of
explaining daily delay phenomena.
A better way of explaining congestion is by the following DGP:
y = ρ0 W y + Xβ0 + ε,
such that:
−1
yb = (In − ρbW ) X β,
b
where the estimated parameters are βb = (0.135, 0.561)0 and ρb = 0.640 (assume that somehow we have
estimated these parameters). Note that the estimated spatial autoregressive parameters indicates positive
spatial dependence in the commuting times.
Computing Effects in R
Now think about the following question: What would be the estimated spillovers if region R2 doubles its
population density? To answer this question we first obtain the predicted values of travel times before the
change.4 That is, we first obtain:
−1
yb(1) = (In − ρbW ) X β.
b
4 Note that there is a typo in LeSage and Pace (2010), because in their equation (1.19) they double distance, not density.
2.3. INTERPRETING SPATIAL MODELS 45
# Estimated coefficients
b <- c(0.135, 0.561)
rho <- 0.642
# W and X
X <- cbind(c(10, 20, 30, 50, 30, 20, 10),
c(30, 20, 10, 0, 10, 20, 30))
W <- cbind(c(0, 1, 0, 0, 0, 0, 0),
c(1, 0, 1, 0, 0, 0, 0),
c(0, 1, 0, 1, 0, 0, 0),
c(0, 0, 1, 0, 1, 0, 0),
c(0, 0, 0, 1, 0, 1, 0),
c(0, 0, 0, 0, 1, 0, 1),
c(0, 0, 0, 0, 0, 1, 0))
Ws <- W / rowSums(W)
# Prediction
yhat_1 <- solve(diag(nrow(W)) - rho * Ws) %*% crossprod(t(X), b)
Now we estimate the predicted values of travel times after the change in population density in R2 using:
−1 f b
yb(2) = (In − ρbW ) X β (2.39)
where Xf is the new matrix reflecting a doubling of the population density of region R2.5 A comparison of
predictions yb(1) and yb(2) are going to be used to illustrate how the model generates spatial spillovers.
# Results
result <- cbind(yhat_1, yhat_2, yhat_2 - yhat_1)
colnames(result) <- c("y1", "y2", "y2 - y1")
round(result, 2)
## y1 y2 y2 - y1
## [1,] 41.90 44.46 2.56
## [2,] 36.95 40.93 3.99
## [3,] 29.84 31.28 1.45
## [4,] 25.90 26.43 0.53
## [5,] 29.84 30.03 0.19
## [6,] 36.95 37.03 0.08
## [7,] 41.90 41.95 0.05
sum(yhat_2 - yhat_1)
## [1] 8.846915
The two set of predictions show that the change in region R2 population density has a direct effect that
increases the commuting times for residents of region R2 by ≈4 minutes. It also has an indirect or spillover
effect that produces an increase in commuting times for the other six regions. Furthermore, it can be noticed
that the increase in commuting times for neighboring regions R1 and R3 are the greatest and these spillovers
5 For more about prediction in the spatial context see Kelejian and Prucha (2007).
46 CHAPTER 2. SPATIAL MODELS
decline as we move to regions in the sample that are located farther away from region R2 where the change
in population density occurred.
What is the cumulative indirect impacts? Adding up the increased commuting times across all other
regions (excluding the own-region change in commuting time), we find that equals ≈ 4.86(2.56 + 1.45 + 0.53 +
0.19 + 0.08 + 0.05) minutes, which is larger than the direct (own-region) impact of 4 minutes. Finally, the
total impact of all residents of the seven regions from the change in population density of region R2 is the
sum of the direct and indirect effects, or 8.85 minutes increase in travel times to the CBD.
Now assume that the OLS estimates for the example above are: βbOLS = [0.55, 1.25]. Using these estimates
we compute the OLS predictions based on the matrices X and X f as shown above.
# Ols prediction
b_ols <- c(0.55, 1.25)
yhat_1 <- crossprod(t(X), b_ols)
yhat_2 <- crossprod(t(X_d), b_ols)
result <- cbind(yhat_1, yhat_2, yhat_2 - yhat_1)
colnames(result) <- c("y1", "y2", "y2 - y1")
round(result, 2)
## y1 y2 y2 - y1
## [1,] 43.0 43.0 0
## [2,] 36.0 47.0 11
## [3,] 29.0 29.0 0
## [4,] 27.5 27.5 0
## [5,] 29.0 29.0 0
## [6,] 36.0 36.0 0
## [7,] 43.0 43.0 0
The results show no spatial spillovers. Only the travel time of R2 is affected by the change in population
density of region R2. It can be also observed that OLS prediction is upward bias. This is the main message
here. An OLS model does no allows for spatial spillover impacts and generates biased marginal effects.
Now we further explore our formulas and definition from previous Section. As we showed in Equation
(2.26), the impact of changes in the ith observation of xr on yi is Sr (W )ii . Given the SLM structure of our
example, this is equivalent to
∂E(CTi )
= Sdensity (W )ii , where Sdensity = (I − ρW )−1 Iβdensity .
∂densityi
We can compute our Sdensity in the following way.
Then, the direct impact of doubling population density of R2 on the expected value of commuting time
for R2 is given by
# Direct impact of R2 on R2
round(S[2,2] * 20, 2)
## [1] 3.99
Note that this value is the same as that found using the predicted value procedure: by doubling population
density in R2 increases the commuting times for residents of region R2 by ≈4 minutes.
2.3. INTERPRETING SPATIAL MODELS 47
Finding the indirect impact on region R1 is similar given Equation 2.25. The indirect impact on region
R1 is given by:
# Indirect impact of R2 on R1
round(S[1,2] * 20, 2)
## [1] 2.56
Again, note that is the same value computed before: An increase of 100% of population density in R2
implies an increase of travel time of region R1 to CBD of about 2.56 minutes, after considering all feedback
effects.
An interesting question would be the following: What would be the impact on commuting time on R1
if population density increases by 20 in all the Regions? To answer this question, we should recall our
definition 2.3.4 states that the sum across the ith row of Sr (W ) would be represent the total impact on
individual observation yi resulting from changing the rth explanatory variable by the same amount across n
observations.
# ATIT
round(sum(S[1, ]) * 20, 2)
## [1] 7.54
This number implies that the total impact to R1 will be an increase of commuting time of ≈ 7.5 minutes.
Using the formula for ATIT gives the same result:
# ATIT
n <- nrow(W)
vones <- rep(1, n)
round(((t(vones) %*% S %*% vones) / n ) * 20, 2)
## [,1]
## [1,] 7.54
Similarly, we could ask: What would be the impact of increasing density by 20 in R1 on all the other
regions? This is equivalent to our definition 2.3.5 which state that the sum down the jth column of Sr (W )
would yield the total impact over all yi from changing the rth explanatory variable by an amount in the jth
observation.
# ATIF
round(sum(S[, 1]) * 20, 2)
## [1] 5.54
In words, increasing density by 20 in R1 would imply a total effect in all the regions of about 7.54 minutes.
Imagine that you are a policy maker and you are considering in implementing a policy to reduce population
density and hence reduce commuting time in the regions. However, given that resources are scarce, you must
select which region to implement this policy. In order to produce a greater effect of policy you could use
the estimated spatial model and look for the region that will have the greatest overall impact (considering
feedback effects). Basically, this involves calculating the column sum of Sr (W ) for each region in the following
way:
## R1 R2 R3 R4 R5 R6 R7
## 0.28 0.44 0.40 0.39 0.40 0.44 0.28
48 CHAPTER 2. SPATIAL MODELS
Note that the impact of decreasing population density by 1 will have a greater reduction in commuting
time if applied in regions R2 and R6 (why?)
Finally, the average direct, indirect and total effects of an increase in 1 in population density in all the
regions can be computed as follows.
## [1] 0.1837
## [,1]
## [1,] 0.3771
## [,1]
## [1,] 0.1934
Equation (2.36) of Example 2.3, we show that the total effect can be also be computed as βr /(1 − ρ). We
know show that this proposition is true for our example
## [1] 0.377095
Cumulative Effects
The main idea of this exercise is to show how the change in some explanatory variable produces changes in
the independent variable in all the spatial units by decomposing them into cumulative and marginal impacts
for different order of neighbors as explained in Section 2.3.3.
First, we load the package expm which will allow us to compute power of matrices in a loop. Then we
create the estimated coefficients along with the W matrix:
In order to create the decomposition for the ADI, AII and ATI, we create the following loop from q = 0
to q = 10:
for (q in 0:10) {
if (q == 0) { # If q=0, then Sr = I * beta
S <- diag(n) * b_dens
} else {
S <- (rho ^ q * Ws %^% q) * b_dens
}
2.3. INTERPRETING SPATIAL MODELS 49
# Print results
round(out, 4)
round(colSums(out), 4)
This table shows both the cumulative and partitioned direct, indirect and total impacts associated with
orders 0 to 10 for the SLM. The cumulative direct impact from previous section equal to 0.1837, which given
the coefficient 0.1350 indicates that there is a feedback equal to (0.1837 - 0.1350) = 0.0487 arising from each
region impacting neighbors that in turn impacts neighbors to neighbors and so on.
The column sum of the matrix out shows that by the time we reach 10th-order neighbors we have
accounted for 0.1834 of the 0.1837 cumulative direct effect. It is important noting that for W 0 there is no
indirect effect, only direct effects, and for W 1 there is no direct effect, only indirect. To see this, note that
when q = 0 we obtain W 0 = In :
Ws %^% 0
Thus, we have Sr (W ) = In βr = 0.1350In . When q = 1 we have only indirect effect since there are zero
elements on the diagonal of the matrix W . This also occurs for q = 3, 5, 7, 9:
Ws %^% 1
Ws %^% 3
Also, the row-stochastic nature of W leads to an average of the sum of the rows that takes the form
βr × ρ = 0.135 × 0.642 = 0.0867, when q = 1.
The matrix out also shows that both direct and indirect effects fall out as the order of neighbors increases,
however the indirect or spatial spillovers effects decay more slowly as we move to higher-order neighbors.
Part II
Estimation Methods
51
Maximum Likelihood Estimation
3
In this chapter we begin the study of the estimation methods for spatial models. In particular, we focus in
the maximum likelihood estimation method. However, it is important to know some basic of the different
estimation methods.
Spatial econometric models can be estimated by maximum likelihood (ML) (Ord, 1975), quasi-maximum
likelihood (QML) (Lee, 2004), instrumental variables (IV) (Anselin, 1988, pp. 82-86), generalized method
of moments (GMM) (Kelejian and Prucha, 1998, 1999), or by Bayesian Markov Chain Monte Carlo method
(Bayesian MCMC) (LeSage, 1997).
As we will see in this chapter, the main drawback of the ML estimation is the assumption of normality of
the error terms. QML and IV/GMM have the advantage that they do not rely on the assumption of normality
of the disturbances. However, both estimators assume that the disturbance terms are independently and
identically distributed for all i with zero mean and variance σ 2 . IV/GMM estimator has the disadvantage
that the estimate for ρ or λ may be out of the parameter space. These coefficients are restricted to the
interval (1/rmin ) by the Jacobian term in the ML estimation. This issue motivated the development of the
IV/GMM, which do not require the Jacobian term.To instrument the spatially lagged dependent variable,
Kelejian et al. (2004) suggest [X, W X, ..., W g X], where g is a pre-selected constant.
y = ρ0 W y + ε , (3.1)
(n×1) (n×1) (n×1)
where ρ0 is the true population parameter of the data generating process (DGP). The reduced form for the
pure SLM in (3.1) is:
−1
y = (In − ρ0 W ) ε. (3.2)
As a result, the spatial lag term equals:
−1
W y = W (In − ρ0 W ) ε. (3.3)
This result will be useful later. Now, recall that if the model is y = Xβ + ε, then the OLS estimator is
−1 >
β = X >X
b X y. Then, considering (3.1) the OLS estimator for ρ0 is:
53
54 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION
−1
> >
ρbOLS = (W y) (W y) (W y) y . (3.4)
| {z } | {z } | {z } |{z}
(1×n) (n×1) (1×n) (n×1)
Substituting the expression for y in the population equation (3.1) into (3.4) gives us the following sampling
error equation:
h i−1
> >
ρbOLS = ρ0 + (W y) (W y) (W y) ε
n
!−1 n !
X X
= ρ0 + 2
yLi yLi i ,
i=1 i=1
where yLi is the ith element of the spatial lag operator W y = yL . Assuming that W is nonstochastic, the
mathematical expectation of ρbOLS is
h i−1
> >
E ( ρbOLS | W ) = ρ0 + E (W y) (W y) (W y) ε W
n
!−1 n
!
(3.5)
X X
= ρ0 + 2
yLi yLi i W .
E
i=1 i=1
From (3.5) it is clear that if the expectation of the last term is zero, then ρbOLS will be unbiased. However,
note that
n
!
h i
>
X
yLi i W = E (W y) ε W
E
i=1
" #
−1
= E ε> I − ρW > W > ε W using (3.3)
(1×1)
= E ε> C > ε W (3.6)
= E tr ε C ε W
> >
= E tr C > εε> W
6= 0,
−1
where C = W (I − ρW ) . Therefore, given the result in (3.6) we have that E ( ρbOLS | W ) = ρ0 if and only
if tr (C) = 0, which occurs if ρ0 = 0. If ρ = 0, C = W , and tr(C) = tr(W ) = 0 because the diagonal
elements of W are zeros (See Definition 3.1.1 for properties of the trace). In other words, if the true model
follows a spatial autoregressive structure, the OLS estimate of ρ will be biased.
Definition 3.1.1 — Some useful results on trace. The trace of a squared matrix A, denoted tr(A), is defined
to be the sum of the elements on the main diagonal of A:
n
X
tr(A) = aii = a11 + a22 + ... + ann (3.7)
i=1
where aii denotes the entry on the ith row and ith column of A.
Some properties:
1. Let A and B be square matrices and c a scalar. Then:
2. tr(A) = tr(A> ).
3.1. WHAT ARE THE CONSEQUENCES OF APPLYING OLS? 55
3. tr(AB) = tr(BA).
4. Trace of an idempotent matrix: Let A be an idempotent matrix, then tr(A) = rank(A).
What about consistency? Note that we can write:
N
!−1 n
!
1X 2 1X
ρbOLS = ρ0 + y yLi i . (3.10)
n i=1 Li n i=1
Under ‘some conditions’ we can show that:
n
1X 2
y → q, (3.11)
n i=1 Li
where q is some finite scalar (We need some assumptions here about ρ and the structure of the spatial weight
matrix ). However, for the second term we obtain
n
1X p
yLi i −→ E(yLi εi ) = tr (C) E(εε> ) 6= 0. (3.12)
n i=1
As a result, the presence of the spatial weight matrix results in a quadratic form in the error terms, which
in turns introduces a form of endogeneity because the spatial lag W y will be correlated with the disturbance
vector ε. Therefore ρbOLS is inconsistent, and we need to account for the simultaneity by either in a maximum
likelihood estimation framework, or by using a proper set of instrumental variables.
R Lee (2002) shows that in some cases the OLS estimator may still consistent and even be asymptotically
efficient relative to some other estimators.
y = ρ0 W y + ε (3.13)
where the true value ρ0 = 0.7; the sample size for each sample is n = 225; ε ∼ N(0, 1) and W is an artificial
n × n weight matrix. The W is constructed from a neighbor list for rook contiguity on a 500 × 500 regular
lattice.
The syntax for creating the global parameters for the simulation in R is the following:
# Global parameters
library("spdep") # Load package
set.seed(123) # Set seed
S <- 100 # Number of simulations
n <- 225 # Spatial units
rho <- 0.7 # True rho
w <- cell2nb(sqrt(n), sqrt(n)) # Create artificial W matrix
iw <- invIrM(w, rho) # Compute inverse of (I - rho*W)
rho_hat <- vector(mode = "numeric", length = S) # Vector to save results.
The function cell2nb creates a list of neighbors for a grid of cells. By default it creates neighbors based
on rook criteria. The invIrM function generates the full weights W , checks that ρ lies in its feasible range
between 1/ min ω and 1/ max ω, where ω = eigen(W ), and returns the n × n inverted matrix (In − ρW )−1 .
The loop for the simulation is the following:
56 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION
Note that since W is considered as fixed (nonstochastic) it is created out of simulation loop. The results
are the following:
# Summary of rho_hat
summary(rho_hat)
It can be noticed that the estimated ρ ranges from 0.8 to 1.2, that is, the range does not include the true
parameter ρ0 = 0.7. Moreover, the mean of the estimated parameters is 1, which is very far away from 0.7!
We can conclude that the OLS estimator of the pure SLM model is highly biased.
Finally, we can plot the sampling distribution of the estimated parameters in the following way:
Figure 3.1 present the sampling distribution of ρ estimated by OLS for each sample in the Monte Carlo
simulation study. The observed pattern is the same as previously discussed: the distribution does not contain
ρ0 = 0.7.
3
2
1
0
Notes: This graph shows the sampling distribution of ρ estimated by OLS for each sample in the Monte Carlo
simulation study. The true DGP follows a pure Spatial Lag Model where the true parameter is ρ0 = 0.7
3.2. MAXIMUM LIKELIHOOD ESTIMATION OF SLM 57
y = ρ0 W y + Xβ0 + ε,
(3.14)
ε ∼ N(0n , σ02 In ),
where y is a vector n × 1 that collects the dependent variable for each spatial units; W is a n × n spatial
weight matrix; X is a n × K matrix of independent variables; β0 is a known K × 1 vector of parameters;
ρ0 measures the degree of spatial correlation; and ε is a n × 1 vector of error terms. Note that we are
making the explicit assumption that the error terms follows a multivariate normal distribution with mean
0 and variance-covariance matrix σ02 In . That is, we are assuming that all spatial units have the same error
variance.
Since we are explicitly assuming the distribution of the error term, we will be able to use the maximum
likelihood estimation procedure. Under the maximum likelihood criterion, the parameter estimates θb =
(βb> , ρb, σ
b2 )> are chosen so as to maximize the probability of generating or obtaining the observed sample.
However, it should be noted that ML estimation is a highly parametric approach, which means that it is based
on strong assumptions. We will see that within these assumptions, it has optimal asymptotic properties (such
as consistency and asymptotic efficiency), but when the assumptions are violated, the optimal properties may
no longer hold.
How can we estimate θ0 ? Note that we could rearrange the model as:
y − ρ0 W y = Xβ0 + ε.
Following the derivation of the linear model, an estimate for β0 would be:
−1
b 0) = X >X
β(ρ X > (In − ρ0 W ) y,
which depend on ρ0 , Given this, an estimate for the variance parameter would be
reduce maximum likelihood to an univariate optimization problem in the parameter ρ. This will be very
useful later in order to derive the ML algorithm.
In order to derive the joint distribution of the data, we need to find the probability density function
f (y1 , y2 , ..., yn |X; θ) = f (y|X; θ), that is, the joint conditional distribution of y given X. Using the Trans-
formation Theorem, we need the following transformation:
∂ε
f (y|X; θ) = f (ε(y)|X; θ) .
∂y
Recall that the model can be written as ε = Ay −Xβ with A = In −ρW where Ay is spatially filtered
dependent variable, i.e., with the effect of spatial autocorrelation taken out. Note that ε = f (y), that is,
the unobserved is a functional form of the observed y.1 To move from the the distribution of the error term
to the distribution for the observable random variable y we need the Jacobian transformation:
1 Since y and not the are the observed quantities, the parameters must be estimated by maximizing L(y), not L(ε). For
i i
more details about this, see Mead (1967) and Doreian (1981).
58 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION
∂ε
det = det (J ) = det(A) = det(In − ρW ),
∂y
where J = ∂y ∂ε
is the n × n Jacobian matrix, and det(In − ρW ) is the determinant of a n × n matrix. In
contrast to the time-series case, the spatial Jacobian is not the determinant of a triangular matrix, but of a
full matrix. This may complicate its computation considerably. Recall that this Jacobian reduces to a scalar
1 in the standard regression model, since the partial derivative becomes |∂(y − Xβ)/∂y| = |In | = 1.
Using the density function of the multivariate normal distribution we can find the joint pdf of ε|X. By
recognizing that ε ∼ N(0, σ 2 In ), we can write:
1
f (ε|X) = (2π · σ 2 )−n/2 exp − 2 ε> ε .
2σ
Given an iid sample of n observations, y and X, the joint density of the observed sample is:
1
∂Ay − Xβ
f (y|X; θ) = (2π · σ 2 )−n/2 exp − 2 (Ay − Xβ)> (Ay − Xβ) det .
2σ ∂y
Note that the likelihood function is defined as the joint density treated as a function of the parameters:
L(θ|y, X) = f (y|X; θ). Finally, the log-likelihood function, which will be maximized, takes the form:2
n log(2π) n log(σ 2 ) 1
log L(θ) = log |A| − − − 2 (Ay − Xβ)> (Ay − Xβ)
2 2 2σ (3.15)
n log(2π) n log(σ 2 ) 1 h >
i
= log |A| − − − 2 y > A> Ay − 2 (Ay) Xβ + β > X > Xβ ,
2 2 2σ
where this development uses the fact that the transpose of a scalar is the scalar, i.e., y > A> Xβ = (y > AXβ)> =
β > X > Ay. This is similar to the typical linear-normal likelihood, except that the transformation from ε to
y, is not by the usual factor of 1, but by log |A|.
∂(ρW )
=W (3.16)
∂ρ
∂A ∂(In − ρW )
=
∂ρ ∂ρ
∂In ∂ρW (3.17)
= −
∂ρ ρ
= −W
∂ log |A|
= tr(A−1 ∂A/∂ρ) = tr A−1 (−W ) (3.18)
∂ρ
Let ε = Ay − Xβ, then:
∂ε ∂(Ay − Xβ)
= = −W y (3.19)
∂ρ ∂ρ
n log(2π)
2 Since the constant −
2
is not a function of any of the parameters, some software programs do not include it when
reporting maximized log-likelihood. See Bivand and Piras (2015).
3.2. MAXIMUM LIKELIHOOD ESTIMATION OF SLM 59
∂ε> ε
= ε> (∂ε/∂ρ) + (∂ε> /∂ρ)ε = 2ε> (∂ε/∂ρ) = 2ε> (−W )y (3.20)
∂ρ
∂A−1
= −A−1 (∂A/∂ρ)A−1 = A−1 W A−1 (3.21)
∂ρ
∂ tr A−1 W
= tr ∂A−1 W /∂ρ (3.22)
∂ρ
Taking the derivative of Equation (3.15) respect to β, we obtain:
∂ log L(θ) 1 1
>
>
= − 2 −2 (Ay) X + 2X Xβ = 2 X > (Ay − Xβ),
>
(3.23)
∂β 2σ σ
and with respect to σ 2 yields:
−1
βbM L (ρ) = X >X X > Ay (3.25)
>
(Ay − XβM L ) (Ay − XβM L )
L (ρ) = (3.26)
2
σ
bM .
n
Note that conditional on ρ (assuming we know ρ), these estimates are simply OLS applied to the spa-
tial filtered dependent variable Ay and the exploratory variables X. Moreover, after some manipulation,
Equation (3.25) can be re-written as:
eO = y − X βb0 and eL = W y − X β
cL . (3.28)
Then, plugging (3.27) into (3.26)
>
(eO − ρeL ) (eO − ρeL )
L [βM L (ρ), ρ] = (3.29)
2
σM .
n
Note that both (3.27) and (3.29) rely only on observables, except for ρ, and so are readily calculable
given some estimate of ρ. Therefore, plugging (3.27) and (3.29) back into the likelihood (3.15) we obtain the
concentrated log-likelihood function:
" #
>
n n n (eO − ρeL ) (eO − ρeL )
log L(θ)c = − − log(2π) − log + log |In − ρW | , (3.30)
2 2 2 n
which is a nonlinear function of a single parameter ρ. A ML estimate for ρ, ρbM L , is obtained from a
numerical optimization of the concentrated log-likelihood function (3.30). Once we obtain ρb, we can easily
obtain β.
b The procedure can be summarized in the following steps:
Algorithm 3.1 — ML estimation of SLM. The algorithm to perform the ML estimation of the SLM is the
following:
1. Perform the two auxiliary regression of y and W y on X to obtain βbO and βbL as in Equation (3.27).
3. Maximize the concentrated likelihood given in Equation (3.30) by numerical optimization to obtain
an estimate of ρ.
4. Use the estimate of ρb to plug it back in to the expression for β (Equation 3.25) and σ 2 (Equation
3.26).
Since the score function will be important for understanding the asymptotic theory of MLE, we will derive
also ∂ log L(θ)/∂ρ. Taking the derivative of Equation (3.15) respect to ρ, we obtain:
∂ log L(θ) 1
∂ ∂
= log |A| − 2 ε> ε
∂ρ ∂ρ 2σ ∂ρ
1
= − tr(A−1 W ) + 2 2ε> W y Using (3.18) and (3.20)
2σ (3.31)
1
= − tr(A W ) + 2 2ε W y
−1 >
2σ
1
= − tr(A W ) + 2 ε> W y.
−1
σ
Thus the complete gradient (or score function) is:
∂ log L(θ)
1 >
∂ log L(θ) ∂ log∂βL(θ) σ2 X ε
∇θ = = ∂σ2 = 2σ 4 (ε ε − nσ )
1 > 2
∂θ ∂ log L(θ) − tr(A W ) + σ2 ε W y
−1 1 >
∂ρ
∂ log L(θ) 1 1
= 2 (CXβ)> ε + 2 (ε> Cε − σ 2 tr(C)),
∂ρ σ σ
where:
C = W A−1 .
Therefore:
n
Y
|In − ρW | = (1 − ρωi ),
i=1
The advantage of this approach is that the eigenvalues only need to be computed once, which carries
some overhead, but greatly speeds up the calculation of the log-likelihood at each iteration. In practice, in
all but the smallest data sets (< 4000 observations), the Ord’s approach will be faster than the brute force
approach.
This new formulation give us the possible domain of ρ. We need that 1 − ρωi 6= 0, which occurs only if
1/ωmin < ρ < 1/ωmax . For row-standardized matrix, the largest eigenvalues is 1. With this new approxima-
tion, the new concentrated log-likelihood function is:
3.2. MAXIMUM LIKELIHOOD ESTIMATION OF SLM 61
" # n
>
n n n (eO − ρeL ) (eO − ρeL ) X
log L(θ)c = − − log(2π) − log + log(1 − ρωi )
2 2 2 n i=1
n
(3.33)
e eO − 2ρe>
L eO + ρ eL eL 2X
> 2 >
= const − log O + log(1 − ρωi )
n n i=1
Another method approach is the characteristic root method outlined in Smirnov and Anselin (2001). This
approach allows for the estimation of spatial lag models for very large data sets (> 100,000 observations) in
a very short time. However, it is limited by the requirement that the weight matrix needs to be intrinsically
symmetric. This precludes the use of asymmetric weight such as k-nearest neighbor weights. For other
approximations see LeSage and Pace (2010, chapter 4).
3.2.4 Hessian
The Hessian matrix will be very important in the following sections to obtain the asymptotic variance-
covariance matrix. For this reason we devote a complete section in order to derive this matrix for the SLM.
In this case, the Hessian is a (K + 2) × (K + 2) matrix of second derivatives given by :
2 2 2
`(β,σ ,ρ) `(β,σ ,ρ) `(β,σ ,ρ)
∂β∂β > ∂β∂σ 2 ∂β∂ρ
2
`(β,σ 2 ,ρ) `(β,σ 2 ,ρ)
H(β, σ 2 , ρ) =
`(β,σ ,ρ)
∂σ2 ∂β > ∂(σ 2 )2 ∂σ 2 ∂ρ
2
`(β,σ ,ρ) `(β,σ 2 ,ρ) `(β,σ 2 ,ρ)
∂ρ∂β > ∂ρ∂σ 2 ∂ρ2
∂ 2 log L(θ) 1
= − 2 (X > X) (3.34)
∂β∂β > σ
∂ 2 log L(θ) 1
= − 2 2 X >ε (3.35)
∂β∂σ 2 (σ )
∂ 2 log L(θ) 1
= − 2 X > W y, (3.36)
∂β∂ρ σ
Using the first derivative (3.24) and working in the cross-derivatives for σ 2 , we obtain:
∂ 2 log L(θ) n 1
= − 2 3 ε> ε (3.37)
∂(σ )
2 2 2(σ )
2 2 (σ )
and:
∂ log L(θ) 1
∂ε
= 2ε >
Using Equation (3.20)
∂σ 2 ∂ρ 2σ 4 ∂ρ
1
= − 4 ε> W y (3.38)
σ
ε> W y
=−
σ4
Finally, working in the second derivative of ρ, and using (3.31), we obtain
∂ log L(θ) 1
∂ ∂
=− tr(A W ) + 2
−1
ε> W y
∂ρ2 ∂ρ σ ∂ρ
∂A−1 W 1
∂
= − tr + 2 (Ay)> W y
∂ρ σ ∂ρ (3.39)
1
= − tr A−1 W A−1 W + 2 (−y > W > W y)
σ
1
= − tr (W A−1 )2 − 2 (y > W > W y)
σ
Therefore, the Hessian is:
62 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION
ε = Ay − Xβ.
If follows that, in terms of expected values:
E [ ε| W , X] = 0 (3.40)
E εε> W , X = σ 2 In , (3.41)
and, for y:
∂ log L(θ) 1
2
E W , X = − 2 X > E [ W y| W , X]
∂β∂ρ σ
1
= − 2 X > E W A−1 Xβ + W A−1 ε W , X
σ (3.45)
1
= − 2 X > W A−1 Xβ
σ
1
= − 2 X > CXβ
σ
For (3.37) we obtain:
∂ log L(θ) 1
2
n
W,X = E >
E − 2 3 ε ε W , X
∂(σ 2 )2 2(σ 2 )2 (σ )
n 1 >
=
− 6E ε ε W,X
2σ 4 σ
n 1
= − 6 E ε> Iε W , X
2σ 4 σ
n 1
= − 6 E tr(ε> Iε) W , X
2σ 4 σ
n 1 (3.46)
= − 6 E tr(In )εε> W , X
2σ 4 σ
n 1
= − 6 tr(In )E εε> W , X
2σ 4 σ
n 1
= − 6 tr(In )σ 2 In
2σ 4 σ
n n 2
= − σ In
2σ 4 σ6
n
=− 4
2σ
3.2. MAXIMUM LIKELIHOOD ESTIMATION OF SLM 63
From (3.38):
1 h 0
= − tr(W A−1 )2 − 2 E β > X > A−1 W > W A−1 Xβ+
σ
2ε> A−1> W > W A−1 Xβ + ε> A−1> W > W A−1 ε
1
= − tr(W A−1 )2 − 2 E β > X > C > CXβ + 2ε> C > CXβ + ε> C > Cε
σ
1
= − tr(W A ) − 2 β > X > C > CXβ + E 2ε> C > CXβ + E ε> C > Cε
−1 2
σ
1
= − tr(W A ) − 2 β > X > C > CXβ + E tr(ε> C > Cε)
−1 2
σ
1
= − tr(W A ) − 2 β > X > C > CXβ + E tr(C > C)εε>
−1 2
σ
1
= − tr(W A ) − 2 β > X > C > CXβ + tr(C > C)E εε> W , X
−1 2
σ
1
= − tr(W A ) − 2 β > X > C > CXβ + tr(C > C)σ 2
−1 2
σ
1
= − tr(C) − 2 (CXβ)> (CXβ) − tr(C > C)
2
σ
1
= − tr(C C) − 2 (CXβ)> (CXβ)
s
σ
(3.48)
where ε> A−1> W > W A−1 Xβ = (β > X > A−1> W > W A−1 ε)> because is a scalar, C s = C + C > and
C = W A−1 .
σ 2 (X X) σ 2 X (CXβ)
>
0> >
1 1
The asymptotic variance matrix follows as the inverse of the information matrix:
−1
σ 2 (X X) σ 2 X (CXβ)
>
1
0> 1 >
An important feature is that the covariance between β and the error variance is zero, as in the standard
regression model, this is not the case for ρ and the error variance. This lack of block diagonality in the infor-
mation matrix for the spatial lag model will lead to some interesting results on the structure of specification
test.
However, we can use the eigenvalues approximation. Recall that
n
∂ X ωi
log |A| = − , (3.51)
∂ρ i=1
1 − ρωi
so that,
−1
σ 2 (X X) σ 2 X (CXβ)
>
1
00 1 >
Var(β, σ 2 , ρ) = σ 2 tr(C)
n 1
. 2σ 4
. . α+ tr(CC) + σ2 (CXβ) (CXβ)
1 >
ωi2
Pn
where α = i=1 (1−ρωi )2 . Note that while the covariance between β and the error variance is zero, as in
the standard regression model, this is not the case for ρ and the error variance.
y = Xβ0 + u
u = λ0 W u + ε (3.52)
ε∼ N(0, σ02 In )
where λ0 is the spatial autoregressive coefficient for the error lag W u (to distinguish the notation from
the spatial autoregressive coefficient ρ in a spatial lag model), W is the spatial weight matrix, ε is an
error ε ∼ N (0, σ02 In ). This model do not require a theoretical model for a spatial process, but instead,
is consistent with a situation where determinants of the dependent variable omitted from the model are
spatially autocorrelated, or with a situation where unobserved shocks follows a spatial pattern (Elhorst,
2014). In summary, SEM treats spatial correlation primarily as a nuisance.
If λ > 0, then we face positive spatial correlation. This implies clustering of similar values; that is, the
errors for spatial unit i tend to vary systematically with the errors for other nearby observations j so that
smaller/larger errors for i would tend to go together with smaller/larger errors for j. This violates the typical
assumption of no autocorrelation in the error term of the OLS.
Under the assumption that the spatial weights matrix is row standardized and the parameter is less than
one in absolute value, the model can be also be expressed as:
−1
y = Xβ + (In − λW ) ε.
Since u = (In − λW ) −1
ε, it can be shown that E(u) = 0. Furthermore, the variance-covariance matrix
of u is:
where Ωu = (In − λW )(In − λW )> . The variance covariance (3.53) is a full matrix implying a spatial
autoregressive error process leading to a nonzero error covariance between every pair of observations, but
decreasing in magnitude with the order of contiguity (Anselin and Bera, 1998). Furthermore, the complex
structure in the inverse matrix in (3.53) yields nonconstant diagonal elements in the error covariance matrix,
thus inducing heteroskedasticity in u, irrespective of the heteroskedasticity of ε. Finally, u ∼ N(0, σ 2 Ωu−1 ).
R The OLS estimates of model in Equation (3.52) are unbiased, but inefficient if λ 6= 0.
Given the previous Remark, we might used generalized least squares (GLS) for a more efficient parameters
estimation. Recall that the inefficiency of OLS estimates of the regression coefficient would invalidate the
statistical inference in the spatial error model. The invalidity of significance test arises from biased estimation
of the variance and standard errors of the OLS estimates for β and λ.
ε = (I − λW ) y − (In − λW ) Xβ = By − BXβ,
where B = (In − λW ). The new error term indicates that ε(y). Recall that in order to create the log-
likelihood function we need the joint density function. Using the Transformation Theorem we are able to
find the joint conditional function:
−1 >
βM L (λ) = X > Ω(λ)X
X Ω(λ)y
−1
= (BX)> (BX) (BX)> By (3.55)
−1
= X(λ)> X(λ) X(λ)> y(λ),
where:
X(λ) = BX = (I − λW )X = (X − λW X)
(3.56)
y(λ) = (y − λW y).
If λ is known, this estimator is equal to the GLS estimator—βbM L = βbGLS —and it can be thought as the
OLS estimator resulting from a regression of y(λ) on X(λ). In other words, for a known value of the spatial
autoregressive coefficient ,λ, this is equivalent to OLS on the transformed variables.
X(λ) = (X − λW X)
y(λ) = (y − λW y)
In the same way, a first-order condition resulting from the spatial derivative of (3.54) with respect to σ 2
gives the ML estimator for the error variance:
1 > > 1
L (λ) = εb B Bε = εb> (λ)b ε(λ) (3.57)
2
σM
n n
where εb = y − X βbM L and εb(λ) = B(λ)(y − X βbM L ) = B(λ)y − B(λ)X βbM L .
First order condition derived from the expression of the likelihood are highly non-linear and therefore the
likelihood in Equation (3.54) cannot be directly maximized. Again, a concentrated likelihood approach is
necessary.
The estimators for β and σ 2 are both functions of the value of λ. A concentrated log-likelihood can then
be obtained as:
1 > >
n
log L(θ)c = const + log εb B B εb + log |B| (3.58)
2 n
The residual vector of the concentrated likelihood is also, indirectly, a function of the spatial autoregressive
parameter.
A one-time optimization will in general not be sufficient to obtain maximum likelihood estimates for all
the parameters. Therefore an interactive procedure will be needed.
Alternate back and forth between the estimation of the spatial autoregressive coefficient conditional upon
residuals (for a value of β), and a estimation of the parameter vector (conditional upon the s.a.c).
Algorithm 3.2 — ML estimation of SEM. Following Anselin (1988), the procedure can be summarize in the
following steps:
which are highly nonlinear due to Sr (θ).3 Therefore, a procedure such as the Delta Method is not feasible.
Instead, we use a Monte Carlo approximation which takes into account the sampling distribution of θ. To
show this procedure, consider the SDM where:
−1
S(θ)r = (In − ρW ) (In βr + W γr )
Let g(θ) = M̄ (θ) be a function representing the marginal (direct, indirect or total) effect that depends
on the population parameters θ. If N(θ|θ̄, Σθ ) denotes the multivariate normal density of θ with mean θ̄
and asymptotic variance-covariance matrix Σθ , then the expected value of the marginal effects conditional
on the population parameters θ̄ and Σθ is:
Z
E(g(θ)|θ̄, Σθ ) = E(g(θ)|y, X, θ)N(θ|θ̄, Σθ )dθ. (3.60)
θ
A Monte Carlo approximation to this expectation is obtained by calculation of the empirical marginal
effects evaluated at pseudo draws of θ from the asymptotic distribution of the estimator. The algorithm is
the following:
Algorithm 3.3 — Standard Errors of the Marginal Effects. Estimate the model using MLE. Consider s =
1, ..., S, and start with s = 1
1. Take a random draw of θ s from N(θ, b θ ), which is the estimated asymptotic distribution of θ.
b̄ Σ b
5. Calculate the empirical mean of the marginal effects. The standard error of the marginal effect across
the S draws is the standard error.
3 Note that we have replaced the parameter for the spatially lagged independent variable to let θ be the vector parameters of
the model.
68 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION
• CRIME: residential burglaries and vehicle thefts per thousand household in the neighborhood.
# Load packages
library("spdep")
library("spatialreg")
library("memisc") # Package for tables
library("maptools")
library("RColorBrewer")
library("classInt")
source("getSummary.sarlm.R") # Function for spdep models
Dataset is currently part of the spdep package. We load the data using the following commands.
# Load data
columbus <- readShapePoly(system.file("etc/shapes/columbus.shp",
package = "spdep")[1])
col.gal.nb <- read.gal(system.file("etc/weights/columbus.gal",
package = "spdep")[1])
As usual in applied work, we start the analysis by asking whether there exists a spatial pattern in the
variable we are interested in. To get some insights about the spatial distribution of CRIME we use the following
quantile clorophet graph:
Figure 3.2 shows the spatial pattern of crime. It can be observed that the spatial distribution of crime
follows a clear pattern of positive autocorrelation. However, we must corroborate this statement by using
a global test of spatial autocorrelation. To do so, we use a row-normalized binary contiguity matrix W ,
col.gal.nb, based on the queen criteria and carry out a Moran’s I test. In particular, we use a Moran test
based on Monte Carlo simulations using the moran.mc function with 99 simulations.
# Moran's I test
set.seed(1234)
listw <- nb2listw(col.gal.nb, style = "W")
moran.mc(columbus$CRIME, listw = listw,
nsim = 99, alternative = 'greater')
##
## Monte-Carlo simulation of Moran I
##
3.5. SPILLOVER EFFECTS ON CRIME: AN APPLICATION IN R 69
60
50
40
30
20
10
Notes: This graph shows the spatial distribution of crime on the 49 Columbus, Ohio neighborhoods. Darker color
indicates greater rate of crime.
## data: columbus$CRIME
## weights: listw
## number of simulations + 1: 100
##
## statistic = 0.48577, observed rank = 100, p-value = 0.01
## alternative hypothesis: greater
The results show that the Moran’s I statistic is 0.51 and the p-value is 0.01. This implies that we
reject the null hypothesis of random spatial distribution and there exists evidence of positive global spatial
autocorrelation in the crime variable: places with high (low) crime rate are surrounded by places with high
(low) crime rate.
Our next step is to estimate different spatial models using the functions already programmed in spdep.
First, we estimate the classical OLS model followed by the SLX, SLM, SDM, SEM and SAC models. The
functions used for each models are the following:
• OLS: lm function.
• SLX: lm function, where W X is constructed using the function lag.listw from spdep package. This
model can also be estimated using the function lmSLX from spdep package as shown below.
• SLM: lagsarlm from spdep package.
• SDM: lagsarlm from spdep package, using the argument type = "mixed". Note that type = "Durbin"
may be used instead of type = "mixed".
• SEM: errorsarlm from spdep package. Note that the Spatial Durbin Error Model (SDEM)—not shown
here— can be estimated by using type = "emixed".
• SAC: sacsarlm from spdep package.
All models are estimated using ML procedure outline in the previous section. In order to compute the
determinant of the Jacobian we use the Ord (1975)’s procedure by explicitly using the argument method =
"eigen" in each spatial model. That is, the Jacobian is computed as in (3.32).
70 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION
# Models
columbus$lag.INC <- lag.listw(listw,
columbus$INC) # Create spatial lag of INC
columbus$lag.HOVAL <- lag.listw(listw,
columbus$HOVAL) # Create spatial lag of HOVAL
ols <- lm(CRIME ~ INC + HOVAL,
data = columbus)
slx <- lm(CRIME ~ INC + HOVAL + lag.INC + lag.HOVAL,
data = columbus)
slm <- lagsarlm(CRIME ~ INC + HOVAL,
data = columbus,
listw,
method = "eigen")
sdm <- lagsarlm(CRIME ~ INC + HOVAL,
data = columbus,
listw,
method = "eigen",
type = "mixed")
sem <- errorsarlm(CRIME ~ INC + HOVAL,
data = columbus,
listw,
method = "eigen")
sac <- sacsarlm(CRIME ~ INC + HOVAL,
data = columbus,
listw,
method = "eigen")
The models are presented in Table 3.1. The OLS estimates are presented in the first column. The results
show that an increase of 1 thousand dollars in the income of the neighborhood is correlated, in average, with
a decreased of 1.6 crimes per thousand households. Similarly, an increase of 1 thousand dollars in the housing
value of the neighborhood is correlated, in average, with a decreased of 0.3 crimes per thousand households.
Both correlations are statistically significant.4 Both results implies that crimes (residential burglaries and
vehicle thefts) are lower richer neighborhoods.
Column 2 of Table 3.1 show the results for the SLX. In particular, the model is given by y = Xβ +
W Xγ + ε, where W X is a 49 × 2 matrix, whose columns correspond to the spatial lag of INC and HOVAL.
The coefficient for the spatial lag of INC, W.INC, is negative and significant. This implies that crime in spatial
unit i is correlated with the income in its neighborhood: the higher the income of the neighbors of i the
lower the crime in i. This result does not, however, hold for the housing value of the neighbors of i which is
positive but not statistically different from zero.
The results for the SLM are shown in column 3. The spatial autoregressive parameter ρ is positive and
significant indicating strong spatial autocorrelation. This implies evidence of spillover effects on crime. The
coefficients for the other variables in the regression are similar to the OLS results, though smaller in absolute
value.
The results for the SDM are presented in column 4. Whereas the estimated ρ parameter is positive and
significant, the coefficient of the lagged explanatory variables are not. This indicates that once we have take
into account the endogenous interaction effects of crime, the neighbors’ factors do not matter in explaining
the crime in each location. Moreover, for the spatial lag of income, the wrong sign is obtained, since the
common factor hypothesis would imply a positive sign, given a positive estimate for ρ and negative sign for
4 Note that we refer to correlation since there may still be some sort of endogeneity problem in either of the two variables.
3.5. SPILLOVER EFFECTS ON CRIME: AN APPLICATION IN R 71
INC. This provide some evidence that an omitted spatial lag may be the main spatial effect, rather than
spatial dependence in the error term.
Column 5 shows the results for the the SEM model which confirm the conclusions from the previous
models. It can be noticed that the autoregressive parameter for W u is positive and significant indicating an
important spatial transmission of the random shocks. This result may be explained by the fact of omitting
important variables that are spatially correlated.
The SAC model, presented in column 6, considers both endogenous interactions effects and interactions ef-
fects among the error terms. From the results, we observe that the SAC model produces coefficients estimates
of W y and W u variables that are not significantly different from zero. However, if endogenous interaction
effects and interactions effects among the error terms are separated from each other, both coefficients turn
out to be significant. This might be explained by the fact that the model is overparametrized, as a result of
which the significance levels of all variables tend to go down.
For those interested in programming the SLM model in R, I provide a short code in R to estimate this
model using Ord’s approximation in Appendix 3.B.
and estimate the predicted values pre- and post- the change in the income variable. In the following lines we
use the reduced-form predictor and the observed values of the exogenous variables to obtain the predicted
values for CRIME, yb1 , using the SLM model previously estimated.
Next we increase INC by 1 in spatial unit 30, and calculate the reduced-form predictions, yb2 .
Finally, we compute the difference between pre- and post-predictions: yb2 − yb1 :
# The difference
delta_y <- y_hat_post - y_hat_pre
col_new$delta_y <- delta_y
## V1
## Min. :-1.1141241
## 1st Qu.:-0.0074114
## Median :-0.0012172
## Mean :-0.0336341
## 3rd Qu.:-0.0002604
## Max. :-0.0000081
sum(delta_y)
## [1] -1.648071
According to the result from sum(delta_y), the predicted effect of the change would be a decrease of 1.65
in the crime rate, considering both direct and indirect effects. That is, increasing the income in US$1,000 in
region 30th might generate effects that will transmit through the whole system of region resulting in a new
equilibrium where the the total crime will reduce in 1.7 crimes per thousand households.
Sometimes we would like to plot these effects. Suppose we wanted to show those regions that had low
and high impact due to the increase in INC. Let’s define “high impacted regions” those regions whose crime
rate decrease more than 0.05. The following code produces Figure 3.3.
# Breaks
breaks <- c(min(col_new$delta_y), -0.05, max(col_new$delta_y))
labels <- c("High-Impacted Regions", "Low-Impacted Regions")
np <- findInterval(col_new$delta_y, breaks)
colors <- c("red", "blue")
# Draw Map
plot(col_new, col = colors[np])
legend("topleft", legend = labels, fill = colors, bty = "n")
points(38.29, 30.35, pch = 19, col = "black", cex = 0.5)
Now we map the magnitude of the changes caused by altering INC in region 30. The code is the following
and the graph is presented in Figure 3.4.
3.5. SPILLOVER EFFECTS ON CRIME: AN APPLICATION IN R 73
High−Impacted Regions
Low−Impacted Regions
Notes: This graph shows those regions that had low and high impact due to increase in INC in 30th. Red-colored
regions are those regions with a decrease of crime rate larger than 0.05, whereas blue-colored regions are those
regions with lower decrease of crime rate.
In the rest of this Section we use the impacts() function from spdep package to understand the direct
(local), indirect(spillover), and total effect of a unit change in each of the predictor variables. This function
returns the direct, indirect and total impacts for the variables in the model. The spatial lag impact measures
are computed using the reduced form:
K
X
y= A(W )−1 (In βr ) + A(W )−1 ε
r=1
(3.61)
A(W ) −1
= In + ρW + ρ W + ....
2 2
The exact A(W )−1 is computed when listw is given. When the traces are created by powering sparse
matrices the approximation In + ρW + ρ2 W 2 + .... is used. When the traces are created by powering sparse
matrices, the exact and the trace methods should give very similar results, unless the number of powers used
is very small, or the spatial coefficient is close to its bounds.
−1.1141
−1.1141
−0.1632
−0.0759
−0.0061
0
Notes: This graph shows the spatial distribution of the changes caused by altering INC in region 30.
The output says that an increase of US$1,000 in income leads to a decrease of 1.8 crimes per thousand
households.
The direct effect of the income variable in the SLM model amounts to -1.123, while the coefficient estimate
of this variable is -1.074. This implies that the feedback effect is -1.123 - (-1.074) = -0.049. This feedback
effect corresponds to 4.5% of the coefficient estimate.
Let’s corroborate these results by computing the impacts using matrix operations:
## [1] -1.122516
n <- length(listw$neighbours)
Total <- crossprod(rep(1, n), S) %*% rep(1, n) / n
Total
## [,1]
## [1,] -1.800897
## [,1]
## [1,] -0.6783818
Note that the results are the same as those computed by impact.
3.5. SPILLOVER EFFECTS ON CRIME: AN APPLICATION IN R 75
We can also obtain the p-values of the impacts by using the argument R. This argument indicates the
number of simulations use to create distributions for the impact measures, provided that the fitted model
object contains a coefficient covariance matrix.
Now with p-values:
The results shows that the variable that exerts the largest negative direct impact is INC. That is, INC
exert the largest reduction on own-crime rate. The indirect effects are presented in the second column. These
effects help identify which variables produce the largest spatial spillovers. Negative effects could be considered
spatial benefits, since these indicate variables that lead to a reduction in crime rate. Positive indirect effects
would represent a negative externality, since this indicates that neighboring regions suffer from an increase in
crime rate when these variables increase. From the results we observe that INC has the largest and significant
negative indirect effects.
The indirect effect for HOVAL is not significant. The weakly significant effect in the SLM model can be
explained by the fact that this model suffers from the problem that the ratio between the spillover effect
and the direct effect is the same for every explanatory variable. Therefore, this model is too rigid to model
spillover effects adequately.
Total effect takes into account both the direct and indirect effects, allowing us to draw an inference
regarding what variables are important to reduce crime rate. We can observe that INC has the larges total
effect.
Now we follow the example that converts the spatial weight matrix into “sparse” matrix, and power it up
using the trW function.
We can also observe the cumulative impacts using the argument Q. When Q and tr are given in the
impacts function the output will present the impact components for each step in the traces of powers of the
weight matrix up to and including the Qth power.
# Cumulative impacts
im2 <- impacts(slm, tr = trMC, R = 100, Q = 5)
sums2 <- summary(im2, zstats = TRUE, reportQ = TRUE, short = TRUE)
sums2
##
## ========================================================
## Simulation results (asymptotic variance matrix):
## ========================================================
## Simulated standard errors
## Direct Indirect Total
## INC 0.34631256 0.4029543 0.6401131
## HOVAL 0.08921795 0.1241116 0.1807874
##
## Simulated z-values:
## Direct Indirect Total
## INC -3.233305 -1.853701 -2.916189
## HOVAL -3.239503 -1.585160 -2.686904
##
## Simulated p-values:
## Direct Indirect Total
## INC 0.0012237 0.063782 0.0035434
## HOVAL 0.0011974 0.112930 0.0072118
## ========================================================
## Simulated impact components z-values:
## $Direct
## INC HOVAL
## Q1 -3.167835 -3.1796183
## Q2 NaN NaN
## Q3 -1.703030 -1.5687862
## Q4 -1.272087 -1.0947684
## Q5 -1.002337 -0.8174607
##
## $Indirect
## INC HOVAL
## Q1 NaN NaN
## Q2 -2.465562 -2.4657025
## Q3 -1.703030 -1.5687862
## Q4 -1.272087 -1.0947684
## Q5 -1.002337 -0.8174607
##
## $Total
## INC HOVAL
## Q1 -3.167835 -3.1796183
## Q2 -2.465562 -2.4657025
## Q3 -1.703030 -1.5687862
## Q4 -1.272087 -1.0947684
## Q5 -1.002337 -0.8174607
##
##
## Simulated impact components p-values:
## $Direct
## INC HOVAL
## Q1 0.0015358 0.0014747
## Q2 NA NA
## Q3 0.0885624 0.1166978
## Q4 0.2033424 0.2736181
## Q5 0.3161810 0.4136652
##
## $Indirect
## INC HOVAL
78 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION
## Q1 NA NA
## Q2 0.013680 0.013674
## Q3 0.088562 0.116698
## Q4 0.203342 0.273618
## Q5 0.316181 0.413665
##
## $Total
## INC HOVAL
## Q1 0.0015358 0.0014747
## Q2 0.0136799 0.0136745
## Q3 0.0885624 0.1166978
## Q4 0.2033424 0.2736181
## Q5 0.3161810 0.4136652
Central limit theorems that are applied to triangular arrays of random variablesPn are concerned with
limiting distributions of appropriately defined function of the row average Sn = n−1 i=1 Xni . For example,
for n = 3 (third row) we have S3 =P(1/3)(X31 + X32 + X33 ). Note that the traditional CLTs deal with
n
functions of average of the type n−1 i=1 Xi , the Xi ’s being elements of the sequence {Xn }. However, the
triangular array {Xnn } is more general than a sequence {Xn } in the sense that the random variables in a
3.6. ASYMPTOTIC PROPERTIES 79
row of the array need not be the same as random variables in other rows. Thus, the triangular nature of a
random variable leads to certain statistical problems, especially with respect to the relevant CLT that should
be applied. In other words, we will need a CLT applicable to triangular array. Both the LLN and CLT
require slightly stronger conditions than the LLN and CLT for i.i.d sequence of random variables.
What are the conditions on the random variables so that a properly Sn converges to a normal distribution
as n → ∞. In a nutshell, assume:
yn = A−1
n (Xn β0 + εn ) (3.62)
where An = An (ρ0 ) is nonsingular. Let εn (δ) = yn −Xn β−ρWn yn , where δ = (β , ρ) . Thus, εn = εn (δ0 ).
> >
Since the matrices (In −ρW )−1 generally depend upon the sample size n, the vectors y and ε will also depend
upon n, and they will form a triangular arrays. This is due to the fact that for the “boundary” elements
the sample weights matrix changes as new spatial units — or new data points — are added. That is, new
spatial units change the structure for the existing spatial units (see for example Kelejian and Prucha, 1999,
2001; Anselin, 2007). For example, the outcome for the first spatial unit, y1,n , will be different if we consider
a total n = 10 or n = 15 observations because of the changing nature of W as n changes and given the DGP
in Equation (3.62). This implies that these elements and the vector y should be indexed by n:
n = 1 =⇒ y11
n = 2 =⇒ y12 y22
n = 3 =⇒ y13 y23 y33
..
.
n = n =⇒ y13 y23 y33 . . . y3n
where y11 6= y12 6= y13 and y22 6= y23 . Note that the dependent variable in the same row are mutually
independent (spatial units are independent) and have the same distribution. But the distribution of the
random variable y (and ) in different rows are allowed to be different.
The triangular array structure of y is partly a consequence of allowing a triangular array structure for the
disturbances in the model. But there is a more fundamental reason for it, and for treating the X observations
as a triangular array also. In allowing for the elements of Xn to depend on n we allow explicitly for some of
the regressors to be spatial lags.
We can identify each of the indices i = 1, ..., n with a location in space. In regularly-observed time series
settings, these indices correspond to equidistant points on the real line, and it is evident what we usually
mean by letting n increase. However there is ambiguity when these points are in space. For example, consider
n points on a 2 dimensional regularly-spaced lattice, where both the number (n1 ) of rows and the number
(n2 ) of columns increases with n = n1 · n2 . If we choose to list these points in lexicographic order (say
first left to right, then second row, etc) then as n increases there would have to be some re-labeling, as the
triangular array permits. Another consequence of this listing is that dependence between locations i and j is
not always naturally expressed as a function of the difference i − j. For example, this is so if the dependence
is isotropic.
80 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION
1. The disturbances {i,n : 1 ≤ i ≤ n, n ≥ 1} are identically distributed. Furthermore, for each sample
size n, they are jointly independent distributed with mean E(i,n ) = 0 and E(2i,n ) = σ,n2
where
0 < σ,n < b.
2
Note that Assumption 3.4(1) allows the error term to depend on the sample size n, i.e., to form a triangular
array. (For simplicity of notation we will, for the most part, drop again subscripts n in the following).
Moreover, because statistics involving quadratic forms of n will be present in the estimation, the existence
of the fourth order moment of i,n will guarantee finite variances for the quadratic forms and we will be able
to apply a CLT.
In order to understand the asymptotic behavior of Wn under some regularity conditions, we need to
understand some useful terminologies.
Definition 3.6.2 — Triangular array of constants. Let {bni } , i = 1, ..., n be a triangular array of constants.
1. {bni } are at most of order (1/hn ), denoted by O(1/hn ) uniformly in i if there exists a finite constant
c independent of i and n such that |bni | ≤ hcn for all i and n.
2. {bni } are bounded away from zero uniformly in i at rate of hn if there exists a positive sequence
{hn } and a constant c > 0 independent of i and n such that c ≤ |bni | /hn for all i for sufficiently
large n.
Again, we must think the W matrices as triangular arrays of constants. Recall that the elements of W
are denoted as wij . However, since as we add more spatial units the spatial structure changes, it might be
the case that the element wij is not the same when n = 50 or n = 55. Therefore, we need triangular arrays
in order to make explicit this possibility. That is why we will index the elements of Wn as wn,ij .
Another question is whether each element of Wn —or sequences— are bounded. That is, they are limited
as n → ∞. In this context, Definition 3.6.2 provides a specific setting for sequences bounded away from zero.
If sequences are divergent, this definition describes how fast the sequences tend to infinity. Now, we apply
this definition to the spatial weight matrices:
n , denoted by O(1/hn ),
Assumption 3.5 — Weight Matrix. The elements wn,ij of Wn are at most of order h−1
uniformly in all i, j, where the rate sequence hn can be bounded or divergent. As a normalization,
wn,ii = 0 for all i.
Recall that in econometric we are often interested in the asymptotic behavior of variables. For example
we say that:
Xn
Xn = O(bn ) =⇒ lim = −∞ < c < ∞.
n→∞ bn
This implies that Xn is a bounded sequence of rate bn . Probably you recall from you econometric class
that we can write:
√ X > X −1 1
n βb − β = √ X > ε,
n n
and we usually state that X > X = O(n) and X > ε = Op (n1/2 ). That is, the sequence (1/n)X > X is a
bounded sequence√and (1/n1/2 )X > ε is a bounded sequence in terms of probability (it converges to something
as fast as rate 1/ n). Assumption 3.5 states that the elements of Wn are sequences that might be bounded
or divergent at rate hn . That is, we do not know if hn wn,ij is bounded or divergent.
3.6. ASYMPTOTIC PROPERTIES 81
Assumptions 3.5 and 3.6 link directly the spatial weight matrix to the sample size n. The intuition tell us
that as the sample size n increases, the row sum of the weight matrices will also tend to increase, since one
region could have more neighbors (see our discussion in Section 3.6.1). The rate at which the spatial weights
wn,ij increases as n increases can be bounded (limit on the number of neighbors) or can be divergent (not
limit in the number of neighbors). Therefore, Assumptions 3.5 and 3.6 are intended to cover weight matrices
whose elements are not restricted to be nonnegative and those that might not be row-standardized.
What are the implications of those assumption? These assumptions have to be with the row and column
sums of the matrix W . In particular, the row and column sums of W before W is row-normalized should
not diverge to infinity at a rate equal to or faster than the rate of the sample size n. This condition is slightly
different in Kelejian and Prucha (1998, 1999). Their condition states that the row and columns sums of the
matrices W and (In − ρW )−1 before W is row-normalized should be uniformly bounded in absolute value
as n goes to infinity. In both cases these conditions limit the cross-sectional correlation to a manageable
degree, i.e., the correlation between two spatial units should converge to zero as the distance separating them
increases to infinity.
In addition to the technicality, these assumptions have applied implications. Normally, no spatial unit
is assumed to be a neighbor to more than a given number, say q, other units. Therefore, the number of
neighbors is limited and Lee (2004)’s and Kelejian and Prucha (1998, 1999)’s assumption is satisfied.
By contrast, when the spatial weights matrix is an inverse distance matrix Kelejian and Prucha (1998,
1999)’s condition may not be satisfied. To see this, consider an infinite number of spatial units that are
arranged linearly. Let the distance of each spatial unit to its first left- and right-hand neighbor be d; to its
second left- and right-hand neighbor, the distance 2d; and so on. See for example Figure 3.5.
R1 R2 R3 R4 R5
2d d d 2d
When W is an inverse distance matrix and its off-diagonal elements are of the form 1/dij , where dij is
the distance between two spatial units i and j, each row sum is
Since the row and column sums are n − 1, these sums diverge to infinity as n → ∞. In contrast to the
previous case, however, (n − 1)/n → 1 instead of 0 as n → ∞. This implies that a spatial weight matrix
that has equal weights and that is row-normalized subsequently, wij − 1/(n − 1) bust be excluded for reasons
of consistency since is satisfies neither Lee (2004)’s and Kelejian and Prucha (1998, 1999)’s condition. The
alternative is a group interaction matrix, introduced by Case (1991). Here “neighbors” refer to farmers who
live in the same district. Suppose that there are R districts and there are m farmers in each district. The
sample size is n = mR. Case assumed that in a district, each neighbor of a farmer is given equal weight.
In that case, Wn = IR ⊗ Bm , where Bm = (ım ı> m − Im )/(m − 1). For this example, hn = (m − 1) and
hn /n = (m − 1)/(mR) = O(1/R). If sample size n increases by increasing both R and m, then hn goes to
infinity and hn /n goes to zero as n tends to infinity. Thus, this matrix satisfies Lee (2004)’s condition.
R Whether {hn } is a bounded or divergent sequence has interesting implications on the OLS estimation.
The OLS estimators of β and ρ are inconsistent when {hn } is bounded, but they can be consistent
when {hn } is divergent (see Lee, 2002).
In summary, when {hn } is a bounded sequence, it implies a cross sectional unit has only a small number
of neighbors, where the spatial dependence is usually defined based on geographical implications. When
{hn } is divergent, it corresponds to the scenario where each unit has a large number of neighbors that often
emerges in empirical studies of social interactions or cluster sampling data.
Under Assumption 3.7, the SLM model (system) has the reduced form (equilibrium) given by Equation
(3.62), and:
−1
E(yn ) = (In − ρ0 Wn ) Xn β = A−1
n Xn β0 (3.63)
−1 −1>
Var(yn ) = σ02 (In − ρ0 Wn ) (In − ρ0 Wn ) = σ02 A−1 −1>
n An (3.64)
Before explaining the rest of assumption, we need the notion of bounded matrices.
Definition 3.6.3 — Bounded Matrices. Let {An } be a sequence of n-dimensional square matrices, where
An = [an,ij ],
1. The column sums of {An } are uniformly bounded (in absolute value) if there exists a finite constant
c that does not depend on n such that
n
X
kAn k∞ = max |an,ij | ≤ c
1≤j≤n
i=1
2. The row sums of {An } are uniformly bounded (in absolute value) if there exists a finite constant c
that does not depend on n such that
n
X
kAn k1 = max |an,ij | ≤ c
1≤i≤n
j=1
Then {An } is said to be uniformly bounded in row sums if {kAn k1 } is a bounded sequence. Similarly,
{An } is said to be uniformly bounded in column sums if {kAn k∞ } is a bounded sequence.
The following lemmas will be very useful:
Lemma 3.8 If {An } an {Bn } are uniformly bounded in row sums (column sums), then {An Bn } is also
uniformly bounded in row sums (column sums).
Lemma 3.9 If {An } is absolutely summable, and Zn has bounded elements, then the elements of Zn> An Zn =
O(n)
3.6. ASYMPTOTIC PROPERTIES 83
Assumption 3.10 The sequences of matrices {Wn } and are uniformly bounded in both row and
−1
An
column sums
The uniform boundedness of the matrices is a condition to limit the spatial correlation to a manageable
degree. For example, it guarantees that the variances of yn are bounded as n goes to infinity.
Technically, this assumes that {kWn k1 } and {kWn k∞ } are bounded sequences. Formally, let An be a
square matrix. Using Definition 3.6.3, we say that the row and column sums of the sequences of matrices An
is bounded uniformly in absolute value if there exists a constant c < ∞ that does not depend on N such that
N
X N
X
kAn k∞ = |aij,N | < c, kAn k1 = |aij,N | < c ∀N
j=1 i=1
1≤i≤n 1≤j≤n
Why do we care about these? Because we need the variance goes to zero when the sample size goes to
infinity in order to apply some consistency theorem.5
Lemma 3.11 — Uniform Boundedness of Matrices in Row and Column Sums. Suppose that the spatial weights
matrix Wn is a non-negative matrix with its (i, j)th element being
dij
wn,ij = Pn
l=1 dil
and dij > 0 for all i, j.
Pn
1. If the row sums j=1 dij are bounded away from zero at the rate hn uniformly in i, and the column
Pn
sums i=1 dij are O(hn ) uniformly in j, then {Wn } are uniformly bounded in column sums.
Pn
2. (Symmetric Matrix) If dij = dji for all i and j and the row sums j=1 dij are O(hn ) and bounded
away from zero at the rate hn uniformly in i, then {Wn } are uniformly bounded in column sums.
Assumption 3.12 The elements of Xn are uniformly bounded constants for all n. The limn→∞ Xn> Xn /n
exists and is nonsingular.
This rules out multicollinearity among the regressors. Note also that we are assuming that Xn is non-
stochastic. If Xn were stochastic, then we will require:
n (ρ) are uniformly bounded in either row or column sums, uniformly in ρ in a compact
Assumption 3.13 A−1
parameter space P . The true parameter ρ0 is in the interior of P
−1
This assumption is needed to deal with the nonlinearity of log (In − ρW ) in the log-likelihood function.
Recall
that if kW
k < 1, then In −ρWn is invertible for all n. Then if kW k < 1, then the sequence of matrices
−1
(In − Wn )
are uniformly bounded in any subset of (−1, 1) bounded away from the boundary. As we
−1
previously see, if W is row-standardized (In − W ) is uniformly bounded in row sums norm uniformly in
any closed subset of (−1, 1). Therefore, P from Assumption 3.13 can be considered as a single closed set
contained in (-1, 1).
−1
What if W is not row-normalized but its eigenvalues are real? Then, the Jacobian of (In − W ) will
be positive if −1/ωmin < ρ < 1/ωmax , where ωmin and ωmax are the minimum and maximum eigenvalues of
W , and P will be a closed interval contained in (−1/ωmin , 1/ωmax ) for all n. Thus, Assumption 3.13 rules
out models where ρ0 is close to -1 and 1.
Now, noting that:
5 Equivalently, this assumption rules out the unit root case in time series.
84 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION
yn = Xn β0 + ρ0 Wn yn + εn
= Xn β0 + ρ0 Wn A−1n Xn β0 + An εn + εn
−1
= Xn β0 + ρ0 Wn A−1
n Xn β0 + ρ0 Wn An εn + εn
−1
(3.65)
= Xn β0 + ρ0 Wn A−1
n Xn β0 + In + ρ0 Wn An
−1
εn
= Xn β0 + ρ0 Cn Xn β0 + (In + ρ0 Cn ) εn
= Xn β0 + ρ0 Cn Xn β0 + A−1
n εn
because In + ρ0 Cn = A−1
n (Show this), where Cn = Wn An .
−1
Theorem 3.15 — Consistency. Under assumption 3.4-3.14, θ0 is globally identifiable and θbn is a consistent
estimator of θ0 .
The proof is given in Lee (2004). Identification of ρ0 can be based on the maximum values of the concen-
trated log-likelihood function Qn (ρ)/n. With identification and uniform convergence of [log Ln (ρ) − Qn (ρ)] /n
to zero on P , consistency of the QMLE θbn follows.
1 ∂ log Ln (θ0 ) 1
√ = 2 √ Xn> εn (3.68)
n ∂β σ0 n
1 ∂ log Ln (θ0 ) 1
= √ ε0n εn − nσ02 (3.69)
√
n ∂σ 2 2σ0 n
4
and
1 ∂ log Ln (θ0 ) 1 1
√ = 2 √ (Cn Xn β0 )> εn + 2 √ (ε> Cn εn − σ02 tr(Cn )) (3.70)
n ∂ρ σ0 n σ0 n n
As explained by Lee (2004, pag. 1905), these are linear and quadratic functions of εn . In particular, the
asymptotic distribution of (3.70) may be derived from central limit theorem for linear-quadratic forms. The
matrix Cn is uniformly bounded in row sums. As the elements of Xn are bounded, the elements of Cn Xn β0
for all n are uniformly bounded by Lemma 3.8. With the existence of high order moments of in Assumption
3.4, the central limit theorem for quadratic forms of double arrays of Kelejian and Prucha (2001) can be
applied and the limit distribution of the score vector follows.
3.6. ASYMPTOTIC PROPERTIES 85
√ √
Since E [(1/ n)∂ log Ln /∂θ] = 0, the variance matrix of (1/ n)∂ log Ln /∂θ is:
2 n (X (CXβ)
>
X) 0> >
1 1
2n X
1 ∂ log Ln (θ)
σ σ
= σ 2 n tr(C) (3.72)
1 1
−E 2σ 4
n ∂θ∂θ > 1
n tr(C s
C) + 1
σ2 n (CXβ) >
(CXβ)
and represent the average Hessian matrix (or information matrix when ε’s are normal). The matrix Ωθ,n
is a matrix with the second, third, and fourth moments of ε. If εn is normally distributed, then Ωθ,n = 0.
1 ∂ 2 log Ln (θ0 )
Σθ = − lim E , (3.74)
n→∞ n ∂θ∂θ >
which are assumed to exists. If the i ’s are normally distributed, then:
√
d
n θbn − θ0 −→ N θ, Σ −1 . (3.75)
θ
The following lemmas and statements summarize some basic properties on spatial weight matrices and
some law of large numbers and central limit theorems on linear and quadratic forms. For proof of these
lemmas see Lee (2004)’ appendix. The error term εn are assumed to be i.i.d. with zero mean and finite
variance σ02 according to Assumption 3.4. For quadratic forms involving ε, the fourth moment µ4 for the ε’s
is assumed to exists.
Consider the following properties:
Lemma 3.17 — Limiting Distribution. Suppose that An is n × n matrix with its columns sums being
uniformly bounded and elements of the n × K matrix Cn are uniformly bounded. Elements i ’s of
εn = (1 , ...., n )> are i.i.d(0, σ 2 ). Then:
1
√ Cn> An εn = Op (1). (3.76)
n
That is, it converges in distribution to something. Furthermore, if the limit of 1 >
n Cn An ε n exists and
is positive definite, then:
1 1
d
√ Cn> An εn −→ N 0, σ02 lim Cn> An A> n n .
C (3.77)
n n→∞ n
Lemma 3.18 — First and Second Moments. Let An = [aij ] be an n-dimensional square matrix. Then
1. E(ε>
n An εn ) = σ0 tr(An ),
2
Pn
2. E(ε>
n An εn ) = (µ4 − 3σ0 ) a2ii + σ04 tr2 (An ) + tr(An A>
n ) + tr(An ) , and
2 4
2
i=1
Pn
3. Var(ε>
n An εn ) = (µ4 − 3σ0 ) i=1 aii + σ0 tr(An An ) + tr(An ) .
>
4 2 4
2
• Var(ε>
n An εn ) = σ0 tr(An An ) + tr(An )
>
4
2
Lemma 3.19 Suppose that {An } is uniformly bounded in either row and column sums, and the elements
an,ij of An are O(1/hn ) uniformly in all i and j. Then:
• E(ε>
n An εn ) = O(n/hn ),
• Var(ε>
n An εn ) = O(n/hn ), and
• ε>
n An εn = Op (n/hn ).
Furthermore:
• limn→∞ hn
n = 0, and,
• hn >
n εn An εn − n E(εn An εn )
hn >
= op (1)
Sketch of Proof Asymptotic Normality. We will sketch the proof of asymptotic normality assuming consis-
tency (Theorem 3.15). The sketch consists in the following steps:
1 ∂ 2 log Ln (θ0 )
Σθ = − lim E
n→∞ n ∂θ∂θ >
is non-singular. To show this is beyond the scope of this class notes. We will take this as given.
2. Now we will show that
= op (1)O(1)
= op (1)
1 > >
X C εn = op (1)
n n n
1 > >
X C Cn εn = op (1)
n n n
It follows that:
3.6. ASYMPTOTIC PROPERTIES 87
1 > > 1
X W yn = Xn> Cn Xn β0 + op (1)
n n n n
1 > > 1
y W εn = ε> C > εn + op (1)
n n n n n n
1 > > 1 > 1
y W Wn yn = (Xn β0 ) Cn> Cn Xn β0 + ε> C > Cn εn + op (1)
n n n n n n n
As Xn> Wn yn /n = Op (1) (it convergences to something in distribution), it follows from Equation (3.36):
Then, taking into account Equation (3.35) and using our result in Equation (3.78) yields:
∂ log Ln (θ) 1
= − tr (Cn (ρ))2 − 2 (yn> Wn> Wn yn ) where Cn (ρ) = Wn An (ρ)−1
∂ρ2 σ
tr (Cn (e
ρn ))2 = tr (Cn (ρ0 ))2 + 2 tr (Cn (ρ̄))3 (ρ0 − ρen )
tr (Cn (e
ρn ))2 − tr (Cn (ρ0 ))2 = 2 tr (Cn (ρ̄))3 (ρ0 − ρen )
Then:
88 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION
Note that Cn (ρ̄) is uniformly bounded in row andcolumn sums uniformly in a neighborhood of ρ0 by
Assumption 3.10 and 3.13. Note that tr (Cn (ρ̄))3 = O(n/hn ) by Lemma
Considering Equation (3.38):
1 ∂ 2 log L(θ)
e 1 ∂ 2 log L(θ0 ) 1 1 1 e> e 1 1 1
− = − 2 3 ε(δ) ε(δ) − + 2 3 ε> ε
n ∂(σ ) 2 2 n ∂(σ ) 2 2 2(e
σ )
2 2 (e
σ ) n 2(σ )
2 2 (σ ) n
1 1 1 1 1 e> e 1 1
= − 2 2 − 2 3 ε(δ) ε(δ) + 2 3 ε> ε
2 σ e 2 )2 (σ ) (e
σ ) n (σ ) n
1 1 1 1 1 1
>
εn εn
= − 2 2 − 2 3 + op (1) + 2 3 ε> ε
2 σ e )
2 2 (σ ) (e
σ ) n (σ ) n
1 1 1 1 1
>
εn εn
= − 2 2 + − 2 3 + op (1)
2 σ e 2 )2 (σ ) (σ 2 )3 (e
σ ) n
= op (1)
1 ∂ 2 log L(θ0 ) 1 1
E >
W , X = − 2 (X > X)
n ∂β∂β σ0 n
1 ∂ log L(θ0 )
2
W,X = 0
E
n ∂β∂σ 2
1 ∂ 2 log L(θ0 ) 1 1
E W , X = − 2 X > CXβ0
n ∂β∂ρ σ0 n
1 ∂ 2 log L(θ0 ) 1
E W , X =− 4
n ∂(σ 2 )2 2σ0
1 ∂ 2 log L(θ0 ) 1
E W , X = − tr(C)/σ02
n ∂σ 2 ∂ρ n
1 ∂ 2 log L(θ0 ) 1 1 1
W , X = − n tr(C C) − σ 2 n (CXβ0 ) (CXβ0 )
s >
E
n ∂ρ2 0
All these expectations exist in the limit by Assumption 3.14 and Lemma 3.18-3.19. Then, by nonsin-
gularity of E [H(wi ; θ)], we can say that
−1
1
p −1
H(w; θ)
b −→ E [H(w; θ0 )]
n
4. Recall that the first-order derivatives of the log-likelihood function at θ0 are given by (see Section 3.2.2):
1√
Xn> εn
σ02 n
1 ∂ log Ln (θ0 ) 1√
ε0n εn − nσ02
√ =
2σ04 n
n ∂θ
1√
σ2 n
(Cn Xn β0 )> εn + σ21√n (ε> n Cn εn − σ02 tr(Cn ))
0 0
As explained by Lee (2004, pag. 1905), these are linear and quadratic functions of εn . In particular, the
asymptotic distribution of √1n ∂ log ∂θ
Ln (θ0 )
may be derived from central limit theorem for linear-quadratic
forms. The matrix Cn is uniformly bounded in row sums. As the elements of Xn are bounded, the
elements of Cn Xn β0 for all n are uniformly bounded by Lemma 3.8. With the existence of high order
moments of in Assumption 3.4, the central limit theorem for quadratic forms of double arrays of
Kelejian and Prucha (2001) can be applied and the limit distribution of the score vector follows.
√ √
Since E [(1/ n)∂ log Ln /∂θ] = 0, the variance matrix of (1/ n)∂ log Ln /∂θ under normality is:
σ 2 n (X X) σ 2 n X (CXβ)
>
1
0> 1 >
1 ∂ log Ln (θ)
= σ 2 n tr(C) (3.81)
1 1
−E 2σ 4
n ∂θ∂θ > 1
n tr(C s
C) + 1
σ2 n (CXβ) >
(CXβ)
and represent the average Hessian matrix (or information matrix when ε’s are normal). Then:
1 ∂ log Ln (θ0 ) d
√ −→ N(0, −E [H(wi ; θ)]), (3.82)
n ∂θ
and:
√ d −1
n(θbn − θ0 ) −→ −E [H(wi ; θ)] N(0, −E [H(wi ; θ)]) = N(0, Σθ−1 ).
90 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION
Appendix
3.A Terminology in Asymptotic Theory
3.B A function to estimate the SLM in R
For those interested in programming spatial models via ML, here I provide a small function to estimate the
SLM based on spdep package and Algorithm 3.1.
##################################
# Spatial Lag Model Estimated via Maximum Likelihood
# By: Mauricio Sarrias
# Based on spdep code
#################################
# Generate estimates
A <- (diag(n) - rho_hat * W)
Ay <- crossprod(t(A), y)
beta_hat <- solve(crossprod(X)) %*% crossprod(X, Ay) # See Equation (3.25)
error <- Ay - crossprod(t(X), beta_hat)
sigma2_hat <- crossprod(error) / n # See Equation (3.26)
# Hessian
C <- crossprod(t(W), solve(A)) # C = WA^{-1}
alpha <- sum(omega ^ 2 / ((1 - rho_hat * omega) ^ 2))
if (is.complex(alpha)) alpha <- Re(alpha)
b_b <- drop(1 / sigma2_hat) * crossprod(X) # k * k
b_rho <- drop(1 / sigma2_hat) * (t(X) %*% C %*% X %*% beta_hat) # k * 1
sig_sig <- n / (2 * sigma2_hat ^ 2) # 1 * 1
sig_rho <- drop(1 / sigma2_hat) * sum(diag(C)) # 1 * 1
rho_rho <- sum(diag(crossprod(C))) + alpha +
drop(1 / sigma2_hat) * crossprod(C %*% X %*% beta_hat) # 1*1
row_1 <- cbind(b_b, rep(0, k), b_rho)
row_2 <- cbind(t(rep(0, k)), sig_sig, sig_rho)
row_3 <- cbind(t(b_rho), sig_rho, rho_rho)
Hessian <- rbind(row_1, row_2, row_3)
std.err <- sqrt(diag(solve(Hessian)))
# Table of coefficients
all_names <- c(colnames(X), "sigma2", "rho")
all_coef <- c(beta_hat, sigma2_hat, rho_hat)
z <- all_coef / std.err
p <- pnorm(abs(z), lower.tail = FALSE) * 2
sar_table <- cbind(all_coef, std.err, z, p)
cat(paste("\nEstimates from SAR Model \n\n"))
colnames(sar_table) <- c("Estimate", "Std. Error", "z-value", "Pr(>|z|)")
rownames(sar_table) <- all_names
printCoefmat(sar_table)
}
#Log-Likelihood function
l_c <- - (n / 2) - (n / 2) * log(2 * pi) - (n / 2) * log(sigma2) + log(det)
return(l_c)
}
92 CHAPTER 3. MAXIMUM LIKELIHOOD ESTIMATION
Hypothesis Testing
4
In the previous chapter we have presented the spatial autoregressive models, the intuition underlying their
DGP, and their estimation by ML. At this stage the following question arises: which model is more convenient
for empirical analysis? There exists two ways to proceed. The first way is to use a spatial model according
to some theoretical considerations. The second approach suggests that a series of statistical test should be
carried out on the different specifications of the spatial autocorrelation models to adopt the one that better
control for spatial autocorrelation among residuals.
In this chapter we present some approaches to test whether the true spatial parameters are zero or not.
In other words, we would like to assess the null H0 : λ = 0 or H0 : ρ = 0, under the alternative H1 : λ 6= 0 or
H1 : ρ 6= 0.
We first start with the Moran’s I statistic used to test whether there is some evidence of spatial autocor-
relation in the error term. Then, we present several test based on the ML principle.
εb> W εb
I= (4.1)
εb> εb
The asymptotic distribution for the Moran statistic with regression residuals was developed by Cliff and
Ord (1972, 1973). In particular, the following Theorem give us the moment of the Moran’s I statistic and
its distribution.
Theorem 4.1 — Moran’s I. Consider H0 : no spatial autocorrelation, and assume that ε ∼ N(0, σ 2 In ). Let
the Moran’s I statistic be:
93
94 CHAPTER 4. HYPOTHESIS TESTING
εb> W εb
n
I= (4.2)
S0 εb> εb
where εb = y − X βb is a vector of OLS residuals, βb = (X > X)−1 X > y, W is a spatial weight matrix, n
is the number of observations and S0 is a standardization factor, equal to the sum of all elements in the
weight matrix. Then, the moments under the null are:
n tr(M W )
E(I) =
S0 n − K
2
2 2 (4.3)
n
tr M W M W > + tr (M W ) + [tr(M W )]
S0
E(I 2 ) =
(n − K)(n − K + 2)
−1
where M = I − X X > X X > . Then:
I − E(I)
zI = ∼ N(0, 1) (4.4)
Var(I)1/2
where Var(I) = E(I 2 ) − E(I)2 .
According to Anselin (1988, p. 102), the interpretation of this test is not always straightforward, even
though it is by far the most widely used approach. While the null hypothesis is obviously the absence of
spatial dependence, a precise expression for the alternative hypothesis does not exists. Intuitively, the spatial
weight matrix is taken to represent the pattern of potential spatial interaction that causes dependence, but
the nature of the underlying DGP is not specified. Usually it is assumed to be of a spatial autoregressive
form. However, the coefficient 4.1 is mathematically equivalent to an OLS regression of W εb on εb, rather
than for εb on W εb, which would correspond to an autoregressive process as in SEM model. In other words,
Moran’s I is a misspecification test that has power against a host of alternatives. This includes spatial error
autocorrelation, but also residual correlation caused by a spatial lag alternative, and even heteroskedasticity!
Thus, the rejection of the null hypothesis of no spatial autocorrelation does not imply the alternative of spatial
error autocorrelation, which is typically how this result is incorrectly interpreted. Specifically, Moran’s I also
has considerable power against a spatial lag alternative, so rejection of the null does not provide any guidance
in the choice of a spatial error vs. a spatial lag as the alternative spatial regression specification.
εb> W εb
I¯ = , (4.5)
e2
σ
with σ
e2 being normalizing factor that depends on the particular model chosen as an alternative hypothesis.
In particular, if the alternative hypothesis is constituted by a SEM, the normalizing factor assumes the
expression:
−1/2
εb> εb tr W > + W W
e =
σ 2
. (4.6)
n
As a consequence the test statistic can be defined as:
ε> W εb
nb
I¯ = −1/2
. (4.7)
εb> εb {tr [(W > + W ) W ]}
4.2. COMMON FACTOR HYPOTHESIS 95
The two expressions reported in Equations (4.2) and (4.7) coincide if the weight matrix has dichotomous
entries in which case wij = wij
2
and, therefore,
XX −1/2
wij = tr W > + W W
.
i j
In their paper, Kelejian and Prucha (2001) prove that the modified Moran test I¯ converges in distribution
to a standardized normal distribution even when the priori assumption of the normality of the error is not
satisfied. Even if in large samples I¯ ∼ N(0, 1), in small samples its expected value and variance may be
different.
4.1.3 Example
We will continue here with Anselin (1988)’s example (see Section 3.5) and we analyze whether the regression
residuals from a OLS model show evidence of some spatial autocorrelation.
To carry out the Moran’s I test on the residuals in R we need to pass the regression object and spatial
weight object (listw) to the lm.morantest function.
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = CRIME ~ INC + HOVAL, data = columbus)
## weights: listw
##
## Moran I statistic standard deviate = 2.681, p-value = 0.00734
## alternative hypothesis: two.sided
## sample estimates:
## Observed Moran I Expectation Variance
## 0.212374153 -0.033268284 0.008394853
The default setting in this function is to compute the p-value for one sided test. To get a two-sided test,
the alternative argument must be specified explicitly.
The results show a Moran’s I statistic of 0.212, which is highly significant and reject the null hypothesis
of uncorrelated error terms.
Recall that the Moran’s I statistic has high power against a range of alternatives. However, it does not
provide much help in terms of which alternative model would be most appropriate.
y = Xβ + (In − λW )−1 ε
(In − λW )y = (In − λW )Xβ + ε
(4.8)
y − λW y = (X − λW X)β + ε
y = λW y + Xβ − W X(λβ) + ε
resulting in a model including not only the spatially lagged dependent variable, W y, but also the spatially
lagged explanatory variables (W X). Under some nonlinear restrictions we can see that (4.8) is equivalent
to the SDM. The unconstrained form of the model—or the SDM model—is
H0 : γ3 + γ1 γ2 = 0. (4.11)
If the constraints hold it follows that the SDM is equivalent to the SEM model.
Var(q)
b = Var(βbOLS ) − Var(βbSEM ). (4.12)
Then the Hausman statistic:
−1
H = qb> (Var(q))
b q,
b (4.13)
is distributed asymptotically chi-square with #β degrees of freedom.
−1
βbOLS = X > X X >y
−1
= X >X X > [Xβ0 + (I − λW ) ε]
−1
βbOLS − β0 = X > X X > Bε
where B = (I − λW ). Taking expectation, we get:
h i h −1 > i
E βbOLS − β0 = E X > X X Bε
−1
= X >X X > BE(ε)
=0
So the OLS estimator is unbiased. For the variance, we obtain:
h i2
Var(βbOLS ) = E βb − E(β)
b
h −1 > −1 i
= E X >X X Bεε> B > X X > X
(4.15)
−1 > −1
= X >X X BE εε> B > X X > X
−1 > −1
= σ2 X > X X BB > X X > X
Under the null of the spatial error process, the ML estimate σ
b2 , based on the the variance of the residuals
from the SEM provides a consistent estimate of σ . The ML estimate λ
2 b provides a consistent estimate of
λ. With these estimates, we can compute the variance of the OLS estimates as in Equation (4.15)(Pace and
LeSage, 2008).
Definition 4.4.1 — Likelihood Ratio Test. The Likelihood Ratio (LR) Test is formally defined as:
n n
!
1X b −1
X d
LR = 2 · n log L(θ) log L(θ)
e −→ χ2 (r) (4.19)
n i=1 n i=1
where r is the number of constraints.
The number of constraints imposed may vary depending on the specifications. In spatial models, the
number of constrains is generally one or two, since we have the restriction ρ = 0, λ = 0, or λ = ρ = 0.
The likelihood ratio test is designed to evaluate the distance that separates the values of the two likeli-
hoods: if the distance is small, then the constrained model is comparable to the unconstrained model. In
this case, the constraint version is “acceptable” and do not reduce the performance of the model. It is thus
statistically possible to not reject the null hypothesis (the postulated constraints prove to be credible). In
other words, if the likelihood value of an unconstrained model strays too far from the constrained model, we
cannot accept the null hypothesis: the gap is too large for the constraint to be consider realistic.
n log(2π) n log(σ 2 ) 1
log L(θ) = log |A| − − − 2 (Ay − Xβ)> (Ay − Xβ) (4.20)
2 2 2σ
The log-likelihood for the constrained model is found by setting ρ = 0 in Equation (4.20). Recall that if
ρ = 0, then A = I − ρW = I, then:
n log(2π) n log(σ 2 ) 1
log L(θ) = − − − 2 (y − Xβ)> (y − Xβ) (4.21)
2 2 2σ
Therefore, following our definition in Equation (4.19):
n log(2π) n log(σ 2 ) 1
log L(θ) = log |B| − − − 2 (y − Xβ)> Ω(λ)(y − Xβ) (4.23)
2 2 2σ
Then the LR for the SEM model is:
1
LR = 2 log |B| + (y − Xβ)> (y − Xβ) − (y − Xβ)> Ω(λ)(y − Xβ) (4.24)
σ2
which is also distributed as χ2 (1). We can use the formulae above or use the following algorithm:
1. compute the restricted MLE θe and record the value of the log-likelihood function at convergence
log L(θ),
e
2. compute the unrestricted MLE θb and record the value of the log-likelihood function at convergence
log L(θ),
b
4.4. TESTS BASED ON ML 99
3. and compute,
h i
LR = 2 log L(θ)
b − log L(θ)
e
This statistic is always positive because the unrestricted maximum value always exceeds the restricted
one.
4. Compare LR with the critical value of chi-square distribution with 1 degrees of freedom.
r(θ0 ) = 0
Let also
∂r(θ0 )
R(θ) =
∂θ >
The Wald test is given by:
h i
d
W = n · r(θ)
b > R(θ) b > r(θ)
b Vb R(θ) b −→ χ2 (r) (4.25)
ρb2
Wρ = (4.26)
Var(ρ)
d
where Var(ρ)
d can be obtained from Equation 3.50 as:
−1
1
Var(ρ)
d = tr(C s C) + 2 (CXβ)> (CXβ) (4.27)
σ
Clearly,
ρ a
∼ N(0, 1) (4.28)
se(ρ)
with se(ρ) as the estimated standard deviation.
100 CHAPTER 4. HYPOTHESIS TESTING
Extensions to hypotheses that consists of linear and nonlinear combinations of model parameters can
be obtained in a straightforward way. Computationally, the W —and LR— is more demanding since they
require ML estimation under the alternative, and the explicit forms of the tests are more complicated.
b2
λ
Wλ = (4.29)
Var(λ)
d
where Var(λ)
d can be obtained from Equation 3.59 as:
−1
tr(WB )
Var(λ)
d = − + tr(WB )2
+ tr(WB
>
WB ) (4.30)
σ2
>
Algorithm 4.3 — Wald Test. Let θ = θ1> , θ2> . In general, to compute the Wald test statistic for
H0 : θ02 = 0,
h i−1 −1
Vbw = I22 (θ) − I21 (θ) I11 (θ)
b b b I12 (θ)
b (4.32)
Theorem 4.4 — Lagrange Multiplier Test. The Lagrange multiplier test statistic is:
!> !
∂ log L(θ)
e h i−1 ∂ log L(θ)
e d
LM = I(θ)
e −→ χ(r) (4.33)
∂ θe ∂ θe
Under the null hypothesis, LM has a limiting chi-square distribution with degrees of freedom equal to
the number of restrictions. All terms are computed at the restricted estimator.
The main advantage of the LM statistic is that it only requires the constrained model to be estimated,
and it is very often less complex since it mainly lies on the OLS. This is one of the reasons that has lead to
the widespread use of this approach.
4.4. TESTS BASED ON ML 101
LM statistical test construction depends on the postulated specification of the spatial autoregressive DGP:
SEM or SLM. The usual practice is to initially use a general test for detecting residual spatial autocorrelation
(Moran’s I test for example) in order to then be able to carry out the statistical LM test to identify the
specific type of the autoregressive process.
where WB = W (I − λW )−1 . Under the null, EH0 ∂ 2 ln L/∂β∂λ = 0, and EH0 ∂ 2 ln L/∂σ∂λ = 0 because
∂ log L(θ)
= − tr W 2 + W > W (4.38)
EH0 2
∂λ
Then the expression for the LM test for a SEM specification is:
2
1 εb> W εb
LMERR = (4.39)
C b2
σ
where C = tr W + W > W . Therefore, the test requires only OLS estimates. Under the null hypothesis,
this statistic converges asymptotically to a χ2 (1). For example, if we use a significance level of 95%, the
critical value is 3.84. Thus, we reject the null hypothesis, if the value of the statistical test LMERR is greater
than 3.84. We can conclude in this case that spatial autocorrelation is present in the standard linear model
residuals and we must proceed to estimate the SEM specification.
Note also that it is similar in expression to Moran’s I: except for the scaling factor T , this statistic is
essentially the square of Moran’s I.
102 CHAPTER 4. HYPOTHESIS TESTING
∂ log L(θ) 1
sρ=0 = = 2 ε> W y (4.40)
∂ρ
ρ=0 σ
The inverse of the information matrix is given in (3.50). The complicating feature of this matrix is that
even under ρ = 0, it is not block diagonal; the (ρ, β) term is equal to (X > W Xβ)/σ 2 , obtained by inserting
ρ = 0; i.e., C = W . The main problem of this is that even under ρ = 0, we cannot ignore one of the
off-diagonal terms. This is not the case for sλ=0 . Asymptotic variance of sλ=0 was obtained just using the
(2, 2) element of ?. For the spatial lag model, asymptotic variance of sρ=0 is obtained from the reciprocal of
the last element of: 1
−1
σ 2 (X X)
>
00 >
1 1
σ 2 X W Xβ
Var(β, σ , ρ) ρ=0 =
2
n
. 2σ 4 0
. tr(W + W W ) +
. 2 >
σ 2 (W Xβ) (W Xβ)
1 >
Since under ρ = 0, C = W and tr(W ) = 0. Recall that T = tr W > + W W , then we can write:
2
1
εbW y
LMSAR = (4.41)
T1 b2
σ
h i −1 >
>
where T1 = (W X εb) M (W X εb) + T σ σ 2 with M = I − X X > X
b2 /b X . Under the null hypothesis,
the test asymptotically converges according to the χ2 distribution to 1 degree of freedom.
• When the test LMLAG value is significant and the LMERR is insignificant, the most appropriate model
is the SLM model;
• in the same vein, when the test LMERR is significant and the LMLAG value is insignificant, the most
appropriate model is the SEM model.
As you can guess, sometimes it is possible to find that both statistical test are significant. In this case,
one decision rule can be as follows:
• when the test LMLAG value is higher than the test LMERR value, it would be best to consider the
SLM model;
• when the test LMERR value is higher than the test LMLAG value, it would be best to consider the
SEM model.
Of course, if both statistics are significant, it could also well be appropriate to estimate a general autore-
gressive model (SAC).
# LM test
lm.LMtests(ols, listw,
test = c("LMerr", "RLMerr", "LMlag", "RLMlag"))
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = CRIME ~ INC + HOVAL, data = columbus)
## weights: listw
##
## LMerr = 4.6111, df = 1, p-value = 0.03177
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = CRIME ~ INC + HOVAL, data = columbus)
## weights: listw
##
## RLMerr = 0.033514, df = 1, p-value = 0.8547
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = CRIME ~ INC + HOVAL, data = columbus)
## weights: listw
##
## LMlag = 7.8557, df = 1, p-value = 0.005066
##
##
## Lagrange multiplier diagnostics for spatial dependence
##
## data:
## model: lm(formula = CRIME ~ INC + HOVAL, data = columbus)
## weights: listw
##
## RLMlag = 3.2781, df = 1, p-value = 0.07021
Note that both LMerr and LMlag are significant. However, the robust statistics point to the lag model
as the proper alternative. With this information in hand, we an select the spatial lag model as the proper
model.
104 CHAPTER 4. HYPOTHESIS TESTING
Instrumental Variables and GMM
5
In the previous chapter, we learnt how to estimate spatial models using ML. One of the main disadvantages of
this method is that it may be computational intensive when the number of spatial units is large. Recall this
procedure requires the manipulation of n × n matrices, such as the matrix multiplication, matrix inversion,
the computation of characteristics roots and so on.
In this chapter, we will study the instrumental variables and the generalized method of moments method
(IV/GMM). One of the reason for developing IV/GMM estimators was a response to perceived computational
difficulties of the ML method (Kelejian and Prucha, 1998, 1999). Unlike ML, the IV/GMM procedure does
not require the computation of the Jacobian, and does not rely on the normality assumption.
>
θbn = argmin gn (w1 , ..., wn , θ) Υ gn (w1 , ..., wn , θ). (5.1)
θ (1×S) (S×S) (S×1)
If s = K the weighting matrix is irrelevant and θbn can be found as a solution to the moment condition:
gn (w1 , ..., wn , θ)
b = 0. (5.2)
The classical GMM literature exploits linear moment conditions of the form
" n #
1X >
E hi ui = 0,
n i=1 (S×1) (1×1)
which holds since E h> i ui = hi E [ui ] = 0 under the maintained assumptions. The spatial literature
>
frequently considers quadratic moment conditions. Let Aq , with element (aijq ) be some n × n matrix
with tr(Aq ) = 0, and assume for ease of exposition that Aq is non-stochastic. Then the quadratic moment
conditions considered in the spatial literature are of the form:
n X n
1 X
E aijq ui uj = 0, (5.3)
n i=1 j=1
>
which clearly holds under the maintained assumptions. To see this, let u = [u1 , ..., un ] , then the moment
conditions in (5.3) can be rewritten as:
" #
Aq E uu> tr(Aq )
>
u Aq u
E = tr = σ2 = 0,
n n n
since under the maintained assumptions E uu> = σ 2 In .
>
Now let θ0 = [λ0 , δ0 ] and suppose the sample moment vector in (5.2) can be decomposed into:
gn (w1 , ..., wn , λ, δ)
λ
gn (w1 , ..., wn , θ) = ,
gnδ (w1 , ..., wn , λ, δ)
where λ is, for example, the spatial autoregressive parameter and δ is the rest of parameters in the model,
such that:
and that some easily (and consistent) computable initial estimator, say δbn , for δ0 is available. In this case
we may consider the following GMM estimator for λ0 corresponding to some weighting matrix Υnλλ :
Utilizing λ
bn we may further consider the following estimator for δ0 corresponding to some weight matrix
Υnδδ :
GMM estimator like θb in Equation (5.1) are often referred to as one-step estimators. Estimators like
bn and δbn in Equations (5.4) and (5.5) above, where the sample moments depend on some initial estimator,
λ
are often referred to as two-step estimators.
If the model conditions are valid, we would expect the most efficient one-step estimator to be more efficient
than the most efficient two-step estimators. However, as usual, there are trade-offs. One trade-off is in terms
of computations. Recall that for small sample sizes ML is available as an alternative to GMM. For large
sample size, statistical efficiency may be less important than computational efficiency and feasibility, and
thus the use of two-step GMM estimators may be attractive. Also, Monte Carlos studies suggest that in
many situations, the loss of efficiency may be relatively small. Another trade-off is that the misspecification
of one moment condition will typically result in inconsistent estimates of all model parameters.
5.1. A REVIEW OF GMM 107
∂gn (θ)
Gn (θ) ≡ .
∂θ >
Now using Taylor expansion to gn (θ), yields:
gn (θ)
b = gn (θ0 ) + Gn (θ) θbn − θ0 . (5.7)
(S×1) (S×1) (S×K)
(K×1)
p
Gn (θ)
b −→ G by some LLN (5.8)
p
Υn −→ Υ by some LLN (5.9)
√ d
ngn (θ0 ) −→ N(0, Ψ ) by some CLT (5.10)
where Ψ is some positive definite matrix. Then applying traditional asymptotic rules:
√ d
n(θbn − θ0 ) −→ N (0, Φ) ,
where:
−1 > −1
Φ = G> Υ G G Υ Ψ Υ G G> Υ G
.
It can be seen that if we choose Υ = Ψ b −1 (weights are given by the variance-covariance matrix of the
n
p
moment conditions), where Ψb −→ Ψ , the variance-covariance simplifies to
−1
Var(θbn ) = Φ = G> Ψ −1 G
.
−1 > −1 > −1 −1
Since G Υ G is positive semidefinite it follows that Υ = Φ
> >
G Υ ΨΥ G G Υ G − G Ψ G b−1
n
gives the optimal GMM estimator (less asymptotic variance).
However, note that we need a CLT applicable to triangular array. In particular we need a CLT for linear
quadratic forms. The following theorems will be useful when deriving the asymptotic properties of Spatial
GMM estimators.
Theorem 5.1 — CLT for triangular arrays with homokedastic errors, (Kelejian and Prucha, 1998). Let {vi,n , 1 ≤ i ≤ n, n ≥ 1}
be a triangular array of identically distributed random variables. Assume that the random variables
{vi,n , 1 ≤ i ≤ n} are jointly independently distributed for each n with E(vi,n ) = 0 and E(vi,n 2
) = σ 2 < ∞.
Let {aij,n , 1 ≤ i ≤ n, n ≥ 1} , j = 1, ..., k be triangular arrays of real numbers that are bounded in absolute
value. Further let
108 CHAPTER 5. INSTRUMENTAL VARIABLES AND GMM
v1,n a11,n ... a1k,n
vn = ... , An = ... ..
.
1 d
√ A> vn −→ N(0, σ 2 QAA )
n n
Theorem 5.2 — CLT for Vectors of Linear Quadratic Forms with Heterokedastic Innovations. Assume the
following:
1. For r = 1, ..., m let Ar,n withPn elements (aijr )i,j=1,...,n be an n × n non-stochastic symmetric real
matrix with sup1≤j≤n,n≥1 i=1 |aijr | < ∞,
Pn
|air |δ1
2. and let ar = (air , ..., anr )> be a n × 1 non-stochastic real vector with supn i=1n < ∞ for some
δ1 > 2.
3. Let ε = (1 , ..., n )> be an n × 1 random vector with the i distributed totally independent with
δ
E [i ] = 0, E 2i , and sup1≤i≤n,n≥1 E |i | 2 < ∞ for some δ2 > 4.
0
Consider the m × 1 vector of linear quadratic forms vn = [Q1n , ..., Qmn ] with:
n X
X n n
X
Qrn = ε0 Ar ε + a0r ε = aijr i j + air i . (5.11)
i=1 j=1 i=1
>
Let µv = E [vn ] = [µQ1 , ..., µQ2 ] and Σvn = [σQrs ]r,s=1,...,m denote the mean and VC matrix of vn ,
respectively, then:
n
X
µQr = aiir σi2
i=1
n X
X n n
X
σQrs = 2 aijr aijs σi2 σj2 + air ais σi2
i=1 j=1 i=1
n h i Xn
(4) (3)
X
+ aiir aiis µi − 3µ4i + (air aiis + ais aiir )µi
i=1 i=1
(3) (4)
with µi = E(3i ) and µi = E(4i ). Furthermore, given that n−1 λmin (Σvn ) ≥ c for some c > 0, then
d
Σv−1/2
n
(vn − µvn ) −→ N(0, Im )
and thus:
a
n−1/2 (vn − µvn ) ∼ N(0, n−1 Σvn )
Kelejian and Prucha (2001) introduced a CLT for a single quadratic form under the assumptions useful
for spatial models. The generalization to vectors of linear quadratic forms is given in Kelejian and Prucha
(2010).
Consider the two-step GMM estimators for λ0 defined in Equation (5.4). Applying this approach, and
assuming typical regularity conditions, we get:
√ h√ √
bn − λ0 = − (Gλλ )> Υ λλ Gλλ −1 Gλλ > Υ λλ
i
ngnλ (λ0 , δ0 ) + Gλδ n δbn − δ0 + op (1), (5.12)
n λ n n n n n n
where
∂gnλ (λ0 , δ0 ) p
−→ Gλλ ,
∂λ
∂gnλ (λ0 , δ0 ) p
−→ Gλδ ,
∂δ
p
Υnλλ −→ Υ λλ .
In many cases the estimator δbn will be asymptotically linear in the sense that
√ 1
n δbn − δ0 = √ Tn> un + op (1),
n
where Tn is a non-stochastic n × kδ matrix, where kδ is the dimension of δ0 , and where un = (u1 , ..., un )> .
Now define:
1 λδ >
λ
g∗n (λ0 , δ0 ) = gnλ (λ0 , δ0 ) + G Tn un .
n
Then Equation (5.12) can be rewritten as:
√
b n − λ0 = − (Gλλ )> Υ λλ Gλλ −1 (Gλλ )> Υ λλ ng λ (λ0 , δ0 ) + op (1).
(5.13)
√
n λ ∗n
with:
λλ > λλ λλ −1 λλ > λλ λλ λλ λλ λλ 0 λλ λλ −1
∗ = (G ) Υ
Φλλ G (G ) Υ Φ∗ Υ G (G ) Υ G
−1 p
From this it is seen that if we choose Υnλλ = Φλλ where Φλλ
∗n −→ Φ∗ , then variance-covariance
λλ
∗n
simplifies to
λλ > λλ −1 λλ −1
∗ = (G ) (Ψ∗ )
Φλλ G .
So, using the weighting matrix Υnλλ , a consistent estimator for the inverse of the limiting variance-
covariance matrix Ψ∗λλ yields the efficient two-step GMM estimator.
Suppose that Equation (5.10) holds and:
λλ
Ψ λδ
Ψ
Ψ= ,
Ψ δλ Ψ δδ
then the limiting distribution of the sample moment vector gnλ evaluated at the true parameter is given by
√ d
ngnλ (λ0 , δ0 ) −→ N 0, Ψ λλ
Note that in general Ψ∗λλ 6= Ψ λλ , unless Gλδ = 0, and that in general Ψ∗λλ will depend on Tn , which
in turn will depend on the employed estimator δbn . In other words, unless Gλδ = 0, for a two-step GMM
estimator, we cannot simply use the variance-covariance matrix Ψ λλ of the sample moment vector mλ (λ0 , δ0 ),
rather we need to work with the variance-covariance matrix Ψ∗λλ .
110 CHAPTER 5. INSTRUMENTAL VARIABLES AND GMM
Prucha (2014) illustrate the difference between Ψ λλ , with elements Ψλλ rs , and Ψ∗ , with elements Ψ∗rs , for
λλ λλ
the important special case where the moment conditions are quadratic and ui is i.i.d N(0, σ ). For simplicity
2
assume that
1 Pn Pn
j=1 aij1 ui uj
gnλ (λ0 , δ0 ) = n1 Pi=1
n Pn .
n i=1 j=1 aij2 ui uj
Now, for r = 1, 2, let air denote the (i, r)th element of Gλδ Tn> , then by Equation (5.10):
1 Pn Pn Pn
i=1 Pj=1 aij1 ui uj + n Pi=1 ai1 ui
1
g∗n (λ0 , δ0 ) = 1 n
λ n
n n
j=1 aij2 ui uj + n
1
P
n i=1 i=1 ai2 ui
but
n X
X n b
X
Ψλλ
∗rs = 2σ
4
aijr aijs + σ 2 air ais
i=1 j=1 i=1
Note that air and ais in the last sum of the RHS for the expression for Ψλλ ∗rs depend on what estimator
δn is employed in the sample moment vector gn (λ0 , δ) used to form the objective function for the two-step
b λ b
GMM estimator λ bn defined in Equation (5.4). It is for this reason that in the literature on tow-step GMM
estimation, users are often advised to follow a specific sequence of steps, to ensure the proper estimation of
respective variance-covariance matrices.
y = ρW y + Xβ + ε.
A more concise way to express the model is as:
y = Zδ + ε,
where Z = [X, W y] and the (K + 1) × 1 coefficient column vector is rearranged as δ = (β > , ρ)> . As we
have previously shown in Section 3.1, the presence of the spatially lagged dependent variable on the right
hand side of the equation induces endogeneity or simultaneous equation bias. Therefore the OLS estimates
are inconsistent.
Instead of applying QML or ML estimation procedure, we might rely on the instrumental variable approach
in order to deal with the endogeneity caused by the spatial lag variable. The principle of instrumental
variables estimation is based on the existence of a set of instruments, H that are strongly correlated with Z
but asymptotically uncorrelated with ε.
At this point is important to stress that the only endogenous variable in this model is the spatial lagged
variable. Therefore, matrix H should contain all the predetermined variables, that is, X and the instru-
ment(s) for W y. As we will see later, an important feature of this estimation procedure is that it does
not require to compute the Jacobian term. Another important feature is that it does not make the strong
assumption of normality of the error terms.
2 In particular, Kelejian and Prucha (1998) derived this model as the first step in their Generalized S2SLS.
5.2. SPATIAL TWO STAGE ESTIMATION OF SLM 111
l=1
In principle, the problem is to approximate E(y|X) as closely as possible without incurring in the inversion
of (In − ρ0 W ). Therefore, note that (5.14) can be expressed as a linear function of X, W X, W 2 X, ... As a
result, and given that the roots of ρWn are less than one in absolute value, the conditional expectation can
also be written as:
E(W y|X) = W E ( y| X)
−1
= W (In − ρW ) Xβ
= W In + ρW + ρ2 W 2 + ρ3 W 3 + ... Xβ
"∞ #
X
=W l l
ρ0 W Xβ
l=1
= W Xβ + W 2 X(ρβ) + W 3 X(ρ2 β) + W 4 X(ρ3 β) + ...
To avoid issues associated with the computation of the inverse of the n × n matrix (In − ρ0 W ), Kelejian
and Prucha (1998, 1999) suggest the use of an approximation of the best instruments. More specifically,
since E(y|X) is linear in X, W X, W 2 X..., they suggest using a set of instruments H which contains, say,
X, W X, W 2 X, ..., W l X, and to compute approximations of the best instruments from a regression of the
rhs variables against H, where l is a pre-selected finite constant and is generally set to 2 in applied studies.
Thus, in general we can write the instruments as:
H = (X, W X, W 2 X).
R The intuition behind the instruments is the following: Since X determines y, then it must be true that
W X, W 2 X, ... determines W y. Furthermore, since X is uncorrelated with ε, then W X must be also
uncorrelated with ε.
In the theoretical literature, some other suggestions for so-called optimal weights have been made. For
example, using the conditional expectation in (5.14), Lee (2003) suggested the instrument matrix:
H = X, W (I − ρW )−1 Xβ ,
which requires the use of consistent first round estimates for ρ and β. In Kelejian et al. (2004), a similar
approach is outlined where the matrix inverse is replaced by the power expansion. This yield an instruments
matrix as:
∞
" ! #
X
H = X, W l l
ρ0 W Xβ .
l=1
In any practical implementation, the power expansion must be truncated at some point.
112 CHAPTER 5. INSTRUMENTAL VARIABLES AND GMM
Assumption 5.3 (Heterokedastic Errors) states the first two moments of the error terms, but we do not
assume that they are normally distributed. We assume also that the error terms are heterokedastic, i.e., the
unobserved variables have different variance for all spatial units. Finally, this assumption also allows for the
innovations to depend on the sample size n, i.e., to form a triangular arrays. See our discussion in Section
3.6.1 about triangular arrays.
Now, we state some assumptions about the behavior of the spatial weight matrix W .
Assumption 5.4 — Diagonal elements of Wn (Kelejian and Prucha, 1998). All diagonal elements of the spatial
weighting matrix Wn are zero
Assumption 5.4 (Diagonal elements of Wn ) is a normalization of the model and it also implies that no
spatial unit is viewed as its own neighbor.
Assumption 5.5 — Nonsingularity (Kelejian and Prucha, 1998). The matrix (In − ρ0 Wn ) is nonsingular with
|ρ0 | < 1.
Under Nonsigularity Assumption 5.5, we can write the reduced form of the true model as:
Assumption 5.6 — Bounded matrices (Kelejian and Prucha, 1998). The row and column sums of the matrices
Wn and (In − ρ0 Wn ) are bounded uniformly in absolute value.
This assumption guarantees that the variance of yn in Equation (5.15), which depend on Wn and (In −
ρ0 Wn ), are uniformly bounded in absolute value as n goes to infinity, thus limiting the degree of correlation
between, respectively, the elements of εn and yn . This assumption is technical and will be used in the
large-sample derivations of the regression parameters estimator.
5.2. SPATIAL TWO STAGE ESTIMATION OF SLM 113
R Applied to W Assumption 5.6 (Bounded matrices) means that each cross-sectional unit can only have
a limited number of neighbors. Applied to (I − ρW ) limits the degree of correlation.
Assumption 5.7 — No Perfect Multicolinearity (Kelejian and Prucha, 1998). The regressor matrices Xn have
full column rank (for n large enough). Furthermore, the elements of the matrices Xn are uniformly
bounded in absolute value.
Assumption 5.8 — Rank Instruments, (Kelejian and Prucha, 1998). The instrument matrices Hn have full
column rank L ≥ K + 1 for all n large enough. Furthermore, the elements of the matrices Hn are
uniformly bounded in absolute value. They are composed of a subset of the linearly independent columns
of (X, W X, W 2 X, ...).
Assumption 5.9 — Limits of Instruments (Kelejian and Prucha, 1998) . Let Hn be a matrix of instruments,
then:
1. limn→∞ n−1 Hn> Hn = QHH where QHH is finite and nonsingular (full rank).
2. plimn→∞ n−1 Hn> Zn = QHZ where QHZ is finite and has full column rank.
Since the instrument matrix Hn contains the spatially lagged explanatory variables, the first condition
in Assumption 5.9 (Limits of Instruments) limn→∞ n−1 Hn> Hn = QHH implies that Wn Xn and Xn cannot
be linearly dependent. This condition would be violated if for example Wn Xn include a spatial lag for the
constant term or the model is the pure SLM. The second condition in Assumption 5.9 (Limits of Instruments)
requires a non-null correlation between the instruments and the original variables.
Given all this assumptions we can define the S2SLS estimator as follows.
Definition 5.2.1 — Spatial Two Stage Least Square Estimator. Let Hn be the matrix (n × L) of instruments.
Then the S2SLS is given by:
−1
δbS2SLS = Zb > Zn
n
b > yn
Zn (5.16)
where:
Note that the S2SLS estimator in (5.16) is similar to the standard 2SLS. We first need the predicted values
for Z based on the OLS regression of Z on H in the first stage. Consider this first stage as the regression
Z = Hθ + ξ, so that θb = (H > H)−1 H > Z. Then the predicted values Z b is obtained using Equation (5.17)
where PH is the projection matrix, which symmetric and idempotent, and hence singular. Note also that
H is a n × L matrix, which also includes the exogenous variables X. It is also important to note that the
projection matrix does not affect X, but it does affect the endogenous variable W y:
h i
PH Z = [X, PH W y] = X, W dy (5.18)
Note that this approach is in the same spirit as the traditional treatment in simultaneous equation setting,
where each endogenous variable (including the spatial lag) is regressed on the complete set of exogenous
variables to form its instrument.
δbGM M = arg min gn (β)> Υn−1 gn (β) ,
| {z } β
| {z } |{z} | {z }
K×1 1×L L×L L×1
where
1 > 1
gn = H ε = H > (y − Zδ)
n n
The matrix Υn−1 is the optimal weight matrix, which correspond to the inverse of the covariance matrix
of the sample moments:
1 2 >
Υ = σ b H H
n
Then, the function to minimize is:
1 n > > −1 > o
Q= H y − H >
Zδ H >
H H y − H >
Zδ
σ2
nb
Obtaining the first order conditions and solving for δ, we obtain:
−1
δbGM M = Z > PH Z Z > PH y (5.19)
where yi is the house price, xi is a vector of controls, pol1i and pol2i are the air quality variables and i is the
error term. Since the actual pollution is not observed at locations i of the house transaction, it is replaced by
a spatially interpolated value, such as the result of a kriging prediction. This interpolated value measures
the true pollution with error causing simultaneous equation bias, so they needed proper instruments for these
variables. They instrumentalize these endogenous variables using the latitude, longitude and their product
as the instruments.
In particular, we can write the general model with additional endogenous variables
y = ρW y + Xβ + Y γ + ε,
where Y are the endogenous explanatory variables. In a spatial lag model, an additional question is whether
these instruments (for the endogenous explanatory variables) should be included in spatially lagged form as
well, similar to what is done for the exogenous variables. As before, the rationale for this comes from the
structure of the reduced form. In this case the reduced form is given by:
−1 −1
E [ W y| Z] = W (I − ρW ) Xβ + W (I − ρW ) Y γ,
where Z = [X, Y ]. The problem here is that the Y are endogenous, and thus they do not belong on the
right hand side of the reduced form! If they are replaced by their instruments, then the presence of the term
−1
W (I − ρW ) would suggest the need for spatial lags to be included as well. In other words, since the
system determining y and Y is not completely specified, the optimal instruments are not known (Bivand
and Piras, 2015).
5.2. SPATIAL TWO STAGE ESTIMATION OF SLM 115
First, note that Hn (Hn> Hn )−1 Hn> is symmetric and idempotent and so Z
b > Zn = Z
n n n . As usual, we
b>Z
b
first write the estimator in terms of the population error term:
−1
δbn = δ0 + Zb>Z
n
bn b > εn
Zn
h > i−1 > (5.20)
= δ0 + Hn (Hn> Hn )−1 Hn> Zn Hn (Hn> Hn )−1 Hn> Zn Hn (Hn> Hn )−1 Hn> Zn εn
−1 >
= δ0 + Zn> Hn (Hn> Hn )−1 Hn> Zn Zn Hn (Hn> Hn )−1 Hn> εn
where we used Assumption 5.8 (Rank of Instruments). Solving for δbn − δ0 we obtain:
1 >
E Hn εn = 0
n
1 > 1
Var Hn εn = 2 Hn> Σn Hn
n n
p p
Since Var n Hn εn → 0, by Chebyshev’s Theorem 5.10, n−1 Hn> εn −→ 0 and δbn −→ δ0
1 >
116 CHAPTER 5. INSTRUMENTAL VARIABLES AND GMM
Theorem 5.11 — Spatial 2SLS Estimator for SLM. Suppose that Assumptions 5.3 to 5.9 hold. Then the
S2SLS estimator defined as
−1
δbn = Zb>Z
n
bn b > yn
Zn (5.25)
Theorem 5.11 gives us a very general asymptotic distribution for the S2SLS estimator. The estimator of
Σ will be based on HAC estimators. However, under certain conditions the asymptotic variance-covariance
matrix of the estimator can be reduced. For example, under homokedasticity the asymptotic variance-
covariance matrix reduced to :
−1
−1
Var(δb2SLS ) = σ2 Q>
HZ QHH QHZ (5.29)
A good estimator for the asymptotic variance will be:
−1
Var(
d δb2SLS ) = σ
b2 Z > H(H > H)−1 H > Z (5.30)
where:
εb> εb
b2 =
σ , εb = y − yb (5.31)
n
y = ρW y + Xβ + ε,
where y is our crime variable and X contains a vector of ones and the variables INC and HOVAL. We will
estimate this model again by ML procedure and then compare it with the S2SLS procedure. In R there
exists two functions in order to compute the S2SLS procedure. The first one is the stsls from spdep and
stslshac from sphet package (Piras, 2010). The latter allows estimating also S2SLS with heterokedasticity
using HAC estimators.
We first load the required packages and dataset:
Now we estimate the SLM model by ML using Ord’s eigen approximation of the determinant and S2SLS
with homokedastic and robust standard errors.
# Estimate models
slm <- lagsarlm(CRIME ~ INC + HOVAL,
data = columbus,
listw,
method = "eigen")
s2sls <- stsls(CRIME ~ HOVAL + INC,
data = columbus,
listw = listw,
robust = FALSE,
118 CHAPTER 5. INSTRUMENTAL VARIABLES AND GMM
W2X = TRUE)
s2sls_rob <- stsls(CRIME ~ HOVAL + INC,
data = columbus,
listw = listw,
robust = TRUE,
W2X = TRUE)
s2sls_pir <- stslshac(CRIME ~ INC + HOVAL,
data = columbus,
listw = listw,
HAC = FALSE)
stsls function fits SLM model by S2SLS, with the option of adjusting the results for heteroskedasticity.
Note that the arguments are similar to lagsarlm from spdep. The robust option of stsls is set FALSE
as default. If TRUE the function applies a heteroskedasticity correction to the coefficient covariances. Note
that the third model s2sls_rob uses this option. The argument W2X controls the number of instruments.
When W2X = FALSE only W X are used as instruments, however when W2X = TRUE W X and W 2 X are used
as instruments for W y. The function stslshac from sphet with the argument HAC = FALSE estimate the
S2SLS estimates with homokedastic standard errors without adjusting for heteroskedasticity.
Some caution should be expressed regarding the standard errors. When the argument robust = FALSE
is used, the variance-covariance matrix is computed as:
−1
Var(
d δb2SLS ) = σ
b2 Z > Z
where:
εb> εb
b2 =
σ , εb = y − yb
n−K
Note that the error variance is calculated with a degrees of freedom correction (i.e., dividing by n − K).
When robust = TRUE the variance-covariance matrix is computed as we have previously stated. That is:
−1
Var(
d δb2SLS ) = σ
b2 Z > H(H > H)−1 H > Z
state, the generalized moment estimator that they suggest is computationally simple irrespectively of the
sample size, which makes it very attractive if we have a very large spatial data base. Since the IV/GMM
estimators ignore the Jacobian term, many of the problems related with matrix inversion, the computation
of characteristic roots and/or Cholesky decomposition could be avoided. Another motivation was that at the
time there were no formal results available regarding the consistency and asymptotic normality of the ML
estimator (Prucha, 2014, pag. 1608). Recall that Lee formally derived the asymptotic properties of the ML
in 2004 for the SLM.3
Recall that the SEM model is given by:
y = Xβ + u,
(5.32)
u = λM u + ε.
In brief, Kelejian and Prucha (1999) suggest the use of nonlinear least square to obtain a consistent
generalized moment estimator for λ, which can be used to obtain consistent estimators for β in a FGLS
approach. The main difference between the Generalized Moments (GM) estimation discussed here and the
Generalized Method of Moment (GMM) estimation discussed later is that in the former there is no inference
for the spatial autoregressive coefficient. In other words, λ is viewed purely as a nuisance parameter, whose
only function is to aid in obtaining consistent estimates for β.
R The GM procedure proposed by Kelejian and Prucha (1999) was originally motivated by the compu-
tational difficulties of the ML.
R Kelejian and Prucha (1999) does not provide an asymptotic variance for λ. Thus, some software just
λ, but not its standard error.
provide the estimate b
One advantage of the GM estimator (and of QML) is that they do not rely on the assumption of nor-
mality of the disturbances ε. Nonetheless, both estimators assume that i are independently and identically
distributed for all i with zero mean and variance σ 2 . To begin with, we state the same assumption about the
error terms as in Kelejian and Prucha (1999).
Assumption 5.12 — Homokedastic Errors (Kelejian and Prucha, 1999). The innovations {i,n , 1 ≤ i ≤ n, n ≥ 1}
are independently and identically distributed for all n with zero mean and variance σ 2 , where 0 < σ 2 < b,
with b < ∞. Additionally, the innovations are assumed to possess finite fourth moments.
Assumption 5.13 — Weight Matrix Mn (Kelejian and Prucha, 1999). Assume the following:
Given Equation (5.32), and Assumption 5.13 (Weight Matrix Mn ), we can write u = (I − λM )−1 ε.
Therefore, the expectation and variance of u are E(u) = 0 and E(uu> ) = Ω(λ), respectively, where:
h i−1
>
E uu> = Ω = σ 2 (In − λM ) (In − λM ) (5.34)
,
and the corresponding generalized least squares (GLS) estimator—assuming we know λ0 —for β is:
−1 > −1
βbGLS = X > Ω −1 X
X Ω y.
From Equation (5.34), it can be observed that Ω contains a matrix inverse, so that its inverse, Ω −1
is simple the product of the two spatial filters scaled by σ 2 . Thus, the expression for the GLS estimator
simplifies to:
−1
> 1 1
> >
βGLS = X 2 (In − λM ) (In − λM ) X
b X > 2 (In − λM ) (In − λM ) y,
σ σ
h i−1
> >
= X > (In − λM ) (In − λM ) X X > (In − λM ) (In − λM ) y.
The FGLS estimator substitutes a consistent estimate for λ into this expression, as:
> −1 >
βF GLS = X
b >
In − λM
b In − λM X
b X > In − λM
b In − λM
b y,
ε = u − λM u,
where ε is the idiosyncratic error and u is the regression error. The GM estimation approach employs the
following simple quadratic moment conditions:4
E n−1 ε> ε = σ 2 ,
σ2
E n−1 ε> M M ε = E tr(M > M εε> ) ,
n
E n−1 ε> M ε = 0.
The
Kelejian and Prucha (1999)’s GM estimator of λ is based on these three moments. The final value
of E n−1 ε> M M ε will depend on the assumption about the variance of ε. If we assume heterokedasticity
then:
where we use the fact that tr X > AX = X > AX = tr AXX > . Furthermore, note that under homokedas-
σ2
E n−1 ε> M M ε = tr M > M
n
Definition 5.3.1 — Moment Conditions. Under homoskedasticity (Kelejian and Prucha, 1999) the moment
conditions are:
E n−1 ε> ε = σ 2 ,
σ2
E n−1 ε> M M ε = tr M > M ,
n
E n−1 ε> M ε = 0.
Under heterokedasticity (Kelejian and Prucha, 2010) the moment conditions are:
E n−1 ε> ε = σ 2 ,
E n−1 ε> M ε = 0.
In order to operationalize the moment conditions, we need to convert conditions on ε into conditions on
u (since ε is not observed). Since u = λM u + ε if follows that ε = u − λM u, i.e., the spatially filtered
regression error terms.
ε> ε = (u − λM u)> (u − λM u)
= u> u − 2λu> M u + λ2 u> M > M u (5.35)
>
ε M Mε>
= (u − λM u) M M (u − λM u)
> >
Let uL = M u, uLL = M M u.5 Taking the expectation over (5.35) and assuming Homokedasticity by
Assumption 5.12, we get:
1 2 1
σ 2 = E u> u − λ E u> uL + λ2 E u> since E n−1 ε> ε = σ 2
L uL
n n n
1 > 2 > 1
0 = σ − E u u + λ E u uL − λ 2 E u>
2
L uL
n n n (5.38)
2 > 21
> 1 2 1 >
0 = λ E u uL − λ E uL uL + σ − E u u
n n n n
λ2 1
0 = n2 E u> uL − n1 E u> 1 λ − E u> u
L uL
n
σ2
In similar fashion,
λ
λ2 − 1 E u>
0 = n tr(M M ) (5.39)
>
2
− n1 E u> 1 >
n E uLL uL LL uLL
n L uL
σ2
λ2 1
0 = + 0 λ − E u> uL (5.40)
1
> >
1
>
n E u u LL uL u L − n E u L u LL
n
σ2
5 Spatially lagged variables are denoted by bar superscripts in the articles. Instead, we will use the L subscript throughout.
That is, a first order spatial lag of y, W y, is denoted by yL . Higher order spatial lags are symbolized by adding additional L
subscripts.
5.3. GENERALIZED MOMENT ESTIMATION OF SEM MODEL 123
At this point it is important to realized that we have have three equations an three unknowns, λ, λ2 and
σ . Consider the following three-equations system implied by Equations (5.38), (5.39) and (5.40):
2
Γn α = γn (5.41)
where Γn is given in Equation (5.42), and α = (λ, λ2 , σ 2 ).6 If Γn where known, Assumption 5.16 (Identifica-
tion) implies that Equation (5.41) determines α as:
α = Γn−1 γn
where:
2 >
− n1 E u> 1
n E u uL L uL
Γn = n tr(M M ) (5.42)
>
2
n > LL uL >
E u − n1 E u>LL uLL
1 >
1
n E u uLL + uL uL − n1 E u> L uLL 0
and
1 >
n E u u
γn = 1 >
E uL uL
n (5.43)
1 >
n E u uL
Now we express the moment conditions γn = Γn α as sample averages in observables spatial lags of OLS
residuals:
gn = Gn α + υn (λ, σ 2 ) (5.44)
Note also that
2 >
b>
− n1 u 1
n u uL Lu
b b bL
Gn = n tr(M M )
2 > b 1 > 1 >
u
>n LL u
b L −nu b LL u
b LL
n u uLL + uL uL 0
1
b> b b>
− n1 u
b b Lub LL
and
1 >
nu u
b b
gn = 1
u >b
n L uL
b
1 >
n u uL
b b
where Gn is a 3 × 3 matrix, and where υn (λ, σ 2 ) can be viewed as a vector of residuals. This can be thought
as a OLS regression where (Kelejian and Prucha, 1998):
e n = G−1
α n gn (5.45)
However, the estimator in (5.45) is based on an overparameterization in the sense that it does not use the
information that the second element of α, λ2 , is the squared of the first. Given this, (Kelejian and Prucha,
1998) and (Kelejian and Prucha, 1999) define the GM estimator for λ and σ 2 as the nonlinear least square
estimator corresponding to Equation (5.44):7
(λ 2
LS,N ) = argmin υn (λ, σ ) υn (λ, σ ) : ρ ∈ [−a, a], σ ∈ [0, b]
2 > 2 2
(5.46)
bN LS,n , σ
bN
Note that (λ
bN LS,n , σ
bN2
LS,N ) are defined as the minimizers of
>
λ λ
gn − Gn λ2 gn − Gn λ2
σ2 σ2
6 Note
that we are assuming that λ2 is a new parameter.
7 Theystate that is more efficient than the OLS estimator. However, both estimator are consistent. See Theorem 2 in
(Kelejian and Prucha, 1998).
124 CHAPTER 5. INSTRUMENTAL VARIABLES AND GMM
Assumption 5.14 — Bounded Matrices (Kelejian and Prucha, 1999). The row and column sums of the matrices
Mn and (I − λMn ) are bounded uniformly in absolute value.
Assumption 5.16 — Identification (Kelejian and Prucha, 1999). Let Γn be the matrix in Equation (5.42). The
smallest eigenvalues of Γn> Γn is bounded away from zero, that is, ωmin (Γn> Γn ) ≥ ω∗ > 0, where ω∗ may
depend on λ and σ 2
(λ 2
LS,N ) = argmin υn (λ, σ ) υn (λ, σ ) : ρ ∈ [−a, a], σ ∈ [0, b]
2 > 2 2
bN LS,n , σ
bN
Then, given Assumptions 5.3 (Heterokedastic errors), 5.13 (Weight Matrix Mn ), 5.14 (Bounded Ma-
trices), 5.15 (Residuals), and 5.16 (Identification),
p
(λ
bN LS,n , σ
bN2
LS,N ) −→ (λ, σ ) as n → ∞
2
(5.48)
An important remark is that Theorem 5.17 states only that the NLS estimates are consistent, but it does
not tell us about the asymptotic distribution of λ
bN LS,n .
The following Theorem is very useful to derive the important asymptotic results:
Theorem 5.18 — Consistency of quadratic forms in spatial models. Let S be an n × n nonstochastic matrix
whose row and columns sums are uniformly bounded in absolute value. Let v > = (v1 , ..., vn ) where vi are
iid (0, σ 2 ) and E(vi4 ) < ∞. Then
v > Sv p tr(S)
−→ E v > Sv = σ 2
n n
If the limit of tr(S)/n exits, then:
tr(S)
lim = S∗
n→∞ n
and
v > Sv p 2 ∗
−→ σ S .
n
Sketch or proof for GM estimator of λ.b The proof is based on Kelejian and Piras (2017) and consist into two
steps. First, we prove consistency of λ for the OLS estimate of α—which is more simple—and assuming that
b
the vector u is observed. We then show that u can be replaced in the GM estimator for λ by u b . For a more
general proof see Kelejian and Prucha (1998, 1999).
1. Assuming that u is observed. Recall that in Equation (5.44) the sample moments are based on the
estimated u
b . But, if u were observed, then we would use the following sample moments:
5.3. GENERALIZED MOMENT ESTIMATION OF SEM MODEL 125
gn∗ = G∗n α
where
2 >
− n1 u> 1
n u uL L uL
G∗n = n tr(M M )
2 >
>n uLL uL > − n1 u>
LL uLL
1 >
n u uLL + uL uL − n1 u> 0
1
L uLL
and
1 >
nu u
gn∗ = u>
1
n L uL
1 >
n u u L
Recall that:
−1
u = (In − λM ) ε
−1
uL = M (In − λM ) ε
−1
uLL = M M (In − λM ) ε
and first and second column of G are quadratic forms of ε. Then, using Theorem (5.18) we can state
∗
that:
p
G∗ −→ Γn
Also:
e = G−1∗
α n gn
∗
since G∗n is a 3 × 3 matrix which is nonsingular. Thus, using our previous results:
plim α
e = plim G−1∗
n plim gn∗ = Γn−1 γn = α (5.49)
p
βe = β0 + ∆n , ∆n −→ 0.
b = y − X βb
u
= y − X (β0 + ∆n )
= y − Xβ0 − X∆n
= u − X∆n
Note that, with the exception of the constants in the third column of G∗n , every element of G∗n and
gn∗ can be expressed as a quadratic of the form ε> Sε/n, where S is an n × n matrix whose row and
columns are uniformly bounded in absolute value given our assumption 5.14. For example:
1 > 1 −1> −1 1
u uL = ε> (In − λM ) M (In − λM ) ε = ε> Sε
n n n
126 CHAPTER 5. INSTRUMENTAL VARIABLES AND GMM
Then:
>
b> S u
u (u − X∆n ) S (u − X∆n )
=
b
n n
u> Su 2∆> >
n X Su ∆> X > SX∆n
= − + n
n n n
We need to show that (This would be part of your homework):
2∆> >
n X Su p
−→ 0
n
∆> >
n X SX∆n p
−→ 0
n
so that we can say that:
b p 1 >
b> S u
u
−→ u Su,
n n
and finally say that:
p p p p
gn −→ gn∗ −→ γn , Gn −→ G∗n −→ Γn
where Ω(λ) = (I − λW )−1 (I − λW > )−1 . But now we have a consistent estimate for λ. Thus, we can get
an estimate of βb using the FGLS estimator defined as:
h i−1
βF GLS (λ) = X > Ω(λ)
b −1 X X > Ω(λ)
b −1 y. (5.51)
Assumption 5.19 — Limiting Behavior. The elements of X are non-stochastic and bounded in absolute value
by cX , 0 < cX < ∞. Also, X has full rank, and the matrix QX = limn→∞ n−1 X > X is finite and
nonsingular. Furthermore, the matrices QX (λ) = limn→∞ n−1 X > Ω(λ)−1 X is finite and nonsingular for
all |ρ| < 1
The following Theorem proposes the asymptotic distribution for the FGLS Estimator:
Theorem 5.20 — Asymptotic Properties of FGLS Estimator. If assumptions 5.3 (Homokedastic errors), 5.13
(Weight Matrix Mn ), 5.14 (Bounded Matrices), and 5.19 (Limiting Behavior) hold:
√
d
n βbGLS − β −→ N 0, σ 2 QX (λ)−1 (5.52)
2. Let λ
bn be a consistent estimator for λ. Then the true GLS estimator βbGLS and the feasible GLS
estimator βbF GLS have the same asymptotic distribution.
h i
3. Suppose further than σ
bn2 is a consistent estimator for σ 2 . Then σ bn )−1 X is a consistent
bn2 n−1 X > Ω(λ
5.3. GENERALIZED MOMENT ESTIMATION OF SEM MODEL 127
Note that Theorem 5.20 assumes the existence of a consistent estimator of λ and σ 2 . It can be shown
that the OLS estimator:
−1
βbn = X > X X >y
√
is n-consistent. Thus, the OLS residuals uei = yi − x>i βn satisfy Assumption 5.15 with di,n = |xi | and
b
∆n = βbn − β. Thus, OLS residuals can be used to obtain consistent estimators of λ and σ 2 .
Then, the feasible GLS is given by
h i−1
βbF GLS = X > (λ)X(
e λ)
e X > (λ)y(
e λ) e
where:
e = (I − λM
X(λ) e )X
e = (I − λM
y(λ) e )y
where:
b2 = εb> (λ)b
σ e ε(λ)
e
εb(λ)
e = y(λ)e − X(λ) e βbF GLS = (I − λM
e )b
u
b = y − X βbF GLS
u
Sketch of Proof of Theorem 5.20. We first prove part (a). Recall that the GLS and FGSL estimator are given
by:
−1 >
βbGLS = X > Ω(λ)−1 X X Ω(λ)−1 y
h i−1
βbF GLS = X > Ω(λ)
b −1 X X > Ω(λ)
b −1 y
−1
Since y = Xβ + u = Xβ + (In − λM ) ε, the sampling error of βbGLS is,
−1 >
βb = β + X > Ω(λ)−1 X X Ω(λ)−1 u
−1 > > −1
βb − β = X > Ω(λ)−1 X X (In − λM ) (In − λM ) (In − λM ) ε
−1 > >
βb − β = X > Ω(λ)−1 X X (In − λM ) ε
−1
√ 1 > 1
n(β − β) =
b −1
X Ω(λ) X √ A> ε
n n
where A = (In − λM ) X. By Assumption 5.19 (Limiting Behavior):
1 >
X Ω(λ)−1 X → QX (λ)
n
Since QX is not singular:
−1
1 >
X Ω(λ)−1 X → Q−1
X (λ)
n
Since A is bounded in abolute value, by Theorem 5.1 it follows that:
1 d
√ A> ε −→ N 0, lim n−1 σ 2 A> A (5.53)
n n→∞
128 CHAPTER 5. INSTRUMENTAL VARIABLES AND GMM
>
where limn→∞ n−1 σ 2 A> A = σ 2 limn→∞ n−1 X > (In − λM ) (In − λM ) X = σ 2 QX (λ). Consequently:
−1
√ 1 > 1
n(β − β) =
b −1
X Ω(λ) X √ A> ε
n n
| {z } | {z }
→Q−1
X
(λ) d
−→N(0,σ 2 QX (λ))
d
−→ N 0, Q−1 −1
X (λ)σ QX (λ)QX (λ) )
2 >
d
−→ N 0, σ 2 Q−1
X (λ)
This also implies that βbGLS is consistent. To show part (b), we can show that:
√ p
n(βbGLS − βbF GLS ) −→ 0
Following Kelejian and Prucha (1999), if suffices to show that
1 > h b −1 i p
X Ω(λn ) − Ω(λ)−1 X −→ 0 (5.54)
n
and
1 > h b −1 i p
X Ω(λn ) − Ω(λ)−1 u −→ 0
n
Note that:
bn )−1 − Ω(λ)−1 = (λ − λ
Ω(λ bn )(M + M > ) + (λ2 − λ
b2 )M > M
n
1 > h b −1 i
X Ω(λn ) − Ω(λ)−1 X = (λ − λ
bn ) n−1 X > (M + M > )X + (λ2 − λ
b2 ) n−1 X > M > M X
n | {z } | {z } | {z n} | {z }
p O(1) p O(1)
−→0 −→0
where (λ − λ
bn ) = op (1) since λ
bn is a consistent estimate of λ, and :
1 > h b −1 i
X Ω(λn ) − Ω(λ)−1 u = (λ − λ
bn ) n−1/2 X > (M + M > )u + (λ2 − λ
b2 ) n−1/2 X > M > M u
n | {z } | {z } | {z n} | {z }
p Op (1) p Op (1)
−→0 −→0
= op (1) ∗ Op (1) + op (1) ∗ Op (1)
(5.55)
= op (1) + op (1)
= op (1)
p
−→ 0
h i
E n−1/2 X > (M + M > )u = 0
Var[n−1/2 X > (M + M > )u] = n−1 X > (M + M > )Ω(M > + M ) X = O(1)
| {z }
absolutely summable
| {z }
O(n)
R Any random variable X with cdf F is Op (1) (White, 2014, pag. 28).
A Feasible GLS (FGLS) can be obtained along with the following steps:
5.3. GENERALIZED MOMENT ESTIMATION OF SEM MODEL 129
Algorithm 5.21 — GLS (FGLS) Algorithm of SEM. The steps are the following:
1. First of all obtain a consistent estimate of β, say βe using either OLS or NLS.
2. Use this estimate to obtain an estimate of u, say u
b,
3. Use u
b , to estimate λ, say λ,
b using (5.46),
5.3.4 FGLS in R
The estimation procedure by GM is carried out by the GMerrorsar function from spatialreg package. In
order to show its functionalities we first load the required packages and dataset:
Now we estimate the SEM model by ML using Ord’s eigen approximation of the determinant and the
Kelejian and Prucha (1999)’s GM procedure:
A Hausman test comparing an OLS and SEM model can be obtained using
# Hausman test
summary(sem_mm, Hausman = TRUE)
##
## Call:GMerrorsar(formula = CRIME ~ HOVAL + INC, data = columbus, listw = listw,
## returnHcov = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.8212 -6.8764 -2.1781 9.5693 28.5779
##
## Type: GM SAR estimator
## Coefficients: (GM standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 63.487150 5.083612 12.4886 < 2.2e-16
## HOVAL -0.300365 0.096799 -3.1030 0.0019160
## INC -1.180414 0.341788 -3.4536 0.0005531
##
## Lambda: 0.3643 (standard error): 0.4318 (z-value): 0.84366
## Residual variance (sigma squared): 109.37, (sigma: 10.458)
130 CHAPTER 5. INSTRUMENTAL VARIABLES AND GMM
The default model specification shown above. The output follows the familiar R format. Note that even
though the estimation procedure is the GM, the output presents inference for λ. In this case, the inference
is based on the analytical method described in http://econweb.umd.edu/~prucha/STATPROG/OLS/desols.
pdf. The output also shows the Hausman test. Recall that this test can be used whenever there are two
estimators, one of which is inefficient but consistent (OLS in this case under the maintained hypothesis of
the SEM), while the other is efficient (SEM in this case). The null hypothesis is that the SEM and OLS
estimates are not significantly different (see LeSage and Pace, 2010, pag. 62). We reject the null hypothesis,
thus the SEM model is more appropriate. Table 5.2 compares the estimates.
ML GM
Constant 61.054 ∗∗∗
63.487∗∗∗
(5.315) (5.084)
INC −0.995∗∗ −1.180∗∗∗
(0.337) (0.342)
HOVAL −0.308∗∗∗ −0.300∗∗
(0.093) (0.097)
λ 0.521∗∗∗ 0.364
(0.141) (0.432)
N 49 49
Significance: ∗ ∗ ∗ ≡ p < 0.001; ∗∗ ≡ p < 0.01; ∗ ≡ p <
0.05
y = Xβ + ρW y + u = Zδ + u
(5.56)
u = λM u + ε
>
where Z = [X, W y], δ = β > , λ , y is the n × 1 vector of observations of the dependent variables, X
R This model is generally referred to as the Spatial-ARAR(1, 1) model to emphasize its autoregressive
structure both in the dependent variable and the error term.
The SAC model can be estimated by ML procedure (see Anselin, 1988). However, the estimation process
requires the inversion of A and B, which can be very costly in terms of computation in large samples.
Furthermore, the ML relies on the normality assumption of the error terms. One way of dealing with
this issue is to incorporate the estimation ideas from the S2SLS and GM we previously presented. To see
this, we can re-write the first equation in model (5.56) by applying the following spatial Cochrane-Orcutt
transformation:
5.4. ESTIMATION OF SAC MODEL: THE FEASIBLE GENERALIZED TWO STAGE LEAST SQUARES ESTIMATOR PR
−1
y = Zδ + (I − λM ) ε
(I − λM ) y = (I − λM ) Zδ + ε (5.57)
ys (λ) = Zs (λ)δ + ε
where the spatially filtered variables are given by:
ys (λ) = y − λM y
= y − λyL
= (I − λM ) y
Zs (λ) = Z − λM Z
= Z − λZL
= (I − λM ) Z
If we knew λ, we would be able to apply an IV approach on the transformed model (5.57). For the
discussion below, assume that we know λ. Note that the ideal instruments in this case will be:
H = X, W X, ..., W l X, M X, M W X, ..., M W l X ,
P Z = (X, P W y)
P M Z = (M X, P M W y) .
p
Since we have the instruments H, and we have assumed that we have λ
b such that λ
b −→ λ0 we might
apply a GMM-type procedure using the following moment conditions:
1 >
m(λ0 , δ0 ) = E H u =0
n
Obviously, the corresponding GMM estimator is just the 2SLS estimator. Note that for the transformed
model (5.57), the moment conditions would be
1
m(λ0 , δ0 ) = E √ H > ε = 0
n
Now let λ
e some consistent estimator for λ0 which can be obtained in a previous step, then the sample
moment vector is:
e δ) = √1 H > ys (λ)
h i
mδ (λ, e − Zs (λ)δ
e ,
n | {z }
ε
e
where we explicitly state that the moments depends on δ—which will be estimated—and a consistent estimate
of λ. Under homoskedasticity the variance-covariance matrix of the moment vector g(λ0 , δ0 ) is given by:
with
−1
1 >
Υnδδ = H H .
n
Note that:
1 h i> 1 −1
1 h i
Jn = √ H ys (λ) − Zs (λ)δ
> e e >
H H √ H ys (λ) − Zs (λ)δ
> e e
n n n
−1
1 1 >
h i > h i
= ys (λ)
e − Zs (λ)δ
e H H H H > ys (λ)e − Zs (λ)δ
e
n n
h i> −1 > h i
= ys (λ)
e − Zs (λ)δ
e H H >H H ys (λ)e − Zs (λ)δ
e
h i> h i
= ys (λ)
e − Zs (λ)δ
e PH ys (λ) e − Zs (λ)δ
e
where Z cs = H H > H −1 HZs . This estimator has been called the feasible generalized spatial two-stage
least squares (FGS2SLS) estimator (Kelejian and Prucha, 1998). However, this estimator is not fully efficient.
The question is: How to obtain a consistent estimator of λ?
b As probably you can guess, this consistent
estimator is obtained in a previous step by GM.
E ε> ε = σ 2
E ε> M M ε = σ 2 tr M > M
E ε> M ε = 0
E ε> M M − tr M > M I ε = 0
E ε> A1 ε = 0.
5.4. ESTIMATION OF SAC MODEL: THE FEASIBLE GENERALIZED TWO STAGE LEAST SQUARES ESTIMATOR PR
Generalizing this expression for the third moment we end up with two instead of three quadratic moment
conditions:
1 >
E ε A1 ε = 0
n (5.59)
1 >
E ε A2 ε = 0
n
with
A1 = M M − n−1 tr M > M I
A2 = M .
Note that A1 is symmetric with tr(A1 ) = 0 (you should be able to prove this), but its diagonal elements
are non zero (In the heteroskedasticity case it is!). In Drukker et al. (2013), an additional scaling factor is
included as:
h 2 i
ν = 1/ 1 + (1/n) tr M > M
.
Under this case the weighting matrices for quadratic moments are:
A1 = ν M M − n−1 tr M > M I
A2 = Mn .
If the errors are heterokedastic, then:
where mi is the ith column of the weights matrix M . Note that diag m> i mi consists of the sum of the
u> Aq uL + u>
L Aq u = uL Aq u + uL Aq u
> >
= u>
L Aq + Aq u
>
= 2u>
L Aq u
134 CHAPTER 5. INSTRUMENTAL VARIABLES AND GMM
Here it is important to note that in some cases A2 = M might not be symmetric. However we can set:
2n E u>
−1
n E u> A1 u
−1 −1
E u>
− L A1 u −n L A1 uL λ
=0
n−1 E u> A2 u 2n−1 E u> L A2 u −n−1 E u> L A2 uL λ2
2n−1 E u> M > A1 u −n−1 E u> M > A1 M u (5.64)
−1
n E u> A1 u
− =0
n−1 E u> uL n−1 E u>L M +M
>
u −n−1 E u> M > A2 M u
γn − Γn αn = 0.
where we use Equation (5.63) for the second moment. Now, we can express the sample moment conditions
as in Section 5.3.2:
f = ge − G
m e λ2 = 0
2×1 2×1 2×2 λ
1 >
ge1 = u
e A1 u
e
n
1 > 1 >
ge2 = u e= u
e A2 u e u
eL
n n
The G
b matrix is given by:
G
e 11 = 2n−1 u
e > M > A1 u
e (5.65)
G
e 12 = −n−1 u
e > M > A1 M u
e (5.66)
= −1 > >
A2 + A> (5.67)
G
e 21 −n u
e M 2 u
e
G
e 22 = −n −1
u>
e M A2 M u
e (5.68)
R Kelejian and Prucha (1999) show consistency of the Method of Moment estimator of λ, but not
asymptotic normality of the estimator.
The variance of the moment conditions will be useful later for the GMM procedure. For this, we need
some statistics for quadratic forms. It can be shown that:
5.4. ESTIMATION OF SAC MODEL: THE FEASIBLE GENERALIZED TWO STAGE LEAST SQUARES ESTIMATOR PR
where µ and Σ are the expected value and variance-covariance matrix of ε, respectively. This result only
depends on the existence of u and Σ; it does not require normality of ε. For the moment assume that A is
symmetric and ε is normally distributed, then:
If A is not symmetric, then we can use our trick in Equation (5.63), then:
1 1 1 1
Cov ε A1 ε, ε A2 ε = 2 tr
> >
A1 + A1 Σ A2 + A2 Σ + 4µ> A1 + A>
> >
A2 + A>
1 Σ 2 µ
2 2 2 2
(5.69)
Now, let Ψ be 2 × 2 matrix of variance-covariance matrix of the moment conditions n1 E ε> A1 ε . Then,
2i = (e
b uLi )2 = u
ui − λe e2si
5.4.3 Assumptions
Now we will state the assumption for the SAC model under heteroskedasticity following Arraiz et al. (2010).
The assumptions regarding the spatial weight matrix are the following:
Assumption 5.22 — Spatial Weights Matrices (Arraiz et al., 2010). Assume the following:
Assumption 5.22(a) is a normalization rule: a region cannot be a neighbor of itself. Assumption 5.22(b)
has to do with the parameter space. This assumption is discussed by Kelejian and Prucha (2010, section
2.2). Assumption 5.22(c) ensures that y and u are uniquely defined. Thus, under assumption 5.22 (Spatial
Weight Matrices), we can write the model as:
−1
yn = (In − ρWn ) [Xn β + un ]
−1
un = (In − ρMn ) εn .
The reduced form is:
Assumption 5.23 — Heteroskedastic Errors (Arraiz et al., 2010). The error term {i,n : 1 ≤ i ≤ n, n ≥ 1}
satisfy E(i,n ) = 0, E(2i,n ) = σi,n
2
, with 0 < aσ ≤ σi,n
2
≤ aσ < ∞. Furthermore, for each n ≥ 1 the random
variables 1,n , ...., n,n are totally independent.
Assumption 5.23 allows the innovations to be heteroskedastic with uniformly bounded variances. This
assumption also allows for the innovations to depend on the sample size n, i.e., to form a triangular arrays.
Assumption 5.24 — Bounded Spatial Weight Matrices (Arraiz et al., 2010). The row and column sums of
the matrices Wn and Mn are bounded uniformly in absolute value, by , respectively, one and some finite
constant, and the row and column sums of the matrices (In − ρWn )−1 and (I − ρMn )−1 are bounded
uniformly in absolute value by some finite constant.
This assumption is a technical assumption, which is used in large-sample derivation of the regression
parameter estimator. This assumption limits the extent of spatial autocorrelation between u and y. It
ensures that the disturbance process and the process of the dependent variable exhibit a “fading” memory.
Note that:
h i
−1
E [un ] = E (In − λMn ) εn
= (In − λMn )
−1
E [εn ] (5.70)
= 0 by Assumption 5.23 (Heteroskedastic Errors)
> −1
h i
−1
E un u> = (I ) >
n E n − λM n εn ε n I n − λM n
(5.71)
−1 −1
= (In − λMn ) E εn εn I − λMn>
>
−1 −1
= (In − λMn ) Σ In − λMn>
where Σ = diag(σi,n
2
).
Assumption 5.25 — Regressors (Arraiz et al., 2010). The regressor matrices Xn have full column rank (for
n large enough). Furthermore, the elements of the matrices Xn are uniformly bounded in absolute value.
This assumption rules out multicollinearity problems, as well as unbounded exogenous variables.
Assumption 5.26 — Instruments I (Arraiz et al., 2010). The instruments matrices Hn have full column rank
L ≥ K + 1 (for all n large enough). Furthermore, the elements of the matrices Hn are uniformly bounded
in absolute value. Additionally, Hn is assumed to, at least, contain the linearly independent columns of
(Xn , Mn Xn )
There are some papers that discuss the use of optimal instruments for the spatial (see for example Lee,
2003; Das et al., 2003; Kelejian et al., 2004; Lee, 2007).
R The effect of the selection of instruments on the efficiency of the estimators remains to be further
investigated.
Assumption 5.27 — Instruments II (Identification) (Arraiz et al., 2010). The instruments Hn satisfy further-
more:
e 2SLS = y − Z δe2SLS
u (5.73)
The following Theorem states that δe2SLS is consistent:
Theorem 5.28 — Consistency of δe2SLS (Kelejian and Prucha, 2010). Suppose the assumptions hold. Then
δe2SLS = δ + Op (n−1/2 ), and hence δe2SLS is consistent for δ0 :
p
δe2SLS −→ δ0
yn = Zn δ + un ,
un = λMn un + εn .
The sampling error is given by:
−1
δbn = δ0 + Zb>Z
n n
b b > un
Zn
h > i−1 >
= δ0 + Hn (Hn> H)−1 Hn> Zn Hn (Hn> Hn )−1 Hn> Zn Hn (Hn> Hn )−1 Hn> Zn un
−1 > −1
= δ0 + Zn> Hn (Hn> Hn )−1 Hn> Zn Zn Hn (Hn> Hn )−1 Hn> (I − λMn ) εn
√
Solving for δbn − δ0 and multiplying by n we obtain:
138 CHAPTER 5. INSTRUMENTAL VARIABLES AND GMM
Estimate
S2SLS:
−1
δe2SLS = Z e>Z e>y
Z
and get u
b 2SLS
Efficient GMM
estimator of λ:
Use u
b GM M to compute
the weighting matrix Ψ
e
and obtain λOGM M
e
Estimate FGS2SLS
using λ :
h OGM iM−1
e
δbF GS2SLS = Z >
b Z
s
b > ys
Z s
and get u
b F GS2SLS
where:
−1
Fn> = Hn> (I − λMn ) = whose elements are bounded in absolute value
Assumption 5.27 implies that:
1 >
lim H Hn = QHH ,
n n
1
plim Hn> Zn = QHZ ,
n
which are finite and nonsingular.
Furthermore, note that E(n−1/2 Fn> εn ) = 0 and
5.4. ESTIMATION OF SAC MODEL: THE FEASIBLE GENERALIZED TWO STAGE LEAST SQUARES ESTIMATOR PR
h i 1 h −1 −1 i
E (n−1/2 Fn> εn )(n−1/2 Fn> εn )> = E Hn> (I − λMn ) εε> I − λMn> Hn
n
1 −1 −1
= σ 2 Hn> (I − λMn ) I − λMn> Hn
n
Assume that
1 > −1 −1 1
limHn (I − λMn ) I − λMn> Hn = Fn> Fn = Φ exists
n→∞ n n
Then assuming homocedasticity and using Theorem 5.20:
d
n−1/2 Fn> εn −→ N(0, σ2 Φ)
Therefore:
√ d
n(δbn − δ0 ) −→ N(0, ∆)
and
−1
−1 > −1
∆ = σ2 Q> QHZ Q−1 −1 −1
>
HZ QHH QHZ HH ΦQHH QHZ QHZ QHH QHZ
1 u e 2SLS I − λM > A1 (I − λM ) u
>
m(λ, δ2SLS ) =
e 2SLS
A2 (I − λM ) u
e
n u e>2SLS I − λM
> e 2SLS
(5.74)
λ
= G 2 − ge
e
λ
where,
where Υ λλ = I. This estimator is consistent but not efficient. For efficiency we need to replace Υ λλ by the
variance-covariance matrix of the sample moments. Furthermore, the expression above can be interpreted as
a nonlinear least squares system of equations. The initial estimate is thus obtained as a solution of the above
system.
Now, we need to define the expression for the matrices As . Drukker et al. (2013) suggest, for the
homokedastic case, the following expressions:
1
A1 = υ M M − tr M M I
> >
n
A2 = M
where υ is the scaling factor needed to obtain the same estimator of Kelejian and Prucha (1998, 1999).
On the other hand, when heteroskedasticity is assumed, Kelejian and Prucha (2010) recommend the
following expressions:
where:
e = diagi=1,...,n e2i
Σ
= I − λ̆gmm M u
e e
e r = I − λ̆gmm M H Pe α
a er
h i (5.78)
e r = −n−1 Z > I − λ̆gmm M (Ar + A>
α r ) I − λ̆gmm M ue
−1 " > −1 #−1
1 1 1 1 1
Pe = >
H H >
H Zn >
H Z >
H H >
H Z
n n n n n n
It is important to note that this step is not necessary since the previous estimator of λ is already consistent.
where
ys = y − λ
eogmm M y
Zs = Z − λ
eogmm M Z
(5.80)
bs = PH Zs
Z
−1
PH = H H > H H>
5.5. APPLICATION IN R 141
where Ψ λb
bbλ
is an estimator for the variance-covariance matrix of the (normalized) sample moment vector
based on the GS2SLS residuals. This estimator differs for the cases of homoskedastic and heteroskedastic
errors.
For the homoskedastic case the r, s (with r, s = 1, 2) element of Ψ
bbλb
λ
is given by:
2 2
rs = σ (2n)−1 tr Ar + A> As + A>
λbλ
Ψ
bb e r s
+σe2 n−1 a
e>
r ae>
s
2 2 >
(5.82)
+n −1
e −3 σ
µ(4)
e vecD (Ar ) vecD (As )
+ n−1 µ e r vecD (As ) + a s vecD (Ar ) ,
>
e(3) a e>
where
e r = Tbα
a er
T = H Pb ,
b
h i−1
b −1 Q
Pb = Q b > b −1 b >
HH HZ QHZ QHH QHZ
b
b −1 = n−1 H > H ,
Q HH
b HZ = n−1 H > Z ,
Q
Z = I − λMe Z, (5.83)
er = −n−1 Z > Ar + A>
α r εb
b2 = n−1 εbεb,
σ
X n
b(3) = n−1
µ 3i ,
b
i=1
n
X
b(4) = n−1
µ 4i .
b
i=1
where, Σ
b is a diagonal matrix whose ith diagonal element is b
2i .
5.5 Application in R
In this example we will use the simulated US Driving Under the Influence (DUI) county data set used in
Drukker et al. (2011). The dependent variable dui is defined as the alcohol-related arrest rate per 100,000
daily vehicle miles traveled (DVMT). The explanatory variables include
library("maptools")
library("spdep")
library("sphet")
# Load Data
us_shape <- readShapeSpatial("ccountyR") # Load shape file
##
## Call:gstsls(formula = dui ~ police + nondui + vehicles + dry, data = us_shape,
## listw = lw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.655535 -0.362165 -0.070363 0.277261 2.418849
##
## Type: GM SARAR estimator
## Coefficients: (GM standard errors)
## Estimate Std. Error z value Pr(>|z|)
## Rho_Wy 0.04692763 0.01698220 2.7633 0.005721
## (Intercept) -6.40991922 0.41836312 -15.3214 < 2.2e-16
## police 0.59810726 0.01491778 40.0936 < 2.2e-16
## nondui 0.00024688 0.00108699 0.2271 0.820328
## vehicles 0.01571247 0.00066881 23.4933 < 2.2e-16
## dry 0.10608849 0.03496242 3.0344 0.002410
##
## Lambda: 0.00095701
## Residual variance (sigma squared): 0.31811, (sigma: 0.56402)
## GM argmin sigma squared: 0.31789
## Number of observations: 3109
## Number of parameters estimated: 8
The results show that all the variables are significant, except for nondui. Importantly, higher number
of sworn officers is positively correlated with the DUI arrest rate, after controlling for nondui, vehicles
and dry! The spatial autoregressive coefficient ρ is positive and significant indicating autocorrelation in the
dependent variable. Drukker et al. (2011) give some theoretical explanation of this results. On the one hand,
5.5. APPLICATION IN R 143
the positive coefficient may be explained in terms of coordination effort among police departments in different
countries. On the other hand, it might well be that an enforcement effort in one of the counties leads people
living close to the border to drink in neighboring counties. The estimate is λ negative, however the output
does not produce inference for it. Lastly, it is important to stress that the standard errors has a degrees of
freedom correction in the variance-covariance matrix.
##
## Generalized stsls
##
## Call:
## spreg(formula = dui ~ nondui + vehicles + dry, data = us_shape,
## listw = lw, endog = ~police, instruments = ~elect, lag.instr = TRUE,
## model = "sarar", het = FALSE)
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -6.1862 -0.8838 0.0147 -0.0161 0.9213 8.3616
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) 11.60596811 1.66674437 6.9633 3.325e-12 ***
## nondui -0.00019624 0.00275912 -0.0711 0.943299
## vehicles 0.09299562 0.00564911 16.4620 < 2.2e-16 ***
## dry 0.39825983 0.09090201 4.3812 1.180e-05 ***
## police -1.35130834 0.14101772 -9.5825 < 2.2e-16 ***
## lambda 0.19319018 0.04431011 4.3600 1.301e-05 ***
## rho -0.08597523 0.03018333 -2.8484 0.004393 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
An important issue here is that the optimal instrument are unknown. It is not recommended the
inclusion of the spatial lag of these additional exogenous variables in the matrix of instruments. However,
results reported in ? do consider the spatial lags of elect.
Now we assume that the error are heteroskedastic of unknown form.
144 CHAPTER 5. INSTRUMENTAL VARIABLES AND GMM
Appendix
5.A Proof Theorem 3 in Kelejian and Prucha (1998)
Recall that the GS2SLS is given by:
h i−1
δbn = Zbs (λ)> Z
bs (λ) bs (λ)> ys (λ)
Z (5.85)
Whereas, the FGS2SLS is given by:
h i−1
δbF,n = Zbs (λ) bs (λ)
b >Z b bs (λ)
Z b > ys (λ)
b (5.86)
where
bs (λ
Z bn ) = PH Zs (λ
bn )
n
Zs (λ
bn ) = Zn − λ
bn Mn Zn
ys ( λ
b n ) = yn − λ
bn Mn yn
(5.87)
\
Zs (λn ) = Xn − λn Mn Xn , Wn yn − λn Mn Wn yn
b b b b
b\
λ n Mn Wn yn = PHn Wn yn − λn Mn Wn yn .
b
us (λ
bn ) = (I − λ
bn )u
= (I − λ
bn )u + εn − εn
Then:
h i−1 h i
δbF,n − δ = Z bs (λ)
b >Zbs (λ)
b bs (λ)
Z b > εn − λ b n − λ M n un
h i−1 h i−1
= Z bs (λ)
b >Zbs (λ)
b bs (λ)
Z b > εn − Z bs (λ)
b >Z bs (λ)
b Zbs (λ)
b > λ bn − λ Mn un
−1 −1
1 b b >b b 1 b b > 1 b b >b b bn − λ 1 Z
= Zs (λ) Zs (λ) Zs (λ) εn − Zs (λ) Zs (λ) λ bs (λ)
b > Mn un
n n n n
−1 −1
√ 1 b b >b b 1 b b > 1 b b >b b bn − λ √1 Z
n(δbF,n − δ) = Zs (λ) Zs (λ) √ Z s ( λ) ε n − Z s ( λ) Z s (λ) λ bs (λ)
b > Mn un
n n n n
(5.90)
By consistency λ bn − λ = op (1). Now, we need to show that:
1 b b >b b p 1 b
Zs (λ) Zs (λ) −→ Zs (λ)> Z
bs (λ) = Q̄ (5.91)
n n
1 b b > d
s (λ) εn −→ N(0, σ Q̄), (5.92)
2
√ Z
n
bn − λ √1 Z
p
λ bs (λ)
b > Mn un −→ 0 (5.93)
n
5.A. PROOF THEOREM 3 IN ? 145
where:
>
Q̄ = [QHZ − λQmHZ ] Q−1
HH [QHZ − λQmHZ ] (5.94)
is finite and nonsingular. For 5.91, note that:
1 b b >b b 1 >
Zs (λ) Zs (λ) = Zn − λ
bn Mn Zn PH Zn − λ
n
bn Mn Zn
n n
1 >
bn Mn Zn Hn H > H −1 H > Zn − λ
=
Zn − λ n n
bn Mn Zn
n
(5.95)
−1
1 > 1 > > 1 > 1 > 1 >
= Zn Hn − λn
b Z M Hn Hn H H Zn − λn Hn Mn Zn
b
|n {z } |{z} n | n {zn }
| n {z n n n
p p
−→λ −→Q>
}| {z }
p
−→Q>
HZ
HM Z
→Q−1 p
−→QHZ −λQHM Z
HH
1 b b > 1 >
√ Z s (λ) εn = √ Zn − λ
bn Mn Zn PH ε
n
n n
1 >
−1 (5.96)
1 > > 1 > 1
= Z H − λ
n n n |{z}
b n Z n M n Hn
Hn H √ Hn> εn
| {z } n | {z }| n n
p p
−→λ >
{z } | {z }
p −→QHM Z
−→Q> →Q−1 d
HZ HH −→N(0,σ2 QHH )
−1
bn − λ √1 Z b n 1 Z > M > Hn 1 H > H
bn − λ 1 Z > Hn − λ 1
bs (λ)
b > Mn un = λ √ Hn> Mn un
λ n n n n n n
n | {z } | {z } |{z} n | {z } | n
p p
−→λ −→Q>
{z }
op (1) p
−→Q>
HZ
HM Z
→Q−1
HH
(5.97)
n Mn Hn = n
Note that E n−1/2 Hn> Mn un = 0 and E n−1 Hn> Mn un u> > > −1
Hn> Mn Σun Mn> Hn> , whose
where:
b2 = εb> εb/n
σ (5.100)
and εb = ys (λ)
b − Zs (λ)
b δbF .
146 CHAPTER 5. INSTRUMENTAL VARIABLES AND GMM
Bibliography
Allers, M. A. and Elhorst, J. P. (2005). Tax Mimicking and Yardstick Competition Among Local Governments
in The Netherlands. International tax and public finance, 12(4):493–513.
Anselin, L. (1988). Spatial Econometrics: Methods and Models, volume 4. Springer.
Anselin, L. (1996). Chapter Eight: The Moran Scatterplot as an ESDA Tool to Assess Local Instability in
Spatial Association. Spatial Analytical, 4:121.
Anselin, L. (2003). Spatial Externalities, Spatial Multipliers, and Spatial Econometrics. International regional
science review, 26(2):153–166.
Anselin, L. (2007). Spatial Econometrics, pages 310–330. Blackwell Publishing Ltd.
Anselin, L. and Bera, A. (1998). Spatial Dependence in Linear Regression Models with an Introduction to
Spatial Econometrics. In Ullah, A. and Giles, D., editors, Handbook of Applied Economic Statistics, pages
237–289. Marcel Dekker, New York.
Anselin, L. and Lozano-Gracia, N. (2008). Errors in Variables and Spatial Effects in Hedonic House Price
Models of Ambient Air Quality. Empirical economics, 34(1):5–34.
Anselin, L. and Rey, S. (1991). Properties of Tests for Spatial Dependence in Linear Regression Models.
Geographical analysis, 23(2):112–131.
Anselin, L. and Rey, S. (2014). Modern Spatial Econometrics in Practice: A Guide to Geoda, Geodaspace
and Pysal. GeoDa Press LLC.
Arraiz, I., Drukker, D. M., Kelejian, H. H., and Prucha, I. R. (2010). A Spatial Cliff-Ord-Type Model with
Heteroskedastic Innovations: Small and Large Sample Results. Journal of Regional Science, 50(2):592–614.
Baller, R. D., Anselin, L., Messner, S. F., Deane, G., and Hawkins, D. F. (2001). Structural Covariates of
US County Homicide Rates: Incorporating Spatial Effects. Criminology, 39(3):561–588.
Basdas, U. (2009). Spatial Econometric Analysis of the Determinants of Location in Turkish Manufacturing
Industry. Available at SSRN 1506888.
Bivand, R., Hauke, J., and Kossowski, T. (2013). Computing the Jacobian in Gaussian Spatial Autoregressive
Models: An Illustrated Comparison of Available Methods. Geographical Analysis, 45(2):150–179.
Bivand, R. and Lewin-Koh, N. (2015). maptools: Tools for Reading and Handling Spatial Objects. R package
version 0.8-36.
Bivand, R. and Piras, G. (2015). Comparing Implementations of Estimation Methods for Spatial Economet-
rics. Journal of Statistical Software, 63(1):1–36.
147
148 CHAPTER 5. INSTRUMENTAL VARIABLES AND GMM
Boarnet, M. G. and Glazer, A. (2002). Federal Grants and Yardstick Competition. Journal of urban Eco-
nomics, 52(1):53–64.
Cliff, A. and Ord, K. (1972). Testing for Spatial Autocorrelation Among Regression Residuals. Geographical
analysis, 4(3):267–284.
Cliff, A. D. and Ord, J. K. (1973). Spatial Autocorrelation. London:Pion.
Cohen, J. and Tita, G. (1999). Diffusion in Homicide: Exploring a General Method for Detecting Spatial
Diffusion Processes. Journal of Quantitative Criminology, 15(4):451–493.
Cordy, C. B. and Griffith, D. A. (1993). Efficiency of least squares estimators in the presence of spatial
autocorrelation. Communications in Statistics-Simulation and Computation, 22(4):1161–1179.
Das, D., Kelejian, H. H., and Prucha, I. R. (2003). Finite Sample Properties of Estimators of Spatial
Autoregressive Models with Autoregressive Disturbances. Papers in Regional Science, 82(1):1–26.
Doreian, P. (1981). Estimating Linear Models with Spatially Distributed Data. Sociological methodology,
pages 359–388.
Drukker, D. M., Egger, P., and Prucha, I. R. (2013). On Two-step Estimation of a Spatial Autoregressive
Model with Autoregressive Disturbances and Endogenous Regressors. Econometric Reviews, 32(5-6):686–
733.
Drukker, D. M., Prucha, I. R., and Raciborski, R. (2011). A Command for Estimating Spatial-autoregressive
Models with Spatial-autoregressive Disturbances and Additional Endogenous Variables. Econometric Re-
views, 32:686–733.
Elhorst, J. P. (2010). Applied Spatial Econometrics: Raising the Bar. Spatial Economic Analysis, 5(1):9–28.
Elhorst, J. P. (2014). Spatial Econometrics: From Cross-Sectional Data to Spatial Panels. Springer.
Filiztekin, A. (2009). Regional Unemployment in Turkey. Papers in Regional Science, 88(4):863–878.
Fischer, M. M., Bartkowska, M., Riedl, A., Sardadvar, S., and Kunnert, A. (2009). The Impact of Human
Capital on Regional Labor Productivity in Europe. Letters in Spatial and Resource Sciences, 2(2-3):97–108.
Garretsen, H. and Peeters, J. (2009). FDI and the Relevance of Spatial Linkages: Do Third-Country Effects
Matter for Dutch FDI? Review of World Economics, 145(2):319–338.
Garrett, T. A. and Marsh, T. L. (2002). The revenue impacts of cross-border lottery shopping in the presence
of spatial autocorrelation. Regional Science and Urban Economics, 32(4):501–519.
Gibbons, S., Overman, H. G., and Patacchini, E. (2015). Spatial Methods. Handbook of Regional and Urban
Economics SET, page 115.
Kelejian, H. and Piras, G. (2017). Spatial econometrics. Academic Press.
Kelejian, H. H. and Prucha, I. R. (1998). A Generalized Spatial Two-Stage Least Squares Procedure for
Estimating a Spatial Autoregressive Model with Autoregressive Disturbances. The Journal of Real Estate
Finance and Economics, 17(1):99–121.
Kelejian, H. H. and Prucha, I. R. (1999). A Generalized Moments Estimator for the Autoregressive Parameter
in a Spatial Model. International economic review, 40(2):509–533.
Kelejian, H. H. and Prucha, I. R. (2001). On the Asymptotic Distribution of the Moran I Test Statistic with
Applications. Journal of Econometrics, 104(2):219–257.
Kelejian, H. H. and Prucha, I. R. (2007). The Relative Efficiencies of Various Predictors in Spatial Econo-
metric Models Containing Spatial Lags. Regional Science and Urban Economics, 37(3):363–374.
Kelejian, H. H. and Prucha, I. R. (2010). Specification and Estimation of Spatial Autoregressive Models with
Autoregressive and Heteroskedastic Disturbances. Journal of Econometrics, 157(1):53–67.
5.A. PROOF THEOREM 3 IN ? 149
Kelejian, H. H., Prucha, I. R., and Yuzefovich, Y. (2004). Instrumental Variable Estimation of a Spatial
Autoregressive Model with Autoregressive Disturbances: Large and Small Sample Results. In Lesage, J.
and Pace, R., editors, Spatial and Spatiotemporal Econometrics, pages 163–198. Emerald Group Publishing
Limited.
Kim, C. W., Phipps, T. T., and Anselin, L. (2003). Measuring the Benefits of Air Quality Improvement: A
Spatial Hedonic Approach. Journal of environmental economics and management, 45(1):24–39.
Kirby, D. K. and LeSage, J. P. (2009). Changes in Commuting to Work Times Over the 1990 to 2000 Period.
Regional Science and Urban Economics, 39(4):460–471.
Lee, L.-F. (2002). Consistency and Efficiency of Least Squares Estimation for Mixed Regressive, Spatial
Autoregressive Models. Econometric theory, 18(02):252–277.
Lee, L.-f. (2003). Best Spatial Two-Stage Least Squares Estimators for a Spatial Autoregressive Model with
Autoregressive Disturbances. Econometric Reviews, 22(4):307–335.
Lee, L.-F. (2004). Asymptotic Distributions of Quasi-Maximum Likelihood Estimators for Spatial Autore-
gressive Models. Econometrica, 72(6):1899–1925.
Lee, L.-f. (2007). GMM and 2SLS Estimation of Mixed Regressive, Spatial Autoregressive Models. Journal
of Econometrics, 137(2):489–514.
LeSage, J. and Pace, R. K. (2010). Introduction to Spatial Econometrics. CRC press.
LeSage, J. P. (1997). Bayesian Estimation of Spatial Autoregressive Models. International Regional Science
Review, 20(1-2):113–129.
LeSage, J. P. (2014). What Regional Scientists Need to Know about Spatial Econometrics. The Review of
Regional Studies, 44(1):13–32.
LeSage, J. P. and Pace, R. K. (2014). Interpreting Spatial Econometric Models, pages 1535–1552. Springer
Berlin Heidelberg, Berlin, Heidelberg.
Mead, R. (1967). A Mathematical Model for the Estimation of Inter-Plant Competition. Biometrics, pages
189–205.
Newey, W. K. and McFadden, D. (1994). Large Sample Estimation and Hypothesis Testing. Handbook of
econometrics, 4:2111–2245.
Ord, K. (1975). Estimation Methods for Models of Spatial Interaction. Journal of the American Statistical
Association, 70(349):120–126.
Pace, R. K. and LeSage, J. P. (2008). A spatial hausman test. Economics Letters, 101(3):282–284.
Pavlyuk, D. (2011). Spatial Analysis of Regional Employment Rates in Latvia.
Piras, G. (2010). sphet: Spatial Models with Heteroskedastic Innovations in R. Journal of Statistical Software,
35(1):1–21.
Prucha, I. (2014). Instrumental Variables/Method of Moments Estimation. In Fischer, M. M. and Nijkamp,
P., editors, Handbook of Regional Science, pages 1597–1617. Springer Berlin Heidelberg.
Saavedra, L. A. (2000). A Model of Welfare Competition with Evidence from AFDC. Journal of Urban
Economics, 47(2):248–279.
Smirnov, O. and Anselin, L. (2001). Fast Maximum Likelihood Estimation of Very Large Spatial Autoregres-
sive Models: A Characteristic Polynomial Approach. Computational Statistics & Data Analysis, 35(3):301–
319.
Stewart, B. M. and Zhukov, Y. (2010). Choosing Your Neighbors: The Sensitivity of Geographical Diffusion
in International Relations. In APSA 2010 Annual Meeting Paper.
Tiefelsdorf, M., Griffith, D., and Boots, B. (1999). A Variance-Stabilizing Coding Scheme for Spatial Link
Matrices. Environment and Planning A, 31(1):165–180.
White, H. (2014). Asymptotic Theory for Econometricians. Academic press.
150 CHAPTER 5. INSTRUMENTAL VARIABLES AND GMM
Index
151