A Full-Scale Approximation of Covariance Functions For Large Spatial Data Sets
A Full-Scale Approximation of Covariance Functions For Large Spatial Data Sets
Huiyan Sang†
Department of Statistics, Texas A&M University, College Station, USA.
Jianhua Z. Huang
Department of Statistics, Texas A&M University, College Station, USA
Summary. Gaussian process models have been widely used in spatial statistics but face
tremendous computational challenges for very large data sets. The model fitting and spatial
prediction of such models typically require O(n3 ) operations for a data set of size n. Various
approximations of the covariance functions have been introduced to reduce the computational
cost. However, most existing approximations can not simultaneously capture both the large
and small scale spatial dependence. A new approximation scheme is developed in this pa-
per to provide a high quality approximation to the covariance function at both the large and
small spatial scales. The new approximation is the summation of two parts: a reduced rank
covariance and a compactly supported covariance obtained by tapering the covariance of the
residual of the reduced rank approximation. While the former part mainly captures the large
scale spatial variation, the latter part captures the small scale, local variation that is unex-
plained by the former part. By combining the reduced rank representation and sparse matrix
techniques, our approach allows for efficient computation for maximum likelihood estimation,
spatial prediction and Bayesian inference. We illustrate the new approach with simulated and
real data sets.
Keywords: Covariance function; Gaussian processes; Geostatistics; Kriging; Large spatial
dataset; Spatial processes
1. Introduction
Gaussian process models have been widely used in modeling spatial data (see, e.g., Diggle
et al., 1998; Banerjee et al., 2004). However, large data sets pose tremendous computational
challenges to the application of these models. In particular, spatial model fitting and spatial
prediction (e.g., kriging) both involve inversion of an n × n covariance matrix for a data
set of size n, which typically requires O(n3 ) operations and O(n2 ) memory, and is thus
computationally intractable for very large n.
Various approximations of the spatial likelihood have been developed for efficient com-
putation with large spatial data sets. Vecchia (1988) and Stein et al. (2004) used a product
of conditional densities, where a careful choice of suitable conditional sets is required. The
Gaussian Markov random field approximation by Rue and Tjelmeland (2002) and Rue and
Held (2005) works best for gridded data and may have difficulty in prediction with massive
†Address for correspondence: Huiyan Sang, Department of Statistics, Texas A&M University,
College Station, TX 77843, USA.
Email: [email protected]
2 H. Sang & J. Z. Huang
data sets. Fuentes (2007) worked in the spectral domain of spatial processes, a strategy
suited mainly to stationary processes.
Two recently developed approaches have shown great appeal as general-purpose method-
ologies but each has its own drawbacks. The first approach is based on a reduced rank
approximation of the underlying process. Methods following this path include kernel con-
volutions (see, e.g., Higdon, 2002), low rank splines or basis functions (e.g., Wikle and
Cressie, 1999; Ver Hoef et al., 2004; Kammann and Wand, 2003; Cressie and Johannesson,
2008), and predictive process models (Banerjee et al., 2008; Finley et al., 2009). Reduced
rank based methods have been proven successful in capturing large scale structure of spatial
processes. However, they usually fail to accurately capture the local, small scale dependence
structure (see, e.g., Stein, 2008; Finley et al., 2009).
The second approach seeks a sparse approximation of the covariance function and
achieves computational efficiency through sparse matrix techniques. In particular, by set-
ting to zero covariances of distant pairs of observations, covariance tapering has recently
been introduced as a way for constructing sparse covariance matrix approximations and
efficient algorithms have been developed for spatial prediction and parameter estimation
(Furrer et al., 2006; Kaufman et al., 2008). The covariance tapering method works well
in handling short range dependence but it may not be effective in accounting for spatial
dependence with long range, because the tapered covariance function with a relatively small
taper range fails to provide a good approximation to the original covariance function and
may lead to bias in spatial prediction and parameter estimation.
We propose a full-scale approximation to the covariance function of a spatial process that
facilitates efficient computation with very large spatial data sets, and in the meantime avoids
the pitfalls of the two approaches mentioned above. The new approach uses a reduced rank
process to capture the large scale spatial variation and a process with compactly supported
covariance function to capture the small scale, local variation that is unexplained by the
reduced rank process. The compactly supported covariance function is obtained by tapering
the covariance function of the residual process from the reduced rank process approximation.
By utilizing the reduced rank representation and sparse matrix techniques, the new approach
significantly reduces the computational burden associated with very large data sets. The
full-scale approximation works well with both the frequentist and the Bayesian approaches
of spatial modeling. It can be conveniently applied to expedite computation for both the
maximum likelihood and Bayesian inference of model parameters and to carry out spatial
predictions through either the best linear unbiased prediction (i.e., kriging) or the Bayesian
prediction.
The remainder of the paper is organized as follows. Section 2 reviews the Gaussian
process models and two existing approximation methods for fast computation: the reduced
rank and tapering methods. Section 3 presents the proposed new approximation method.
Section 4 gives details of model fitting and spatial prediction using the new approximation.
We then illustrate our method in Section 5 with a simulation study and a rainfall data
analysis. Section 6 discusses some possible extensions and other applications.
In this section, we present a summary of Gaussian process models for spatial data sets, and
also review two existing approaches of approximating the covariance functions that allow
rapid computation of the likelihood-based parameter estimation and spatial prediction. We
Approximation of spatial covariance functions 3
point out the drawbacks of these two approaches to motivate our new approach to be
introduced in the next section. Our presentation of Gaussian process models is based on
the standard treatment in Banerjee et al. (2004) and Schabenberger and Gotway (2005).
where h(s0 ) = (C(s0 , s1 ), . . . , C(s0 , sn ))T . The mean squared prediction error is
The computational bottleneck of applying the kriging equations (3) and (4) is the inversion
of the n × n matrix Cn,n + τ 2 I which typically has computational cost O(n3 ). On the
other hand, the Bayesian prediction draws samples from the the predictive distribution
p(Y (s0 )|Y) = p(Y (s0 )|Ω, Y)p(Ω|Y) at a new site s0 by composition. Again, the sampling
from the posterior distribution p(Y (s0 )|Ω, Y) involves the inversion of the n × n matrix
Cn,n + τ 2 I, which is a computational burden for large data sets as shown in the previous
paragraph.
The covariance function of w̄(s) is Cl (s, s′ ) = ψ T (s)Λψ(s′ ), where ψ(s) = (ψ1 (s), . . . , ψm (s))T ,
and Λ is an m × m diagonal matrix with entries λ1 , . . . , λm .
Application of the above reduced rank approximation relies on the ability to solve the
integral equation, typically a hard task. Williams and Seeger (2001) proposed to solve the
integral equation using the Nyström method. Consider a set of knots S ∗ = {s∗1 , . . . , s∗m }.
Approximation of spatial covariance functions 5
∗
Let C denote the m×m matrix whose (k, l) entry is C(s∗k , s∗l ). By discretizing the integral,
the Nyström method transforms (6) into
m
1 X
C(s, s∗k )ψi (s∗k ) ≈ λi ψi (s). (8)
m
k=1
(m)
By plugging in s with points in S ∗ into (8), we obtain a matrix eigenequation C∗ ui =
(m) (m)
λi ui , the solution of which is linked to the solution of (8) through
(m)
√ (m) λi
ψ i (S ∗ ) ≈ mui , λi ≈ ,
m
(m)
where ψ i (S ∗ ) = (ψi (s∗1 ), . . . , ψi (s∗m ))T , and the normalization |ui |2 = (1/m)|ψ i (S ∗ )|2 = 1
is used. By (8), the Nyström approximation of the ith eigenfunction is
√ m
m X (m)
ψi (s) ≈ (m)
C(s, s∗k )ui .
λi k=1
where C(s, S ∗ ) = (C(s, s∗1 ), . . . , C(s, s∗m ))T . The covariance function of the rank-m process
w̄(s) is C̄(s, s′ ) = C(s, S ∗ )T C∗−1 C(s′ , S ∗ ), which is a finite rank approximation to the
covariance function C(s, s′ ) of the original process w(s). Applying this approximation,
the matrix inversion in (2)–(4) can be computed efficiently using the Sherman–Woodbury–
Morrison formula: (A + UBV)−1 = A−1 − A−1 U(B−1 + VA−1 U)−1 VA−1 .
The reduced rank approximation given above is based on truncating the K-L expansion
of the process and the Nyström approximation of the eigensystem. Banerjee et al. (2008)
proposed to construct an reduced rank approximation using spatial interpolation which,
interestingly, yields the same approximation of the covariance function. Specifically, let
w∗ = [w(s∗i )]m ∗
i=1 denote the realization of w(s) at the m knots in S . The BLUP of w(s)
∗ ∗ T ∗−1 ∗
at any fixed site s based on w is w̃(s) = C(s, S ) C w . According to the classical
theory of kriging, w̃(s) minimizes the mean squared prediction error E{[w(s) − f (w∗ )]2 }
over all linear functions f (w∗ ), and over all square integrable functions if the process is
Gaussian. Because of the interpretation as the best prediction, w̃(s) is called the predictive
process. Banerjee et al. (2008) has demonstrated the utility of applying the predictive
process approximation to achieve computational efficiency in Bayesian hierarchical modeling
of large spatial data sets. Since the covariance matrix of w∗ is C∗ , the process w̃(s) has the
same covariance function as the rank-m process w̄(s) defined in (9). Thus the approaches
by Williams and Seeger (2001) and Banerjee et al. (2008) are equivalent.
Although methods based on the reduced rank approximation have proven successful in
capturing large-scale variation of spatial processes, they share one common disadvantage:
inaccuracy in representing local/small scale dependence (Stein, 2008; Finley et al., 2009).
Top panel of Fig. 1 shows a typical example that the predictive process approximation is
poor at short distances. For spatial processes with relatively fine scale spatial dependence,
6 H. Sang & J. Z. Huang
the reduced rank approximation generally requires a relatively high rank m in order to
preserve more complete information about the fine scale spatial pattern, and hence loses
the computational advantage. Banerjee et al. (2008) pointed out that the performance of
the predictive process approximation depends on the size of the spatial dependence range
relative to the spacing of the knots. The quality of the predictive process approximation
usually gets worse when the spatial dependence range gets shorter. In this regard, the
predictive process with a limited number of knots will not be able to make reliable inference
of dependence for pairs of sites that are very close to each other (relative to the spacing of
the knots). Some numerical examples will be presented in Section 3 when the predictive
process is compared with our new approximation scheme.
The predictive process model requires the selection of knot locations. Banerjee et al.
(2010) briefly discussed several possible strategies for knot selection. For fairly evenly
distributed data locations, they suggested to select knots on a uniform grid overlaid on
the domain. For highly irregularly distributed locations, they suggested either to use the
popular clustering algorithms such as k-means or the more robust median-based partitioning
around medoids algorithms (e.g., Kaufman and Rousseeuw, 1990). One may also choose
the knots following some formal design-based approaches based upon minimization of a
spatially averaged predictive variance criterion (e.g., Diggle and Lophaven, 2006; Finley
et al., 2009).
According to the Schur product theorem (Horn and Johnson, 1985, Section 7.5), the tapered
covariance function is positive semi-definite and thus a valid covariance function. Note that
the tapered covariance is exactly zero for data at any two locations whose distance is larger
than γ. Thus the tapering range parameter γ can be viewed as the effective range for
the spatial phenomenon being studied. By assigning γ a small value, we obtain a sparse
covariance matrix and can use efficient sparse matrix algorithms for likelihood inference
and spatial prediction. Many compactly supported correlation functions constructed in the
literature can serve as a taper function (see, e.g., Wendland, 1995, 1998; Gneiting, 2002).
Examples include the spherical covariance function defined as
h 2 h
Kspherical (h; γ) = (1 − )+ (1 + ), h > 0, (10)
γ 2γ
Approximation of spatial covariance functions 7
0.6
0.4
0.2
0
0 5 10 15 20 25 30 35
Distance
Covariance tapering
1
Approximated covariance
0.8 True covariance
Covariance
0.6
0.4
0.2
0
0 5 10 15 20 25 30 35
Distance
Full−scale approximation
1
Approximated covariance
0.8 True covariance
Covariance
0.6
0.4
0.2
0
0 5 10 15 20 25 30 35
Distance
Fig. 1. The exponential covariance function C(h) = exp(−3h/50) and its approximations. The
solid lines represent the true covariance function. The top panel plots the covariance matrix at 500
random locations from [0, 100] × [0, 100] using the predictive process approximation with m = 100
evenly placed knots. The middle panel displays the tapered covariance function using the spherical
correlation taper with taper range γ = 20. The bottom panel plots the covariance matrix at the same
500 locations using the proposed full-scale approximation with m = 100 evenly placed knots and
taper range γ = 20.
and
h 6 h 35h2
Kwendland,2 (h; γ) = (1 − )+ (1 + 6 + ), h > 0. (12)
γ γ 2γ 2
See Furrer et al. (2006) for some suggestions on choosing the tapering function for the
covariance tapering.
The covariance tapering does a good job in capturing the small scale spatial dependence,
but may not be effective to account for the large scale dependence. The middle panel of
Fig. 1 shows that the quality of approximation at long range can be rather poor. Because
of the way how tapering works, the tapered covariance function with a relatively small
tapering range fails to provide a good approximation to the original covariance function
at long range, and hence may lead to serious bias in parameter estimation and inaccuracy
in prediction. Using a larger tapering range may improve the quality of approximation
but sacrifice the computational advantage of tapering. To adjust the bias in parameter
estimation, Kaufman et al. (2008) proposed an alternative tapering method, referred to as
the two-taper approximation, to approximate the log likelihood function by tapering both
the model and sample covariance matrices. Although the two-taper estimates do not have
large bias, Kaufman et al. (2008) indicated that they are not suitable for being plugged into
the kriging procedure in prediction applications.
Our new approach combines the ideas of the reduced-rank process approximation and the
sparse covariance approximation and has the advantages of both approaches while overcomes
their individual shortcomings. We first decompose the spatial Gaussian process into two
parts: a reduced rank process to characterize the large scale dependence and a residual
process to capture the small scale spatial dependence that is unexplained by the reduced
rank process. We then obtain sparse covariance approximation of the residual process using
covariance tapering. Since the residual process mainly captures the small scale dependence
and the tapering has little impact on such dependence other than introducing sparsity, the
error of the new approximation is expected to be small. We refer to our new approach as the
full-scale approximation because of its capability of providing high quality approximations
at both the small and large spatial scales.
Specifically, for the spatial process w(s) in (1), consider the decomposition
w(s) = wl (s) + ws (s), (13)
where wl (s) is a reduced rank approximation of w(s) and ws (s) = w(s) − wl (s) is the
residual of the approximation. We specialize wl (s) to be the predictive process introduced
in Section 2.2. The subscripts of wl (s) and ws (s) indicate respectively that they primarily
capture the long range and short range dependence. For a fixed set S ∗ of m knots and the
corresponding vector w∗ of process realizations, the predictive process can be expressed as
wl (s) = CT (s, S ∗ )C∗−1 w∗ , (14)
whose finite-rank covariance function is
Cl (s, s′ ) = Cov{wl (s), wl (s′ )} = CT (s, S ∗ )C∗−1 C(s′ , S ∗ ). (15)
The predictive process is less variable than the original process in the sense that the marginal
variance of the predictive process at a fixed location is equal to (when the location coincides
Approximation of spatial covariance functions 9
with a knot) or smaller than that of original covariance (Finley et al., 2009). It can capture
reasonably well the long range but not the short range dependence.
The reason that the predictive process approach can not capture the short range de-
pendence is because it discards entirely the residual process ws (s). Our novelty here is a
more careful treatment of the covariance function that can both preserve most information
present in the residual process and also achieve computational efficiency. The covariance
function of the residual process is
We propose a sparse matrix approximation of this function using tapering. The tapering
function, denoted as Ktaper (s, s′ ; γ), is chosen to be a compactly supported correlation
function that is identically zero whenever |s − s′ | ≥ γ for a positive taper range γ (Gneiting,
2002). We approximate the covariance function of the residual process ws (s) by the following
tapered function
which is a valid covariance function with compact support. By assigning the taper range
parameter γ a small value, one obtains sparse covariance matrices, which can be manipulated
using efficient sparse matrix algorithms.
Putting things together, our approximation of the original covariance function has the
form
C † (s, s′ ) = Cl (s, s′ ) + Cs (s, s′ ), (17)
where Cl (s, s′ ) and Cs (s, s′ ) are given in (15) and (16) respectively. Our formulation includes
existing approaches as special cases. If the tapered part is void, we get the predictive process
of Banerjee et al. (2008); if the set S ∗ of locations is empty, then the predictive process part
is void and we get the tapered version of the original covariance function as used in Furrer
et al. (2006) and Kaufman et al. (2008).
The approximated covariance function given in (17) is indeed a valid covariance function
provided both C(·, ·) and Ktaper (·, ·) are valid covariance functions. To give a more precise
statement, we introduce some notations. Let k(·, ·) be a function on D × D. Given n points
s1 , . . . , sn in D, the n × n matrix K with elements Kij = k(si , sj ) is called the Gram matrix
of k with respect to s1 , . . . , sn ; the Gram matrix is called positive semi-definite, if cT Kc ≥ 0
for all vectors c ∈ Rn , and positive definite, if in addition, cT Kc = 0 only when c is a vector
of 0’s. The function k(·, ·) is positive (semi-)definite if the corresponding Gram matrix is
positive (semi-)definite for all possible choices of s1 , . . . , sn .
Proposition 1.
(i) If Ktaper (·, ·) is positive semi-definite, then C † (·, ·) is positive semi-definite if and
only if C(·, ·) is positive semi-definite.
(ii) If Ktaper (·, ·) is positive definite, then C † (·, ·) is positive definite if and only if C(·, ·)
is positive definite.
The proof is given in the Appendix.
A positive semi-definite function of two arguments is a valid covariance function. Part
(i) of Proposition 1 suggests that the proposed full-scale approximation provides a valid
covariance function. Part (ii) of the proposition indicates that the full-scale approximation
is not of reduced rank, which is in contrast to the fact that the predictive process approxi-
mation is reduced rank. If C(·, ·) is positive definite, then the Gram matrix produced by the
10 H. Sang & J. Z. Huang
full-scale approximation is always of full rank provided that the taper function is positive
definite, but the Gram matrix by the predictive process approximation is singular.
To compare the approximation property of the new approximation approach and the
reduced rank and the covariance tapering approaches, we employ the Matérn family of
stationary correlation functions (Stein, 1999):
1/2 ν 1/2
′ 1 2ν ks − s′ k 2ν ks − s′ k
ρ(s, s ; ν, φ) = ν−1 Jν , ν > 0, φ > 0, (18)
2 Γ(ν) φ φ
where Γ(·) is the gamma function, Jν is the modified Bessel function of the second kind with
order ν, and k · k denotes the Euclidean distance. The parameter ν is called the smoothness
parameter, controlling the degree of smoothness (i.e., the degree of differentiability of the
sample paths) of the random field, and φ is a spatial range parameter. Matérn family is
perhaps the most widely used covariance function in spatial statistics because it flexibly
encompasses several class of valid covariance functions, including the exponential (ν = .5)
and the Gaussian (ν → ∞). Furrer et al. (2006) gave some suggestions for choosing different
tapering functions for the members of the Matérn family according to their smoothness.
A special case of the Matérn family has been used in generating Fig. 1 when we illustrated
the approximation properties of the reduced rank and tapering approximations. The true
covariance function presented in Fig. 1 corresponds to model (1) where β = 0, σ 2 = 1, τ 2 =
0, and the correlation function of the spatial random effects belongs to the Matérn family
with ν = .5 and φ = 50/3. We considered the square domain [0, 100] × [0, 100] in R2 and
depicted the approximated covariance functions by evaluating 3000 of the covariance values
from the 125,250 possible pairs generated from 500 random locations from the square. We
used a 10×10 equally spaced grid as the set of knots for the predictive process approximation
and chose the spherical taper with γ = 20 for the covariance tapering. The same set of
knots and choice of taper range were used for the full-scale approximation. The bottom
panel of Fig. 1 shows the result of using the full-scale approximation. It is clear that the
full-scale approximation offers good approximations to the original covariance function at
both short distances and long distances, and does not have the serious biases appeared in
either the reduced rank or the covariance tapering approximations as shown on the other
two panels of Fig. 1.
Next, we compare the approximation properties of different approaches of covariance
approximation under various choices of the smoothness and spatial range parameters of the
Matérn family. To make our presentation concise, we assess covariance approximation by
means of the Kullback-Leibler (K-L) divergence between distributions and the Frobenius
distance between covariance matrices. In specific, consider two multivariate normal distri-
butions L1 = MVN1 (µ1 , Σ1 ) and L2 = MVN2 (µ2 , Σ2 ) on Rn . The K-L divergence from
L1 to L2 is defined as
1
KL(L1 , L2 ) = {log det(Σ−1 −1 T −1
2 Σ1 ) + tr(Σ2 Σ1 ) + (µ2 − µ1 ) Σ2 (µ2 − µ1 ) − n}.
2
The
P Frobenius distance between the covariance matrices Σ1 and Σ2 is defined as F (Σ1 , Σ2 ) =
{ i,j (Σ1,ij − Σ2,ij )2 }1/2 .
We considered Model (1) with β = 0, σ 2 = 1, τ 2 = 0.01, and the Matérn correlation
function for the spatial random effects for a set of varying spatial range parameters and
smoothness parameters. We adopted the same settings of the knot intensity, the taper
range, and the sampling locations as in Fig. 1. Fig. 2 shows the K-L divergence from the
Approximation of spatial covariance functions 11
approximated distribution to the true distribution under each approach, and the relative
Frobenius distance between the approximated and the true covariance matrices, defined as
F (Σapprox , Σtrue )/F (Σtrue ). It is not surprising to see that, when the smoothness parameter
is fixed and the range parameter increases, the K-L divergence and the relative Frobenius
distance increase under the covariance tapering approximation, while the same measures
decrease under the predictive process approximation. Given the same spatial range param-
eter and the smoothness parameter, it is evident that the full-scale approximation produces
substantially smaller K-L divergence and relative Frobenius distance than the other two
existing approaches, indicating that the new approximation is more accurate.
nu=.5 nu=.5
2000
Predictive process 1.2
Predictive process
Covariance tapering
1500 1 Covariance tapering
Full−scale approx.
0.8
1000
0.6
500 0.4
0.2
0
0
0 10 20 30 40 0 5 10 15
Spatial range parameter Spatial range parameter
nu=1 nu=1
2000 1
0.8
1500
Relative Frob. dist.
K−L divergence
0.6
1000
0.4
500
0.2
0 0
0 5 10 15 0 2 4 6 8
Spatial range parameter Spatial range parameter
nu=2 nu=2
2000 1
0.8
1500
Relative Frob. dist.
K−L divergence
0.6
1000
0.4
500
0.2
0 0
0 1 2 3 4 5 0 1 2 3 4 5
Spatial range parameter Spatial range parameter
Fig. 2. The K-L distance and the relative Frobenius distance from the approximated covariance
matrix to the true covariance matrix for the Matérn family with different smoothness parameters and
spatial range parameters. The relative Frobenius distance is defined as F (Σapprox , Σtrue )/F (Σtrue ).
We adopt the same settings of the knot intensity, the taper range, and the sampling locations as in
Fig. 1. The spherical function in (10) is used for ν = .5; the Wendland function in (11) is used for
ν = 1, and the Wendland function in (12) is used for ν = 2.
12 H. Sang & J. Z. Huang
Next, we examine how the approximation quality changes as we vary the knot intensity
and tapering range for the full-scale approximation. Fix the smoothness parameter at
ν = .5 and the range parameter at φ = 10. Fig. 3 displays the K-L divergence and the
relative Frobenius distance for varying knot intensities and taper range. We considered four
knot intensities, m = 64, m = 100, m = 225, m = 400, and a dense grid of taper range.
We observed that higher knot intensity and larger taper range are associated with better
approximation quality. More precisely, when the knot intensity is fixed and the taper range
increases, both the K-L divergence and the relative Frobenius distance decrease; when the
taper range is fixed and the knot intensity increases, the K-L divergence and the relative
Frobenius distance decrease. Usually various combinations of knot intensity and taper range
can be used to achieve similar approximation quality. For example, for the K-L distance to
be 50, one may choose a high knot intensity (m = 400) combined with a low taper range
(γ ≈ 2), a median knot intensity (m = 225) with a median taper range (γ ≈ 4) or a low
knot intensity (m = 12) with a large taper range (γ ≈ 15).
350
K−L divergence
300
0.2
250
0.15
200
150 0.1
100
0.05
50
0 0
0 5 10 15 20 25 30 0 5 10 15 20 25 30
Taper range Taper range
Fig. 3. The K-L distance and the relative Frobenius distance from the approximated covariance
matrix to the true covariance matrix for the Matérn family with four different knot intensities (m =
64, 100, 225, 400) and a dense grid of taper range. The true covariance is an exponential correlation
function with φ = 10.
Remark. It is easy to see that (17) is the covariance function of the following process
where ξ(s) is a spatial random field independent of w(s) and with Ktaper (s, s′ ; γ) as the
covariance function. However, the process w† (s) is not Gaussian and so cannot be used for
likelihood approximation of the spatial random effects w(s) of the original model. Likelihood
approximation using (17) based on the original model will be discussed in detail in the next
section.
Approximation of spatial covariance functions 13
We discuss in this section how the covariance approximation proposed in the previous
section can be used to develop efficient computational algorithms for parameter estimation
and spatial prediction. Both the maximum likelihood and Bayesian inference of model
parameters are considered (Cressie, 1993; Banerjee et al., 2004).
Cl = Cn,m C∗−1 T
m,m Cn,m (20)
where T(γ) = [Ktaper (si , sj ; γ)]i=1:n,j=1:n , and the “◦” notation refers to the element-wise
matrix product, also called Schur or Hadamard product. It follows from (17) that the
covariance matrix of {wl (s), s ∈ S} is approximated by C† = Cl + Cs , and the approximate
covariance matrix of Y is C† + τ 2 I, where C† depends on the parameters θ and σ 2 . The
approximate log likelihood function is
ℓn (β, θ, σ 2 , τ 2 )
n 1
= − log(2π) − log det{Cl (θ, σ 2 ) + Cs (θ, σ 2 ) + τ 2 In } (22)
2 2
1
− (Y − Xβ)T {Cl (θ, σ 2 ) + Cs (θ, σ 2 ) + τ 2 In }−1 (Y − Xβ).
2
Evaluation of the log likelihood requires calculation of the inverse and determinant of
the n × n matrix
Cl + Cs + τ 2 In = Cn,m C−1 T 2
m,m Cn,m + Cs + τ In .
(Cn,m C−1 T 2
m,m Cn,m + Cs + τ In )
−1
According to (21), the n × n matrix Cs + τ 2 In is a sparse matrix. The right hand sides
of (23) and (24) only involve inversion and determinant of and multiplication with sparse
n × n matrices as well as the inversion and determinant of m × m matrices. Thus the
computational complexity of the log likelihood calculation is of the order O(nm2 + nk 2 ),
where m is the number of knots and k is the average number of nonzero entries per row
in Cs . By using a small number m and a short taper range γ (which results in a small
k), the computational cost in fitting the spatial model can be greatly reduced relative
to the expensive computational cost of using the original covariance function, where the
computational complexity is typically of the order O(n3 ).
By the likelihood theory under the increasing domain asymptotic framework, the MLE
is asymptotically normal with the covariance matrix being the inverse of the information
matrix (Mardia and Marshall, 1984). In particular, the standard errors of the MLE’s are
estimated based on the inverse of the Fisher information matrix or the observed information
matrix. Detailed derivation of the Fisher information is similar to (Mardia and Marshall,
1984) and omitted. Our simulation study shows that this information based variance esti-
mation works well (results not shown to save space).
at the current log value for the range parameter φ. Log-normal proposals can also be used
for the variance parameters σ 2 and τ 2 . Following the MCMC sampling, posterior inferences
such as posterior means and credible intervals are then made by computing summaries of
the posterior samples.
Efficient computation can be achieved by using (23) and (24) for likelihood evaluation,
Fisher information matrix calculation, and sampling from the posterior distribution.
where h(s0 ) = [Cl (s0 , si ) + Cs (s0 , si )]ni=1 , and the associated approximate mean squared
prediction error is
In practice, one needs to substitute estimates of the unknown model parameters in the
above expressions.
The Bayesian inference seeks to find the predictive distribution p[Y (s0 ) | Y] at a new
site s0 . Generically denoting the set of all model parameters by Ω, we first obtain a set of
posterior samples {Ω(l) , l = 1, . . . , L} from the posterior distribution p[Ω | Y]. For a given
Ω value, p[Y (s0 ) | Ω, Y] is a Gaussian distribution with the mean and the variance given by
and
V ar[Y (s0 )|Ω, Y] = σ 2 − hT (s0 )(Cl + Cs + τ 2 In )−1 h(s0 ) + τ 2 . (29)
(l) (l)
The predictive distribution is sampled by composition, drawing Y (s0 ) ∼ p[Y (s0 ) | Ω , Y]
for each Ω(l) , l = 1, . . . , L, where Ω(l) is the lth sample from the posterior distribution
p[Ω | Y]. Again, inversion of the matrix Cl +Cs +τ 2 I appeared in (26)–(29) can be efficiently
computed using (23).
5. Illustrations
In this section, we use one simulation example and one real data example to illustrate
the full-scale approximation and compare it with the predictive process and the covariance
tapering approaches. The implementation of all methods was written in Matlab and run on a
processor with dual 2.8 GHz Xeon CPUs and 12GB memory. For sparse matrix calculations,
we used the Matlab function sparse. For the likelihood function optimization, we used the
Matlab function fmincon which implements a Broyden-Fletcher-Goldfarb-Shanno (BFGS)
based Quasi-Newton method. The R package spam (Furrer and Sain, 2010) for sparse matrix
calculation is also available at http://cran.r-project.org/web/packages/spam/.
Simulated locations
1000
900
800
700
600
Northing
500
400
300
200
100
0
0 100 200 300 400 500 600 700 800 900 1000
Easting
−1/2
′ 1 2 1/4
′
1/4 ΣD(s) + ΣD(s )
C(s, s ; θ) = σ ν−1 |ΣD(s) | |ΣD(s′ ) | ×
2 Γ(ν) 2
ν
p p
2 νQ(s, s′ ) Jν 2 νQ(s, s′ ) ,
where
−1
′ ′ T ΣD(s) + ΣD(s′ )
Q(s, s ) = (s − s ) (s − s′ )
2
Approximation of spatial covariance functions 17
with D(s) indicating the subregion where s belongs to. Anisotropy is introduced to the
covariance function by letting Σ depend on D(s). We reparameterize
!
λ2D(s),1 0
ΣD(s) = G(ψD(s) ) GT (ψD(s) ),
0 λ2D(s),2
true parameter values into the BLUP equation. The prediction accuracy of each method is
evaluated by the estimated MSPE based on the 1,000 testing data points at the hold-out
locations. For each set of parameter values, we recorded the MSPE and the prediction run
time under the three approaches for various choices of knot numbers and taper ranges. The
results are shown in Fig. 5. The full-scale approximation clearly outperforms the other two
approximations: It requires substantially less run time to reach the same MSPE. On the
other hand, there is no clear winner between the predictive process and covariance taper-
ing: The covariance tapering performs better than the predictive process for the first set of
parameter values, but the opposite is true for the second set of parameter values.
0.65 0.6
MSPE
MSPE
0.6
0.5
0.55
0.4
0.5
0.3
0.45
0.5 1 1.5 2 2.5 3 3.5 4 0.5 1 1.5 2 2.5 3
time time
Fig. 5. The MSPE versus time plots for the simulation example under the full-scale approximation
(diamond), the predictive process (circle) and the covariance tapering (plus). Figure (a) shows the
result using λ11 = 16.69, λ12 = 66.7, λ21 = 5, λ22 = 50,λ31 = 66.7, λ32 = 16.69, and Figure (b)
shows the result using λ11 = 33.3, λ12 = 100, λ21 = 50, λ22 = 50,λ31 = 100, and λ32 = 33.3 with the
remaining parameters given in the second column of Table 1.
Approximation of spatial covariance functions 19
Table 3. MSPE, likelihood evaluation time and prediction time of the full-scale approximation,
the predictive process and the covariance tapering.
has to be increased for the predictive process approximation or the taper range has to be
expanded for the covariance tapering approximation, which typically yield rapid growth in
computational times. For the covariance tapering approach, the similar MSPE (.2351) is
achieved when the taper range is increased to 100, the run time for the likelihood evaluation
is more than 10 times of using our approach. For the predictive process, a similar MSPE
(.2400) is obtained when the knot intensity is 1705 and the run time is nearly doubled
compared with our approach.
6. Discussion
We have proposed a new approximation of covariance functions for modeling and analysis of
very large point-referenced spatial data sets. This “full-scale” approximation can effectively
capture both the large scale and small scale spatial variations. Through simulation studies,
we have shown that the new approximation generally provides more efficient computation
and substantially better performance in both model inference and prediction, compared
with the reduced rank approximation and the covariance tapering approximation. While
each of the two existing approaches has its own failure modes, our new approximation
consistently performs well regardless of the dependence properties of the spatial covariance
functions.
The full-scale approximation has two tuning parameters: the knot intensity m and the
taper range γ. As we demonstrated in Section 5, it is evident that larger m and longer γ offer
better approximation to the original covariance function, which, unfortunately, will result in
higher computational cost. In the real data example, we used a subset of the data to decide
on how to weigh the tradeoff between inference accuracy and computational cost. A more
comprehensive study on the selection of knots and taper range is left for future research.
It is also interesting to extend the current work to other contexts of spatial statistics, such
Approximation of spatial covariance functions 21
Acknowledgement
The research of Huiyan Sang and Jianhua Z. Huang was partially sponsored by NSF grant
DMS-1007618. Jianhua Z. Huang’s work was also partially supportedby NSF grant DMS-
09-07170 and the NCI grant CA57030. Both authors were supported by Award Number
KUS-CI-016-04, made by King Abdullah University of Science and Technology (KAUST).
The authors thank the referees and the editors for valuable comments. The authors also
thank Dr. Sudipto Banerjee, Dr. Reinhard Furrer, and Dr. Lan Zhou for several useful
discussions regarding this work, and thank Dr. Cari Kaufman for providing the precipitation
data set.
References
Banerjee, S., B. Carlin, and A. Gelfand (2004). Hierarchical Modeling and Analysis for
Spatial Data. Boca Raton: Chapman & Hall-CRC.
Banerjee, S., A. Finley, P. Waldmann, and T. Ericsson (2010). Hierarchical spatial process
models for multiple traits in large genetic trials. Journal of the American Statistical
Association 105 (490), 506–521.
Banerjee, S., A. Gelfand, A. Finley, and H. Sang (2008). Gaussian predictive process models
for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 70 (4), 825–848.
Chiles, J. and P. Delfiner (1999). Geostatistics: Modeling Spatial Uncertainty. New York:
Wiley.
Cressie, N. (1993). Statistics for Spatial Data, 2nd edn. New York: Wiley.
Cressie, N. and G. Johannesson (2008). Fixed rank kriging for very large spatial data
sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70 (1),
209–226.
Diggle, P., J. Tawn, and R. Moyeed (1998). Model-based geostatistics. Journal of the Royal
Statistical Society: Series C (Applied Statistics) 47 (3), 299–350.
Finley, A., H. Sang, S. Banerjee, and A. Gelfand (2009). Improving the performance of
predictive process modeling for large datasets. Computational Statistics and Data Anal-
ysis 53 (8), 2873–2884.
22 H. Sang & J. Z. Huang
Fuentes, M. (2007). Approximate likelihood for large irregularly spaced spatial data. Journal
of the American Statistical Association 102 (477), 321–331.
Furrer, R., M. Genton, and D. Nychka (2006). Covariance tapering for interpolation of large
spatial datasets. Journal of Computational and Graphical Statistics 15 (3), 502–523.
Furrer, R. and S. Sain (2010). spam: A sparse matrix r package with emphasis on mcmc
methods for gaussian markov random fields. Journal of Statistical Software 36 (10), 1–25.
Gelman, A., J. Carlin, H. Stern, and D. Rubin (2004). Bayesian Data Analysis. Boca
Raton: Chapman & Hall.
Johns, C., D. Nychka, T. Kittel, and C. Daly (2003). Infilling sparse records of spatial
fields. Journal of the American Statistical Association 98 (464), 796–806.
Kammann, E. and M. Wand (2003). Geoadditive models. Journal of the Royal Statistical
Society Series C(Applied Statistics) 52 (1), 1–18.
Kaufman, C., M. Schervish, and D. Nychka (2008). Covariance tapering for likelihood-
based estimation in large spatial data sets. Journal of the American Statistical Associa-
tion 103 (484), 1545–1555.
Kaufman, L. and P. Rousseeuw (1990). Finding groups in data, Volume 16. Wiley New
York.
Mardia, K. and R. Marshall (1984). Maximum likelihood estimation of models for residual
covariance in spatial regression. Biometrika 71 (1), 135.
Paciorek, C. and M. Schervish (2006). Spatial modelling using a new class of nonstationary
covariance functions. Environmetrics 17 (5), 483–506.
Rue, H. and L. Held (2005). Gaussian Markov Random Fields: Theory and Applications.
Boca Raton: Chapman & Hall-CRC.
Rue, H. and H. Tjelmeland (2002). Fitting Gaussian Markov random fields to Gaussian
fields. Scandinavian Journal of Statistics 29, 31–49.
Schabenberger, O. and C. Gotway (2005). Statistical Methods for Spatial Data Analysis.
Boca Raton: Chapman & Hall.
Approximation of spatial covariance functions 23
Spiegelhalter, D., N. Best, B. Carlin, and A. van der Linde (2002). Bayesian measures of
model complexity and fit. Journal of the Royal Statistical Society. Series B (Statistical
Methodology) 64 (4), 583–639.
Stein, M. (1999). Interpolation of Spatial Data: Some Theory for Kriging. Springer, New
York.
Stein, M. (2008). A modeling approach for large spatial datasets. Journal of the Korean
Statistical Society 37 (1), 3–10.
Stein, M., Z. Chi, and L. Welty (2004). Approximating likelihoods for large spatial data sets.
Journal of the Royal Statistical Society. Series B, Statistical Methodology 66, 275–296.
Vecchia, A. (1988). Estimation and model identification for continuous spatial processes.
Journal of the Royal Statistical Society. Series B (Methodological) 50, 297–312.
Ver Hoef, J., N. Cressie, and R. Barry (2004). Flexible spatial models for kriging and
cokriging using moving averages and the Fast Fourier Transform (FFT). Journal of
Computational and Graphical Statistics 13 (2), 265–282.
Wendland, H. (1995). Piecewise polynomial, positive definite and compactly supported
radial functions of minimal degree. Advances in computational Mathematics 4 (1), 389–
396.
Wendland, H. (1998). Error estimates for interpolation by compactly supported radial basis
functions of minimal degree. Journal of Approximation Theory 93 (2), 258–272.
Wikle, C. and N. Cressie (1999). A dimension-reduced approach to space-time Kalman
filtering. Biometrika 86 (4), 815–829.
Williams, C. and M. Seeger (2001). Using the nystrom method to speed up kernel machines.
In T. K. Leen, T. G. Dietterich, and V. Tresp (Eds.), Advances in Neural Information
Processing Systems 13, pp. 682–688. MIT Press.
We need only show that the Gram matrix of C † (·, ·) with respect to a set of spatial locations
is positive semi-definite or positive definite if the corresponding Gram matrix of C † (·, ·) has
the same property. We can assume that this set of locations contains all m knots used to
definite the finite-rank covariance function Cl (·, ·) given in (15) because, if the Gram matrix
with respect to a set of locations is positive semi-definite or positive definite, the Gram
matrix with respect to a subset has the same property.
Partition the Gram matrix to 2 × 2 blocks
C11 C12
,
C21 C22
where the block C11 is the covariance matrix of (w(s1 ), . . . , w(sm ))T where s1 , . . . , sm are
the m knots. The Gram matrix of the full-scale approximated covariance function (17) can
be written as
C11 −1
C11 C12 C11 C12 T11 T12 C11 C12
C11 C11 C12 + − ◦ = e 22 ,
C21 C21 C22 C21 C21 C−1 11 C12 T21 T22 C21 C
24 H. Sang & J. Z. Huang
where Ce 22 = C21 C−1 C12 +{C22 −C21 C−1 C12 }◦T22 , and Tij ’s correspond to the partition
11 11
of the Gram matrix of the tapering function.
Standard matrix calculation yields
I 0 C11 C12 I −C−111 C12 C11 0
=
−C21 C−1
11 I C21 C22 0 I 0 C22 − C21 C−1
11 C12
and
I 0 C11 C12 I −C−1
11 C12 C11 0
= .
−C21 C−1
11 I e 22
C21 C 0 I 0 {C22 − C21 C−1
11 C12 } ◦ T22