A Regression-Kriging Model For Estimation of Rainf
A Regression-Kriging Model For Estimation of Rainf
net/publication/253576027
Article in Proceedings of SPIE - The International Society for Optical Engineering · October 2009
DOI: 10.1117/12.838036
CITATIONS READS
0 569
3 authors, including:
Hong Wang
Hohai University
26 PUBLICATIONS 1,068 CITATIONS
SEE PROFILE
All content following this page was uploaded by Hong Wang on 03 September 2014.
ABSTRACT
This paper presents a multivariate geostatistical algorithm called regression-kriging (RK) for predicting the spatial
distribution of rainfall by incorporating five topographic/geographic factors of latitude, longitude, altitude, slope and
aspect. The technique is illustrated using rainfall data collected at 52 rain gauges from the Laohahe basis in northeast
China during 1986-2005 . Rainfall data from 44 stations were selected for modeling and the remaining 8 stations were
used for model validation. To eliminate multicollinearity, the five explanatory factors were first transformed using factor
analysis with three Principal Components (PCs) extracted. The rainfall data were then fitted using step-wise regression
and residuals interpolated using SK. The regression coefficients were estimated by generalized least squares (GLS),
which takes the spatial heteroskedasticity between rainfall and PCs into account. Finally, the rainfall prediction based on
RK was compared with that predicted from ordinary kriging (OK) and ordinary least squares (OLS) multiple regression
(MR). For correlated topographic factors are taken into account, RK improves the efficiency of predictions. RK
achieved a lower relative root mean square error (RMSE) (44.67%) than MR (49.23%) and OK (73.60%) and a lower
bias than MR and OK (23.82 versus 30.89 and 32.15 mm) for annual rainfall. It is much more effective for the wet
season than for the dry season. RK is suitable for estimation of rainfall in areas where there are no stations nearby and
where topography has a major influence on rainfall.
Keywords: regression-kriging, factor analysis, GLS, rainfall
1. INTRODUCTION
Effective watershed management strategies depend on accurate model results. Rainfall is the most important input for
watershed modeling, including hydrology modeling. Rainfall characteristics are usually spatially varying, even in a small
watershed; so accurate description of the spatial variation of rainfall is quite important for predicting water movement in
a watershed. The accurate estimation of the spatial distribution of rainfall requires a very dense network of instruments,
which entails large installation and operational costs. Therefore, it is essential to estimate point rainfall at unrecorded
locations from values at surrounding sites.
A number of methods have been proposed for the interpolation of rainfall1, including global interpolation methods (trend
surface and multiple regression), local interpolation methods (Thiessen polygons, inverse distance weighting, kriging and
splines), and mixed methods (combined global and local methods). Rainfall generally increases with elevation2-3, and so
many authors have incorporated elevation into its prediction using geostatistical approaches1, 4-5. Others have developed
relationships between rainfall and various topographic variables such as altitude, latitude, continentality, slope,
orientation or exposure, using regression approaches6-11. In recent years, there has been an increasing interest in hybrid
interpolation techniques in which two conceptually different approaches are combined to model and map rainfall: (1)
interpolation relying solely on point observations of rainfall; and (2) interpolation based on regression of rainfall on
spatially exhaustive auxiliary information (such as topographic variables). Several studies have been conducted in a
range of locations and environments around the world (for example, in various European countries, the USA, Australia
*
[email protected]; phone 86 25 83787891; fax 86 25 83592288
International Symposium on Spatial Analysis, Spatial-Temporal Data Modeling, and Data Mining, edited by Yaolin Liu,
Xinming Tang, Proc. of SPIE Vol. 7492, 74924I · © 2009 SPIE · CCC code: 0277-786X/09/$18 · doi: 10.1117/12.838036
2. CASE STUDY
The Laohahe basin lies in southeastern Inner Mongolia Autonomous Region, China, with an area of approximately
1.86×104 km2. The 52 rain gauges (Fig. 1) with daily readings were randomly divided into interpolation (44) and
validation sets (8) in this study. The monthly and annual rainfall data have been averaged over the period of January
1986-December 2005. Rainfall is strongly seasonal (Fig. 2), and so in the analysis mean monthly rainfall was calculated
for the dry season (June - September), the wet season (the rest of the year), as well as the whole year. The basic rainfall
statistics (mean, standard deviation, minimum, and maximum) are presented in Table 1.
Another source of information is an SRTM DEM of 90m resolution downloaded from the International Agriculture
Research Consortium for Spatial Information server (http://srtm.csi.cgiar.org). Elevation (ELE), slope (SLP) and aspect
(ASP) were derived from the SRTM DEM that had been resampled to a cell size of 200 m in ArcGIS 9.1. Elevation
ranges from 433 to 2047 m. It has an apparent W-E gradient with the highest in northwest part of the study area as shown
in Fig. 3. There is a significant correlation between rainfall and latitude and slope at the 0.01 level in different periods
(Table 1). However, the opposite is true with longitude and aspect. Apart from the dry season, the correlation between
rainfall and elevation is also significant.
3. METHODS
3.1 Transformation of predictors
Factor analysis was used to produce standardized Principal Components (PCs) prior to regression in order to eliminate
multicollinearity. Unlike the original predictors, these uncorrelated and standardized PCs do not share any collinearity
among them . From the 44 rain stations, five predictors, including ELE, SLP, ASP, X and Y, were transformed and the
first three PCs were used for stepwise regression.
ˆ ( s0 ) is the fitted drift, eˆ( s0 ) is the interpolated residual, βˆk is the estimated drift model coefficient,
where m λi is the
kriging weight determined by the spatial dependence structure of the residual, and e( si ) is the residual at location si .
The regression coefficient βˆk is estimated from the samples by the GLS method in order to take heteroskedasticity or
spatial correlation between individual observations into account. The GLS coefficient βˆk is computed using the EViews
package21 in following steps:
(1) Determine a linear model of the variable as predicted by the auxiliary map q (here it is two PCs). Then the most
suitable predictors are selected and OLS regression coefficients are obtained;
(2) Derive OLS residuals at interpolation sample locations as
e* ( si ) = z ( si ) − [b0 + b1 ⋅ q( si )] (2)
(3) Test normalcy of data distribution, heteroskedasticity and spatial autocorrelation of the OLS residuals using the
Jarque-Beta, Godlfeld-Quandt and Lagrange Multilier test22 , respectively, in the EViews package;
(4) Calculate weight in the weighted ordinary least square (WLS). In light of heteroskedasticity in the residuals, OLS is
not BLUE (Best Linear Unbiased Estimator). So the original predictors should be transformed in order to satisfy the
condition that the variance is a constant. Then OLS is used for these transformed predictors. This is called GLS.
Generally, GLS is the OLS of transformed variables which satisfy the standard hypothesis of OLS. In order to transform
the original predictors, the relationship between residual variance and predictors should be explored first. The best
regression model was determined using curve estimation in the SPSS package. If the model is linear regression, then the
−2 −1
weight = q ; if the model is quadratic regression, then the weight = q . Both sides of the regression equation are
multiplied by this weight to make the estimated coefficients the GLS coefficients;
(5) Derive GLS residuals and test the normalcy of data distribution, heteroskedasticity and spatial autocorrelation of the
GLS residuals using the same techniques as in step (3);
(6) Interpolate the GLS residuals using SK with a known expected mean of residuals (by definition 0) and transform the
interpolated map into a raster layer of a 200 m resolution in ArcGIS; and
(7) Overlay the GLS surface with the interpolated GLS residuals at each prediction point in ArcGIS. This processing
comprises three steps: (i) the GLS coefficient was used to compute the rainfall in different periods in each 200 m-interval
prediction point; (ii) interpolated rainfall maps were produced using the inverse distance weighting (IDW) method and
resampled to 200 m resolution; and (iii) the GLS surface was added to the interpolated GLS residuals raster layer.
3.4 Evaluation
The performance of RK is assessed and compared with OK and OLS MR using the validation dataset of eight stations.
The true prediction accuracy is evaluated by comparing the estimated values ( zˆ( s j ) ) with the observed ones at
*
validation point ( z ( s j ) ) in order to assess systematic error, calculated as mean absolute error ( MAE ):
1 l
RMSE = ⋅ ∑ [ zˆ ( s j ) − z * ( s j )]2 (4)
l j =1
where l is the number of validation points. In order to compare accuracy of prediction between variables of different
type, the RMSE can be normalized by the total standard deviation ( s z )of observed samples:
RMSE
RMSEr = (5)
sz
As a rule of thumb, a RMSEr value close to 40% is considered a fairly satisfactory accuracy of prediction. By
comparison, a value >70% means that the model accounts for less than 50% of variability at the validation points and the
prediction is deemed unsatisfactory.
5. CONCLUSIONS
Clearly, RK is a powerful prediction technique that can be used to interpolate sampled environmental variables (such as
elevation, slope, aspect, longitude and latitude) from large point datasets. Factor analysis on the predictors is efficient at
removing multicollineality and reducing asymmetry in their distribution. GLS coefficient estimation ensures normality,
randomness and homoskedasticity of the prediction residuals. RK achieves a higher prediction accuracy than OK and
MR (44.67% versus 73.60% and 62.27%) and produces a smaller bias (23.82 versus 32.15 and 30.89 mm) for annual
rainfall. It is much more effective for the wet season than for the dry season. RK enables the estimation of rainfall in
areas where there are no stations in the vicinity and where topography exerts a major influence on rainfall. Further
research should investigate whether other environmental descriptors, such as the Tropical Rainfall Measuring Mission
(TRMM) 24-25, the Normalized Different Vegetation Index (NDVI) 26 or local prevailing winds (especially their direction
and force) are able to account for a larger proportion of the spatial variability displayed by rainfall.
ACKNOWLEDGEMENTS
This research was supported financially by the Natural Science Foundation of China (Grant No. 40871230) and the
National Key Basic Research Program of China (Grant no. 2006CB400502).
REFERENCES
[1] Goovaerts, P., “Geostatistical approaches for incorporating elevation into the spatial interpolation of rainfall”, J.
Hydro., 228, 113-129 (2000).
[2] Spreen, W. C., “A determination of the effect of topography upon precipitation”, Trans. Am. Geophys. Union, 28,
285-290 (1947).
[3] Smith, R. B., [The influence of mountains on the atmosphere], Adv. Geophys, 21, Academic Press, 87-230 (1979).
[4] Martínez-Cob, A., “Multivariate geostatistical analysis of evapotranspiration and precipitation in mountainous
terrain”, J. Hydro., 174, 19-35 (1996).
[5] Prudhomme, C. and Duncan, W. R., “Mapping extreme rainfall in a mountainous region using geostastistical
techniques: a case study in Scotland”, Int. J. Climatol, 19, 1337-1356 (1999).
[6] Basist, A., Bell, G. D.and Meentemeyer, V., “Statistical relationships betweent opography and precipitation patterns”,
J. Clim., 7, 1305-1315 (1994).
Table 1. Descriptive statistics and Kolmogorov-Smirnov’s test for the annual period, the dry and the wet season: M-mean,
MIN-minimum, MAX-maximum, SD-standard deviation, K-S-Kolmogorov-Smirnov’s test value. The last seven
columns give the linear correlation coefficient between rainfall and predictors and their transforms.
Period Rainfall (mm) K-S Correlation
M MIN MAX SD X Y ELE SLP ASP PC1 PC2 PC3
Wet 297 221 410 39.4 0.75 -0.20 -0.70** 0.42** 0.55** 0.12 0.48** -0.66** 0.04
Dry 136 88.4 221 26.5 0.86 -0.21 -0.46** 0.23 0.36** -0.01 0.33* -0.42** -0.07
Annual 434 331 599 60.5 1.05 -0.22 -0.65** 0.37** 0.51** 0.07 0.46** -0.61** -0.00
** correlation is significant at the 0.01 level (2-tailed), * correlation is significant at the 0.05 level (2-tailed)
Regression Regression
Period Selected Predictors
coefficients (OLS†) coefficients (GLS‡)
intercept 417.6972 409.6250
PC1 26.89074 41.90602
Annual
PC2 -31.13085 -16.73457
PC22 16.15373 13.67581
intercept 285.9751 282.9170
PC1 18.89804 26.52168
Wet
PC2 -22.32165 -19.40965
PC22 11.50941 11.65545
intercept 136.2771 128.5539
Dry PC1 9.201015 16.36250
PC2 -8.373646 3.030121
†OLS-Ordinary least square estimation, ‡GLS-General least square estimation based on the weighted least square estimation of
residuals.
Table 3. Results of the Jarque-Bera, Lagrange-Multilier and Godlfeld-Quandt test for the OLS and GLS residuals in the
annual period, the wet and the dry season: Prob.-probability (F-statistic).
Table 4. The regression model between the OLS residual variance and rainfall and predictor’s transforms: Prob.-probability
(F-statistic).
Table 5. Comparison of interpolation methods for bias (MAE) and accuracy of the prediction (RMSE) at 8 validation points,
OK-ordinary kriging, MR-OLS multiple regression, RK-GLS regression kriging, MAE-mean absolute error (mm),
RMSE-root mean square errors (mm), RMSEr-relative prediction error (%).
Fig. 2. Mean monthly rainfall of the stations over the period of January 1986-December 2005. Bars represent standard
deviation.
Fig.4. Comparison of prediction methods: ordinary kriging (a), OLS multiple regression (b), and regression-kriging (c) for
the annual period (_1), the wet (_2) and the dry season (_3).