2004 - Environmental Data Mining and Modeling Based On Machine Learning Algorithms and Geostatistics
2004 - Environmental Data Mining and Modeling Based On Machine Learning Algorithms and Geostatistics
www.elsevier.com/locate/envsoft
Received 30 September 2002; received in revised form 13 January 2003; accepted 5 March 2003
Abstract
The paper presents some contemporary approaches to spatial environmental data analysis. The main topics are concentrated on
the decision-oriented problems of environmental spatial data mining and modeling: valorization and representativity of data with
the help of exploratory data analysis, spatial predictions, probabilistic and risk mapping, development and application of conditional
stochastic simulation models. The innovative part of the paper presents integrated/hybrid model—machine learning (ML) residuals
sequential simulations—MLRSS. The models are based on multilayer perceptron and support vector regression ML algorithms used
for modeling long-range spatial trends and sequential simulations of the residuals. ML algorithms deliver non-linear solution for
the spatial non-stationary problems, which are difficult for geostatistical approach. Geostatistical tools (variography) are used to
characterize performance of ML algorithms, by analyzing quality and quantity of the spatially structured information extracted from
data with ML algorithms. Sequential simulations provide efficient assessment of uncertainty and spatial variability. Case study from
the Chernobyl fallouts illustrates the performance of the proposed model. It is shown that probability mapping, provided by the
combination of ML data driven and geostatistical model based approaches, can be efficiently used in decision-making process.
2003 Elsevier Ltd. All rights reserved.
Keywords: Environmental data mining and assimilation; Geostatistics; Machine learning; Stochastic simulation; Radioactive pollution
1364-8152/$ - see front matter 2003 Elsevier Ltd. All rights reserved.
doi:10.1016/j.envsoft.2003.03.004
846 M. Kanevski et al. / Environmental Modelling & Software 19 (2004) 845–855
and Jacobson, 1984; Gambolati and Galeati, 1987; Hass, The real case study on soil pollution from the Cherno-
1996). All these approaches imply a certain formula byl fallout illustrates the application of the proposed
based trend model, which is not necessarily in a good model. The accident at Chernobyl nuclear power plant
agreement with the data. An alternative way for trend caused large-scale contamination of environment by
modeling is to use a data-driven approach, which relies radiologically important radionuclides. Large-scale
only on data. One of such approaches was developed by consequences of the Chernobyl fallout were considered
partitioning heterogeneous study area into some smaller in the past decade and one of the comprehensive map-
homogeneous subareas and analyzing the spatial struc- ping work was presented in De Cort and Tsaturov
ture within them separately (Pélissier and Goreaud, (1996). Geostatistical analysis and prediction modeling
2001). of radioactive soil contamination data was presented in
In the present paper, we propose a newly developed Kanevsky et al. (1996a).
model—machine learning residuals sequential Gaussian
simulations (MLRSGS) as an extension of the ideas
presented by Kanevsky et al. (1996b) and Demyanov et 2. Machine learning residual Gaussian simulations
al. (2000). In these papers, a hybrid model—neural net-
work residuals kriging (NNRK)—was first introduced 2.1. Methodology of ML residual Gaussian
and then extended for use in a combination with different simulations
geostatistical models. The basic idea is to use feedfor-
ward neural network (FFNN), which is a well-known The basic idea is to use ML to develop a non-para-
global universal approximator to model large-scale non- metric, robust model to extract large-scale non-linear
linear trends, and then to apply geostatistical structures from data (detrending) and then to use geosta-
estimators/simulators for the residuals. Machine learning tistical models to simulate the residuals at local scales.
algorithms unite a wide family of data-driven models. In brief, the MLRSGS algorithm follows the steps given
Here, we will focus on two of them: multilayer per- below (extended after Kanevsky et al., 1996b):
ceptron (MLP) and support vector regression (SVR).
Another type of hybrid models (expert systems) was 1. Data preprocessing and exploratory analysis: in gen-
developed by using geographical information systems eral, split data into training, testing and validation
and modeling integrated into a decision support system sets, checking for outliers, exploratory data analysis,
for environmental and technological risk assessment and estimations and modeling of spatial correlation—
management (see Fedra and Winkelbauer, 1999). experimental and theoretical variography. Training
One of the principal advantages of machine learning set is used for the ML algorithms training, validation
algorithms is their ability to discover patterns in data, set is used to tune hyperparameters (e.g. number of
which exhibit significant unpredictable non-linearity. hidden neurons) while testing set is applied to assess
Being a data-driven approach (“black-box” models), ML MLA generalization ability.
depends only on the quality of data and the architecture 2. Training and testing of ML algorithm. In the present
of model, particularly for MLP—number of hidden neu- paper, MLP and SVR are used. They are well-known
rons, activation functions types of connections. ML can function approximators and are described briefly
capture spatial peculiarities of the pattern at different below.
scales describing both linear and non-linear effects. Per- 3. Accuracy test—comprehensive analysis of the
formance of MLA is based on solid theoretical foun- residuals provides the ML residuals at the training
dations, which were considered by Bishop (1995), Hay- points (measured–estimated) which are the base for
kin (1999) and Vapnik (1998). further analysis. Two further cases are possible:
Stochastic simulation is an intensively developed and
used approach to provide uncertainty and risk assess- 앫 the residuals are not correlated with the measurements
ment for spatial and spatio-temporal problems. Stochas- correlated (both 1D and 2D), which means that MLA
tic simulation models are preferable over estimators as has modeled all spatial structures presented in the
they are able to provide a joint probabilistic distributions raw data;
rather than a single value estimates. Sequential Gaussian 앫 the residuals show some correlation with the samples,
simulation (SGS) is one of the widely used methods that then further analysis should be performed on the
is able to handle highly variable data but still is sensitive residuals to model this correlation.
to trend, thus formally requires to some extent spatial 4. ML residuals are explored using variography. The
stationarity assumed. SGSs are based on modeling of remaining spatial correlation presents short-range
spatial correlation structures—variography. correlation structures, once long-range correlation
A mixture of ML data driven and geostatistical model (trend) in the whole area was modeled by MLA.
based approaches is also attractive for decision-making 5. Normal score transformation (non-linear transform-
process because of their interpretability. ation from raw data to Nscore values, distributed
M. Kanevski et al. / Environmental Modelling & Software 19 (2004) 845–855 847
N(0,1) is performed to prepare data for further Gaus- function (or non-linear transformer) can be any continu-
sian simulations. Nscore variogram model describing ous, bounded and non-decreasing function. Exponential
spatial correlations of Nscore values is built. SGS is sigmoid or hypertangent are commonly used in practice.
then applied to the MLA residuals and stochastic real- The weights W(w0, …, wn) are adaptive parameters
izations are generated using the training dataset. which are optimized by minimizing the following quad-
ratic cost function:
冘
The idea of stochastic simulation is to develop a spa- N
tial Monte Carlo model that will be able to generate 1
MSE ⫽ (t ⫺o )2 (2)
many, in some sense equally probable, realizations of a Ni ⫽ 1 i i
random function (in general, described by a joint prob-
ability density function). Any realization of the random where MSE is the mean square error, N is the number
function is called an unconditional simulation. Realiza- of samples, oi is the net output (prediction) and ti is the
tions that honor the data are called conditional simula- real function desired value. Backpropagation error algor-
tions. Basically, the simulations try to reproduce the first ithm is applied to calculate gradient of MSE on adaptive
(univariate global distribution) and the second weight, ∂E / ∂W. Various optimization algorithms, which
(variogram) moments. The similarities and dissimi- employ backpropagation, can be used, such as the conju-
larities between the realizations describe spatial varia- gate gradient descend method, second-order pseudo-
bility and uncertainty. The simulations bring valuable Newton Levenberg-Marquardt method, or the resilient
information on the decision-oriented mapping of pol- propagation method.
lution. Postprocessing of the simulated realizations pro- In a standard MLP, the neurons are arranged in input,
vides probabilistic maps: maps of probabilities of the hidden and output layers. The values of the exploratory
function value to be above/below some predefined variables (X and Y co-ordinates) are exposed to the input
decision levels. Gaussian random function models are layer, the output layer produces and compares the target
widely used in statistics and simulations due to their ana- estimate of the function value (137Cs concentration), hid-
lytical simplicity, they are well understood and are limit den layers (one or two) allow(s) to handle non-linearity
distributions of many theoretical results. SGS algorithm (Fig. 2). The number of neurons in the hidden layers can
used in this work was described in detail in Deutsch and vary and is the subject to the optimum configuration. As
Journel (1998). long as the aim of MLP in the present work is to extract
a large-scale trend, as few hidden neurons as possible
6. Simulated values of the residuals appear after back were chosen to extract non-linear trends. Further
normal score transformation. Final ML residual increase of the number of hidden neurons leads to
simulations value is a sum of ML estimate and extracting more detailed local peculiarities and even
SGS realization. noise from the pattern: choosing too many hidden neu-
rons will lead to over-fitting (or over-learning) when
2.2. Description of multilayer perceptron model MLP loses its ability to generalize the information from
the samples. On the other hand, using too few hidden
MLP is a type of artificial neural network with a spe- neurons does not provide explicit extraction of the trend;
cific structure and training procedure described in hence some large-scale correlations will remain in the
Bishop (1995) and Haykin (1999). residuals restricting further procedure. Thus, geostatist-
The key component of MLP is the formal neuron, ical variogram analysis becomes the key tool to control
which sums the inputs, and performs a non-linear trans- the MLP performance for trend extraction.
form via the activation function f (Fig. 1). The activation
2.3. Description of support vector regression model linear functions can be controlled by the term ||w||2, see
Eq. (6) (Vapnik, 1998). Also, we have to minimize the
SVR is a recent development of the statistical learning empirical risk (training error). With selected symmetrical
theory (SLT) (Vapnik, 1998). It is based on structural linear e-insensitive loss, empirical risk minimization is
risk minimization and seems to be promising approach equivalent to adding the slack variables xi,x∗i into the
for spatial data analysis and processing (see Scholkopf functional with the linear constraints (7). Introducing the
and Smola, 1998; Gilardi and Bengio, 2000; Kanevski trade-off constant C, we arrive at the following optimiz-
et al., 2001). There are several attractive properties of ation problem:
the SVR: robustness of the solution, sparseness of the
冘
N
regression, automatic control of the solutions com- 1
minimize 兩兩w兩兩2 ⫹ C (xi ⫹ xi∗) (6)
plexity, good generalization performance (Vapnik, 2 i⫽1
1998). In general, by tuning SVR hyper-parameters, it
冦
is possible to cover a wide range of spatial regression
functions from over-fitting to over-smoothing (Kanevski f(xi)⫺yi⫺eⱕxi
et al., 2001). subject to ⫺f(xi) ⫹ yi⫺eⱕxi∗ (7)
First, we state a general problem of regression esti-
mation as it is presented in the scope of SLT. Let xi,x ⱖ0,
∗
i for i ⫽ 1,…,N
{(x1,y1),…(xN,yN)} be a set of observations generated
from an unknown probability distribution P(x, y) with The slack variables xi,xi∗ measure the distance
xi苸Rn, yi苸R, and F = {f兩Rn→R} a class of functions. between the observation and the ε tube. The distance
The task is to find a function f from the given class of between the observation and the e and xi,xi∗ is illustrated
functions that minimizes a risk functional: by the following example: imagine you have a great con-
冕
R[f] ⫽ Q(y⫺f(x),x)dP(x,y) (3)
fidence in your measurement process, but the variance
of the measured phenomena is large. In this case, e has
to be chosen a priori very small while the slack variables
where Q is a loss function indicating how the difference xi,xi∗ are optimized and thus can be large. Remember
between the measurement value and the model’s predic- that inside the ε tube ([f(x)⫺e,f(x) + e]) loss function
tion is penalized. is zero.
As P(x, y) is unknown, one can compute an empiri- Note that by introducing the couple (xi,xi∗), the prob-
cal risk: lem now has 2n unknown variables. But these variables
are linked since one of the two values is a necessary
冘 equal to zero. Either the slack is positive (x∗i = 0) or
N
1
Remp ⫽ Q(y ⫺f(xi),xi) (4) negative (xi = 0). Thus, yi苸[f(xi)⫺e⫺xi,f(xi) + e + xi∗].
Ni ⫽ 1 i
A classical way to reformulate the constraint based
When it is only known that noise-generating distri- minimization problem is to look for the saddle point of
bution is symmetric, the use of linear loss function is Lagrangian L:
preferable, and results in a model from robust regression
冘 冘
N N
family. For simplicity, we also assume loss to be the 1
same for all spatial locations. L(w,x,x a) ⫽ 兩兩w兩兩2 ⫹ C
∗
(xi ⫹ xi∗)⫺ ai(yi
2 i⫽1 i⫽1
The Support vector regression model is based on a
new type of loss functions, the so-called e-insensitive
冘
N
loss functions. Symmetric linear e-insensitive loss is ⫺f(xi) ⫹ e ⫹ xi)⫺ ai∗(f(xi)⫺yi ⫹ e ⫹ x∗i ) (8)
defined as: i⫽1
再 冘
N
|y⫺f(x)|⫺e, if |y⫺f(x)| ⬎ e ⫺ (hixi ⫹ hi∗x∗i )
Q(y⫺f(x),x) ⫽ (5)
0, otherwise i⫽1
The asymmetrical loss function can be used in appli- where ai,a∗i ,hi,h∗i are Lagrange multipliers associated
cations where underestimations and overestimations are with the constraints; ai,a∗i can be roughly interpreted as
not equivalent. a measure of the influence of the constraints on the sol-
Let us start from the estimation of regression function ution. A solution with ai = a∗i = 0 can be interpreted as
in a class of linear functions F = {f(x)兩f(x) = (w,x) + “the corresponding data point has no influence on this
b}. Support vector regression is based on the structural solution”. Other points with non-zero ai or a∗i are the
risk minimization principle, which results in penalization “support vectors (SVs)” of the problem.
of the model complexity simultaneously with keeping The dual formulation of the optimization problem is
small empirical risk (training error). The complexity of solved in practice:
M. Kanevski et al. / Environmental Modelling & Software 19 (2004) 845–855 849
冘 冘
N N
i⫽1
⫺e (a∗i ⫹ ai) ⫹ yi(ai∗⫺ai)
i⫽1 i⫽1
冦
3. Case study
冘
N
冘
N The selected region is rectangular covering 7428 km2
f(x) ⫽ (ai∗⫺ai)(xi·x) ⫹ b (10) with 845 populated sites. The basic statistical parameters
i⫽1 of the data (137Cs concentration at 684 points) presented
in Fig. 3 are the following: minimum value 5.9 kBq/m2,
Note that both the solution (10) and the optimization
problem (9) are written in the terms of dot products.
Hence, we can use a so-called “kernel trick” to achieve
non-linear regression model. We substitute the dot pro-
ducts (xi,xj) with a suitable function
{K苸L2(Rn)丣L2(Rn),K兩(Rn丣Rn)→R}. If the kernel func-
tion satisfies the Mercer’s conditions:
冕冕 K(x⬘,x⬙)g(x⬘)g(x⬙)dx⬘dx⬙ ⬎ 0 (11)
K(x⬘,x⬙) ⫽ 冘 j
lj⌽j(x⬘)⌽j(x⬙) (12)
冘冘
N N
1
maximise ⫺ (a∗⫺ai)(a∗j ⫺aj)K(xi,xj) (13)
2i ⫽ 1j ⫽ 1 i
冘 冘
N N
冦 冘
N
(ai∗⫺ai) ⫽ 0
subject to i⫽1
Fig. 6. 137Cs, artificial neural network (one hidden layer with five
neurons) spatial predictions.
K(x,x⬘) ⫽ exp ⫺ 冉
|x⫺x⬘|2
2s2 冊 (16)
in Figs. 10 and 11. Variograms of the Nscore transfor-
med residuals can be easily modeled (fitting to theoreti-
cal model) and SGSs can be applied (variogram reaches
Kernel parameter—bandwidth s is related to some a sill and levels off). Final ML residual sequential Gaus-
characteristic correlation scales of trend model. Kernel sian simulation results are presented as equiprobable
bandwidth of 20 km is used for the presented model. realizations in Figs. 12 and 13. They keep the large-scale
Other parameters were defined as: C = 20, e = 200. This trend structure (from Figs. 6 and 7) and also feature dis-
852 M. Kanevski et al. / Environmental Modelling & Software 19 (2004) 845–855
Fig. 9. Scatterplot of the MLP and SVR residuals vs. MLP and SVR
estimates, respectively.
137
Fig. 7. Cs, support vector regression trend modeling.
137
Fig. 8. Scatterplot of the MLP and SVR residuals vs. Cs sample
values.
Fig. 12. Mapping of 137Cs with neural network residual sequential Fig. 13. Mapping of 137Cs with support vector regression residual
Gaussian simulations model (NNRSGS). sequential Gaussian simulations model (SVRRSGS).
of simulated models (realizations) are generated. Post- when only ML mapping is applied. This helps to
processing of realizations gives rich variety of outputs, understand the quality of the results. If there is no
one of them is the probability/risk map. Probability maps spatial correlation in the residuals, it means that all
of exceeding level 800 kBq/m2 obtained with neural spatial information from data have been extracted
network/support vector regression residual sequential and ML can be used for prediction mapping as well.
Gaussian simulation models are presented in Figs. 14 (2) Robustness of the approach: how it is sensitive to
and 15, respectively. This is an important advanced the selection of the ML architecture and learning
information for the real decision-making process. algorithm. Chernov et al. (1999) demonstrated the
robustness of MLP with varying number of neurons
on validation data. Also, it was shown that MLP is
4. Discussion more sensitive towards selection of the training set
than towards the number of neurons. The same
The final stage deals with the validation of the ML robust behavior in the case presented in this study
residual sequential Gaussian simulation results. compari- has been obtained both for MLP and SVR (varying
sons with geostatistical prediction models were carried model parameters). So, we can choose the simplest
out. The proposed models give comparable or better ML models capable to learn and catch non-linear
results on different data sets. A comparison between pro- trends.
posed models (NNRSGS and SVRRSGS) was also car-
ried out at the testing points. As a result, the NNRSGS Usually, accuracy test (analysis of the residuals) has
model gives better results than the SVRRSGS model in been used for the analysis and description of what was
terms of testing error and summary statistics of testing learned by ML. Accuracy test measures correlation
distribution. Comprehensive comparison with other ML between the training data and the MLA predictions at
methods is a topic of further research. the same points.
Several important points should be mentioned:
(3) Data clustering is a well-known problem in spatial
(1) Analysis of the residuals is important also in case data analysis (Deutsch and Journel, 1998). This
854 M. Kanevski et al. / Environmental Modelling & Software 19 (2004) 845–855
Fig. 14. Probability of exceeding level of 800 kBq/m2 for Fig. 15. Probability of exceeding level of 800 kBq/m2 for
NNRSGS model. SVRRSGS model.
References Gilardi, N., Bengio, S., 2000. Local machine learning models for spa-
tial data analysis. IDIAP-RR 00-34.
Haas, T.C., 1996. Multivariate spatial prediction in the presence of
Bishop, C.M., 1995. Neural Networks for Pattern Recognition. Claren- nonlinear trend and covariance nonstationarity. Environmetrics 7.
don Press, Oxford. Haykin, S., 1999. Neural Networks. A Comprehensive Foundation,
Chernov, S., Demyanov, V., Grachev, N., Kanevski, M., Kravetski, second ed. Prentice Hall International, Inc.
A., Savelieva, E., Timonin, V., Maignan, M., 1999. Multiscale Pol- Isaaks, Ed.H., Shrivastava, R.M., 1989. An Introduction to Applied
lution Mapping with Artificial Neural Networks and Geostastistics. Geostatistics. Oxford University Press, Oxford.
Proceedings of the 5th Annual Conference of the International Kanevski, M., Demyanov, V., Chernov, S., Savelieva, E., Serov, A.,
Association for Mathematical Geology (IAMG’ 99). Ed. Lippartd, Timonin, V., 1999. Geostat Office for Environmental and Pollution
S.J., Nass, A., Sinding-Larsen, R., August 1999, 325-330. Spatial Data Analysis. Mathematische Geologie. CPress Publishing
Cressie, N., 1991. Statistics for Spatial Data. John Wiley & Sons, House, band 3, April, pp. 73–83.
New York. Kanevski, M., Pozdnukhov, A., Canu, S., Maignan, M., Wong, P., Shi-
De Cort, M., Tsaturov, Yu.S., 1996. Atlas on caesium contamination bli, S., 2001. Support vector machines for classification and map-
of Europe after the Chernobyl nuclear plant accident. European ping of reservoir data. In: Soft Computing for Reservoir Charac-
Commission, Report EUR 16542 EN. terization and Modeling. Springer-Verlag, pp. 531–558.
Demyanov, V., Kanevski, M., Savelieva, E., Timonin, V., Chernov, Kanevsky, M., Arutyunyan, R., Bolshov, L., Demyanov, V., Linge,
S., Polishuk, V., 2000. Neural Network Residual Stochastic Cosi- I., Savelieva, E., Shershakov, V., Haas, T., Maignan, M., 1996a.
mulation for Environmental Data Analysis. Proceedings of the Geostatistical Portrayal of the Chernobyl fallout. In: Baafi, E.Y.,
Second ICSC Symposium on Neural Computation (NC’2000), May Schofield, N.A. (Eds.), Geostatistics ’96, Wollongong, vol. 2.
2000, Berlin, Germany, 647-653. Kluwer Academic Publishers, pp. 1043–1054.
Deutsch, C.V., Journel, A.G., 1998. GSLIB Geostatistical Software Kanevsky, M., Arutyunyan, R., Bolshov, L., Demyanov, V., Maignan,
Library and User’s Guide. Oxford University Press, New York, M., 1996b. Artificial neural networks and spatial estimations of
Oxford. Chernobyl fallout. Geoinformatics 7, 5–11.
Dowd, P.A., 1994. In: Dimitrakopoulos, R. (Ed.), The Use of Neural Masters, Timothy, 1995. Advanced Algorithms for Neural Networks.
Networks for Spatial Simulation, Geostatistics for the Next Cen- A C++ Sourcebook. John Wiley & Sons, Inc.
tury. Kluwer Academic Publishers, pp. 173–184. Neuman, S.P., Jacobson, E.A., 1984. Analysis of nonintrinsic spatial
Fedra, K., Winkelbauer, L., 1999. A hybrid expert system, GIS and variability by residual kriging with application to regional
simulation modeling for environmental and technological risk man- groundwater levels. Mathematical Geology 16, 499–521.
agement. Environmental Decision Support Systems and Artificial Pélissier, R., Goreaud, F., 2001. A practical approach to the study of
Intelligence, Technical Report WS-99-07. AAAI Press, Menlo spatial structure in simple cases of heterogeneous vegetation. Jour-
Park, CA, pp. 1–7. nal of Vegetation Science 12, 99–108.
Gambolati, G., Galeati, G., 1987. Comment on “analysis of nonintrin- Scholkopf, B., Smola, A., 1998. Learning with Kernels. MIT Press,
sic spatial variability by residual kriging with application to Cambridge, MA.
regional groundwater levels” by Neuman and Jacobson. Mathemat- Vapnik, V., 1998. Statistical Learning Theory. John Wiley & Sons,
ical Geology 19, 249–257. New York.