0% found this document useful (0 votes)
19 views

2004 - Environmental Data Mining and Modeling Based On Machine Learning Algorithms and Geostatistics

Uploaded by

Kenneth Ugalde
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

2004 - Environmental Data Mining and Modeling Based On Machine Learning Algorithms and Geostatistics

Uploaded by

Kenneth Ugalde
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Environmental Modelling & Software 19 (2004) 845–855

www.elsevier.com/locate/envsoft

Environmental data mining and modeling based on machine


learning algorithms and geostatistics
M. Kanevski a,b, R. Parkin c,∗, A. Pozdnukhov a,c,d, V. Timonin c, M. Maignan b,
V. Demyanov c, S. Canu e
a
IDIAP Dalle Molle Institute for Perceptual Artificial Intelligence, Simplon 4, 1920 Martigny, Switzerland
b
Lausanne University, Lausanne, Switzerland
c
IBRAE, Nuclear Safety Institute, Russian Academy of Sciences, Environmental Modelling and System Analysis Laboratory, 52 B, Tulskaya,
Moscow 113191, Russia
d
Physics Department, Moscow State University, Math. Division, Moscow, Russia
e
INSA, Rouen, France

Received 30 September 2002; received in revised form 13 January 2003; accepted 5 March 2003

Abstract

The paper presents some contemporary approaches to spatial environmental data analysis. The main topics are concentrated on
the decision-oriented problems of environmental spatial data mining and modeling: valorization and representativity of data with
the help of exploratory data analysis, spatial predictions, probabilistic and risk mapping, development and application of conditional
stochastic simulation models. The innovative part of the paper presents integrated/hybrid model—machine learning (ML) residuals
sequential simulations—MLRSS. The models are based on multilayer perceptron and support vector regression ML algorithms used
for modeling long-range spatial trends and sequential simulations of the residuals. ML algorithms deliver non-linear solution for
the spatial non-stationary problems, which are difficult for geostatistical approach. Geostatistical tools (variography) are used to
characterize performance of ML algorithms, by analyzing quality and quantity of the spatially structured information extracted from
data with ML algorithms. Sequential simulations provide efficient assessment of uncertainty and spatial variability. Case study from
the Chernobyl fallouts illustrates the performance of the proposed model. It is shown that probability mapping, provided by the
combination of ML data driven and geostatistical model based approaches, can be efficiently used in decision-making process.
 2003 Elsevier Ltd. All rights reserved.

Keywords: Environmental data mining and assimilation; Geostatistics; Machine learning; Stochastic simulation; Radioactive pollution

1. Introduction cerned. Trend removal is also necessary for comprehen-


sive spatial correlation analysis and modeling
Environmental data feature complex spatial pattern at (variography). Variogram modeling for such data and
different scales due to combination of several spatial using common geostatistical approaches will result in
phenomena or various influencing factors of different incorrect results. In the presence of trends, the data can
origins. In some cases, the original observations are be decomposed into two parts:
taken with significant measurement errors and may con-
Z(x) ⫽ M(x) ⫹ e(x) (1)
tain significant uncertainty as well as a number of out-
liers. Non-linear spatial trends corresponding to large- where M(x) represents large-scale deterministic spatial
scale processes complicate geostatistical modeling as variations (trends), and e(x) represents small-scale spa-
long as stationary models (e.g. ordinary kriging) are con- tial stochastic variations. Contemporary geostatistics
offers several possible approaches to handle spatial
trends (spatial non-stationarity): universal kriging

Corresponding author. Tel.: +7-095-955-2231; fax: +7-095-958-
1151. (implying a polynomial trend model), residual kriging,
E-mail addresses: [email protected] (M. Kanevski); park@ib- moving window regression residual kriging (see Cressie,
rae.ac.ru (R. Parkin); [email protected] (A. Pozdnukhov). 1991; Deutsch and Journel, 1998; Dowd, 1994; Neuman

1364-8152/$ - see front matter  2003 Elsevier Ltd. All rights reserved.
doi:10.1016/j.envsoft.2003.03.004
846 M. Kanevski et al. / Environmental Modelling & Software 19 (2004) 845–855

and Jacobson, 1984; Gambolati and Galeati, 1987; Hass, The real case study on soil pollution from the Cherno-
1996). All these approaches imply a certain formula byl fallout illustrates the application of the proposed
based trend model, which is not necessarily in a good model. The accident at Chernobyl nuclear power plant
agreement with the data. An alternative way for trend caused large-scale contamination of environment by
modeling is to use a data-driven approach, which relies radiologically important radionuclides. Large-scale
only on data. One of such approaches was developed by consequences of the Chernobyl fallout were considered
partitioning heterogeneous study area into some smaller in the past decade and one of the comprehensive map-
homogeneous subareas and analyzing the spatial struc- ping work was presented in De Cort and Tsaturov
ture within them separately (Pélissier and Goreaud, (1996). Geostatistical analysis and prediction modeling
2001). of radioactive soil contamination data was presented in
In the present paper, we propose a newly developed Kanevsky et al. (1996a).
model—machine learning residuals sequential Gaussian
simulations (MLRSGS) as an extension of the ideas
presented by Kanevsky et al. (1996b) and Demyanov et 2. Machine learning residual Gaussian simulations
al. (2000). In these papers, a hybrid model—neural net-
work residuals kriging (NNRK)—was first introduced 2.1. Methodology of ML residual Gaussian
and then extended for use in a combination with different simulations
geostatistical models. The basic idea is to use feedfor-
ward neural network (FFNN), which is a well-known The basic idea is to use ML to develop a non-para-
global universal approximator to model large-scale non- metric, robust model to extract large-scale non-linear
linear trends, and then to apply geostatistical structures from data (detrending) and then to use geosta-
estimators/simulators for the residuals. Machine learning tistical models to simulate the residuals at local scales.
algorithms unite a wide family of data-driven models. In brief, the MLRSGS algorithm follows the steps given
Here, we will focus on two of them: multilayer per- below (extended after Kanevsky et al., 1996b):
ceptron (MLP) and support vector regression (SVR).
Another type of hybrid models (expert systems) was 1. Data preprocessing and exploratory analysis: in gen-
developed by using geographical information systems eral, split data into training, testing and validation
and modeling integrated into a decision support system sets, checking for outliers, exploratory data analysis,
for environmental and technological risk assessment and estimations and modeling of spatial correlation—
management (see Fedra and Winkelbauer, 1999). experimental and theoretical variography. Training
One of the principal advantages of machine learning set is used for the ML algorithms training, validation
algorithms is their ability to discover patterns in data, set is used to tune hyperparameters (e.g. number of
which exhibit significant unpredictable non-linearity. hidden neurons) while testing set is applied to assess
Being a data-driven approach (“black-box” models), ML MLA generalization ability.
depends only on the quality of data and the architecture 2. Training and testing of ML algorithm. In the present
of model, particularly for MLP—number of hidden neu- paper, MLP and SVR are used. They are well-known
rons, activation functions types of connections. ML can function approximators and are described briefly
capture spatial peculiarities of the pattern at different below.
scales describing both linear and non-linear effects. Per- 3. Accuracy test—comprehensive analysis of the
formance of MLA is based on solid theoretical foun- residuals provides the ML residuals at the training
dations, which were considered by Bishop (1995), Hay- points (measured–estimated) which are the base for
kin (1999) and Vapnik (1998). further analysis. Two further cases are possible:
Stochastic simulation is an intensively developed and
used approach to provide uncertainty and risk assess- 앫 the residuals are not correlated with the measurements
ment for spatial and spatio-temporal problems. Stochas- correlated (both 1D and 2D), which means that MLA
tic simulation models are preferable over estimators as has modeled all spatial structures presented in the
they are able to provide a joint probabilistic distributions raw data;
rather than a single value estimates. Sequential Gaussian 앫 the residuals show some correlation with the samples,
simulation (SGS) is one of the widely used methods that then further analysis should be performed on the
is able to handle highly variable data but still is sensitive residuals to model this correlation.
to trend, thus formally requires to some extent spatial 4. ML residuals are explored using variography. The
stationarity assumed. SGSs are based on modeling of remaining spatial correlation presents short-range
spatial correlation structures—variography. correlation structures, once long-range correlation
A mixture of ML data driven and geostatistical model (trend) in the whole area was modeled by MLA.
based approaches is also attractive for decision-making 5. Normal score transformation (non-linear transform-
process because of their interpretability. ation from raw data to Nscore values, distributed
M. Kanevski et al. / Environmental Modelling & Software 19 (2004) 845–855 847

N(0,1) is performed to prepare data for further Gaus- function (or non-linear transformer) can be any continu-
sian simulations. Nscore variogram model describing ous, bounded and non-decreasing function. Exponential
spatial correlations of Nscore values is built. SGS is sigmoid or hypertangent are commonly used in practice.
then applied to the MLA residuals and stochastic real- The weights W(w0, …, wn) are adaptive parameters
izations are generated using the training dataset. which are optimized by minimizing the following quad-
ratic cost function:


The idea of stochastic simulation is to develop a spa- N
tial Monte Carlo model that will be able to generate 1
MSE ⫽ (t ⫺o )2 (2)
many, in some sense equally probable, realizations of a Ni ⫽ 1 i i
random function (in general, described by a joint prob-
ability density function). Any realization of the random where MSE is the mean square error, N is the number
function is called an unconditional simulation. Realiza- of samples, oi is the net output (prediction) and ti is the
tions that honor the data are called conditional simula- real function desired value. Backpropagation error algor-
tions. Basically, the simulations try to reproduce the first ithm is applied to calculate gradient of MSE on adaptive
(univariate global distribution) and the second weight, ∂E / ∂W. Various optimization algorithms, which
(variogram) moments. The similarities and dissimi- employ backpropagation, can be used, such as the conju-
larities between the realizations describe spatial varia- gate gradient descend method, second-order pseudo-
bility and uncertainty. The simulations bring valuable Newton Levenberg-Marquardt method, or the resilient
information on the decision-oriented mapping of pol- propagation method.
lution. Postprocessing of the simulated realizations pro- In a standard MLP, the neurons are arranged in input,
vides probabilistic maps: maps of probabilities of the hidden and output layers. The values of the exploratory
function value to be above/below some predefined variables (X and Y co-ordinates) are exposed to the input
decision levels. Gaussian random function models are layer, the output layer produces and compares the target
widely used in statistics and simulations due to their ana- estimate of the function value (137Cs concentration), hid-
lytical simplicity, they are well understood and are limit den layers (one or two) allow(s) to handle non-linearity
distributions of many theoretical results. SGS algorithm (Fig. 2). The number of neurons in the hidden layers can
used in this work was described in detail in Deutsch and vary and is the subject to the optimum configuration. As
Journel (1998). long as the aim of MLP in the present work is to extract
a large-scale trend, as few hidden neurons as possible
6. Simulated values of the residuals appear after back were chosen to extract non-linear trends. Further
normal score transformation. Final ML residual increase of the number of hidden neurons leads to
simulations value is a sum of ML estimate and extracting more detailed local peculiarities and even
SGS realization. noise from the pattern: choosing too many hidden neu-
rons will lead to over-fitting (or over-learning) when
2.2. Description of multilayer perceptron model MLP loses its ability to generalize the information from
the samples. On the other hand, using too few hidden
MLP is a type of artificial neural network with a spe- neurons does not provide explicit extraction of the trend;
cific structure and training procedure described in hence some large-scale correlations will remain in the
Bishop (1995) and Haykin (1999). residuals restricting further procedure. Thus, geostatist-
The key component of MLP is the formal neuron, ical variogram analysis becomes the key tool to control
which sums the inputs, and performs a non-linear trans- the MLP performance for trend extraction.
form via the activation function f (Fig. 1). The activation

Fig. 1. Formal neuron. Fig. 2. Multilayer perceptron.


848 M. Kanevski et al. / Environmental Modelling & Software 19 (2004) 845–855

2.3. Description of support vector regression model linear functions can be controlled by the term ||w||2, see
Eq. (6) (Vapnik, 1998). Also, we have to minimize the
SVR is a recent development of the statistical learning empirical risk (training error). With selected symmetrical
theory (SLT) (Vapnik, 1998). It is based on structural linear e-insensitive loss, empirical risk minimization is
risk minimization and seems to be promising approach equivalent to adding the slack variables xi,x∗i into the
for spatial data analysis and processing (see Scholkopf functional with the linear constraints (7). Introducing the
and Smola, 1998; Gilardi and Bengio, 2000; Kanevski trade-off constant C, we arrive at the following optimiz-
et al., 2001). There are several attractive properties of ation problem:
the SVR: robustness of the solution, sparseness of the

N
regression, automatic control of the solutions com- 1
minimize 兩兩w兩兩2 ⫹ C (xi ⫹ xi∗) (6)
plexity, good generalization performance (Vapnik, 2 i⫽1
1998). In general, by tuning SVR hyper-parameters, it


is possible to cover a wide range of spatial regression
functions from over-fitting to over-smoothing (Kanevski f(xi)⫺yi⫺eⱕxi
et al., 2001). subject to ⫺f(xi) ⫹ yi⫺eⱕxi∗ (7)
First, we state a general problem of regression esti-
mation as it is presented in the scope of SLT. Let xi,x ⱖ0,

i for i ⫽ 1,…,N
{(x1,y1),…(xN,yN)} be a set of observations generated
from an unknown probability distribution P(x, y) with The slack variables xi,xi∗ measure the distance
xi苸Rn, yi苸R, and F = {f兩Rn→R} a class of functions. between the observation and the ε tube. The distance
The task is to find a function f from the given class of between the observation and the e and xi,xi∗ is illustrated
functions that minimizes a risk functional: by the following example: imagine you have a great con-


R[f] ⫽ Q(y⫺f(x),x)dP(x,y) (3)
fidence in your measurement process, but the variance
of the measured phenomena is large. In this case, e has
to be chosen a priori very small while the slack variables
where Q is a loss function indicating how the difference xi,xi∗ are optimized and thus can be large. Remember
between the measurement value and the model’s predic- that inside the ε tube ([f(x)⫺e,f(x) + e]) loss function
tion is penalized. is zero.
As P(x, y) is unknown, one can compute an empiri- Note that by introducing the couple (xi,xi∗), the prob-
cal risk: lem now has 2n unknown variables. But these variables
are linked since one of the two values is a necessary
冘 equal to zero. Either the slack is positive (x∗i = 0) or
N
1
Remp ⫽ Q(y ⫺f(xi),xi) (4) negative (xi = 0). Thus, yi苸[f(xi)⫺e⫺xi,f(xi) + e + xi∗].
Ni ⫽ 1 i
A classical way to reformulate the constraint based
When it is only known that noise-generating distri- minimization problem is to look for the saddle point of
bution is symmetric, the use of linear loss function is Lagrangian L:
preferable, and results in a model from robust regression

冘 冘
N N
family. For simplicity, we also assume loss to be the 1
same for all spatial locations. L(w,x,x a) ⫽ 兩兩w兩兩2 ⫹ C

(xi ⫹ xi∗)⫺ ai(yi
2 i⫽1 i⫽1
The Support vector regression model is based on a
new type of loss functions, the so-called e-insensitive

N

loss functions. Symmetric linear e-insensitive loss is ⫺f(xi) ⫹ e ⫹ xi)⫺ ai∗(f(xi)⫺yi ⫹ e ⫹ x∗i ) (8)
defined as: i⫽1

再 冘
N
|y⫺f(x)|⫺e, if |y⫺f(x)| ⬎ e ⫺ (hixi ⫹ hi∗x∗i )
Q(y⫺f(x),x) ⫽ (5)
0, otherwise i⫽1

The asymmetrical loss function can be used in appli- where ai,a∗i ,hi,h∗i are Lagrange multipliers associated
cations where underestimations and overestimations are with the constraints; ai,a∗i can be roughly interpreted as
not equivalent. a measure of the influence of the constraints on the sol-
Let us start from the estimation of regression function ution. A solution with ai = a∗i = 0 can be interpreted as
in a class of linear functions F = {f(x)兩f(x) = (w,x) + “the corresponding data point has no influence on this
b}. Support vector regression is based on the structural solution”. Other points with non-zero ai or a∗i are the
risk minimization principle, which results in penalization “support vectors (SVs)” of the problem.
of the model complexity simultaneously with keeping The dual formulation of the optimization problem is
small empirical risk (training error). The complexity of solved in practice:
M. Kanevski et al. / Environmental Modelling & Software 19 (2004) 845–855 849

冘冘 and the prediction is a non-linear regression function:


N N
1
maximise ⫺ (a∗⫺ai)(a∗j ⫺aj)(xi·xj) (9)
2i ⫽ 1j ⫽ 1 i

N

f(x) ⫽ (a∗i ⫺ai)K(xi,x) ⫹ b (14)

冘 冘
N N
i⫽1
⫺e (a∗i ⫹ ai) ⫹ yi(ai∗⫺ai)
i⫽1 i⫽1


3. Case study


N

(a∗i ⫺ai) ⫽ 0 Radioactive soil contamination caused by the Cherno-


subject to i⫽1 byl fallout features anisotropic highly variable and spotty
0ⱕa∗i ,aiⱕC, for i ⫽ 1,…,N spatial pattern. Multiscale character of the pattern is due
to numerous influencing factors: the source term,
weather conditions (especially rainfall), dry and damp
This is a quadratic programming (QP) problem, hence precipitations, surface properties (orography, ground
has an unique solution. It can be solved numerically by cover, soil type, land use, etc.). The most significant
a number of methods. After we get the values ai and influence on the long-term contamination was provided
ai∗, we can compute b from the constraints of the pri- by the radionuclide cesium 137Cs. The half-life period of
mary problem (7) and make predictions: this isotope is about 30 years.


N The selected region is rectangular covering 7428 km2
f(x) ⫽ (ai∗⫺ai)(xi·x) ⫹ b (10) with 845 populated sites. The basic statistical parameters
i⫽1 of the data (137Cs concentration at 684 points) presented
in Fig. 3 are the following: minimum value 5.9 kBq/m2,
Note that both the solution (10) and the optimization
problem (9) are written in the terms of dot products.
Hence, we can use a so-called “kernel trick” to achieve
non-linear regression model. We substitute the dot pro-
ducts (xi,xj) with a suitable function
{K苸L2(Rn)丣L2(Rn),K兩(Rn丣Rn)→R}. If the kernel func-
tion satisfies the Mercer’s conditions:

冕冕 K(x⬘,x⬙)g(x⬘)g(x⬙)dx⬘dx⬙ ⬎ 0 (11)

for any g(x)苸L2(Rn), then it can be expanded in a uni-


formly converging series

K(x⬘,x⬙) ⫽ 冘 j
lj⌽j(x⬘)⌽j(x⬙) (12)

where {li,⌽j(.)} is an eigensystem of K. We may regard


⌽j(x) as some j-th feature of vector x, then kernel K is
a dot product in some feature space. As (11) determines
positively defined kernels, the substitution of K instead
of dot products in (9) results in a still convex QP prob-
lem:

冘冘
N N
1
maximise ⫺ (a∗⫺ai)(a∗j ⫺aj)K(xi,xj) (13)
2i ⫽ 1j ⫽ 1 i

冘 冘
N N

⫺e (a∗i ⫹ ai) ⫹ yi(ai∗⫺ai)


i⫽1 i⫽1

冦 冘
N

(ai∗⫺ai) ⫽ 0
subject to i⫽1

0ⱕa∗i ,aiⱕC, for i ⫽ 1,…N


137
Fig. 3. Raw data on Cs concentration in the Bryansk region.
850 M. Kanevski et al. / Environmental Modelling & Software 19 (2004) 845–855

mean value 571.8 kBq/m2, maximum value 4334


kBq/m2, variance 315,372 kBq2/m4, skewness 2.7 and
kurtosis 16.9. As usually, environmental data are posi-
tively skewed and their distributions are far from normal.
The samples reflecting spatial contamination pattern
are the subject of exploratory spatial data analysis to
address spatial continuity. Spatial continuity is a feature
of spatial processes, which have some underlying origin
in the physics of the process. The presence of spatial
continuity means that closer samples are more likely
similar than farther ones (Issaks and Shrivastava, 1989).
Because the samples represent only one realization of
the spatial process, some kind of stationarity assumption
is required to use statistical methods. Strong stationarity
means that for any finite number n of sample points xi
(i = 1,…,n) and any lag h, the joint finite-dimensional Fig. 4. The drift of 137
Cs data.
distribution functions of Z(x1),Z(x2),…,Z(xn) are the
same as of Z(x1 + h),Z(x2 + h),…,Z(xn + h). In practice,
this proposition is very hard to be detected, and as we way, and the rest of the data formed the training set.
are usually interested only in two first moments, the Thus, testing set represents regional data. Of course, the
second-order stationarity assumption is enough. It is the training set in this case is somewhat clustered that is
stationarity only of the two first moments: the mean is not so good for the MLA training. However, using the
constant (E[Z(x)] = m = const) and covariance backward selection (i.e. picking out the points for train-
(Cov(x1,x2) = E[Z(x1),Z(x2)]⫺m2 = C(h)) exists and does ing dataset by the declustering and to consider the rest
not depend on x, but only on h. as testing one) is impossible to obtain a representative
Rather often, the real data do not follow even second- testing set. The procedure of selection was carried out
order stationarity model. Intrinsic hypothesis, which is several times with different cell sizes and with varying
weaker than second-order stationarity, is enough to apply numbers of the selected points from each cell. Since it
geostatistical tools. The intrinsic hypothesis is a process is difficult to control both testing and training datasets,
with second-order stationarity applied for the more attention was paid to the similarity of the training
increments. It means that the mean of increments (also data set to the initial data structures of all data. The simi-
named the drift): larity was controlled by comparing summary statistics,
histograms and spatial correlation structures. Similarity
D(h) ⫽ E[Z(x)⫺Z(x ⫹ h)] (15)
of spatial structures for the obtained datasets to the initial
is constant and D(h) = 0 and does not depend on x and data is even more important than statistical factors. Com-
h, and the variance of increments (2g(h) = var[Z(x + parison of the spatial structure was carried out with the
h)⫺Z(x)]) exists and does not depend on x, only on h. help of variogram roses, which show anisotropy. Such
The drift D(h) can be an indicator of data obedience comparison provides grounds that split (see Fig. 5) with
to the intrinsic hypothesis. Such a deduction can be 484 training and 200 testing points is quite suitable for
made, for example, when the value of the drift D(h) the following ML modeling and it is the best of all
fluctuates around zero (the drift is supposed to be zero obtained.
whatsoever the position of h in the domain). If D(h) In the present study, MLP models with the following
increases/decreases with the augmentation of the length parameters were used: two input neurons, describing
of the separation vector h (see Fig. 4), then the data do spatial co-ordinates (X, Y), one hidden layer and output
not follow the intrinsic hypotheses. It can mean that the neuron describing 137Cs contamination. Backpropagation
data have systematic trend. In such cases, the variogram training with Levenberg-Marquardt followed by conju-
modeling and the following common geostatistical pre- gate gradient algorithm was used in order to avoid local
diction will result in misleading results. To handle this minima (Masters, 1995).
problem, the trend must be removed from the data in the The variogram analysis of the obtained residuals for
first place. Here, the machine learning algorithms have the trained neural networks with varying number of neu-
been used to model the trend in the data. rons in the hidden layer showed that the optimal results
Cell declustering was used for splitting data into train- (in the sense of modeling non-linear trends) was
ing and testing sets to provide efficient ML learning. The obtained by using MLP with five neurons in a single
region was divided into rectangular cells by a regular hidden layer. Further increase of the number of hidden
grid and one or several points were selected at random neurons leads to extracting more detailed local peculiari-
from each cell. The testing dataset was obtained in this ties of the pattern, reflected by multiple correlation range
M. Kanevski et al. / Environmental Modelling & Software 19 (2004) 845–855 851

Fig. 6. 137Cs, artificial neural network (one hidden layer with five
neurons) spatial predictions.

choice is based both on the analysis of training and test-


ing errors and the analysis of the variogram of the
resulting trend model. Detailed description of the influ-
ence of the parameters on the solution and tuning pro-
cedure can be found in Kanevski et al. (2001). The map-
Fig. 5. Location of the training and testing points. ping results (trend model) are presented in Fig. 7.
Trained multilayer perceptron and support vector
regression were able to extract some information from
of the variogram of trend estimates. Then, MLP is used data described by large-scale spatial correlations. The
for 137Cs spatial prediction mapping. Predictions were rest of the information—small scale spatially structured
performed on a rectangular regular grid with cell size residuals—was analyzed and modeled using geostatist-
1 × 1 km. The result for the 137Cs MLP large-scale map- ical conditional stochastic simulations. The obtained
ping is presented in Fig. 6. residuals are correlated with the original data and are not
Let us present the results of the large-scale modeling correlated with the MLA estimates (see Figs. 8 and 9).
using support vector regression approach. Several user- Correlation coefficients between the residuals and 137Cs
defined (hyper) parameters influence on the SVR model: sample values are equal to 0.77 (for MLP residuals) and
kernel function, C, e. Gaussian radial basis functions 0.79 (for SVR residuals).
(RBF) were found to be well suited for spatial environ- Exploratory variography of spatial correlation struc-
mental modeling: tures of the Nscore transformed residuals are presented

K(x,x⬘) ⫽ exp ⫺ 冉
|x⫺x⬘|2
2s2 冊 (16)
in Figs. 10 and 11. Variograms of the Nscore transfor-
med residuals can be easily modeled (fitting to theoreti-
cal model) and SGSs can be applied (variogram reaches
Kernel parameter—bandwidth s is related to some a sill and levels off). Final ML residual sequential Gaus-
characteristic correlation scales of trend model. Kernel sian simulation results are presented as equiprobable
bandwidth of 20 km is used for the presented model. realizations in Figs. 12 and 13. They keep the large-scale
Other parameters were defined as: C = 20, e = 200. This trend structure (from Figs. 6 and 7) and also feature dis-
852 M. Kanevski et al. / Environmental Modelling & Software 19 (2004) 845–855

Fig. 9. Scatterplot of the MLP and SVR residuals vs. MLP and SVR
estimates, respectively.

137
Fig. 7. Cs, support vector regression trend modeling.

Fig. 10. Nscore omni-directional variogram and the variogram model


of the MLP residuals.

137
Fig. 8. Scatterplot of the MLP and SVR residuals vs. Cs sample
values.

tinctive spatial variability and small-scale effects ignored


by ML models.
The similarity and dissimilarity between the realiza-
tions describe spatial variability and uncertainty. The
Fig. 11. Nscore omni-directional variogram and the variogram model
next step deals with the probabilistic mapping: prob- of the SVR residuals.
ability mapping of to be above/below some predefined
decision level. This topic relates to decision-oriented
mapping of contaminated territories. Usually, hundreds
M. Kanevski et al. / Environmental Modelling & Software 19 (2004) 845–855 853

Fig. 12. Mapping of 137Cs with neural network residual sequential Fig. 13. Mapping of 137Cs with support vector regression residual
Gaussian simulations model (NNRSGS). sequential Gaussian simulations model (SVRRSGS).

of simulated models (realizations) are generated. Post- when only ML mapping is applied. This helps to
processing of realizations gives rich variety of outputs, understand the quality of the results. If there is no
one of them is the probability/risk map. Probability maps spatial correlation in the residuals, it means that all
of exceeding level 800 kBq/m2 obtained with neural spatial information from data have been extracted
network/support vector regression residual sequential and ML can be used for prediction mapping as well.
Gaussian simulation models are presented in Figs. 14 (2) Robustness of the approach: how it is sensitive to
and 15, respectively. This is an important advanced the selection of the ML architecture and learning
information for the real decision-making process. algorithm. Chernov et al. (1999) demonstrated the
robustness of MLP with varying number of neurons
on validation data. Also, it was shown that MLP is
4. Discussion more sensitive towards selection of the training set
than towards the number of neurons. The same
The final stage deals with the validation of the ML robust behavior in the case presented in this study
residual sequential Gaussian simulation results. compari- has been obtained both for MLP and SVR (varying
sons with geostatistical prediction models were carried model parameters). So, we can choose the simplest
out. The proposed models give comparable or better ML models capable to learn and catch non-linear
results on different data sets. A comparison between pro- trends.
posed models (NNRSGS and SVRRSGS) was also car-
ried out at the testing points. As a result, the NNRSGS Usually, accuracy test (analysis of the residuals) has
model gives better results than the SVRRSGS model in been used for the analysis and description of what was
terms of testing error and summary statistics of testing learned by ML. Accuracy test measures correlation
distribution. Comprehensive comparison with other ML between the training data and the MLA predictions at
methods is a topic of further research. the same points.
Several important points should be mentioned:
(3) Data clustering is a well-known problem in spatial
(1) Analysis of the residuals is important also in case data analysis (Deutsch and Journel, 1998). This
854 M. Kanevski et al. / Environmental Modelling & Software 19 (2004) 845–855

Fig. 14. Probability of exceeding level of 800 kBq/m2 for Fig. 15. Probability of exceeding level of 800 kBq/m2 for
NNRSGS model. SVRRSGS model.

problem is related to the spatial representativity of


data. The influence of clustering on the efficiency of approach may deal with multivariate cases as long as
ML algorithms should be studied in detail. ML algorithms are capable of dealing with multivariate
information and can integrate different types of data.
Extension of the model to image processing requires
5. Conclusions improving and adaptation of the algorithms, especially
from ML side. Recent developments in ML algorithms
New non-stationary NNRSGS and SVRRSGS models implementations, see e.g. http://www.torch.ch, are prom-
for the analysis and mapping of spatially distributed data ising from the computational point of view.
were developed. Non-linear trends in environmental data The analysis and presentation of the results as well as
can be efficiently modeled by a three layer perceptron. MLP and Gaussian simulation modeling were performed
Combinations of ML and geostatistical models gave rise with the help of GEOSTAT OFFICE software (Kanevski
to the decision-oriented risk and probabilistic mapping. et al., 1999). Support vector regression modeling was
Promising results presented are based on the unique case carried out with the help of GeoSVM
study: soil contamination by the most radiologically (http://www.ibrae.ac.ru/~mkanev).
important Chernobyl radionuclide. Other kinds of ANN
models (in particular local approximators) can be used
with possible modifications in the proposed framework.
ML based models are preferable to pure geostatistical Acknowledgements
methods because the latter have limitations due to pres-
ence of non-linear trends in data, which are difficult to
model. Computational costs of the method are rather The work was supported in part by the INTAS grants
cheap for a typical geostatistical problem. But the appli- 99-00099, 97-31726, INTAS Aral Sea project #72,
cation of the method needs deep expert knowledge in CRDF grant RG2-2236, and Russian Academy of
geostatistical modeling. Further, extensions of the Sciences grant for young scientists research N84, 1999.
M. Kanevski et al. / Environmental Modelling & Software 19 (2004) 845–855 855

References Gilardi, N., Bengio, S., 2000. Local machine learning models for spa-
tial data analysis. IDIAP-RR 00-34.
Haas, T.C., 1996. Multivariate spatial prediction in the presence of
Bishop, C.M., 1995. Neural Networks for Pattern Recognition. Claren- nonlinear trend and covariance nonstationarity. Environmetrics 7.
don Press, Oxford. Haykin, S., 1999. Neural Networks. A Comprehensive Foundation,
Chernov, S., Demyanov, V., Grachev, N., Kanevski, M., Kravetski, second ed. Prentice Hall International, Inc.
A., Savelieva, E., Timonin, V., Maignan, M., 1999. Multiscale Pol- Isaaks, Ed.H., Shrivastava, R.M., 1989. An Introduction to Applied
lution Mapping with Artificial Neural Networks and Geostastistics. Geostatistics. Oxford University Press, Oxford.
Proceedings of the 5th Annual Conference of the International Kanevski, M., Demyanov, V., Chernov, S., Savelieva, E., Serov, A.,
Association for Mathematical Geology (IAMG’ 99). Ed. Lippartd, Timonin, V., 1999. Geostat Office for Environmental and Pollution
S.J., Nass, A., Sinding-Larsen, R., August 1999, 325-330. Spatial Data Analysis. Mathematische Geologie. CPress Publishing
Cressie, N., 1991. Statistics for Spatial Data. John Wiley & Sons, House, band 3, April, pp. 73–83.
New York. Kanevski, M., Pozdnukhov, A., Canu, S., Maignan, M., Wong, P., Shi-
De Cort, M., Tsaturov, Yu.S., 1996. Atlas on caesium contamination bli, S., 2001. Support vector machines for classification and map-
of Europe after the Chernobyl nuclear plant accident. European ping of reservoir data. In: Soft Computing for Reservoir Charac-
Commission, Report EUR 16542 EN. terization and Modeling. Springer-Verlag, pp. 531–558.
Demyanov, V., Kanevski, M., Savelieva, E., Timonin, V., Chernov, Kanevsky, M., Arutyunyan, R., Bolshov, L., Demyanov, V., Linge,
S., Polishuk, V., 2000. Neural Network Residual Stochastic Cosi- I., Savelieva, E., Shershakov, V., Haas, T., Maignan, M., 1996a.
mulation for Environmental Data Analysis. Proceedings of the Geostatistical Portrayal of the Chernobyl fallout. In: Baafi, E.Y.,
Second ICSC Symposium on Neural Computation (NC’2000), May Schofield, N.A. (Eds.), Geostatistics ’96, Wollongong, vol. 2.
2000, Berlin, Germany, 647-653. Kluwer Academic Publishers, pp. 1043–1054.
Deutsch, C.V., Journel, A.G., 1998. GSLIB Geostatistical Software Kanevsky, M., Arutyunyan, R., Bolshov, L., Demyanov, V., Maignan,
Library and User’s Guide. Oxford University Press, New York, M., 1996b. Artificial neural networks and spatial estimations of
Oxford. Chernobyl fallout. Geoinformatics 7, 5–11.
Dowd, P.A., 1994. In: Dimitrakopoulos, R. (Ed.), The Use of Neural Masters, Timothy, 1995. Advanced Algorithms for Neural Networks.
Networks for Spatial Simulation, Geostatistics for the Next Cen- A C++ Sourcebook. John Wiley & Sons, Inc.
tury. Kluwer Academic Publishers, pp. 173–184. Neuman, S.P., Jacobson, E.A., 1984. Analysis of nonintrinsic spatial
Fedra, K., Winkelbauer, L., 1999. A hybrid expert system, GIS and variability by residual kriging with application to regional
simulation modeling for environmental and technological risk man- groundwater levels. Mathematical Geology 16, 499–521.
agement. Environmental Decision Support Systems and Artificial Pélissier, R., Goreaud, F., 2001. A practical approach to the study of
Intelligence, Technical Report WS-99-07. AAAI Press, Menlo spatial structure in simple cases of heterogeneous vegetation. Jour-
Park, CA, pp. 1–7. nal of Vegetation Science 12, 99–108.
Gambolati, G., Galeati, G., 1987. Comment on “analysis of nonintrin- Scholkopf, B., Smola, A., 1998. Learning with Kernels. MIT Press,
sic spatial variability by residual kriging with application to Cambridge, MA.
regional groundwater levels” by Neuman and Jacobson. Mathemat- Vapnik, V., 1998. Statistical Learning Theory. John Wiley & Sons,
ical Geology 19, 249–257. New York.

You might also like