0% found this document useful (0 votes)
18 views17 pages

Neural Networks in Geophysical Applications

The document discusses the application of neural networks in geophysical contexts, highlighting their ability to approximate continuous functions and solve various geophysical problems. It emphasizes the importance of understanding network configuration, training techniques, and performance optimization to fully leverage these tools. The paper provides an overview of static, feedforward neural networks and introduces techniques for enhancing their performance in geophysical applications.

Uploaded by

chemsworth659
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views17 pages

Neural Networks in Geophysical Applications

The document discusses the application of neural networks in geophysical contexts, highlighting their ability to approximate continuous functions and solve various geophysical problems. It emphasizes the importance of understanding network configuration, training techniques, and performance optimization to fully leverage these tools. The paper provides an overview of static, feedforward neural networks and introduces techniques for enhancing their performance in geophysical applications.

Uploaded by

chemsworth659
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

This is a repository copy of Neural networks in geophysical applications .

White Rose Research Online URL for this paper:


http://eprints.whiterose.ac.uk/325/

Article:
Van der Baan, M. and Jutten, C. (2000) Neural networks in geophysical applications.
Geophysics, 65 (4). pp. 1032-1047. ISSN 0016-8033

https://doi.org/10.1190/1.1444797

Reuse
See Attached

Takedown
If you consider content in White Rose Research Online to be in breach of UK law, please notify us by
emailing [email protected] including the URL of the record and the reason for the withdrawal request.

[email protected]
https://eprints.whiterose.ac.uk/
GEOPHYSICS, VOL. 65, NO. 4 (JULY-AUGUST 2000); P. 1032–1047, 7 FIGS., 1 TABLE.

Neural networks in geophysical applications

Mirko van der Baan∗ and Christian Jutten‡

techniques. For complete information covering the whole do-


ABSTRACT main of neural networks types, refer to excellent reviews by
Lippmann (1987), Hush and Horne (1993), Hérault and Jutten
Neural networks are increasingly popular in geo-
(1994), and Chentouf (1997).
physics. Because they are universal approximators, these
The statement that “designing and training a network is
tools can approximate any continuous function with an
still more an art than a science” is mainly attributable to sev-
arbitrary precision. Hence, they may yield important
eral well-known difficulties related to neural networks. Among
contributions to finding solutions to a variety of geo-
these, the problem of determining the optimal network config-
physical applications.
uration (i.e., its structure), the optimal weight distribution of
However, knowledge of many methods and tech-
a specific network, and the guarantee of a good overall per-
niques recently developed to increase the performance
formance (i.e., good generalization) are most eminent. In this
and to facilitate the use of neural networks does not seem
paper, techniques are described to tackle most of these well-
to be widespread in the geophysical community. There-
known difficulties.
fore, the power of these tools has not yet been explored to
Many types of neural networks exist. Some of these have
their full extent. In this paper, techniques are described
already been applied to geophysical problems. However, we
for faster training, better overall performance, i.e., gen-
limit this tutorial to static, feedforward networks. Static im-
eralization, and the automatic estimation of network size
plies that the weights, once determined, remain fixed and do
and architecture.
not evolve with time; feedforward indicates that the output
is not feedback, i.e., refed, to the network. Thus, this type of
network does not iterate to a final solution but directly trans-
INTRODUCTION
lates the input signals to an output independent of previous
Neural networks have gained in popularity in geophysics input.
this last decade. They have been applied successfully to a va- Moreover, only supervised neural networks are consi-
riety of problems. In the geophysical domain, neural networks dered—in particular, those suited for classification problems.
have been used for waveform recognition and first-break pick- Nevertheless, the same types of neural networks can also
ing (Murat and Rudman, 1992; McCormack et al., 1993); for be used for function approximation and inversion problems
electromagnetic (Poulton et al., 1992), magnetotelluric (Zhang (Poulton et al., 1992; Röth and Tarantola, 1994). Super-
and Paulson, 1997), and seismic inversion purposes (Röth and vised classification mainly consists of three different stages
Tarantola, 1994; Langer et al., 1996; Calderón–Macı́as et al., (Richards, 1993): selection, learning or training, and classifica-
1998); for shear-wave splitting (Dai and MacBeth, 1994), well- tion. In the first stage, the number and nature of the different
log analysis (Huang et al., 1996), trace editing (McCormack classes are defined and representative examples for each class
et al., 1993), seismic deconvolution (Wang and Mendal, 1992; are selected. In the learning phase, the characteristics of each
Calderón–Macı́as et al., 1997), and event classification (Dowla individual class must be extracted from the training examples.
et al., 1990; Romeo, 1994); and for many other problems. Finally, all data can be classified using these characteristics.
Nevertheless, most of these applications do not use more re- Nevertheless, many other interesting networks exist—un-
cently developed techniques which facilitate their use. Hence, fortunately, beyond the scope of this paper. These include
expressions such as “designing and training a network is still the self-organizing map of Kohonen (1989), the adaptive
more an art than a science” are not rare. The objective of resonance theory of Carpenter and Grossberg (1987), and
this paper is to provide a short introduction to these new the Hopfield network (Hopfield, 1984) and other recurrent

Manuscript received by the Editor January 20, 1999; revised manuscript received February 3, 2000.

Formerly Université Joseph Fourier, Laboratoire de Géophysique Interne et Tectonophysique, BP 53, 38041 Grenoble Cedex, France; currently
University of Leeds, School of Earth Sciences, Leeds LS2 9JT, UK. E-mail: [email protected].
‡Laboratoire des Images et des Signaux, Institut National Polytechnique, 46 av. Félix Viallet, 38031 Grenoble Cedex, France. E-mail: chris@lis-
viallet.inpg.fr.
c 2000 Society of Exploration Geophysicists. All rights reserved.

1032
Neural Networks in Geophysics 1033

networks. See Lippmann (1987) and Hush and Horne (1993) The mathematical neuron proceeds in a similar but simpler
for a partial taxonomy. way (Figure 1b) as integration takes place only over space. The
This paper starts with a short introduction to two types of weighted sum of its inputs is fed to a nonlinear transfer function
static, feedforward neural networks and explains their general (i.e., the activation function) to rescale the sum (Figure 1c). A
way of working. It then proceeds with a description of new tech- constant bias θ is applied to shift the position of the activation
niques to increase performance and facilitate their use. Next, function independent of the signal input. Several examples of
a general strategy is described to tackle geophysical problems. such activation functions are displayed in Figure 1d.
Finally, some of these techniques are illustrated on a real data Historically, the Heaviside or hard-limiting function was
example—namely, the detection and extraction of reflections, used. However, this particular activation function gives only
ground roll, and other types of noise in a very noisy common- a binary output (i.e., 1 or 0, meaning yes or no). Moreover,
shot gather of a deep seismic reflection experiment. the optimum weights were very difficult to estimate since this
particular function is not continuously differentiable. Thus,
NEURAL NETWORKS: STRUCTURE AND BEHAVIOR e.g., first-order perturbation theory cannot be used. Today, the
sigmoid is mostly used. This is a continuously differentiable,
The mathematical perceptron was conceived some 55 years monotonically increasing function that can best be described
ago by McCulloch and Pitts (1943) to mimic the behavior of a as a smooth step function (see Figure 1d). It is expressed by
biological neuron (Figure 1a). The biological neuron is mainly f s (α) = (1 + e−α )−1 .
composed of three parts: the dendrites, the soma, and the axon. To gain some insight in the working of static feedforward
A neuron receives an input signal from other neurons con- networks and their ability to deal with classification prob-
nected to its dendrites by synapses. These input signals are lems, two such networks will be considered: one composed
attenuated with an increasing distance from the synapses to of a single neuron and a second with a single layer of hidden
the soma. The soma integrates its received input (over time neurons. Both networks will use a hard-limiting function for
and space) and thereafter activates an output depending on simplicity.
the total input. The output signal is transmitted by the axon Figure 2a displays a single neuron layer. Such a network can
and distributed to other neurons by the synapses located at the classify data in two classes. For a 2-D input, the two distribu-
tree structure at the end of the axon (Hérault and Jutten, 1994). tions are separated with a line (Figure 2b). In general, the two

FIG. 1. The biological and the mathematical neuron. The mathematical neuron (b) mimics the behavior of the biological neuron
(a). The weighted sum of the inputs is rescaled by an activation function (c), of which several examples are shown in (d). Adapted
from Lippmann (1987), Hérault and Jutten (1994), and Romeo (1994).
1034 Van der Baan and Jutten

classes are separated by an (n − 1)-dimensional hyperplane for values. Thus, only a single neuron with a Gaussian activation
an n-dimensional input. function and constant variance is needed to describe the gray
More complex distributions can be handled if a hidden layer class in Figure 3 instead of the depicted three neurons with
of neurons is added. Such layers lie between the input and out- hard-limiting or sigmoidal activation functions. Moreover, the
put layers, connecting them indirectly. However, the general Gaussian will place a perfect circle around the class in the mid-
way of working does not change at all, as shown in Figures 3a dle (if a common variance is used for all input parameters).
and 3b. Again, each neuron in the hidden layer divides the This insight into the general way neural networks solve clas-
input space in two half-spaces. Finally, the last neuron com- sification problems enables a user to obtain a first notion of the
bines these to form a closed shape or subspace. With the ad- structure required for a particular application. In the case of
dition of a second hidden layer, quite complex shapes can very complicated problems with, say, skewed, multimodal dis-
be formed (Romeo, 1994). See also Figure 14 in Lippmann tributions, one will probably choose an neural networks struc-
(1987). ture with two hidden layers. However, Cybenko (1989) shows
Using a sigmoidal instead of a hard-limiting function does that neural networks using sigmoids are able to approximate
not change the general picture. The transitions between classes asymptotically any continuous function with an arbitrary close
are smoothed. On the other hand, the use of a Gaussian activa- precision using only a single nonlinear, hidden layer and linear
tion function implicates major changes, since it has a localized output units. Similarly, Park and Sandberg (1991) show that,
response. Hence, the sample space is divided in two parts. The under mild conditions, neural networks with localized activa-
part close to the center of the Gaussian with large outputs is tion functions (such as Gaussians) are also universal approxi-
enveloped by the subspace at its tails showing small output mators. Unfortunately, neither theorem is able to predict the

FIG. 2. (a) Single perceptron layer and (b) associated decision boundary. Adapted from Romeo (1994).

FIG. 3. (a) Single hidden perceptron layer and (b) associated decision boundary. Adapted from Romeo (1994).
Neural Networks in Geophysics 1035

exact number of neurons needed since these are asymptotic re- works with a single hidden layer are considered. In addition,
sults. Moreover, applications exist where neural networks with only RBF neurons in the hidden layer have Gaussian activa-
two hidden layers produce similar results as a single hidden tion functions. The output neurons have sigmoids as activation
layer neural networks with a strongly reduced number of links functions. Hence, n tot is also given by equation (3).
and, therefore, a less complicated weight optimization prob- As we will see, the ratio n tot /m determines if an adequate
lem, i.e., making training much easier (Chentouf, 1997). network optimization can be hoped for, where m defines the
Two types of activation functions are used in Figure 1d. number of training samples.
The hard-limiter and the sigmoid are monotonically increas-
ing functions, whereas the Gaussian has a localized activation. NETWORK OPTIMIZATION
Both types are commonly used in neural networks applications. Known problems
In general, neural networks with monotonically increasing ac-
tivation functions are called multilayer perceptrons (MLP) and The two most important steps in applying neural networks
neural networks with localized activation functions are called to recognition problems are the selection and learning stages,
radial basis functions (RBF) (Table 1). since these directly influence the overall performance and thus
Hence, MLP networks with one output perceptron and a the results obtained. Three reasons can cause a bad perfor-
single hidden layer are described by mance (Romeo, 1994): an inadequate network configuration,
the training algorithm being trapped in a local minimum, or an
n 
 h1
(k) (k)
 
f MLP (x) = σ wk σ w ·x−θ −θ (1) unsuitable learning set.
k=1 Let us start with the network configuration. As shown in
Figures 2 and 3, the network configuration should allow for an
with σ (.) the sigmoidal activation function, x the input, wk the adequate description of the underlying statistical distribution
weight of link k to the output node, n h1 the number of nodes of the spread in the data. Since the number of input and out-
in the hidden layer, w(k) the weights of all links to node k in put neurons is fixed in many applications, our main concern is
the hidden layer, and θ the biases. Boldface symbols indicate with the number of hidden layers and the number of neurons
vectors. Equation (1) can be extended easily to contain several therein.
output nodes and more hidden layers. No rules exist for determining the exact number of neurons
Likewise, RBF networks with a single hidden layer and one in a hidden layer. However, Huang and Huang (1991) show that
output perceptron are described by the upper bound of number of neurons needed to reproduce
n 
 h1 exactly the desired outputs of the training samples is on the
wk K skx − c(k) − θ
  
f RBF (x) = σ (2) order of m, the number of training samples. Thus, the number
k=1 of neurons in the hidden layer should never exceed the num-
with K(.) the localized activation function, . a (distance) ber of training samples. Moreover, to keep the training prob-
norm, c(k) the center of the localized activation function in hid- lem overconstrained, the number of training samples should
den node k, and sk its associated width (spread). always be larger than the number of internal weights. In prac-
It is important to be aware of the total number (n tot ) of inter- tice, m ≈ 10n tot is considered a good choice. Hence, the number
nal variables determining the behavior of the neural networks of neurons should be limited; otherwise, the danger exists that
structure used, as we show hereafter. Fortunately, this number the training set is simply memorized by the network (overfit-
is easy to calculate from equations. (1) and ( 2). For MLP net- ting). Classically, the best configuration is found by trial and
works it is composed of the number of links plus the number of error, starting with a small number of nodes.
perceptrons to incorporate the number of biases. If n i denotes A second reason why the network may not obtain the desired
the number of input variables, n hi the number of perceptrons in results is that it may become trapped in a local minimum. The
the ith hidden layer, and n o the number of output perceptrons, misfit function is very often extremely complex (Hush et al.,
then n tot is given by 1992). Thus, the network can easily be trapped in a local min-
imum instead of attaining the sought-for global one. In that
n tot = (n i + 1) ∗ n h1 + (n h1 + 1) ∗ n o (3) case even the training set cannot be fit properly.
for an MLP with a single hidden layer and Remedies are simple. Either several minimization attempts
must be done, each time using a different (random or nonran-
n tot = (n i + 1) ∗ n h1 + (n h1 + 1) ∗ n h2 + (n h2 + 1) ∗ n o (4) dom) initialization of the weights, or other inversion algorithms
for an MLP with two hidden layers. The number of internal must be considered, such as global search.
variables is exactly equal for isotropic RBF networks since Finally, problems can occur with the selected training set.
each Gaussian is described by n i + 1 variables for its position The two most frequent problems are overtraining and a bad,
and variance, i.e., width. Moreover, in this paper only RBF net- i.e., unrepresentative, learning set. In the latter case, either
too many bad patterns are selected (i.e., patterns attributed
Table 1. Abbreviations. to the wrong class) or the training set does not allow for a
good generalization. For instance, the sample space may be
MLP Multilayer Perceptrons incomplete, i.e., samples needed for an adequate training of
NCU Noncontributing Units
OBD Optimal Brain Damage the network are simply missing.
OBS Optimal Brain Surgeon Overtraining of the learning set may also pose a problem.
PCA Principal Component Analysis Overtraining means the selected training set is memorized such
RBF Radial Basic Functions that performance is only excellent on this set but not on other
SCG Scaled Conjugate Gradient data. To circumvent this problem, the selected set of examples
1036 Van der Baan and Jutten

is often split into a training and a validation set. Weights are Since system (7) is ill posed, a null space will exist. Hence,
optimized using the training set. However, crossvalidation with the internal variables cannot be determined uniquely. If, in
the second set ensures an overall good performance. addition, n tot ≫ m then the danger of overtraining, i.e., mem-
In the following subsections, all of these problems are con- orization, increases considerably, resulting in suboptimal per-
sidered in more detail, and several techniques are described formance. Two reasons cause A to be rank deficient. First, the
to facilitate the use of neural networks and to enhance their sample space may be incomplete, i.e., some samples needed for
performance. an accurate optimization are simply missing and some training
samples may be erroneously attributed to a wrong class. Sec-
Network training/weight estimation: An optimization problem ond, noise contamination will prevent a perfect fit of both pro-
vided and nonprovided data. For example, in a tomographic
If a network configuration has been chosen, an optimal problem, rank deficiency will already occur if no visited cells
weight distribution must be estimated. This is an inversion or are present, making a correct estimate of the true velocities in
optimization problem. The most common procedure is a so- these cells impossible.
called localized inversion approach. In such an approach, we To give an idea of the number of training samples required,
first assume that the output y can be calculated from the input the theoretical study of Baum and Haussler (1989) shows that
x using some kind of function f, i.e., y = f(x). Output may be for a desired accuracy level of (1 − ǫ), at least n tot /ǫ examples
contaminated by noise, which is assumed to be uncorrelated must be provided, i.e., m ≥ n tot /ǫ. Thus, to classify 90% of the
to the data and to have zero mean. Next, we assume that the data correctly, at least 10 times more samples must be provided
function can be linearized around some initial estimate x0 of than internal variables are present, i.e., m ≥ 10n tot .
the input vector x using a first-order Taylor expansion, i.e., How can we solve equation (7)? A possible method of esti-
∂f(x0 ) mating the optimal w is by minimizing the sum of the squared
y = f(x0 ) + x. (5) differences between the desired and the actual output of the
∂x network. This leads to the least-mean-squares solution, i.e., the
If we write y0 = f(x0 ), y = y − y0 , and A(x) = ∂f/∂x, equa- weights are determined by solving the normal equations
tion (5) can also be formulated as
w = (At A)−1 At y, (8)
(x)
y = A x, (6)
where the superscript (w) is dropped for clarity.
where the Jacobian A(x) = ∇x f contains the first partial deriva- This method, however, has the well-known disadvantage that
tives with respect to x. To draw an analogy with a better known singularities in At A cause the divergence of the Euclidean
inversion problem, in a tomography application y would con- norm |w| of the weights, since this norm is inversely propor-
tain the observed traveltimes, x the desired slowness model, tional to the smallest singular value of A. Moreover, if A is rank
and Ai j the path lengths of ray i in cell j. deficient, then this singular value will be zero or at least effec-
However, there exists a fundamental difference with a to- tively zero because of a finite machine precision. The squared
mography problem. In an neural networks application, both norm |w|2 is also often called the variance of the solution.
the output y and the input x are known, since y(i) represents To prevent divergence of the solution variance, very often
the desired output for training sample x(i) . Hence, the problem a constrained version of equation (8) is constructed using a
is not the construction of a model x explaining the observations, positive damping variable β. This method is also known as
but the construction of the approximation function f. Since this Levenberg–Marquardt or Tikhonov regularization, i.e., sys-
function is described by its internal variables, it is another linear tem (8) is replaced by
system that must be solved, namely,
w = (At A + βI)−1 At y, (9)
(w)
y = A w, (7)
with I the identity matrix (Lines and Treitel, 1984; Van der Sluis
where the Jacobian A(w) = ∇w f contains the first partial deriva- and Van der Vorst, 1987).
tives with respect to the internal variables w. The vector w The matrix At A + βI is not rank deficient in contrast to At A.
contains the biases and weights for MLP networks and the Hence, the solution variance does not diverge but remains con-
weights, variances, and centers for RBF networks. For the ex- strained. Nevertheless, the method comes at an expense: the
act expression of A(w) , we refer to Hush and Horne (1993) solution will be biased because of the regularization parame-
and Hérault and Jutten (1994). Nevertheless, all expressions ter β. Therefore, it does not provide the optimal solution in a
can be calculated analytically. Moreover, both the sigmoid and least-mean-squares sense. The exact value of β must be cho-
the Gaussian are continuously differentiable, which is the ulti- sen judiciously to optimize the trade-off between variance and
mate reason for their use. Thus, no first-order perturbation the- bias (see Van der Sluis and Van der Vorst, 1987; Geman et al.,
ory must be applied to obtain estimates of the desired partial 1992).
derivatives, implying a significant gain in computation time for More complex regularization can be used on both w and
large neural networks. y. For instance, if uncertainty bounds on the output are known
In general, the optimization problem will be ill posed since (e.g., their variances), then these can be used to rescale the
A(w) suffers from rank deficiency, i.e., rank (A(w) ) ≤ n tot . Thus, output. A similar rescaling can also be applied on the input
system (7) is underdetermined. However, at the same time, and/or weights. This method allows for incorporating any a
any well-formulated inversion problem will be overconstrained priori information available. Hence, a complete Bayesian in-
because m ≫ n tot , yielding that there are more training samples version problem can be formulated. See Tarantola (1987) for
than internal variables. details on this approach.
Neural Networks in Geophysics 1037

Just as in tomography problems, equations (8) and (9) are use a linearization by parts of the produced output of the hid-
rarely solved directly. More often an iterative approach is ap- den neurons; Denœux and Lengellé (1993), who use prototypes
plied. The best known method in neural networks applications (selected training examples) for an adequate initialization; and
is the gradient back-propagation method of Rumelhart et al. Sethi (1990, 1995), who uses decision trees to implement a four-
(1986) with or without a momentum term, i.e., a term analogous layer neural networks. Another interesting method is given by
to the function of the regularization factor β. It is a so-called Karouia et al. (1994) using the theoretical results of Gallinari
first-order optimization method which approximates (At A)−1 et al. (1991), who show that a formal equivalence exists be-
in equations (8) and (9) by αI with β = 0. tween linear neural networks and discriminant or factor anal-
This method is basically a steepest descent algorithm. Hence, yses. Hence, they initialize their neural networks so that such
all disadvantages of such gradient descent techniques apply. For an analysis is performed and start training from there on.
instance, in the case of curved misfit surfaces, the gradient will All of these initialization methods make use of the fact that
not always point to the desired global minimum. Therefore, although linear methods may not be capable of solving all con-
convergence may be slow (see Lines and Treitel, 1984). To ac- sidered applications, they constitute a good starting point for
celerate convergence, the calculated gradients are multiplied a neural networks. Hence, a linear initialization is better than
with a constant factor α, 0 < α < 2. However, a judicious choice a random initialization of weights.
of α is required, since nonoptimal choices will have exactly the
opposite effect, i.e., convergence will slow even further. For
instance, if α is too large, strongly oscillating misfits are ob- Generalization
tained that do not converge to a minimum; choosing too small Now that we are able to train a network, a new question
a value will slow convergence and possibly hinder the escape arises: When should training be stopped? It would seem to
from very small local minima. Furthermore, convergence is not be a good idea to stop training when a local minimum is at-
guaranteed within a certain number of iterations. In addition, tained or when the convergence rate has become very small,
previous ameliorations in the misfit can be partly undone by i.e., improvement of iteration to iteration is zero or minimal.
the next iterations. However, Geman et al. (1992) show that this leads to over-
Although several improvements have been proposed con- training, i.e., memorization of the training set: now the noise is
cerning adaptive modifications of both α (Dahl, 1987; Jacobs, fitted, not the global trend. Hence, the obtained weight distri-
1988; Riedmiller and Braun, 1993) and more complex regular- bution will be optimal for the training samples, but it will result
ization terms (Hanson and Pratt, 1989; Weigend et al., 1991; in bad performance in general. A similar phenomenon occurs
Williams, 1995), the basic algorithm remains identical. For- in tomography problems, where it is known as overfit (Scales
tunately, other algorithms can be applied to solve the inver- and Snieder, 1998).
sion problem. As a matter of fact, any method can be used Overtraining is caused by the fact that system (7) is ill posed,
which solves the normal equations (8) or (9), such as Gauss– i.e., a null space exists. The least-mean-squares solution of sys-
Newton methods. Particularly suited are scaled conjugate gra- tem (7), equation (8), will result in optimal performance only
dient (SCG) methods, which are proven to converge within if a perfect and complete training set is used without any noise
min(m, n tot ) iterations, automatically estimate (At A + βI)−1 contamination. Otherwise, any solution is nonunique because
without an explicit calculation, and have a memory of previous of the existence of this null space. Regularization with equa-
search directions, since the present gradient is always conjugate tion (9) reduces the influence of the null space but also results
to all previously computed (Møller, 1993; Masters, 1995). in a biased solution, as mentioned earlier.
Furthermore, in the case of strongly nonlinear error sur- The classical solution to this dilemma is to use a split set of
faces with, for example, several local minima, both genetic examples. One part is used for training; the other part is used as
algorithms and simulated annealing (Goldberg, 1989; Hertz a reference set to quantify the general performance (Figure 4).
et al., 1991; Masters, 1995) offer interesting alternatives, and Training is stopped when the misfit of the reference set reaches
hybrid techniques can be considered (Masters, 1995). For in- a minimum. This method is known as holdout crossvalidation.
stance, simulated annealing can be used to obtain several good Although this method generally produces good results, it
initial weight distributions, which can then be optimized by results in a reduced training set that may pose a problem if only
an SCG method. A review of learning algorithms including a limited number of examples is available. Because this method
second-order methods can be found in Battiti (1992). Reed
(1993) gives an overview of regularization methods.
A last remark concerning the initialization of the weights.
Equation (5) clearly shows the need to start with a good initial
guess of these weights. Otherwise, training may become very
slow and the risk of falling in local minima increases signifi-
cantly. Nevertheless, the most commonly used procedure is to
apply a random initialization, i.e., wi ∈ [−r, r ]. Even some op-
timum bounds for r have been established [see, for example,
Nguyen and Widrow (1990)].
As mentioned, an alternative procedure is to use a global
training scheme first to obtain several good initial guesses to
start a localized optimization. However, several theoretical
methods have also been developed. The interested reader is FIG. 4. Generalization versus training error. Adapted from
referred to the articles of Nguyen and Widrow (1990), who Moody (1994).
1038 Van der Baan and Jutten

requires subdivision of the number of existing examples, the (1980). For instance, the BIC criterion is given by
final number of used training samples is reduced even further. 
σr2

ln m
Hence, the information contained in the selected examples is BIC = ln + n tot , (10)
not optimally used and the risk of underconstrained training m m
increases. where σr2 denotes the variance of the error residuals (misfits).
It is possible to artificially increase the number of training The first term is clearly related to the misfit; the second is re-
samples m by using noise injection or synthetic modeling to lated to the network complexity.
generate noise-free data. However, caution should be used These criteria, however, have been developed for linear sys-
when applying such artificial methods. In the former, small, tems and are not particularly suited to neural networks because
random perturbations are superimposed on the existing train- of their nonlinear activation functions. Hence, several theoret-
ing data. Mathematically, this corresponds to weight regular- ical criteria had to be developed for such nonlinear systems
ization (Matsuoka, 1992; Bishop, 1995; Grandvalet and Canu, (MacKay, 1992; Moody, 1992; Murata et al., 1994). Like their
1995), thereby only reducing the number of effective weights. predecessors, they are composed of a term related to the misfit
Moreover, the noise parameters must be chosen judiciously to and a term describing the complexity of the network. Hence,
optimize again the bias/variance trade-off. In addition, a bad these criteria also try to minimize both the misfit and the com-
noise model could introduce systematic errors. In the latter plexity of a network simultaneously.
case, the underlying model may inadequately represent the real However, these criteria are extremely powerful if the under-
situation, thus discarding or misinterpreting important mech- lying theoretical assumptions are satisfied and in the limit of
anisms. an infinite training set, i.e., m ≫ n tot . Otherwise they may yield
To circumvent the problem of split data sets, some other erroneous predictions that can decrease the general perfor-
techniques exist: generalized crossvalidation methods, resid- mance of the obtained network. Moreover, these criteria can
ual analysis, and theoretical measures which examine both ob- be used only if the neural networks is trained and its structure
tained output and network complexity. is adapted simultaneously.
The problem of holdout crossvalidation is that information A third method has been proposed in the neural networks
contained in some examples is left out of the training process. literature by Jutten and Chentouf (1995) inspired by statisti-
Hence, this information is partly lost, since it is only used to cal optimization methods. It consists of a statistical analysis of
measure the general performance but not to extract the funda- the error residuals, i.e., an analysis of the misfit for all output
mentals of the considered process. As an alternative, v-fold values of all training samples is performed. It states that an op-
crossvalidation can be considered (Moody, 1994; Chentouf, timally trained network has been obtained if the residuals and
1997). the noise have the same characteristics. For example, if noise
In this method, the examples are divided into v sets of is assumed to be white, training is stopped if the residuals have
(roughly) equal size. Training is then done v times on v − 1 zero mean and exhibit no correlations (as measured by a sta-
sets, in which each time another set is excluded. The individual tistical test). The method can be extended to compensate for
misfit is defined as the misfit of the excluded set, whereas the nonwhite noise (Hosseini and Jutten, 1998). The main draw-
total misfit is defined as the average of the v individual misfits. back of this method is that a priori assumptions must be made
Training is stopped when the minimum of the total misfit is concerning the characteristics of the noise.
reached or convergence has become very slow. In the limit of
v = m, the method is called leave one out. In that case, training Configuration optimization: Preprocessing
is done on m − 1 examples and each individual misfit is calcu- and weight regularization
lated on the excluded example.
The advantage of v-fold crossvalidation is that no exam- The last remaining problem concerns the construction of a
ples are ultimately excluded in the learning process. Therefore, network configuration yielding optimal results. Insight in the
all available information contained in the training samples is way neural networks tackle classification problems already al-
used. Moreover, training is performed on a large part of the low for a notion of the required number of hidden layers and
data, namely on (m − m/v) examples. Hence, the optimization the type of neural networks. Nevertheless, in most cases only
problem is more easily kept overconstrained. On the other vague ideas of the needed number of neurons per hidden layer
hand, training is considerably slower because of the repeated exist.
crossvalidations. For further details refer to Stone (1974) and Classically, this problem is solved by trial and error, i.e., sev-
Wahba and Wold (1975). Other statistical methods can also eral structures are trained and their performances are exam-
be considered, such as the jackknife or the bootstrap (Efron, ined. Finally, the best configuration is retained. The main prob-
1979; Efron and Tibshirani, 1993; Masters, 1995)—two statisti- lem with this approach is its need for extensive manual labor,
cal techniques that try to obtain the true underlying statistical which may be very costly, although automatic scripts can be
distribution from the finite amount of available data without written for construction, training, and performance testing.
posing a priori assumptions on this distribution. Moody (1994) In addition, the specific application and its complexity are
also describes a method called nonlinear crossvalidation. not the only factors of influence. As shown above, the ratio of
Another possible way to avoid a split training set is to min- the number of total internal variables to the number of training
imize theoretical criteria relating the network complexity and samples is of direct importance to prevent an underconstrained
misfit to the general performance (Chentouf, 1997). Such cri- optimization problem. This problem is of immediate concern
teria are based on certain theoretical considerations that must for applications disposing of large input vectors, i.e., n i is large,
be satisfied. Some well-known measures are the AIC and BIC although regularization may help limit the number of effec-
criteria of Akaike (1970). Others can be found in Judge et al. tive weights (Hush and Horne, 1993). Very often the number
Neural Networks in Geophysics 1039

of required links and nodes can be reduced easily using pre- a large dynamic range, often the use of log(x) instead of x is
processing techniques to highlight the important information recommended.
contained in the input or by using local connections and weight Another possible way to limit the number of internal vari-
sharing. ables is to make a priori assumptions about the neural net-
Many different preprocessing techniques are available. works structure and, in particular, about the links between the
However, one of the best known is principal component analy- input and the first hidden layer. For instance, instead of using a
sis, or the Karhunen–Loève transform. In this approach, train- fully connected input and hidden layer, only local connections
ing samples are placed as column vectors in a matrix X. The may be allowed for, i.e., it is assumed that only neighboring in-
covariance matrix XXt is then decomposed in its eigenvalues put components are related. Hence, links between these input
and eigenvectors. Finally, training samples and, later, data are nodes and a few hidden neurons will be sufficient. The disad-
projected upon the eigenvectors of the p largest eigenvalues vantage is that this method may force the number of hidden
( p < m). These eigenvectors span a new set of axes displaying neurons to increase for an adequate description of the problem.
a decreasing order of linear correlation between the training However, if the use of local connections is combined with
samples. In this way, any abundance in the input may be re- weight sharing, then a considerable decrease of n tot may be
duced. Moreover, only similarities are extracted which may re- achieved. Thus, grouped input links to a hidden node will have
duce noise contamination. The ratio of the sum of the p largest identical weights. Even grouped input links to several nodes
eigenvalues (squared) over the total sum of squared eigenval- may be forced to have identical weights. For large networks,
ues yields an accurate estimate of the information contained in this method may considerably decrease the total number of
the projected data. More background is provided in Richards free internal variables (see Le Cun et al., 1989). Unfortunately,
(1993) and Van der Baan and Paul (2000). results depend heavily on the exact neural networks structure,
The matrix X may contain all training samples, the samples and no indications exist for the optimal architecture.
of only a single class, or individual matrices for each existing The soft weight-sharing technique of Nowlan and Hinton
class. In the latter case, each class has its own network and (1992) constitutes an interesting alternative. In this method it
particular preprocessing of the data. The individual networks is assumed that weights may be clustered in different groups
are often called expert systems, only able to detect a single class exhibiting Gaussian distributions. During training, network
and therefore requiring repeated data processing to extract all performance, centers and variances of the Gaussian weight dis-
classes. tributions, and their relative occurrences are optimized simul-
Use of the Karhunen–Loève transform may pose problems taneously. Since one of the Gaussians is often centered around
if many different classes exist because it will become more dif- zero, the method combines weight sharing with Tikhonov
ficult to distinguish between classes using their common fea- regularization. One of the disadvantages of the method is
tures. As an alternative, a factor or canonical analysis may be its strong assumption concerning weight distributions. More-
considered. This method separates the covariance matrix of all over, no method exists for determining the optimal number of
data samples into two covariance matrices of training samples Gaussians, again yielding an architecture problem.
within classes and between different classes. Next, a projec-
tion is searched that simultaneously yields minimum distances
within classes and maximum distances between classes. Hence, Configuration optimization: Simplification methods
only a single projection is required. A more detailed descrip-
tion can be found in Richards (1993). This incessant architecture problem can be solved in two dif-
The reason why principal component and factor analyses ferent ways, using either constructive or destructive, i.e., sim-
may increase the performance of neural networks is easy to plification, methods. The first method starts with a small net-
explain. Gallinari et al. (1991) show that a formal equivalence work and simultaneously adds and trains neurons. The second
exists between linear neural networks (i.e., with linear activa- method starts with a large, trained network and progressively
tion functions) and discriminant or factor analyses. Strong indi- removes redundant nodes and links. First, some simplification
cations exist that nonlinear neural networks (such as MLP and methods are described. These methods can be divided into
RBF networks) are also closely related to discriminant anal- two categories: those that remove only links and those that
yses. Hence, the use of a principal component or a factor anal- remove whole nodes. All simplification methods are referred
ysis allows for a simplified network structure, since part of the to as pruning techniques.
discrimination and data handling has already been performed. The simplest weight pruning technique is sometimes referred
Therefore, local minima are less likely to occur. to as magnitude pruning. It consists of removing the smallest
Other interesting preprocessing techniques to reduce input present weights and thereafter retraining the network. How-
can be found in Almeida (1994). All of these are cast in the form ever, this method is not known to produce excellent results (Le
of neural networks structures. Notice, however, that nearly Cun et al., 1990; Hassibi and Stork, 1993) since such weights,
always the individual components of the input are scaled to though small, may have a considerable influence on the perfor-
lie within well-defined ranges (e.g., between −1 and 1) to put mance of the neural network.
the dynamic range of the input values within the most sen- A better method is to quantify the sensitivity of the misfit
sitive part of the activation functions. This often results in a function to the removal of individual weights. The two best
more optimal use of the input. Hence, it may reduce the num- known algorithms proceeding in such a way are optimal brain
ber of hidden neurons. For instance, Le Cun et al. (1991) show damage or OBD (Le Cun et al., 1990) and optimal brain sur-
that correcting each individual input value for the mean and geon or OBS (Hassibi and Stork, 1993).
standard deviation of this component in the training set will Both techniques approximate the variation δ E of the least-
increase the learning speed. Furthermore, for data displaying mean-squares misfit E attributable to removal of a weight wi
1040 Van der Baan and Jutten

by a second-order Taylor expansion, i.e., to remove very large weights. Moreover, Cottrell et al. (1995)
 ∂E 1  ∂2 E show that both OBD and OBS amount to removal of statisti-
δE = wi + 2
(wi )2 cally null weights. Furthermore, their statistical approach can
∂w i 2 ∂w
i i i be used to obtain a clear threshold to stop pruning with the
1  ∂2 E OBD and OBS techniques because they propose not to re-
+ wi w j . (11) move weights beyond a student’s t threshold, which has clear
2 i = j ∂wi ∂w j
statistical significance (Hérault and Jutten, 1994).
Higher order terms are assumed to be negligible. Removal Instead of pruning only links, whole neurons can be sup-
of weight wi implies wi = −wi . Since all pruning techniques pressed. Two techniques which proceed in such a way are those
are only applied after neural networks are trained and a local of Mozer and Smolensky (1989) and Sietsma and Dow (1991).
minimum has been attained, the first term on the right-hand The skeletonization technique of Mozer and Smolensky (1989)
side can be neglected. Moreover, the OBD algorithm assumes prunes networks in a way similar to OBD and OBS. However,
that the off-diagonal terms (i = j) of the Hessian ∂ E 2 /∂wi ∂w j the removal of whole nodes on the misfit is quantified. Again,
are zero. Hence, the sensitivity (or saliency) si of the misfit nodes showing small variations are deleted.
function to removal of weight wi is expressed by Sietsma and Dow (1991) propose a very simple procedure
to prune nodes, yielding excellent results. They analyze the
1 ∂2 E 2 output of nodes in the same layer to detect noncontributing
si = w . (12)
2 ∂wi2 i units (NCU). Nodes that produce a near-constant output for all
Weights with the smallest sensitivities are removed, and the training samples or that have a correlated output to other nodes
neural network is retrained. Retraining must be done after are removed, since such nodes are not relevant for network per-
suppressing a single or several weights. The exact expression formance. A correlated output implies that these nodes always
for the diagonal elements of the Hessian is given by Le Cun have identical or opposite output. Removal of nodes can be cor-
et al. (1990). rected by a simple adjustment of biases and weights of all nodes
The OBS technique is an extension of OBD, in which the they connect to. Hence, in principle no retraining is needed, al-
need for retraining no longer exists. Instead of neglecting the though it is often applied to increase performance. Although
off-diagonal elements, this technique uses the full Hessian ma- Sietsma and Dow (1991) do not formulate their method in sta-
trix H, which is composed of both the second and third terms tistical terms, a statistical framework can easily be forged and
in the right-hand side in equation (11). Again, suppression removal can be done on inspection of averages and the covari-
of weight wi yields wi = −wi , which is now formulated as ance matrix. Moreover, this allows for a principal component
eit w + wi = 0, where the vector ei represents the ith column analysis within each layer to suppress irrelevant nodes.
of the identity matrix. This leads to a variation δ E i , Both the skeletonization and NCU methods also allow for
pruning input nodes. Hence, they can significantly reduce the
δ E i = 12 wt Hw + λ eit w + wi
 
(13) number of internal variables describing the neural networks,
which is of particular interest in the case of limited quantities
(with λ a Lagrange multiplier). Minimizing expression (13) of training samples.
yields All pruning techniques increase the generalization capacity
1 wi2 of the network because of a decreased number of local min-
δ Ei = (14) ima. Other pruning techniques can be found in Karnin (1990),
2 Hii−1
Pellilo and Fanelli (1993), and Cottrell et al. (1995). A short
and review of some pruning techniques, including weight regular-
wi ization methods, is given in Reed (1993).
w = − H−1 ei . (15) During pruning, a similar problem occurs as during train-
Hii−1
ing: When should pruning be stopped? In practice, pruning is
The weight wi resulting in the smallest variation in misfit often stopped when the next neural networks cannot attain a
δ E i in equation (14) is eliminated. Thereafter, equation (15) predefined maximum misfit. However, this may not be an op-
tells how all the other weights must be adapted to circumvent timum choice. A better method is to use any of the techniques
the need for retraining the network. Yet, after the suppression described in the subsection on generalization. The nonlinear
of several weights, the neural networks is usually retrained to theoretical criteria may be especially interesting because they
increase performance. include the trade-off of network complexity versus misfit.
Although the method is well based on mathematical princi-
ples, it does have a disadvantage: not only the full Hessian but Configuration optimization: Constructive methods
also its inverse must be calculated. Particularly for large net-
works, this may require intensive calculations and may even As a last possibility for creating optimal network structures,
pose memory problems. However, the use of OBS becomes we consider the constructive methods. Such methods start from
very interesting if the inverse of the Hessian or (At A + βI)−1 scratch; only input and output layers are defined, and they
has already been approximated from the application of a automatically increase the network size until convergence is
second-order optimization algorithm for network training. The reached. The principal problem associated with this approach
exact expression for the full Hessian matrix can be found in is to find a suitable stopping criterion. Otherwise, the training
Hassibi and Stork (1993). set will simply be memorized and generalization will be poor.
Finally, note that equation (11) is only valid for small per- Mostly theoretical measures evaluating the trade-off between
turbations wi . Hence, OBD and OBS should not be used network complexity and performance are used.
Neural Networks in Geophysics 1041

Nowadays, constructive algorithms exist for both MLP and During training, the algorithm presents the training samples
RBF networks and even combinations of these, i.e., neu- consecutively to the network. If a training sample cannot be
ral networks using mixed activation functions. Probably the correctly classified by the existing network, then this training
best known constructive algorithm is the cascade correlation sample is used as a new prototype. Otherwise, the weight of
method of Fahlman and Lebiere (1990). It starts with a fully the nearest prototype is increased to increase its relative oc-
connected and trained input and output layer. Next, a hidden currence. The variances of all Gaussians describing conflicting
node is added which initially is connected only to the input classes are reduced such that no conflicting class produces val-
layer. To obtain a maximum decrease of the misfit, the output ues larger than the negative threshold for this training sample.
of the hidden node and the prediction error of the trained net- Output is not bounded because of the linear activation func-
work are maximally correlated. Next, the node is linked to the tions which exist in the output nodes. Hence, this is a decision-
output layer, weights from the input layer to the hidden node making network, i.e., it only gives the most likely class for a
are frozen (i.e., no longer updated), and all links to the output given training sample but not its exact likelihood.
layer are optimized. In the next iteration, a new hidden node is The dynamic decay adjustment algorithm of Berthold and
added, which is linked to the input layer and the output of all Diamond (1995) has some resemblance to the probabilistic
previously added nodes. Again, the absolute covariance of its neural network of Specht (1990). This network creates a Gaus-
output and the prediction error of the neural networks is maxi- sian centered at each training sample. During training, only
mized, after which its incoming links are again kept frozen and the optimum, common variance for all Gaussians must be esti-
all links to the output nodes are retrained. This procedure con- mated. However, the fact that a hidden node is created for each
tinues until convergence. Each new node forms a new hidden training sample makes the network more or less a referential
layer. Hence, the algorithm constructs very deep networks in memory scheme and will render the use of large training sets
which each node is linked to all others. Moreover, the original very cumbersome. Dynamic decay adjustment, on the other
algorithm does not use any stopping criterion because input is hand, creates new nodes only when necessary.
assumed to be noiseless. Other incremental algorithms include orthogonal least
Two proposed techniques that do not have these drawbacks squares of Chen et al. (1991), resource allocating network of
are the incremental algorithms of Moody (1994) and Jutten and Platt (1991), and projection pursuit learning of Hwang et al.
Chentouf (1995). These algorithms differ from cascade corre- (1994). A recent review of constructive algorithms can be found
lation in that only a single hidden layer is used and all links in Kwok and Yeung (1997).
are updated. The two methods differ in the number of neurons
added per iteration [one (Jutten and Chentouf, 1995) or several PRACTICE
(Moody, 1994)] and their stopping criteria. Whereas Moody A general strategy
(1994) uses the generalized prediction error criterion of Moody
(1992), Jutten and Chentouf (1995) analyze the misfit residu- How can these methods and techniques be used in a geo-
als. Further construction is ended if the characteristics of the physical application? The following list contains some relevant
measured misfit resemble the assumed noise characteristics. points to be considered for any application. Particular attention
A variant (Chentouf and Jutten, 1996b) of the Jutten and should be paid to the following.
Chentouf algorithm also allows for the automatic creation of Choice of neural network.—For static problems, a prelimi-
neural networks with several hidden layers. Its general way of nary data analysis or general considerations may already indi-
proceeding is identical to the original algorithm. However, it cate the optimum choice whether to use an MLP or RBF net-
evaluates if a new neuron must be placed in an existing hid- work. For instance, clusters in classification problems are often
den layer or if a new layer must be created. Another variant thought to be localized in input space. Hence, RBF networks
(Chentouf and Jutten, 1996a) allows for incorporating both may yield better results than MLP networks. However, both
sigmoidal and Gaussian neurons. It evaluates which type of ac- types of neural networks are universal approximators, capable
tivation function yields the largest reduction in the misfit (see of producing identical results. Nevertheless, one type may be
also Chentouf, 1997). better suited for a particular application than the other type be-
The dynamic decay adjustment method of Berthold and cause these predictions are asymptotic results. If no indications
Diamond (1995) is an incremental method for RBF networks exist, both must be tried.
that automatically estimates the number of neurons and the Choice of input parameters.—In some problems, this may be
centers and variances of the Gaussian activation functions best a trivial question. In extreme cases, any parameter which can
providing an accurate classification of the training samples. It be thought may be included, after which a principal component
uses selected training samples as prototypes. These training analysis (PCA) or factor analysis may be used to reduce input
samples define the centers of the Gaussian activation functions space and thereby any redundancy and irrelevant parameters.
in the hidden-layer neurons. The weight of each Gaussian rep- Nevertheless, an adequate selection of parameters significantly
resents its relative occurrence, and the variance represents the increases performance and quality of final results.
region of influence. To determine these weights and variances, Suitable preprocessing techniques.—Any rescaling, filtering,
the method uses both a negative and a positive threshold. The or other means allowing for an more effective use of the input
negative threshold forms an upper limit for the output of wrong parameters should be considered. Naturally, PCA or a factor
classes, whereas the positive threshold indicates a minimum analysis can be included here.
value of confidence for correct classes. That is, after training, Training set and training samples.—The number of training
training samples will at least produce an output exceeding the samples is of direct influence on the total number of inter-
positive threshold for the correct class and no output of the nal variables allowed in the neural networks to keep training
wrong classes will be larger than the negative threshold. overconstrained. The total number of internal variables should
1042 Van der Baan and Jutten

never exceed the number of training samples. The largest network. In such cases, the reinitialization has allowed for an
fully connected neural networks can be calculated using equa- escape of a local minimum.
tions (3) and (4). Naturally, such limitations will not exist if a
very large training set is available. An example
Training algorithm and generalization measure.—Naturally,
a training algorithm has to be chosen. Conjugate gradient To illustrate how some of these methods and techniques can
methods yield better performance than the standard backprop- be put in a general methodology, we consider a single example
agation algorithm since the first is proven to converge within a that can be solved relatively easily using a neural networks
limited number of iterations but the latter is not. Furthermore, without the need for complicated processing schemes. Our
a method has to be chosen to guarantee a good performance example concerns the detection and extraction of reflections,
in general. This can be any general method—crossvalidation, ground roll, and other types of noise in a deep seismic reflection
theoretical measure, or residual analysis. However, these mea- experiment to enhance data quality.
sures must be calculated during training and not after conver- Van der Baan and Paul (2000) have shown that the appli-
gence. cation of Gaussian statistics on local amplitude spectra after
Configuration estimation.—The choice between the use of the application of a PCA allows for an efficient estimate of the
a constructive or a simplification method is important. An in- presence of reflections and therefore their extraction. They
creasingly popular choice is using any constructive algorithm to used a very simple procedure to extract the desired reflections.
obtain a suitable network configuration and thereafter apply- In a common shot gather (Figure 5a) a particular reflection
ing a pruning technique for a minimal optimum configuration. was picked 14 times on adjacent traces (Figure 5b). Local am-
Sometimes, reinitialization and retraining of a neural networks plitude spectra were calculated using 128-ms (16 points) win-
may improve a misfit and allow for continued pruning of the dows centered around the picks. These local amplitude spectra

FIG. 5. Common shot gather plus 31 pick positions. (a) Original data, (b) fourteen reflection picks, (c) three prearrival noise plus
six ground roll picks, (d) six picks on background noise plus bad traces, and (e) two more picks on bad traces.
Neural Networks in Geophysics 1043

were put as column vectors in a matrix X, and a PCA was Equations (3) and (4) indicate that for a training set con-
applied using only a single eigenvector. The first eigenvector taining 31 samples and having nine input parameters and three
of XXt was calculated, and henceforth all amplitude spectra output nodes, already three hidden nodes result in an under-
were projected upon this vector to obtain a single scalar in- constrained training problem. Two hidden layers are out of the
dicating resemblance to the 14 training samples. Once all 14 question. Even if expert systems are used (networks capable
amplitude spectra were transformed into scalars, their average of recognizing only a single type of signal), then three hidden
and variance were calculated. The presence of reflection en- nodes also result in an underconstrained training problem. On
ergy was then estimated by means of (1) a sliding window to the other hand, expert systems for extracting reflections and
calculate the local amplitude spectra, (2) a projection of this ground roll may benefit from PCA data preprocessing because
spectrum upon the first eigenvector, and (3) Gaussian statis- it significantly reduces the number of input parameters and
tics described by the scalar mean and variance to determine thereby allows for a larger number of hidden neurons.
the likelihood of the presence of a reflection for this particular The first network we used was a so-called 9-5-3 MLP net-
time and offset. Amplitude spectra were used because it was work, i.e., nine input, five hidden, and three output nodes. The
assumed that a first distinction between signal types could be network was trained until convergence. The fact that this may
made on their frequency content. In addition, samples were in- have resulted in an overfit is unimportant since the network
sensitive to phase perturbations. Extraction results (obtained obtained was pruned using the (NCU) method of Sietsma and
by means of a multiplication of the likelihood distribution with Dow (1991). This particular method was chosen because it re-
the original data) are shown in Figure 6a. More details can be moves whole nodes at a time (including input nodes), resulting
found in Van der Baan and Paul (2000). in a 4-2-3 neural networks. The four remaining input nodes con-
To obtain a good idea of the possible power of neural net- tained the second to the fifth frequency component of the am-
works, we extended their method to detect and extract two plitude spectra. The resulting signal extractions are displayed
other categories of signal: ground roll and all remaining types in Figure 7. A comparison with the corresponding extraction
of noise (including background noise, bad traces, and prear- results of the method of Van der Baan and Paul (2000) shows
rival noise). To this end, more picks were done on such nonre- that more reflection energy has been extracted (Figure 6a ver-
flections. Hence, ground roll (Figure 5c), background and pre- sus 7a). Similar results are found for the ground roll (Figure 6b
arrival noise (Figures 5c and 5d), and bad traces (Figures 5d versus 7b). However, the extraction results for the last cate-
and 5e) were selected. This resulted in a total training set con- gory containing all remaining types of noise has been greatly
taining 14 reflections [identical to those used in Van der Baan improved (compare Figure 6c and 7c). Nevertheless, some lat-
and Paul (2000)] and 17 nonreflections. erally coherent energy remains visible in Figure 7c, which may
Next, the two other categories of signal were extracted in a be attributable to undetected reflective energy. Hence, results
similar way as the reflections, i.e., Gaussian statistics were ap- are amenable to some improvement, e.g., by including lateral
plied to local amplitude spectra after a PCA. Figures 6b and 6c information.
show the results. A comparison of these figures with Figure 5a The second network to be trained was a 9-5-3 RBF network.
shows that good results are obtained for the extracted ground Again, the network was trained until convergence and there-
roll. However, in Figure 6c many laterally coherent reflection after pruned using NCU. The final network structure consisted
events are visible. Hence, the proposed extraction method did of a 5-2-3 neural networks producing slightly worse results than
not discern the third category of signals, i.e., all types of noise Figure 7c for the remaining noise category, as some ground roll
except ground roll. was still visible after extraction. The five remaining input nodes
Fortunately, the failure to extract all remaining types of noise were connected to the first five frequency components of the
is easy to explain. Whereas both reflections and ground roll are local amplitude spectra.
characterized by a specific frequency spectrum, the remaining Hence, both MLP and RBF networks can solve this particu-
types of noise display a large variety in frequency spectra con- lar problem conveniently and efficiently, whereas a more con-
taining both signals with principally only high or low frequen- ventional approach encountered problems. In this particular
cies. Therefore, the remaining types of noise have a multimodal application, results did not benefit from PCA data preprocess-
distribution that cannot be handled by a simple Gaussian dis- ing because the distributions were too complicated (mixed and
tribution. To enhance extraction results, the remaining types multimodal). However, the use of a factor analysis might have
of noise should be divided into several categories such that no been an option.
multimodal distributions will exist. Although neither network was able to produce extraction
In the following we show how different neural networks are results identical to those obtained by Van der Baan and Paul
able to produce similar and better results using both MLP and (2000) for the reflection energy, highly similar results could be
RBF networks. However, we did not want to test the influence obtained using different expert systems with either four input
of different generalization measures. Hence, results were se- and a single output node or a single input and output node (af-
lected manually—a procedure we do not recommend for gen- ter PCA preprocessing of data). Thus, similar extraction results
eral use. could be obtained using very simple expert systems without
As input parameters, the nine frequencies in the local am- hidden layers.
plitude spectra are used. These nine frequencies resulted from
the use of 16 points (128 ms) in the sliding window. All simu- DISCUSSION AND CONCLUSIONS
lations are performed using the SNNS V.4.1 software package
[available from ftp.informatik.uni-stuttgart.de (129.69.211.2)], Neural networks are universal approximators. They can ob-
which is suited for those who do not wish to program their own tain an arbitrary close approximation to any continuous func-
applications and algorithms. tion, be it associated with a direct or an inverse problem.
1044
Van der Baan and Jutten
FIG. 6. Extraction results using Gaussian statistics and a PCA for (a) reflections, (b) ground roll, and (c) other types of noise. Compare with Figure 5a.
Neural Networks in Geophysics
FIG. 7. Extraction results using an MLP network for (a) reflections, (b) ground roll, and (c) other types of noise. Notice the improvement of extraction results for the
remaining types of noise.

1045
1046 Van der Baan and Jutten

Therefore, they constitute a powerful tool for the geophysi- Dai, H., and MacBeth, C., 1994, Split shear-wave analysis using an
cal community to solve problems for which no or only very artificial neural network?: First Break, 12, 605–613.
Denœux, T., and Lengellé, R., 1993, Initializing back-propagation net-
complicated solutions exist. works with prototypes: Neural Networks, 6, 351–363.
We have described many different methods and techniques Dowla, F. U., Taylor, S. R., and Anderson, R. W., 1990, Seismic dis-
crimination with artificial neural networks: Preliminary results with
to facilitate their use and to increase their performance. The regional spectral data: Bull. Seis. Soc. Am., 80, 1346–1373.
last principal issue is related to the training set. As in many Efron, B., 1979, Bootstrap methods: Another look at the Jackknife:
other methods, the quality of the obtained results stands or Ann. Statist., 7, 1–26.
Efron, B., and Tibshirani, R. J., 1993, An introduction to the bootstrap:
falls with the quality of the training data. Chapman and Hall.
Furthermore, one should first consider whether the use of Fahlman, S. E., and Lebiere, C., 1990, The cascade-correlation learning
neural networks for the intended application is worth the ex- architecture, in Touretzky, D. S., Ed., Advances in neural information
processing systems 2: Morgan Kaufmann, 524–532.
pensive research time. Generally, this question reduces to the Gallinari, P., Thiria, S., Badran, F., and Fogelman–Soulie, F., 1991, On
practical issue of whether enough good training samples can be the relations between discriminant analysis and multilayer percep-
trons: Neural Networks, 4, 349–360.
obtained to guarantee an overconstrained training procedure. Geman, S., Bienenstock, E., and Doursat, R., 1992, Neural networks
This problem may hinder their successful application even after and the bias/variance dilemma: Neural Comp., 4, 1–58.
significant preprocessing of the data and reduction of the num- Goldberg, D. E., 1989, Genetic algorithms in search, optimization and
machine learning: Addison-Wesley Publ. Co.
ber of input parameters. If a negative answer must be given to Grandvalet, Y., and Canu, S., 1995, Comments on “Noise injection into
this pertinent question, then a better alternative is the contin- inputs in back propagation learning”: IEEE Trans. Systems, Man,
ued development of new and sound mathematical foundations and Cybernetics, 25, 678–681.
Hanson, S. J., and Pratt, L. Y., 1989, Comparing biases for minimal net-
for the particular application. work construction with backpropagation, in Touretzky, D. S., Ed.,
Advances in neural information processing systems 1: Morgan Kauf-
ACKNOWLEDGMENTS mann, 177–185.
Hassibi, B., and Stork, D. G., 1993, Second order derivatives for net-
work pruning: Optimal brain surgeon, in Hanson, S. J., Cowan, J. D.,
M. v. d. B thanks Philippe Lesage for an introduction to the and Giles, C. L., Eds., Advances in neural information processing
domain of neural networks and for pointing to the existence systems 5: Morgan Kaufmann, 164–171.
of SNNS. In addition, discussions with Shahram Hosseini are Hérault, J., and Jutten, C., 1994, Réseaux neuronaux et traitement de
signal: Hermès édition, Traitement du signal.
acknowledged. We are grateful for the reviews of Bee Bednar, Hertz, J., Krogh, A., and Palmer, R. G., 1991, Introduction to the theory
an anonymous reviewer, and S. A. Levin to whom the note of of neural computation: Addison-Wesley Publ. Co.
caution about the use of synthesized data is due. Hopfield, J. J., 1984, Neurons with graded response have collective
computational properties like those of two-state neurons: Proc. Natl.
Acad. Sci. USA, 81, 3088–3092.
REFERENCES Hosseini, S., and Jutten, C., 1998, Simultaneous estimation of signal and
noise in constructive neural networks: Proc. Internat. ICSC/IFAC
Akaike, H., 1970, Statistical predictor identification: Ann. Inst. Statist. Symp. Neural Computation, 412–417.
Math., 22, 203–217. Huang, S. C., and Huang, Y. F., 1991, Bounds on the number of hidden
Almeida, L. B., 1994, Neural preprocessing methods, in Cherkassy, neurons in multilayer perceptrons: IEEE Trans. Neur. Networks, 2,
V., Frieman, J. H., and Wechsler, H., Eds., From statistics to neural 47–55.
networks: Theory and pattern recognition applications: Springer- Huang, Z., Shimeld, J., Williamson, M., and Katsube, J., 1996, Per-
Verlag, 213–225. meability prediction with artificial neural network modeling in the
Battiti, R., 1992, First and second order methods for learning between Ventura gas field, offshore eastern Canada: Geophysics, 61, 422–436.
steepest descent and Newton’s methods: Neural Comp., 4, 141–166. Hush, D. R., and Horne, B. G., 1993, Progress in supervised neural
Baum, E. B., and Haussler, D., 1989, What size network gives valid networks—What’s new since Lippmann?: IEEE Sign. Process. Mag.,
generalization?: Neural Comp., 1, 151–160. 10, No. 1, 8–39.
Berthold, M. R., and Diamond, J., 1995, Boosting the performance Hush, D., Horne, B., and Salas, J. M., 1992, Error surfaces for multi-
of RBF networks with dynamic decay adjustment, in Tesauro, G., layer perceptrons: IEEE Trans. Systems, Man and Cybernetics, 22,
Touretzky, D. S., and Leen, T. K., Eds., Advances in neural processing 1152–1161.
information systems 7: MIT Press, 521–528. Hwang, J. N., Lat, S. R., Maechler, M., Martin, D., and Schimert, J., 1994,
Bishop, C. M., 1995, Training with noise is equivalent to Tikhonov Regression modeling in back-propagation and projection pursuit
regularization: Neural Comp., 7, 108–116. learning: IEEE Trans. Neural Networks, 5, 342–353.
Calderón–Macı́as, C., Sen, M. K., and Stoffa, P. L., 1997, Hopfield neu- Jacobs, R. A., 1988, Increased rates of convergence through learning
ral networks, and mean field annealing for seismic deconvolution rate adaptation: Neural Networks, 1, 295–308.
and multiple attenuation: Geophysics, 62, 992–1002. Judge, G. G., Griffiths, W. E., Hill, R. C., and Lee, T., 1980, The theory
——– 1998, Automatic NMO correction and velocity estimation by a and practice of econometrics: John Wiley & Sons, Inc.
feedforward neural network: Geophysics, 63, 1696–1707. Jutten, C., and Chentouf, R., 1995, A new scheme for incremental
Carpenter, G. A., and Grossberg, S., 1987, Art2: Self-organization of learning: Neural Proc. Letters, 2, 1–4.
stable category recognition codes for analog input patterns: Appl. Karnin, E., 1990, A simple procedure for pruning backpropagation
Optics, 26, 4919–4930. trained neural networks: IEEE Trans. Neural Networks, 1, 239–242.
Chen, S., Cowan, C. F. N., and Grant, P. M., 1991, Orthogonal least Karouia, M., Lengellé, R., and Denœux, T., 1994, Weight initializa-
squares learning algorithm for radial basis function networks: IEEE tion in BP networks using discriminant analysis techniques: Neural
Trans. Neural Networks, 2, 302–309. Networks and Their Applications, Proceedings, 171–180.
Chentouf, R., 1997, Construction de réseaux de neurones multicouches Kohonen, T., 1989, Self-organization and associative memory, 3rd ed.:
pour l’approximation: Ph.D. thesis, Institut National Polytechnique, Springer-Verlag New York, Inc.
Grenoble. Kwok, T.-Y., and Yeung, D.-Y., 1997, Constructive algorithms for struc-
Chentouf, R., and Jutten, C., 1996a, Combining sigmoids and radial ba- ture learning in feedforward neural networks for regression prob-
sis functions in evolutive neural architectures: Eur. Symp. Artificial lems: IEEE Trans. Neural Networks, 8, 630–645.
Neural Networks, D Facto Publications, 129–134. Langer, H., Nunnari, G., and Occhipinti, L., 1996, Estimation of seismic
——– 1996b, DWINA: Depth and width incremental neural algorithm: waveform governing parameters with neural networks: J. Geophys.
Int. Conf. Neural Networks, IEEE, Proceedings, 153–158. Res., 101, 20 109–20 118.
Cottrell, M., Girard, B., Girard, Y., Mangeas, M., and Muller, C., 1995, Le Cun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E.,
Neural modeling for time series: A statistical stepwise method for Hubbard, W., and Jackel, L. D., 1989, Backpropagation applied to
weight elimination: IEEE Trans. Neural Networks, 6, 1355–1364. handwritten zip code recognition: Neural Comp., 1, 541–551.
Cybenko, G., 1989, Approximation by superpositions of a sigmoidal Le Cun, Y., Denker, J. S., and Solla, S. A., 1990, Optimal brain damage,
function: Math. Control, Signals and Systems, 2, 303–314. in Touretzky, D., Ed., Advances in neural information processing
Dahl, E. D., 1987, Accelerated learning using the generalized delta systems 2: Morgan Kaufmann, 598–605.
rule: Int. Conf. Neural Networks, IEEE, Proceedings, 2, 523–530. Le Cun, Y., Kanter, I., and Solla, S., 1991, Eigenvalues of covariance
Neural Networks in Geophysics 1047

matrices: Application to neural network learning: Phys. Rev. Lett., subsurface targets in geophysical data using neural networks: Geo-
66, 2396–2399. physics, 57, 1534–1544.
Lines, L. R., and Treitel, S., 1984, Tutorial: A review of least-squares Reed, R., 1993, Pruning algorithms—A survey: IEEE Trans. Neural
inversion and its application to the geophysical domain: Geophys. Networks, 4, 740–747.
Prosp., 32, 159–186. Richards, J., 1993, Remote sensing digital image analysis, an introduc-
Lippmann, R. P., 1987, An introduction to computing with neural net- tion: Springer-Verlag New York, Inc.
works: IEEE ASSP Mag., 4, No. 2, 4–22. Riedmiller, M., and Braun, H., 1993, A direct adaptive method for
MacKay, D. J. C., 1992, Bayesian interpolation: Neural Comp., 4, 415– faster backpropagation learning: The RPROP algorithm: Int. Conf.
447. Neural Networks, IEEE, Proceedings, 1, 586–591.
Masters, T., 1995, Advanced algorithms for neural networks—A C++ Romeo, G., 1994, Seismic signals detection and classification using ar-
sourcebook: John Wiley & Sons, Inc. tificial neural networks: Annali di Geofisica, 37, 343–353.
Matsuoka, K., 1992, Noise injection into inputs in back-propagation Röth, G., and Tarantola, A., 1994, Neural networks and inversion of
learning: IEEE Trans. Systems, Man, and Cybernetics, 22, 436–440. seismic data: J. Geophys. Res., 99, 6753–6768.
McCormack, M. D., Zaucha, D. E., and Dushek, D. W., 1993, First- Rumelhart, D. E., Hinton, G. E., and Williams, R. J., 1986, Learning
break refraction event picking and seismic data trace editing using internal representation by backpropagating errors: Nature, 332, 533–
neural networks: Geophysics, 58, 67–78. 536.
McCulloch, W. S., and Pitts, W., 1943, A logical calculus of the ideas Scales, J. A., and Snieder, R., 1998, What is noise?: Geophysics, 63,
immanent in nervous activity: Bull. Math. Biophys., 5, 115–133. 1122–1124.
Møller, M. F., 1993, A scaled conjugate gradient algorithm for fast Sethi, I. K., 1990, Entropy networks: From decision trees to neural
supervised learning: Neural Networks, 6, 525–533. networks: Proc. IEEE, 78, 1605–1613.
Moody, J. E., 1992, The effective number of parameters: An analysis of ——— 1995, Neural implementation of tree classifiers: IEEE Trans.
generalization and regularization in nonlinear learning systems, in Systems, Man and Cybernetics, 25, 1243–1249.
Moody, J. E., Hanson, S. J., and Lippmann, R. P., Eds., Advances in Sietsma, J., and Dow, R. D. F., 1991, Creating artificial neural networks
neural information processing systems 4: Morgan Kaufmann, 847– that generalize: Neural Networks, 4, 67–79.
854. Specht, D., 1990, Probabilistic neural network: Neural Networks, 3,
——— 1994, Prediction risk and architecture selection for neural net- 109–118.
works, in Cherkassy, V., Frieman, J. H., and Wechsler, H., Eds., From Stone, M., 1974, Cross-validatory choice and assessment of statistical
statistics to neural networks: Theory and pattern recognition appli- predictions: J. Roy. Statist. Soc., 36, 111–147.
cations: Springer-Verlag, 213–225. Tarantola, A., 1987, Inverse problem theory—Methods for data fitting
Mozer, M. C., and Smolensky, P., 1989, Skeletonization: A technique and model parameter estimation: Elsevier Science Publ. Co.
for trimming the fat from a network via relevance assessment, in Van der Baan, M., and Paul, A., 2000, Recognition and reconstruction
Touretzky, D. S., Ed., Advances in neural information processing of coherent energy with application to deep seismic reflection data:
systems 1: Morgan Kaufmann, 107–115. Geophysics, 65, 656–667.
Murat, M. E., and Rudman, A. J., 1992, Automated first arrival picking: Van der Sluis, A., and Van der Vorst, H. A., 1987, Numerical solutions
A neural network approach: Geophys. Prosp., 40, 587–604. of large, sparse linear algebraic systems arising from tomographic
Murata, N., Yoshizawa, S., and Amari, S., 1994, Network information problems, in Nolet, G., Ed., Seismic tomography: D. Reidel Publ.
criterion—Determining the number of hidden units for an artificial Co., 49–83.
neural network model: IEEE Trans. Neural Networks, 5, 865–872. Wahba, G., and Wold, S., 1975, A completely automatic French curve:
Nguyen, D., and Widrow, B., 1990, Improving the learning speed of Fitting spline functions by cross-validation: Comm. Stati., 4, 1–17.
2-layer neural networks by choosing initial values of the adaptive Wang, L.-X., and Mendel, J. M., 1992, Adaptive minimum prediction-
weights: Int. Joint Conf. Neural Networks, Proceedings, III, 2063– error deconvolution and source wavelet estimation using Hopfield
2068. neural networks: Geophysics, 57, 670–679.
Nowlan, S. J., and Hinton, G. E., 1992, Simplifying neural networks Weigend, A. S., Rumelhart, D. E., and Huberman, B. A., 1991, Gen-
using soft weight-sharing: Neural Comp., 4, 473–493. eralization by weight-elimination with application to forecasting, in
Park, J., and Sandberg, I. W., 1991, Universal approximation using Lippmann, R. P., Moody, J. E., and Touretzky, D. S., Eds., Advances
radial-basis-function networks: Neural Comp., 3, 246–257. in neural information processing systems 3: Morgan Kaufmann, 875–
Pellilo, M., and Fanelli, A. M., 1993, A method of pruning layered feed 882.
forward neural networks, in Prieto, A., Ed., International workshop Williams, P. M., 1995, Bayesian regularization and pruning using a
on artificial neural networks: Springer-Verlag, 278–283. Laplace prior: Neural Comp., 7, 117–143.
Platt, J., 1991, A resource-allocating network for function interpola- Zhang, Y., and Paulson, K. V., 1997, Magnetotelluric inversion using
tion: Neural Comp., 3, 213–225. regularized Hopfield neural networks: Geophys. Prosp., 45, 725–
Poulton, M. M., Sternberg, B. K., and Glass, C. E., 1992, Location of 743.

You might also like