94 Nnfit

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

NEURAL NETWORKS AS TOOLS TO SOLVE PROBLEMS

IN PHYSICS AND CHEMISTRY.


Wlodzislaw Duch
Department of Computer Methods,
Nicholas Copernicus University,
ul. Grudziadzka 5, 87{100 Toru n, Poland.
e-mail: duch @ phys.uni.torun.pl
Geerd H. F. Diercksen
Max{Planck{Institute fur Astrophysik
Karl{Schwarzschild{Strae 1
85740 Garching bei Munchen, Germany
e-mail: ghd @ mpa-garching.mpg.de
(May 18, 1994)
Application of the neural network methods to problems in physics and chemistry has rapidly
gained popularity in recent years. We show here that for many applications the standard methods
of data tting and approximation techniques are much better than neural networks in the sense
of giving more accurate results with a lower number of adjustable parameters. Learning in neu-
ral networks is identi ed with the reconstruction of hypersurfaces based on a knowledge of sample
points and generalization with interpolation. Neural networks use sigmoidal functions for these re-
constructions, giving for most physics and chemistry problems results far from optimal. An arbitrary
data tting problem may be solved using a single-layer network architecture provided that there
is no restriction on the type of functions performed by the processing elements. A simple example
illustrating unreliability of interpolation and extrapolation by the typical backpropagation neural
network learning of a smooth function is presented. Some results from approximation theory are
quoted giving a rigorous foundation to applications requiring correlation of numerical results with
a set of parameters.

1
I. INTRODUCTION elements of the next layer to which it is connected. If all
possible connections between the consecutive layers are
Neural computing and the eld of neural network mo- allowed the network is called \fully connected". In some
deling has become very fashionable in the last decade. cases it is better to use ANN that is not fully connected:
Availability of general neural network simulators 1] has reduction of the number of adjustable weights may im-
encouraged many scientists to try these new techniques prove not only the timing of computations for training
for solving their physics and chemistry problems. There- the network but also the accuracy of learning.
fore it is of great importance to understand what neural Backpropagation of errors (BP) is the most commonly
networks can do and when their application may lead used algorithm to train such an ANNs 3] (for classica-
to new results, hard to obtain with standard methods. tion problems feedforward learning vector quantization
A number of good books and review articles on neural and counterpropagation networks are most commonly
network models 2] have appeared in recent years, unfor- used) 2]. Although the BP learning rule is rather uni-
tunately rarely giving a good mathematical perspective versal and can be applied to a number of dierent ar-
of relevant theories, such as the theory of statistical de- chitectures of neural nets a term \backpropagation net"
cisions or approximation theory. is commonly used to designate those nets that are tra-
Articial neural networks (ANNs) are networks of sim- ined using the BP algorithm. This learning rule compares
ple processing elements (called \neurons") operating on the desired output with the achieved output and error si-
their local data and communicating with other elements. gnals (dierences between desired and achieved outputs)
Thanks to this global communication the ANN has stable are propagated layer by layer from the output back to the
states consistent with the current input and output va- input layer. The weights are changed using a gradient de-
lues. The design of ANNs was motivated by the structure scent or some other minimization method in such a way,
of a real brain, but the processing elements and the archi- that the error should be reduced after the next presen-
tectures used in articial neural networks frequently have tation of the same input. Although the BP algorithm is
nothing in common with their biological inspiration. The rather slow and requires many iterations it enables the
weights of connections between the elements are adjusta- learning of arbitrary mappings and therefore it is widely
ble parameters. Their modication allows the network to used. Over 40 other learning rules for dierent network
realize a variety of functions. In the typical case of a su- architectures exist and new rules are still being discove-
pervised learning a set of input and output patterns is red 2].
shown to the network and the weights are adjusted (this ANNs are interesting to the physicists as an exam-
is called \learning" or \adaptation") until the outputs ple of complex systems that are more general than the
given by the network are identical to the desired ones. Ising or the spin glass models. From this point of view,
In principle ANN has the power of a universal compu- as interesting dynamical systems, their evolution is inve-
ter, i.e. it can realize an arbitrary mapping of one vec- stigated and the methods of statistical physics applied to
tor space to another vector space. Since physicists and such problems as network capacity, eciency of various
chemists deal with such problems quite often one of the learning rules or chaotic behavior 4]. ANNs are also in-
applications of ANNs in these elds is to correlate pa- teresting as models of various sensory subsystems and as
rameters with some numerical results in hope that the simplied models of the nervous system.
network, given a set of examples, or a statistical sample In this paper we are concerned only with applications
of data points, will somehow acquire an idea of what the of neural networks as tools that can help to solve real
global mapping looks like. One of the goals of this paper physical problems. The number of papers in the section
is to investigate whether such a hope is justied. \neural networks" in Physics Abstracts has approxima-
ANNs are especially suitable for problems where a high tely doubled comparing the 1992 and 1991 entries. This
error rate is acceptable, the conditions are ill-dened and is a reection of the enthusiasm with which ANNs are
the problem mathematically ill-posed. The brain has evo- received by the scientic community. Is this enthusiasm
lved to process the data from the senses and works much well founded? Since the eld is not well known among
better at solving problems requiring perception and pat- physicists and chemists, in this paper we are going to
tern recognition than problems involving logical steps set neural networks applications in the perspective of a
and data manipulation. Most ANN architectures share better established mathematical theories.
this quality with real brains. They are not suited for the In the next section we critically analyze a few recent
tasks that the sequential computer programs can do well, applications of ANNs to the problems in chemical physics
such as the manipulation of symbols, logical analysis or in which various parameters are correlated with output
solving numerical problems. data. We are going to summarize the idea behind such ap-
The most common architecture of ANNs is of the mul- plications and in the third section elucidate what ANNs
tilayered feedforward type. The signals are propagated are really doing. In the fourth section we describe some
from the input to the output layer, each processing ele- alternative approaches and give an illustrative example
ment being responsible for integration of the signals co- of learning a simple functional dependence by a backpro-
ming from the lower layer and inuencing all processing pagation ANN. In the last section we present a general
discussion on the use and misuse of neural networks in

2
physics and chemistry. \... arithmetic problems are seldom solved in biological
tasks... Any attempts to build neural models for the ad-
der, subtractor, multiplier, and other computing circuits,
II. ASSOCIATIONS USING NEURAL or even analog{to{digital converters are therefore based
NETWORKS on immature reasoning."
Unfortunately many applications in physics and chemi-
The typical architecture of a neural network is presen- stry are of this type. Here we shall write only about ap-
ted in Fig. 1. The input signals xi are received by the rst plications that use neural networks in the \most proper"
layer of processing elements (called \neurons"), the input way, to form associations between the input and the out-
layer. The results are obtained from the output layer, in put values. The papers of Darsy et.al 6] and Androsiuk
Fig. 1 consisting of only one neuron. Between the input et.al 7] are rather typical in this regard. The neural ne-
and the output layers there are a number of \hidden" lay- twork is taught solutions of the Schr!odinger equation i.e.
ers in case of Fig.1 only one such layer is present. These correlation between the parameters of some Hamiltonian
hidden layers of neurons are not directly accessible to and the energy. In case of the two papers quoted above
the user who gives the inputs and sees the outputs from a two-dimensional harmonic oscillator Hamiltonian was
the network. Connections are allowed only between the used. The potential is quadratic and the lowest eigenvalue
layers, not within the layers. The input signals are propa- of this Hamiltonian is linear in both frequency parame-
gated in one direction, from the input layer to the output ters:
layer, hence such an architecture is called \a feedforward
network", in contrast to the \recurrent networks" with V (x y !x !y ) = 12 m(!x2 x2 + !y2 y2 )
feedback connections between the layers and within the
layers. If all possible connections between the layers are E(!x  !y ) = A(!x + !y ) (3)
allowed the network is called \fully connected". For a (4)
large number of neurons, to avoid an excessive number
of connections, partial connectivity is assumed (if each where A = "h=2 is constant. A set of training data for
neuron in the brain was connected with all others the dierent (!x  !y ), consisting of values of V (x y !x  !y )
brain would have to be about 10 km in diameter). at a rectangular (x y) mesh taken as inputs and the cor-
The strength of the connections between the neuron responding energy values E taken as outputs are given to
number i and number j is a variable parameter Wij , cor- the network. The backpropagation algorithm is used to
responding to the strength of the synaptic connections change the weights of the ANN bringing the responses of
in real nervous tissue, called \the weight" of the connec- the net, identied with the E values, for the given input
tion. Adjustment of these weights allows the network to V (xi  yi  !x !y ), as close as possible to the desired E.
perform a variety of mappings of input to output signals. Perfect agreement is usually not possible or even not
Each neuron performs a weighted sum of the incoming desirable from the point of view of \a generalization" or
signals prediction of the unknown E values corresponding to the
X new (!x  !y ) parameters. When tting the data one ra-
Xi = Wij j (1) rely requires that the approximation function should pass
j exactly through the given data points, because this may
and processes the result via a function . Because of lead to the \overtting" or oscillatory behavior. The ac-
the biological motivations most of the feedforward ne- curacy and the speed of learning depends on the number
tworks assume for an output function of a neuron a sig- of hidden neurons and the architecture of the ANN. After
moidal function the training phase is nished a number of new parame-
ters (!x  !y ) are selected and values of the V (x y !x  !y )
(X) = (X) = 1 + e;(1X ;)=T (2) taken as a test data to check the ability of the trained
network to guess the new values of E. The accuracy in
learning and testing phases does not exceed a few per-
cent.
where T is a global parameter, usually xed for all pro- In the paper of Darsy et.al a fully connected network
cessing elements (neurons) of the network, determining with 40 hidden neurons was used, 49 values of V (x y) on
the degree of the non-linearity of neurons, and  is called 7 7 mesh were given, the number of data sets for tra-
the threshold and is usually also xed for all processing ining was 50 and the maximumerror for the training data
elements. Thus the number of parameters adjusted du- was around 5%. Slightly better results were obtained by
ring the adaptation of a network to a given set of data is Androsiuk et.al 7] with a backpropagation network that
equal to the number of network weights. was not fully connected: only 3 hidden neurons, 1 output
This short introduction should be sucient to under- and 49 input neurons (one for each point on potential
stand the applications described below. One of the pio- surface) were used, the number of the data sets for tra-
neers in the eld of neural modeling, T. Kohonen, wrote ining was also 50, and the maximum error for the training
in his 1984 book on associative memory 5]: data was about 2.5%. The extrapolation of the results to

3
other values of (!x  !y ) gives errors gradually increasing The time to train a fully connected feedforward ne-
as the parameter values move outside the test range. twork is very long (thousands or hundreds of thousands
Interpolation and extrapolation is discussed in these iterations may be necessary) and selection of the archi-
papers in terms of \generalization of acquired know- tecture for networks that are not fully connected is a long
ledge". One may be easily misled by the use of such trial-and-error procedure. However, once the feedforward
concepts. For example, the authors of the papers quoted network has been trained it gives the answers very qu-
above conclude 6] that: \neural networks can be used to ickly, in one iteration, since computation is reduced to
investigate the more perplexing questions related to basic optimized function evaluation.
issues of physics and chemistry" and 7] claim: \... we pre-
sented studies of a neural network capable of performing
the transformations generated by the Schr!odinger equ- III. WHAT ARTIFICIAL NEURAL NETWORKS
ation required to nd eigenenergies of a two-dimensional REALLY DO.
harmonic oscillator". In fact in both papers the authors
have trained a network to recognize points on a plane Comments in this section are relevant mainly for the
in 3 dimensions (!x  !y  E), and tested whether the ne- most commonly used feedforward ANNs of the backpro-
twork can interpolate the data. Certainly such ability pagation type. Applications of ANNs to various problems
has nothing to do with \the basic issues of physics and in physics and chemistry are frequently not based on solid
chemistry" or with \transformations generated by the mathematical foundations, but rather on the availability
Schr!odinger equation", but rather with the data interpo- of the software to simulate neural networks. In the appli-
lation and extrapolation techniques. cations mentioned in the previous section the ability of
A conceptually very similar, although computationally neural networks for learning from examples, associating
more ambitious, example of using backpropagation ANNs a set of parameters with the output values (cf. 6]- 10])
for correlation of data is found in a series of papers of and subsequently generalizing to new values, was used.
Sumpter et.al 8]. Internal energy ow in molecular sys- Accuracy of interpolation did not exceed a few percent
tems was studied using the data from molecular dynamics and the number of adjustable parameters (weights of the
calculations. The ANN was taught the relationship be- network) was in most cases quite high, ranging from 10
tween phase-space points along a classical trajectory and to almost 106.
energies for stretch, bend and torsion vibrations. The ne- The quotations from the papers reviewed in the pre-
twork used after some experimentation had 84 nodes in vious section may mislead some readers into believing
4 layers, with a total of 1648 connections. The input vec- that neural networks really solve physical problems or
tors were 24-dimensional (coordinates and momenta for carry out transformations of the Schr!odinger equation,
4 atoms) and the output was 4-dimensional (energies of 4 i.e. do something intelligent. Therefore we should stress
modes). The accuracy of energy prediction after training that the problem of learning associations between para-
on 2000 examples of data points was between 5-20%. The meters and output values is equivalent to the data t-
authors conclude that \a trained neural network is able ting, i.e. to the problem of approximation of an unknown
to carry out qualitative mode-energy calculation for a input-output mapping. Vice versa, all data tting pro-
variety of tetratomic systems". blems may be presented in the ANN form. Therefore the
Many other examples of this sort may be found in the question one should ask is: are neural networks ecient
literature. Although supercially they are similar to the in tting the data and what is the functional form they
papers quoted below there is a crucial dierence that we are using?
will point out in the summary. The general approximation problem is stated as fol-
Peterson has used a BP network for classication of lows 11]: if f(X) is continuous function dened for
atomic levels in Cm I, Cm II and Pu I ions 9] according X = (x1 ::xn) 2 X , FW (X) is continuous approxima-
to their electronic conguration. Each level was descri- ting function of X and some parameters W 2 W , then
bed by 4 numbers: energy, angular momentum, g factor nding the best approximation amounts to nding W"
and isotope shift, that should be correlated with a small such that the distance jjFW (X) ; f(X)jj is the smallest
number (4 to 8) of electronic congurations. The network for all W in the parameter space W .
used had less than 100 neurons and the accuracy of clas- Consider the linear expansion FW (X) in a set of ba-
sication was between 64 and 100%. sis functions i(X), approximating an unknown function
Many papers have been published on applications of f(X) of many variables X:
neural networks to various problems in protein chemistry
10]. The ANNs are used here for classication purposes X
m
to nd the correlation between the 3D structure and the FW (X) = Wi i (X) (5)
sequence of aminoacids. Since an average protein has a i
few hundred aminoacids ANNs used in this case have Given a set of sample points (Xk  fk = f(Xk )) our task
tens of thousands processing elements and hundreds of is to nd the best possible expansion coecients Wi . We
thousands of weights and their simulation requires a large may imagine a network realization of the approximation
amount of supercomputer resources.

4
problem in several ways, for example, a single-layer ne- gradient of this function with respect to the weights (ap-
twork that solves the problem is shown in Fig. 2, with proximation errors are proportional to this gradient). If
m hidden nodes (m = 5 in this example), a single input the signals xi are small and T is large we can expand the
and single output node. Each input node is connected exponential function and the geometrical series leaving
with the weight equal to 1 to the hidden node and with the linear approximation:
the weight equal to Wi to the output node. The input X
node sends unmodied signals to the hidden nodes that FW (x1 :: xn)  Wi (xi ) (9)
have output functions equal to i (X) and the output i
node performs the summation of all weighted contribu-
tions. Any linear or iterative method of data tting may which has the same form as Eq. 5 with (xi ) as the
be used as a learning algorithm to nd the best \we- basis functions for the expansion. However, in the usual
ights" or expansion coecients Wi . This scheme covers case the dependence on the parameters W is non-linear.
many approximation methods, including Fourier trans- In general sigmoidal functions can approximate any
forms, spline interpolations and polynomial ts. Genera- continuous multivariate function 12], although high qu-
lization for vector functions (many-components) requires ality of this approximation requires in most cases a rather
many output neurons and realizes a general mapping be- large number of parameters W . Conditions for conver-
tween two vector spaces. gence of such expansions have been found only recently
Most of the feedforward networks assume for an output 13]. Neural networks may therefore serve as the universal
function of a neuron a sigmoidal function Eq. (2). Since approximators. Many other neuron output functions lead
this function has as an argument xi , a weighted sum of to networks that may serve as the universal approxima-
all signals j received by a given neuron i from neurons tors. Radial basis functions 14], and even more general
j that are connected to its inputs kernel basis functions 15] can be used for uniform appro-
X ximation by neural networks. In fact many other types
xi = Wij j (6) of function may be used, for example rational functions
j 16] for some approximation problems lead to networks
of lower complexity (smaller number of parameters, i.e.
the number of parameters adjusted during learning is faster convergence) than the networks based on sigmoidal
equal to the number of non-zero connections of the ne- or radial functions.
twork. In the examples presented in the previous section For many problems linear approximation is the most
the ANNs had hundreds of adjustable parameters. The appropriate method and an attempt to use neural ne-
task required from these networks during training was to tworks or any other nonlinear approximators will lead to
t the available (training) data to some functional form low accuracy and lengthy computations. Periodic func-
by adjusting these parameters. Although one may think tions (covering the prediction of time series in physics,
that in using ANNs no functional form is a priori as- chemistry, economics and other elds) are much better
sumed this is obviously not true. How does an explicit represented via Fourier, wavelet 17] or similar expansions
function realized by the backpropagation network look? rather than by sigmoidal functions. Many other functions
It is usually presented in the recursive way useful in physics are of gaussian type. Why should we
X X X hope that the ANN based on the sigmoidal functions will
FW (X) = ( Wi(1) ( Wi(2) (:::( Wi(k)xi ):::))) give us better results than the tting procedures?
The training algorithm modies the weights (or logical
1 2 k k
i1 i2 ik
(7) functions of the nodes in case of logical networks 18]),
slowly changing the landscape of the FW (X) function re-
where  is the sigmoidal output function of the node alized by the net until it will approximately give in the
and the upper indices 1::k refer to the network layer. For N sample points the same values as the f(X) function.
a network with 2 active layers (Fig. 1) the explicit output Since the initial form of the mapping depends on the ran-
function is not dicult to write: dom set of weights W at the start of the training period,
X X the nal form of the function FW after training is to a
FW (X) = ( Wi(1)( Wij(2) xj )) (8) large degree arbitrary, except for the close neighborhood
i j of the N training points. Hoping that the ANN will nd
X ! an approximation far from the sample points (Xi  f(Xi ))
= Wi 1 is unreasonable. At most the accuracy that one may ob-
(1)

1 + exp(; j Wij(2) xj =T)


P tain is equal to that of a t with Wij parameters using
i
1 sigmoidal functions.
= P Fitting a set of data points to a linear combination
1 + exp(; i Wi 1+exp(; P1 W (2) x =T ) =T)
(1)
of basis functions by the least-squares procedure, in case
when the function is smooth, is not dicult. Formulated
j ij j

The formulas for the backpropagation of errors training as a neural network learning problem with just one out-
algorithm 3] may easily be obtained by computing the put neuron such an approach is known as the functional

5
link neural network 19]. Nonlinear tting problems are X11

dicult and neural network approach is not an exception, F(x) = Wi (x ; xi) (10)
in fact even quite simple networks lead to NP-complete i=1
problems 20]. A very interesting solution to this problem Results that one can obtain with this type of tting
is based on the idea of growing and shrinking networks, function represent the limit that an ANN of any archi-
allocating resources to new data and constructing the tecture with 11 linearly adjustable parameters and sig-
approximating function incrementally 21]. moidal output functions of neurons may achieve (nume-
The problem of tting non-smooth functions in many rical experiments with other, multi-layer networks with
dimensions, as in the clustering analysis, is also dicult. non-linear parameters, always gave worse results). Varia-
Fitting in a multidimensional space is much harder than ble thresholds for individual neurons, equivalent to the
in one or two dimensions and it remains to be seen if neu- centering of sigmoidal functions around the data points,
ral networks can compete with least squares ts to expan- are a necessary prerequisite for a good local approxima-
sions in some basis set. Comparison of various nonlinear tion in this case. For T = 0:01 the non-linearity of the
approaches to classication for real world data shows that sigmoidal functions in the range of =5 is rather strong
the accuracy of neural networks is similar to nonlinear re- and the overall t is poor. In Fig. 3b) one can see how
gression and tree-induction statistical methods 22]. the approximating function is combined from steep sig-
moids { the shape of the sigmoid functions is also shown
in this gure. Networks with high non-linearities are sim-
IV. EXAMPLE OF ANN BEHAVIOR ply not capable of a smooth modeling of the data in this
example.
From the point of view of the approximation theory The behavior of the approximating function for para-
physical and chemical problems treated in the literature meters outside the training range strongly depends on
by ANNs are in some cases trivial. Consider for example the value of T . This is illustrated by the next two dra-
the results of 6] and 7]: we do not need 50 sample data wings, Fig. 3c) and d), obtained with local sigmoidal ts
points to conclude that the dependence of E on (!x  !y ) for T = 0:5 and T = 1. The accuracy of these ts is even
is linear. If there is no way of guessing the functional de- better than that of the Fourier series t! In this case the
pendence or if there is no global t that will give small sigmoidal functions, shown in these gures, could be re-
errors we may try to use various spline functions to get placed by the semi-linear functions, and the approxima-
the correct local behavior and glue them together to get ting function by the semi-linear spline function. In our
a global description of the data. However, functional de- experience it is hard to nd a function f(x) for which
pendencies for most physical problems are known (in con- this would not be the case.
trast to problems in pattern recognition to which ANNs The last drawing shows the results obtained by a back-
are applied with some success, for example in high-energy propagation network with 11 non-zero weights trained
physics 23]). They result from the underlying simplied on the 11 input points and tested on 100 points in the
physical models parametrization of functions and data (; 3) range. The curve gives an overall idea of the qu-
tting is a well-established and highly accurate method ality of the ANNs approximation { the results can vary,
for creating mathematical models. depending on the randomly set starting weights and the
Perhaps an explicit example will clarify the need for parameter T . Fig. 3e) shows the best results we could nd
rigorous methods in the reconstruction of functional de- after many experiments with dierent network architec-
pendencies from sample data.pIn Fig. 3 we have pre- tures, hundreds of thousands of iterations and essentially
sented a function f(x) = sin( 2x) sin(2x) and various forcing the network to learn some points (giving the va-
ts and approximations based on the 11 sample po- lues of these points more frequently than the others). The
ints taken every 2=10 in the (0 2) region. The ori- network learned the values of the 11 training points to a
ginal function f(x) is left for comparison on all dra- very high precision (0.001).
wings (thin line). Polynomial ts of at least 8{th order Although the trained network is useful for interpola-
have to be used to give a sucient number of minima tion of the data the extrapolation properties of this ne-
and maxima and such polynomials lead to large oscil- twork are completely unreliable. A simple way to improve
lations between the sample points. Fig. 3a) shows the the extrapolation is to use auto-regression, i.e. instead of
original function and its approximation by a Fourier se- using pairs of points (xi yi ) for tting or network training
ries based on an 11-parameter expansion in the functions one may use a set of yi  yi;1 ::: points. The use of autore-
1 sin(x) cos(x) ::: sin(5x) cos(5x). It is a global t, good gression does not change the overall conclusion concering
for interpolation in the whole (0 2) region but comple- the tting procedures and the network behavior.
tely failing outside of it. In the second drawing we see the The strength of ANNs for some problems where
t based on the highly non-linear local sigmoidal func- non-linear relations make it hard to nd a global ap-
tions with T=0.01. It has the following form: proximation lies in the local description of the appro-
ximated function around the training data vectors. The
accuracy of such a description is rather low. Looking at
the function realized by the untrained net, starting with

6
random weights, with the net outputting one real num- where h is a continuous function centered at Xi and pi
ber for n-dimensional input vector, we get a hypersurface is a polynomial of some low order (in particular a con-
in n + 1 dimensions, rather smooth but irregular. Lear- stant). Some of the h functions used with very good re-
ning input-output associations changes this hypersurface sults for problems in physics 27] include linear and Gaus-
to reproduce the output values around the training input sian functions:
values, but most of the random structure of the initial hy-
persurface, including local minima and regions far from r = jjX ; Xi jj h(r) = r (14)
the input data, remain unchanged. It is instructive to 1
h(r) = (c2 + r2)  > 0
look at this function during training, and to notice how
the model of the data that the network has is changing. h(r) = (c2 + r2)  1 > > 0
h(r) = e(;r=c)2
V. ALTERNATIVES TO ANNS
Another branch of mathematics in which the appro-
Approximation theory leads to regularization of func- ximation of multidimensional functions may be based is
tions, a very important concept 24] especially for appli- the statistical decision theory (cf. 5] and references the-
cations with noisy data (as usually obtained from expe- rein), probabilistic Bayesian classiers 28] and regres-
riments). In essence regularization takes into account ad- sion theory 29]. These approaches should be preferred
ditional information in form of constraints that should be for classication problems 9], 10] and if the data cluste-
fullled by the approximating function, dening a func- ring is rather strong. The best known programs based on
tional: this theory are variants of the Learning Vector Quanti-
zation (LVQ) algorithm of Kohonen 30]. Other relevant
X
N
^ jj2
methods were developed by the high energy and plasma
Hf] = (yi ; f(Xi ))2 + jjPf (11) physics communities and are known as function parame-
i=1 trization methods 31].
A recent comparison of various non-linear appro-
where (Xi  yi) are know data points, P^ is the constraint aches to classication and data modeling, such as neu-
operator and is a real parameter determining how im- ral networks, statistical pattern recognition, MARS and
portant the constraints should be. This functional is mi- BRUTO 22] shows that all these methods have their
nimized over all functions f(X) belonging to some class weak and strong points, depending on particular appli-
of trial functions. The approximation in the least-square cations. Fuzzy sets theory 32] and local coordinate trans-
sense follows for = 0 (no regularization). Elegant so- formation methods based on dierential geometry are
lutions are know for many constraint operators, allowing also strong competitors of ANN algorithms 33] oering a
the avoidance of \data overtting" eects in case of noisy well-dened mathematical background for local data ap-
data inputs. For example, to smooth the approximating proximation. ANNs are but one family of systems among
function by minimizing the rapid variation of its curva- many types of adaptive systems reconstructing hypersur-
ture P^ should include second derivatives: faces from the sample data by adjusting internal parame-
8n
N <X !9 ters.
X @ f(X i ) =2
2
^ jj =
jjPf 2
(12)
i :j @xj 2
=1 =1
VI. SUMMARY
Regularization may also be applied to the backpropa-
gation ANNs 25] and can be implemented by ANNs with From the preceding sections it is clear that there are
one hidden layer 26]. Poggio and Girosi 26] show how many alternatives to the use of neural networks for
regularization theory may be extended to what they call complex approximation problems. There are obvious ca-
\theory of Hyper Basis Functions", containing the Radial ses when the use of neural networks is quite inappro-
Basis Function (RBF) method 14] as a special case. The priate: whenever linear methods are sucient or whene-
RBF approach allows for multivariate interpolation in a ver least-squares ts in some basis functions works well.
way that is better for most chemistry and physics pro- Given a sucient number of network parameters (weights
blems than the ANN interpolation with sigmoidal func- and processing functions) and a sucient number of data
tions. The RBF method assumes the following functional points an approximation to an arbitrary mapping may be
form to the approximate function f(X) given the value obtained 12]. For some problems approximation via sig-
at N points Xi = (x(1i) x(2i) ::x(ni)): moidal functions, especially with strong non-linearity, is
slowly converging { a reection of the fact that no physi-
X
N X
m cal insight is used in the construction of the approxima-
F(X) = Cih(jjX ; Xi jj) + Di pi(X) (13) ting mapping of parameters on the results. The number
i=1 i=1 of adjustable parameters (weights) in an ANN is usually
quite large. Time for training the ANN, tedious selection

7
of network architecture, neuron output function and glo- parameters for each input data item presented, while glo-
bal learning parameters plus the dependence of results bal tting methods require access to all data.
on the initial state of the network make the use of neural Approximation theory and statistical decision theory,
networks for solving physical problems a very unreliable especially if approximate functional dependencies are
method. known, give a rm mathematical background to the
Recently Bishop 25] proposed a fast curve tting pro- treatment of many problems to which neural network
cedure based on neural networks. A trained ANN is used techniques are applied in an ad hoc manner. Instead
to give quickly an approximation to the non-linear para- of backpropagation neural networks, applications based
meters of the iterative tting procedure. It has already on explicit, well-controlled construction of approximate
been suggested in the context of Adaptive Logical Ne- mappings should be preferred. Papers on applications of
tworks 18] that after training the ALN net the function ANNs to problems of physics and chemistry should at
that it has learned should be extracted: isn't it better least compare the results with the results obtained using
to construct such approximating functions directly? Al- statistical methods and data tting procedures.
ternative approaches, based on the approximation theory
and theory of statistical decisions have more rigorous ma-
thematical foundation and properly applied should lead ACKNOWLEDGMENTS
to better results with a smaller number of parameters.
Consider for example theoretical molecular physics and W.D. gratefully acknowledges grant of the Research
quantum chemistry. We are trying to associate, using Council of the Polish Ministry of National Education and
some complicated computational machinery, certain pa- support of the Max-Planck Institute during his visits in
rameters, such as geometric or one-electron basis set in- Garching. We also thank prof. M. Zerner and prof. B.
formation, with the values of energy and other properties. Wybourne for reading and correcting the manuscript and
Does a global function of these parameters giving energy dr W. Kraemer for pointing out some references.
E(p1  ::pn) and other properties exist? Can we create an
approximating function containing a large number of ad-
justable parameters from computational results plus the
empirical data that will allow us to guess new values?
If good empirical or ab initio data is available for con-
struction of such a mapping it is to some degree possible
and this approach has been used for many years to obtain 1] For a short description of available hardware and software
reliable molecular (and more recently also solid state) po- packages please see the Frequently Asked Questions on
tential energy surfaces 34]. The results of such mapping the Usenet comp.ai.neural-nets discussion group (availa-
based on physical models for underlying expansions are ble by anonymous FTP as le "neural-nets-faq" from
much more reliable than the results one may obtain using many Internet nodes).
ANNs 35]. 2] J.A. Anderson, and E. Rosenfeld, (Eds). Neurocompu-
Will the neural networks have signicant impact on the ting: Foundations of Research. (The MIT Press: Cam-
methods of solving the physics and chemistry problems? bridge, MA.(1988)) J.A. Anderson, A. Pellionisz, and
Our conclusion is that using feedforward neural networks E. Rosenfeld, (Eds). Neurocomputing 2: Directions for
to solve some of these problems is a rather inecient way Research. (The MIT Press: Cambridge, MA.(1990)) R.
of tting the data to a specic functional form. However, Hecht-Nielsen, Neurocomputing. (Addison Wesley, 1990)
if there is no approximate theory but the data is not com- J. Hertz, A. Krogh and R. Palmer, Introduction to the
pletely chaotic { as for example in the case of proteins Theory of Neural Computation. (Addison-Wesley: Re-
10], QSAR and other problems in physics and chemistry dwood City, California 1991) J.L. McClelland and D.E
Rumelhart, Explorations in Parallel Distributed Proces-
38] or time series forecasting in economics or physics 36], sing: Computational Models of Cognition and Percep-
37] { any data modeling tools are worth using, including tion (software manual). (The MIT Press, Cambridge, MA
neural networks. There is no need to insist on sigmo- 1988) P.D. Wasserman, Advanced Methods in Neural
idal processing functions, usually more accurate results Computing. (Van Nostrand Reinhold: New York, 1993)
are obtained with a smaller number of parameters using T. Kohonen, An Introduction to Neural Computing. Neu-
approximating functions based on gaussian or other lo- ral Networks 1 (1988) 3-16.
calized functions 14]. Moreover, various methods such as 3] D.E. Rumelhart, G.E. Hinton, and R.J. Williams, Le-
resource allocating neural networks and other construc- arning representations by backpropagating errors. Na-
tive algorithms 21] automatically adding more network ture, 323 (1986) 533{536 D.E. Rumelhart and J.L. McC-
nodes to describe the data with higher accuracy may be lelland, Parallel Distributed Processing: Explorations in
more convenient for data modeling. Another case when the Microstructure of Cognition, (The MIT Press, Cam-
neural network methods may have strong advantages is bridge, MA 1986), Vol.1, pp 318{362 P.J. Werbos, Proc.
when a large amount of data coming from experiment or of IEEE 78 (1990) 1550
computations should be processed. Neural networks le- 4] T.L.H. Watkin, A. Rau and M. Biehl, Rev.Mod.Phys.65
arning algorithms lead to small changes of the network (1993) 499

8
5] T. Kohonen, Self-organization and Associative Memory. School of Comp. Sci. (1990)
(Springer-Verlag, New York, 1984, 2nd Edition: 1988 3rd 22] B.D. Ripley, Neural networks and exible regression
edition: 1989). and discrimination. In: Statistics and Images, ed. K.V.
6] J.A. Darsey, D.W. Noid and B.R. Upadhyaya, Mardia (Abingdon, Carfax, 1993) B.D. Ripley, Flexible
Chem.Phys.Lett.177 (1991) 189 181 (1991) 386 non-linear approaches to classi cation. In: From Stati-
7] J. Androsiuk, L. Kulak and K. Sienicki, Chem.Phys.173 stics to neural networks. Theory and Pattern Recognition
(1993) 377 Applications, eds. V. Cherkassky, J.H. Friedman, W. We-
8] Sumpter B.G., C. Getino, D.W. Noid, J.Chem.Phys.97 chsler (Springer Verlag 1994) B.D. Ripley, J. Roy. Stat.
(1992) 293 J.Phys.Chem.96 (1992) 2761 B.G. Sumpter Soc. B 56 (1994) (in press)
and D.W. Noid, Chem.Phys.Lett.192 (192) 455 23] P. Treleaven and M. Vellasco, Comp.Phys.Comm.57
9] K.L. Peterson, Phys.Rev A 41 (1990) 2457 44 (1991) 126 (1989) 543 B. Humpert, Comp.Phys.Comm.58 (1990)
10] Poliac M.O., G.L. Wilcox, Y. Xin, T. Carmeli, M. Lieb- 223
man, IEEE International Joint Conference on Neural Ne- 24] A.N. Tikhonov, Soviet Math. Dokl 4 (1963) 1035 A.N.
tworks (New York, NY, IEEE 1991) Wilcox G.L., M. Tikhonov, V.Y. Arsenin, Solutions of Ill-Posed Problems
Poliac, M. Liebman, Proceedings of the INNC 90, Paris (W.H. Winston, Washington D.C. 1977) V.A. Morozow,
(Kluwer, Dordrecht, Netherlands 1990), p. 365 E.A. Fer- Methods for Solving Incorrectly Solved Problems (Sprin-
ran, P. Ferrara, Physica A 185 (1992) 395 M. Blazek, P. ger Verlag, Berlin, 1984)
Pancoska, Neurocomputing 3 (1991) 247 M. Compiani, 25] C.M. Bishop, in: Proceedings of the INNC, Paris, Vol.
P. Fariselli, R. Casadio, in: Proceedings of the Fourth 2 p. 749 (Kluwer, Dordrecht, Netherlands 1990) AEA
Italian Workshop on Parallel Architectures and Neural Technology Report AEA FUS 162
Networks, Vietri sur Mare, Italy (World Scienti c, Sin- 26] T. Poggio and F. Girosi, Proc. of the IEEE 78 (1990)
gapore 1991), p. 227 1481
11] J.R. Rice The Approximation of Functions (Addison We- 27] E.J. Kansa, Computers Math. Appl. 19 (1990) 127
sley, Reading, MA 1964) G.G. Lorentz, Approximation R.Franke, Math. Comp. 38 (1982) 181
of Functions (Chelsea Publishing Co, New York, 1986). 28] R.O. Duda, P.E. Hart, Pattern classication and scene
12] G. Cybenko, Math. Control Systems Signals 2 (1989) 303 analysis (Wiley, New York, 1973)
K. Funahashi, Neural Networks 2 (1989) 183 K. Hornik, 29] D.F. Specht, Proc. of IEEE Intern. Conference on Neural
M. Stinchcombe, H. White, Neural Networks 2 (1989) Networks 1 (1988) 525 D.F. Specht, IEEE Transactions
359 Jones L.K, Proc. of the IEEE 78 (1990) 1586 on Neural Networks 2 (1991) 568
13] H. White, Neural Networks 3 (1990) 535 A.R. Barron, 30] T. Kohonen, Proceedings of the IEEE 78 (1990) 1464 A.
IEEE Transactions on Information Theory 39 (1993) 930 Cherubini and R. Odorico, Comp.Phys.Comm.72 (1992)
14] M.J.D. Powell, "Radial basis functions for mulitvariable 249
interpolation: a review", in: J.C. Mason and M.G. Cox, 31] B. Ph. van Milligen and N. J. Lopes Cardozo,
eds, Algorithms for Approximation (Clarendon Press, Comp.Phys.Comm.66 (1991) 243
Oxford 1987) 32] G.J. Klir and T.A. Folger, Fuzzy Sets, Uncertainity and
15] V. K_urkova, K. Hlavackova, Proceedings of the Neuro- Information (Prentice Hall, NJ 1988)
net'93, Prague (1993) 33] Duch W (1992) Department of Computer Methods, Tech-
16] H. Leung, S. Haykin, Neural Computation 5 (1993) 928 nical Reports 1-3/1992
17] C.K. Chui, Ed: Wavelet Analysis and Its Applications 34] J.N. Murrell, S. Carter, S.C. Farantos, P. Huxley and
(Academic Press, Boston, 1992) A.S.C. Varandas, Molecular Potential Energy Functions
18] W.W. Armstrong, A. Dwelly, R. Manderscheid and T. (Wiley, NY 1984) B.R. Eggen, R.L. Johnston, S. Li and
Monroe, An Implementation of Adaptive Logic Ne- J.N. Murrell, Molecular Physics 76 (1992) 619
tworks, Univ. of Alberta, Comp. Science Dept, November 35] A. Lee, S.K. Rogers, G.L. Tarr, M. Kabrisky and D. Nor-
11, 1990 technical report. man, Proc. SPIE - Int. Soc. Opt. Eng. 1294 (1990) 138
19] Y-H. Pao, Adaptive Pattern recognition and neural ne- 36] R. R. Trippi, E. Turban E., Neural Networks in Finance
tworks. (Addison-Wesley, Readnig, MA 1989) and Investing, (Chicago, Ill, Probus 1992) Koons H.C.,
20] A. Blum, R. Rivest, Training a 3-node neural network D.J. Gorney, J. Geophys. Res 96 (1991) 5549
is NP-complete. In: COLT '88, Proc. of 1988 Workshop 37] J. D. Farmer, J. J. Sidorowich, Phys. Rev. Letters 59
on computational learning theory. MIT 1988 ((M. Kauf- (1987) 845-848 M. Casdagli, Physical Review A, 35
mann, San Mateo, CA) S. Judd, in: Proc. IEEE intern. (1989) 335-356 M. Casdagli, Physica D, 35 (1989) 335
conference on neural networks, San Diego, CA, 1987, vol J.B. Elsner, J. Phys. A 25 (1992) 843-50
II, p. 685 38] J. Zupan, J. Gasteiger, Neural networks for chemists:
21] S.I. Gallant, Neural network learning and expert systems an introduction. (VCH, Weinheim, Germany 1993) G.M.
(Bradfor Book, MIT Press 1993) J. Platt, Neural Com- Maggiora, D.W. Elrod, R.G. Trenary, J. Chem. Informa-
put. 3 (1991) 213 V. Kadirkamanathan, M. Niranjan, tion and Comp. Sciences 32 (1992) 732
Neural Comput. 5 (1993) 954 B. Fritzke, Vector quanti-
zation with growing and splitting elastic net, in: ICANN
'93: Porceedings of the International Conference on Arti-
cial Neural Networks, Amsterdam 1993 S.E. Fahlman,
C. Lebiere, Tech. Rep. CMU-CS-90-100, Carnegi-Mellon

9
FIG. 1. The typical architecture of a feedforward neural
network. An input layer I , and output layer with a single
neuron, and a hidden layer , with two sets of weights, be-
tween the input and hidden layer and between the hidden
and output layer. This network performs the function given
explicitly in Eq. 8 in the text.

FIG. 2. A neural network architecture for tting the data


to a set of basis functions .

FIG. 3. An example of eectiveness of various p interpola-


tion and extrapolation techniques for f (x) = sin( 2x) sin(2x)
function, shown in a thin line on all drawings, with 11 uni-
formly spaced (x f (x)) data points taken from (0 2) as the
training data. Approximations by: a) Fourier series, 5-th or-
der t b) t using 11 strongly non-linear sigmoidal functions
centered on the data points, T=0.01 c) as above, but with
much lower non-linearity, T=0.2 d) as above, with T=1.0,
equivalent to linear splines e) the best back-propagation neu-
ral network with one input, one output and 10 hidden neurons
(20 non-zero weights) we have found.

10
Fig. 3a
1

0.75

0.5

0.25

-2 2 4 6 8
-0.25

-0.5

-0.75

-1
Fig. 3b

0.5

-2 2 4 6 8

-0.5

-1

11
1

0.8

0.6

0.4

0.2

-1 -0.5 0 0.5 1
Fig. 3c

0.5

-2 2 4 6 8

-0.5

-1

12
1

0.8

0.6

0.4

0.2

-1 -0.5 0 0.5 1
Fig. 3d

0.5

-2 2 4 6 8

-0.5

-1

13
1

0.8

0.6

0.4

0.2

-1 -0.5 0 0.5 1
Fig.3e
1

0.75

0.5

0.25

-2 2 4 6 8
-0.25

-0.5

-0.75

-1

14

You might also like