Hugh Cartwright - Using Artificial Intelligence in Chemistry and Biology - A Practical Guide (Chapman & Hall CRC Research No) - CRC Press (2008)
Hugh Cartwright - Using Artificial Intelligence in Chemistry and Biology - A Practical Guide (Chapman & Hall CRC Research No) - CRC Press (2008)
Hugh Cartwright - Using Artificial Intelligence in Chemistry and Biology - A Practical Guide (Chapman & Hall CRC Research No) - CRC Press (2008)
Hugh Cartwright
Nawwaf Kharma
Associate Professor, Department of Electrical and Computer Engineering,
Concordia University, Montreal, Canada
This book contains information obtained from authentic and highly regarded sources. Reasonable
efforts have been made to publish reliable data and information, but the author and publisher can-
not assume responsibility for the validity of all materials or the consequences of their use. The
authors and publishers have attempted to trace the copyright holders of all material reproduced
in this publication and apologize to copyright holders if permission to publish in this form has not
been obtained. If any copyright material has not been acknowledged please write and let us know so
we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced,
transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or
hereafter invented, including photocopying, microfilming, and recording, or in any information
storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copy-
right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222
Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro-
vides licenses and registration for a variety of users. For organizations that have been granted a
photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and
are used only for identification and explanation without intent to infringe.
QD39.3.E46C375 2008
540.285’63--dc22 2007047462
Preface........................................................................................................ xiii
The Author...................................................................................................xv
1. Artificial Intelligence......................................................................... 1
1.1 What Is Artificial Intelligence?.............................................................2
1.2 Why Do We Need Artificial Intelligence and What Can We
Do with It?...............................................................................................3
1.2.1 Classification...............................................................................5
1.2.2 Prediction.....................................................................................5
1.2.3 Correlation...................................................................................5
1.2.4 Model Creation...........................................................................6
1.3 The Practical Use of Artificial Intelligence Methods........................6
1.4 Organization of the Text.......................................................................7
References........................................................................................................8
v
vi Contents
3. Self-Organizing Maps...................................................................... 51
3.1 Introduction.......................................................................................... 52
3.2 Measuring Similarity..........................................................................54
3.3 Using a Self-Organizing Map............................................................ 56
3.4 Components in a Self-Organizing Map............................................ 57
3.5 Network Architecture......................................................................... 57
3.6 Learning................................................................................................ 59
3.6.1 Initialize the Weights............................................................... 60
3.6.2 Select the Sample...................................................................... 62
3.6.3 Determine Similarity............................................................... 62
3.6.4 Find the Winning Node..........................................................63
3.6.5 Update the Winning Node......................................................64
3.6.6 Update the Neighborhood......................................................65
3.6.7 Repeat......................................................................................... 67
3.7 Adjustable Parameters in the SOM................................................... 71
3.7.1 Geometry................................................................................... 71
3.7.2 Neighborhood........................................................................... 73
3.7.3 Neighborhood Functions........................................................ 73
3.8 Practical Issues.....................................................................................80
3.8.1 Choice of Parameters...............................................................80
3.8.2 Visualization............................................................................. 81
3.8.3 Wraparound Maps...................................................................85
3.8.4 Maps of Other Shapes.............................................................. 87
3.9 Drawbacks of the Self-Organizing Map........................................... 88
3.10 Applications.......................................................................................... 89
3.11 Where Do I Go Now?.......................................................................... 93
3.12 Problems................................................................................................ 93
References...................................................................................................... 94
Index........................................................................................................... 329
Preface
xiii
xiv Preface
Errors are almost unavoidable in a text of this sort and I shall be grateful to
be notified of any that may be spotted by readers.
Hugh Cartwright
xv
1
Artificial Intelligence
Computers don’t understand. At least, they don’t understand in the way that
we do. Of course, there are some things that are almost beyond comprehen-
sion: Why do women collect shoes? What are the rules of cricket? Why do
Americans enjoy baseball? And why are lawyers paid so much?
These are tough questions. As humans, we can attempt to answer them,
but (at least for the next few years) computers cannot help us out. Humans
possess far more “intelligence” than computers, so it is to be expected that if
a human struggles to understand shoe collecting, a computer will be totally
baffled by it. This makes it all the more surprising that computers that use
“artificial intelligence” can solve a variety of scientific tasks much more
effectively than a human.
Artificial Intelligence (AI) tools are problem-solving algorithms. This book
provides an introduction to the wide range of methods within this area that
are being developed as scientific tools. The application of AI methods in sci-
ence is a young field, hardly out of diapers in fact. However, its potential is
huge. As a result, the popularity of these methods among physical and life
scientists is increasing rapidly.
In the conventional image of AI, we might expect to see scientists building
laboratory robots or developing programs to translate scientific papers from
Russian to English or from English to Russian, but this does not accurately
reflect the current use of AI in science. The creation of autonomous robots
and the perfection of automatic translators are among the more important
areas of research among computer scientists, but the areas in which most
experimental scientists are involved are very different.
Science is dominated by problems; indeed, without them there would be
hardly any science. When tackling a new problem, scientists are generally
less concerned about how they solve it, provided that they get a solution, than
they are about the quality of the solution that is obtained. Thus, the pool of
mathematical and logical algorithms that scientists use for problem solving
continues to grow. That pool is now being augmented by AI methods.
Until the second half of the 1990s, experimental scientists, by and large,
knew little about AI, suspecting (without much evidence) that it might be
largely irrelevant in practical science. This turned out to be an unduly pes-
simistic view. AI algorithms are in reality widely applicable and the power of
these methods, allied to their simplicity, makes them an attractive proposi-
tion in the analysis of scientific data and for scientific simulation. The pur-
pose of this book is to introduce these methods to those who need to solve
scientific problems, but do not (yet) know enough about AI to take advantage
1
2 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
of it. The text includes sufficient detail that those who have some modest
programming skills, but no prior contact with AI, can rapidly set about using
the techniques.
15000
3
2
10000
5000
4
1
–5000
–10000
0 5 10 15 20 25
Figure 1.1
Progressive improvements in the attempts by a genetic algorithm to find a curve that fits a set
of data points well (lines 1 to 3). The solid line is fitted using standard least squares.
But, you may have spotted a problem here. If an AI program has to learn,
does that not suggest that its performance might be fairly hopeless when we
first ask it to solve a problem? Indeed it does, as Figure 1.1 indicates.
The figure shows a polynomial fit to a dataset calculated according to a
standard least squares algorithm (solid line); this is compared with a series
of attempts to find a fit to the same data using a genetic algorithm.
The genetic algorithm is unimpressive, at least to begin with (line 1). The
quality of the fit that it finds does improve (lines 2 and 3) and eventually
it will reach a solution that matches the least squares fit, but the algorithm
takes far longer to find the solution than standard methods do, and the fit is
no better.
10 40 Potential drugs
10 6
High throughput
screening
10 2 Positive leads
10 Trials
Figure 1.2
The drug development process.
Artificial Intelligence 5
that, for a typical shop producing twenty chemicals, is much too large to
allow every sequence to be investigated individually.
A second reason why AI is of value to scientists is that it offers powerful
tools to cope with complexity. In favorable circumstances, the solutions to
problems can be expressed by rules or by a well-defined, possibly trivial,
model. If we want to know whether a compound contains a carbonyl group,
we could record its infrared spectrum and check for a peak near 1760 cm-1.
The spectrum, paired with the rule that ketones generally show an absorp-
tion in this region, is all that we need. But other correlations are more dif-
ficult to express by rules or parametrically. What makes a good wine? We
may (or may not) be able to recognize a superior wine by its taste, but would
have considerable difficulty in determining whether a wine is good, or even
if it is palatable, if all we had to go on was a list of the chemicals of which it
is comprised.
We shall meet a number of examples in this book of the sort of scientific
problems in which AI can be valuable, but it will be helpful first to view a
few typical examples.
1.2.1 Classification
Scientists need to classify and organize complex data, such as that yielded by
medical tests or analysis via GC-MS (gas chromatography-mass spectrome-
try). The data may be multifaceted and difficult to interpret, as different tests
may conflict or yield inconclusive results. Growing cell structures may be
used to assess medical data for example, such as that obtained from patient
biopsies, and determine whether the test results are consistent with a diag-
nosis of breast cancer.1
1.2.2 Prediction
The prediction of stable structures that can be formed by groups of a few
dozen atoms is computationally expensive because of the time required to
determine the energy of each structure quantum mechanically, but such
studies are increasingly valuable because of the need in nanochemistry to
understand the properties of these small structures. The genetic algorithm is
now widely used to help predict the stability of small atomic clusters.2
1.2.3 Correlation
The relationships between the molecular structure of environmental pollut-
ants, such as polychlorinated biphenyls (PCBs), and their rate of biodegradation
are still not well understood, though some empirical relationships have been
established. Self-organizing maps (SOMs) have been used to rationalize the
resistance of PCBs to biodegradation and to predict the susceptibility to degra-
dation of those compounds for which experimental data are lacking.3 The same
technique has been used to analyze the behavior of lipid bilayers, following a
6 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
molecular dynamics simulation, to learn more about how these layers behave
both in natural systems and when used as one component in biosensors.4
Enhanced
algorithms
Neural networks
Agent-based
systems
Figure 1.3
Relationships between the AI methods of the most current value in science.
Artificial Intelligence 7
References
1. Walker, A.K., Cross, S.S., and Harrison, R.F., Visualisation of biomedical data-
sets by use of growing cell structure networks: A novel diagnostic classification
technique, Lancet, 354, 1518, 1999.
2. Johnston, R.L., et al., Application of genetic algorithms in nanoscience: Cluster
geometry optimisation. Applications of Evolutionary Computing, Proceedings, Lec-
ture Notes in Computer Science, Springer, Berlin, 2279, 92, 2002.
3. Cartwright, H.M., Investigation of structure-biodegradability relationships in
polychlorinated biphenyls using self-organising maps, Neural Comput. Apps.,
11, 30, 2002.
4. Murtola, T., et al., Conformational analysis of lipid molecules by self-organiz-
ing maps, J. Chem. Phys., 125, 054707, 2007.
5. Escuderos, M.E., et al., Instrumental technique evolution for olive oil sensory
analysis, Eur. J. Lipid Sci. Tech., 109, 536, 2007.
6. Goudos, S.K. and Sahalos, J.N., Microwave absorber optimal design using multi-
objective particle swarm optimisation, Microwave Optic. Tech. Letts., 48, 1553,
2006.
2
Artificial Neural Networks
2.1 Introduction
Pattern matching may sound dull or esoteric, but we all use an excellent,
free pattern-matching tool every day—the human brain excels at just this
sort of task. Babies are not taught how to recognize their mother, yet without
instruction quickly learn to do so. By the time we are adults, we have become
9
10 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Figure 2.1
The handwriting of a typical chemist is (just) readable.
Figure 2.2
The complex pattern of interneuron connections within the brain.
12 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Output
Input
Figure 2.3
A novice user’s view of the operation of an artificial neural network.
Any software model of the brain that we could create using a current PC
could emulate only an infinitesimal amount of it, so it might seem that our
expectations of what an ANN could accomplish should be set at a thoroughly
modest level. Happily, the reality is different. Though computer-based neu-
ral networks are minute compared with the brain, they can still outperform
the brain in solving many types of problems.
The first contact that many scientific users have with neural networks is
through commercially available software. The manuals and Help systems in
off-the-shelf neural network software may not offer much on the principles
of the method, so, on a first encounter, ANNs might appear to be black boxes
whose inner workings are both mysterious and baffling (Figure 2.3).
This cautious and distant view of the innards of the box is no disadvantage
to the user if the network is already primed for use in a well-defined domain
and merely waiting for the user to feed it suitable data. Scientists, though,
will want to employ ANNs in innovative ways, taking full advantage of their
potential and flexibility; this demands an appreciation of what lies within
the box. It is the purpose of this chapter to lift up the lid.
Axon
Synapse
Figure 2.4
A schematic view of two neurons in the brain.
so considerable that, even though the model that it generates cannot always
be translated into plain English rules and equations, its development is still
justified by superior performance.
1. Some nodes.
2. The connections between them.
3. A weight associated with each connection.
4. A recipe that defines how the output from a node is determined
from its input.
2.4.1 Nodes
ANNs are built by linking together a number of discrete nodes (Figure 2.5).
Each node receives and integrates one or more input signals, performs some
simple computations on the sum using an activation function, then outputs
the result of its work. Some nodes take their input directly from the outside
world; others may have access only to data generated internally within the
network, so each node works only on its local data. This parallels the opera-
tion of the brain, in which some neurons may receive sensory data directly
from nerves, while others, deeper within the brain, receive data only from
other neurons.
2.4.2 Connections
Small networks that contain just one node can manage simple tasks (we shall
see an example shortly). As the name implies, most networks contain several
nodes; between half a dozen and fifty nodes is typical, but even much larger
networks are tiny compared with the brain. As we shall see in this chapter
and the next, the topology adopted by a network of nodes has a profound
influence on the use to which the network can be put.
Node
Input Output
signals signals
Figure 2.5
A node, the basic component in an artificial neural network.
Artificial Neural Networks 15
Nodes are joined by virtual highways along which messages pass from
one node to another. As well as possessing internal connections, the net-
work must be able to accept data from the outside world, so some network
connections act as inputs, providing a pipeline through which data arrive.
The input to the network can take many forms. It might consist of numeri-
cal data drawn from a database, streams of numbers generated by a piece
of simulation software, the RGB (red, green, blue) values of pixels from a
digital image, or the real-time signal from a sensor, such as a pH electrode
or a temperature probe.
A network that computed, but kept the result of its deliberations to itself,
would be valueless, thus in every network at least one output connection
is present that feeds data back to the user. Like the input data, this output
can be of several different types; it could be numerical, such as a predicted
boiling point or an LD50 value. Alternatively, the network might generate a
Boolean value to indicate the presence or absence of some condition, such
as whether the operating parameters in a reactor were within designated
limits. The network might output a string to be interpreted as an instruction
to turn on a piece of equipment, such as a heater. Networks may have more
than one output connection and then are not limited to a single type of out-
put. A Boolean output which, if set, would sound an alarm to alert the user
to a drop in the temperature of a reaction cell, could be combined with a real-
valued output that indicated the level to which a heater in the cell should be
set if this event occurred.
Figure 2.6
The calculations performed by an artificial neuron.
net = ∑w x
i=0
i i (2.1)
The summation in equation (2.1) is over all n + 1 inputs to the node. xi is the
signal traveling into the node along connection i and wi is the weight on that
connection. Most networks contain more than one node, so equation (2.1) can
be written in the slightly more general form:
net j = ∑w x
i=0
ij ij (2.2)
in which wij is the weight on the connection between nodes i and j, and the
signal travels down that connection in the direction i → j.
0.31
0.96 Output
y ∑
0.1
Bias
Figure 2.7
A simple artificial neural network.
nonzero output from the node. It is in this sense that input from the bias
node can be regarded as providing a threshold signal.
For the node shown, if X = 3 and Y = 7, the total input signal at the
node is
y j = fj (net j ) (2.4)
Several types of activation functions have been used in ANNs; the step
or binary threshold activation function is one of the simplest (Figure 2.8).** A
node that employs the binary activation function first integrates the incom-
ing signals as shown in equation (2.1). If the total input, including that com-
ing in on the connection from the bias node, is below a threshold level, T, the
node’s output is a fixed small value Φ, which is often chosen to be zero. If the
summed input is equal to or greater than this threshold, the output is a fixed,
larger value, usually set equal to one.
* The curious name “squashing function” reflects the action of the function. Activation func-
tions can take as input any real number between –∞ and +∞; that value is then squashed
down for output into a narrow range, such as {0, +1} or {–1, +1}.
** The step function is also known as a Heaviside function.
18 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
f(net)
1
φ
Tnet
fstep (net )=
fstep 1, =if1,net
(net) ≥T
if net ≥T
fstep(net) = φ, if net < T (2.5)
fstep (net) = φ, if net < T
Figure 2.8
A binary or step activation function.
Output
+1
Input
–1
) = +=1+1,
fsign (f net(net) net ≥≥00
, ifif net
sign
(2.6)
fsign (fnet
sign(net)
) = −=1–1,
, ifif net
net < 0
<0
Figure 2.9
The sign function.
The binary threshold activation function was used in early attempts to cre-
ate ANNs because of a perceived parallel between this function and the way
that neurons operate in the brain. Neurons require a certain level of activa-
tion before they will “fire,” otherwise they are quiescent. A TLU functions
in the same way.
Although a TLU has features in common with a neuron, it is incapable of
acting as the building block of a versatile computer-based learning network.
The reason is that the output from it is particularly uninformative. The output
Artificial Neural Networks 19
signal tells us only that the sum of the weighted input signals did, or did not,
exceed the threshold T. As the response from the unit is the same whether the
input is fractionally above the threshold or well beyond it, almost all informa-
tion about the size of the input signal is destroyed at the node. A network of
TLUs has few applications in science unless the input signals are themselves
binary, when the use of a binary activation function does not unavoidably
cause loss of information.
If the input data are not binary, an activation function should be chosen
that allows the node to generate an output signal that is related in some way
to the size of the input signal. This can be accomplished through several
types of activation function.
The linear activation function passes the summed input signal directly
through to the output, possibly after multiplication by a scaling factor.
y j = fj (net j ) = k × net j = k × ∑w x
i=0
ij ij (2.7)
In equation (2.7), k is a scaling factor and usually 0 < k ≤ 1.0. The simplest
linear activation function is the identity function, in which k is 1.0, thus the
output from a node that uses the identity function is equal to its input.
y j = fj (net j ) = net j = ∑w x
i=0
ij ij (2.8)
The output from a linear function may be capped so that input and output are
linearly related within a restricted range. Beyond these limits, either positive or
negative, the output does not vary with the size of the incoming signal, but has
a fixed value (Figure 2.10). Capping is used to prevent signals from becoming
f(net)
xmin xmax
net
y =yk=×k net
× net,
, xxmin
min ≤
≤ xx≤≤xxmax
max
y = –Ф, x < xmin
y =y−=Φ+Ф,
, xx <> xxmax
min (2.9)
y = + Φ, x > xmax
Figure 2.10
A capped linear scaling activation function.
20 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
unreasonably large as they pass through the network, which may happen in a
large network if the connection weights are much greater than one.
0.4
–1.0 Output
y ∑
–2.0
Bias
Figure 2.11
A one-node network that can tackle a classification task.
y
6
4
–10 –5 5 10 x
–2
–4
–6
Figure 2.12
A set of data points that can be separated into two groups by a single straight line.
Artificial Neural Networks 21
How did the network do this? Encoded in the network’s connection weights
is the equation of the line that separates the two groups of points; this is
The term in brackets is the value of Y calculated using equation (2.10). The
actual value of Y for the point is subtracted from this, and, if the result is neg-
ative, the point lies above the line that divides the two groups and, therefore,
is in group 1, while, if the value is greater than zero, the point lies below the
line and is in group 0. If the value is zero, the point lies exactly on the line.
2.5 Training
The network in Figure 2.11 can allocate any two-dimensional data point to
the correct group, but it is only able to do this because it was prepared in
advance with suitable connection weights. This classification problem is so
simple that the connection weights can be found “by inspection,” either by
calculating them directly or by testing a few straight lines until one is found
that correctly separates the points. Not every problem that we meet will be
as straightforward as this.
Indeed, if the problem is simple enough that the connection weights can be
found by a few moments work with pencil and paper, there are other computa-
tional tools that would be more appropriate than neural networks. It is in more
complex problems, in which the relationships that exist between data points
are unknown so that it is not possible to determine the connection weights by
hand, that an ANN comes into its own. The ANN must then discover the con-
nection weights for itself through a process of supervised learning.
The ability of an ANN to learn is its greatest asset. When, as is usually
the case, we cannot determine the connection weights by hand, the neural
network can do the job itself. In an iterative process, the network is shown
a sample pattern, such as the X, Y coordinates of a point, and uses the pat-
tern to calculate its output; it then compares its own output with the correct
output for the sample pattern, and, unless its output is perfect, makes small
adjustments to the connection weights to improve its performance. The train-
ing process is shown in Figure 2.13.
22 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Choose network
size
Initialize weights
Select random
sample pattern
Determine network
output
Calculate error
adjust weights
No Performance OK?
Yes
End
Figure 2.13
The steps in training an artificial neural network.
δ=t−y (2.12)
If the actual output and the target output are identical, the network has
worked perfectly for this sample, therefore, no learning need take place.
Another sample is drawn from the database and fed into the network and
the process is continued. If the match is not perfect, the network needs to
learn to do better. This is accomplished by adjusting the connection weights
so that the next time the network is shown the same sample it will provide
an output that is closer to the target response.
The size of the adjustment, ∆w, that is made to the weight on a connection
into the node is proportional both to the input to the node, x, and to the size
of the error:
∆w = ηδx (2.13)
The input to the node enters into this expression because a connection that
has sent a large signal into the node bears a greater responsibility for any
error in the output from the node than a connection that has provided only
a small signal. In equation (2.13), η is the learning rate, which determines
whether changes to the connection weights should be large or small in com-
parison with the weight itself; its value is chosen by the user and typically is
0.1 or less. The connection weight is then updated:
This process is known as the delta rule, or the Widrow–Hoff rule and is a
type of gradient descent because the size of the change made to the connec-
tion weights is proportional to the difference between the actual output and
the target output. Once the weights on all connections into the node have
been adjusted, another pattern is taken from the database and the process
is repeated. Multiple passes through the database are made, every sample
pattern being included once in each cycle or epoch, until the error in the net-
work’s predictions becomes negligible, or until further training produces no
perceptible improvement in performance.
Example 3: Training
Suppose that Figure 2.7 shows the initial connection weights for a net-
work that we wish to train. The first sample taken from the database is
X = 0.16, Y = 0.23, with a target response of 0.27. The node uses the iden-
tity function to determine its output, which is therefore:
As the target response is 0.27, the error is –0.1004. If the learning rate
is 0.05, the adjustment to the weight on the x connection is
0, 0 1, 0
Figure 2.14
The Boolean AND function. A single straight line can separate the open and filled points.
0, 1 1, 1
0, 0 1, 0
Figure 2.15
The Boolean XOR function. Two straight lines are required to divide the area into regions that
contain only one type of point.
Artificial Neural Networks 25
(a)
(b)
Figure 2.16
Scientific data points usually require curves, several lines, or both to divide them into homo-
geneous groups.
instance, two straight lines are required to separate the space into regions in
which Y has the same value.
Most scientific problems are nonlinear, requiring (if the data points are
two-dimensional) either a curve (Figure 2.16a) or more than one line (Fig-
ure 2.16b) to separate the classes.
Although some problems in more than two dimensions are linearly sepa-
rable (in three dimensions, the requirement for linear separability is that the
points are separated by a single plane, Figure 2.17), almost all problems of
scientific interest are not linearly separable and, therefore, cannot be solved
by a one-node network; thus more sophistication is needed. The necessary
additional power in the network is gained by making two enhancements:
(1) the number of nodes is increased and (2) each node is permitted to use a
more flexible activation function.
26 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Figure 2.17
Linear separability in three dimensions.
Figure 2.18
A parallel arrangement of nodes.
Artificial Neural Networks 27
Figure 2.19
A serial arrangement of nodes.
Figure 2.20
A network that contains a recursive (backward) link.
In the brain, connections between neurons are quite numerous and appar-
ently random, at least at the local level. The impressive logical power of the
brain might suggest that we should aim for a network designed along similarly
random lines, but this type of disorganized network is hard to use: Where do
we feed data in? Where are the outputs? What is to stop the data from just
going round and round in an endless loop? Even worse, such a network is very
difficult to train. Instead, the most common type of artificial neural network is
neither entirely random nor completely uniform; the nodes are arranged in lay-
ers (Figure 2.21). A network of this structure is simpler to train than networks
that contain only random links, but is still able to tackle challenging problems.
One layer of input nodes and another of output nodes form the bookends
to one or more layers of hidden nodes.* Signals flow from the input layer to the
hidden nodes, where they are processed, and then on to the output nodes,
which feed the response of the network out to the user. There are no recur-
sive links in the network that could feed signals from a “later” node to an
“earlier” one or return the output from a node to itself. Because the messages
in this type of layered network move only in the forward direction when
input data are processed, this is known as a feedforward network.
* So-called because nodes in the hidden layer have no direct contact with the outside world.
28 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Figure 2.21
A feedforward neural network.
1.0
0.8
0.6
0.4
0.2
–4 –2 0 2 4
11
f (f(net)
net) = –net (2.16)
1 + ee − net
1 +
Figure 2.22
The logistic activation function.
From any real input, this provides an output in the range {0, 1}. The Tanh
function (Figure 2.23) has a similar shape, but an output that covers the
range {–1, +1}. Both functions are differentiable in closed form, which confers
a speed advantage during training, a topic we turn to now.
0
–4 –2 2 4
–1
Figure 2.23
The Tanh activation function.
30 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
A hidden node that sends a large signal to an output node is more respon-
sible for any error at that node than a hidden node that sends a small signal,
Artificial Neural Networks 31
so changes to the connection weights into the former hidden node should be
larger.
Backpropagation has two phases. In the first, an input pattern is presented
to the network and signals move through the entire network from its inputs
to its outputs so the network can calculate its output. In the second phase, the
error signal, which is a measure of the difference between the target response
and actual response, is fed backward through the network, from the outputs
to the inputs and, as this is done, the connection weights are updated.
The recipe for performing backpropagation is given in Box 1 and illus-
trated in Figure 2.24 and Figure 2.25. Figure 2.24 shows the first stage of the
process, the calculation of the network output; this is the forward pass. Input
signals of 0.8 and 0.3 are fed into the network and pass through it, being
multiplied by connection weights before being summed and fed through a
sigmoidal activation function at each node.
net j = ∑x w ,
i=0
ij ij
δ j = (t j − o j )o j (1 − o j )
9. Calculate the error for each node in the final hidden layer
δ j = o j (1 − o j ) ∑δ w
k
k kj
0.6640
1.0 –1
0.6812
3 –2
0.7108 0.2256
1.0 2 –2 1.0
0.8993 –1.2332
Node output
4 –3
1 –3
0.6225 0.5744
1.0 –3 –1 1.0
0.5 0.3
4 –1
2 1
Node input
0.8 0.3
Figure 2.24
Backpropagation: The forward pass.
∂E
δj = − (2.17)
∂net j
and the gradient of this error with respect to weight wij as:
∂E
∆wij = (2.18)
∂wij
Artificial Neural Networks 33
0.4926 -0.3284
1.0 2 -2 1.0
0.1013 -0.0574
0.2256x(1-0.2256)x(-0.3284)
1 -3
4 -3
-0.1283 -0.1317
1.0 -3 -1 1.0
-0.0301 -0.0322
4 2 1 -1
0.8 0.3
Figure 2.25
Backpropagation: Updating the weights.
The gradient of the weights can be expanded using the chain rule:
∂E ∂neti
∆wij = − (2.19)
∂neti ∂wij
The first term is the error in unit i, while the second can be written as:
∂neti
=
∂
∂wij ∂wij ∑w y
k ∈Ai
ik k = yj (2.20)
∆wij = δ i y j (2.21)
To find this term, we need to calculate the activity and error for all relevant
network nodes. For input nodes, this activity is merely the input signal x. For
all other nodes, the activity is propagated forward:
yi = fi ( ∑w y )
j∈Ai
ij j (2.22)
34 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Since the activity of unit i depends on the activity of all nodes closer to the
input, we need to work through the layers one at a time, from input to output.
As feedforward networks contain no loops that feed the output of one node
back to a node earlier in the network, there is no ambiguity in doing this.
The error output is calculated in the normal fashion, summing the contri-
butions across all the outputs from the node:
E=
1
2 ∑ (t − y )
o
o o
2
(2.23)
δ o = to − yo (2.24)
To determine the error at the hidden nodes, we backpropagate the error from
the output nodes. We can expand this error in terms of the posterior nodes:
∂y j
δj = − ∑ ∂∂netE ∂∂nety ∂net
i ∈Pj i j
i
j
(2.25)
The first factor is as before, just the error of node i. The second is
∂neti
∂y j
=
∂
∂y j ∑w y
k ∈Ai
ik k = wij (2.26)
while the third is the derivative of the activation function for node j:
Equation (2.27) reveals why some activation functions are more convenient
computationally than others. In order to apply BP, the derivative of the acti-
vation function must be determined. If no closed form expression for this
derivative exists, it must be calculated numerically, which will slow the algo-
rithm. Since training is in any case a slow process because of the number of
samples that must be inspected, the advantage of using an activation func-
tion whose derivative is quick to calculate is considerable.
When the logistic function is used we can make use of the identity that:
δ j = fj′(net j ) ∑δ w
i∈Pj
i ij (2.29)
Artificial Neural Networks 35
To calculate the error for unit j, the error for all nodes that lie in later layers
must previously have been calculated. The updating of weights, therefore,
must begin at the nodes in the output layer and work backwards.
Epoch
Figure 2.26
Typical variation of the training rate with epoch number.
36 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
2.10 Momentum
Networks with a sufficiently large number of nodes are able to model any
continuous function. When the function to be modeled is very complex,
we can expect that the error function, which represents the way that the
discrepancy between the target output and the actual output varies with
the values of the connection weights, will be highly corrugated, displaying
numerous minima and maxima. We have just argued that toward the end of
training connection weights should be adjusted by only small amounts, but
once adjustments to the weights become small there is a real danger that the
network may get trapped, so that the connection weights continue to adjust
as each sample is presented, but do so in a fashion that is cyclic and so makes
no progress. Each weight then oscillates about some average value; once this
kind of oscillation sets in, the network ceases to learn.
This problem is more severe in multilayer networks that, though more pow-
erful and flexible than their single-layer counterparts, are also vulnerable to
trapping because the error surface, whose dimensionality equals the number
of connection weights, is complex. The scale of the problem is related to the
size of the network. Each connection weight is a variable whose value can be
adjusted; in a large network, there will be scores or hundreds of weights that
can be varied independently. As the set of connection weights defines a high
dimensional space, the greater the number of weights, the more minima and
maxima are likely to exist in the error surface and the more important it is
that the system be able to escape from local minima during training.
To reduce the chance that the network will be trapped by a set of end-
lessly oscillating connection weights, a momentum term can be added to
the update of the weights. This term adds a proportion of the update of the
weight in the previous epoch to the weight update in the current epoch:
300
Boiling Point/K
0
0 50 100 150 200 250 300
Molecular Weight/Daltons
Figure 2.27
The relationship between the boiling point of a material in degrees Kelvin and its molecu-
lar weight.
Error
Epoch
Figure 2.28
Variation of error in the training set (solid line) and test set (broken line) with epoch.
as the network adjusts its connection weights to learn specific examples, its
knowledge of the general rules will be diluted and, therefore, the overall
quality of its knowledge will diminish.
A Goldilocks step is required—training must continue until the network
has learned just the right amount of knowledge, but no more. Judging when
the network has a satisfactory grasp of the general rules, but has not yet turned
its attention to learning specific patterns is not as difficult as it may seem. A
convenient way to avoid the perils of overfitting is to divide the dataset into
two parts: a training set and a test set. The training set provides the patterns
that the network uses to establish the correct connection weights; the test set
is used to measure how far training has progressed. Both sets should contain
examples of all the rules. For large sample databases, this can be achieved by
selecting the members of the two sets from the database at random.
The network is trained using the training set and, as it learns, the error
on this set decreases, as shown by the solid line in Figure 2.28. Periodically,
the network is shown the test set and the total error for samples in that set is
determined. When the network is shown the test set, the connection weights
remain untouched, whether or not performance on the set is satisfactory; the
test set is used only to assess the current ability of the network. As the net-
work learns, performance on both sets will improve, but once the network
starts to learn specific examples in the training set, the network’s under-
standing of the general rules will begin to degrade, and the prediction error
for the test set will rise. The optimal stopping point is then at the minimum
of the broken curve in Figure 2.28, at which point the error on the test set is
at its lowest.
This method for preventing overfitting requires that there are enough sam-
ples so that both training and test sets are representative of the dataset. In
fact, it is desirable to have a third set known as a validation set, which acts as a
secondary test of the quality of the network. The reason is that, although the
test set is not used to train the network, it is nevertheless used to determine
at what point training is stopped, so to this extent the form of the trained
network is not completely independent of the test set.
40 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Table 2.1
Boiling points, molecular weights, and electronegativity differences for some
diatomic molecules.
Source: Atkins, P. and De Paula, J., Physical Chemistry, 8th ed., Oxford University Press,
Oxford, U.K., 2006. With permission.
sigmoidal function does not become insensitive to the input, which would
happen if the integrated input signal was very large or very small.
2.12.4 Random Noise
A quite different way to reduce overfitting is to use random noise. A random
signal is added to each data point as it is presented to the network, so that a
data pattern:
becomes
random noise looks ill-advised, but, provided that it only slightly perturbs
the input pattern, the presence of this background noise will not mask the
general rules in the dataset since those rules are manifest in many samples.
On the other hand, the network is now less likely to learn a specific pattern
because that pattern will be different on each presentation.
in which d is a decay parameter in the range 0.0 > d > 1.0 and is usually very
close to zero. Although this procedure does reduce curvature, the decay of
the weights constantly pushes the network away from the weights on which
it would like to converge, thus the benefits of reduced overfitting must be
balanced against some loss of quality in the finished network.
may have a very destructive effect on performance, but the importance of a node
is not determined by the magnitude of the connection weights alone. In addition,
judging when performance “begins to deteriorate” is a qualitative decision and
it may be hard to assess at what point the diminution in performance becomes
damaging. The various methods available for network pruning have intriguing
names, such as “optimal brain damage,” but are beyond the scope of this book.
A promising alternative to incremental networks or pruning is a growing
cell structure network, in which not only the size of the network but also its
geometry are evolved automatically as the calculation proceeds. Growing
cell structures, which form the subject of Chapter 4, are effective as a means
of creating self-organizing maps, but their use in generating feedforward
networks is in its infancy.
Figure 2.29
The herd effect.
Artificial Neural Networks 45
All sections of the network will simultaneously try to adjust their weights
so as to find this rule. Until this rule has been satisfactorily modeled, the sec-
ond rule will be largely ignored, but once the first rule has been dealt with,
errors arising from the second rule predominate and all sections of the net-
work will then switch attention to solving that rule. In so doing, the network
may start to forget what has been learned about the first one. The network,
thus, switches herd-like between a focus on one rule and on the other and
can oscillate for some time before gradually settling into a mode in which it
addresses both rules.
Various strategies can be used to distract the herd. The simplest is to run
the same set of data through several different network geometries, changing
the number of hidden layers as well as the number of hidden nodes in each
layer and see what provides the best network measured by performance on
the test set. This simple strategy can be effective if the time required in train-
ing the network is low, but for complex data many different networks would
need to be tried and the time required may be excessive. Fortunately, the
herd effect, if it cannot be avoided entirely, can generally be overcome by
allowing training to run for an extended period.
2.13 Applications
Artificial neural networks are now widely used in science. Not only are they
able to learn by inspection of data rather than having to be told what to do,
but they can construct a suitable relationship between input data and the tar-
get responses without any need for a theoretical model with which to work.
For example, they are able to assess absorption spectra without knowing
about the underlying line shape of a spectral feature, unlike many conven-
tional methods.
Most recent scientific applications involve the determination of direct
relationships between input parameters and a known target response. For
example, Santana and co-workers have used ANNs to relate the structure of
a hydrocarbon to its cetane number,4 while Berdnik’s group used a theoreti-
cal model of light scattering to train a network that was then tested on flow
cytometry data.5
QSAR studies are a fertile area for ANNs and numerous papers have been
published in the field. Katritzky’s group has a range of interests in this area,
particularly related to compounds of biological importance; see for example
Reference 6. Some QSAR studies have been on a heroic scale. Molnar’s group
has used training sets of around 13,000 compounds and a total database con-
taining around 30,000 to try to develop meaningful links between cytotoxic-
ity and molecular descriptors.7
ANNs are the favorite choice as tools to monitor electronic noses,8 where
the target response may be less tangible than in other studies (although, of
course, it is still necessary to be able to define it). Many applications in which
a bank of sensors is controlled by a neural network have been published and
as sensors diminish in size and cost, but rise in utility, sensors on a chip with
a built-in ANN show considerable promise. Together, QSARs and electronic
noses currently represent two of the most productive areas in science for the
use of these tools.
The ability of ANNs to model nonlinear data is often crucial. Antoniewicz,
Stephanopoulos, and Kelleher have studied the use of ANNs in the estima-
tion of physiological parameters relevant to endocrinology and metabolism.9
Artificial Neural Networks 47
2.15 Problems
1. Applicability of neural networks
Consider whether a neural network would be an efficient way of
tackling each of the following tasks:
a. Predicting the direction of the stock market (1) in the next twenty-
four hours, (2) over the next two years.
b. Predicting the outcome of ice hockey matches.
c. Predicting the frequency of the impact of meteorites on the Earth.
d. Deriving rules that link diet to human health.
e. Automatic extraction of molecular parameters, such as bond
lengths and bond angles, from the gas-phase high resolution
spectra of small molecules (which are of particular importance
in interstellar space).
f. Linking the rate of an enzyme-catalyzed reaction to the reaction
conditions, such as temperature and concentration of substrate.
48 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
References
1. Sharda, R. and Delen, D., Predicting box-office success of motion pictures with
neural networks, Exp. Sys. Apps., 30, 243, 2006.
2. Rumelhart, D.E., Hinton, G.E., and Williams, R.J., Learning representations by
back-propagating errors, Nature, 323, 533, 1986.
3. Atkins, P. and De Paula, J., Physical Chemistry, 8th ed., Oxford University Press,
Oxford, U.K., 2006.
Artificial Neural Networks 49
4. Santana, R.C. et al., Evaluation of different reaction strategies for the improve-
ment of cetane number in diesel fuels, Fuel, 85, 643, 2006.
5. Berdnik, V.V. et al., Characteristics of spherical particles using high-order neu-
ral networks and scanning flow cytometry, J. Quant. Spectrosc. Radiat. Transf.,
102, 62, 2006.
6. Katritzky, A.R. et al., QSAR studies in 1-phenylbenzimidazoles as inhibitors of
the platelet-derived growth factor, Bioorg. Med. Chem., 13, 6598, 2005.
7. Molnar, L. et al., A neural network-based classification scheme for cytotoxicity
predictions: Validation on 30,000 compounds, Med. Chem. Letts., 16, 1037, 2006.
8. Fu, J., et al., A pattern recognition method for electronic noses based on an
olfactory neural network, Sens. Actuat. B: Chem., 125, 489, 2007.
9. Antoniewicz, M.R., Stephanopoulos, G., and Kelleher, J.K., Evaluation of regres-
sion models in metabolic physiology: Predicting fluxes from isotopic data with-
out knowledge of the pathway, Metabolomics 2, 41, 2006.
10. Beale, R. and Jackson, T., Neural Computing: An Introduction, Institute of Physics,
Bristol, U.K., 1991.
11. Zupan, J. and Gasteiger, J., Neural Networks in Chemistry and Drug Design, Wiley-
VCH, Chichester, U.K., 1999.
12. National Institute of Standards and Technology (NIST). Chemistry WebBook.
http://webbook.nist.gov/chemistry.
3
Self-Organizing Maps
There are many situations in which scientists need to know how alike a num-
ber of samples are. A quality control technician working on the synthesis of
a biochemical will want to ensure that each batch of product is of compara-
ble purity. An astronomer with access to a large database of radiofrequency
spectra, taken from observation of different parts of the interstellar medium,
might need to arrange the spectra into groups to determine whether there is
any correlation between the characteristics of the spectrum and the direction
of observation.
If sample patterns in a large database are each defined by just two val-
ues, a two-dimensional plot may reveal clustering that can be detected by
the eye (Figure 3.1). However, in science our data often have many more
than two dimensions. An analytical database might contain information
on the chemical composition of samples of crude oil extracted from differ-
ent oilfields. Oils are complex mixtures containing hundreds of chemicals
at detectable levels; thus, the composition of an oil could not be represented
by a point in a space of two dimensions. Instead, a space of several hun-
dred dimensions would be needed. To determine how closely oils in the
database resembled one another, we could plot the composition of every
oil in this high-dimensional space, and then measure the distance between
the points that represent two oils; the distance would be a measure of the
difference in composition. Similar oils would be “close together” in space,
10
–5
–8 –4 0 4 8 12
Figure 3.1
Clustering of two-dimensional points.
51
52 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
while dissimilar oils would be “far apart” and we could check for clusters
by seeking out groups of points that lie close together.
This sounds rather alarming. Few of us are entirely comfortable work-
ing with four dimensions; something a hundred times larger is a good deal
worse. Fortunately, a self-organizing map (SOM) can take the pain away. Its
role is to identify similar samples within a large set and group them together
so that the similarities are apparent. Data that are represented by points in
these spaces, whether they cover a few or several hundred dimensions, are
handled in a simple way and the algorithm also provides a means to visu-
alize the data in which these clusters of points can be seen, so we need not
worry about the difficulties of spotting an object in a space that spans hun-
dreds of dimensions.
3.1 Introduction
There are good reasons why an oil company might want to know how simi-
lar two samples of crude oil are. Operating margins in the refineries that
process crude oil are low and the most profitable product mix is critically
dependent on how the nature of the feedstock is taken into account when
setting the refinery operating conditions. A minor change in the properties
of a feedstock, such as its viscosity or corrosivity, may affect the way that
it should be processed and the relative quantities of the chemicals that the
refinery produces. The change in product mix may, in turn, significantly
increase the refinery profit or, if the wrong choices of operating conditions
are made, remove it completely. Thus, operating conditions are fine-tuned to
the composition of each oil and if two successive batches of oil are not suf-
ficiently similar, adjustments will need to be made to those conditions.
It is not only in oil processing that it is valuable to identify similarities
among samples. In fact, the organization of samples into groups or classes
with related characteristics is a common challenge in science. By grouping
samples in this way, we can accomplish a couple of objectives, as is illus-
trated in the first two examples below.
Within the database there may be many chemicals for which butyl rubber
gloves provide good protection and we could reasonably anticipate that
at least some structural similarities would exist among these chemicals
that would help us rationalize the choice of butyl rubber.
It is found that many nitro compounds fall into the group for which
butyl rubber is an appropriate choice, so if a new nitro-containing com-
pound had been synthesized and we wished to choose a glove to provide
protection, inspection of the members of the class would suggest butyl
rubber as a suitable candidate. In this application, we are using the obser-
vation of similarities within a class (the presence of many nitro com-
pounds) as a predictive tool (best handled using butyl rubber gloves).
We have already met one tool that can be used to investigate the links that
exist among data items. When the features of a pattern, such as the infrared
absorption spectrum of a sample, and information about the class to which
it belongs, such as the presence in the molecule of a particular functional
group, are known, feedforward neural networks can create a computational
model that allows the class to be predicted from the spectrum. These net-
works might be effective tools to predict suitable protective glove material
from a knowledge of molecular structure, but they cannot be used if the
classes to which samples in the database are unknown because, in that case,
a conventional neural network cannot be trained.
54 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
* Also known as a Self-Organizing Feature Map or SOFM, or a Kohonen map after its inventor.
** The projection need not necessarily be onto two dimensions, but projections onto a different
number of dimensions are less common.
Self-Organizing Maps 55
Figure 3.2
Projection of points that are clustered in three dimensions onto a two-dimensional plane.
it is the ideas that matter rather than rigor, so we shall replace these qual-
itative judgments with numerical approximations.
ng j )2 +
d 2 ij = c1 × (legsi − legs j )2 + c2 × (covering i − coverin (3.1)
c3 × (speedi − speed j )2 + …
in which the constants c1, c2, c3, … are chosen to reflect how important we
think each factor ought to be in determining similarity.
Once the “distances” between the animals have been calculated using
equation (3.1), we lay the animals out on a piece of paper, so that those that
share similar characteristics, as measured by the distance between them, are
close together on the map, while those whose characteristics are very differ-
ent are far apart. A typical result is shown in Figure 3.3. What we have done
in this exercise is to squash down the many-dimensional vectors that repre-
sent the different features of the animals into two dimensions.
This two-fold procedure—calculate similarity among the samples, then
spread the samples across a plane in a way that reflects that similarity—is
exactly the task for which the SOM is designed.
56 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Platypus Vulture
Pterodactyl
Dog Kangaroo Albatross
Sheep Wallaby
Giraffe Chicken
Panda Tiger
Cow Bee
Lion
Elephant
Shark
Dolphin Anaconda Ant
Worm
Slug
Sea cucumber Millipede Amoeba
Figure 3.3
A map that organizes some living creatures by their degree of similarity.
Figure 3.4
The layout of nodes in a one-dimensional SOM.
* The nodes in a SOM are sometimes referred to as neurons, to emphasize that the SOM is a
type of neural network.
58 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Not only are the lengths of the pattern and weights vectors identical, the
individual entries in them share the same interpretation.
Figure 3.5
The layout of nodes in a two-dimensional SOM.
Input
w0 x0
w1 x1
w2 x2
wm xm
Figure 3.6
Node weights in a two-dimensional SOM. Each node has its own independent set of weights.
Self-Organizing Maps 59
Figure 3.7
A SOM that organizes scientists according to their physical characteristics.
3.6 Learning
The aim of the SOM is to categorize patterns. In order to do this successfully,
it must first engage in a period of learning. The class to which a sample pat-
tern within the database belongs is unknown, thus it is not possible to offer
the algorithm any help in judging the significance of the pattern. Learning is
therefore unsupervised and the SOM must find out about the characteristics
of the sample patterns without guidance from the user. That the SOM can
60 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
learn to organize data in a meaningful way when it has no idea what the data
mean is one of the more intriguing aspects of the self-organizing map.
Training of a SOM is an iterative, straightforward process:
(a)
(b)
Figure 3.8
A 10 × 10 SOM trained on values taken at random from a 1 × 1 square; (a) at the start of training,
(b) when training is complete.
62 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
dpq = ∑ (w
j =1
pj − xqj )2 (3.2)
Both the sample vector and the node vector contain n entries; xqj is the
j-th entry in the pattern vector for sample q, while wpj is the j-th entry in the
weights vector at node p. This comparison of pattern and node weights is
made for each node in turn across the entire map.
7 6 −5 2 7
1 4 0 7 −9
Input = w1 = w2 = w3 = w4 = ,
2 −3 0 −4 2
9 9 7 −1 9
the squared Euclidean distances between the input pattern and the
weights vectors would be
* The distance between the node weights and the input vector calculated in this way is also
known as the node’s activation level. This terminology is widespread, but counterintuitive,
since a high activation level sounds desirable and might suggest a good match, while the
reverse is actually the case.
Self-Organizing Maps 63
Nodes
Input
Figure 3.9
The input pattern is compared with the weights vector at every node to determine which set of
node weights it most strongly resembles. In this example, the height, hair length, and waistline of
each sample pattern will be compared with the equivalent entries in each node weight vector.
Since the node weights are initially seeded with random values, at the start
of training no node is likely to be much like the input pattern. Although the
match between pattern and weights vectors will be poor at this stage, deter-
mination of the winning node is simply a competition among nodes and the
absolute quality of the match is unimportant.
6
4
−3
9
provided the closest match to the input pattern of
7
1
2
9
If the learning rate was 0.05, the node weights after updating would be
6 + 0.05 × 1 6.05
4 + 0.05 × (−3) = 3.85
−3 × 0.05 × 5 −2.75
9 + 0.05 × 0 9
As the adjustments at the winning node move each of its weights slightly
toward the corresponding element of the sample vector, the node learns
a little about this sample pattern and, thus, is more likely to again be the
Self-Organizing Maps 65
winning node when this pattern is presented to the map at some later stage
during training.
Initially, the node weights are far from their optimum values. To bring them
rapidly into the right general range, the weights are at first changed rapidly
through the use of a large learning rate. As training progresses, the weights
become a better match to sample data, so the changes can be made smaller
and the algorithm settles into a fine-tuning mode. The learning rate is, there-
fore, a diminishing function of the number of cycles that have passed.
Figure 3.10
The neighborhood around a winning node (shown shaded).
by the user, is large, possibly large enough to cover the entire lattice. As train-
ing proceeds, the neighborhood shrinks and adjustments to the node weights
become more localized (Figure 3.11). This encourages different regions of the
map to develop independently and become specialists in recognizing certain
types of sample patterns. After many cycles, the neighborhood shrinks until
it covers only the nodes immediately around the winning node, or possibly
just the winning node itself.
As the weights at every node in the neighborhood are adjusted, it will be
apparent that if, firstly, the neighborhood is very large and, secondly, the
weights at every node are adjusted by the same amount, the weights vectors
at every node across the entire lattice will eventually become very similar, if
not identical. A SOM in this state would be no more informative than one in
which all weights were completely uncorrelated. Therefore, to prevent this
happening, the weights are changed by an amount that depends on how
(a)
(b)
Figure 3.11
The size of the neighborhood around the winning node decreases as training progresses.
Self-Organizing Maps 67
close a node in the neighborhood is to the winning node. The greatest change
to the weights is made at the winning node; the weights at all other nodes in
the neighborhood are adjusted by an amount that depends on, and generally
diminishes with, their distance from that node.
In equation (3.4), f(d) is a function that describes how the size of the adjust-
ment to the weights depends on the distance that a particular node in the
neighborhood is from the winning node. This function might be the recip-
rocal of the distance between the winning node and a neighborhood node,
measured across the lattice, or any other appropriate function that ensures
that nearby nodes are treated differently from those that are far away. We
shall consider several possible forms for this function in section 3.7.3.
Once the winning node has been identified, adjustments to the weights rip-
ple out across the map from it. Because the adjustments are less pronounced
far from the winning node, this process leads to the development of similar
weights among nodes that are close to each other on the map, which will be
a crucial step on the road to meaningful clustering. The gradual reduction of
the neighborhood size as training proceeds helps to preserve this knowledge,
which would otherwise continually be degraded as the weights of nodes in
one part of the map were repeatedly pushed away from the values on which
they would like to settle by updates forced upon them from winning nodes
in other regions of the map.
3.6.7 Repeat
The winning node has been found, its weights have been updated, as have
those of the nodes in its neighborhood. This completes the updating of the
network that follows the presentation of one input pattern. A second pat-
tern is now selected from the sample database and the process is repeated.
Once all patterns have been shown to the SOM, a single cycle has passed;
the order of samples in the database is randomized and the process begins
again. Many cycles will be required before the weights at all the nodes settle
down to stable values or until the network meets some performance crite-
rion. Once this point is reached, the winning node for any particular sample
pattern should be in the same region of the map in every cycle; we say that
the sample “points to” a region of the map or to a particular node.
Training brings about two sorts of changes to the set of weights. At the
level of the individual node, the vector of weights gradually adjusts so that
each vector begins to resemble a blend of a selection of sample patterns. On
a broader level, the vectors at nodes that are close together on the map start
to resemble one another, and this development of correlation between nodes
that are close together is illustrated in the one-dimensional SOM whose
training is depicted in Figure 3.12.
68 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
(a)
(b)
(c)
(d)
Figure 3.12
Training a one-dimensional SOM using a set of random angles. The starting configuration is
shown at the top.
Although it is clear that from a random starting point the final weights
have become highly ordered, you may wonder why it is that, if the input
data contain values that cover the range from 1 to 360, this range is not fully
reflected in the final map, in which no arrows are pointing directly down. We
might have expected that at the end of training the weights at neighboring
nodes would differ by approximately 24°, so that the whole range of possible
Self-Organizing Maps 69
values of angles was covered, since the weights at fifteen nodes can be cho-
sen (15 × 24 = 360°). The reason that has not happened is that in this run the
neighborhood around each node extends across the complete set of nodes, so
even at the end of the run the weight of one node is influenced by that of oth-
ers around it. The final arrow at the right-hand end of the map, for example,
might “want” to become vertical so as to be able to properly represent an
input pattern of 360°, but is constrained to be a bit like its left-hand neighbor
because each time that neighbor is the winning node, some adjustment is
made to the weight at the rightmost node also. The weight at the neighbor is
in turn affected to some extent by the weight at its left-hand neighbor and so
on. If we reduce the size of the neighborhood with time so that the neighbor-
hood is eventually diminished to a size of two (which includes the winning
node and a single neighbor), we see a greater range of weights, as expected,
though the full range of angles is still not covered (Figure 3.13).
Figure 3.13
The result of training a one-dimensional SOM using a set of random angles. The final neigh-
borhood includes only the winning node and one neighbor.
Figure 3.15 shows the result of running the same dataset through a SOM of
the same geometry, but starting from a different set of random node weights.
The two maps are strikingly different, even though they have been created
from the same set of data. The difference is so marked that it might seem to
throw into doubt the value of the SOM. How can we trust the algorithm if
successive runs on identical data give such different output?
This question of reproducibility is an important one; we can understand
why the lack of reproducibility is not the problem it might seem to be by
considering how the SOM is used as a diagnostic tool. A sample pattern
that the map has not seen before is fed in and the winning node (the node
to which that sample “points”) is determined. By comparing the properties
of the unknown sample with patterns in the database that point to the same
winning node or to one of the nodes nearby, we can learn the type of samples
in the database that the unknown sample most strongly resembles.
70 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Figure 3.14
The result of training a two-dimensional SOM with a set of angles in the range 1 to 360°.
Figure 3.15
The result of training a two-dimensional SOM with a set of angles in the range 1 to 360°. The
data used for Figure 3.14 and Figure 3.15 are identical. The same geometry was used on each
occasion, but with two different sets of random initial weights.
Self-Organizing Maps 71
Let us do this using the map shown in Figure 3.14. We feed in some value that
was not contained in the original sample dataset, say 79.1°, and find the win-
ning node, in other words, the node whose weight is closest to the value 79.1°.
The weight at the winning node for the sample pattern 79.1° will have a
value that is close to 79°. The figure reveals that this node is surrounded
by other nodes whose weights correspond to similar angles, so the pattern
“79.1” will point to a region on the map that we might characterize as being
“angles around 80°.” If 79.1° were fed into the second map, Figure 3.15, the
position of the node on the map to which it would point, as defined by the
node’s Cartesian coordinates, would be different from the position of the
winning node in Figure 3.14, but the area in which that winning node is
situated could still be described by the phrase “angles around 80°,” thus the
pattern is still correctly classified.
We conclude that the absolute position of the node on the map to which the
sample pattern points is not important; neither of the maps in Figure 3.14 and
Figure 3.15 is better than the other. It is the way that samples are clustered
on the map that is significant. It is, in fact, common to discover when using a
SOM that there are several essentially equivalent, but visually very different,
clusterings that can be generated.
A comparable situation may arise in the training of an ANN. If an ANN is
trained twice using the same dataset, starting from two different sets of ran-
dom connection weights, the sets of weights on which the algorithm eventu-
ally converges will not necessarily be identical in the two networks, although
if training has been successful, the output from both networks in response
to a given input pattern should be very similar. Provided that the resulting
networks function reliably and accurately, this unpredictability in the way
that the ANN and the SOM train does not cast into doubt their value.
Figure 3.16
Hexagonal and triangular alternatives to a square lattice for the layout of nodes.
to separate classes. If the number of nodes in the map exceeds the number
of sample patterns, all sample patterns will have their own node by the end
of training. Because the neighborhood shrinks as the cycles pass, in the later
stages of training there will be little interaction between the winning node
and any neighbors other than those that are very close. The node weights
will then adjust independently and the forces that lead to the partial homog-
enization of weights in neighboring regions of the map will die away.
This does not mean that no clustering occurs if the number of nodes is
large compared to the number of classes. Later in this chapter we shall see
an example in which the number of nodes has been set to a large value as an
aid in visualization. Some broad-brush clustering is formed when the neigh-
borhood is large, but if the map contains too many nodes, clustering may
be less well defined than would be the case in a map of smaller dimension.
There is, therefore, for any given dataset some optimum size of map (or more
realistically, a range of sizes) in which enough room is available on the map
that very different patterns can move well away from each other, but there
is sufficient pressure to place some patterns close together so that clustering
will still be clearly defined. This optimum size depends on the number of
classes that exist within the data.
3.7.2 Neighborhood
The adjustment of weights is spread across a neighborhood. The idea of a
neighborhood may seem simple enough, but it is only fully defined once
we have specified two features: its extent and how the updating of the node
weights should depend on the location of the node within it. The size of
the neighborhood is determined by choosing some cut-off distance beyond
which nodes lie outside the neighborhood, whatever the geometry of the lat-
tice. This distance will diminish as training progresses.
t2
1 −
2
w(t) = × e 2σ (3.5)
σ 2π
where t = x – μ.
The Gaussian function does not show the abrupt change in value at the
edge of the neighborhood that the linear function does because there is no
longer any edge other than the boundary of the map; instead the adjustment
74 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
0.4
0.3
G(x)
0.2
0.1
0.0
–4 –2 0 2 4
x
Figure 3.17
A linear neighborhood function; x denotes the number of nodes to the right or left of the win-
ning node.
0.4
0.3
G(x)
0.2
0.1
0.0
–4 –2 0 2 4
x
Figure 3.18
A Gaussian neighborhood function.
can extend arbitrarily far from the winning node (Figure 3.18). The width of
the function can be adjusted to make the weight changes more or less strongly
focused around the winning node. To increase computational speed, this
function is applied to only those nodes that fall within a neighborhood of
limited size, even though the function itself extends to infinity.
Functions that determine how large the updates to the weights should be,
such as the Gaussian, fall off with distance from the node. An almost equivalent
procedure to lessening the size of the neighborhood as the cycles pass is to use
a neighborhood function that becomes more localized as the algorithm runs so
that the weights at distant nodes are only slightly perturbed later in training.
Self-Organizing Maps 75
In a small map, the Gaussian function may lead to weight changes in the
neighborhood being larger than needed. To narrow down the region in
which large weight changes are made, back-to-back exponentials may be
used (Figure 3.19).
Each of the functions shown in Figure 3.17 to Figure 3.19 adjusts the
weights at every node within the neighborhood in a way that increases the
similarity between the weights vector and the sample pattern. However, in
the completed SOM the weights at nodes in regions of the map that are far
apart should be very different and this suggests that the weights of nodes
that are distant from the winning node, yet still in the neighborhood, should
perhaps be adjusted in the opposite direction to the adjustment that is applied
to the winning node and its close neighbors.
In the Mexican Hat function (Figure 3.20), the weights of the winning
node and its close neighbors are adjusted to increase their resemblance to
the sample pattern (an excitatory effect), but the weights of nodes that are
0.4
0.3
G(x)
0.2
0.1
0.0
–4 –2 0 2 4
x
Figure 3.19
Back-to-back exponentials used as a neighborhood function.
–6 0 6
Figure 3.20
The Mexican hat neighborhood function.
76 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
farther away are adjusted in the opposite direction, making them less like
the sample pattern (an inhibitory effect). This long-distance inhibition tends
to emphasize differences in weights across the network.
The Mexican Hat is the second derivative of a Gaussian function:
t2
1 t2 −
2
Ψ(t) = (1 − 2 ) × e 2 σ (Mexican hat) (3.6)
σ3 2π σ
(a)
(b)
Figure 3.21
The node weights in a square SOM after (a) generation 27, (b) generation 267, (c) generation 414,
(d) generation 2,027, and (e) generation 100,000.
78 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
(c)
(d)
Figure 3.21 (continued)
The node weights in a square SOM after (a) generation 27, (b) generation 267, (c) generation 414,
(d) generation 2,027, and (e) generation 100,000.
Self-Organizing Maps 79
(e)
Figure 3.21 (continued)
The node weights in a square SOM after (a) generation 27, (b) generation 267, (c) generation 414,
(d) generation 2,027, and (e) generation 100,000.
Figure 3.22
Points in a one-dimensional SOM attempting to cover the space defined by a 2 × 1 rectangle.
If the input data are not spread evenly across the x/y plane, but are concen-
trated in particular regions, the SOM will try to reproduce the shape that is
mapped by the input data (Figure 3.23), though the requirement that a rect-
angular lattice of nodes be used to mimic a possibly nonrectangular shape
may leave some nodes stranded in the “interior” of the object.
80 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Figure 3.23
A familiar shape from the laboratory, learned by a rectangular SOM. Sample patterns are
drawn from the outline of an Erlenmeyer flask.
3.8.2 Visualization
At the start of this chapter, we noted that the SOM has two roles: to cluster
data and to display the result of that clustering. The finished map should
divide into a series of regions, each corresponding to a different class of
sample pattern.
Even when the clustering brought about by the SOM has been technically
successful, so that samples in different classes are allocated to different areas
of the map, there is no guarantee that an informative, readily interpretable
display will be produced; there may still be work to be done. The reason is
that, if each node stores many weights, not all of those weights can be plot-
ted simultaneously on a two-dimensional map, thus decisions must be made
about how the weights can best be used to visualize the data. If the underly-
ing data are of high dimensionality, some experimentation may be required
to get the most helpful map.
When the weights are one-dimensional, as in the angles data (see Fig-
ure 3.12), a display that shows the node weight as an arrow is effective. If
the data are two-dimensional, interpreting the two values at each node as
either a vector with magnitude and direction and so providing a display
82 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Figure 3.24
The nodes weights for the Erlenmeyer interpreted as a force field.
Self-Organizing Maps 83
(a)
(b)
Figure 3.25
A SOM trained on some trigonometric data. In Figure 3.25a, the SOM is shown with the first
three weights interpreted as RGB values. Figure 3.25b shows the same map with the color of
each point determined by the difference in weights between one node and those in its immedi-
ate neighborhood.
nodes represent the structure of the map at an early stage of training, not when
the map has converged, so while some parts of the map are fully representa-
tive of the dataset, others are not. This does not mean that the map cannot be
used, but a run with a much smaller map size will reveal the extent to which
Self-Organizing Maps 85
(a)
(b)
Figure 3.26
The same SOM as in Figure 3.25, but at a later stage of development.
the structure seen in these figures reflects real differences in the data. It will
be discovered that a map of far smaller size is, in fact, optimum.
Figure 3.27
A SOM trained on the same dataset as the SOM shown in Figure 3.25 and Figure 3.26, but start-
ing from a different random set of initial weights.
Figure 3.28
A (one-dimensional) ring of nodes.
Figure 3.29
A torus, which is the result of linking opposite sides of a two-dimensional rectangular SOM.
88 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Figure 3.30
A tangled map.
hand side moderately well, it is clear that it is unable to do so toward the left.
Similar problems can arise with three- and higher-dimensional data and are
more difficult to detect.
3.10 Applications
How might we use a SOM in a scientific context? Suppose that we prepared a
map in which the molecular descriptors of many compounds that have been
used as herbal remedies provided the input patterns. The map would group
these materials according to the values of the molecular descriptors. If some
correlation existed between molecular structure and therapeutic activity, we
might hope to find that remedies that were active against migraine, for exam-
ple, were clustered together in some region of the map. The clustering might
then provide information about what molecular features we might expect in
a new drug that could be effective in treatment of the same condition.
Scientists, of course, do not create a map out of curiosity; the SOM must
have a purpose. The map created by a SOM may be used in several ways,
which usually lie within one of four broad categories:
A typical use of the SOM is in the classification of crude oil samples, men-
tioned at the start of this chapter. If crude oils were grouped by composition,
we might find that oils cluster together because they have a similar composi-
tion having been derived from wells in the same field. More interestingly,
we might find that clustered oils in some cases come from quite different
fields, but from similar geological formations. The most powerful way to
characterize the very complex mixtures that are crude oils is to use mass
spectrometry. This provides a detailed fingerprint for each sample, but the
data are complex. Fonseca et al. have used GC-MS (gas chromatography-
mass spectrometry) data as input into a SOM to characterize the geographic
origin of oils, with a success rate of about two-thirds of their samples on the
basis of a single SOM.1 Recognizing that an individual SOM may not be fully
optimized, they also investigated the use of ensembles of maps and were
able to lift the identification rate to about 90 percent using this technique.
In an additional step, they investigated whether a SOM would still be able
to identify the geographic origin of the samples after a period of simulated
weathering, obtaining encouraging results.
There have also been studies of the use of SOMs in tasks that are well
beyond the capabilities of most other methods. Kohonen and his group have
used SOMs to investigate the self-organization of very large document col-
lections on the basis of the presence of words or short phrases.2 The signifi-
cance of work such as this is more in the general principles that underlie it
rather than in the specific application. As the volume of scientific information
available through the Internet grows, sophisticated tools are needed that can
unearth publications that may be relevant to a particular, specialized area
of research. Because of the volume of scientific research, few scientists now
have the time to finger through Chemical Abstracts or similar publications
(“finger through” is, in fact, often no longer an option because of the move
to electronic publishing). Thus, tools that are able to combine a broad survey
of many millions of documents with a focused output of results are increas-
ingly valuable.
It has been noted elsewhere in this book that increasing computing power
has led to the linking of AI methods with other computational tools, either
a second AI method to form a hyphenated tool or a simulation to generate
data, which can then be analyzed via an AI tool. Recent work on the confor-
mation analysis of lipids by Murtola, Kupiainen, Falck, and Vuttalainen3 falls
in the latter category. Lipid bilayers are of central interest in biochemistry
and other areas of science because of the broad range of processes in which
they are involved. The structural properties of lipids are crucial in determin-
ing the behavior of biological membranes and are an important consider-
ation in the rapidly growing field of biosensors.
Self-Organizing Maps 91
B concentration
0.25
0.20
0.15
0.10
0.05
0.00
(a)
F concentration
0.25
0.20
0.15
0.10
0.05
0.00
(b)
G concentration
0.25
0.20
0.15
0.10
0.05
0.00
(c)
Figure 3.31
Three layers of a SOM trained on ternary mixtures of fluorescent dyes. Each layer corresponds
to the concentration of a different dye. (Leung, A. and Cartwright, H. M., unpublished work.)
Self-Organizing Maps 93
3.12 Problems
1. The Periodic Table
The Periodic Table forms one of the most remarkable, concise, and
valuable tabulations of data in science. Its power lies in the regulari-
ties that it reveals, thus, in some respects, it has the same role as the
SOM. Construct a SOM in which the input consists of a few proper-
ties of some elements, such as electronegativity, atomic mass, atomic
radius, and electron affinity. Does the completed map show the kind
of clustering of elements that you would expect? What is the effect of
varying the weight given to the different molecular properties that
you are using?
2. Colorful arrows
Not every value in a sample pattern needs to be given equal impor-
tance. For example, we might feel that the waistline of scientists was
a more important factor in sorting them than their hairstyle. Write a
SOM that sorts out angles, as in Figure 3.12, but this time each sample
point should have both a value for the angle and a randomly cho-
sen RGB color. Investigate the effect on the degree of ordering on a
large map (say, 30 × 30) of varying the weight attached to each of
these two characteristics from 0 to 100 percent. What happens in a
large map if the weights of red, green, and blue are allowed to vary
independently?
3. Sorting whole numbers
Interesting effects may be obtained by using a SOM to sort integers.
Create a SOM in which each input pattern equals the factors of an
integer between 1 and 100. A simple way to do this would be to create
94 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
a list of all the prime factors below 100 and prepare a binary vector
that indicates whether the number is a factor. Alternatively, a binary
input could be chosen that indicates which integers are divisors of
the chosen number. For example, the divisors of 24 are 1, 2, 3, 4, 6, 8,
and 12, so we could specify its divisors as the vector:
{1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, …}
You will need to consider how best to display the SOM weights.
References
1. Fonseca, A.M., et al, Geographical classification of crude oil by Kohonen self-
organizing maps, Analytica Chimica Acta, 556, 374, 2006.
2. Kohonen, T., Spotting relevant information in extremely large document collec-
tions, Computational Intelligence: Theory and Applications, Lecture Notes in Com-
puter Science, (LNCS), Springer, Berlin, 1625, 59, 1999.
3. Murtola, T., et al., Conformational analysis of lipid molecules by self-organiz-
ing maps, J. Chem. Phys., 125, 054707, 2007.
4. Leung, A. and Cartwright, H.M., unpublished work.
5. Wang, H.Y., Azuaje, F., and Black, N., Interactive GSOM-based approaches
for improving biomedical pattern discovery and visualization, Proceedings
Computer and Information Science, Lecture Notes in Computer Science (LNCS),
Springer, Berlin, 3314, 556, 2004.
6. Zupan, J. and Gasteiger, J., Neural Networks for Chemists: An Introduction, Wein-
heim, Cambridge, U.K., 1993.
4
Growing Cell Structures
Table 4.1
Number of Pattern–Weight Comparisons Required in
Training a SOM, as a Function of the Size of the Problem
95
96 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
get even worse because the more samples there are in the database and the
greater their dimensionality, the more features probably exist that distinguish
one sample from another. As the number of features increases, the number of
cycles needed for the SOM to converge will increase in proportion, so it might
be impossible to run the algorithm to completion, except on a parallel proces-
sor machine.
A second difficulty that arises in the use of a SOM relates to the number
of classes into which it can reliably divide a dataset. This is closely linked
with the size of the map. A small map will be able to separate data into only
a limited number of classes; on the other hand, a larger map, though able to
separate more classes, is more expensive to train. It is important to choose a
suitable number of nodes to form the map if the correct degree of clustering
is to be revealed, but the number of classes in the database is almost certainly
unknown at the start of a run; thus it is possible only to guess at what might
be an appropriate map size. A computationally expensive period of experi-
mentation may be required, during which maps of different sizes are created
and tested in order to find one that shows the desired degree of clustering.
Although this process is a reasonable way to determine the size of a map that
best reveals clustering, it is time-consuming and is still dependent on a mea-
sure of guesswork in determining what level of clustering is appropriate.
A method exists that largely overcomes the problems of computational
expense and uncertainty in the size of the map. This is the growing cell
structure algorithm, which we explore in this chapter.
4.1 Introduction
Constructive clustering is an ingenious alternative to the creation of many
SOMs of different size to find out which is the most effective in separating
classes. Several constructive clustering algorithms exist. They are all related
to the SOM, but they have the crucial difference from it that the size of the
map is not specified in advance; instead, the geometry is adjustable and
evolves as the algorithm learns about the data, eventually settling down to
a dimension that, one hopes, best suits the complexity of the data that are
being analyzed.
Initially a map of minimal size is prepared that consists of as few as three
or four nodes. Since the map at this stage is so small, it is very quick to train.
As training continues and examples of different classes are discovered in the
database, the map spreads itself out by inserting new nodes to provide the
extra flexibility that will be needed to accommodate these classes. The map
continues to expand until it reaches a size that offers an acceptable degree of
separation of samples among the different classes. As in a SOM, on the fin-
ished map, input patterns that are similar to one another should be mapped
Growing Cell Structures 97
onto topologically close nodes, and nodes that are close in the map will have
evolved similar weights vectors.
Methods in which the geometry of the map adjusts as the algorithm runs
are known as growing cell algorithms. Several growing cell methods exist; they
differ in the constraints imposed on the geometry of the map and the mecha-
nism that is used to evolve it.
In the growing grid method, the starting point is a square 2 × 2 grid of
units. During a preliminary growth phase, the map expands by the periodic
addition of complete rows or columns of units, thus retaining rectangular
geometry (Figure 4.1).
When a new row or column is inserted into the grid, all statistical informa-
tion previously gathered about winning units is thrown away before learning
resumes. This simplifies the algorithm, but increases computation time. As
the grid expands, the neighborhood around each cell also grows; therefore,
unlike a SOM, the number of units whose weights are adapted on presenta-
tion of each sample pattern increases as the cycles pass. Once a previously
selected grid size has been reached, or some performance criterion has been
met, growth of the network is halted and the algorithm enters a fine-tuning
mode, during which further adjustments are made to the node weights to
optimize the network’s interpretation of the dataset.
A growing neural gas has an irregular structure. A running total is main-
tained of the local error at each unit, which is calculated as the absolute dif-
ference between the sample pattern and the unit weights when the unit wins
the competition to match a sample pattern. Periodically, a new unit is added
close to the one that has accumulated the greatest error, and the error at the
neighbors to this node share their error with it. The aim is to generate a net-
work in which the errors at all units are approximately equal.
Similar aims underlie the growing cell structure (GCS) approach, which
relies on the use of lines, triangles, or, in more general terms, “dimensional
hypertetrahedrons,” (which, as we shall see, are far easier to use than the
name suggests).
Figure 4.1
Expansion of a growing grid.
98 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Figure 4.2
The building blocks of one-, two-, and three-dimensional growing cell structures.
Growing Cell Structures 99
w1
w2
wm
Figure 4.3
The starting configuration for a two-dimensional GCS. Each unit stores a vector of weights.
and local error is that the first measures how frequently a unit is chosen
from among all the units in the map to represent a sample pattern, while the
second measures the quality of the match between unit weights and sample
pattern when a unit has won the competition to represent a sample. In both
cases, as the network is trained, a record is kept of how successful each part
of it has been. When the time comes to expand the network, a new unit is
inserted into the network at the location where it can be most useful.
when it does win. Every time a unit wins the competition to represent a
sample pattern, the squared distance between the unit weights and the input
pattern is added to a local error variable Ea at that unit.
Ea = Ea + ∑ (x
j
jk − w ja )2 (4.1)
In equation (4.1), xjk is the j-th data point in sample pattern k and wja is the j-th
weight at network unit a. After many sample patterns have been fed through
a GCS that is being trained, some units will have accumulated a much larger
local error than others; this might be for either of the following reasons:
1. The unit wins the competition to be the BMU only rarely and, when
it does win, the match between its weights and the sample pattern
is poor. Although the unit rarely is the BMU, the poor match when
it is chosen means that its local error grows to a large value. If the
match at this unit is the best for a particular pattern, even though the
match is poor, it is evident that no unit in the network is capable of
representing the pattern adequately.
The database cannot contain many samples that are similar to this
particular pattern, for if it did the unit would match many patterns
and, in due course, the weights at the BMU would match them well;
therefore, the poor match indicates that those samples for which this
is the winning unit must be very different from each other and also
few in number. As no unit in the network represents the samples
well, the error in this region of the map is high and at least one new
unit is required to represent a part of the variability in the samples.
Alternatively:
2. The unit is the BMU for a large number of samples and is a good
match to nearly all of them. It has accumulated a large error because
it is so often the winning unit; the small errors gradually add up.
Since this unit wins frequently, the network is failing to differentiate
between many of the patterns in the database, so a new unit should
be added near the unit that has the greatest error to allow the net-
work to better distinguish among these samples.
If alternative 2 applies, the same unit might be selected by the local error
measure for insertion of a new unit as would be picked by the signal counter
because, in both cases, the unit is frequently chosen as BMU. Alternative 1,
however, picks out units that have a low signal counter rather than a high
one. It follows that the course of evolution of a GCS will depend on the type
of local measure of success that is used.
102 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
τb = τb + 1 (4.2)
If the local measure is the error, this is increased by adding to it the Euclid-
ean distance between the sample pattern and the weights at the unit.
4.3.4 Weights
The next step is to update the network weights. The weights at the winning
unit are updated by an amount Δwja, as in a standard SOM:
∆w ja = εb ( x jk − w ja ) (4.3)
εb is a learning rate, which determines how rapidly the weights at the BMU
move toward the sample pattern.
The weights of neighboring units must also be updated, but the way that
this is done differs from the procedure used in a SOM. In that algorithm, all
nodes within some defined distance of the winning node are considered to
be neighbors, thus the updating may cover many nodes or even the whole
lattice. In addition, the size of the adjustment made to the weights depends
on the distance between the winning node and the neighbor. By contrast, in
the GCS, the lattice is irregular, so distance is not an appropriate way to iden-
tify neighbors and does not enter into the updating algorithm. The neighbor-
hood is instead defined to include only those units to which the winning
unit is directly connected. The weights of the neighbors are adjusted by a
smaller amount than are the weights of the winning unit, determined by the
parameter εn.
∆w jn = ε n ( x jk − w jn ) (4.4)
τn = τn × α (4.5)
α is a real-valued number in the range 0 < α < 1.0 and is generally very close
to 1, so that the signal counter declines only slowly.
This slight, but regular, chipping away at the value of the counter serves
two purposes. Suppose that a unit won the competition to represent sample
patterns many times in the early period of training, but now wins the com-
petition only rarely. It is clear that its value to the network, though it might
once have been considerable, has fallen to a low level. Because this region
of the network is of limited value, it should not be chosen as an area for
the insertion of a new unit and the reduction in the signal counter helps to
ensure this.
The slow decay of the signal counter, unless it is boosted by fresh wins,
serves a second purpose. While a large counter indicates a suitable region
of the map for the insertion of a new unit, a very small value indicates the
opposite. The unit may be of so little value to the network that it is a candi-
date for deletion (section 4.5). Unlike the SOM, not only can units be added
in the GCS, they can also be removed, so the signal counter can be used
to identify redundant areas of the network where pruning of a unit may
enhance efficiency.
If the local error is used to measure success, it too is decreased by a small amount
after each sample pattern has been processed or each cycle is complete.
En = En (1 − β) (4.6)
4.3.6 Repeat
Once the weights at the winning node and its neighbors have been updated,
another pattern is selected at random from the database and the BMU once
again determined. The process continues for a large, predetermined number
104 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
1. To the unit um
2. To the neighbor
3. To any other nodes that already have connections to both um and to
the neighbor
The reason for selecting the unit that is most unlike um is that we require a
wide range of weights vectors across the map if it is to adequately represent
the full variety of sample patterns. By choosing to average the weights at the
Growing Cell Structures 105
two units that are least alike, we introduce the greatest possible degree of
variety into the map.
Finally, the local measures at um and all its neighbors are reduced, with
the new unit taking a share from every node to which it is attached. Several
recipes exist for doing this. Typically, we determine the similarity between
the weights at the new node and the weights of each unit to which it is joined
and then assign to it a share of the neighbors’ signal counters in proportion
to the degree of similarity. The share is inversely proportional to the number
of neighbors that are sharing, so that the total local success measure over all
nodes remains unchanged. Thus, were the weights at the new node to be
identical to each of its five new neighbors (an unlikely event), each neigh-
bor would give up one-sixth of its signal counter so that all six units subse-
quently had the same signal counter.
If the local error is used instead of the signal counter, it is common to set
the initial local error at the new unit to the average of the errors at all units
to which it is connected:
Enew =
1
∑
n i =1,n
Ei (4.8)
The local error at each neighbor is then reduced by the amount of error that
it has relinquished.
The effect of this expansion process is illustrated in Figure 4.4, in which
all points in the sample database are drawn from a donut-shaped object. The
outline of that object is quickly reproduced, with increasing fidelity as the
number of units increases. It is notable that fewer units lie in the interior of
the object once the map is complete than was the case when the SOM was
trained on a set of points that defined a conical flask. This is a consequence of
the flexibility of the geometry of the lattice in which the GCS grows.
As with any computational method, it is important that the parameters
that govern operation of the method are chosen appropriately. In the GCS,
one of the key parameters is the frequency with which new units are added.
Each time a unit is added, the map should be given sufficient time to adjust
all the unit weights so that the resulting map is a fair representation of the
data. If this is not done, new units may be added in inappropriate positions,
which will compromise the quality of the map.
Figure 4.5 illustrates this. It shows several stages in the development of
a GCS that has been trained with the same dataset as was used to prepare
Figure 4.4. In this case, though, a new unit has been added every twenty
cycles, which is much too frequently, and it can be seen that, although the
resulting map bears some resemblance to the underlying data, the fit to that
data is poor.
106 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Figure 4.4
The evolution of a GCS fitting data to a donut-shaped object.
Growing Cell Structures 107
Figure 4.4 (Continued)
The evolution of a GCS fitting data to a donut-shaped object.
Figure 4.5
The evolution of a GCS fitting data to a donut-shaped object. A new node has been added at the
rate of one unit per twenty cycles.
108 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Figure 4.5 (Continued)
The evolution of a GCS fitting data to a donut-shaped object. A new node has been added at the
rate of one unit per twenty cycles.
The local success measures of the excised cells are then shared out among their
neighbors, either equally, or in inverse proportion to the difference in weights
between the cell that is removed and each neighbor to which it was joined.
Pruning units out of a network that has only recently been grown sounds
like a curious tactic. Why expand the network in the first place if later on bits
of it will be removed? However, the ability to delete units as well as add them
maximizes the utility of the network by spreading knowledge as evenly as
possible across the units.
In addition, deletion of units allows a network to divide into multiple inde-
pendent portions, without interconnections, and thus describe more accu-
rately datasets in which one part consists of sample patterns that are very
unlike the remainder. Once the performance of the network is not noticeably
improved by the addition of a unit, the expansion and contraction may be
brought to a halt. The network can then be used as an interpretive tool in
exactly the same fashion as a SOM.
4.7 Applications
Wherever a SOM has been used to analyze scientific data, a GCS could be
used instead. However, the GCS has a further advantage only alluded to
above. Not only does the GCS include only as many units as are needed to
properly describe the dataset, but, in addition, when units are added, they are
positioned in the regions of the map in which they can be of greatest value,
i.e., in the region where the dataset is most detailed. As an example, the GCS
has been used to generate meshes to describe complex three-dimensional
objects from a list of scanned data points.1 The parts of an object where there
is little detail and which can therefore be described by a small number of
data points will be mapped by only a few units, while those regions in which
the surface is very varied will be represented by a high density of units, and,
therefore, with high fidelity. Averaging is built in, so if the signals are noisy,
automatic smoothing takes place.
The GCS shows a great deal of potential as a tool for visualization. Wong
and Cartwright have investigated the use of GCS to help in the visualiza-
tion of large high-dimensionality datasets,2 and Walker and co-workers have
used the method to analyze biomedical data.3 Applications in the field are
starting to increase in number, but at present the potential of the method far
exceeds its use.
Growing Cell Structures 111
4.9 Problems
1. In the last chapter, a SOM was trained with data points that lay on
the outline of an Erlenmeyer flask. Construct a GCS that is trained
using points drawn from the outline of a figure 8.
2. Repeat exercise 1, but use two identical figures 8s with a gap between
them. Check that, as training proceeds, the removal of units which
are rarely the BMU leads eventually to two separate regions on the
map that are not connected, each of which defines the points in one
object only.
3. The CD that accompanies this book contains a well-studied set of
data known as the Iris dataset. A brief description of the data is
included on the CD. Write a GCS to analyze the Iris data. If you have
SOM software available, compare the performance and execution
time of the SOM and the GCS on this dataset.
References
1. Ivrissimtzis, I.P., et al., unpublished work. http://www.mpi-sb.mpg.de/ivirs-
sim/neural.pdf.
2. Wong, J.W.H. and Cartwright, H.M., Deterministic projection by growing cell
structure networks for visualization of high-dimensionality datasets. J. Biomed.
Inform., 38, 322, 2005.
3. Walker, A.J., Cross, S.S., and Harrison, R.F., Visualization of biomedical data-
sets by use of growing cell structure networks: A novel diagnostic classification
technique, Lancet, 354, 1518, 1999.
5
Evolutionary Algorithms
5.1 Introduction
It is a common experience in science when we try to solve a new problem to
find that an exact solution proves elusive. If no closed-form solution to the
problem seems to be available, some process of iterative improvement may
be needed to move from the best approximate solution to one of acceptable
quality. With persistence and perhaps a little luck, a suitable solution will
emerge through this process, although many cycles of refinement may be
required (Figure 5.1).
Is solution of
sufficient quality?
No Yes
Figure 5.1
The process of iterative improvement of an initially modest solution to a problem.
113
114 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Population
Time
Figure 5.2
The growth in numbers of a population of animals in the absence of any environmental pressure.
Is any solution of
acceptable quality? Yes
No End
Modify members of
the new generation
Figure 5.3
The genetic algorithm (GA).
Evolutionary Algorithms 117
Figure 5.4
Daisy.
118 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
H1
H2 C
Cl
Br
Figure 5.5
Bromochloromethane.
Evolutionary Algorithms 119
{ 0001010100011011010100 }
{ 1, 5, 66, –2, 0, 0, 3, 4 , –2, –110, –2, 1 }
{ 71.7, –11.4, 3.14159, 0.0, 78.22, –25.0, –21.4, 2.3, 119.2 }
Figure 5.6
Some typical genetic algorithm (GA) strings.
can also be used. In this chapter, we shall be working with strings that con-
sist of integers (Figure 5.6).
The scientific problem that we are trying to solve enters into the algorithm
in two ways. First, we must encode the problem, that is, choose a way in
which potential solutions to it can be expressed in the form of strings that
the algorithm can manipulate. Second, the algorithm must be given a way to
evaluate the quality of any possible string so that it can determine how good
each string is.
To understand how the GA manages to evolve a solution, it is helpful to
focus on a particular example; therefore, we shall use the algorithm to solve
a straightforward problem.
Figure 5.7
Ten randomly oriented dipoles.
4. Selection
5. Crossover
6. Mutation
5.5.2 Initial Population
Once the population size has been chosen, the process of generating and
manipulating solutions can begin. In preparation for evolution, an initial
population of potential solutions is created. In some problems, domain-spe-
cific knowledge may be available that can be used to restrict the range of
these solutions. It might be known in advance that the solution must have a
particular form; for example, we might loosely tie together opposite ends of
neighboring dipoles in our example, so that, instead of being able to adopt
any arbitrary relative orientation, the difference in angle between two neigh-
boring dipoles could not be more than 60º. If we have advance knowledge
of the format of a good solution, this can be taken into account when the
individuals in the first population are created. Within the limits of any such
constraints, each solution is created at random.
In the present problem, we shall assume that nothing is known in advance
of the form of a good solution. A typical string, created at random, is given
in Figure 5.8; shown below the string are the orientations of the dipoles that
it codes for.
The initial population of ten random strings is shown in Table 5.1.
Evolutionary Algorithms 121
Figure 5.8
A random string from generation 0.
Table 5.1
Generation 0: The Initial Set of Strings, Their Energies E, and Their Fitness f
String θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ9 θ10 E f
0 103 176 266 239 180 215 69 217 85 296 1.052 0.1169
1 357 295 206 55 266 180 166 95 131 293 0.178 0.1302
2 141 335 116 132 346 119 275 278 255 26 –1.016 0.1542
3 182 267 168 353 90 36 82 251 160 178 –0.047 0.1341
4 292 102 41 179 191 286 73 336 273 257 0.288 0.1283
5 109 5 180 287 23 301 30 185 117 101 0.059 0.1322
6 217 308 234 87 13 239 25 127 304 3 –0.374 0.1403
7 294 13 170 142 45 313 91 278 235 228 0.454 0.1257
8 102 105 198 148 61 316 15 117 323 143 –0.839 0.1501
9 69 95 278 50 216 18 0 328 357 355 0.554 0.1241
Figure 5.9
The surface defined by an arbitrary fitness function, across which the genetic algorithm (GA)
searches for a maximum or minimum. The surface may be complex and contain numerous
minima and maxima.
q1q2
E= (5.1)
4πε 0r
a
b
Figure 5.10
Distances used in the calculation of the interaction energy of two dipoles.
Interactions between two unlike ends of the dipoles are negative and,
therefore, attractive, while those between two like ends are positive, and
thus are repulsive. The total interaction energy is a summation over all ten
dipoles, and if we assume that the calculation can be simplified by includ-
ing only interactions between neighboring dipoles, the total energy can be
calculated from equation (5.2).
∑ q q qq qq qq
i=9
E= const × i i +1 + i i +1 − i i +1 − i i +1 (5.2)
i =1 a b c d
c
fi = (5.3)
Ei
100
–2 2
–100
Figure 5.11
A possible, but unsatisfactory, relationship between string energy and string fitness, calcu-
lated using equation (5.3).
which attraction and repulsion cancel. This defect in the fitness function is
easily remedied by making a minor modification to equation (5.3):
1
fi = (5.4)
Ei + 6.0
where the value 6.0 has been chosen because the minimum energy using the
distance between dipoles in this problem and the selected constant is –5.5,
so the fitness now cannot become infinite. Figure 5.12 shows that the fitness
calculated with equation (5.4) is better behaved; it is nowhere infinite, and
the lower the total energy of the set of dipoles defined by a string, the higher
the corresponding fitness.
You will notice how remarkably arbitrary this process is. The constant
value 6.0 was not chosen because it has some magical properties in the GA,
but because it seemed to be a value that would be “pretty good” in this par-
ticular application. The lack of sensitivity in the operation of the GA to the
precise way in which the calculation is set up and run can seem unsettling at
first, but this tolerance of the way that the user decides to use the algorithm
is one of its important attractions, giving even novice users every chance of
achieving good results.
Because the orientations of the dipoles defined by the strings in generation
0 have been chosen at random, none of these strings is likely to have a high
fitness. The initial points in any GA population will be scattered randomly
across the fitness surface (Figure 5.13).
The energy of each of the ten strings in the initial population is shown in
Table 5.1, together with the fitness calculated from equation (5.4).
5.5.4 Selection
The fitness of a string is an essential piece of information on the road to
the creation of solutions of higher quality. The first step in the process of
improving the strings is to use the fitness to select the better strings from the
Evolutionary Algorithms 125
10
0
–6 –5 –4 –3 –2 –1 0
Figure 5.12
A possible relationship between string energy and fitness, calculated from equation (5.4).
Figure 5.13
The strings in an initial population, having been created randomly, will have fitnesses spread
widely across the fitness surface.
population so that they can act as parents for the next generation; selection
employs a survival-of-the-fittest competition.
The selection process, which is run as soon as the fitnesses have been cal-
culated, must be biased toward the fitter strings if the algorithm is to make
progress. If selection did not have at least some preference for the fitter
strings, the search in which the algorithm engages would be largely random.
On the other hand, if the selection of strings was completely deterministic,
so that the best strings were picked to be further upgraded, but the poorer
strings were always ignored, the population would soon be dominated by
the fittest strings and would quickly become homogeneous.
The selection procedure, therefore, is only biased toward the fitter strings;
it does not choose them to the inevitable exclusion of poorer strings. To
accomplish this, the selection procedure is partly stochastic (random).
Several alternative methods of selection exist; in this example, we shall use
Binary Tournament Selection. Later we shall encounter other methods. Two
126 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
strings are chosen at random from the current population, say, strings 3 and
6 from Table 5.1. Their fitness is compared (f3 = 0.1341, f6 = 0.1403) and the
string with the higher fitness, in this case string 6, is selected as a parent and
becomes the first entry in the parent pool. Another pair of strings is drawn at
random from the population and the fitter of the two becomes the next entry
in the parent pool. The process is repeated npop times to yield a group of par-
ent strings equal in number to the set of strings in the starting population.
Some strings will participate more than once in the binary tournament
because the fitness of twenty randomly selected strings must be compared to
ten select parents. Uniquely, the weakest string cannot win against any string,
even if it is picked to participate, so it is sure to disappear from the population,
leaving a place open for a second copy of one of the other strings; consequently,
at least one string will appear in the parent pool more than once.
The selection of strings for the tournament is random, thus the results may
vary each time the process is performed on a given starting population. This
unpredictable selection of strings is the second time that random numbers
have been used in the algorithm (the first was in the construction of the
strings in the initial population) and random numbers will return shortly.
It may seem surprising that random numbers are needed when the problem
to be solved has a definite (albeit, still unknown) solution, but their use is a
common feature in AI methods and, in fact, is fundamental to the operation
of several of them.
The strings chosen to be parents by the binary tournament are shown in
Table 5.2.*
In an animal population, individuals grow old. Aging, though it changes
the outward appearance of an animal, does not alter its genetic makeup,
and since the GA only manipulates the equivalent of the genes, “age” has
no meaning in a standard GA. Although, in common with the evolution of
a population of animals, the algorithm operates over a succession of genera-
tions, during which some individuals will be lost from the population and
others created, as soon as a GA population has been constructed, the algo-
rithm immediately sets about destroying it and giving birth to its successor.
* The “string numbers” shown in Table 5.1 and Table 5.2 are of no significance within the
algorithm; they are shown to make it simpler for the reader to keep track of the operation
of the GA.
Evolutionary Algorithms 127
Table 5.2
Strings Chosen by Binary Tournament Selection to Participate in the First Parent Pool
Parent Old
String # θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ9 θ10 String #
Note: The “old string” number refers to the labeling in Table 5.1. The “parent string” number
will be used in Table 5.3.
5.5.5 Crossover
The average fitness of the parent strings in Table 5.2 is higher than the aver-
age fitness of the strings in the starting population (the parent pool has an
average fitness of 0.1380, compared with an average fitness of 0.1336 for the
population in generation 0). Although the rise in fitness is neither guaran-
teed nor substantial, it is to be expected because the selection process was
biased in favor of the fitter strings. We cannot be too pleased with ourselves
though; this increase in fitness is not yet evidence of real progress. No new
strings are created by selection and at least one string has been removed
by it. If the selection process was repeated a few times, the average fitness
would probably continue to improve, but the algorithm would quickly find
its way down a dead-end, converging on a uniform set of strings in which
each was a copy of one from the first population.
It is clear that selection on its own cannot generate a high quality solution;
to make further progress, a mechanism is needed to modify strings. Two
tools exist for this purpose: mating and mutation.
In most living things, reproduction is a team sport for two participants.
By combining genes from a couple of parents, the genetic variety that can
be created in the offspring is far greater than that which would be possible
if all genes were derived from a single parent. The creation of individuals
whose genes contain information from both parents induces greater vari-
ability in the population and increases the chance that individuals which are
well adapted to the environment will arise. The same reproductive principle,
128 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
that two parents trumps one, is used in the GA, in which mating is used to
generate new strings from old strings.
The mechanism for accomplishing this is crossover. In one-point crossover
(Figure 5.14), two strings chosen at random from the freshly created parent
pool are cut at the same randomly chosen position into a head section and a
tail section. The heads are then swapped, so that two offspring are created,
each having genetic material from both parents.
Figure 5.14
One-point crossover of parent strings 3 and 6, cut to the right of the arrows, between genes 7
and 8.
5.5.6 Mutation
The crossover operator swaps genetic material between two parents, so unless
the segment to be swapped is the same in both strings (as has happened
with strings 0 and 7 in Table 5.3 because the two parents that were crossed
were identical*), child strings differ from their parents. Now we are start-
ing to make progress. Crossover is creating new strings, which introduces
information that was not in the original population. However, crossover is
* Although strings 0 and 7 were identical after crossover, the first angle in string 0 was then
mutated, changing from 217 to 359, so in Table 5.3 the two child strings differ at the first
angle.
Evolutionary Algorithms 129
Table 5.3
The Strings from Generation 0, after Selection, Crossover, and Mutation
Created Mutated
from at
String# θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ9 θ10 Parents Position
not powerful enough on its own to enable the algorithm to locate optimum
solutions. This is because crossover can only shuffle between strings what-
ever values for angles are already present; it cannot generate new angles. We
might guess that the optimum value for the angle of the first dipole is 90°.
Table 5.1 shows that in no string in the initial population does the first dipole
have an angle of 90°, so no matter how enthusiastically we apply crossover,
the operator is incapable of inserting the value of 90° into a string at the cor-
rect position or, indeed, at any position. To make further progress, an opera-
tor is needed that can generate new information.
Mutation is this operator. At a random position in a randomly chosen
string, it inserts a new, randomly selected value (more random numbers!).
One of the two strings produced by crossing parents 4 and 8 is
292 102 41 179 191 316 15 117 323 143
The string is picked out by the mutation operator and mutated at position 6
(chosen at random) to give the new string
292 102 41 179 191 131 15 117 323 143
Mutation refreshes the population through the insertion of new data into a
small fraction of the strings. This introduction of fresh data is essential if the
algorithm is to find an optimum solution, since, in the absence of mutation, all
variety in the population would eventually disappear and all strings would
become identical. However, the news is not all good. Although mutation is
130 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Table 5.4
The Genetic Algorithm Parameters Used in the Dipoles Problem
npop 10
Crossover type 1-point
pc 1.0
pm 0.2
5.6 Evolution
Updating of the population is now complete—strings were assessed for
quality and the better strings chosen semistochastically through npop binary
tournaments. The strings lucky enough to have been selected as parents
were paired off and segments swapped between them using a one-point
crossover. Finally, a couple of strings, on average, were mutated. The new
generation of strings is now complete and the cycle repeats.
We can monitor progress of the algorithm by observing how the fitness
of the best string in the population, or alternatively the average fitness of
the whole population, changes as the cycles pass. Table 5.5 lists the fitness
of the best string in the population at a few points during the first fifty
generations.
The table reveals an encouraging rise in fitness, which is confirmed in Fig-
ure 5.15. This shows how the fitness of the best string in each generation varies
over a longer period. It is clear that the algorithm is making steady progress,
although the occasional spikes in the plot show that on a few occasions a good
string found in one generation fails to survive for long, thus the fitness of the
best string, rather than marching steadily upward, temporarily declines.
Evolutionary Algorithms 131
Table 5.5
The Fitness of the Best String at Several Points in the First Fifty Cycles
Generation 0 5 10 15 20 30 40 50
1.5
0.5
0
0 250 500
Figure 5.15
Fitness of the best string in the population during the first five hundred generations.
By generation 100 (Table 5.6) the dipoles are starting to align and there are
hints that a good solution is starting to emerge.*
It is clear that the population is homogenizing as a result of the removal of
the poorer solutions. This homogeneity brings both benefits and disadvan-
tages. A degree of homogeneity is beneficial because it provides protection
against loss of information. If there are several copies of a good string in the
population, the danger of that string being lost because it was not chosen
by the selection operator or has been destroyed by crossover or mutation, is
reduced. As Figure 5.15 shows, the fitness of the best strings takes several
short-lived steps backwards in the early part of the run, but these setbacks
are rare once a couple of hundred generations have passed and several cop-
ies of the best string are present.
On the other hand, the algorithm requires a degree of diversity in the
population to make continued progress. If every string in the population is
identical, one-point crossover has no effect and any evolutionary progress
* The best solution consists of a head-to-tail arrangement of all ten dipoles. There are two such
arrangements, one in which all the dipoles are pointing to the left and the other in which they
are all pointing to the right. Both solutions are equally likely to be found by the algorithm
and both have the same energy, but any single run will eventually converge on only one or
the other. Owing to the stochastic nature of the algorithm and the fact that the solutions are
equivalent, if the GA was run many times, it would converge on each solution in 50 percent
of the runs.
132 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Table 5.6
The Population, Generation 100
String θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ9 θ10
Table 5.7
The Population, Generation 1000
String θ1 θ2 θ3 θ4 θ5 θ6 θ7 θ8 θ9 θ10
0 96 92 90 93 93 97 95 88 88 88
1 96 92 90 93 93 97 95 88 88 88
2 96 92 90 93 93 97 95 88 88 88
3 96 92 90 93 44 97 95 88 88 88
4 96 92 90 93 93 97 95 88 88 88
5 96 92 90 93 93 97 95 88 88 25
6 96 92 90 93 93 97 95 88 88 88
7 96 92 90 93 93 97 95 88 88 88
8 96 92 90 93 93 97 95 88 88 88
9 96 92 90 93 93 97 95 88 88 88
2.5
1.5
0.5
0
0 500 1000 1500 2000 2500 3000 3500 4000
Figure 5.16
Variation of the fitness of the best string (upper line) and the average population fitness
(lower line) with generation number.
2.03
1.89
1.11
0.88
1.02
2.06 1.14
Figure 5.17
Roulette wheel selection. The area of the slot on the virtual roulette wheel for each string is
proportional to the string’s fitness.
Evolutionary Algorithms 137
those picked to participate in the tournament. The fittest string might never
be picked and, without being invited to the dance, it cannot become a parent.
If the population is comprised of a large number of average strings with just
one or two strings of superior quality, the loss of the best strings may hinder
progress toward an optimum solution. We can see evidence of stalled prog-
ress in Figure 5.16 at generation 784. The fitness of the best string collapses
from 1.9 to less than 1.5 before making a partial recovery a few generations
later. Stochastic remainder is a hybrid method for the selection of parents that
combines a stochastic element with a deterministic step to ensure that the
best string is never overlooked.
There are three steps in stochastic remainder selection. First, the fitnesses
are scaled so that the average string fitness is 1.0.
fi ,raw × npop
fi ,scaled = npop
(5.5)
∑f
j =1
j ,raw
As a result of this scaling, all strings that previously had a raw fitness
above the mean will have a scaled fitness greater than 1, while the fitness
of every below-average string will be less than 1. Each string is then copied
into the parent pool a number of times equal to the integer part of its scaled
fitness (Table 5.8).
Thus, two copies of string 2 in Table 5.8, whose scaled fitness is 2.037, are
certain to be made in the first round of selection using stochastic remainder,
while at this stage no copies of string 7, whose fitness is 0.850, are made.
This deterministic step ensures that every above-average string will appear
Table 5.8
Stochastic Remainder Selection
String 1 2 3 4 5 6 7 8 9 10
fi,raw 0.471 1.447 0.511 0.208 1.002 0.791 0.604 0.604 0.985 0.444
fi,scaled 0.663 2.037 0.719 0.293 1.411 1.114 0.850 0.850 1.387 0.625
Ci,certain 0 2 0 0 1 1 0 0 1 0
fi,residual 0.663 0.037 0.719 0.293 0.411 0.114 0.850 0.850 0.387 0.625
Ci,roulette 1 0 1 0 0 0 1 1 0 1
Ci,total 1 2 1 0 1 1 1 1 1 1
Note: fi,raw is the raw fitness of a string, fi,scaled is its scaled fitness, Ci,certain is the number of parent
strings guaranteed to be made by stochastic remainder, fi,residual is the residual fitness after
the number of parents made has been subtracted from the scaled fitness, Ci,roulette is the
number of copies made by a typical roulette wheel selector, and Ci,total is the total number
of parents prepared from the string.
138 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
at least once in the parent pool. Any string whose fitness is at least twice the
mean fitness will be guaranteed more than one place in the pool.
Finally, the number of copies made of each string is subtracted from its
scaled fitness to leave a residual fitness, which for every string must be less
than 1.0. A modified roulette wheel or tournament selection is then run
using these residual fitnesses to fill any remaining places in the population.
In this modification, once a copy has been made of a string as a result of its
selection by the roulette wheel operator, or in the tournament selection, its
residual fitness is reduced to zero to prevent it being chosen again. Stochastic
remainder neatly combines a deterministic step that ensures that every good
string is granted the opportunity to act as a parent for the new population,
with a stochastic step that offers all strings a chance, no matter how poor
they may be, to also pass on their genes.
Table 5.9 shows how the number of copies of a set of strings may vary
depending on the selection method chosen. Because each method contains a
stochastic element, the values in the table would change if the algorithm was
run a second time.
Table 5.9
The Number of Copies of a Set of Strings Made by Running Different Selection Strategies
1 0.471 0.663 1 0 1 1
2 1.447 2.037 2 3 2 2
3 0.511 0.719 1 0 0 1
4 0.208 0.293 0 0 1 0
5 1.002 1.411 2 3 2 1
6 0.791 1.114 1 1 1 1
7 0.604 0.850 1 1 1 1
8 0.604 0.850 0 1 0 1
9 0.985 1.387 1 1 2 1
10 0.444 0.625 1 0 0 1
Note: Since each of the selection methods contains a stochastic element, different results would
be obtained if the selection were run a second time.
3.0
1.27
1.21
1.0
Figure 5.18
Fitness scaling. The original range of fitnesses from 1.21 to 1.27 is stretched to cover a wider range.
the most effective for a given problem. A typical procedure would be to scale
all fitnesses so that they cover the range of, say, 1.0 to 3.0. This can be accom-
plished by applying the linear transformation:
2.0
fi ,scaled = 1.0 + ( fi − fmin ) × (5.6)
( fmax − fmin )
140 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
where fi is the fitness of string i, and fmax and fmin are the fitnesses of the best
and worst strings, respectively, in the population. Scaling is unlikely to be
necessary in the early part of a run, but may be turned on once the popula-
tion begins to homogenize.
The relative orientations of most of the dipoles in this string are unprom-
ising, but dipoles 7 and 8 (angles shown in bold) are pointing almost in the
Figure 5.19
The growing clustering of strings in regions of higher fitness.
Evolutionary Algorithms 141
same direction and are roughly aligned along the axis. This head-to-tail
arrangement brings opposite ends of these two neighboring dipoles into close
proximity and, thus, is favorable energetically, and although the remaining
dipoles in the string are arranged almost randomly, this small segment of
the string serves to give the string a fitness that is slightly superior to that of
a string in which every dipole adopts a random orientation. Another string
in the same population, string 8, has two short segments in which dipoles
are in a favorable alignment:
102 105 198 148 61 316 73 336 273 257
Dipoles 1 and 2 are roughly aligned along the axis, as are those in positions 9
and 10, though the first pair point in the opposite direction to the second pair.
Suppose that these two strings are selected as parents and then paired
together for crossover between positions 8 and 9 (Figure 5.20).
Figure 5.20
Action by the crossover operator that brings together two promising segments.
In the first child string, all of dipoles 7 to 10 now have a similar orienta-
tion and, consequently, we would expect that the new string will have a fit-
ness that exceeds that of either parent.* Because of this improved fitness, the
string will be more likely to be chosen as a parent when the next generation
is processed by the selection operator, so both the string and the good seg-
ments that it contains will begin to spread through the population.
By contrast, if the crossover or mutation operators produce a string that
contains few useful segments or none at all, either because there were no
useful segments in the parent strings or because the evolutionary operators
disrupted these segments, the string will have relatively poor fitness and, in
due course, will be culled by the selection operator. Thus, good strings that
contain chunks of useful information in the form of a group of high-quality
* Although the string has only four dipoles with a favorable orientation, just like one of its
parents, all four are contiguous, so there are three favorable dipole–dipole interactions: 7 to
8, 8 to 9 and 9 to 10. Parent string 5 had only one such interaction and parent 8 had two.
142 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
genes tend to proliferate in the population, carrying their good genes with
them, while poorer strings are weeded out and discarded. Because of the pre-
sumed importance of segments that confer above-average fitness on a string,
they are given a special name: building blocks or schemata (sing. schema).*
This discovery of schemata sounds promising; it seems that building a
high-quality string should be simple. All we need to do is to create a large
number of random strings, figure out where these valuable building blocks
are located in each string, snip them out, and bolt them together to form a
super string. Though an appealing idea, this is not feasible. The algorithm
does not set out to identify good schemata — indeed it cannot do so because
the high fitness calculated by the algorithm is associated with the string in
its entirety, not with identifiable segments within it. During the evolutionary
process, the GA has no means by which it can pick out only good schemata
and leave behind those of little value, so explicitly cutting and pasting the
best parts of different strings is not possible. However, although GA selec-
tion chooses strings, not schemata, it is automatically biased toward good
schemata as they are more likely to be contained within superior strings.
Therefore, even though the algorithm is schemata-blind, it still, in effect,
identifies the schemata needed to build good solutions.**
Following this brief discussion of how the GA finds good solutions, we
consider now how to oil the gears of the algorithm to make it function as
effectively as possible.
that are of high fitness and that will, therefore, proliferate in the population are also con-
tained in the global optimum string. During evolution, schemata will proliferate if their aver-
age fitness exceeds the average fitness of all the schemata in the population, but problems
exist in which the optimum solution is built from schemata that, when found within a typical
string, are of below average fitness; only when all the necessary schemata required to form
the perfect string come together do they cast off their frog’s clothing and reappear as a string
of high quality.
The proliferation in the population of highly fit schemata that do not appear in the optimum
string, at the expense of less fit schemata that do, is known as deception and renders the
problem difficult to solve. Although there are many deceptive problems in the literature, the
number of different examples is itself deceptive, as a large proportion were specifically cre-
ated to be deceptive, so that ways of dealing with deception could be investigated. Neverthe-
less, deception is a serious challenge for the genetic programming algorithm (see later).
Evolutionary Algorithms 143
mutation plays its part by creating new valuable building blocks that con-
fer high fitness on the string that contains them. However, both operators
are also disruptive: If a string contains a single short building block of
value, it is unlikely to be disrupted by one-point crossover, but if two build-
ing blocks are far apart in a string, crossover will probably separate these
segments and, thereby, reduce the quality of the string, not enhance it. Sev-
eral alternative crossover operators are available to help us circumvent this
difficulty.
Figure 5.21
Uniform crossover.
there is only limited correlation between genes that are more than one or two
places apart in the string.*
Figure 5.22
Standard two-point crossover.
parents. Earlier we noted that, if no string in the initial population has the
value of 90° for the first dipole, one-point crossover could not move it there.
Two-point crossover, because it shuffles angles around in the strings, is
able to manage this, provided that the value of 90° exists somewhere in the
population.
As a result of these advantages, two-point crossover is more valuable than
its one-point cousin, but, in the form described above, it hides a subtle prob-
lem. When one-point crossover is applied, every gene has an equal chance
of being in the segment that is moved. (There is no distinction between the
segment that is swapped between strings and the segment that is left behind;
thus it is inevitable that the swapping probability is the same at all positions
along the string.) However, in two-point crossover, genes in the middle of
the string are more likely than genes at the extremities to be included within
the segment that is selected for swapping; this difference becomes more pro-
nounced as the size of the string increases (Table 5.10).
As genes in the center of the strings are frequently swapped by two-point
crossover, the algorithm is constantly creating new combinations of genes
in this part of the strings. Through this shuffling of genes and retesting as
the generations pass, the algorithm may make good progress optimizing the
middle of the string, but leave the ends of long strings relatively untouched
over many generations; as a consequence, the outer regions of the strings
will be slower to improve than the midsection.
A simple solution to this unequal treatment of different sections of the
string exists. The order in which the cut points for crossover are generated is
noted. If the second cut is “earlier” in the string than the first cut, the cross-
over wraps around at the end of the string. Suppose that the two strings
shown at the top of Figure 5.23 are to be crossed. The two cut points selected
at random in the first string are between genes 2 and 3 and between genes 5
and 6. This defines a segment of length 4 that will be swapped into the other
string. A cut point is chosen in the second string, say between genes 8 and
9. The second cut point in this string is chosen with equal probability either
four genes to the left of this point or four genes to the right. If the second
cut is to the right, this takes us beyond the end of the string, so the cut will
be between genes 2 and 3 and the cut wraps around. This will yield the two
strings shown in Figure 5.23.
Table 5.10
Probability That a Gene at the Midpoint of a String, pmid, or at the
End of a String, pend, Will Be Swapped during Two-Point Crossover
String Length 5 11 17 51
Figure 5.23
Wraparound crossover.
Figure 5.24 illustrates wraparound in the first string. The cut points in the
first string are chosen as 8/9 and 2/3 (in that order). The segment that is
swapped from this string is still four genes in length, but this time comprises
genes 9, 10, 1, and 2.
Figure 5.24
Wraparound crossover: a second example.
The wraparound operator gives every position in the string an equal chance
of being selected because, on average, half of the time the second cut point
will be to the left of the right cut point, thus wraparound will apply and it
will be the outer parts of the string rather than an interior segment that will
be swapped. This is the equivalent of joining the two ends of the string and
choosing cut points at random from within the circular string thus formed
(Figure 5.25). This demonstrates that wraparound treats all genes equally.
Wraparound is not limited to one-dimensional strings. In this chapter, we
will not be dealing in detail with two-dimensional GA strings, which are
strings that take the form of an array rather than a simple list of values.
Evolutionary Algorithms 147
109
26 25
7
255
102
278
105
5 198
27 119
Figure 5.25
Interpretation of a genetic algorithm (GA) string as a circle.
45 7 388 …
71 −11 46 …
S= (5.7)
0 71 0 …
34 −100 … …
0.8
0.6
0.4
0.2
0
0 200 400 600 800 1000 1200
Figure 5.26
The fitness of the best string in an attempt to solve the dipoles problem using a population of
four strings: pm = 0.2, pc = 1.0.
Evolutionary Algorithms 149
be just one. Crossover or mutation may easily destroy it and, with no backup
copy available, the algorithm will repeatedly need to retrace its steps to try
to rediscover information in a string that has been lost.
By contrast, when the algorithm is run with a population of one hundred
strings (Figure 5.27), the fitness of the best string increases with almost no
setbacks, and a high-quality solution emerges in fewer than two hundred
generations. The optimum solution in this run is found (and then lost) twice
around generation 770, but the algorithm settles permanently on the opti-
mum solution at generation 800.
If a modest increase in the number of strings, say, from ten to one hundred,
is helpful, perhaps a more dramatic increase, to fifty thousand or one hun-
dred thousand, would be even better. However, it is rarely beneficial to use
huge populations. Duplicates of a string provide protection against its pos-
sible destruction by the genetic operators, but a very large population may
contain hundreds or thousands of identical strings whose presence does
nothing to increase diversity; it merely slows the calculation.
The success of the algorithm relies on the GA processing many genera-
tions; solutions can only evolve if evolution is possible. Excessively large
populations are especially problematic if evaluation of the fitness function
is computationally expensive as it is, for example, in the chemical flowshop
problem (section 5.10) in which evaluating the total time required for all
chemicals to be processed by the flowshop in a defined order may take far
more time than the execution of all other parts of the GA combined.
When the fitness function is challenging to evaluate, the use of a very large
population will restrict operation of the algorithm to just a few generations.
During these generations, the fitness of the best string will improve, which
may give the impression that the algorithm is effective, but the improve-
ment occurs because in each generation very many essentially random new
2.5
1.5
0.5
0
0 200 400 600 800 1000 1200
Figure 5.27
The fitness of the best string in an attempt to solve the dipoles problem with a population of
one hundred strings: pm = 0.2; pc = 1.0.
150 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
In this string one of the dipoles has the 90° alignment required for all the
dipoles in the perfect solution, but for each of the remaining dipoles, the
required value is present neither in this string nor in any of the other strings
in the population. We can estimate how many generations must pass on
average before an angle of 90° is created by the mutation operator for the first
dipole in string 0, assuming that the string survives the selection process
each generation.
The probability of mutation per string per generation is 0.2; on aver-
age, therefore, two strings are mutated each generation. Taking all strings
together, there is a 2/10 probability that a particular position is chosen for
mutation,* and, if it is, the probability that the required value of dipole orien-
tation (90) will be generated is 1/360. The probability that the required value
will be generated in the desired position in the string in which it required,
thus, is about 2/10 × 1/360, substantially less than one chance in a thousand
per generation.
It is frustrating that the angle of 96° is nearly correct, and yet the prob-
ability that the algorithm can move it by just a few degrees to the optimum
value seems to be so low, but there is a solution. When only small changes
need to be made to the angles to arrive at the optimum solution, the muta-
tion operator is a blunt tool with which to make those changes. The values
generated by mutation span the entire range of angles, so the operator may
replace a near-optimum value such as 96° with 211° or some other unsuit-
able angle, which is a move in the wrong direction. One way of proceeding
is to adjust the mutation operator so that, at the start of the calculation, it
generates any angle with equal probability, while in the later stages of a run
the angle generated is not picked at random from across the entire range of
possible values, but is instead related to the value already in place. Rather
* Roughly. Two strings are mutated each generation on average, but since the process is sto-
chastic fewer or more strings may be mutated during a given cycle. Furthermore, there is a
chance that the same position will be chosen for mutation in each string.
Evolutionary Algorithms 151
than generating a completely new value for the angle, mutation generates
an offset, which adjusts the current value by a small amount. The size of the
offset is chosen randomly from within a normal distribution whose width
diminishes slowly with generation number; the effect is to focus the search
increasingly on a region around the current solution.
This is focused mutation, also known as creep, a name that is suggestive of
the way in which the mutated gene investigates the region around its cur-
rent value. Focused mutation is one of a number of ways in which we can
adapt crossover and mutation. In crossover, for example, the genes in the
child strings might be created not by swapping values between parents, but
by taking an arithmetic or geometric average of those values (provided that
the genes are real numbers). Such operators are not widely used, but the fact
that they have been proposed indicates the extent to which one has the free-
dom to “invent” operators that might be of value in solving a problem while
still remaining within the realm of the GA.
5.9 Encoding
5.9.1 Real or Binary Coding?
In any genetic algorithm application, the physical problem must be trans-
lated into a form suitable for manipulation by the evolutionary operators.
Choice of coding is an important part of this process.
152 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
In early work, GA strings were binary coded. Computer scientists are com-
fortable with binary representations and the problems tackled at that time
could be easily expressed using this type of coding. Binary coding is some-
times appropriate in scientific applications, but it is less easy to interpret than
alternative forms, as most scientific problems are naturally expressed using
real numbers.
When deciding what type of coding to use, one has to consider the effect
that binary coding may have on the effectiveness of the algorithm as well as
whether one type of coding is more difficult to interpret than another. For
example, suppose that we decided to use binary coding for the dipoles prob-
lem. String 1 from the original population was
103 176 266 239 180 215 69 217 85 296
{001100111 010110000 100001010 011101111 010110100 011010111 001000101
011011001 001010101 100101000}
where the gaps have been introduced to show the breaks between successive
angles, but do not form part of the string itself. We notice immediately that,
whereas in the real-valued format it was simple to spot whether two dipoles
pointed in approximately the same direction, in binary format this is a good deal
harder. The situation is worse still if the gaps between the angles are omitted:
{00110011101011000010000101001110111101011010001101011100100010101101100
1001010101100101000}
The GA operates upon binary strings using selection, mating, and muta-
tion operators in much the same way as when real number coding is used,
but a binary representation brings with it several difficulties. Consider a sec-
ond string with which we can cross the string given above:
and
When these new binary strings are translated back into real values so that
we can check what has happened, the result is
and
We see that crossover has swapped some values between the strings, just
as happened when real-valued strings were used, but at the crossing point
the values of two angles have not been swapped but changed because the cut
fell within the binary coded value for a single number; where before the real-
valued strings contained 239 and 64, they now contain 224 and 79. This altera-
tion of the values in the string could be avoided by allowing the crossover
operator to cut the binary string only at the position where one real-valued
number is divided from another, but if cuts are only permitted at those points,
the algorithm now resembles closely the situation when real value coding is
used and the procedure is reduced to a disguised real-value algorithm.
The evolution of the algorithm using binary strings will not be identical to
the evolution when real-valued strings are used because the effect of muta-
tion on an angle represented in binary will be different from the effect of
mutation on a real-valued angle. A binary string is mutated by selecting a
gene at random and flipping the bit, turning 0 → 1 or 1 → 0. Suppose we flip
the eighth bit from the left in this string:
In the mutated string, the first angle, in real values, has changed from 103
to 101, so the value of the angle has changed only slightly. At the start of the
GA run, large changes in the values of the genes are desirable because many
genes will be far from their optimum values, but if the groups of binary bits
that encode for a real number are long, most random changes to a binary
representation will alter the real-number value of the gene by only a small
fraction of the total possible range. In the present example, if any one of the
five rightmost bits is changed by mutation, the greatest possible change in the
angle is just 16°. Binary coding thus biases mutation toward small changes
in angle, in contrast to the mutation of real-valued angles in which mutation
changes one angle to any other with equal probability.
154 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Table 5.11
Gray Codes of the Real Numbers 0 to 7
0 000
1 001
2 011
3 010
4 110
5 111
6 101
7 100
Evolutionary Algorithms 155
Figure 5.28
A possible (nonoptimum) tour of some two- and three-starred Michelin restaurants in the
British Isles.
156 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Precursors Products
in out
Figure 5.29
A schematic of a small chemical flowshop. The circles and squares represent units, such as
reactors, dryers, or centrifuges.
found, eating in each restaurant exactly once before returning to the starting
point. In order to save enough money to pay for the meals, the shortest pos-
sible route that includes every establishment must be taken. The route can be
expressed as a vector listing the restaurants in the order to be visited. The
number of possible orders is n!; thus, when n is large, an enumerative search
(in which all the routes are inspected one by one to see which is the shortest)
is not possible.
The traveling gourmet problem is in itself not of much interest to physical
and life scientists (or, more accurately, is probably of acute interest to many
of us, but well beyond our means to investigate in any practical way), but
problems that bear a formal resemblance to this problem do arise in science.
The chemical flowshop is an example (Figure 5.29).*
Chemical flowshops are widely used in industry for the medium-scale
synthesis of chemicals. In a flowshop, a number of different physical units,
such as reactor vessels, ovens, dryers, and distillation columns are placed
in a serial, or largely serial, arrangement. Rather than being devoted to the
preparation of a single chemical, many different chemicals are synthesized
one after another in a flowshop by feeding in the appropriate precursors
and reagents as the intermediates pass through most or all of the units. This
mode of operation is an effective use of the physical plant because no part
* Scheduling problems are extraordinarily common within and beyond science, so much so
that the TSP has been used as a test problem for almost every form of AI algorithm. The prob-
lem itself can be formulated in many different ways; the most obvious is to require, as here,
that the sum of the distances traveled be minimized. An equivalent requirement, but one
that hints at the many other existing ways of tackling the problem, is to regard the complete
tour as being defined by an enormous rubber band stretched from city to city. If the tour is
long, the band is stretched so its energy is high; as the length of the tour decreases so does
the energy, so the optimum tour is that of lowest energy. An advantage of this interpretation
is that by introducing “friction” at the position where the rubber band is wrapped around
an imaginary post in the center of each city, the tension in the band need not be the same
in each segment of the tour, so different portions of the tour can be partially optimized by
minimizing their energy without having to worry about what is happening in other parts of
the tour.
Evolutionary Algorithms 157
of the plant should stand idle for long, but it creates scheduling problems.
When making different products, the residence times of chemicals in each
unit may be very different. The synthesis of one material might require that
a particular unit be used for a period of several hours, while the preparation
of a second material may require use of the immediately preceding unit for
only a matter of minutes, thus the long-residence chemical blocks the prog-
ress of the chemical that follows it through the line.
Because the residence times of different chemicals in the units vary, the
order in which chemicals are made in a flowshop has a profound effect on
the efficiency of its operation. If the optimum production order can be identi-
fied and used, chemicals will move through the flowshop in something that
approaches lock-step fashion, with the contents of every unit in the flowshop
being ready to move on to the next unit at about the same time. Such per-
fect matching of the movement of chemicals throughout the flowshop will
almost never be achievable in practice, but if something approaching this
sort of coordinated movement can be managed, all units would be used for
a high proportion of the available time and the flowshop would be running
at high efficiency.
Choice of the optimum production order is a scheduling problem. Like the
problem of the traveling gourmet, a GA solution for the flowshop consists of
an ordered list. In the flowshop, this list specifies the order in which chemi-
cals are to be made:
{6, 1, 9, 10, 7, 3, 2, 5, 4, 8}
were subjected to two-point crossover at genes 3/4 and 7/8 in both strings to
give the two new strings
After crossover, neither string is valid because both specify that some
products must be synthesized twice and other products not at all. It is tempt-
ing to allow the GA itself to sort out this problem. A large penalty could be
imposed on the fitness of any string that contains duplicate products, thus
ensuring that the string would soon be removed from the population by
the selection operator. However, in a flowshop of industrial scale, operating
158 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
a schedule that calls for the synthesis of fifteen to twenty products, a large
majority of the strings created by the evolutionary operators will be invalid,
therefore, the algorithm will spend much of its time identifying, process-
ing, and destroying strings of no value. This flooding of the population with
invalid strings will significantly slow evolution.
Instead of relying on the selection process within the GA to filter out invalid
strings, it is more efficient to inspect new strings as they are created by cross-
over and to repair any damage before the strings are fed back into the GA.
This can be done by sweeping through pairs of strings as they are formed by
crossover and checking for duplicate entries. When the first repeated chemi-
cal is found in one string, it is swapped with the first duplicate in the second
string (Figure 5.30), and the procedure is continued until all duplicate entries
have been repaired.
This procedure must always succeed because the same number of dupli-
cates will appear in both strings and, if a chemical appears twice in one
string, it must be missing entirely from the other string.
Similar repair work following mutation is rendered unnecessary if a small
change is made to the way in which mutation is applied. Instead of chang-
ing a single entry in the string to a random new value, the positions of two
randomly chosen entries are swapped, thus automatically leaving a valid
string (Figure 5.31).
6 1 9 1 2 10 5 5 4 8
7 3 9 10 7 3 2 4 8 6
6 1 9 7 2 10 5 3 4 8
+
7 3 9 10 1 5 2 4 8 6
Figure 5.30
String repair after crossover.
6 1 9 10 7 3 2 5 4 8
6 1 5 10 7 3 2 9 4 8
Figure 5.31
Mutation in a permutation string.
Evolutionary Algorithms 159
1.5
0.5
0
0 200 400 600 800 1000 1200 1400 1600
Figure 5.32
The effect of elitism on the fitness as a function of generation number.
160 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
new solutions are generated readily, so that even without elitism the fitness
rises rapidly, while toward the end of the run, the population contains many
similar strings of high fitness, therefore, it is not detrimental to lose one of
them.
Since elitism guarantees that the best string in any generation cannot be
lost, it can be safely combined with an increased mutation rate, especially in
the later stages of a run. A higher mutation rate toward the end of a GA run
helps to promote diversity and encourages a wider exploration of the search
space at a stage when many strings in the population will be similar, thus the
disruption caused by mutation is less of a concern than it would be earlier
in the run.
Figure 5.33
The string fridge: Strings of various ages are stored in the fridge to provide a random selection
of information from previous generations.
Evolutionary Algorithms 161
blocks, but difficult to retain them while at the same time discovering other
good building blocks. A string stored in the fridge at an early point in the
algorithm that contains a nugget of valuable information may be resurrected
thousands of generations later and brought out to inoculate the GA popula-
tion with information that, having been discovered near the start of the run,
could subsequently have been lost.
String fridges contain much outdated information, and most strings
that are defrosted find themselves back in a population whose average fit-
ness far exceeds that of the population when the string was shifted into
the fridge, so are culled without delay. However, because the fridge door
is rarely opened and its use only involves copying a single string in and
out, its computational cost is almost zero, and for a complex problem this
approach can be of value.
5.12 Traps
As in the application of any nondeterministic Artificial Intelligence method,
it is not safe to assume that all that is needed for a successful GA calculation
is to open the jaws of the algorithm, drop in the problem, and then wait.
Various factors must be considered to determine whether the GA might be
effective.
Most fundamental of all is the need for evolution, which requires that the
population must be permitted to run for a significant number of generations
during which it can change. In some applications it is difficult to run the
GA for long enough for this requirement to be satisfied. Typical of situations
when evolution is difficult to bring about are attempts to combine the GA
with experimental measurements, for example, by using the GA to propose
a formulation of materials that might be fused together at high temperature
to create a solid possessing potentially interesting or valuable properties.
The number of generations may be severely restricted by the need to evalu-
ate the fitness of each potential solution by performing an experiment in the
laboratory. Although in principle an experimental approach is productive, if
it is possible to run the algorithm only for a few generations, such problems
may be better tackled by other methods, and there may be little justification
for regarding the calculation as having converged.
Equally, very short GA strings that contain only two or three genes are not
suitable for a GA approach. Better methods than the GA for the optimization
of a function of two or three variables are usually available. It should also
be clear that converting a very short real-valued string into its much longer
binary-coded equivalent merely disguises the underlying problem of a short
string. Writing it in binary form so that it appears more complex does not
alter the inherent simplicity of the string.
162 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
5.13.1 Evolutionary Strategies
In evolutionary strategies, a parent string produces λ offspring; the fittest of the
1 + λ individuals is selected to be the single parent for the next generation of off-
spring. There is no crossover operator in evolutionary strategies, only mutation.
Before the algorithm is allowed to start, a reasonable range for each param-
eter in the string is established, such that we are confident that the optimum
solution lies within the range covered by the parameters:
X = f ( x1 , x2 , x3 ,…xn ) (5.9)
This reflects the observation that genetic change among living species is
based on the current chromosome (hence, the mean of zero) and that small
changes to it are more likely than large ones (hence, the dependence of the size
of the change on a normal distribution with a limited standard deviation). The
solution, which corresponds to the new set of parameters, is then calculated.
In a (1,1) algorithm, the two solutions, X and Xnew , are compared, the better
solution is kept, and the process repeats until a solution of sufficient qual-
ity emerges.
Evolutionary Algorithms 163
5.13.2 Genetic Programming
Genetic programming (GP) should be the Holy Grail of scientific computing,
and indeed of many other sorts of problem solving. The goal of GP is not to
evolve the solution to a problem, but to evolve a computer program that when
executed will yield that solution.
The potential of such a method is very considerable since, if it can be made
to work, there would no longer be a need to write an explicit program to
solve a problem. Instead a description of the problem would be fed into the
GP program, the user would retire from the scene, and in due course a com-
plete program to solve the problem would emerge.
A computer program can be written as a sequence of operations to be
applied to a number of arguments. A GP algorithm manipulates strings just
as the GA does, but in a GP the strings are composed of fragments of com-
puter code or, more accurately, a sequence of instructions to prepare those
fragments. Each instruction codes for an operation which is applied to zero
or more variables that also form part of the string, so by constructing a GP
string as a sequence of operations and arguments, the entire string may be
interpreted as a (possibly) fully functioning program. This program is often
shown in the GP literature as a tree diagram. Generally the program is under-
stood to be in the LISP language, although, in principle, a GP could be used
to construct a program in any computer language. However, LISP programs
have special abilities, such as being able to operate on themselves to make
changes or to generate further LISP programs, which makes them well suited
to a GP application.
Unlike the GA, GP strings do not have a fixed length, so can grow or shrink
under the influence of the genetic operators. The quality of each string is
measured by unraveling it, turning it into a computer program, running the
program, and comparing the output of the program with a solution to the
problem set by the scientist. Notice that this process is almost the reverse
of the procedure that we use to run a GA. In the GA, we have available a
method by which the quality of any arbitrary solution can be calculated, but
usually have no way of knowing what the optimum value for the solution to
a problem is. In the GP, the optimum value that solves the problem is known,
but the means by which that value should be calculated is unknown.
The GP is initialized with strings that code for random sequences of
instructions and the fitness of the strings is assessed. Those strings that
164 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
generate a program whose execution gives a value closest to the correct one
receive the highest fitness; programs that offer poor solutions, or generate no
output at all, are allocated a low fitness. The genetic operators of selection,
crossover, and mutation are then applied to generate a new population of
candidate solutions.
In constructing a GP program, it is necessary not only to define GA-like
parameters, such as the population size, the type of crossover, and the num-
ber of generations, but also parameters that relate specifically to the compo-
nents of the candidate programs that the GP will build. These include:
• The terminals: These correspond to the inputs that the program will
need. Even though each candidate program will be manipulated
by the evolutionary operators, if we know in advance that pieces of
input data are certain to be required for any GP-generated program
to function correctly, we must at least ensure that the program has
access to them.
• The primitive functions: These comprise the various mathematical
and logical operations that the program may need. They will usually
include mathematical functions such as + –, /, and *, logical func-
tions, programming constructs, such as loops, and possibly other
mathematical functions, such as trigonometric, exponential, and
power functions.
• The fitness function: This is defined by the problem to be solved. The
function is determined not by how closely the GP-generated pro-
gram gets to the correct solution when run once, but how close the
program comes to the correct answer when tackling many examples
of the same problem. The reason for this will become clear from the
example below.
Program 3: b × d + c – a = 10
If only one correct function exists, at most one of these equations can
be right, yet each program that returned the value of 10 for its output
would be allocated the same fitness. By providing a second set of input
data, say a = 2, b = 4, c = 7, d = 0, and the associated solution (–8), the fit-
ness of those programs that had found an incorrect function would be
diminished when the fitness was calculated over a range of examples.
Example 4: Deception
A fundamental equation in biosensor analysis is
aK
Req = Rmax o (5.12)
ao K + 1
A + B S AB
K2
Req = Rmax (5.13)
ao K − 1
This has a form similar in some respects to the correct expression, but
several changes are required to reach that expression. Depending on the
values of the parameters, making one of these changes (e.g., changing
the minus to a plus in the denominator) might improve or worsen the
fitness. There is often very little information that indicates to the algo-
rithm whether a proposed change will be beneficial in moving toward
the optimum solution.
Analogies to this exist in nature: Human and chimp genomes are approxi-
mately 99 percent identical, so the change in the genome required to turn
a chimp into a human is 1 percent. Suppose that we were trying to evolve
the chromosome of an object that would perform consistently well when set
166 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Figure 5.34
Potential differential equation solvers.
in this way, the entire search covers a large area, but once one member of the
flock discovers the food, the message will quickly spread throughout the
entire flock. Each bird constitutes what is known as an agent, carrying out
local searches that are loosely correlated.
A similar approach is used computationally in particle swarm optimiza-
tion (PSO). Each particle or agent within a large group moves semirandomly
across the search space, but with a bias in the direction of the best solution
found so far by that agent, pbest, and a bias also in the direction of the best
global solution found by the entire swarm, gbest. As in the GA, we need a
means by which the solution for which an agent encodes can be assessed.
Each move made by the agent is calculated from its current position and
velocity, modified by a randomly chosen amount of movement towards both
gbest and pbest, together with a small element of movement in an entirely ran-
dom direction.
x( k + 1) = x( k ) + v( k + 1) (5.15)
In these equations, v(k) and v(k + 1) are the particle’s velocity in step k and
step (k + 1), respectively; cp is the learning factor for movement toward the
personal best solution; cg is the learning factor for movement toward the
global solution; cr is a random learning rate; and x(k) and x(k + 1) are the cur-
rent and next positions of the particle. cp and cr typically have values in the
range 1 to 2.5, while cr is in the range 0 to 0.2. To prevent particles oscillating
with large amplitude in the search space (a potential problem if cp or cg are
> 2), a maximum velocity is imposed on all particles.
Each agent therefore moves in a way that is determined both by the success
of its own search and the success of the entire flock. When a good solution is
found, the flock moves toward it and, because many agents are present, the
general area is then investigated thoroughly.
The effect of this process is rather similar to evolution in a GA. The initial
search covers a wide area, but is unfocused. The information that is passed
168 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
between agents causes the search to become focused around the better solu-
tions. However, the mechanism by which evolution occurs is quite different
from the GA. There is no selection or crossover, and mutation exists only
inasmuch as each particle can adjust the direction and speed with which it
moves. There is also no removal of particles, the population remains of con-
stant size and always features the same particles, but the particles do have an
individual memory of where they have been and where the most promising
positions visited by the swarm are located, thus each functions as a semi-
independent entity.
5.14 Applications
The GA is now widely used in science, and this discussion of typical appli-
cations can just touch upon a few representative examples of its use. No
attempt is made to provide a comprehensive review of published applica-
tions, but this section will give a flavor of the wide range of areas in which
the GA is now used.
There has been a notable change in the way that the GA has been used as
it has become more popular in science. Some of the earliest applications of
the GA in chemistry (for example, the work of Hugh Cartwright and Robert
Long, and Andrew Tuson and Hugh Cartwright on chemical flowshops) used
the GA as the sole optimization tool; many more recent applications combine
the GA with a second technique, such as an artificial neural network.
The use of hyphenated algorithms has been encouraged by the rapid
growth in computer power. For example, Hemmateenejad and co-workers
have combined a binary-coded GA with ab initio calculations in a study of
blood–brain partitioning of solvents.3 It is now feasible to use the GA to opti-
mize molecular structures by interpreting each GA string as a recipe for
building a molecule, generating multiple strings corresponding to possible
structures, running a quantum mechanical calculation on each structure to
determine a fitness for each molecule, and then using the normal evolution-
ary operators to select and change the structures. Numerous workers have
used the GA in combination with some means of calculating molecular ener-
gies as a route to the prediction of molecular structure. For clusters of large
numbers of atoms, the quantum mechanical calculations are demanding,
which makes the use of the GA to guide these searches only barely feasible;
this has led some workers to run the GA for only a few generations, which
renders it less effective. Among the more interesting applications are those
by Djurdjevic and Briggs,4 who have used the GA in protein fold prediction,
as have Cox and Johnston,5 while Wei and Jensen6 have addressed the ques-
tion of how to find optimal motifs in DNA sequences. Abraham and Probert7
have used the GA for polymorph prediction. Johnston’s group has been par-
ticularly active in this area, especially in the study of metal clusters.8
Evolutionary Algorithms 169
A number of workers have used the GA to find values for the set of param-
eters in some predefined model. Hartfield’s work on the interpretation of the
spectrum of molecular iodine is typical.9 A particular difficulty with such an
application is that the problem may well be deceptive, but because this may
not be obvious from the form of the problem, the fit may be of lower quality
than anticipated.
Computer screening allows pharmaceutical companies to test virtual drugs
against known protein structures to decide whether synthesis is financially
worthwhile. This screening takes place at all stages of drug development
because the later a drug is abandoned the greater the cost. GP has been used
to select a small group of molecular descriptors from a large set that best
explains the propensity of a drug to degradation by P450 (which is respon-
sible for the degradation of most drugs in the body). This is significant as an
important cause of the withdrawal of a drug from trials is the prevalance of
adverse side effects. One advantage of the GP approach over a neural net-
work is that the model that the GP creates is open to inspection, so it is pos-
sible to determine whether the molecular descriptors that the GP believes to
be important are in agreement with what human experts would expect.
Applications of swarm models are just starting to appear in science. In
a recent application, each agent codes for a molecular structure, initially
randomly generated, which calculates its own nuclear magnetic resonance
(NMR) spectrum. This spectrum is compared with the experimentally deter-
mined spectrum of a protein, and the degree of match indicates how close
the agent is to the correct protein structure. The movement of one agent
toward another is accomplished through the sharing of structural informa-
tion, which ensures that agents that satisfactorily reflect at least part of the
structure can spread this information through the population.
Eberhart2 is one of the few recent introductions to that field, but also covers
genetic algorithms and genetic programming by way of setting the scene.
5.16 Problems
1. Aligning quadrupoles
Consider a quadrupole, in which positive charges and negative
charges are arranged at the ends of a cross. In the dipoles problems
discussed in this chapter, it was immediately clear that a head-to-
tail arrangement of dipoles would give the optimum energy, but the
lowest energy orientation for a line of quadrupoles is less obvious.
They might orient themselves as a set of pluses so that positive and
negative charges lie as close as possible to each other along the x-axis,
or as a series of multiplication signs, which allows a greater number
of favorable interactions, but of a lower energy as the charges are
farther apart (Figure 5.35).
Write a GA to investigate which arrangement is the more stable and
investigate whether there is some critical ratio of the dimension of
the quadrupole to the distance apart that determines which geom-
etry is of lower energy.
2. A field of dipoles
Use two-dimensional strings to construct a GA that can find the low-
est energy arrangement of 100 identical dipoles positioned evenly
across a square.
3. Universal Indicator
Consider the formulation of a universal indicator; this is a solu-
tion whose color is pH-dependent across a wide pH range. Univer-
sal indicators usually contain several weak acids, in each of which
– + – + –
+ + – – + + – – + +
– + – + –
– + – + – + – + – +
+ – + – + – + – + –
Figure 5.35
Two possible ways in which quadrupoles might align when positioned at equal distances
along a straight line.
Evolutionary Algorithms 171
either the protonated or the unprotonated form (or both) are colored.
The solution must have the following two characteristics:
a. The indicator solution must be colored at all pHs and change
color to a noticeable extent when the pH changes by a small
amount, say 0.5 units.
b. Every pH must correspond to a different color.
Consider what information you would need to evolve a recipe for a
good Universal indicator. The visible spectra of many suitable weak
acids and their pKa values can be found on the Web if you want
to test your algorithm. The formulations of some typical commer-
cial Universal indicators can be found on the Web sites of chemical
suppliers.
References
1. Koza, J.R., Keane, M.A., and Streeter, M.J., What’s AI done for me recently?
Genetic Programming’s human-competitive results, IEEE Intell. Systs. 18, 25,
2003.
2. Kennedy, J. and Eberhart, R.C., Swarm Intelligence, Morgan Kaufmann, San
Francisco, 2001.
3. Hemmateenejad, B., et al., Accurate prediction of the bloodbrain partitioning
of a large set of solutes using ab intio calculations and genetic neural network
modeling, J. Comp. Chem., 27, 1125, 2006.
4. Djurdjevic, D.P., and Biggs, N.J., Ab initio protein fold prediction using evo-
lutionary algorithms: Influence of design and control parameters on perfor-
mance, J. Comp. Chem., 27, 1177, 2006.
5. Cox, G.A. and Johnston, R.L., Analyzing energy landscapes for folding model
proteins, J. Chem. Phys., 124, 204714, 2006.
6. Wei, Z. and Jensen, S.T., GAME: Detecting cis-regulatory elements using a
genetic algorithm, Bioinformatics, 22, 1577, 2006.
7. Abraham, N.L. and Probert, N.J., A periodic genetic algorithm with real space
representation for crystal structure and polymorph prediction, Phys. Rev. B., 73,
224106, 2006.
8. Curley, B.C. et al., Theoretical studies of structure and segregation in 38-atom
Ag-Au nanoalloys, Eur. Phys. J. D. Atom. Mol., Opt. Plasma Phys., 43, 53, 2007.
9. Hartfield, R.J., Interpretation of spectroscopic data from the iodine molecule
using a genetic algorithm, Appl. Math. Comp., 177, 597, 2006.
10. Goldberg, D.E., Genetic Algorithms in Search, Optimization and Machine Learning,
Addison Wesley, Reading, MA, 1989.
11. Davis, L. (ed.), Handbook of Genetic Algorithms, Van Nostrand Reinhold, New
York, 1991.
12. De Jong, K.A., Evolutionary computation: A unified approach, MIT Press, Cam-
bridge, MA, 2006.
6
Cellular Automata
Numerous systems in science change with time or in space: plants and bac-
terial colonies grow, chemicals react, gases diffuse. The conventional way to
model time-dependent processes is through sets of differential equations, but
if no analytical solution to the equations is known, so that it is necessary to use
numerical integration, these may be computationally expensive to solve.
In the cellular automata (CA) algorithm, many small cells replace the dif-
ferential equations by discrete approximations; in suitable applications the
approximate description that the cellular automata method provides can be
every bit as effective as a more abstract equation-based approach. In addi-
tion, while differential equations rarely help a novice user to develop an
intuitive understanding of a process, the visualization provided by CA can
be very informative.
Cellular automata models can be used to model many systems that vary in
both space and time, for example, the growth of a bacterial colony (Figure 6.1),
Figure 6.1
A simulation of the filamentary growth of a bacterial colony.
173
174 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Figure 6.2
A computer model of the development of chemical waves in the Zaikin–Zhabotinsky reaction.1
6.1 Introduction
Cellular automata model scientific phenomena by breaking up the region
of simulation into many cells, each of which describes a small part of the
overall system. The state of each cell is defined by a limited number of vari-
ables whose values change according to a set of rules that are applied repeat-
edly to every cell. Even though the rules that are used within the algorithm
are usually simple, complex behavior may still emerge as cells alter their
state in a way that depends both on the current state of the cells and that of
their neighbors. CA are conceptually the most straightforward of the meth-
ods covered in this book, yet, despite their simplicity, they are versatile and
powerful. In fact, it is possible to demonstrate that a CA model is capable of
Cellular Automata 175
Finite
Clock state Feedback
variables
Output
Figure 6.3
A finite state automaton.
“on” or “off.” The cell also incorporates feedback, thus some of the input is
generated by the cell reading its own output from the previous cycle.
The cells in a CA use transition rules to update their states. Every cell uses an
identical set of rules, each of which is easy to express and is computationally
simple. Even though transition rules are not difficult either to understand or
to implement, it does not follow that they give rise to dull or pointless behav-
ior, as the examples in this chapter illustrate.
0 0 0 1 0 0 0
1. The state of the cell in the next cycle is determined by adding together
the current values of the states of the two cells immediately above it.
0 0 1 1 0 0
So, as to make clear the evolution of the CA, we will write out all of
the cycles, one after the other, so the first two cycles are
0 0 0 1 0 0 0
0 0 1 1 0 0
0 0 0 1 0 0 0
0 0 1 1 0 0
0 0 1 2 1 0 0
0 0 0 1 0 0 0
0 0 1 1 0 0
0 0 1 2 1 0 0
0 1 3 3 1 0
0 1 4 6 4 1 0
1 5 10 10 5 1
1 6 15 20 15 6 1
Cellular Automata 177
H v+1 − 2 yH v + 2 vH v−1 = 0
Initialize cells
Advance clock
Yes
No
End
Figure 6.4
Cellular automata: the algorithm.
Figure 6.5
The arrangement of cells in a one-dimensional cellular automaton.
around each cell. Since all cells in a regular lattice are equivalent, the neigh-
borhood in such a lattice must always include the same number of cells. In
an irregular lattice, this requirement is difficult to meet.
In the simplest CA, cells are evenly spaced along a straight line (Figure 6.5)
and can adopt one of only two possible states, 0 and 1. More flexibility is
gained if the cells are arranged in a two-dimensional rectangular or hex-
agonal lattice and the number of permissible states is increased. We could
further increase flexibility by including in the cell’s state parameters whose
values can vary continuously. This is a natural extension to the model to
consider in scientific applications, in which the data very often can take any
value within a real number range. The number of possible states for each
cell is then infinitely large. A model in which there is no limit to the number
Cellular Automata 179
of possible states that a cell may access is not a true CA, but is nevertheless
often still referred to, and can be treated as, a CA.
1. Sum the current state of a cell and that of its two neighbors;
2. If the sum equals 0 or 3 the state of the cell in the next cycle will be 0,
otherwise it will be 1.
Figure 6.6 shows the first few cycles in the evolution of this CA. Each
successive cycle is drawn higher up the page, thus the starting state is at
Figure 6.6
The evolution of a one-dimensional, two-state cellular automaton. The starting state is at the
bottom of the figure and successive generations are drawn one above the other. The transition
rules are given in the text.
180 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
the bottom of the figure and “time” (the cycle number) increases from
the bottom upward. Cells that are “on” are shown in black, while those
that are “off” are shown in white.
From a single starting cell, an intricate pattern emerges.*
6.3.3 Neighborhoods
6.3.3.1 The Neighborhood in a One-Dimensional Cellular Automata
The future state of any cell upon which we focus our attention, known as the
target cell, is determined from its current state and that of the cells that lie in
its neighborhood, hence, we must define what is meant by the neighborhood.
In one dimension, it is easy to identify neighbors: If we assume that the lat-
tice is infinite, each cell in a one-dimensional CA has both a left and a right
neighbor, so the number of immediate neighbors is two. The neighborhood
* The pattern is a set of Sierpinski triangles; these are triangles in which an inner triangle,
formed by connecting the midpoints of each side, has been removed.
** The Game of Life is run on a two-dimensional square lattice; each cell is either “dead” or
3. If an unoccupied cell has exactly three neighbors, it becomes a birth cell and is occu-
Figure 6.7
The neighborhood in a one-dimensional cellular automaton. Usually this includes only the
immediate neighbors, but it can extend farther out to include more distant cells.
includes the target cell itself, so that the state of the target cell is fed back to
help determine its state in the next cycle (Figure 6.7).
The neighborhood may be expanded to include cells that do not actually
touch the target cell, so that the states of more distant cells can influence
the future state of the target. The farther away a cell is from the target cell,
the less will be its influence on the target. Therefore, some function, which
might be linear, Gaussian, or exponential, can be used to define how much
influence cells have depending on their distance away. In the majority of
applications, only neighbors that actually touch the target cell appear in the
transition rules.
(a) (b)
(c) (d)
Figure 6.8
Possible neighborhoods in a two-dimensional cellular automaton: (a) von Neumann, (b) Moore,
(c) axial, and (d) radial.
The state of the cell in the next cycle is selected at random from all
possible states.
Cellular Automata 183
Figure 6.9
Successive states of a two-dimensional cellular automaton in which the state of every cell var-
ies randomly.
When such a transition rule is applied, the state of each cell and, there-
fore, of the entire system varies completely unpredictably from one cycle to
the next (Figure 6.9), which is unlikely to be of much scientific interest. No
information is stored in the model about the values of the random numbers
used to determine the next state of a cell, thus once a new pattern has been
created using this rule there is no turning back: All knowledge of what has
gone before has been destroyed. This irreversibility, when it is impossible to
determine what the states of the CA were in the last cycle by inspecting the
current state of all cells, is a common feature if the transition rules are partly
stochastic. It also arises when deterministic rules are used if two different
starting patterns can create the same pattern in the next cycle.
The behavior of CA is linked to the geometry of the lattice, though the
difference between running a simulation on a lattice of one geometry and a
different geometry may be computational speed, rather than effectiveness.
There has been some work on CA of dimensionality greater than two, but the
behavior of three-dimensional CA is difficult to visualize because of the need
for semitransparency in the display of the cells. The problem is, understand-
ably, even more severe in four dimensions. If we concentrate on rectangular
lattices, the factors that determine the way that the system evolves are the
permissible states for the cells and the transition rules between those states.
It might seem that transition rules that are predominantly random would
not give rise to interesting behavior, but this is not entirely true. Semiran-
dom rules have a role in adding noise to deterministic simulations and, thus,
leading to a simulation that is closer to reality, but even without this role
such rules can be of interest.
184 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Figure 6.10
Random rules giving rise to a propagating wave front.
Cellular Automata 185
Figure 6.11
A random walk by a particle across a square grid. The most recent movements are shown
in white.
states of all its equivalent neighbors, but to the state of a randomly chosen
neighbor. An alternative would be to move away from an entirely synchro-
nous update of every cell and instead update each cell after some random
time step, chosen separately for each cell, has passed.
1 0 1
1 0 1
1 0 0
1 1 0
0 0 0 1
1 1 1
0 0 0
1 0 1
1 1 1
Figure 6.12
The many-to-one mapping that may arise in voting rules. Several different combinations of
states may give rise to the same state in the next cycle.
(a)
(b)
Figure 6.13
The coagulation of cells in a five-state cellular automata model as a result of the use of vot-
ing rules.
188 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
(c)
Figure 6.13 (Continued)
The coagulation of cells in a five-state cellular automata model as a result of the use of vot-
ing rules.
(a)
Figure 6.14
Coagulation in a two-state system with random voting.
Cellular Automata 189
(b)
(c)
Figure 6.14 (Continued)
Coagulation in a two-state system with random voting.
Figure 6.15
Diffusion-limited aggregation (DLA).
Figure 6.16
Cellular automata (CA) using the transition rules given in Example 2, but without special steps
being taken at the boundaries to permit development of the pattern to continue.
a very large number of generations, the colony may grow to such an extent
that it eventually comes into contact with a boundary, but by starting with
a sufficiently large grid, the simulation can be run without the boundaries
being of consequence.
If the boundaries might affect the simulation, several approaches are pos-
sible. In the first, “brute force” method, we simply try to move the boundaries
still farther away, accepting as we do so that this increase in the scale of the
CA will slow the execution. This is cheating because we are dodging the issue
of how to deal properly with the boundaries, but may be feasible if the key
region of the simulation can be positioned close to the center of the lattice so
that any anomalies that could arise at the boundaries will not affect the global
behavior of the CA. The growth of bacterial colonies falls into this category.
A second way to deal with boundaries is to eliminate them. This can be
achieved by wrapping the lattice around on itself as we did with the self
organizing map, generating a ring in one dimension or a torus or a sphere
in two dimensions. This is an initially attractive approach, but potentially
Cellular Automata 193
Figure 6.17
A predator and prey simulation. The population of gazelles (the higher line) has been scaled
down since, in reality, it would be many times that of the lions.
194 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Figure 6.18
A cellular automata simulation box using periodic boundaries.
Cellular Automata 195
1. If the current state of a cell is k, where 0 < k < N – 2, the state of the
cell in the next cycle is k + 1.
2. If the current state of a cell is N – 1, its state in the next cycle is 0.
These rules mimic the spontaneous first order reaction of a chemical (a cell
in state x) that transforms into a different chemical (the cell in state x + 1).
Diffusion rules relate the future state of the target cell to those of neighbor-
ing cells. This mimics a second order reaction:
Figure 6.19 shows an example of the sort of pattern that results from the
application of these rules, with a disturbance moving out from a central
region in a sequence of waves with constant wavelength (equal to the number
of accessible states). In a lattice in which most cells were given an initial state
of 0, a small number in a block at the center were given an initial state of 1. A
small amount of random noise has been added to the rules by applying them
at each cell with a probability of 0.98 and the effect of this can be seen as the
waves gradually become more disorganized and diffuse; eventually they will
dissolve into the background noise. In this simulation, infinity pool bound-
ary conditions mean that cells at the edge of the simulation are lost, thus the
simulation has the appearance of being run on a lattice of infinite size.
With careful choice of rules, excited state CA can produce convincing simula-
tions of the spirals and waves that can be produced by oscillating reactions.
(a) (b)
Figure 6.19
A disturbance in an excited state cellular automata (CA).
Cellular Automata 197
(c) (d)
Figure 6.19 (Continued)
A disturbance in an excited state cellular automata.
6.8 Applications
Because the CA are well suited to the description of temporal phenomena, it
will be no surprise that an early application of CA in science was in hydro-
dynamic flow. Wolfram, who has been an enthusiastic advocate of CA, has
used them successfully in the simulation of turbulent flow — a demanding
area for simulations of any type. Indeed, Wolfram’s work covers a range of
topics in physics and uptake of the method among scientists owes much to
his work.4
The CA is a powerful paradigm for pattern formation and self-organiza-
tion, an area of increasing importance in nanotechnology. CA have not yet
been extensively used in nanotechnology applications, though their use in
quantum dot applications is growing.
Cellular Automata 199
6.10 Problems
1. Bacterial remediation
CA models are well suited to the modeling of bacterial remediation.
The first step is to build a model of a colony of bacterial cells that can
reproduce and grow. In doing so, you will need to take into account
several factors such as:
1. The need of each cell to consume a certain amount of nutrient
per time step to stay alive.
2. Diffusion of nutrient from the surrounding environment into the
region of the bacterial cells to replenish that consumed by the
bacteria.
3. The time required before a cell is mature enough to divide and
create a new cell.
4. Any dependence of the rate of aging of the cells on the level of
nutrient; in particular, the possibility that a bacterial cell may
enter a state of “suspended animation” if the nutrient concentra-
tion falls below a certain level.
Construct a two-dimensional CA to model bacterial growth and
investigate the conditions under which your model will give rise to:
1. Rapid, regular growth leading to a circular colony.
2. Irregular growth leading to a fractal colony.
3. Sparse growth leading eventually to the colony dying.
Also investigate the behavior of the colony in the presence of a non-
uniform concentration of pollutant. Assume that the colony can
metabolize the pollutant and that the presence of the pollutant (1)
inhibits or (2) promotes growth of the colony.
You may wish to compare the results of your simulation with Ben-
Jacob’s experimental results.9 Virtually all of the behavior described
by Ben-Jacob and co-workers can be effectively modeled using CA.
Cellular Automata 201
References
1. Field, R.J., Noyes, R.M., and Koros, E., Oscillations in chemical systems. 2. Thor-
ough analysis of temporal oscillation in bromate-cerium-malonic acid system,
JACS, 94, 8649, 1972.
2. Gardner, M., Wheels, Life and Other Mathematical Amusements, Freeman, New
York, 1993.
3. Wolfram, S., Statistical mechanics of cellular automata, Revs. Mod. Phys., 55, 601,
1983.
4. Wolfram, S., Computation theory of cellular automata, Commun. Math. Phys.
96, 15, 1984.
5. Chuong, C.M., et al., What is the biological basis of pattern formation of skin
lesions? Exp. Dermatol., 15, 547, 2006.
6. Deutsch, A. and Dormann, S., Cellular Automaton Modeling of Biological Pattern
Formation, Birkhauser, Boston, 2005.
7. Wu, P.F., Wu, X.P., and Wainer, G., Applying cell-DEVS in 3D free-form shape
modeling, Cellular Automata, Proceedings Lecture Notes in Computer Science
(LNCS), Springer, Berlin, 3305, 81, 2004.
8. Kier, L.B., Seybold, P.G., and Cheng, C-K, Modeling Chemical Systems Using Cel-
lular Automata, Springer, Dordrecht, 2005.
9. Ben-Jacob, E., Cohen, I., and Levine, H., Cooperative self-organization of micro-
organisms. Adv. Phys. 49, 395, 2000.
7
Expert Systems
SHRDLU knew enough about the properties of blocks and what could be
done with them, including the meaning of phrases such as “on top of” and
“pick up,” that it was able to determine not only whether the move of a block
that it or a user might propose was feasible, given the current positions of all
the blocks, but was also able to respond correctly to queries from the user
about the current location and relative position of the blocks. This led some
researchers to argue that in a sense the program showed “understanding.”
However, this view suggests that understanding can be measured by observ-
ing behavior, which is rather a weak requirement for intelligence* and few in
the AI community would now argue that SHRDLU understood at all.
* The way in which birds flock together seems to imply some sophistication in both decision
making and communication, but in fact, as we saw in Chapter 5, flocking can arise from the
application of a few very simple rules and requires no intelligence. This is not to suggest that
the mechanism by which real birds flock is through application of the three rules mentioned
in that chapter, but does tell us that what we perceive as intelligent behavior need not be
evidence of deep thinking.
203
204 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Programs such as SHRDLU were written at a time when some people felt
that the long-term goal for AI-based software should be the development
of a tool that could solve a problem that was completely new to it, work-
ing virtually “from scratch.” This is a hugely ambitious goal, and even now,
several decades after the emergence of the first primitive expert systems,
computer scientists are far from achieving it. Though researchers have made
rapid advances in building multipurpose AI-driven robots equipped with
sensors that provide good vision, and endowing the computers that run
them with advanced reasoning, no silicon being comes near to matching the
performance of humans.
In fact, expert systems are almost the exact opposite of a machine that
could solve a problem in some arbitrary area from scratch. They focus on
a single specialist topic and know almost nothing outside it. They make up
for this blinkered outlook on life by possessing a knowledge of the topic
that few humans could match and then complement it with a reasoning abil-
ity that mimics that of a human expert, allowing them to make deductions
from data supplied by the user, so that their naming as an “expert” system
is appropriate.
7.1 Introduction
Expert systems are problem-solving tools, so are very practical in nature.
Each ES is written to meet the need for advice in a specific area. Software
whose role is to provide practical advice, be it advising a worker on the choice
of a pension plan or helping a scientist to interpret nuclear magnetic reso-
nance spectra, must contain at least two components: a database that con-
tains the relevant background information, and rules that define the context
of the problem and provide a framework that allows the software to manipu-
late data. Given both components, computers can manage many tasks that
require specialized knowledge, such as providing investment advice or play-
ing a competent game of chess.
Chess intriguingly combines simplicity in its rules and goal with complex-
ity in its tactics and play, so it is a favorite topic of many in the AI community.
An average Computer Science undergraduate could write an ES whose role
was to suggest the next move to a novice chess player during a game. As
both the rules of chess and its aim are simply stated and unambiguous, it is
straightforward from the current state of the playing pieces on a chessboard
to work out all the possible moves. For each legal move, the rules of chess
can then be applied a second time to work out an opponent’s next move.
Each combination of a user move followed by an opponent’s response can be
assessed to see whether the user’s move is advantageous, perhaps as judged
by whether the user gains an opponent’s piece.
Expert Systems 205
person in the field. The knowledge that a human expert possesses may be
sufficiently specialized that it cannot readily be transferred to other peo-
ple. In addition, experts in some fields may be so rare that it is prudent to
take precautions against the possibility that their knowledge might be lost
if the expert retires, dies, or runs off to join one’s competitors. Therefore, it
is not surprising that computer scientists long ago spotted an opportunity;
if a human expert is not always available when needed, it might be possible
instead to access expert knowledge through a computer keyboard if a silicon
equivalent could be constructed.
Expert systems are the most important application within a field known
as knowledge-based systems. This broad area includes tools that manipulate
symbolic knowledge as well as those that can handle only statistical or
mathematical data. At the dumb end of the spectrum, the simplest knowl-
edge-based systems have a limited reasoning ability, or none at all, and
consist of little more than a large database of information about some par-
ticular topic, accessible through a simple query system. They behave like
an electronic card index that can instantly reveal which team won the Stan-
ley Cup in 1980.* A true expert system has reasoning ability; this allows it
to interact with the user in what appears to be an intelligent manner within
the domain of its expertise, and take advantage of rules and relationships
in its database to make deductions and provide advice in response to user
input.
Example 1: Holidays
Here is an imagined conversation between an ES and a scientist:
Biologist: I’ve been working too hard. I need a break.
Expert System: Have you chosen where to go?
B: No
ES: Do you like warm weather?
B: Yes
ES: Do you like to lie on a beach?
B: No
ES: Do you like good food and wine?
B: Definitely
ES: France or Italy would be suitable. Do you like art?
B: Yes
ES: I suggest Florence for your holiday.
This is a fairly mundane exchange. You would not need years of experi-
ence before you were able to decide for yourself where you should go on a
holiday, so there is little commercial potential for an expert system advising
overworked biologists on how to spend their summer break. Nevertheless,
this simple interaction gives us an early insight into how an expert system
appears to the user. The system has provided one half of a conversation, elic-
iting information from the user through a sequence of focused questions as
* New York Islanders.
Expert Systems 207
the options of a holiday location are narrowed down. The user is eventually
offered a recommendation based on the system’s analysis of the information
that the user has provided, combined with whatever information the soft-
ware already contained.
The interaction between a user and the ES resembles the kind of conver-
sation that might take place between a nonspecialist and a human expert.*
Within the task domain, not only should the conversation proceed along
much the same lines as it would if two humans were talking, but an ES
should also approximately mirror the decision-making processes of a
human expert and at least match their level of competence. Some research-
ers have taken the view that it is not sufficient that the ES should only give
the appearance of working like a human; they argue that emulation requires
that the software should as far as possible actually follow the same sort of
deductive pathway that a person would take. This is a far stiffer require-
ment than that an ES should appear from the outside to be reasoning “along
human lines” and is rather difficult to implement, given that our knowledge
of how the human brain operates is still fragmentary, thus making impos-
sible a proper judgment of the degree to which the two types of reasoning
match. Nevertheless, if we at least try to reproduce human reasoning in the
ES, this may make it more likely that the reasoning of the system will be
robust and predictable.
Notice how the conversation above reveals that the system has made an
attempt to reason logically by finding out what the user considers to be
important before it makes any recommendations. This approach is much
more productive than a random search in which the interaction might run
as follows:
with possible holiday destinations extracted by the “expert” system from its
database and thrown out at random. An exhaustive search is no better. In
this approach, the system might first propose all the locations it knows about
that start with A, then suggest all those that start with B, and so on. It is clear
* Although most expert systems are designed to advise fairly unsophisticated users, some act
instead as assistants to established human experts; DENDRAL, which we introduce shortly,
falls in this category. Rather than replacing the expert, these systems enhance their pro-
ductivity. Assistants of this sort have particular value in fields, such as law or medicine, in
which the amount of potentially relevant background material may be so large that not even
a human expert may fully be cognizant with it.
208 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
• Provide advice that at least matches in reliability the advice that would
be provided upon analysis of the same data by a human expert.
• Interact with the user through a natural language interface.
• When required, provide justification for, and an explanation of, its
decisions and advice.
• Deal effectively with uncertainty and incomplete information.
Unlike methods, such as neural networks, which can tackle any one of a
wide range of tasks provided that they are given suitable training, each ES
can work in just one field, but the range of possible fields is large. Working
expert systems provide instruction for car mechanics, remote medical diag-
noses for nurses, and help in the diagnosis of faults in machinery. They are
valuable in control, interpretation, prediction, and other types of applica-
tions; in each case though, an ES must be built specifically for the desired
application.
As Table 7.1 shows, the advantages of a well-constructed expert system are
considerable, but anyone choosing to build an ES faces significant challenges.
The goal is to construct a software tool that emulates human reasoning, is
fast in operation, reliable, and at least matches the performance of human
experts in an area where few humans are experts. This is asking a lot. No ES
is yet able to match humans across a large and random selection of areas, but
a well-constructed ES can offer a passable impression of a human expert in
a particular area.
210 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Table 7.1
Some Advantages and Disadvantages of Silicon-Based Experts
User
Interpreter
interface
Knowledge Inference
base engine
Knowledge Case-based
base editor data
Figure 7.1
The components of an expert system.
Problem domain
Knowledge
domain
Figure 7.2
The knowledge domain is an area within what is usually a much broader problem domain.
pV = nRT
The American Civil War lasted from 1861 until 1865
The world is round
π = 3.14159
Streetlights show d-line emission from excited sodium atoms at
wavelengths of 589.76 nm and 589.16 nm.
Table 7.2
Typical Nondomain Data in an Expert System Based in a Laboratory
Instrument availability Types of analytical instruments within the laboratory and their
current status
Instrument specification Operating ranges of instruments and their maximum throughput in
samples per hour
Instrument configuration Types of solvents that are available for running HPLC, sampling
accessories on hand for IR, detectors available for GC, and the
sensitivity toward particular analytes of each detector and
instrument
Regulatory procedures Information relating to standards and procedures for analytical
measurements that determine how analyses must be performed to
meet legal requirements
Scheduling data Time required for each analysis
boundaries of that domain. The reason is not only the need to include the gen-
eral knowledge mentioned above, but also that the ES may need to tell the user
how to use the expert advice that it will provide. An ES that advises techni-
cians in an analytical laboratory in which river water samples are analyzed for
their metal ion content might contain a variety of operational data that relate
not to the general problem of determining heavy metal content, but to the spe-
cific problem of doing so in the laboratory in which the ES is operating. This
might include detailed data on instrumentation and procedures (Table 7.2).
may propose a solution with some degree of confidence. Heuristics are closely
linked with fuzzy data, which we shall cover in the next chapter.
7.5.1.3 Rules
An expert system does much more than extract information from a database,
format it, and offer it up to the user; it analyzes and processes the infor-
mation to make deductions and generate recommendations. Because an ES
may be required to present alternative strategies and give an estimate of the
potential value of different courses of action, it must contain a reasoning
capacity, which relies on some sort of general problem-solving method.
One of the simplest and most powerful ways to manipulate information in
the knowledge base is through rule-based reasoning, which relies on production
rules. A production rule links two or more items of information in a structure
of the form:
IF <condition> THEN <conclusion>
For example,
The information to which the rule is applied might be extracted from the
knowledge base, it might be provided by the user in response to questions
from the ES, or it may be provided by combining the two. An expert system
that uses rule-based reasoning is, quite reasonably, known as a rule-based
system. This is the most widely used form of expert system in science, and it
is on this type of system that this chapter concentrates.
atypical behavior. Such examples could be used for comparison with current
data to determine whether a similar situation has arisen in the past. If simi-
larities are found, the case library can provide information about the action
that was taken in the past example and the result of that action.
1. T
here is a backlog of samples for analysis. To reduce the
backlog, first process those samples whose analysis can be
completed in the shortest time.
2.
All samples sent by the hospital should be processed without
delay.
the ES has used if the user is suspicious of, or baffled by, the advice that the
ES has offered and wants to discover the logic behind it.
The dialogue in which the ES engages is not preprogrammed; there is no
set sequence of questions that the system is sure to ask, no matter how the
user responds. Instead, after each response, the system analyzes the infor-
mation that it now possesses and then decides what to do next.
Although the user interface just acts as the messenger for the interpreter,
sound and logical design of the interface is crucial in giving the user confi-
dence in the system. The user interface employs a conversational format, so
the interpreter and user interface must between them have some knowledge
about how information should be presented to humans. In the most effective
expert systems, the interactions between software and user take into account
the level of expertise of the user.
The interface may be self-adaptive, noting the identity of the user each
time they log in to use the system and tracking the interactions that follow,
learning about the level of expertise of the user, and building up a personal
profile of them so that the approach of the ES can be tailored to them indi-
vidually. This is sometimes apparent in the Help system on a PC, which is
usually based around a type of ES. Since the level of knowledge, experience,
and confidence among PC users varies widely, if a Help system is to be of the
greatest value, it is wise to construct it so that it can adjust its approach and
presentation to match the level of expertise of the user.
Together, the user interface and the explanation system, therefore, are
important components in the ES. A user who finds the software difficult to
use, or confusing, may become irritated with the system or be misled by it and
stop using it. Worse still, they may misinterpret the advice that it provides.
IF
<the patient’s Body Mass Index is well above recommended
levels>
THEN
<recommend weight loss to reduce the risk of a heart
attack>
Lose weight.
Buy an umbrella.
Use GC-MS (gas chromatography-mass spectrometry) to analyze the
sample.
Or a conclusion:
The conclusion could be that a certain piece of data can now be demon-
strated to be true (“The sample has a melting point below 10°C”) and this
might just be added to the ES’s own database. Alternatively, the conclusion
could be an instruction to the scheduler to change the order in which it will
assess rules in the database. It follows that not every conclusion will be
reported. Most will form just one step in a chain of reasoning, only the end
of which will be seen by the user.
Using its pool of facts, the ES can attempt to reason. Humans use several
different methods to reach a conclusion. We may reason logically:
Eating fish tonight? You should have white wine with the meal.
The sample of river water comes from an area that once housed
a sheet metal works. Heavy metal contamination is common in
218 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
And case-based reasoning, in which the decision on how to act in one situ-
ation is guided by what action was successful when a similar situation was
encountered in the past, is also of potential value:
An ES may use all of these techniques to deal with a query from a user,
but is likely to rely primarily on production rules. These rules are a type of
knowledge, so are stored in the knowledge base, often using a format that
makes their meaning obvious.* Each rule is “standalone,” hence, although its
execution may affect the conclusion reached by applying it in combination
with other rules, a rule can be added, deleted, or modified without this action
directly affecting any other rule in the database. Correction and updating of
the expert system, therefore, is much more straightforward than would be
the case if different rules were linked together using program code.
7.6.1 Rule Chaining
No sensible expert system, when it receives a query from the user, blindly
executes every rule in the database in the hope that the sum total of all its
conclusions might provide an appropriate response to the user. Even assum-
ing that the output of large numbers of rules executed simultaneously could
be massaged into some meaningful format (which is optimistic), a plan of
action is needed. The paradigm or problem-solving model is this plan. It sets out
the steps that the inference engine should take to solve the query posed by
the user. In a rule-based system, the paradigm often relies on identifying a
group of connected production rules that, when applied in sequence, will
yield a line of reasoning that leads to the desired conclusion.
User hungry?
No Yes
Has money?
No Yes
Rich?
Yes No
Favorite food?
Ethnic Burgers
Chinese/Indian
Figure 7.3
A tree diagram that illustrates forward chaining.
Wedding
anniversary
Remember it
Forget it
Figure 7.4
Backward chaining.
The two lines are joined by an arc if both steps must be taken.
Since several rules have the same conclusion, the ES would take the first
rule it encounters and test the condition (<spend nothing>) to see whether
it is true. If it is not known at this point whether the condition is true, the
condition provides a new goal for the ES. It will now attempt to find out
whether the amount spent on a present was different from zero. The sys-
tem will again search the knowledge base to see if it contains a rule whose
conclusion is the amount spent. If it finds no such rule, it will ask the user
whether she/he can provide that information. The ES thus starts at the final
conclusion to be reached and works backward through a sequence of goals
until it either proves the entire set of conclusions or runs out of data. Because
the route that the ES takes is at each stage determined by the need to meet
some goal, this is also known as goal-driven reasoning.
A prime advantage of backward chaining or goal-driven reasoning is
that the questioning in which the ES engages is focused from the very start
because it begins at the final step of a possibly long chain of reasoning. The
questions that are asked, therefore, should all be relevant. Contrast this with
forward chaining, in which the early stages of the interaction may comprise
vague, unfocused questions because the user might be interested in any-
thing across the entire range in which the system has expertise:
From such a woolly starting point, time may be wasted while the ES tries
to find out what the user really wants. On the other hand, if there are many
different rules that lead to the action <go to live with mother>, it is pos-
sible that an ES using backward chaining will pursue many fruitless chains
before reaching a satisfactory conclusion.
Whether it follows forward or backward chaining, the system will pose a
series of questions that allow it to gather more information about the prob-
lem until it successfully answers the user’s query or runs out of alternatives
and admits defeat. Just like the way that the system chooses to respond to
a query, the exact sequence of steps that it takes is not preprogrammed into
the knowledge base, but is determined “on the fly” by the scheduler, which
reassesses the options after each response that the user provides.
Although the ES can be constructed using rules with only one action and
one condition, rules need not be restricted to a single condition or action.
They may contain several parts joined by logical operators, such as AND,
OR, and NOT, so that the conclusions depend on more than one premise
or test.
222 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Example 2
Suppose that we have the following rules that are to be used to work out
what might be a suitable solvent for some material.
R1:
IF the sample has a low molecular weight AND IF
the sample is polar, THEN try water as the solvent.
R2:
IF the sample has a low molecular weight AND IF
the sample is not polar, THEN try toluene as the
solvent.
R3:
IF the sample has a moderate molecular weight AND
IF the sample is ionic, THEN try water as the sol-
vent ELSE suspect insoluble in water and toluene.
R4:
IF the sample is volatile, THEN the sample has low
molecular weight.
R5:
IF the sample boiling point is <50°C, THEN the
sample is volatile.
In a goal-driven system, the inference engine would start with the first
rule because the conclusion of that rule is the recommendation of a solvent.
This forms its first goal. In other words, it would try to demonstrate that
water is a suitable solvent for the sample. In order to test this rule, the ES
needs to know whether the molecular weight is low.* If this information is
not already available within the database, the next step is to check whether
the knowledge base contains a rule, the conclusion of which is that the sam-
ple has a low molecular weight. In this instance, R4 provides this informa-
tion, but this rule, in turn, depends on whether or not the sample is volatile,
so this gives the ES a secondary goal: to show that the sample is volatile. It
would then find a further rule (R5) that states that a sample is volatile if its
boiling point is less than 50°C. If the required boiling point is not in the
knowledge base, the ES would turn to the user.
ES: What is the boiling point of the sample?
U: 43°C
and the ES can now turn to the next step of determining whether the
sample is polar.
Comments from the system help to give the user confidence that the ES
knows what it is doing. They are particularly valuable if the ES has engaged in
some complicated reasoning or if its conclusions are unexpected (though, in
the example given above, the replies from the ES hardly fall in that category).
The explanatory dialogue that the system offers is limited, however, for
several reasons. First, the evidence that the ES draws on to support its con-
clusions will consist of the logic that it used to derive those conclusions.
Consequently, any explanation that it provides will be a restatement or
rephrasing of the rules and facts that it has already used in generating its
advice. It is true that much of the reasoning may have been hidden from
the user — the holiday-advising expert did not explain its reasoning until
prompted — but, if within the chain of reasoning is a doubtful step, there
may be no way that the user can test it. Recall that the rules that an ES uses
are unlikely to be causal; thus, if the conclusions of an ES appear strange and
224 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
there is a rule about which the user has concerns, it is unlikely that the ES
will contain sufficient information to independently justify the rule.
Furthermore, an ES is unable to step outside the boundaries of what is
contained in its knowledge base This knowledge base may, knowingly or
otherwise, incorporate the biases of the person who built the system and
any bias or prejudice will color its view of how to interpret responses from
a user. For example, the ES interprets the positive answer to the question:
“Do you like good food and wine?” in a particular way, that is, that the user
will appreciate French or Italian cooking. However, for the user, a visit to
McDonald’s® or Kentucky Fried Chicken® may be high on their list of culi-
nary treats, so for them the expert system is going off on the wrong track if
it proposes a holiday in Florence when the biologist would really be happier
in South Dakota.
In addition, most expert systems are unable to rephrase an explanation. If
the user does not understand the reasoning of the system after being given
an initial explanation, there is no Plan B that the ES could use to explain its
conclusions in another way.
Paradoxically, the explanation system is actually of the greatest use to the
person who might be expected to least need it (the expert who built the sys-
tem) before the end-user even gets a look in. This is because the flow of con-
trol in an expert system is not linear — rules are not executed in the order
in which they are entered into the database, but in an order determined by
the scheduler, so following the flow of control and thus understanding the
system’s reasoning may be difficult. During building and testing, if the ES
provides an invalid or unexpected response, it can be asked to explain itself.
This can be a significant aid in tracking any faults and ambiguities.
As well as being able to provide some explanation of its reasoning, it is
important also that an ES should know and report its limitations. This is
routine for a human:
I suspect that the reason your car won’t start is a flat bat-
tery, but until I check it, I can’t be sure.
Not only does this human know the limits of his knowledge (“I suspect
that . . . ,” “I can’t be sure . . .”), but he makes it clear to the listener that his
conclusion about the stranded car is only provisional (“. . . until I check it. . . .”),
so the listener can judge how much weight to attach to the opinion. Expert
systems are far less able in both respects. Expert systems report every con-
clusion with the same degree of confidence unless they are explicitly pro-
grammed to deal with uncertainty, so it is important that, as an ES bumps up
against the boundaries of its knowledge, it should be able to give some hints
that it is venturing into unreliable territory. This requires “knowledge about
knowledge.” Knowing what it knows, its limitations and the circumstances
within which it is applicable, is an important attribute of the more flexible
expert system. This type of knowledge is meta-knowledge and is an active area
of research across many AI fields.
Expert Systems 225
be unambiguous (e.g., the result of a test for blood sugar level or a mea-
surement of body temperature), other forms of knowledge may be hard to
encode in a silicon expert. A doctor might feel that a patient consulting her
is looking “unwell” and, on the basis of their clinical experience, be able
to determine what potency of drug to prescribe to best help the patient. A
lay person might also recognize that the person is unwell, but be unable
to draw reliable conclusions about the level of medication required. Thus,
the question of how to extract expert knowledge and how readily it can be
converted into production rules is central to any decision on whether to
try to create a new ES. Tools to make this less painful and more efficient
are discussed in the next section.
Knowledge base
Inspection
Human
Coding expert
Dialogue
Knowledge engineer
Figure 7.5
The iterative process by which the first version of an expert system is built.
* In broad terms, the aim of knowledge engineering is to integrate human knowledge with
computer systems.
228 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
users of the finished system, will have a feel for what type of representation
will be most effective. Once a suitable representation has been selected, the
next step in building the knowledge base is a detailed conversation between
the knowledge engineer and the expert, so that the knowledge engineer can
gain an overview of the subject area and develop an understanding of the
role that the ES is expected to fill. Once some initial data have been fed into
the system, the conversation between expert and software engineer becomes
more wide-ranging, and often is full of open-ended questions from the
knowledge engineer such as:
Figure 7.6
An “aggressive stance.”
Figure 7.7
Going long on stock.
• Is the human expert able to explain how to mange the tasks that the ES will
be expert in?
Computers are well suited to the manipulation of numbers, but the ES relies
on symbolic computation, in which symbols stand for properties, concepts,
and relationships. The degree to which an ES can manage a task may depend
on the complexity of the problem. For example, computer vision is an area of
great interest within AI and many programs exist that can, without human
assistance, use the output from a digital camera to extract information, such
as the characters on a car number plate. However, automatic analysis of more
complex images, such as a sample of soil viewed through a microscope, is far
232 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
less simple even though the underlying problem is fundamentally the same.
ES are useful in recognizing car number plates or widgets on a conveyor belt
in an industrial production line; they are not (yet) of much use in analyzing
complex images with unpredictable structure and definition.
SENSOR
Figure 7.8
The electronic nose.
is strong evidence for its value.* The need for a panel of experts increases the
number of people who will be involved in and, hence, the cost of the develop-
ment of the system.
7.12 Applications
Experimental scientists, if they write computer programs, tend to be fond of
FORTRAN, C, C++, Visual Basic, or Java. None of these is well suited to the
construction of an ES, in which LISP and PROLOG are the favored languages.
LISP (LISt Processing) is a simple and elegant language, which is still widely
used in AI, while PROLOG (PROgramming in LOGic) uses easy-to-understand
* This is related to the Turing test. A program that converses with a user about a particular
topic with such conviction and knowledge that the user is unable to determine whether he is
dealing with computer software or another human, has passed the Turing Test.
234 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
assertions and rules, which makes programs written in the latter language
straightforward to interpret. Fortunately for those who would like to build
an ES but have no desire to learn further computer languages, a knowledge of
neither LISP nor PROLOG is required to use most commercial ES shells.
The range of applications of expert systems in science is very wide, limited
primarily by the considerable effort that is often needed to prepare one. Typi-
cal applications include the work of Pole, Ando, and Murphy on the predic-
tion of the degradation of pharmaceuticals;1 Nuzillard and Emerenciano’s
work on structure elucidation from two-dimensional NMR data;2 the work
of Patlewicz and co-authors on sensitization;3 and of Prine in another area
connected to safety, that of MSDS data.4 The number of publications runs to
several hundred per year, covering a great diversity of topics.
7.14 Problems
1. Types of tasks suited to expert systems
Consider the following tasks. Which, if any, of them would be suit-
able candidates for the construction of an expert system?
a. Determining the chemical composition of different fuel oils
through analysis of their infrared absorption spectra.
b. Checking the shapes of cells seen through a microscope for pos-
sible abnormalities.
c. Determining the Latin name for plants in a nursery.
d. Finding a rich partner to marry.
e. Preparing for an undergraduate examination in thermo-
dynamics.
f. Buying a sports car.
Expert Systems 235
Figure 7.9
Tower of Hanoi.
2. Tower of Hanoi
The Tower of Hanoi is a well-known problem in logic (Figure 7.9).
A set of rings of different diameters, say, ten in number, is placed
on one of three vertical rods, with the smallest ring on the top and
rings of increasing diameter underneath it. The object is to move
the complete set of rings from one rod to another, moving one ring
at a time, without a ring of larger diameter ever being positioned
above one of smaller diameter. The entire pile of rings must be
transferred from one rod to a second, moving one ring at a time
and using the third rod as necessary. Propose a set of expert sys-
tem rules to accomplish this.
3. Funding research
One of the most demanding exercises that scientists engage in is
writing research proposals to obtain funding to support their work.
Sketch out the structure of an expert system that would be able to
help in this task.
References
1. Pole, D.L., Ando, H.Y., and Murphy, S.T., Prediction of drug degradants using
DELPHI: An expert system for focusing knowledge, Mol. Pharm. 4, 539, 2007.
2. Nuzillard, J-M. and Emerenciano, V.D.P., Automatic structure elucidation
through data base search and 2D NMR spectral analysis, Nat. Prod. Comms. 1,
57, 2006.
236 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
3. Patlewicz, G., et al., TIMES-SS — A promising tool for the assessment of skin
sensitization hazard. A characterization with respect to the OECD validation
principles for (Q)SARs and an external evaluation for predictivity, Reg. Toxicity
Pharmacol., 48, 225, 2007.
4. Prine, B., Expert system for fire and reactivity MSDS text, Proc. Safety Prog., 26,
123, 2006.
5. Jackson, P., Introduction to Expert Systems, Addison-Wesley, Reading, MA, 1998.
6. Liebowitz, J., Introduction to Expert Systems, Mitchell Publishing, Santa Cruz,
1988.
7. Ignizio, J.P., An Introduction to Expert Systems: The Development and Implementa-
tion of Rule-Based Expert Systems, McGraw Hill, New York, 1991.
8. Lucas, P. and van der Gaag, L., Principles of Expert Systems, Addison-Wesley,
Wokingham, U.K., 1991.
8
Fuzzy Logic
In the previous chapter, we met expert systems. Life is easy for an expert
system (ES) when knowledge is perfect. In the rule:
both the condition and the action are unambiguous. There is no uncertainty
in application of the rule because, assuming that the pH can be measured
with reasonable precision, it either is or is not less than 1, and if it is less than
1, the quantity and concentration of alkali to be added are precisely defined.
By contrast, suppose that the rule was instead:
We now would need guesswork, prior knowledge, or just good luck to add
the correct amount of alkali of the right concentration and to do so at the
correct time. We can imagine an expert system struggling to deal with the
imprecision of this rule and coming up with advice along the lines of:
"I'm not sure myself about the pH, but if you have the feeling
that it might be rather low, why not toss in a bit of alkali?
I'm fairly hopeful that this will do the job, but I don't really
know how much to suggest; a little bit sounds about right to
me, unless you have any better ideas."
8.1 Introduction
The vagueness in the second rule above and, consequently, the unhelpful
imagined response of the ES, reflect the way that we use woolly language in
everyday conversation:
237
238 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
The statements are now more precise, but in adding numerical data some-
thing has been lost. Neither of the first two statements has retained any judg-
mental quality. Does the speaker believe that Sally’s waistline of 28 inches is
thin, anorexic, or just slimmer than average? Does a 91 percent recognition
rating qualify as very well known?
One might think that any loss of data that can be recognized to be judg-
mental should not concern us as scientists. After all, science is based over-
whelmingly on measurement and quantitative information rather than on
opinion, so perhaps scientists should be able to sidestep the problems caused
by ill-defined statements. But science does not operate in vacuum-packed
isolation, cut off from the rest of life. The points of contact between scientists
and those working outside science are many and varied, hence, as normal
conversation is vague, scientists must be able to handle that vagueness.
In fact, the problem of imprecision is far more widespread than might at
first be apparent. It would be naive to imagine that vagueness arises only
at the interface where the scientific and nonscientific worlds rub together.
Vague language is common in science itself: reactions between ions in
solution are recognized by chemists to be “very fast”; overtones in infra-
red absorption spectra are typically “much weaker” than the fundamental
absorption; many colloids are stable “almost indefinitely”; lipid bilayers are
an “important” factor in determining the functioning of cells.
To bridge the gap between the descriptive but imprecise language that we
use in conversation and the more quantitative tools that are used in science, a
computational method is required that can analyze and make deductions from
imprecise statements and uncertain or fuzzy data. Fuzzy logic is this tool.
Fuzzy Logic 239
EJ = BJ ( J + 1) − DJ 2 ( J + 1)2
0 01 1
Figure 8.1
The range of logical values in Boolean logic.
The boiling points of all the liquids in which we are interested could also
be shown on a diagram in which the volatile liquids lay within some closed
boundary and all others lay outside of it (Figure 8.3).
242 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Volatile
1
Membership
Not volatile
40
Temperature
Figure 8.2
A crisp membership function for the determination of volatile liquids.
53
44
56.7
31.7
30
26.8
33
38.5 34.1
71 80.4
Figure 8.3
A representation of several liquids defined by their boiling points. Liquids within the shaded
area are (defined to be) volatile; those outside it are not.
Now that the definition of a volatile liquid has been settled, the expert sys-
tem could apply the rule. However, this approach is clearly unsatisfactory.
The all-or-nothing crisp set that defines “volatile” does not allow for degrees
of volatility. This conflicts with our common sense notion of volatility as a
description, which changes smoothly from low-boiling liquids, like diethyl
ether (boiling point = 34.6°C), which are widely accepted to be volatile, to
materials like graphite or steel that are nonvolatile. If a human expert used
the rule:
0 1
Figure 8.4
The range of logical values in fuzzy logic.
reach the same conclusion about a liquid whose boiling point is 41°C. The
hard boundaries imposed by Boolean logic force an abrupt, and unreason-
able, change in the recommended method of analysis when the 40°C bound-
ary is crossed.
Fuzzy logic gets around this difficulty by replacing hard boundaries
between sets with soft divisions. Objects are allocated to fuzzy sets, which
are sets with fuzzy boundaries. The membership value of an object within
a fuzzy set can lie anywhere within the real number range of 0.0 to 1.0
(Figure 8.4).
An object may lie entirely within one fuzzy set and, thus, have a member-
ship in that set of 1.0; entirely outside of it and, thus, have a membership
of 0; or it may belong to two or more fuzzy sets simultaneously and have a
membership between 0 and 1 in each. A liquid whose boiling point is 46°C
may belong to the volatile class with a membership 0.45, and the not volatile
class with a membership of 0.55.
The NOT function is defined by:
µ A (not X ) = 1.0 − µ A ( X )
This statement might belong to the “True” set with a membership of, say, 0.9,
and to the “False” set with a membership of 0.1. Statements that in Boolean
logic are either true or false, but not both, can in fuzzy logic be simultaneously
244 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Table 8.1
Fuzzy and Crisp Membership Values in the “Volatile” Set for Some Liquids
53
44
56.7
31.7
30
26.8
33
38.5 34.1
71
80.4
Figure 8.5
The boiling points of substances can be used to define the degree of membership (shown here
as depth of shading) that a liquid has in a set.
true and false with a membership value in each set that measures the level of
confidence that we have in the statement.
Table 8.1 compares the fuzzy membership and crisp membership values in
the volatile set for a few liquids (Figure 8.5).
Membership
40 60 80
Temperature
Figure 8.6
Membership functions for the sets “very volatile,” “volatile,” and “slightly volatile.”
point of 46°C? A membership of about 0.5 in each of the volatile class and the
not volatile class might seem about right, but the choice appears arbitrary.
Although it is the purpose of fuzzy systems to handle ill-defined informa-
tion, this does not mean that we can get away with uncertainty in the alloca-
tion of membership values. If some of the membership values for liquids in a
database were proposed by one person and the rest by a second person, the
two groups of memberships could well be inconsistent unless both people
used the same recipe for determining membership. Any deductions of the
fuzzy system would then be open to doubt. In fact, even the membership
values determined by just one person might be unreliable unless they had
used a properly defined method to set membership values. The hold-a-wet-
finger-in-the-air style of finding a membership value is not supportable.
To deal with this difficulty, we construct a membership function plot, from
which memberships can be determined directly (Figure 8.6). The member-
ship function defines an unambiguous relationship between boiling point
and membership value, so the latter can then be determined consistently,
given the boiling point.
The x-axis in a plot of a membership function represents the universe of
discourse. This is the complete range of values that the independent variable
can take; the y-axis is the membership value of the fuzzy set.
Membership can also be expressed in equation form by a statement, such
as carbon disulfide is a member of the set of volatile compounds with a mem-
bership of 0.45:
“volatile,” and “slightly volatile,” but it might seem as though the functions
themselves have been pulled out of thin air. In a sense they have. There is
no recipe that must be followed to turn a concept that is as loosely defined
as “very volatile” into a deterministic membership function. The shape and
extent of these functions are both chosen in a way that is essentially arbi-
trary, but seems “reasonable” to the user.
Triangular (Figure 8.7), piecewise linear (Figure 8.8), or trapezoidal (Fig-
ure 8.9) functions are commonly used as membership functions because
they are easily prepared and computationally fast.
However, there is no theoretical justification for using one of these shapes
rather than another. One might suspect that there could be more of an argu-
ment for using a normal distribution, as that function arises naturally in the
treatment of errors (Figure 8.10), but there is no theoretical justification for
preferring this either over triangular or other shapes of function. Provided
that the profile used correlates the membership value with the user’s percep-
tion of what that ought to be to an acceptable degree, the shape of the func-
tion is not an important factor in the operation of a fuzzy logic system.
Membership
Figure 8.7
x
A triangular membership function.
Membership
Figure 8.8
x
A piecewise linear membership function.
Membership
Figure 8.9
A trapezoidal membership function. x
Membership
Figure 8.10
A membership function that resembles
x
a normal distribution.
Fuzzy Logic 247
tells us that some chemical X has a membership of 0.9 in the fuzzy set “very
toxic.” If X rapidly decomposes in basic solution, it might be a member of the
“rapidly degrades” set with a membership of 0.8. Using probability theory, we
would determine that the probability that X is a rapidly degraded, very toxic
chemical is 0.9 × 0.8 = 0.72. By contrast, fuzzy logic would yield the conclusion
that, if X is in the very toxic class with a membership of 0.9 and in the rapidly
degraded class with a membership of 0.8, then X is a member of the set of rap-
idly degraded, very toxic materials with a membership value of the lesser of
Fuzzy Logic 249
the two memberships in the individual classes to which X belongs, not their
product, that is 0.8. We shall learn more of this in section 8.9.
8.7 Hedges
Suppose that we have defined a membership function for the “Low pH” set.
Most acid solutions would be, to some degree, members of this fuzzy set. We may
want to be able to qualify the description by adding modifiers, such as “very,”
“slightly,” or “extremely” whose use allows us to retain close ties to natural lan-
guage. The qualifiers that modify the shape of a fuzzy set are known as hedges.
We can see the effect of the hedges “very” and “very very” in Figure 8.11.
The application of both of these hedges concentrates the region to which
the hedge applies. Nonzero membership of the fuzzy set “very low pH” cov-
ers the same overall range of pH as membership of the group “low pH,” even
with the hedge in place, but the effect of the hedge is to reduce membership
in the “low pH” set for those pHs that also have some membership in the
“medium pH” set. The effect of “very very” is even more marked, as we
would expect.
In order to ensure consistency among users of fuzzy logic, hedges have
been defined mathematically. Thus, “very” is defined (arbitrarily but consis-
tently within the field) as µA(x)2, “more or less” is defined as
µ A ( x)
and so on. Some examples of the mathematical form of hedges are given in
Table 8.2.
1
Membership
5.9 pH
Figure 8.11
The effect of applying the hedges “very” (dotted line) and “very very” (dashed line) to the “low
pH” set.
250 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Table 8.2
Mathematical Equivalences of Some Hedges
A little μA(x)1.3
Somewhat μA(x)1.7
Very μA(x)2
Extremely μA(x)3
Very very μA(x)4
kb [E]o k′ + k
ν= where K M = a b (8.3)
1 + K M /[S]o ka
ka and ka’ are the rate constants for the formation of the enzyme-substrate
complex and kb is the rate at which this complex collapses to give product.
HX HY HX Y- X- Y-
E E E
Figure 8.12
An enzyme in which two ionizing groups are present.
log (Reaction rate)
5.9 7.5
pH
Figure 8.13
Activity of the enzyme fumerase as a function of solution pH.
252 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
R1:
IF the pH is medium AND the enzyme concentration is high
THEN the reaction rate is high.
R2:
IF {the pH is low OR the pH is high} AND the enzyme con-
centration is high THEN the reaction rate is low.
R3:
IF the enzyme concentration is low THEN the reaction rate
is low.
8.8.1 Input Data
The inputs into the system are the pH and the concentration of enzyme, mea-
sured across all forms, both active and inactive. From these pieces of data, the
system is required to provide an estimate of the reaction rate. Let us assume
that the total concentration of enzyme is 3.5 mmol dm–3 and that the pH is 5.7
and use these values to estimate the rate of reaction.
Table 8.3
Relationship between pH and Membership of the Low,
Medium, and High pH Sets
Membership
pH Low Medium High
5.0 1.0 0 0
5.4 1.0 0 0
5.8 0.63 0.37 0
6.2 0.11 0.89 0
6.6 0 1 0
7.0 0 1 0
7.4 0 0.64 0.36
7.8 0 0.12 0.88
Fuzzy Logic 253
If the pH of the solution is 5.7, from Figure 8.14 we can determine that the
pH has a membership of 0.76 in the low pH set, 0.24 in the medium pH set,
and 0.0 in the high pH set.
Similarly, we can use Figure 8.15 to determine that the enzyme concentra-
tion of 3.5 mmol dm–3 translates into a membership of 0.80 in the low concen-
tration set and 0.20 in the high concentration set (Table 8.4).
1
Membership
0
5.9 7.5
pH
Figure 8.14
Membership of the pH in the sets low, medium, and high.
Low High
1
Membership
Concentration/mmol dm -3 10
Figure 8.15
The relationship between concentration and membership in the low and high concentration sets.
Table 8.4
Relationship between Concentration of Enzyme and Set Membership
Membership
Concentration/mmol dm–3 Low High
2 1 0
3 0.94 0.06
4 0.64 0.36
5 0.33 0.67
6 0 1
254 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
R1 OR R2 OR R3 OR ...
In order to reason using fuzzy data, a way must be found to express rules
so that the degree of certainty in knowledge can be taken into account and a
level of certainty can be ascribed to conclusions. This is done through fuzzy
rules. A fuzzy rule has the form:
If A is x then B is y
where A and B are linguistic variables and x and y are linguistic values.
For example:
Once the input data have been used to find fuzzy memberships, in the
second step we compute the membership value for each part in the condition
or antecedent of a rule. These are then combined to determine a value for the
conclusion or consequent of that rule. If the antecedent is true to some degree,
the consequent will be true to the same degree.
We have seen above that the fuzzified inputs are
R3:
IF the enzyme concentration is low THEN the reaction rate
is low.
Because the enzyme concentration has a membership of 0.8 in the low set,
the antecedent has a membership value of 0.8, and we can conclude that the
consequent of this rule has a membership of 0.8.
The remaining two rules have multiple parts, which must be combined to
find a value that specifies the degree to which the entire condition holds.
Let us first tackle rule R1:
Fuzzy Logic 255
R1:
IF the pH is medium AND the concentration is high THEN the
rate is high.
If two antecedents are linked by AND, the degree to which the entire state-
ment holds is found by forming their conjunction. This degree cannot exceed
the degree to which either part of the statement is true, thus the evaluation
required is the minimum of the two memberships.
R2:
IF {the pH is low OR the pH is high} AND the enzyme con-
centration is high THEN the rate is low
Here we have both OR and AND operators. If two antecedents are linked
by OR, the degree to which the entire statement holds is determined by how
much the object belongs to either set, therefore is given by the maximum of
μA(X) and μB(Y).
Thus,
µ A ( X or Y ) = maximum(µ A ( X ), µ A (Y )) (8.5)
8.9.1 Aggregation
It may happen that only a single rule provides information about a particular
output variable. When this is true, that rule can be used immediately as a
measure of the membership for the variable in the corresponding set. In the
enzyme problem, only one rule predicts that the rate is high, therefore, we
can provisionally assign a membership of 0.2 for the rate in this fuzzy class.
Often though, several rules provide fuzzy information about the same vari-
able and these different outputs must be combined in some way. This is done
by aggregating the outputs of all rules for each output variable.
When we have evaluated all the rules, an output variable might belong to
two or more fuzzy subsets to different degrees. For example, in the enzyme
problem one rule might conclude that the rate is low to a degree of 0.2 and
another that the rate is low to a degree of 0.8. In aggregation, all the fuzzy
values that have been calculated for each output variable are combined to
provide a consensus value for the membership of the output variable in each
256 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
fuzzy set to which it has access. This consensus is determined by both the
degree of membership in each of the output sets and the width of that set in
the universe of discourse, as we illustrate below.
The outputs of the three rules have been found to be
Several recipes exist for aggregation. One of the simplest and most easily
applied, due to Mamdani, is as follows. We note that, of the three rules that
predict the rate, two of them give different degrees of membership in the
“low rate” set. Should we use the higher of these two values or the lower to
determine the overall prediction of the system? Let us initially skirt around
the question by taking them both, using their average value of 0.5 to give a
consensus membership. (You may spot the flaw in this procedure. We shall
return to this point in a moment.)
This presumed membership of 0.5 in the “low” set must now be combined
with the output from R1, which was that the rate is “high” to a degree of 0.2.
To combine these, we require a membership function that relates the actual
reaction rate to the fuzzy descriptors “low” and “high” (Figure 8.16).
The memberships in each category are combined by determining the area
within each membership function that lies below the line defined by the
membership value (Figure 8.17). Thus, the degree to which the rate is low, 0.5,
defines a region in Figure 8.17 of a size determined by the membership of the
rate in the “low” set. Similarly, the membership in the “high” set determines
a second region in the figure. The figure can now be used to defuzzify the
conclusions of the fuzzy system and turn them back into crisp numbers.
A defuzzifier is the opposite of a fuzzifier; it maps the output sets into crisp
numbers, which is essential if a fuzzy logic system is to be used to control an
Low High Very high
1.0
Membership
0.5
0.0
0 5 10
Rate (arbitrary units)
Figure 8.16
Membership functions to convert between experimental reaction rates and fuzzy sets.
Fuzzy Logic 257
0.5
0.2
0.0
0 5 10
Rate (arbitrary units)
Figure 8.17
Defuzzification of set memberships to give a predicted reaction rate.
∫ µ (x)xdx
A
cog = a
b
(8.6)
∫ µ (x)dx
a
A
The expression in equation (8.6) is applied to all the shaded areas in Fig-
ure 8.17. We can break these areas into rectangles and triangles. Starting
from the left, we can identify a rectangle that starts at rate = 0 and finishes
at rate = 4; this gives the first term in equation (8.7). The shaded area in the
“low” membership set is completed by a triangle from rate = 4 to rate = 5.
Moving to the “high” set, we must first find the area of a triangle that runs
from rate = 3 to rate =3.4, and so on. The numerator in equation (8.6) thus is
given by:
4 5 3.4 7.6
∫ 0
0.5 xdx +
∫ 4
(−0.5 x + 2.5)xdx +
∫ 3
(0.5 x − 1.5)xdx +
∫ 3.4
0.2 xdx +
∫ 7.6
(−0.5 x + 4)xdx
∫ 7.6
(−0.5 x + 4)dx
10.14/3.17 = 3.2
and the predicted rate of reaction is 3.2. The value generated by this pro-
cedure seems to be reasonable, but our initial averaging of the two mem-
bership values for the membership in the “low” set was a dubious way to
proceed. This is because the result of that averaging, a membership of 0.5 in
the “low” set, was given exactly the same weighting as that allocated to the
membership of 0.2 in the “high” set generated by R1, even though two rules
contributed to the low set membership and only one rule to the high. This
is hard to justify. If a dozen rules predicted a low rate, with varying degrees
of confidence, and just one rule predicted a high rate, we could be fairly sure
that the rate really is low because so many rules were voting for this. We
would be much less sure if only one rule predicted a low rate. Therefore,
somehow we must take into account each rule individually.
This is easily done. Rather than averaging the values of all consequents
that predict membership of the same fuzzy set, we include separately the
membership values for every rule in the calculation of equation (8.6). As Fig-
ure 8.18 shows, there are now three areas to consider, so the numerator in
equation (8.6) becomes:
3.4 5 4.6 5
∫ 0
0.8 xdx +
∫ 3.4
(−0.5 x + 2.5)xdx +
∫ 0
0.2 xdx +
∫ 4.6
(−0.5 x + 2.5)xdx +
3.4 7.6 8
∫ 3
(0.5 x − 1.5)xdx +
∫ 3.4
0.2 xdx +
∫ 7.6
(−0.5 x + 4)xdx
3.4 5 4.6 5
∫ 0
0.8 dx +
∫ 3.4
(−0.5 x + 2.5)dx +
∫ 0
0.2 dx +
∫ 4.6
(−0.5 x + 2.5)dx +
3.4 7.6 8
∫ 3
(0.5 x − 1.5)dx +
∫ 3.4
0.2 dx +
∫ 7.6
(−0.5 x + 4)dx
0.8
Membership
0.2
0.0
0 3 5 6 8 10
Rate (arbitrary units)
Figure 8.18
A better defuzzification method to give a predicted reaction rate.
The predicted rate is, consequently, 14.5/4.78 = 3.03 and we note that in the
revised procedure the overall rate is found to be lower, as we might have
expected because two rules are voting for a low rate.
The center of gravity is a reliable and well-used method. Its principal dis-
advantage is that it is computationally slow if the membership functions
are complex (though it will be clear that if the membership functions are as
simple as the ones we have used here, the computation is quite straightfor-
ward). Other methods for determining the nonfuzzy output include center-
of-largest-area and first-of-maxima.
8.10 Applications
An early application of fuzzy logic was in the control of a cement kiln and
applications to similar or identical systems have continued to appear (see, for
example, Jarvensivu et al.1). This is an example of a practical application in
which generation of high quality product may depend on the skill and expe-
rience of a small number of workers. Even when computer control is avail-
able, workers may describe the way that they adjust conditions to provide a
high-quality product by statements such as, “The kiln rotation rate should be
lowered slightly,” and that, in order to compensate, “The temperature should
be diminished just a bit.” Fuzzy logic now has a lengthening track record of
use in such situations.
Control problems represent a major area of application for fuzzy logic
since reliable process control may rely on the long-term expertise of one or
a few people, and those people may be able to frame their knowledge of the
system only in imprecise terms. A typical example is the control of pH in a
crystallization reactor.2 A similar application was described by Puig and co-
260 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
workers, who used this method to control the amount of dissolved oxygen
in a wastewater treatment plant.3 In other control problems that are rarely
under the control of people but for which the control problem may be dif-
ficult to program in exact terms, fuzzy logic can again be of value, e.g., to
control autofocus in cameras4 or to reduce the effect of camcorder vibration
on picture quality.
A recent review presents a broad overview of the use of fuzzy logic in
biological sciences. Biological applications represent a rich field for fuzzy
logic, as a recent review suggests.5 Du and co-workers have used fuzzy sys-
tems to model gene regulation,6 and Marengo’s group has considered the
use of fuzzy logic in analysis of electrophoresis maps.7 Medical applications
are also increasingly common, as exemplified by the work of Chomej and
co-authors.8
8.12 Problems
1. The rate of enzyme-mediated reactions, like most other types of reac-
tion, depends on temperature. Over a limited temperature range, the
reaction may follow the Arrhenius equation:
E
− a
k = Ae RT
References
1. Jarvensivu, M., Saari, K., and Jamsa-Jounela, S.L., Intelligent control of an indus-
trial lime kiln process, Contr. Eng. Prac., 9, 589, 2001.
2. Chanona, J., et al., Application of a fuzzy algorithm for pH control in a struvite
crystallisation reactor, Water Sci. Tech., 53, 161, 2006.
3. Puig, S., et al., An on-line optimisation of a SBR cycle for carbon and nitrogen
removal based on on-line pH and OUR: The role of dissolved oxygen control,
Water Sci. Tech., 53, 171, 2006.
4. Malik, A.S., and Choi, T.S., Application of passive techniques for three-dimen-
sional cameras, IEEE Trans. Cons. Elec. 53, 258, 2007.
5. Torres, A. and Nieto, J.J., Fuzzy logic in medicine and bioinformatics, J. Biomed.
Biotech., 91908, 2006.
6. Du, P., et al., Modeling gene expression networks using fuzzy logic, IEEE Trans.
Sys. Man Cybernet. Part B – Cybernet., 35, 1351, 2005.
7. Marengo, E., et al., A new integrated statistical approach to the diagnostic use
of two-dimensional maps, Electrophoresis, 24, 225, 2003.
8. Chomej, P., et al., Differential analysis of pleural effusions by fuzzy logic based
analysis of cytokines, Respir. Med., 98, 308, 2004.
9. Negnevitsky, M., Artificial Intelligence: A Guide to Intelligent Systems, 2nd ed.,
Addison-Wesley, Reading, MA, 2005.
9
Learning Classifier Systems
Classifier systems are software tools that can learn to control or interpret
complex environments without help from the user. This is the sort of task
to which artificial neural networks are often applied, but both the internal
structure of a classifier system and the way that it learns are very different
from those of a neural network. The “environment” that the classifier system
attempts to learn about might be a physical entity, such as a biochemical fer-
mentor, or it might be something less palpable, such as a scientific database
or a library of scientific papers.
Classifier systems are at their most valuable when the rules that describe
the behavior of the environment or its structure are opaque and there is no
effective means to determine them from theory. An understanding of the
environment is then difficult or impossible without the help of some learn-
ing tool. Classifier systems behave as a kind of self-tutoring expert system,
generating and testing rules that might be of value in controlling or inter-
preting the environment. Rules within the system that testing reveals are
of value are retained, while ineffective rules are discarded. In a classifier
system, the rules evolve whereas, in an expert system the rules would be
integrated into the software.
Classifier systems are in some respects a scientific solution looking for a
problem because they are currently the least used of the methods discussed
in this book. However, their potential as a disguised expert system is sub-
stantial and they are starting to make inroads into the fields of chemical and
biochemical control and analysis.
9.1 Introduction
Scientific control problems are widespread in both academia and industry.
Large segments of the chemical industry are devoted to the synthesis of bulk
chemicals and the plants in which synthesis is performed, run with a high
degree of automation. For most practical purposes, any chemical plant that
operates on an industrial scale runs completely under automatic control.
A similar level of automation is found in the biochemical industry.
Although the volumes of production of biochemicals are smaller by several
orders of magnitude than those of bulk chemicals, companies that operate
fermentors and other types of biochemical reactors must still work within
263
264 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
the constraints imposed by multiple goals: the products must have a min-
imum level of purity, their composition must be consistent from batch to
batch, and they must be produced with the maximum possible yield. At the
same time the operation must be safe, have the minimum possible need for
human intervention, and have a low demand on power and other services.
The key to meeting these requirements is reliable process control.
Not long ago, evidence of sound process control would be the sight of a
few computers running part of the operation and some engineers standing
around with clipboards who would oversee what was going on. The extent
to which process control has gained in sophistication since then is exem-
plified by the “lights out” laboratories that large pharmaceutical companies
now operate. In these laboratories everything one would expect to see in a
modern science laboratory is present, except that there are no people. The
laboratories are packed with scientific instruments, which run automatically
and continuously, robotic samplers that feed the instruments, and comput-
ers to ensure that everything runs smoothly; yet no technicians are needed
to feed in samples or control instruments. Humans do enter the laboratory,
but only to complete tasks like refilling reagent containers or dealing with
faulty equipment. Every other operation, from sample preparation or choice
of reaction conditions to the interpretation of spectra, is performed under the
direction of computers.
The development of the lights out laboratory has been driven by econom-
ics: computers are cheaper than a technician, will work long hours with-
out complaining or asking for a comfort break, do not need to spend time
away at Disney World® every August, and may be more consistent in their
performance. The laboratories are a notable modern example of the way in
which process control is being taken over by computers, but automatic con-
trol has been long recognized as one of the best ways to increase efficiency
and ensure safety.
The principal reason why automatic control has these advantages over
human control is that the synthesis or isolation of even a simple chemical,
such as ethylene oxide, on an industrial scale is a complex technical problem.
Figure 9.1 shows a schematic of a section of an industrial unit for the produc-
tion of this compound. Whenever the plant is operating, a stream of mes-
sages floods into the control center, reporting the temperature in the reactors,
the pressure in the flowlines, the purity of the product, and other data. The
rate at which messages appear may be so great that no single human could
even read them all, let alone process them efficiently and effectively; thus
messages from sensors are collected and analyzed by computers in real time
in order to assess the current state of the plant. Necessary action, such as
turning on a heater or reducing the pressure in a flowline, can then be taken
by the controlling computers so that the process proceeds smoothly, safely,
and efficiently.
Computers can provide a near instantaneous response to any change
in operational conditions, thus enhancing operating efficiency. However,
while a quick response is desirable, making the wrong decision is not. The
Learning Classifier Systems 265
Vent
Ethylene
oxide
Steam Steam
Steam
Reboiler
Waste
liquid
Figure 9.1
A portion of an industrial plant that produces ethylene oxide.
response must be appropriate so the computers that control the plant need to
know what they are doing. Some algorithmic understanding of the processes
that are occurring within the plant in the form of a model of those processes
is required, in order that any adjustments to operating conditions made in
response to messages coming from the sensors are suitable.
The understanding that the computers controlling the equipment might
possess could be contained within a kinetic and thermodynamic model that
encapsulates the detailed chemistry of the process. This would include a
description of all the reactions that might be expected to occur under the con-
ditions achievable in the plant, together with a list of the relevant rate constants
and activation energies for each reaction. In addition, process variables, such as
maximum flow rates or pump pressures that are needed for a full description
of the behavior of the system under all feasible conditions, would be provided.
Even for a quite simple process, this might be a fearsome model, but in
favorable circumstances, when changes in operating conditions within the
plant are at least an order of magnitude slower than the rate at which the
kinetic and thermodynamic computer model runs, it should be possible
to use the model to predict the behavior of the reactor. Assuming that it is
possible to build such a chemical model, the controlling computers might
rely upon an expert system to monitor and direct the process; such a system
would incorporate a set of rules derived from an inspection of, and inter-
action with, the model so that the appropriate response is provided to any
possible combination of system conditions. The reactor in this scenario is
controlled by explicit programming for every possible combination of condi-
tions that might arise in the reactor. If the model on which the expert system
relies is sufficiently detailed and robust, software built on that foundation
should be adequate to control the plant.
266 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
finds itself, or by changing its own behavior in a way that the agent believes
will be beneficial.
The CS is aware of what is happening in its surroundings through mes-
sages that are sent to it from detectors (pressure, temperature, or pH sensors
in a reaction vessel, perhaps). It responds by sending requests into the envi-
ronment, asking that changes, such as turning on a heater or adding some
acid, be made to meet a defined goal. The changes are put into effect in the
environment, which then returns a reward whose size is proportional to the
degree to which the changes were successful in meeting the goal. The soft-
ware operates in such a way that it tries to maximize the reward it receives,
and because a high reward is provided when the environment is properly
controlled, the interests of the software and of the environment are aligned.
This trial-and-error approach to training software is known as reinforce-
ment learning, a fundamental paradigm in machine learning. In reinforce-
ment learning, a system does exactly what we have just described, interacting
with its environment in such a way that it tries to maximize any payoff that
the environment can provide. The system knows nothing about the details
of the task that it is carrying out, but concentrates entirely on maximizing
its reward; thus the same basic software could be used to run a chemical
plant through control of valves and heaters, or stabilize the temperature in a
greenhouse by opening and closing windows.*
Computer scientists use reinforcement learning to train robots equipped
with a simple vision system that are seeking a source of electric power to
recharge themselves. The inputs to the software that resides in the robot
include signals from position and light sensors on the robot and these inputs
prompt the software to generate instructions to actuators that move the robot
or change the direction in which it faces. If the robot moves toward, and
eventually reaches, a power outlet, it receives a reward. The expectation is
that in due course the robot will learn to associate an image of the power out-
let that its sensors provide and movement toward that outlet with a reward.
A chemist or a biochemist might want this sort of software to do something
a little more practical, like controlling a biochemical reactor or analyzing a
medical database. In the former case, the inputs would provide details of the
progress of the reaction and conditions within the reactor, while the outputs
would be instructions to bring about physical changes, such as opening a
valve to add reagent or increasing the flow of coolant through a reaction
vessel jacket.
* We have been here before. The genetic algorithm attempts to find the string of highest qual-
ity without knowing anything about what the string actually “means” in the outside world.
An artificial neural network attempts to find the set of connection weights that minimize
the difference between the network output for a test set and the target responses. And a
self-organizing map modifies its weights vectors so that every sample pattern can find a
node whose weights resemble those of the pattern, though the algorithm knows nothing of
the significance of its weights. In each case, the scientific interpretation of the sample data is
of no consequence to the AI algorithm. It is evident that this underlying indifference to the
meaning of the sample data is one of the reasons that AI systems are so flexible.
268 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
The most useful classifier systems learn without any direct help or instruc-
tion from the user. We shall meet these in the second part of this chapter, but
first we consider how a CS programmed in advance could be used to control
the temperature of a reacting mixture in a vessel.
Environment Effectors
Post-it
notes
Figure 9.2
The components that comprise, and the flow of messages within, a classifier system.
* In this discussion, we assume that the environment is a physical entity, but other environ-
ments may be used. It could be a database, a stream of messages, or any object that can pro-
vide input to the classifier system, accept output from it, and pass judgment on the quality or
value of that output.
Learning Classifier Systems 269
Steps 2 and 3, in which the environment plays no direct part, can be run
independently for each prototype rule in the CS; thus the system is well
suited to implementation on parallel processor machines. In the next section,
we consider the components that form the system in more detail.
Motor
Air out
Reactant in
Steam/coolant in Product
out
Reactor
jacket
Steam/coolant out
Air in
Figure 9.3
Schematic of a biochemical reactor.
270 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
Figure 9. 4
The format of an input message.
Learning Classifier Systems 271
Table 9.1
The Meaning of Messages Sent by the Digital Thermometer
Message Condition
,QSXW0HVVDJH/LVW
1000101
0110101
1101010
1101010
1001010
0111011
.....
Figure 9.5
The input message list.
away. We shall soon need to compare the binary input messages with other
binary strings, however, and use of this format makes this comparison sim-
ple, fast, and unambiguous.
rules.* The classifier list is the only part of the system that can be modified,
thus this is where the memory of a CS resides. Rules in a CS have the same
format as those used by expert systems:
condition/action
The condition part of the classifier is constructed from the ternary alpha-
bet [0, 1, #], so each of the following is a valid condition string:*
A classifier (often) contains more than one condition and (rarely) contains
more than one action, thus the most general form of a classifier is
When the classifier system is ready to tackle messages on the input list, the
first message is extracted and processed by the first classifier. Each digit of
the entire input message, including the identifier tag, is compared with the
corresponding digit in the condition of the classifier to determine whether
the digits in the message and the condition match in every position. Because
all digits are compared, position-by-position, the length of the condition
string in the classifier must equal that of the input message. If the # sign
appears in the condition, this is interpreted as a “don’t care” symbol, so a #
in the classifier matches either a 1 or a 0 in the message.
Thus, the input message 01101 matches each of the conditions 01101 01##1
and ##### but does not match #1110. Note that the presence of the identifier
tag at the start of a message and the requirement that this must match the
corresponding digits in the classifier conditions means that some classifiers
will match and, therefore, deal with only messages from a particular probe.
Multiple condition classifiers are satisfied only when all of the conditions
are met by messages on the current input list. It is not necessary that the dif-
ferent conditions are satisfied by different messages, although they may be.
Thus, the message 11001 will satisfy both conditions in the classifier 110##,
1##01/action.
One or more conditions in a classifier may be preceded by a “~” character;
this is interpreted as negation of the condition, thus the condition is satisfied
only if there is no matching message on the input list. Hence, the classifier:
* This ternary coding has some limitations; in particular, it may affect the ability of the system
to derive general rules. The coding described in this chapter, which relates to work by John
Holland, has been successfully used in a variety of applications and has the advantage of
simplicity. Readers who wish to explore alternative ways of representing classifiers will find
them described in recent papers on classifier systems.
274 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
will be satisfied only if (1) the message 11001 does appear on the input list
and (2) there are no messages on the input list that start with 01.
Example 1
If the input message list consists of the messages:
01010 10010 00111 00010
and the classifiers are
0010#/#0001, #001#/11111, 00101/10##1, 0##10/1##10,
01##0/1#001
the output messages shown in Table 9.2 will be generated.
The messages 11010, 11001, 11111 (twice) and 10010 will then be posted
onto the output list. A single input message may be, and often will be,
matched by more than one classifier. Equally, a single classifier may be satis-
fied by several input messages. If the classifier has more than one condition,
each input message may give rise to a separate output message. If the input
messages were
Table 9.2
Output Messages Generated by a Small Group of Classifiers (see text)
Input message . . . . . . is matched by the classifier . . . . . . and generates the output
The classifier
11#01, #0010/01###
It is easy to see that multiaction classifiers could give rise to large numbers of
output messages, so these are used sparingly.
The order in which output messages are generated and placed on the out-
put list is of no consequence, so the system can process information in paral-
lel without worrying about whether one classifier should inspect the input
list before another gets the chance. The number of messages posted to the
output list is, in Example 1, equal to the number of classifiers, but in general
the input and the output lists need not be equal in length either to each other
or to the classifier list.
Output messages have a format that is similar to input messages; they are
binary strings and are divided into an identifier and a data portion. We shall
assume that any output message that starts with 01 is an instruction to a
heater in the reactor. The remaining digits in the message indicate the power
to be supplied to the heater, as specified in Table 9.3.
Thus, input message 00111 (interpreted as “the temperature is below
Ttoo_cold ”) would satisfy the classifier 00111/01111. The classifier would out-
put the message 01111, which would be interpreted as “turn on the heater”
(the 01 part of the action) “to maximum power” (the 111 part). In this way,
an input message that indicates a need for action has generated a suitable
Table 9.3
Relationship between an Output Message and the Action Taken in the Environment
response. In effect, the identifier tag ensures that the message is delivered to
the classifier(s) designed to cope with it; therefore, the classifier 01##0/101#1
will only read and respond to messages sent from the temperature probe
whose tag is 01.
#####/10010 ##1#1/00###
few # symbols compared to the number of 0s and 1s, or contains none at all,
are specialists, triggered only by one or a few messages. Although these clas-
sifiers will rarely be called into action, they may nevertheless be of particular
value, providing the right response to deal with specific environmental con-
ditions. We can expect them to be a valuable component in the classifier set.
Example 2
Table 9.1 shows that both the input messages 00111 and 00100 indicate
that the temperature is too low. The classifier 001##/01### would be
satisfied by either of these messages. In response to the message 00111,
which indicates that the temperature is below the safe limit so must
immediately be raised, the classifier would output 00111 (turn on the
heater to full power). The message 00100 indicates that the temperature
is above the lowest limit, but below the optimum range, and the classi-
fier would generate the message 01100 to turn on the heater to a medium
power. The # symbol, therefore, allows for the creation of multipurpose
rules, whose action is able to adapt to different input messages. The abil-
ity to match a number of different input messages has the advantage that
several classifiers can be replaced by just one, thus increasing efficiency.
Time
Figure 9.6
The variation with time of an environment parameter such as pH. Horizontal line: represents
perfect control; bold line: environment under the control of specialist classifiers only; dot–
dashed line: environment under the control of generalist classifiers only; dashed line: environ-
ment under the control of both generalist and specialist classifiers.
9.5.1 Bids
The crucial first step in identifying useful classifiers is to require classifiers to
take a risk when they post a message to the output list. When the input message
list is checked against the classifiers, classifiers that are satisfied will attempt to
generate at least one output message.* So far we have assumed that, if a classifier
is satisfied by an input message, it always posts an output message. However,
* Any classifier with more than one action part could post more than one output message per
input message.
280 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
bidi (t)
pi (t) = (9.3)
∑ bid (t)
j
j
Since generalists have low specificity, their bids for a given classifier strength
will be lower than bids from specialists and they may fail in their bids unless
they are of very general utility, so often receive a reward and, thus, have high
strength. Notice that the generalist #####/action has a specificity of zero,
so if this classifier is a member of the classifier list, any bid it makes cannot
succeed.
The classifier requires some resources that can be used in the bidding and
for this purpose it uses a portion of its strength when it bids. If the bid is
large enough to succeed, the message appears on the output list and the bid
that has been made is deducted from the classifier strength. If the bid fails,
no message is posted and the strength remains unaltered. In due course, if
the combined effect of all output messages on the environment is favorable,
the reward that the environment returns will enhance the strength of the
successful bidders. The next time the classifier bids, it will be stronger, its bid
will be more likely to succeed and the environment will benefit again. On the
other hand, if the message that a classifier posts is not helpful, the classifier
may receive no reward, its strength will fall because part of it has been used
to make the bid and eventually its bids to post its unproductive messages
will fail. In this way the output of the system is gradually filtered so that use-
ful messages predominate while unwanted messages are eliminated.
If conflicts arise when the messages are sent to the effectors, recourse is
made to the bids that each classifier made to post the message. The probabil-
ity that one message is accepted rather than another is proportional to the
ratio of the bids made by the classifiers when the messages were posted:
{
Bi (t)/ Bi (t) + Bj (t) }
Learning Classifier Systems 281
9.5.2 Rewards
Once the output list is complete, messages that are intended for effectors are
sent into the environment where they are acted on and, as a consequence,
the environment changes. If the combined actions of the effectors produce
a desirable change, for example, by lowering the pressure in a pipeline that
had been identified by pressure sensors as operating above the prescribed
pressure limits, the environment provides a positive reward to the classifier
system, which is shared among all the classifiers that succeeded in posting
a message.
In the simplest of classifier systems, the reward from the environment is
divided up equally among all those classifiers that were successful in post-
ing output messages (recognizing that classifiers that post several messages
will get a proportionately larger share of the pool). It may seem unfair that
all classifiers that made successful bids get the same share of the payoff when
some of the messages will be more helpful in bringing about the necessary
change in the environment than others, which will be of lesser value or be
counterproductive. However, the CS has no way of knowing which messages
were most beneficial as the environment returns a single, global reward, thus
the system cannot selectively reward some classifiers and ignore others.
If classifiers are coupled (see section 9.5.5), one of the classifiers is not sat-
isfied by a message from the environment, but by a message from a second
classifier generated during the previous major cycle. When this happens,
after receiving its share of the reward from the environment, the first clas-
sifier then passes an amount equal to its bid to the second classifier, whose
message satisfied it.
After many cycles, productive classifiers will have gained significant
strength, but those classifiers that are not of value in controlling the envi-
ronment will have accumulated little. The reward scheme thus allows us to
identify the more useful classifiers, but how can we persuade the system as
a whole to learn?
To bring about learning, there must be a mechanism that can modify the
contents of the classifier list so that classifiers that serve no useful purpose
can be removed and possibly replaced by others of greater value. This can be
accomplished in two ways: rule culling and rule evolution.
which, by several orders of magnitude, is too large a pool with which to deal.
As a consequence, the creation of an exhaustive list of classifiers is rarely a
realistic option.
70000
60000
50000
40000
30000
20000
10000
0
0 2 4 6 8 10 12
Figure 9.7
The number of distinct classifiers as a function of classifier length.
Learning Classifier Systems 283
1.
a. Randomize all classifiers.
b. Place messages from the environment on the message list.
c. Compare each message with the condition part of each classifier.
If a classifier is satisfied, place the output on the output message
list, but only if the bid it makes is sufficient.
d. Tax all bidding classifiers.
e. Once all messages have been read, delete the input message list.
f. Sweep through the output message list checking for any messages
to effectors.
g. Request the effectors to carry out the appropriate action.
h. Receive a payoff from the environment and distribute this to clas-
sifiers that have posted to the message list.
i. Occasionally run an evolutionary algorithm (step 2) to generate
new classifiers from the old based on classifier fitness.
j. Return to step 1b.
2. Operate the GA by:
a. Picking classifiers semistochastically from the classifier list to act
as parents, using an algorithm in which the fitness of a classifier
is based on its accumulated reward.
b. Copying the chosen classifiers into a parent population, then
using the genetic operators of crossover and mutation to prepare
new classifiers.
c. Replacing some of the old classifiers by new classifiers and
continuing.
284 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
The first classifier is satisfied by the message 01001 and posts the message
00001 to the output list. If this is not a message to any effector, it will be trans-
ferred to the input list on the next major cycle and read by all the classifiers
in that cycle. The message 00001 satisfies the second classifier, which would
post the message 00111; in turn, this message activates the third classifier in
the next cycle.
Learning Classifier Systems 285
9.5.6 Taxation
One final element is needed to ensure smooth evolution of a CS. In a large
set of classifiers, there may be a number that are never satisfied by an input
message and so never bid; others may be so indiscriminate that they put in a
bid every cycle, but have so little strength that they never succeed in posting
any message. Neither type of classifier will contribute much to the smooth
working of the overall system, except possibly by providing a pool of genetic
material that may be accessed by the GA. Their presence will only expand
the classifier list and slow operation of the algorithm.
It is, therefore, common practice to impose a tax on the strength of the clas-
sifiers each cycle. Taxes may include a head tax, which is applied to every
classifier, and a bid tax, which applies to all those classifiers that make a bid
during the current cycle, irrespective of whether or not the bid succeeds. The
bid tax is designed to slowly weaken those classifiers that bid repeatedly, but
never do so with enough conviction to succeed. The head tax applies to every
classifier so that those that are so hopeless that they never even bid also lose
strength and will eventually be removed.
The taxes are low, for if they were not, the loss of classifiers would be high.
Especially vulnerable would be specialist classifiers which may need to wait
many cycles before being triggered by any input message and, therefore, hav-
ing the chance to gain a reward. By the imposition of taxes, the number of
classifiers is kept to a workable level, which both increases execution speed
and makes the manipulations of the genetic algorithm more efficient.
286 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
9.6 Applications
At one stage in the early 1990s it appeared that, after a promising start, clas-
sifier systems would quietly fade away from the AI landscape. Although
their principles are relatively straightforward, their behavior is complex
and does not lend itself easily to a theoretical analysis. In addition, like
neural networks, they are computationally greedy during training and
long periods may pass in training during which they seem to make only
slow progress.
Furthermore, although this chapter has concentrated on the use of a CS in
the context of control problems, no chemical company synthesizing ethylene
oxide in bulk, or any other chemical for that matter, would allow a CS to play
around with its chemical plant for the days or weeks that might be neces-
sary for the system to learn how to control the plant safely (bearing in mind
the need to clear up the occasional mess after the CS makes a mistake in its
learning and the plant explodes or burns down).
So what are the prospects for the CS, if indeed it has any?
Recently there has been a resurgence of interest as their operation is better
understood and as the range of areas to which they could be applied starts
to grow. Increases in computer power have also helped, since while a three-
day wait to see if a CS is making progress may be excessive, a wait over the
lunch hour to see what is happening is quite reasonable. The area in which
they have the greatest potential, process control, is just the area in which
training is most problematic. However, alternatives exist to giving a CS a
complete, functional process line to investigate. The classifier list could be
built incrementally, by running it successively on subsections of the entire
system. It may be able to learn by interacting with a simulation, so it can get a
satisfactory understanding of the process without destroying what it should
be controlling. It may be possible for portions of a CS found useful in one
application to be used as the core for a second.
Despite the revival in interest and what appears to be considerable poten-
tial, the number of applications remains low; there is considerable scope for
development in this area. Attempts to use CS in the control of traffic lights,
sequence prediction in proteins, and the analysis of medical data all suggest
that the range of possible applications is wide — the field is just opening up.
recent developments in this field is to search for papers by, and references to,
Rick Riolo, Lashon Booker, Richard Belew, and Stephanie Forrest, who have
been active in the field for a number of years.
9.8 Problems
1. The rates of many enzyme reactions are strongly dependent on both
pH and temperature. Construct a CS that learns to keep conditions
within a simple reactor within the limits 6 < pH < 9 and 27 < T < 41.
You will need both to write the classifier system itself and a small
routine that represents the environment. Test the operation of your
system by including a method that periodically adds a random
amount of acid or base, or turns on a heater or chiller for a short
period.
References
1. Lanzi, P.L., Stolzmann, W., and Wilson, S.W., (Eds.) Learning classifier systems:
From foundations to applications, Lecture Notes in Artificial Intelligence 1813,
Springer, Berlin, 2000.
10
Evolvable Developmental Systems
Nawwaf Kharma
Associate Professor, Department of Electrical and Computer Engineering,
Concordia University, Montreal, Canada
289
290 Nawwaf Kharma
Mapping
Genomic Lifetime
Figure 10.1
Relationship between genome and phenome and how each is adapted.
phenome to the genome comes in the form of a single quality associated with
the behavior of the completely developed phenome in a given environment,
and (2) any adaptation that the phenome (or individual) experiences during
its lifetime is not passed forward to any future generations (via the genome)
(Figure 10.1). This is in keeping with current theories of Darwinian evolu-
tion, as opposed to Lamarckian inheritance of acquired characteristics.
A made-up example of a developmental mapping may be presented using
production (or rewrite) rules.
X → LBR
L → A2
R → 1C
Indeed, the last point highlights the fact that there exists no method that
would allow one to go back from (1) a well-defined phenomic form and (2) a
particular developmental approach (e.g., cellular automata) to a developmen-
tal program, which is embeddable within the genome. The process of devis-
ing a developmental mapping, for a given problem, has thus far been an art
practiced by the few, rather than a technique taught to the many. Neverthe-
less, having a developmental mapping almost always involves:
view,1 there are three types of mappings, which they call embryogenies (or
mappings): external, explicit, and implicit. Though this taxonomy focuses
on the essential distinguishing feature of an EDS (i.e., its mapping), it fails to
provide a clear division between explicit and implicit mappings. The other
taxonomy is that of Stanley and Miikkulainen.2 It takes as its basis the under-
lying biological features of the developmental process. It is a valiant attempt
at proposing something truly novel and inspiring. However, we feel that the
taxonomy is somewhat arbitrary, complex, and too “biological” for it to be of
immediate use to all parties interested in EDS.
The rest of the chapter is mainly composed of five diverse examples of
application of evolutionary developmental methods to problems in engineer-
ing and computer science. The chapter concludes with an attempt to foresee
some future trends in EDS research and application, followed by a very short
story that we hope will entertain and perhaps inspire.
N
5 Input D Q
E
LUT
W Flip-flop
S CLK
Figure 10.2
A block diagram of an Sblock. (Haddow, P.C. and Tufte, G. [2001] Bridging the genotype-pheno-
type mapping for digital FPGAs. In proceedings of the Third NASA/DoDWorkshop on Evolvable
Hardware, IEEE Computer Society.)
Evolvable Developmental Systems 295
Z
00000 0 0
00001 0 1
00010 0 0
Out AND E
00011 0 1
00100 0 0
00101 0 1
00110 0 0
00111 0 1
01000 0 0
W R Out
. . .
. . .
11110 1 1
11111 1 1
Figure 10.3
The truth tables of two Sblocks. (Haddow, P.C. and Tufte, G. [2001] Bridging the genotype-
phenotype mapping for digital FPGAs. In proceedings of the Third NASA/DoDWorkshop on
Evolvable Hardware, IEEE Computer Society.)
Figure 10.4
L-systems type rules of change and growth. (Haddow, P.C. and Tufte, G. [2001] Bridging the
genotype-phenotype mapping for digital FPGAs. In proceedings of the Third NASA/DoDWork-
shop on Evolvable Hardware, IEEE Computer Society.)
Evolvable Developmental Systems 297
when these conditions are satisfied, the RHS of the growth rule (= 32 bits) is
placed in one of the available free neighbors; which one depends on a preset
priority scheme. North has first priority, followed by south, west, and east.
As a result of the successful application of a growth rule, an Sblock, which
was free but of an initial seed, gains a fully specified truth table and becomes
a functioning connected cell.
As stated, rules are applied in order of their priority. Additionally, when
a rule is matched and applied to a certain location, that location is protected
from any further change until every other rule (in the rules’ set) is tested for
application. Once all the rules have been tested, all protection is removed
from the array, and the whole cycle of selection, application, and protection
of affected locations is repeated. In some, this iterative application of rules is
repeated for a predetermined number of cycles. It is, however, conceivable to
use a different criterion for termination of growth, such as lack of change.
10.3.6 Experimental Results
Figure 10.5 illustrates three snapshots of a typical run achieved by Haddow
et al.3 In the figure, a square grid of 16 × 16 Sblocks is shown. A black box
represents a 0-input box or an unconfigured (or free) Sblock. The decrease in
the number of inputs to Sblocks necessary for routing to occur is expressed
by a change in the color from dark grey to white. White Sblocks stand for
routing blocks. The experiments evolved 30 to 50 percent routers in less than
one hundred evolutionary generations.
The trend shown toward increasing routing blocks demonstrates that the
application of artificial evolution (i.e., a GA) to the problem of finding rules
that can grow effective (though not optimal) digital circuits is a real techno-
logical possibility.
Figure 10.5
A grid of developing Sblocks after 3, 23, and 57 generations. (Haddow, P.C. and Tufte, G. [2001]
Bridging the genotype-phenotype mapping for digital FPGAs. In proceedings of the Third
NASA/DoDWorkshop on Evolvable Hardware, IEEE Computer Society.)
Evolvable Developmental Systems 299
Px
Fi
i
–Fi
Ry
Rx
Figure 10.6
Two plane trusses; the left is stable, the right unstable.
300 Nawwaf Kharma
attempts to design a truss that can maximize those criteria while being both
stable and as resistant to external forces as possible. It will be assumed, for
all future discussions, that the trusses are topologically connected, pin-con-
nected, friction free, and that force is applied only at joints.
Truss Stress Analysis: The computation of member forces in an arbitrary
plane truss is now examined. There exist some simple counting tests that
may determine if a given truss is unstable. Failing that, one must attempt
to compute the equilibrium state given some external forces; in the process,
one obtains values for all member forces. In this example, all truss mem-
bers are identical in terms of material and area, grown in a developmental
space where units are measured in meters; EA is set to 1.57 × 104 N, corre-
sponding to a modulus of elasticity for steel and a cylindrical member of
diameter 1 cm. Consider a general truss with n joints and m beams; exter-
nal forces are applied at joints and the member forces are computed. Let the
structure forces be
{P} = {P ,…, P }
T
1 n
,
structure displacements be
{∆} = {∆ ,…, ∆ }
T
1 n
,
{F} = {F ,…, F }
T
1 m
.
One may relate the individual member forces to displacement and struc-
ture forces as follows:
{F} i
= k a β { ∆}
i i
where [β]i is the connectivity matrix for the i th member beam and [k]i is its
stiffness matrix, relating the deformation of the beam under a given force to
the displacement at the joint. Hence, to solve for forces, it suffices to compute
the displacements. The displacements may be computed through a truss
stiffness matrix, a combination of the individual member stiffness matrices:
{∆} = K {P} −1
Therefore, given a plane truss, one may first compute the stiffness matrix,
then compute the displacements, then the individual member forces. The
entire process is bounded by the calculation of a matrix inversion (or LU-
Decomposition), and, hence, has running time O(m3).
Evolvable Developmental Systems 301
Time t ← 0
Initialize developmental space Dt
while Dt ≠ Dt-1 do
t ← t + 1
Dt ← Dt-1
for all Cell c ∈ Dt-1 do
if c has sufficient age and crc then
Action a ← ϕ(μc)
Decrement crc appropriately for a
Execute a in Dt
end if
end for
end while
( c, h ,…, h , a )
1 nc
color = 2 4 g4 + 2 3 g3 + 2 2 g2 + 21 g1 + 2 0 g0 + 1
The zero cell type is reserved for the empty cell, the one value is for a joint
with no beams, and all other combinations exist in the set {2, . . . , 32}.
Finally, one may allow cells to be elongated in one direction, by an arbitrary
number of cell lengths. For example, a cell of type 9 has an angle of 3π/4 with
the x-axis, and a length of 2 . A single elongation in the y-direction would
lead to a length of 5 and an angle of 7π/8 with the x-axis. Hence, excluding
elongation, any two-dimensional lattice of integers may be mapped to (some)
truss. One such mapping, including elongations, is shown in Figure 10.7.
In Figure 10.8, the growth of an agent is shown (in grey), whereas the final
organism (in black) is much smaller. This is the result of a trimming process
applied to every organism following development. The trimming process
serves to (1) remove obviously unstable sections, such as beams which do not
0 0 0 0 1 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 25 5 4 0 0 0
25 1 32 1 4
Figure 10.7
Example of a mapping between a lattice of integers and a plane truss.
Figure 10.8
Example of growth achieved by a Deva1 algorithm.
Evolvable Developmental Systems 303
connect to joints at both ends; (2) to remove sections that are not connected to
the base of the structure; and (3) to remove redundant joints, replacing them
with longer beams. All three of these can be accomplished in a single pass of
the untrimmed truss structure, allowing for processing in O(n) time, where
n is the number of beams.
The output of φ may be computed as follows, given a current cell c0 and its
neighborhood µ c0 :
The best free location is defined as the empty adjacent location (in the
von Neumann neighborhood surrounding the cell), which lies opposite to
the greatest mass of nonempty cells (in the Moore neighborhood). Most cell
actions come with a cost, decrementing a cell’s rc ; this is meant to incorporate
the notion of finite resources. If a cell cannot execute an action (no best free
location, insufficient resources), it does nothing. A Deva1 growth is controlled
then through a genome (transition function), and several system parameters
304 Nawwaf Kharma
Figure 10.9
Another example of growth in developmental space.
(number of cell types, nc, initial setting of resource counter, rc). Figure 10.9
illustrated one possible example of Deva1 growth.
1 1 165 MPa − M
p= +
2 2 165 MPa
The second fitness function, fstoch , is similar to fmat in all ways except that
rather than apply external forces at the highest joint, the forces are applied
randomly. Hence, three nonbase joints are selected at random, and at each a
force of 5 kN is applied down and 500 N either right or left (with equal prob-
ability at each joint).
Our final fitness function, f base , is again similar to fmat in that a force of 20 kN
is applied down and 5 kN right divided between the highest joints. However,
the length minimization factor is removed and instead the base minimization
306 Nawwaf Kharma
factor is included. Therefore, f base selects for tall trusses capable of supporting a
load at the peak occupying as little ground space as possible:
10.4.5 Experimental Results
For the fitness trials, the population size is two hundred; initMult is twenty;
the probability of crossover is 0.8; the rate of elitism is 0.01; the probability of
copy mutation is 0.05; the probability of power mutation is 0.05; useSeed is set
to true; φ is one hundred.
There were ten runs of the fitness trials for each fitness and rc setting, rc =
16, 24. These runs are referred to as fit.x.y.z, where x is a fitness function, y is
an rc value, and z ∈ {0, .., 9} is an index. Hence, fit.stoch.24.3 is the third run of
the rc = 24 trial using the fitness function fstoch .
In all trials, successful trusses were evolved; all runs found stable trusses,
and fifty-six of the sixty found trusses capable of supporting the applied
external forces. In general, heights of approximately 9 m were found in the
rc = 16 trials, and heights of 18 m when rc = 24. There were several general
trends in evolved solutions for each fitness function. For the fmat function, all
high fitness population members somewhat resembled the exemplar, a sim-
ple triangle-shaped truss. Organisms varied greatly, however, in the fstoch and
f base runs. For the fstoch function, sparse pyramids were common. Also, there
were many agents with thin, tall upper sections and large bases. For the f base
function, some tall trusses with small, central bases were found. Addition-
ally, large pyramid trusses with sparse bases were also common. Figure 10.10
shows exemplar population members illustrating these phenomena.
The maximum fitness of agents in the rc = 16 fitness trials are graphed in
Figure 10.11. Note that frequent plateaus are found in each run, also pres-
ent in the rc = 24 runs. This suggests that the genetic operators are more
frequently impotent or destructive, relative to more typical experiments in
GAs. It is also reminiscent of theories of evolution via punctuated equilibria,
where (real world) evolution is believed to work via infrequent jumps.
As illustrated in Figure 10.12, a visual continuity between the phenotypes
of members could typically be seen. In the example, agents show many simi-
lar qualities, including the presence of single-unit length crossbeams, hol-
lowed-out centers, and elongated base supports.
Figure 10.10
Exemplar organisms from the fitness trials.
308 Nawwaf Kharma
0.60
0.48
0.36
0.24
0 50 100
Figure 10.11
Plot of fitness (y) against generation number (x).
Figure 10.12
The fittest ten individuals (excluding repeats) in the hundredth generation; all are stable.
Genome (1 chromosome):
0 1 2 3 4 5 6 7
x 001 101 111 220 341 510 600
Phenome (directed graph) (Figure 10.13):
[ I I I I I I I
Figure 10.13
A genome and its corresponding phenome in Cartesian genetic programming (CGP). (Miller,
J.F. and Thomson, P. [2003] A developmental method for growing graphs and circuits. In pro-
ceedings of the Fifth International Conference on Evolvable Systems, Springer-Verlag, Berlin, pp.
93–104.)
The first part of the genome represents the external input (x) that has posi-
tion 0. Next, comes node 1 then 2 and so on up to node 7. The inputs of
each node can only connect to outputs of nodes preceding them or exter-
nal inputs. All nodes are described similarly and so only that part of the
genome corresponding to node 5 will be described, the rest of the genome
can be interpreted similarly. Node 5 has two inputs connected to the outputs
of node 3 and 4, respectively, and it has function 1, which is multiplication (0
stands for addition).
state of the cell, i.e., its current function, connectivity, and position. These are
the program’s inputs, and its outputs are its next function connectivity and
whether it will divide or not. The position of a node does not change. Nodes
operate synchronously (i.e., they have a clock that control transitions from
one state to the next) (Figure 10.14).
At every clock tick, the developmental program of each cell (in the graph
of interconnected cells) computes its next state and whether it will pro-
duce a new cell or not. Division produces a new cell with exactly the same
(unchangeable) developmental program as the mother cell, but with a new
location = location of the mother cell + 1. Because CGP and by extension
DCGP only allow feed-forward graphs, the inputs of all cells will come from
external inputs and/or the outputs of other cells, which are directly/indi-
rectly connected to the external inputs. Hence, if the external inputs stay
stable, then so will the outputs of all the cells in the graph.
This stability means two things: (1) the functionality and connectivity of
the cell will stay constant (positions, as we said, never change) and (2) the
cell will either always divide or always not divide. This means that there
is a need for an external mechanism that determines when division (of all
dividing cells) is to halt. At present, this is a parameter value determined
beforehand by the human experimenter. All in all, one cell with a DP and
external inputs to that DP may divide into a graph of cells, all with the same
DP, but with potentially different functionalities and connectivities. After
a predetermined number of iterations, all dividing cells will stop dividing,
and the result of development will be a fixed graph of interconnected cells
with one overall functionality. This functionality is realized via a multicel-
lular “organism” lying between the external inputs and the output of the
final node.
One issue remains. How is the DP actually implemented? We know that
the DP is basically a mapping that accepts four integers as inputs: connec-
´
I am to A
conn
ecte nne cted
d to co
A now
I am
to B´
I am connec connected
ted to B I am now
Cell Program
F I am now fu
I am function nction F´
I wil
nP l/wil
sitio l not
at po devi
I am de
Clock tick
Figure 10.14
The general model of a node in developmental Cartesian genetic programming (CGP). (Miller,
J.F. and Thomson, P. [2003] A developmental method for growing graphs and circuits. In pro-
ceedings of the Fifth International Conference on Evolvable Systems, Springer-Verlag, Berlin, pp.
93–104.)
Evolvable Developmental Systems 311
Z = 2A+B
Y = Z+F
X =Y +P
W = ZX
Figure 10.15
The unmodulated functions of the developmental program of a developmental Cartesian
genetic programming (DCGP) node.
Z = (2A+B) mod P
Y = (Z+F) mod P
X = (Y + P) mod N f
W = (ZX) mod 2
Figure 10.16
The modulated functions of the developmental program of a DCGP node.
312 Nawwaf Kharma
0 0 2
x
2 3
f1 f0
1
y
1 0
Figure 10.17
The final phenome resulting from the application of the developmental program (DP) in every
node. (Miller, J.F. and Thomson, P. [2003] A developmental method for growing graphs and cir-
cuits. In proceedings of the Fifth International Conference on Evolvable Systems, Springer-Verlag,
Berlin, pp. 93–104.)
Termination criteria usually combine two terms: one relating to the actual
ultimate objective of the exercise, such as a perfectly fitting function and
another term setting an upper limit on the number of generations that the
evolutionary algorithm can run.
10.5.6 Experimental Results
There are a few real world experimental results for the DCGP approach to
function synthesis. However, this approach was modified in order to apply it
Evolvable Developmental Systems 313
Figure 10.18
Snapshots of the development of a cell program into a French flag.
Cell
Genome Cytoplasm
Figure 10.19
A graphical representation of genomic regulation in the complex model of development.
Evolvable Developmental Systems 315
1 3 4 2
0 0 0 1
0 0 1 1
0 1 0 1
0 1 1 0
1 0 0 1
1 0 1 0
1 1 0 1
1 1 1 0
2
2 3 4 1 1 3
0 0 0 1
0 0 1 1 1 2 4 3
0 1 0 1 0 0 0 0
0 1 1 0 0 0 1 1
1 0 0 1 4 0 1 0 1
1 0 1 1 0 1 1 0
1 1 0 0 12 3 4
1 0 0 1
1 1 1 0 0 0 0 1 1 0 1 1
0 0 1 0 1 1 0 1
0 1 0 0 1 1 1 0
0 1 1 0
1 0 0 0
1 0 1 0
1 1 0 0
1 1 1 0
Figure 10.20
Genomic regulation in the simple model development.
316 Nawwaf Kharma
Naturally, the state of an RBN will change with time. If there are no exter-
nal influences, then it has been shown that the state of any RBN will settle
into either a point attractor (i.e., steady state) or a cycle.
Figure 10.21
The initial cell dividing twice into three cells.
Evolvable Developmental Systems 317
At the organism level, there are three issues that must be tackled, besides
that of visual representation: the first division, breaking of symmetry, and
intercellular communication. The simulation forces the first cell to divide
by initializing the node (gene) signifying division to 1. This ensures that the
organism divides at least once.
In order to ensure long-term asymmetric divisions, each cell is provided with
information about its spatial positioning with respect to the external boundary
and the midline of the organism. This information is provided to the growing
organism via a special mechanism, which also deals with intercellular com-
munication, i.e., information signals coming from neighboring cells.
A great deal of thought has gone into ways of simulating intercellular com-
munication (or induction). Induction is concerned with the means by which
one cell or group of cells can affect the developmental fate (or settled cell
type) of another cell or group of cells via passing signals. A direct way to
simulate this is to implement a modification, which would allow for some of
the edges of the random Boolean network of a cell to come from a neighbor-
hood vector. The neighborhood vector is the logical OR of the state vectors
of all neighboring cells. To differentiate between an edge coming from the
cell’s own state vector and an edge coming from the neighborhood vector, a
minus sign is attached to every input coming from the neighborhood vector.
Finally, specific bits in the neighborhood vector of a cell are used as binary
(ON/OFF) indicators of midline and externality. This whole arrangement is
shown in Figure 10.22.
Network
1011001
Figure 10.22
A simplified graphical representation of intercellular communication.
318 Nawwaf Kharma
The fact that the sheer number of cells is rewarded positively encourages
division. If cells of a right type grow in the wrong place (i.e., anywhere), then
that also is rewarded. However, if cells of the right type grow in the right
place then the reward is highest. Finally, the square root term is meant to
balance matching percentages when all wanted cell types are found. The
ideal organism has the highest fitness value of thirty, and only an organism
with two cells (as one is not possible) of unwanted types would receive the
lowest fitness value.
4 4 1 1 1 1 2 2 6 6 6 6 6 6 2 2
4 4 1 1 1 1 2 2 4 0 0 0 0 0 0 2
1 1 1 1 1 1 1 1 4 0 0 0 0 0 0 4
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 4 0 0 0 0 0 0 4
4 4 1 1 1 1 2 2 4 0 0 0 0 0 0 2
4 4 1 1 1 1 2 2 6 6 6 6 6 6 2 2
Figure 10.23
An ideal (left) and an actual evolved organism (right).
Evolvable Developmental Systems 319
10.6.6 Extended Example
The genome of seeker represents the core control circuitry, so to speak, of each
cell of the organism, not just at the start of its development, when it is made
of exactly one cell, but also after development has concluded, resulting in a
Selection
Crossover Mutation
of two
high-fitness
individuals
Low-fitness Replacement
individual
Figure 10.24
The steady state genetic algorithm.
320 Nawwaf Kharma
functional multicellular organism. The genome is the only part of the organ-
ism that undergoes evolution; all other mechanisms (e.g., the specific way
in which cells are allowed to communicate via a neighborhood vector) are
fixed throughout the simulation. The genome describes the RBN of a cell: the
number of nodes it has, the Boolean logical function of each node, and how
the nodes connect to each other (as well as to external signal sources).
The genome in Table 10.1 describes an RBN with N (number of nodes) = 6
and K (node connectivity) = 2. Each node is fully described by its inputs and
the logic function applied to them. Because each node has 2 inputs, it requires
a 22 (= 4) row truth table to fully specify its output (state). Since the truth table
has a fixed format (00/10/01/11), it is sufficient to list the output column of
the truth table of a node to describe its functionality. For example, node #
3 has “0001” as its logic function. This is read: Input values of 00 return an
output of 0, 10 return a 0, 01 return a 0, and 11 return a 1. This is equivalent
to the AND function, applied in this case to inputs (–5) AND (5). The rest of
the rows, representing nodal functions, are read similarly. It is important to
note, however, that the logic function at node #4 will always result in a value
of 1 (or TRUE) making this genome one that favors cell division, something
that was favored in the fitness function seen in equation (10.2) earlier.
It is essential to note that input (5) comes from node #5 of the RBN of the
cell itself. In contrast, the minus sign of input (–5) indicates an external input
coming from the neighborhood vector, where the (–5) and (–6) inputs are
reserved inputs. The value of (–5) equals 1 if the cell is on the boundary of
the organism; it is 0 otherwise. Another external input, (–6) equals 1 if the
cell borders the midline of the organism; it is 0 otherwise. If the information
in Table 10.1 is used to construct the RBN of a seeker cell, then the graphical
representation in Figure 10.25 would emerge.
In order to understand the mechanism that results in a fully developed
organism, let us look at the start of the process after having — through
Table 10.1
Logic Function
0101
Node # 0011 Inputs
1 0010 3 –6
2 1100 –2 –1
3 0001 –5 5
4 1101 4 4
5 0110 6 –6
6 0111 6 –1
2 2
External 3 3
1 1
Midline 4 4
6 6
5 5
Figure 10.25
The random Boolean network of a seeker cell showing internal and external connections
(via edges).
Cell L Cell R
Neighborhood vector = 100110 Neighborhood vector = 000110
State Vector = 000100→000101→001111 State Vector = 100100→010100
Cell Type = 4 Cell Type = 2
State Vector Bit [4] = 1 (Divide) State Vector Bit[4] = 1 (Divide)
The tabulated form above accurately describes the process all cells go
through. What follows is a crude step-by-step explanation for one of the
daughter cells.
322 Nawwaf Kharma
Cells LT and LB
Neighborhood vector = 011110
State Vector = 001111 → 011111
Cell Type = 6
State Vector Bit[4]= 1 → Divide
10.7 Summary
This chapter began with an introduction to the often-neglected mapping
between the genome and the phenome of an individual within the context
of evolutionary computation. In most evolutionary algorithms (EAs), this
mapping is seen simply as a way of converting the genome of an individual
into an entity that can be evaluated for fitness. However, mappings can be
very elaborate, and in natural organisms such mappings take a considerable
amount of time and resources to unfold. It is the time necessary for a seed to
develop into a tree and an embryo into an adult. In evolvable developmental
systems (EDS), development is simulated using mappings and evolution is
used to find a satisfactory mapping or an appropriate configuration for a
generic mapping.
The field of EDS is relatively new and expanding. There are not yet stan-
dard techniques or standard methodologies to apply, not even in specialized
fields of application (e.g., digital chip design). Nevertheless, there are a num-
ber of approaches that seem to be attracting a fair bit of attention by EA prac-
titioners and researchers. These include production rules (e.g., L-systems),
cellular automata (e.g., Deva1), cellular growth mechanisms (e.g., develop-
mental CGA), genomic regulatory networks (e.g. RBNs), and others. Sections
10.3 to 10.6 of this chapter present examples of each one of these approaches.
They have different areas of “application,” but since real-world applications
are not yet the main aim, the emphasis in our explanations is focused on
methodology.
In section 10.3, Haddow et al. show how they successfully evolved a set
of production rules, which grew a digital circuit that operated as a router. It
was not an optimal router, but it showed the possibility of using evolution
and development to design circuits. In section 10.4, Kowaliw et al. evolved
single cells with the ability to grow into different types of plane trusses.
None of the evolved designs are optimal by civil engineering standards, but
they all appeared to substantially satisfy whatever fitness function was used
in evolving them. Section 10.5 presents an incremental growth (= develop-
mental) version of Cartesian genetic programming, as proposed by Miller
at al. This approach was used to evolve both mathematical functions and
colored two-dimensional images (flags). In the case of functions, develop-
ment starts with a single node (which contains the developmental program)
and ends with a completely specified and interconnected network of nodes,
324 Nawwaf Kharma
10.8.1 Epilogue
It was a beautiful summer day of the year 16042 when the expansion of the
universe halted and, for a moment, time stood still. It was well known that
the spatial expansion of the universe was slowing down, and at an increas-
ing rate. Scientists were even able to narrow down the window, within which
the great reversal would start, to within five years. Only a small minority,
however, argued that time and not just space will reverse direction. They
had warned of the dangers of completely neglecting an improbable, but
highly significant, scenario. Then it came. Rivers flowed backward from
the sea to their source and plants grew back into seeds. Nevertheless, the
animals did not seem to mind this unusual change of order. Only humans
fretted depressively about their impending doom, as their minds grew dim-
mer and their bodies younger. The governments of the richest nations came
together in an urgent World Summit. For days on end, they produced and
debated possible solutions to their unusual condition. Their main concern
was the survival of the species — especially the human one. They knew that
they had a fair bit of time to agree on a realizable solution, but not infinity.
For all humanity was degenerating, deevolving, in slow motion, and at some
point in the future-past, Man himself is bound to go extinct. Twenty-one
and a half days later, the Subcommittee for Biocomputing Alternatives pro-
posed a detailed remedy. Match the great reversal with a reversal of devel-
opment. Gather all the knowledge that humanity has been able to generate
about biological development— especially human development. Engineer a
virtual machine — not a living creature, but a software entity, an entity with
a thousand times the intellect of a genetically enhanced human. But this
would be no ordinary entity; it would have the ability to grow, to develop,
as humans do or rather, used to. The entity would “live” and develop in a
virtual world of prereversal laws, where flowers gave rise to fruits and the
sun rose from the east. The entity would also have access to an unlimited
supply of energy and would be linked to the online crucibles (libraries) of the
advanced world. Critically, however, this entity would develop the old-fash-
ioned way: From a single cell to a fully formed and superintelligent adult. A
team of nine top scientists, one hundred and twenty-one engineers and an
army of technicians and support staff were put to work. No cost was spared,
and all distracting voices were kept at bay. The raging discussions and even
confrontations about the possible futility of this expensive technology were
kept off the team’s airwaves. Indeed, the team was totally preoccupied with
the success of the first simulation. So, it was no surprise that the sudden and
inexplicable “death” of EV1 (what the entity came to be called) almost killed
the whole project. Seven of the nine original scientists persisted, developed
software modules carefully and reverted to safer means of implementation
of increasingly biological simulations. They launched one trial after another,
mostly in secret, using: e-GRNs, PBNs, even G-Nets. Five days shy of two
years of development, on an equally gorgeous summer day of the year 16039,
EV5 was unleashed — its (hardware) blue, green, and red lights flickering in
Evolvable Developmental Systems 327
seemingly random patterns. The entity grew into a bulb of triangular then
hexagonal “cells.” The bulb morphed into a slithering snake-like “worm”
with hundreds of pulsating cells, emitting high-frequency shrieks. The cells
divided again and again, always changing shape, color, and behavior. This
continued for about 32 weeks; by then EV5 had developed into a beautiful
and hyperintelligent baby girl. Everyone was anxiously awaiting “her” first
words. What would they be? Would they be “milk” or perhaps “mama” or a
word in some ancient language, which none of them understood? The days
went by, and suddenly, a completely formed paragraph came out of the girl’s
virtual mouth:
Acknowledgments
In order to complete this chapter, I had to call on the help of three friends,
who I would like to acknowledge. First, I thank Dr. Luc Varin for providing
me with unfettered access to his office at the Loyola campus of Concordia,
away from the distractions of Montreal’s downtown. I also am thankful to
Imad Hoteit for reading and providing detailed feedback on the first draft of
the chapter. Last, but not least, I would like to thank Mohammad Ebne-Alian
for the excellent technical drawings and recreations used to illustrate parts
of this chapter.
References
1. Kumar, S. and Bentley, P.J. (1999) Computational embryology: Past, present and
future. In Ghosh, A. and Tsutsui, S. (eds.) Theory and Application of Evolutionary
Computation: Recent Trends, Springer-Verlag, Berlin.
2. Stanley, K.O. and Miikkulainen, R. (2003) A taxonomy for artificial embryogeny.
Artif. Life, 9:93–130, MIT Press, Cambridge, MA.
328 Nawwaf Kharma
3. Haddow, P.C., Tufte, G., and Van Remortel, P. (2003) Evolving hardware: Pump-
ing life into dead silicon. In Kumar, S. and Bentley, P.J. (eds.) On Growth, Form
and Computers, Elsevier Academic Press, Amsterdam, pp. 405–423.
4. Haddow, P.C. and Tufte, G. (2001) Bridging the genotype-phenotype mapping
for digital FPGAs. In proceedings of the Third NASA/DoDWorkshop on Evolvable
Hardware, IEEE Computer Society, Washington, D.C., pp. 109–115.
5. Lindenmayer, A. (1975) Developmental algorithms for multicellular organisms:
A survey of L-systems. J. Theoret. Biol., 54:3–22, Elsevier, Amsterdam.
6. Kowaliw, T., Grogono, P., and Kharma, N. (2007) The evolution of structural
design through artificial embryogeny. In proceedings of the IEEE Symposium
on Artificial Life (ALIFE’07), IEEE Computer Society, Washington, D.C., pp.
425–432.
7. Eiben, A.E. and Smith, J.E. (2007) Introduction to Evolutionary Computing,
Springer-Verlag, Berlin.
8. Miller, J.F. and Thomson, P. (2003) A developmental method for growing graphs
and circuits. In proceedings of the Fifth International Conference on Evolvable Sys-
tems, Springer-Verlag, Berlin, pp. 93–104.
9. Miller, J.F. and Thomson, P. (2000) Cartesian genetic programming. In proceed-
ings of the Third European Conference on Genetic Programming. LNCS, 1802:121–132,
Springer-Verlag, Berlin.
10. Dellaert, F. (1995) Towards a Biologically Defensible Model of Development.
Master’s thesis, Department of Computer Engineering and Science, Case West-
ern University, Cleveland, OH.
11. Kauffman, S. (1969) Metabolic stability and epigenesis in randomly constructed
genetic nets. J. Theoret. Biol., 22:437–467, Elsevier, Amsterdam.
Index
329
330 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
D
E
Daisy, 117
Data Early stopping, 38, 39
nonlinear, 46 Eberhart, 170
normalization, 41 Effector, 268, 272
Einstein, 59, 63
Data-driven reasoning, see Reasoning,
Ejs, 7, 88
data-driven
Electrocardiogram, 91
Davis, 169
Electrochemical deposition, 190
De Jong, 169
Electronegativity, 41, 42, 48
Decay
Electronic nose, 6, 11, 46, 232, 233
local measures , 102, 103
Electrophoresis map, 260
weight, see Connection weight decay
Elitism, 134, 159, 160, 304
Deception in evolutionary algorithm,
effect on convergence, 159
142, 165
Embryogeny, 293
Declarative code, 218 Encoding
Defuzzifier, 256, 259 binary, 151–154
Delta rule, 23 permutation problem, 157
generalized, 30; see also Training real, 151–154
neural network Endocrinology, 46
DENDRAL, 207, 209 Energy minimization via a genetic
Detector, 267 algorithm, 119
Deterministic step, 138 Entanglement, 76
Deutsch, 199 Environment, Learning classifier
Developmental system, 268
mapping, 291 Environmental pressure, 115
program, 312 Enzyme, 47, 250, 287
Diffusion rule, see Transition rule, Epoch, 23, 35
diffusion Equilibrium constant, 165
Diffusion-limited aggregation, 190, 191 Erlenmeyer, 80
Digital circuit, see Circuit, digital Error, neural network output, 22, 23, 30,
Dimensional hypertetrahedron, 97 31; see also error signal
332 Using Artificial Intelligence in Chemistry and Biology: A Practical Guide
G
GA, see Genetic algorithm
Game of Life, 180
Gasteiger, 93
F
Genotype, 299
Feedback, 176 Gene
Feedforward network, see Network, genetic algorithm, 118
feedforward regulation, 260
Fermentation, 269 sequencing, 91
Fermentor, 263 space, 114
Finite state machine, 175 Generation, 117
Finite state automata, 175 Genetic Algorithm, 2, 5, 113–162, 267,
Fitness, 121, 290, 297, 305 278, 283, 289; see also Selection,
average, 127,130,133 Mutation, Crossover
evaluation, 297, 305 location of good solutions, 140–142
residual, 138 steady-state, 319
scaling, 137–140 Genetic
variation with generation, 131, 134 change, 117; see also mutation
Fitness function, 121, 164 decoding, 290
Flocking, 166 mutation, 113, 115
Flow cytometry, 46 programming, 116, 163, 289
Index 333
perpattern, 45
rate
K growing cell structure, 102
Kennedy, 169 neural network, 23, 30, 35, 41
Kiln, cement, 259 self-organizing map, 60, 64, 80, 81,
Knowledge base, 210, 211 102
editor, 210, 211, 216 variation with epoch, 35
Knowledge reinforcement, 267
acquisition, 226 sequential mode, 45
bottleneck, 228 supervised, 15, 21–24
causal, 212 system
domain, 211 dynamic, 110
elicitation, automatic, 230 incremental, 110
engineer, 227 unsupervised, 59
engineering, 227 Learning classifier system, 2, 263–288
factual, 210 Least squares fit, 3, 311
heuristic, 213 Leave one out, 40, 48
objective, 239 Life, Game of, see Game of Life
subjective, 241 Lights-out laboratory, 264
Knowledge-based system, 206 Linearly separable problem, 24
Kohonen Linguistic variable, 241
layer, 57, 90 Link, recursive, see Recursive link
map, see self-organizing map Lipid, 90
network, 98 bilayer, 5, 90
Koza, 165 LISP, 163, 233
Local
error, 97–105
search, 151
success measure, 99
Logic
L Boolean, 240, 243
device, configurable, 293
Lattice
Logistic function, see Activation
cellular automata, 178
function, logistic
gas, 197
L-system, 296
hexagonal, 71, 72,178
rectangular, 178
tetrahedral, 71
triangular, 71,72
Layered network, 26, 30; see also
Network, two-layer and
M
Network, three-layer
Layer, hidden, 27, 28 Mamdani, 256–259
LCS, see Learning classifier system Mapping, 290, 323, 324
LD50, predicted by neural network, 15 developmental, 291
Learning, 2, 12 indirect, 290
batch, 45 many-to-one, 186
factor, particle swarm optimization, Mask, 143
167 Mass spectrometry, 90
online, 45 Mass spectrum, 208
Index 335
W
Weight
in self-organizing map, see Node
weight X
connection, see connection weight XOR function, Boolean, 24
growing cell structure, 102
neural network, 14, 16
Weights vector, 60, 64, 102; see also Node
weight
Widrow-Hoff rule, see delta rule
Williams, 30
Z
Wine, 11, 232 Zaikin-Zhabotinsky reaction, 174, 195
quality, measuring, 5, 6 Zupan, 93