Backpropagation - Theory, Architectures, and Applications (1) - 1-100
Backpropagation - Theory, Architectures, and Applications (1) - 1-100
Copyrighted Material
DEVELOPMENTS IN CONNECTIONIST THEORY
David E. Rumelhart, Editor
Ramsey/Stich/Rumelhart • Philosophy
and Connectionist Theory
Chauvin/Rumelhart • Backpropagation:
Theory, Architectures, and Applications
Copyrighted Material
BACKPROPAGATION
Edited by
Yves Chauvin
Stanford University
and Net-ID, Inc.
David E. Rumelhart
Department of Psychology
Stanford University
Copyrighted Material
Copyright © 1995 by Lawrence Erlbaum Associates, Inc.
All rights reserved. No part of this book may be reproduced in
any form, by photostat, microform, retrieval system, or any other
means, without the prior written permission of the publisher.
Backpropagation : theory, architectures, and applications / edited by Yves Chauvin and David E.
Rumelhart.
p. cm.
Includes bibliographical references and index.
ISBN 0-8058-1258-X (alk. paper). — ISBN 0-8058-1259-8 (pbk. : alk. paper)
1. Backpropagation (Artificial intelligence) I. Chauvin, Yves, Ph. D. II. Rumelhart,
David E.
Q327.78.B33 1994
006.3—dc20 94-24248
CIP
Books published by Lawrence Erlbaum Associates are printed on acid-free paper, and their
bindings are chosen for strength and durability.
10 9 8 7 6 5 4 3 2 1
Copyrighted Material
Contents
Preface vii
Copyrighted Material
vi CONTENTS
Copyrighted Material
Preface
Almost ten years have passed since the publication of the now classic
volumes Parallel Distributed Processing: Explorations in the Microstruc-
ture of Cognition. These volumes marked a renewal in the study of brain-in-
spired computations as models of human cognition. Since the publication of
these two volumes, thousands of scientists and engineers have joined the
study of Artificial Neural Networks (or Parallel Distributed Processing) to
attempt to respond to three fundamental questions: (1) how does the brain
work? (2) how does the mind work? (3) how could we design machines with
equivalent or greater capabilities than biological (including human) brains?
Progress in the last 10 years has given us a better grasp of the complexity
of these three problems. Although connectionist neural networks have shed
a feeble light on the first question, it has become clear that biological
neurons and computations are more complex than their metaphorical con-
nectionist equivalent by several orders of magnitude. Connectionist models
of various brain areas, such as the hippocampus, the cerebellum, the olfac-
tory bulb, or the visual and auditory cortices have certainly helped our
understanding of their functions and internal mechanisms. But by and large,
the biological metaphor has remained a metaphor. And neurons and synapses
still remain much more mysterious than hidden units and weights.
Artificial neural networks have inspired not only biologists but also
psychologists, perhaps more directly interested in the second question. Al-
though the need for brain-inspired computations as models of the workings
of the mind is still controversial, PDP models have been successfully used to
model a number of behavioral observations in cognitive, and more rarely,
clinical or social psychology. Most of the results are based on models of
perception, language, memory, learning, categorization, and control. These
results, however, cannot pretend to represent the beginning of a general
understanding of the human psyche. First, only a small fraction of the large
quantity of data amassed by experimental psychologists has been examined
by neural network researchers. Second, some higher levels of human cogni-
tion, such as problem solving, judgment, reasoning, or decision making
rarely have been addressed by the connectionist community. Third, most
models of experimental data remain qualitative and limited in scope: No
general connectionist theory has been proposed to link the various aspects of
cognitive processes into a general computational framework. Overall, the
vii
Copyrighted Material
viii PREFACE
Copyrighted Material
PREFACE ix
Acknowledgments
It would probably take another volume just to thank all the people who
contributed to the existence of this volume. The first editor would like to
mention two of them: Marie-Thérèse and René Chauvin.
Copyrighted Material
x PREFACE
Copyrighted Material
BACKPROPAGATION
Copyrighted Material
Copyrighted Material
1 Backpropagation:
The Basic Theory
David E. Rumelhart
Richard Durbin
Richard Golden
Yves Chauvin
Department of Psychology, Stanford University
INTRODUCTION
Copyrighted Material
2 RUMELHART, DURBIN, GOLDEN, CHAUVIN
"Beyond Regression," and David Parker and David Rumelhart apparently devel-
oped the idea at about the same time in the spring of 1982. It was, however, not
until the publication of the paper by Rumelhart, Hinton, and Williams in 1986
explaining the idea and showing a number of applications that it reached the field
of neural networks and connectionist artificial intelligence and was taken up by a
large number of researchers.
Although the basic character of the back-propagation algorithm was laid out
in the Rumelhart, Hinton, and Williams paper, we have learned a good deal more
about how to use the algorithm and about its general properties. In this chapter
we develop the basic theory and show how it applies in the development of new
network architectures.
We will begin our analysis with the simplest cases, namely that of the feedfor-
ward network. The pattern of connectivity may be arbitrary (i.e., there need not
be a notion of a layered network), but for our present analysis we will eliminate
cycles. An example of such a network is illustrated in Figure l. 2
For simplicity, we will also begin with a consideration of a training set which
consists of a set of ordered pairs [( , d)i] where we understand each pair to
represent an observation in which outcome d occurred in the context of event .
The goal of the network is to learn the relationship between and d. It is useful to
imagine that there is some unknown function relating to d, and we are trying to
find a good approximation to this function. There are, of course, many standard
methods of function approximation. Perhaps the simplest is linear regression. In
that case, we seek the best linear approximation to the underlying function. Since
multilayer networks are typically nonlinear it is often useful to understand feed-
forward networks as performing a kind of nonlinear regression. Many of the
issues that come up in ordinary linear regression also are relevant to the kind of
nonlinear regression performed by our networks.
One important example comes up in the case of "overfitting." We may have
too many predictor variables (or degrees of freedom) and too little training data.
In this case, it is possible to do a great job of "learning" the data but a poor job of
generalizing to new data. The ultimate measure of success is not how closely we
approximate the training data, but how well we account for as yet unseen cases.
It is possible for a sufficiently large network to merely "memorize" the training
data. We say that the network has truly "learned" the function when it performs
well on unseen cases. Figure 2 illustrates a typical case in which accounting
exactly for noisy observed data can lead to worse performance on the new data.
Combating this "overfitting" problem is a major problem for complex networks
with many weights.
Given the interpretation of feedforward networks as a kind of nonlinear re-
gression, it may be useful to ask what features the networks have which might
2
As we indicate later, the same analysis can be applied to networks with cycles (recurrent
networks), but it is easiest to understand in the simpler case.
Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 3
Y
Output Units
Hidden
Units
X
Input Units
X X
X
Figure 2. Even though the oscillating line passes directly through all
of the data points, the smooth line would probably be the better pre-
dictor if the data were noisy.
Copyrighted Material
4 RUMELHART, DURBIN, GOLDEN, CHAUVIN
(a) (b)
X1 X2 Xn X1 X Xn X1X2X1X3
2
Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 5
In layered n e t w o r k s the constraints are very different. Rather than limiting the
order of the interactions, w e limit only the number of interactions and let the
n e t w o r k select the a p p r o p r i a t e c o m b i n a t i o n s of units. In m a n y real-world situa-
tions the r e p r e s e n t a t i o n of the signal in physical terms (for e x a m p l e , in terms of
the pixels of an i m a g e or the acoustic representation of a speech signal) m a y
require looking at the relationships a m o n g m a n y input variables at a t i m e , but
there m a y exist a description in terms of a relatively few variables if only w e
k n e w w h a t they w e r e . T h e idea is that the multilayer network is trying to find a
low-order representation (a few hidden units), but that representation itself i s , in
g e n e r a l , a n o n l i n e a r function of the physical input variables w h i c h allows for the
interactions of m a n y t e r m s .
Before w e turn to the substantive issues of this chapter, it is useful to ask for
w h a t k i n d s of applications neural n e t w o r k s would be best suited. Figure 4 pro-
vides a f r a m e w o r k for u n d e r s t a n d i n g these issues. T h e figure has t w o d i m e n -
s i o n s , " T h e o r y R i c h n e s s " and " D a t a R i c h n e s s . " T h e basic idea is that different
k i n d s of s y s t e m s are appropriate for different kinds of p r o b l e m s . If w e have a
g o o d theory it is often possible to d e v e l o p a specific "physical m o d e l " to describe
the p h e n o m e n a . S u c h a "first-principles" model is especially valuable w h e n w e
Data Richness
and
Theory
Copyrighted Material
6 RUMELHART, DURBIN, GOLDEN, CHAUVIN
have little data. Sometimes we are "theory poor" and also "data poor." In such a
case, a good model may be best determined through asking "experts" in a field
and, on the basis of their understanding, devise an "expert system." The cases
where networks are particularly useful are domains where we have lots of data
(so we can train a complex network) but not much theory, so we cannot build a
first-principles model. Note that when a situation gets sufficiently complex and
we have enough data, it may be that so many approximations to the first princi-
ples models are required that in spite of a good deal of theoretical understanding
better models can be constructed through learning than by application of our
theoretical models.
There are three major issues we must address when considering networks such as
these. These are:
Representation
The original critique by Minsky and Pappert was primarily concerned with the
representational capacity of the perceptron. They showed (among other things)
that certain functions were simply not representable with single-layer per-
ceptrons. It has been shown that multilayered networks do not have these limita-
tions. In particular, we now know that with enough hidden units essentially any
function can be approximated as closely as possible (cf. Hornik et al., 1989).
There still is a question about the way the size of the network must scale with the
complexity of the function to be approximated. There are results which indicate
that smooth, continuous functions require, in general, simpler networks than
functions with discontinuities.
Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 7
Learning
Although there are results that indicate that the general learning problem is
extremely difficult—certain representable functions may not be learnable at
all—empirical results indicate that the "learning" problem is much easier than
expected. Most real-world problems seem to be learnable in a reasonable time.
Moreover, learning normally seems to scale linearly; that is, as the size of real
problems increase, the training time seems to go up linearly (i.e., it scales with
the number of patterns in the training set). Note that these results were something
of a surprise. Much of the early work with the back-propagation algorithm was
done with artificial problems, and there was some concern about the time that
some problems, such as the parity problem, required. It now appears that these
results were unduly pessimistic. It is rare that more than 100 times through the
training set is required.
Generalization
Whereas the learning problem has turned out to be simpler than expected, the
generalization problem has turned out to be more difficult than expected. It
appears to be possible to easily build networks capable of learning fairly large
data sets. Learning a data set turns out to be little guarantee of being able to
generalize to new cases. Much of the most important work during recent years
has been focused on the development of methods to attempt to optimize general-
ization rather than just the learning of the training set.
Copyrighted Material
8 RUMELHART, DURBIN, GOLDEN, CHAUVIN
with than p r o d u c t s , w e will m a x i m i z e the log of this probability. Since the log is
a m o n o t o n i c transformation, m a x i m i z i n g the log is equivalent to m a x i m i z i n g the
probability itself. In this c a s e w e have
ln P ( D | N ) = ln P([( ,di)]|N)
Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 9
T h e Gaussian Case
(yij – dij)2
P ( d i | x i Λ N) = K e x p
( Σ 2Σ2 )
j
Σj (yij – dij)2
ln P (d i | i Λ N) = - ln K -
2Σ2 .
(yij - dij)2
l = - Σ Σ
2Σ 2 .
i j
3
Note, this is not necessary. The output units could have a variety of forms, but the quasi-linear
class is simple and useful.
4
Note that ηj itself is a function of the input vector i and the weights and biases of the entire
network.
Copyrighted Material
10 RUMELHART, DURBIN, GOLDEN, CHAUVIN
function with respect to its net input. As w e shall see, this is a very general form
w h i c h occurs often.
N o w w h a t form should the output function t a k e ? It has been conventional to
take it to be a sigmoidal function of its net input, but u n d e r the G a u s s i a n
a s s u m p t i o n of error, in which the m e a n c a n , in principle, take on any real v a l u e ,
it m a k e s m o r e sense to let be linear in its net input. T h u s , for an assumption of
G a u s s i a n error and linear output functions w e get the following very simple form
of the learning rule:
l
∞
(dij - y i j ) .
ηj
T h e B i n o m i a l Case
l dj – yj F(ηj)
= .
ηj yj(1 - yj) ηj
A g a i n , the derivative has the same form as b e f o r e — t h e difference b e t w e e n the
p r e d i c t e d and o b s e r v e d values divided by the variance (in this case the variance
Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 11
l
∞ (d – y ) .
ηj j j
T h e M u l t i n o m i a l Case
eηj
Fj(η) =
Σk enk .
eηj
l = Σ Σ dij ln .
j Σk eηk
i
l
(dij - yij).
ηj
-This is sometimes called the "soft-max" or "Potts" unit. As we shall see, however, it is a simple
generalization of the ordinary sigmoid and has a simple interpretation as representing the posterior
probability of event j out of a set n of possible events.
Copyrighted Material
12 RUMELHART, DURBIN, GOLDEN, CHAUVIN
T h e G e n e r a l Case
T h e fact that these cases all end up with essentially the same learning rule in spite
of different m o d e l s is not accidental. It requires exactly the right choice of output
functions for each class of p r o b l e m . It turns out that this result will o c c u r
w h e n e v e r w e c h o o s e a probability function from the exponential family of p r o b a
bility distributions. This family, which i n c l u d e s , in addition to the n o r m a l and the
b i n o m i a l , the g a m m a distribution, the exponential distribution, the Poisson dis
tribution, the negative b i n o m i a l distribution, and most other familiar probability
distributions. T h e general form of the exponential family of probability distribu
tions is
[ ( d i θ - B(θ)) + C(d )]
P ( d | Λ N) = e x p Σ
a( ) ,
i
w h e r e θ is the "sufficient statistic" of the distribution and is related to the m e a n of
the distribution, is a m e a s u r e of the overall variance of the distribution, and the
B( ), C( ) and a( ) are different for each m e m b e r of the family. It is b e y o n d the
s c o p e of this chapter to d e v e l o p the general results of this m o d e l . 6 Suffice it to
say that for all m e m b e r s of the exponential family we get
l dj - yj
ηj var(yi) .
SOME EXAMPLES
6
See McCullagh and Nelder (1989, pp. 28-30) for a more complete description.
Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 13
P( i|N) = Σ P( i| i ck) P ( c k ) ,
k
w h e r e ck indexes the kth cluster. For simplicity w e can a s s u m e that the clusters
are to be equally p r o b a b l e , so P(ck) = 1/N, w h e r e N is the n u m b e r of clusters.
N o w , the probability of t h e data given the cluster is simply the output of the kth
h i d d e n unit, hk. T h e r e f o r e , the value of the output unit is 1/N Σk hk and the log-
likelihood of the data given the input is
l
l = ln
( Σ hk N ) .
k
O u t p u t U n i t
C l u s t e r U n i t s
I n p u t U n i t s
l
pk(xj – Wjk).
wjk
Society of Experts
yij. = Σ r k y ijk ,
k
Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 15
W e i g h t e d Output U n i t s
Relevance
Units
Semi-Output
Units
Expert
Networks
Input U n i t s
l = ln P ( d i | Λ N) = ln Σ P ( d i |
j i Λ sk)rk,
k
w h e r e sk r e p r e s e n t s the kth subnet.
We m u s t n o w m a k e s o m e specific a s s u m p t i o n s about the form of P(d i | i Λ
sk). For c o n c r e t e n e s s , w e a s s u m e a G a u s s i a n distribution, but w e c o u l d have
c h o s e n any of the other probability distributions w e have d i s c u s s e d . In this case
[ (dj - yjk)2 ]
l = ln Σ Krkexp Σ 2
.
2σ
k j
Copyrighted Material
16 RUMELHART, DURBIN, GOLDEN, CHAUVIN
dent of the input. In this case, the probabilities are input dependent. It is slightly
more difficult to calculate, but it turns out that the derivative for the relevance
units also has the simple form
l = pk – rk,
ηk
the difference between the position and the prior probability that subnetwork k is
the correct network.
This example, although somewhat complex, is useful for seeing how we can
use our general theory to determine a learning rule in a case where it might not be
immediately obvious and in which the general idea of just taking the difference
between the output of the network and the target and using that as an error signal
is probably the wrong thing to do. We now turn to one final example.
7
The algorithm and network design presented here were first proposed by Rumelhart in a presen
tation entitled "Learning and generalization in multilayer networks" given at the NATO Advanced
Research Workshop on Neurocomputing, Algorithms, Architecture and Applications held in Les
Arcs, France, in February 1989. The algorithm can be considered a generalization and refinement of
the TDNN network developed by (Waibel et al., 1989). A version of the algorithm was first published
in Keeler, Rumelhart, and Loew (1991).
Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 17
p = l- (1 - p ij)
i
j
hidden
input
Figure 7. The basic recognition network. See text for detailed net-
work description.
Copyrighted Material
18 RUMELHART, DURBIN, GOLDEN, CHAUVIN
1
= ,
Pij 1 + e-ηij
where
=
ηij Σ wik hkj + βi
k
and wik is t h e weight from h i d d e n unit hkj to the detector pij. N o t e that since the
w e i g h t s from the hidden unit to the detection units are linked, this same weight
will c o n n e c t each feature unit in the r o w with a c o r r e s p o n d i n g detection unit in
t h e r o w a b o v e . Since w e have built translational i n d e p e n d e n c e into the structure
of the n e t w o r k , anything w e learn about features or characters at any given
location i s , through the linking of w e i g h t s , automatically transferred to every
location.
If w e w e r e willing, or a b l e , to carefully s e g m e n t the input and tell the network
exactly w h e r e each character w a s , w e could use a standard training t e c h n i q u e to
train the n e t w o r k to r e c o g n i z e any character at any location. H o w e v e r , w e are
interested in a training algorithm in which w e d o not have to provide the network
with specific training information. We are interested in simply telling the net
w o r k w h i c h characters w e r e present in the input, not w h e r e each character is. To
i m p l e m e n t this idea, w e have built an additional network w h i c h takes the output
of t h e Pij units and c o m p u t e s , through a fixed output n e t w o r k , the probability that
at least o n e character of a given type is present a n y w h e r e in the input field. We d o
this by c o m p u t i n g the probability that at least o n e unit of a particular type is on.
T h i s can s i m p l y be written as
y i=1– (1 - Pij).
j
Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 19
N o t e that exactly the same target would be given for the word " d o g " and the
w o r d " g o d . " N e v e r t h e l e s s , the network learns to properly localize the units in the
pij layer. T h e reason is, simply, that the individual characters occur in m a n y
c o m b i n a t i o n s and the only w a y that the n e t w o r k can learn to discriminate cor
rectly is to actually detect the particular letter. T h e localization that occurs in the
pij layer d e p e n d s on each character unit seeing only a small part of the input field
and on each unit of t y p e i constrained to r e s p o n d in the s a m e way.
I m p o r t a n t in the design of the n e t w o r k w a s an assumption as to the meaning of
the individual units in the n e t w o r k . We will s h o w w h y w e m a k e these interpreta
tions and h o w the learning rule w e derive d e p e n d s on these interpretations.
To b e g i n , w e w a n t to interpret each output unit as the probability that at least
o n e of that character is in the input field. A s s u m i n g that the letters occurring in a
g i v e n w o r d are a p p r o x i m a t e l y independent of the other letters in the w o r d , w e
can also a s s u m e that the probability of the target vector given the input is
T h i s is o b v i o u s l y an e x a m p l e of the b i n o m i a l multiclassification m o d e l . T h e r e
fore, w e get the following form of o u r log-likelihood function:
l = Σ dj ln yj + (1 - d j )ln(1 - yj),
' l =
(dj - yi) pij .
ηij yi
This is a kind of competitive rule in which the learning is proportional to the
relative strength of the activation of the unit at a location in the ith r o w to the
strength of activation in the entire row. T h i s ratio is the conditional probability
that the target w a s at position j u n d e r the a s s u m p t i o n that the target w a s , in fact,
p r e s e n t e d . T h i s c o n v e n i e n t interpretation is not accidental. By assigning the out
put units their probablistic interpretations and by selecting the a p p r o p r i a t e ,
t h o u g h u n u s u a l , o u t p u t unit yi – 1 – (1 – pij), w e w e r e able to ensure a
plausible interpretation and behavior of our character detection units.
Concluding Remarks
Copyrighted Material
20 RUMELHART, DURBIN, GOLDEN, CHAUVIN
PRIORS
l = Σ Σ ln P ( d i j | i Λ N) + ln P(N).
i j
W e i g h t Decay
Σij wij2 =– 1
ln P(N) = ln e x p – 2
Σ w ij.
( 2σ 2
) 2σ2 ij
l 1
—
wij σ2 wij
Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 21
Weight Elimination
σ22 w2ij
ln P(N) =- Σ .
σ21 ij σ22 + w2ij
T h e derivative is
l σ22 wij
— .
2
Wij σ2
1
(σ 2 + w2ij)2
N o t e that this has the p r o p e r t y that for small weights (weights in which wij is
small relative to σ 2 ), the d e n o m i n a t o r is approximately constant and the c h a n g e
in weights is s i m p l y proportional to the n u m e r a t o r wij, as in weight decay. For
Copyrighted Material
22 RUMELHART, DURBIN, GOLDEN, CHAUVIN
Weight Symmetries
In w e i g h t d e c a y the idea w a s to have a prior such that the weight distribution has
a z e r o m e a n and is normally distributed. T h e weight elimination p a r a d i g m is
m o r e general in that it distinguishes t w o classes of w e i g h t s , of which o n e i s , like
the w e i g h t d e c a y c a s e , c e n t e r e d on zero and normally distributed, and the other is
uniformly distributed. In weight s y m m e t r i e s there is a small set of normally
distributed weight clusters. T h e p r o b l e m is to simultaneously estimate the m e a n
of the priors and the weights t h e m s e l v e s . In this case the priors are
[ (wi - μk)2 ]
P(N) = Σ exp - 2 P(ck),
2σk
i k
w h e r e P(ck) is the probability of the kth weight cluster and μk is its center. To
d e t e r m i n e h o w the weights are to b e c h a n g e d , w e must c o m p u t e the derivative of
the log of this probability. We get
Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 23
Elastic N e t w o r k a n d t h e
Traveling Salesman Problem
1
+ (μx,k+1 – μx,k+1).
λ
Copyrighted Material
24 RUMELHART, DURBIN, GOLDEN, CHAUVIN
(i.e, the means μx,k and μd,k) are adjusted until the network stabilizes. At this
point it is likely that none of the cluster centers are located at any of the cities. We
then decrease a and present the cities again until it stabilizes. Then σ is de
creased again. This process is repeated until there is a cluster mean located at
each city. At this point we can simply follow the cluster means in order and read
off the solution to the problem.
Concluding Comments
In this section we showed how knowledge and constraints can be added to the
network with well-chosen priors. So far the priors have been of two types. (1) We
have used priors to constrain the set of networks explored by the learning algo
rithm. By adding such "regularization" terms we have been able to design net
works which provide much better generalization. (2) We have been able to add
further constraints among the network parameter relationships. These constraints
allow us to force the network to a particular set of possible solutions, such as
those which minimize the tour in the traveling salesman problem.
Although not discussed here, it is possible to add knowledge to the network in
another way by expressing priors about the behavior of different parts of the
network. It is possible to formulate priors that, for example, constrain the output
of units on successive presentations to be as similar or as dissimilar as possible to
one another. The general procedure can dramatically affect the solution the
network achieves.
HIDDEN UNITS
Thus far, we have focused our attention on log-likelihood cost functions, appro
priate interpretation of the output units, and methods of introducing additional
constraints in the network. The final section focuses on the hidden units of the
network.8 There are at least four distinct ways of viewing hidden units.
8
As an historical note, the term "hidden unit" is used to refer to those units lying between the
input and output layers. The name was coined by Geoffrey Hinton, inspired by the notion of "hidden
states" in hidden Markov models.
Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 25
Copyrighted Material
26 RUMELHART, DURBIN, GOLDEN, CHAUVIN
( a ) (b)
B
B
A B A B
A B A B
A B A B
B B
A A
( c ) ( d )
B
A B A
B
A A A
B
B
A A B A
Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 27
Classifier
A B
Second
Hidden B
A
First
Hidden A B
B A
X Y
Figure 9. The first layer works by putting in the vertical and horizon
tal lines and moving the points to the corners of the region. This
means that at the second level the problem is convex and two further
hidden units divide the space and make it linearly separable.
As we will see, sigmoidal units are somewhat better behaved than many
others with respect to the smoothness of the error surface. Durbin and Rumelhart
(1989), for example, have found that, although "product units" (Fj( i|wij) =
wij
ix i ) are much more powerful than conventional sigmoidal units (in that fewer
parameters were required to represent more functions), it was a much more
difficult space to search and there were more problems with local minima.
Another important consideration is the nature of the extrapolation to data
points outside the local region from which the data were collected. Radial basis
functions have the advantage that they go to zero as you extend beyond the region
where the data were collected. Polynomial units Fj(ηj) = ηpk are very ill behaved
outside of the training region and for that reason are not especially good choices.
Sigmoids are well behaved outside of their local region in that they saturate and
are constant at 0 or 1 outside of the training region.
Copyrighted Material
28 RUMELHART, DURBIN, GOLDEN, CHAUVIN
first property has to do with the learning process itself. As noted from Figure 10,
the sigmoidal unit is roughly linear for small weights (a net input near zero) and
gets increasingly nonlinear in its response as it approaches its points of maximum
curvature on either side of the midpoint. Thus, at the beginning of learning,
when the weights are small, the system is mainly in its linear range and is seeking
an essentially linear solution. As the weights grow, the network becomes increas-
ing nonlinear and begins to move toward the nonlinear solution to the problem.
This property of initial linearity makes the units rather robust and allows the
network to reliably attain the same solution.
Sigmoidal hidden units have a useful interpretation as the posterior probability
of the presence of some feature given the input. To see this, think of a sigmoidal
hidden unit connected directly to the input units. Suppose that the input vectors
are random variables drawn from one of two probability distributions and that the
job of the hidden unit is to determine which of the two distributions is being
observed. The role of the hidden unit is to give an output value equal to the
probability that the input vector was drawn from distribution 1 rather than distri-
bution 2. If drawn from distribution 1 we say that some "hidden feature" was
present; otherwise we say it was absent. Denoting the hidden feature for the jth
hidden unit as fj we have
P( |fj = 1)P(fj = 1)
P(fj= 1| ) =
p( |fj = 1)P(fj = 1) + P ( |fj = 0)P(fj = 0).
1.0
0.5
0.0
Logistic S i g m o i d F u n c t i o n
Figure 10. The logistic sigmoid is roughly linear near the middle of its
range and reaches its maximum curvature.
Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 29
S o w e finally have
x i
P(fj = 1| = ip ij (1 – pij)(1 - x) P (fj =
x i .
ip ij (1 – pij)(1 - x) P (fj = 1) + ipxiji (1 – pij)(1 - xi) P (fj = 0
P(fj = 1|
= P(fj = 1)
βj ln 1 - pij + ln
1 - qij P(fj = 0)
i
and
p i j (1 - qij)
In ,
(1 - pij)qij
w e can see the similarity. We can thus see that the sigmoid is properly u n d e r s t o o d
as representing the posterior probability that s o m e hidden feature is present given
Copyrighted Material
30 RUMELHART, DURBIN, GOLDEN, CHAUVIN
(xiθi- B(θi)) + C( , )
exp +lnP(fj= 1)
{ a() }
P(fj =1|) = i
(xi θi - B(θi)) + C( , ) +lnP(f = 1) + (xi θ*i - B(θ*i)) + C( , ) +lnP(f =0)
exp{ Σ j
} exp{ } j
i a() i a()
1 + exp ( xiwi + β) i ]
[
i
T h u s , u n d e r t h e a s s u m p t i o n that the input variables are d r a w n from s o m e m e m
b e r of the exponential family and differ only in their m e a n s (represented by 6,),
the s i g m o i d a l hidden unit can be interpreted as the probability that the hidden
feature is present. N o t e the very same form is derived whether the underlying
distributions are G a u s s i a n , b i n o m i a l , or any m e m b e r of the exponential family. It
can readily b e seen that, w h e r e a s the sigmoid represents the two-alternative c a s e ,
the n o r m a l i z e d exponential clearly represents the multialternative c a s e . T h u s , w e
d e r i v e the n o r m a l i z e d exponential in exactly the same w a y as w e derive the
sigmoid:
H i d d e n - U n i t L a y e r s as R e p r e s e n t a t i o n s o f
the Input Stimuli
Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 31
Outputs
Inputs
to Similar Outputs
Figure 11. Similar inputs produce similar outputs.
described as sets of microfeatures), then this network will exhibit the property
that "similar inputs yield similar outputs," along with the accompanying general-
ization and transfer of learning. Two-layer networks behave this way because the
activation of an output unit is given by a relatively smooth function of the
weighted sum of its inputs. Thus, a slight change in the value of an input unit will
generally yield a similarly slight change in the values of the output units.
Although this similarity-based processing is mostly useful, it does not always
yield the correct generalizations. In particular, in a simple two-layer network, the
similarity metric employed is determined by the nature of the inputs themselves.
And the "physical similarity" we are likely to have at the inputs (based on the
structure of stimuli from the physical world) may not be the best measure of the
"functional" or "psychological" similarity we would like to employ at the output
(to group appropriate similar responses). For example, it is probably true that a
lowercase a is physically less similar to an uppercase A than to a lowercase o, but
functionally and psychologically a a and A are more similar to one another than
are the two lowercase letters. Thus, physical relatedness is an inadequate sim-
ilarity metric for modeling human responses to letter-shaped visual inputs. It is
therefore necessary to transform these input patterns from their initial physically
derived format into another representational form in which patterns requiring
similar (output) responses are indeed similar to one another. This involves learn-
ing new representations.
Figure 12 illustrates a layered feedforward network in which information
(activation) flows up from the input units at the bottom through successive layers
of hidden units, to create the final response at the layer of output units on top.
Such a network is useful for illustrating how an appropriate psychological or
functional representation can be created. If we think of each input vector as a
point in some multidimensional space, we can think of the similarity between
Copyrighted Material
32 RUMELHART, DURBIN, GOLDEN, CHAUVIN
Output
Final
Representation
T2
T1
Input
two such vectors as the distance between their two corresponding points. Further-
more, we can think of the weighted connections from one layer of units to the
next as implementing a transformation that maps each original input vector into
some new vector. This transformation can create a new vector space in which the
relative distances among the points corresponding to the input vectors are differ-
ent from those in the original vector space, essentially rearranging the points.
And if we use a sequence of such transformations, each involving certain non-
linearities, by "stacking" them between successive layers in the network, we can
entirely rearrange the similarity relations among the original input vectors.
Thus, a layered network can be viewed simply as a mechanism for transform-
ing the original set of input stimuli into a new similarity space with a new set of
distances among the input points. For example, it is possible to move the initially
distant "physical" input representations of a and A so that they are very close to
one another in a transformed "psychological" output representation space, and
simultaneously transform the distance between a and o output representations so
that they are rather distant from one another. (Generally, we seek to attain a
representation in the second-to-last layer which is sufficiently transformed that
we can rely on the principle that similar patterns yield similar outputs at the final
layer.) The problem is to find an appropriate sequence of transformations that
accomplish the desired input-to-output change in similarity structures.
The back-propagation learning algorithm can be viewed, then, as a procedure
for discovering such a sequence of transformations. In fact, we can see the role
Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 33
CONCLUSION
In this chapter we have tried to provide a kind of overview and rationale for the
design and understanding of networks. Although it is possible to design and use
interesting networks without any of the ideas presented here, it is, in our experi-
ence, very valuable to understand networks in terms of these probabilistic inter-
pretations. The value is primarily in providing an understanding of the networks
and their behavior so that one can craft an appropriate network for an appropriate
problem. Although it has been commonplace to view networks as kinds of black
boxes, this leads to inappropriate applications which may fail not because such
networks cannot work but because the issues are not well understood.
REFERENCES
Cover, T. H. (1965). Geometrical and statistical properties of systems of linear inequalities with
applications in pattern recognition. IEEE Transactions on Electronic Computers, 14. pp. 326-
334
Durbin, R., & Rumelhart, D. E. (1989). Product units: A computationally powerful and biolog-
ically lausible extension to backpropagation networks. Neural Computation, 1, 133-142.
Durbin, R., & Willshaw, D. (1987). An analogue approach to the travelling salesman problem using
an elastic net method. Nature, 326, 689-691.
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer Feed-forward Networks are Univer-
sal Approximators, Neural Networks, 2, pp. 359-366.
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local
experts. Neural Computation, 3(1).
Jordan, M. I., & Rumelhart, D. E. (1992). Forward models: Supervised learning with a distal
teacher. Cognitive Science, 16, pp. 307-354.
Keeler, J. D., Rumelhart, D. E., & Loew, W. (1991). Integrated segmentation and recognition of
hand-printed numerals. In R. P. Lippmann, J. E. Moody, and D. S. Touretzky (Eds.), Neural
information processing systems (Vol. 3). San Mateo, CA: Morgan Kaufmann.
Kolmogorov, A. N. (1991). Selected Works of A. N. Kolmogorov, Dordrecht; Boston; Kluwer Aca-
demic.
Le Cun, Y., Boser, Y. B., Denke, J. S., Henderson, R. D , Howard, R. E., Hubbard, W., & Jackel,
L. D. (1990). In D. S. Touretzky (Ed.), Handwritten digit recognition with a back-propagation
network (Vol. 2). San Mateo, CA: Morgan Kaufmann.
McCullagh, P. & Nelder, J. A. (1989). Generalized linear models. London: Chapman and Hall.
Mitchison, G. J., & Durbin, R. M. (1989). Bounds on the learning capacity of some multi-layer
networks. Biological Cybernetics, 60, 345-356.
Nowlan, S. J. (1991). Soft Competitive Adaptation: Neural Network Learning Algorithm based on
Copyrighted Material
34 RUMELHART, DURBIN, GOLDEN, CHAUVIN
Fitting Statistical Mixtures. Ph.D. thesis, School of Computer Science, Carnegie Mellon Univer-
sity, Pittsburgh, PA.
Parker, D. B. (1982). Learning-logic (Invention Report S81-64, File 1). Stanford, CA: Office of
Technology Licensing, Stanford University.
Pogio, T., & Girosi, F. (1989). A theory of networks for approximation and learning. A. I. Memo
No. 1140, Artificial Intelligence Laboratory, Massachusetts Institute of Technology.
Rosenblatt, F. (1962). Principles of neurodynamics. New York: Spartan.
Rumelhart, D. E. (1990). Brain style computation: Learning and generalization. In S. F. Zornetzer,
J. L. Davis, and C. Lau (Eds.), An introduction to neural and electronic networks. San Diego:
Academic Press.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by
error propagation. In D. E. Rumelhart and J. L. McClelland (Eds.), Parallel Distributed Process-
ing: Explorations in the Microstructure of Cognition (Vol. 1). Cambridge, MA: Bradford Books.
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. (1989). Phoneme recognition
using time-delay neural networks. IEEE Transactions on Acoustics, Speech and Signal Process-
ing. 37, 328-338.
Weigend, A. S., Huberman, B. A., & Rumelhart, D. E. (1990). Predicting the future: A connec-
tionist approach. International Journal of Neural Systems, I, 193-209.
Weigend, A. S., Rumelhart, D. E., & Huberman, B. (1991). Generalization by weight-elimination
with application to forecasting. In R. P. Lippman, J. Moody, and D. S. Touretsky (Eds.),
Advances in neural information processing (Vol. 3, pp. 875-882). San Mateo, CA: Morgan
Kaufman.
Werbos, P. (1974). Beyond regression: New tools for prediction and analysis in the behavioral
sciences. Unpublished dissertation, Harvard University.
Copyrighted Material
2 Phoneme Recognition Using
Time-Delay Neural Networks*
Alexander Waibel
Computer Science Department, Carnegie Mellon University
Toshiyuki Hanazawa
ATR Interpreting Telephony Research Laboratories, Osaka, Japan
Geoffrey Hinton
University of Toronto
Kiyohiro Shikano
ATR Interpreting Telephony Research Laboratories, Osaka, Japan
Kevin J. Lang
Carnegie Mellon University
ABSTRACT
35
Copyrighted Material
36 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG
tained by the best of our HMM's was only 93.7 percent. Closer inspection reveals
that the network "invented" well-known acoustic-phonetic features (e.g., F2-rise,
F2-fall, vowel-onset) as useful abstractions. It also developed alternate internal
representations to link different acoustic realizations to the same concept.
I. INTRODUCTION
In recent years, the advent of new learning procedures and the availability of high
speed parallel supercomputers have given rise to a renewed interest in connec-
tionist models of intelligence [ 1 ]. Sometimes also referred to as artificial neural
networks or parallel distributed processing models, these models are particularly
interesting for cognitive tasks that require massive constraint satisfaction, i.e.,
the parallel evaluation of many clues and facts and their interpretation in the light
of numerous interrelated constraints. Cognitive tasks, such as vision, speech,
language processing, and motor control, are also characterized by a high degree
of uncertainty and variability and it has proved difficult to achieve good perfor-
mance for these tasks using standard serial programming methods. Complex
networks composed of simple computing units are attractive for these tasks not
only because of their "brain-like" appeal1 but because they offer ways for auto-
matically designing systems that can make use of multiple interacting con-
straints. In general, such constraints are too complex to be easily programmed
and require the use of automatic learning strategies. Such learning algorithms
now exist (for an excellent review, see Lippman [2]) and have been demonstrated
to discover interesting internal abstractions in their attempts to solve a given
problem [1], [3]-[5]. Learning is most effective, however, when used in an
architecture that is appropriate for the task. Indeed, applying one's prior knowl-
edge of a task domain and its properties to the design of a suitable neural network
model might well prove to be a key element in the successful development of
connectionist systems.
Naturally, these techniques will have far-reaching implications for the design
of automatic speech recognition systems, if proven successful in comparison to
already-existing techniques. Lippmann [6] has compared several kinds of neural
networks to other classifiers and evaluated their ability to create complex deci-
sion surfaces. Other studies have investigated actual speech recognition tasks and
compared them to psychological evidence in speech perception [7] or to existing
speech recognition techniques [8], [9]. Speech recognition experiments using
neural nets have so far mostly been aimed at isolated word recognition (mostly
the digit recognition task) [10]–[13] or phonetic recognition with predefined
constant [14], [15] or variable phonetic contexts [16], [14], [17].
A number of these studies report very encouraging recognition performance
1
The uninitiated reader should be cautioned not to overinterpret the now-popular term "neural
network." Although these networks appear to mimic certain properties of neural cells, no claim can
be made that present exploratory attempts simulate the complexities of the human brain.
Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 37
[16], but only few comparisons to existing recognition methods exist. Some of
these comparisons found performance similar to existing methods [9], [11], but
others found that networks perform worse than other techniques [8]. One might
argue that this state of affairs is encouraging considering the amount of fine-
tuning that has gone into optimizing the more popular, established techniques.
Nevertheless, better comparative performance figures are needed before neural
networks can be considered as a viable alternative for speech recognition sys-
tems.
One possible explanation for the mixed performance results obtained so far
may be limitations in computing resources leading to shortcuts that limit perfor-
mance. Another more serious limitation, however, is the inability of most neural
network architectures to deal properly with the dynamic nature of speech. Two
important aspects of this are for a network to represent temporal relationships
between acoustic events, while at the same time providing for invariance under
translation in time. The specific movement of a formant in time, for example, is
an important cue to determining the identity of a voiced stop, but it is irrelevant
whether the same set of events occurs a little sooner or later in the course of time.
Without translation invariance, a neural net requires precise segmentation to
align the input pattern properly. Since this is not always possible in practice,
learned features tend to get blurred (in order to accommodate slight misalign-
ments) and their performance deteriorates. In general, shift invariance has been
recognized as a critically important property for connectionist systems and a
number of promising models have been proposed for speech and other domains
[18]-[21], [14], [17], [22].
In the present paper, we describe a Time-Delay Neural Network (TDNN)
which addresses both of these aspects of speech and demonstrate through exten-
sive performance evaluation that superior recognition results can be achieved
using this approach. In the following section, we begin by introducing the
architecture and learning strategy of a TDNN aimed at phoneme recognition.
Next, we compare the performance of our TDNN's to one of the more popular
current recognition techniques: Hidden Markov Models (HMM). In Section III,
we start by describing an HMM, under development at ATR [23], [24]. Both
techniques, the TDNN and the HMM, are then evaluated over a testing database
and we report the results. We show that substantially higher recognition perfor-
mance is achieved by the TDNN than by the best of our HMM's. In Section IV,
we then take a closer look at the internal representation that the TDNN learns for
this task. It discovers a number of interesting linguistic abstractions which we
show by way of examples. The implications of these results are then discussed
and summarized in the final section of this paper.
Copyrighted Material
38 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG
interconnections between units in each of these layers. This is to ensure that the
network will have the ability to learn complex nonlinear decision surfaces [2],
[6]. Second, the network should have the ability to represent relationships be-
tween events in time. These events could be spectral coefficients, but might also
be the output of higher level feature detectors. Third, the actual features or
abstractions learned by the network should be invariant under translation in time.
Fourth, the learning procedure should not require precise temporal alignment of
the labels that are to be learned. Fifth, the number of weights in the network
should be sufficiently small compared to the amount of training data so that the
network is forced to encode the training data by extracting regularity. In the
following, we describe a TDNN architecture that satisfies all of these criteria and
is designed explicitly for the recognition of phonemes, in particular, the voiced
stops "B," "D," and "G."
Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 39
DN
Wj + N
UJ
D1
Wj + 1
Wj
Σ F
DN
Wi + N
U1
D1
Wi + 1
Wi
Adjacent coefficients in time were collapsed for further data reduction resulting
in an overall 10 ms frame rate. All coefficients of an input token (in this case, 15
frames of speech centered around the hand-labeled vowel onset) were then nor
malized. This was accomplished by subtracting from each coefficient the average
coefficient energy computed over all 15 frames of an input token and then
normalizing each coefficient to lie between – 1 and +1. All tokens in our
database were preprocessed in the same fashion. Fig. 2 shows the resulting
coefficients for the speech token "BA" as input to the network, where positive
values are shown as black squares and negative values as gray squares.
This input layer is then fully interconnected to a layer of 8 time-delay hidden
units, where J = 16 and N = 2 (i.e., 16 coefficients over 3 frames with time
delay 0, 1, and 2). An alternative way of seeing this is depicted in Fig. 2. It
shows the inputs to these time-delay units expanded out spatially into a 3 frame
window, which is passed over the input spectrogram. Each unit in the first hidden
Copyrighted Material
40 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG
B D G
Output Layer
integration
3 units
Hidden Layer 2
8 units
Hidden Layer 1
(Hz)
16 melscale filterbank coefficients
5437
4547
3797
3187
2672
2250
Input Layer
1922
I641
1406
1219
1031
844
656
462
281
141
15 frames
10 msec frame rate
layer now receives input (via 48 weighted connections) from the coefficients in
the 3 frame window. The particular choice of 3 frames (30 ms) was motivated by
earlier studies [26]–[29] that suggest that a 30 ms window might be sufficient to
represent low level acoustic-zphonetic events for stop consonant recognition. It
was also the optimal choice among a number of alternative designs evaluated by
Lang [21] on a similar task.
In the second hidden layer, each of 3 TDNN units looks at a 5 frame window
of activity levels in hidden layer 1 (i.e., J = 8, N = 4). The choice of a larger 5
frame window in this layer was motivated by the intuition that higher level units
Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 41
should learn to make decisions over a wider range in time based on more local
abstractions at lower levels.
Finally, the output is obtained by integrating (summing) the evidence from
each of the 3 units in hidden layer 2 over time and connecting it to its pertinent
output unit (shown in Fig. 2 over 9 frames for the "B" output unit). In practice,
this summation is implemented simply as another nonlinear (sigmoid function is
applied here as well) TDNN unit which has fixed equal weights to a row of unit
findings over time in hidden layer 2. 4
When the TDNN has learned its internal representation, it performs recogni-
tion by passing input speech over the TDNN units. In terms of the illustration of
Fig. 2, this is equivalent to passing the time-delay windows over the lower level
units' firing patterns.5 At the lowest level, these firing patterns simply consist of
the sensory input, i.e., the spectral coefficients.
Each TDNN unit outlined in this section has the ability to encode temporal
relationships within the range of the N delays. Higher layers can attend to larger
time spans, so local short duration features will be formed at the lower layer and
more complex longer duration features at the higher layer. The learning proce-
dure ensures that each of the units in each layer has its weights adjusted in a way
that improves the network's overall performance.
B. Learning in a TDNN
Several learning techniques exist for optimization of neural networks [1], [2],
[30]. For the present network, we adopt the Backpropagation Learning Proce-
dure [18], [5]. Mathematically, backpropagation is gradient descent of the mean-
squared error as a function of the weights. The procedure performs two passes
through the network. During the forward pass, an input pattern is applied to the
network with its current connection strengths (initially small random weights).
The outputs of all the units at each level are computed starting at the input layer
and working forward to the output layer. The output is then compared to the
desired output and its error calculated. During the backward pass, the derivative
of this error is then propagated back through the network, and all the weights are
adjusted so as to decrease the error [18], [5]. This is repeated many times for all
the training tokens until the network converges to producing the desired output.
In the previous section, we described a method of expressing temporal struc-
ture in a TDNN and contrasted this method to training a network on a static input
4
Note, however, that as for all units in this network (except the input units), the output units are
also connected to a permanently active threshold unit. In this way, both an output unit's one shared
connection to a row in hidden layer 2 and its dc-bias are learned and can be adjusted for optimal
classification.
5
Thus, 13 frames of activations in hidden layer 1 arc generated when scanning the 15 frames of
input speech with a 3 frame time delay window. Similarly, 9 frames are produced in hidden layer 2
from the 13 frames of activation in the layer below.
Copyrighted Material
42 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG
pattern (spectrogram), which results in shift sensitive networks (i.e., poor perfor-
mance for slightly misaligned input patterns) as well as less crisp decision mak-
ing in the units of the network (caused by misaligned tokens during training).
To achieve the desired learning behavior, we need to ensure that the network is
exposed to sequences of patterns and that it is allowed (or encouraged) to learn
about the most powerful cues and sequences of cues among them. Conceptually,
the backpropagation procedure is applied to speech patterns that are stepped
through in time. An equivalent way of achieving this result is to use a spatially
expanded input pattern, i.e., a spectrogram plus some constraints on the weights.
Each collection of TDNN units described above is duplicated for each one frame
shift in time. In this way, the whole history of activities is available at once.
Since the shifted copies of the TDNN units are mere duplicates and are to look
for the same acoustic event, the weights of the corresponding connections in the
time shifted copies must be constrained to be the same. To implement this, we
first apply the regular backpropagation forward and backward pass to all time-
shifted copies as if they were separate events. This yields different error deriva-
tives for corresponding (time shifted) connections. Rather than changing the
weights on time-shifted connections separately, however, we actually update
each weight on corresponding connections by the same value, namely by the
average of all corresponding time-delayed weight changes.6 Fig. 2 illustrates this
by showing in each layer only two connections that are linked to (constrained to
have the same value as) their time-shifted neighbors. Of course, this applies to all
connections and all time shifts. In this way, the network is forced to discover
useful acoustic-phonetic features in the input, regardless of when in time they
actually occurred. This is an important property, as it makes the network inde-
pendent of error-prone preprocessing algorithms that otherwise would be needed
for time alignment and/or segmentation. In Section IV-C, we will show exam-
ples of grossly misaligned patterns that are properly recognized due to this
property.
The procedure described here is computationally rather expensive, due to the
many iterations necessary for learning a complex multidimensional weight space
and the number of learning samples. In our case, about 800 learning samples
were used, and between 20 000 and 50 000 iterations of the backpropagation
loop were run over all training samples. Two steps were taken to perform learn-
ing within reasonable time. First, we have implemented our learning procedure
in C and Fortran on a 4 processor Alliant supercomputer. The speed of learning
can be improved considerably by computing the forward and backward sweeps
for several different training samples in parallel on different processors. Further
improvements can be gained by vectorizing operations and possibly assembly
coding the innermost loop. Our present implementation achieves about a factor
6
Note that in the experiments reported below, these weight changes were actually carried out each
time the error derivatives from all training samples had been computed [5].
Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 43
0.6
3 6 9 24 99 249 780
0.3
iterations
0 10000
of 9 speedup over a VAX 8600, but still leaves room for further improvements
(Lang [21], for example, reports a speedup of a factor of 120 over a VAX11/780
for an implementation running on a Convex supercomputer). The second step
taken toward improving learning time is given by a staged learning strategy. In
this approach, we start optimizing the network based on 3 prototypical training
tokens only.7 In this case, convergence is achieved rapidly, but the network will
have learned a representation that generalizes poorly to new and different pat-
terns. Once convergence is achieved, the network is presented with approx-
imately twice the number of tokens and learning continues until convergence.
Fig. 3 shows the progress during a typical learning run. The measured error is
½ the squared error of all the output units, normalized for the number of training
tokens. In this run, the number of training tokens used were 3, 6, 9, 24, 99, 249,
and 780. As can be seen from Fig. 3, the error briefly jumps up every time more
variability is introduced by way of more training data. The network is then forced
to improve its representation to discover clues that generalize better and to
deemphasize those that turn out to be merely irrelevant idiosyncracies of a
limited sample set. Using the full training set of 780 tokens, this particular run
7
Note that for optimal learning, the training data are presented by always alternating tokens for
each class. Hence, we start the network off by presenting 3 tokens, one for each class.
Copyrighted Material
44 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG
was continued until iteration 35 000 (Fig. 3 shows the learning curve only up to
15 000 iterations). With this full training set, small learning steps have to be
taken and learning progresses slowly. In this case, a step size of 0.002 and a
momentum [5] of 0.1 was used. The staged learning approach was found to be
useful to move the weights of the network rapidly into the neighborhood of a
reasonable solution, before the rather slow fine tuning over all training tokens
begins.
Despite these speedups, learning runs still take in the order of several days. A
number of programming tricks [21] as well as modifications to the learning
procedure [31] are not implemented yet and could yield another factor of 10 or
more in learning time reduction. It is important to note, however, that the amount
of computation considered here is necessary only for learning of a TDNN and not
for recognition. Recognition can easily be performed in better than real time on a
workstation or personal computer. The simple structure makes TDNN's also well
suited for standardized VLSI implementation. The detailed knowledge could be
learned "off-line" using substantial computing power and then downloaded in the
form of weights onto a real-time production network.
Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 45
s0 s1 s2 s3
Figure 4. Hidden Markov Model.
Copyrighted Material
46 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG
per speaker and phoneme (see details of the training database below). Fig. 5
shows for a typical training run the average log probability normalized by the
number of frames. Training was continued until the increase of the average log
probability between iterations became less than 2 * 10 –3 .
Typically, about 10–20 learning iterations are required for 256 tokens. A
training run takes about 1 h on a VAX 8700. Floor values8 were set on the output
probabilities to avoid errors caused by zero probabilities. We have experimented
with composite models, which were trained using a combination of context-
independent and context-dependent probability values as suggested by Schwartz
et al. [35], [36]. In our case, no significant improvements were attained.
B. Experimental Conditions
For performance evaluation, we have used a large vocabulary database of 5240
common Japanese words [44]. These words were uttered in isolation by three
male native Japanese speakers (MAU, MHT, and MNM, all professional an-
nouncers) in the order they appear in a Japanese dictionary. All utterances were
recorded in a sound-proof booth and digitized at a 12 kHz sampling rate. The
database was then split into a training set (the even numbered files as derived
from the recording order) and a testing set (the odd numbered files). A given
speaker's training and testing data, therefore, consisted of 2620 utterances each,
from which the actual phonetic tokens were extracted.
The phoneme recognition task chosen for this experiment was the recognition
of the voiced stops, i.e., the phonemes "B," "D," and "G." The actual tokens
were extracted from the utterances using manually selected acoustic-phonetic
labels provided with the database [44]. For speaker MAU, for example, a total of
219 "B's," 203 "D's," and 260 "G's" were extracted from the training and 227
"B's," 179 "D's," and 252 "G's," from the testing data. Both recognition
schemes, the TDNN's and the HMM's, were trained and tested speaker depen-
dently. Thus, in both cases, separate networks were trained for each speaker.
In our database, no preselection of tokens was performed. All tokens labeled
as one of the three voiced stops were included. It is important to note that since
the consonant tokens were extracted from entire utterances and not read in
isolation, a significant amount of phonetic variability exists. Foremost, there is
the variability introduced by the phonetic context out of which a token is extrac-
ted. The actual signal of a "BA" will therefore look significantly different from a
"BI" and so on. Second, the position of a phonemic token within the utterance
introduces additional variability. In Japanese, for example, a "G" is nasalized,
when it occurs embedded in an utterance, but not in utterance initial position.
Both of our recognition algorithms are only given the phonemic identity of a
token and must find their own ways of representing the fine variations of speech.
8
Here, once again, the optimal value out of a number of alternative choices was selected.
Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 47
iog_prob
-2.8
-3.2
-3.6
-4.0 iterations
0 10 20 30 40 50
C. Results
Table I shows the results from the recognition experiments described above as
obtained from the testing data. As can be seen, for all three speakers, the TDNN
yields considerably higher performance than our HMM. Averaged over all three
speakers, the error rate is reduced from 6.3 to 1.5 percent—a more than fourfold
reduction in error.
While it is particularly important here to base performance evaluation on
testing data,9 a few observations can be made from recognition runs over the
training data. For the training data set, recognition error rates were: 99.6 percent
(MAU), 99.7 percent (MHT), and 99.7 percent (MNM) for the TDNN, and 96.9
percent (MAU), 99.1 percent (MHT), and 95.7 percent (MNM) for the HMM.
Comparison of these results to those from the testing data in Table I indicates that
both methods achieved good generalization from the training set to unknown
data. The data also suggest that better classification rather than better generaliza-
tion might be the cause of the TDNN's better performance shown in Table I.
Figs. 6–11 show scatter plots of the recognition outcome for the test data for
speaker MAU, using the HMM and the TDNN. For the HMM (see Figs. 6–8),
the log probability of the next best matching incorrect token is plotted against the
log probability10 of the correct token, e.g., "B," "D," and "G." In Figs. 9–11,
the activation levels from the TDNN's output units are plotted in the same
fashion. Note that these plots are not easily comparable, as the two recognition
methods have been trained in quite different ways. They do, however, represent
the numerical values that each method's decision rule uses to determine the
9
If the training data are insufficient, neural networks can in principle learn to memorize training
patterns rather than finding generalization of speech.
10
Normalized by number of frames.
Copyrighted Material
48 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG
TABLE 1
Recognition Results for Three Speakers Over Test Data
Using TDNN and HMM
0.0
next_best
-10.0 correct
-10.0 0.0
Figure 6. Scatter plot showing log probabilities for the best matching
incorrect case versus the correctly recognized "B's" using an HMM.
Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 49
0.0
next_best
-10.0 correct
-10.0 0.0
Figure 7. Scatter plot showing log probabilities for the best matching
incorrect case versus the correctly recognized "D's" using an HMM.
0.0
ne:<t_best
-10.0 correct
-10.0 0.0
Figure 8. Scatter plot showing log probabilities for the best matching
incorrect case versus the correctly recognized "G's" using an HMM.
Copyrighted Material
50 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG
next_best
1.0
0.0 correct
0.0 1.0
Figure 9. Scatter plot showing activation levels for the best matching
incorrect case versus the correctly recognized "B's" using a TDNN.
next_best
1.0
0,0 correct
0.0 1.0
Figure 10. Scatter plot showing activation levels for the best match-
ing incorrect case versus the correctly recognized "D's" using a TDNN.
Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 51
1.0 next_best
0.0 correct
0.0 1.0
Figure 11. Scatter plot showing activation levels for the best match-
ing incorrect case versus the correctly recognized "G's" using a TDNN.
Given the encouraging performance of our TDNN's, a close look at the learned
internal representation of the network is warranted. What are the properties or
abstractions that the network has learned that appear to yield a very powerful
description of voiced stops? Figs. 12 and 13 show two typical instances of a "D"
out of two different phonetic contexts ("DA" and "DO," respectively). In both
cases, only the correct unit, the "D-output unit," fires strongly, despite the fact
that the two input spectrograms differ considerably from each other. If we study
the internal firings in these two cases, we can see that the network has learned to
Copyrighted Material
52 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG
(HZ)
5437
4547
3797
3187
2672
2250
1922
1641
1406
1219
1031
814
656
462
281
Figure 12. TDNN activation patterns for "DA." 141
use alternate internal representations to link variations in the sensory input to the
same higher level concepts. A good example is given by the firings of the third
and fourth hidden unit in the first layer above the input layer. As can be seen
from Fig. 13, the fourth hidden unit fires particularly strongly after vowel onset
in the case of "DO," while the third unit shows stronger activation after vowel
onset in the case of "DA."
Fig. 14 shows the significance of these different firing patterns. Here the
connection strengths for the eight moving TDNN units are shown, where white
and black blobs represent positive and negative weights, respectively, and the
magnitude of a weight is indicated by the size of the blob. In this figure, the time
delays are displayed spatially as a 3 frame window of 16 spectral coefficients.
Conceptually, the weights in this window form a moving acoustic–phonetic
feature detector that fires when the pattern for which it is specialized is encoun-
Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 53
(Hz)
5437
4547
3797
3187
2672
2250
1922
1641
1406
1219
1031
844
656
462
281
Figure 13. TDNN activation patterns for "DO." 141
tered in the input speech. In our example, we can see that hidden unit number 4
(which was activated for "DO") has learned to fire when a falling (or rising)
second formant starting at around 1600 Hz is found in the input (see filled arrow
in Fig. 14). As can be seen in Fig. 13, this is the case for "DO" and hence the
firing of hidden unit 4 after voicing onset (see row pointed to by the filled arrow
in Fig. 13). In the case of "DA" (see Fig. 12), in turn, the second formant does
not fall significantly, and hidden unit 3 (pointed to by the filled arrow) fires
instead. From Fig. 14 we can verify that TDNN unit 3 has learned to look for a
steady (or only slightly falling) second formant starting at about 1800 Hz. The
connections in the second and third layer then link the different firing patterns
observed in the first hidden layer into one and the same decision.
Another interesting feature can be seen in the bottom hidden unit in hidden
layer number 1 (see Figs. 12 and 13, and compare them to the weights of hidden
Copyrighted Material
54 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG
unit 1 displayed in Fig. 14). This unit has learned to take on the role of finding
the segment boundary of the voiced stop. It does so in reverse polarity, i.e., it is
always on except when the vowel onset of the voiced stop is encountered (see
unfilled arrow in Figs. 13 and 12). Indeed, the higher layer TDNN units subse-
quently use this "segmenter" to base the final decision on the occurrence of the
right lower features at the right point in time.
In the previous example, we have seen that the TDNN can account for varia-
tions in phonetic context. Figs. 15 and 16 show examples of variability caused by
the relative position of a phoneme within a word. In Japanese, a "G" embedded
in a word tends to be nasalized as seen in the spectrum of a "GA" in Fig. 15. Fig.
16 shows a word initial "GA." Despite the striking differences between these two
Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 55
[Hz)
5037
0507
3797
3187
2672
2250
1922
1641
1406
1219
1031
844
656
462
Figure 15. TDNN activation patterns for "GA" 281
embedded in an utterance. 101
Copyrighted Material
56 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG
(Hz)
5437
4547
3797
3187
2672
2250
1922
1641
1406
1219
1031
844
656
462
Figure 16. TDNN activation patterns for "GA" 281
in utterance initial position. 141
percent increase in error rate when all tokens from the training data were arti-
ficially shifted by 20 ms. Such residual time-shift sensitivities are due to the edge
effects at the token boundaries and can probably be removed by training the
TDNN using randomly shifted training tokens.11 We also consider the formation
of shift-invariant internal features to be the important desirable property we
observe in the TDNN. Such internal features could be incorporated into larger
speech recognition systems using more sophisticated search techniques or a
syllable or word level TDNN, and hence could replace the simple integration
layer we have used here for training and evaluation.
Three important properties of the TDNN's have been observed. First, our
TDNN was able to learn, without human interference, meaningful linguistic
11
We gratefully acknowledge one of the reviewers for suggesting this idea.
Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 57
(HZ)
5437
4547
3797
3187
2672
2250
1922
1641
1406
1219
1031
844
656
462
Figure 17. TDNN activation patterns for "DO" 281
misaligned by +30 ms. 141
Copyrighted Material
58 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG
(HZ)
5437
4547
3797
3187
2672
2250
1922
1641
1406
1219
1031
644
656
462
Figure 18. TDNN activation patterns for "DO" 281
141
misaligned by –30 ms.
Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 59
evaluation over testing data from three speakers, the TDNN achieved an average
recognition score of 98.5 percent. For comparison, we have applied various
Hidden Markov Models to the same task and only been able to recognize 93.7
percent of the tokens correctly. We would like to note that many variations of
HMM's have been attempted, and many more variations of both HMM's and
TDNN's are conceivable. Some of these variations could potentially lead to
significant improvements over the results reported in this study. Our goal here is
to present TDNN's as a new and successful approach for speech recognition.
Their power lies in their ability to develop shift-invariant internal abstractions of
speech and to use them in trading relations for making optimal decisions. This
holds significant promise for speech recognition in general, as it could help
overcome the representational weaknesses of speech recognition systems faced
with the uncertainty and variability in real-life signals.
ACKNOWLEDGMENT
The authors would like to express their gratitude to Dr. A. Kurematsu, President
of ATR Interpreting Telephony Research Laboratories, for his enthusiastic en-
couragement and support which made this research possible. We are also in-
debted to the members of the Speech Processing Department at ATR and Mr.
Fukuda at Apollo Computer, Tokyo, Japan, for programming assistance in the
various stages of this research.
REFERENCES
Copyrighted Material
60 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG
[11] R. P. Lippmann and B. Gold, "Neural-net classifiers useful for speech recognition," in Proc.
IEEE Int. Conf. Neural Networks, June 1987.
[12) D. J. Burr, "A neural network digit recognizer," in Proc. IEEE Int. Conf. Syst., Man, Cybern.,
Oct. 1986.
[13] D. Lubensky, "Learning spectral-temporal dependencies using connectionist networks," in
Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Apr. 1988.
[14] R. L. Watrous and L. Shastri, "Learning phonetic features using connectionist networks: An
experiment in speech recognition," in Proc. IEEE Int. Conf. Neural Networks, June 1987.
[15] R. W. Prager, T. D. Harrison, and F. Fallside, "Boltzmann machines for speech recognition,"
Comput., Speech, Language, vol. 3, no. 27, Mar. 1986.
[16] J. L. Elman and D. Zipser, "Learning the hidden structure of speech,"Tech. Rep., Univ. Calif.,
San Diego, Feb. 1987.
[17] R. L. Watrous, L. Shastri, and A. H. Waibel, "Learned phonetic discrimination using connec-
tionist networks," in Proc. Euro. Conf. Speech Technol., Edinburgh, Sept. 1987, pp. 377-380.
[18] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning Internal Representations by Error
Propagation. Cambridge, MA: M.I.T. Press, 1986, ch. 8, pp. 318-362.
[19] J. S. Bridle and R. K. Moore, "Boltzmann machines for speech pattern processing," in Proc.
Inst. Acoust. 1984, 1984, 315-322.
[20] D. W. Tank and J. J. Hopfield, "Neural computation by concentrating information in time," in
Proc. Nat. Academy Sci., Apr. 1987, pp. 1896-1900.
[21] K. Lang, "Connectionist speech recognition," Ph.D. dissertation proposal, Carnegie-Mellon
Univ., Pittsburgh, PA.
[22] K. Fukushima, S. Miyake, and T. Ito, "Neocognitron: A neural network model for a mechanism
of visual pattern recognition," IEEE Trans. Syst., Man, Cybern., vol. SMC-13, pp. 826-834,
Sept./Oct. 1983.
[23] T. Hanazawa, T , Kawabata, and K. Shikano, "Discrimination of Japanese voiced stops using
Hidden Markov Model," in Proc. Conf. Acoust. Soc. Japan, Oct. 1987, pp. 19-20 (in Japa-
nese).
[24] , "Recognition of Japanese voiced stops using Hidden Markov Models," 1E1CE Tech.
Rep., Dec. 1987 (in Japanese).
[25] A. Waibel and B. Yegnanarayana, "Comparative study of nonlinear time warping techniques in
isolated word speech recognition systems," Tech. Rep., Carnegie-Mellon Univ., June 1981.
[26] S. Makino and K. Kido, "Phoneme recognition using time spectrum pattern," Speech Com-
mun., pp. 225-237, June 1986.
[27] S. E. Blumenstein and K. N. Stevens, "Acoustic invariance in speech production: Evidence
from measurements of the spectral characteristics of stop consonants," J. Acoust. Soc. Amer.,
vol. 66, pp. 1001-1017, 1979.
[28] , "Perceptual invariance and onset spectra for stop consonants in different vowel envi-
ronments," J. Acoust. Soc. Amer., vol. 67, pp. 648-662, 1980.
[29] D. Kewley-Port, "Time varying features as correlates of place of articulation in stop conso-
nants," J. Acoust. Soc. Amer., vol. 73, pp. 322-335, 1983.
[30] G. E. Hinton, "Connectionist learning procedures," Artificial Intelligence, 1987.
[31] M. A. Franzini, "Speech recognition with back propagation," in Proc. 9th Annu. Conf.
lEEE/Eng. Med. Biol. Soc, Nov. 1987.
[32] F. Jelinek, "Continuous speech recognition by statistical methods," Proc. IEEE, vol. 64,
pp. 532-556, Apr. 1976.
[33] J. K. Baker, "Stochastic modeling as a means of automatic speech recognition," Ph.D. disserta-
tion, Carnegie-Mellon Univ., Apr. 1975.
[34] L. R. Bahl, S. K. Das, P. V. de Souza, F. Jelinek, S. Katz, R. L. Mercer, and M. A. Picheny,
"Some experiments with large-vocabulary isolated-word sentence recognition," in Proc. IEEE
Int. Conf. Acoust., Speech, Signal Processing, Apr. 1984.
Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 61
Copyrighted Material
Copyrighted Material
3 Automated Aircraft Flare and
Touchdown Control Using
Neural Networks
Charles Schley
Yves Chauvin
Van Henkle
Thomson-CSF, Inc., Palo Alto Research Operation
ABSTRACT
INTRODUCTION
63
Copyrighted Material
64 SCHLEY, CHAUVIN, VAN HENKLE
Disturbances
Copyrighted Material
3. FLARE AND TOUCHDOWN CONTROL USING NEURAL NETWORKS 65
The process is observed through a set of sensors. Finally, the process is subjected
to disturbances from its external operating environment.
Controller Design
What is the best method of designing a control law? Two types of methods, often
called classical and modern methods, are described in the literature. Classical
methods look at linearized versions of the plant to be controlled and some loosely
defined response specifications such as bandwidth (i.e., speed of response) and
phase margin (i.e., degree of stability). These methods make use of time-domain
or frequency-domain mathematical tools such as root-loci methods or Bode plots.
Modern methods generally assume that a performance index for the process is
specified and provide controllers that optimize this performance index. Optimal
control theory attempts to find the parameters of the controller such that the
performance measure (possibly with added performance constraint penalties) is
minimized. In many standard optimal control problems, the controller network is
linear, the plant is linear, and the performance measure is quadratic. In this case
and when the process operation is over infinite time, the parameters of the
controller may be explicitly derived. In general, however, if either the controller
is nonlinear, the plant is nonlinear, or the performance measure is not quadratic,
closed-form solutions for the controller's parameters are not available. Neverthe-
less, numerical methods may be used to compute an optimal control law. Modern
methods make use of sophisticated mathematical tools such as the calculus of
variations, Pontryagin's maximum principle, or dynamic programming.
Although modern methods are more universal, classical methods are widely
used in practice, even with sophisticated control problems (McRuer, Ashkenas,
& Graham, 1973). We feel that the differences between classical and modern
methods can be summarized as follows:
Narendra and Parthasarathy (1990) and others have noted that recurrent back-
propagation networks implement gradient descent algorithms that may be used to
optimize the performance index of a plant. The essence of such methods is to
propagate performance errors back through the process and then back through the
controller to give error signals for updating the controller parameters. Figure 2
provides an overview of the interaction of a neural control law with a complex
system and possible performance indices for evaluating various control laws.
Copyrighted Material
66 SCHLEY, CHAUVIN, VAN HENKLE
Controller Disturbances
(Neural Net)
Task
Inputs Control
Signals
Process to be Process Outputs
Controlled
Optimization Performance
Procedure Constraints
Objective
Performance
Measure
The functional components needed to train the controller are shown within the
shaded box of Figure 2. The objective performance measure contains factors that
are written mathematically and usually represent terms such as weighted square
error or other quantifiable measures. The performance constraints are often more
subjective in nature and can be formulated as reward or penalty functions on
categories such as "good" or "bad." The controller training function illustrated in
Figure 2 also contains an optimization procedure used to adjust the parameters of
the controller.
The network controller may be interpreted as a "neural" network when its
architecture and the techniques employed during control and parameter change
resemble techniques inspired from brain-style computations (e.g., Rumelhart &
McClelland, 1986). In the present case, these techniques are (i) parallel computa-
tions (ii) local computations during control and learning and (iii) use of "neural"
network learning algorithms. Narendra and Parthasarathy (1990) provide a more
extensive review of the common properties between neural networks and control
theories.
Copyrighted Material
A GENERAL-PURPOSE NONLINEAR
CONTROL ARCHITECTURE
V Sigmoid( )
Weight
Multiply
w II
Basic Controller Block Control
Signals
Replicate i=1,...,n
Σ
67
Copyrighted Material
68 SCHLEY, CHAUVIN, VAN HENKLE
switch has zero output, the weighted sum of task inputs and process outputs does
not appear in the output. When these basic controller blocks are replicated and
their outputs are added, control signals then consist of weighted sums of the
controller inputs. Moreover, these weighted sums can be selected and/or blended
by the saturating switches to yield the control signals. The overall effect is a
prototypical feedahead and feedback controller with selectable gains and multi-
ple pathways where the overall equivalent gains are a function of the task and
process outputs. The resulting architecture yields a sigma-pi processing unit in
the final controller (Rumelhart, Hinton, & Williams, 1986).
Note that the controller of Figure 3 is composed of multiple parallel computa-
tion paths and can thus be implemented with parallel hardware to facilitate fast
computation. Also, should one or more of the parallel paths fail, some level of
performance would remain, providing a degree of fault tolerance. In addition,
since the controller can select one or more weighted mappings, it can operate in
multiple modes depending on conditions within the process to be controlled or
upon environmental conditions. This property can result in a controller finely
tuned to a process having several different regimes of behavior (nonlinear pro-
cess).
Presented here is the development of a model for the glideslope tracking and flare
phases of aircraft flight immediately prior to landing. The model is in the form of
Copyrighted Material
3. FLARE AND TOUCHDOWN CONTROL USING NEURAL NETWORKS 69
Figure 4. Glideslope and flare geometry. The pitch angle is the angle
between the aircraft body and the ground measured in the vertical
plane.
Copyrighted Material
70 SCHLEY, CHAUVIN, VAN HENKLE
Copyrighted Material
3. FLARE AND TOUCHDOWN CONTROL USING NEURAL NETWORKS 71
XB
mg Euler angles
Gravity vector = mg
ZB
y
YB z
Figure 6. Inertial coordinate frame and transformation to body fixed.
Copyrighted Material
72 SCHLEY, CHAUVIN, VAN HENKLE
= Q sin + R cos
cos ,
= Q cos – R sin ,
= P + Q tan sin + R tan cos ,
X [ cos – sin 0 ] [ cos 0 sin ] [ I 0 0 ] [ U]
Y = sin cos 0 0 10 0 cos – sin V
,
Z 0 0 1 – sin 0 cos 0 sin cos w
(3)
Copyrighted Material
3. FLARE AND TOUCHDOWN CONTROL USING NEURAL NETWORKS 73
perturbation effects (lateral to longitudinal, and vice versa) are ignored. Trim
values are assumed as constant true airspeed Vtas with U0 = Vtas, W0 = 0.
Angular velocity is QQ = 0; angular position is 0 = Γo where Γo is the trim
flight path angle. The body-fixed coordinate frame thus consists of the aircraft
stability axes. Since the aircraft is trimmed at constant speed, the aircraft nose (or
wing chord) direction is actually elevated with respect to the body-fixed x axis.
Additionally, there is no rotation of the vertical relative to the inertial frame (i.e.,
no hypervelocity orbital flights). Then, the linearized equations are obtained as
follows:
XF XF F X
dXF = u–
(u ug ) + W (w – wg ) + W (w – wg)
U
XF XF
+ (q – qg) + Δ δ,
Q
ZF ZF ZF
dZF (u – ug) + (w — wg) + (w – wg)
U W W
ZF
+ (q – qg) + ZF δ,
Q A
Copyrighted Material
74 SCHLEY, CHAUVIN, VAN HENKLE
M M M
dM = (u – ug) + (w – wg) + (w – wg)
U W W
M M
+ (q – qg) + δ, (5)
Q A
wg wg / t 1
qg = = – – wg, (6)
x x/ t Vtas
where qg = spatial distribution of wind gusts in pitch direction.
The force equations can be divided by mass and the moment equation by the
moment of inertia. Neglecting the qg term and the velocity derivative terms,
collecting other terms, and simplifying yields the complete longitudinal lin
earized equations in terms of stability derivatives. Note that the angle of attack a
has been introduced, where a = (180/π)(w/V tas ). Also, note that the initial flight
path angle Γo is assumed to be 0, implying a true straight and level trim condi
tion.
V π π π
u = Xuu + tas
180
Xwa +
180 Xqq– 180
gθ
+ XE δE + XTδT – Xuug – Xwwg,
180 l
= Zuu + Zwa + (V tas + Zq)q
Vtasπ V tas
180
+ (ZEδE + ZTδT – Zuug – Zwwg),
Vtasπ
180
q = Muu + VtasMwa + Mqq
π
180
+ π
(MEδE + MTδT – Muug – Mwwg),
θ = q,
Vttasπ
=
(V tas +
u)cos θ + a sin θ Vtas + u,
180
Vtasπ
h = (V tas + u)sin θ — Vtasπ
a cos θ — a + Vtasπ θ, (7)
180 180 180
Copyrighted Material
3. FLARE AND TOUCHDOWN CONTROL USING NEURAL NETWORKS 75
Copyrighted Material
76 SCHLEY, CHAUVIN, VAN HENKLE
θcmd +
Kθ
+ δE
Control Laws
Longitudinal controllers must be specified for both the glideslope tracking and
the flare modes. This involves determination of the pitch command value (θcmd)
referenced in Figure 7. Other than parameter values and an open-loop pitchup
command for flare, both of these controllers are constructed similarly. Figure 9
illustrates the architecture of the PID controller for both glideslope tracking and
flare.
Reference
Airspeed: Vtas + + δT
KT
Estimated
Airspeed: Vt Vtas + u–ugccos( wind
–) + Typical Values:
xVS1 KT = 3.0
1
KTΩT
s ωT = 0 . 1
Figure 8. Autothrottle.
Copyrighted Material
3. FLARE AND TOUCHDOWN CONTROL USING NEURAL NETWORKS 77
θpitchup
hcmd
Kh
+
+ +
hest 1 xvc1 + θcmd
Khωh
S
+ Typical Values
Glideslope: θpitchup = 0
S Kh Flare:θPitchup=3°
Kh = 0.20
Flare begins when h ≤ hf (45 ft)
ωh = 0.10
Kh = 0.32
K
θ c m d = hωhxVCl + K
hhest + K est – K
hhcmd – K
cmd + θpitchup, (9)
x =
where vc1 glideslope and flare controller integrator
θ c m d = incremental pitch angle command (deg)
hcmd, c m d = altitude (ft) and altitude rate (ft/sec) commands
h est , e s t = altitude (ft) and altitude rate (ft/sec) estimates obtained
from compl. filter
θ p i t c h u p = open-loop pitchup command (deg) for flare
LXIRS = 74.2
π
K' = 1.431 – LXIRS
180 k1
θ
K1 = 0.15
k2 = 0.0075
k' k2
k3 = 0.000125
+ + +
h + + 1 xVC4 + 1 xVC3 + 1 hest
k3
s s s XVC2
Nz + hest
Copyrighted Material
78 SCHLEY, CHAUVIN, VAN HENKLE
=
VC3 –k2XVC2 + xVC4 +
k2k'θ + k2h + N
z'
Vtasπ π
Nz = (— + θ) = — Zuu — (V t a s z w a + Zqq)
180 180
Figure 11 shows the complementary filter used for flare. Its functions are
implemented by means of Equations 11.
X
VC2 = – K1xVC2 + VC3 + k1h,
hest = X
VC2 , est = X
VC3 ' (11)
where xVC2, xVC3, xVC4 = glideslope complementary filter integrators
Nz = incremental vertical acceleration (ft/sec/sec)
h est , esl = altitude (ft) and altitude rate (ft/ sec) estimates
θ, h = incremental pitch angle (deg), altitude (ft)
u, a, q = incremental speed (ft/sec), angle of attack (deg),
pitch rate (deg/sec)
k1
k1 = 2.0
+
h + k2 = 1.0 + 1 xvc3 + 1 hest
k2
s S XVC2
Nt + hest
Copyrighted Material
3. FLARE AND TOUCHDOWN CONTROL USING NEURAL NETWORKS 79
where hcmd, cmd = altitude (ft), altitude rate (ft/sec) commands, glideslope be
gins at tgs = 0
x = negative of the ground range to the GPIP (ft)
γgs = desired glideslope angle, γgs 2.75°
hgs = altitude at beginning of the glideslope, hgs 300 ft
xgs = x distance at beginning of glideslope, xgs 6245.65 ft
Note: Values for h and x at beginning of glideslope are for simulation purposes
only. Glideslope actually begins at about 1500 ft altitude.
xf = – h f / t a n γ gs , hƒVtas
τx =
Vtas tan γgs + TD
.
Copyrighted Material
80 SCHLEY, CHAUVIN, VAN HENKLE
Wind Disturbances
The environment influences the process through wind disturbances represented
by constant-velocity and turbulence components. They are confined to wind
disturbances having two components: constant velocity and turbulence. The
magnitude of the constant-velocity component is a function of altitude (wind
shear). Turbulence is more complex and is a temporal and spatial function as an
aircraft flies through an airspace region. The constant velocity wind component
exists only in the horizontal plane (i.e., combination of headwind and cross wind)
and its value is given in Equation 14 as a logarithmic variation with altitude. The
quantity H is a value representing the constant wind component at an altitude of
510 ft. A typical value of H is 20 ft/sec. In the next section we explain that the
network was trained with a distribution of constant wind components from H =
–10 ft/sec to H = 40 ft/sec. Note that a wind model with H = 40 represents a
very strong turbulent wind.
[ loge(h/510) ]
ugc = –H 1+ (14)
loge(51)
where ugc = constant (altitude shear) component of ug, zero at 10-ft altitude
H = wind speed at 510-ft altitude (typical value = 20 ft/sec)
h = aircraft altitude
For the horizontal and vertical wind turbulence velocities, the Dryden spectra
(Neuman and Foster, 1970) for spatial turbulence distribution are assumed.
These spectra involve a wind disturbance model frozen in time. This is not
seriously limiting since the aircraft moves in time through the wind field and thus
experiences temporal wind variation. The Dryden spectra are also amenable to
simulation and show reasonable agreement with measured data. The generation
of turbulence velocities is effected by the application of Gaussian white noise to
coloring filters. This provides the proper correlation to match the desired spectra.
Figures 12 and 13 summarize turbulence calculations, while Equations 15 imple
ment them.
[ 2 ]
d r y 1 = – a u xdry 1 +
a
u σu √ N(0, 1),
auΔt
aw 3 ]
d r y 2=–
a
wxdry2 + a
w-xdry3 + a
w (b )
w
σwbw √ a3wΔt
N(0, 1),
aw [ 3 ]
d r y 2= –awxdry3 + a
w
(
1+ σ b
b) w w
w
√a3wΔt N(0, 1),
ug = U
gc + X
dry 1 ,
W
g = xdry2, (15)
Copyrighted Material
3. FLARE AND TOUCHDOWN CONTROL USING NEURAL NETWORKS 81
ugc
N(0, 1) + x + ug
σu
2 au dry1 +
auΔt s
–
Vtas 1/3
σu = 0.2|u gc |, au = Lu = 100(h) ƒor h > 230, Lu = 600 for h ≤ 230
Lu
where: Vtas h, Δt = nominal aircraft speed (ftlsec), aircraft altitude (ft), simulation time step
Lu, σu = scale length (ft), turbulence standard deviation (ft/sec)
N(0,1)= Gaussian white noise with zero mean and unity standard deviation
Initial Conditions
In order to begin a simulation of the flight of the aircraft, all dynamic variables
must have initial conditions specified. For the longitudinal/vertical equations of
motion (Equations 7) for the bare airframe, initial conditions are specified by
placing the aircraft on the glideslope in a steady-state condition. This means that
initial values for the u, a, q, θ, x, and h variables of Relations 7 are obtained
according to the following assumptions: u0 = ugo, u0 = 0, 0 = 0, q0 = 0,θ0=
0, and 0 = —0tan γ gs , where ugo is the initial longitudinal constant wind speed
and γgs is the glideslope angle. Substituting these conditions into Relations 7
aw
bw
Xdry3 + Xdry2 w
N(0, 1) 3 aw + aw aw g
σwbw 1+ +
aw3 t bw s s
– –
ΣW = 0.2|u g c |(0.5 + 0 . 0 0 0 9 8 h ) for 0 ≤ h ≤ 500, σw = 0.2\ugc\ for h > 500,
V Vtas
aw = tas bw = Lw = h
Lw , √3Lw
where:Vtas,h, Δt = nominal aircraft speed (ftlsec), aircraft altitude (ft), simulation time step
Lw,ΣW= scale length (ft), turbulence standard deviation (ftlsec)
N(0,1)= Gaussian white noise with zero mean and unity standard deviation
Copyrighted Material
82 SCHLEY, CHAUVIN, VAN HENKLE
provides the results of Relations 16 whose solution determines the initial condi
tions for the variables uQ, a0, q0, θ0, x0, h0, δE0 , andδT0.
u0 ugo
q0 = 0
x0 –h(t0)/tan γgs ,
h0 h(t0)
Vtasπ g π
Xw XE XT ao 0
180 180
180 180 ZE
Zw 0 ZE θ0 0
Vtasπ Vtasπ
,
180 180
VtasMw 0 ME MT δE0
0
TT TT
V tas π Vtasπ
– 0 0 δT0
–(Vtas + ugo)tan γgs
180 180
(16)
where ugo = ugc (h0) = longitudinal constant wind speed (ft/sec).
Simulation
To simulate the flight of the aircraft, the aircraft states (i.e., u, a, q, θ, x, and h)
given in Equations 7 must be solved by means of some method for solution of
differential equations (e.g., Runge-Kutta). The values for elevator, throttle, rud
der, and aileron angles for equations [7] (i.e., δE and δT) are obtained by imple
menting Equations 8. These are implied by the stability augmentation diagrams
of Figures 7 and 8. Here,the symbol s indicates the Laplace operator (derivative
with respect to time). Input to the pitch stability augmentation system (θcmd) is
calculated by implementing the controller shown in Figure 9 where flare takes
place at about 45 ft of altitude. Inputs to the glideslope and flare controller (hesl
and est ) are determined by means of the glideslope and flare complementary
filters given by Equations 10 and 11. The hcmd input to the glideslope and flare
controller is provided by Equations 12 and 13. Finally, the wind disturbance
functions for Equations 7 (i.e., ug and wg) are generated by means of the average
horizontal speed given by Equation 14 and the Dryden spectra turbulence calcu
lations of Equations 15.
A sample scenario for the generation of a trajectory involves a descent along a
glideslope from an initial altitude of 300 ft followed by flare at 45 ft altitude and
then touchdown. Both nominal (disturbance-free) conditions and logarithmic
altitude shear turbulent conditions should be considered for completeness.
Copyrighted Material
NEURAL NETWORK LEARNING IMPLEMENTATION
In this section, we present our approach for developing control laws for the
aircraft model described in the preceding section and relate it to classical and
modern control theories. We confine our attention to the glideslope and flare and
do not consider the lateral control system. Additionally, we do not model the
complementary filters and assume that their function is perfect. At any rate, in
the classical approach, controllers are often specified as PID devices for both the
glideslope and flare modes. Other than parameter values and an open loop
pitchup command for flare, both of these controllers are constructed similarly.
Recall that Figure 9 illustrates the conventional controller architecture for both
glideslope and flare.
As previously noted, modern control theory suggests that a performance index
for evaluating control laws should first be constructed, and then the control law
should be computed to optimize the performance index. When closed form
solutions are not available, numerical methods for estimating the parameters of a
control law may be developed. Neural network algorithms can actually be seen
as constituting such numerical methods (Narendra and Parthasarathy, 1990;
Bryson and Ho, 1969; Le Cun, 1989). We present here an implementation of a
neural network algorithm to address the aircraft landing problem.
Difference Equations
The state of the aircraft (including stability augmentation and autothrottle) can be
represented by the following eight-dimensional vector:
Xt = [ut at qt θt t ht x7t x8t]T. (17)
State variables ut, at, qt, θt, t, and ht correspond to the aircraft state variables
per se. Variable x7t originates from the autothrottle. Variable x8t computes the
integral of the difference between actual altitude h and desired altitude hcmd over
the entire trajectory. Alternatively, the state variables t and x8t can be considered
as being internal to the network controller (see below). The difference equations
describing the dynamics of the controlled plant can be written as
Xt+1 = AtXt + BtUt + CDt + Nt, (18)
gs f1
At = ZtA + (1 – Zt)A , (19)
Bt = Z,Bgs + (1 – Zt)Bf1, (20)
Zt = S((ht – hf)σf), (21)
S(x) = 1/(1 + exp (–x)), (22)
where A represents the plant dynamics and B represents the aircraft response to
the control U. The matrix C is used to compute x8 from the desired state Dt
83
Copyrighted Material
84 SCHLEY, CHAUVIN, VAN HENKLE
containing desired altitude, desired altitude rate, and desired ground position, as
obtained from nominal glideslope and flare trajectories. N is the additive noise
computed from the wind model. The matrices Ags, Af1, Bgs, and Bf1 are constant.
The variable Zt generates a smooth transition between glideslope and flare dy
namics and makes the cost function J differentiable over the whole trajectory.
The switching controller described in Section 2 can be written as
Ut = PTLt where Pt = S([VXt + q]σ) and Lt = W[Xt –
RDt] + r, (23)
where the function S(x) is the logistic function 1/(1 + exp (–x)) taken over each
element of the vector x and σ is an associated slope parameter. The weight matrix
V links actual altitude h to each switch unit in Pt (the switch is static). The weight
matrix W links altitude error, altitude rate error, and altitude integral error to each
linear unit in Lt.
Figure 14 shows a network implementation of the equations where each unit
JE JP
Damping judge
ht+1 t+1
hcmdt+1
h cmdt +1
Xt+ 1 Dt+1
Kh K
Ut
A
W
V
Xt Dt
Copyrighted Material
3. FLARE AND TOUCHDOWN CONTROL USING NEURAL NETWORKS 85
computes a linear mapping of ht – hcmdt, x8t, and t – cmdt. Thus, the network
controller forms a PID dynamic weighting of the altitude error ht – hcmdt.
Initially, we chose two basic controller blocks (see Figure 3) to represent
glideslope and flare. The sigmoidal selection for each block is based on altitude
alone. In this sense, we use our a priori knowledge of the physics of the plant to
adapt the complexity of the network controller to the control task. The task of the
network is then to learn the state-dependent PID controller gains that optimize a
cost function in given environmental conditions.
with respect to V, r, W, and q. Note that the cost function E[J] assigns a cost to a
particular control law parameterized by V, r, W, and q using knowledge of the
stochastic plant model described by the distribution p{X1---XT). Note that the
quadratic cost function J is parametrized by an arbitrary choice of the parameters
ah, a (we used ah = a = 1).
When Σ is large, the slope of the sigmoid S in Equation 23 becomes large and
the associated switching response of the units P becomes sharp. This solution is
equivalent to dividing the entire trajectory into a set of linear plants and finding
the optimal linear control law for each individual linear plant. If σ is of moderate
magnitude (e.g., a 1), nonlinear switching and blending of linear control laws
via the sigmoidal switching unit is a priori permitted by the architecture. One of
the main points of our approach resides in this switching/blending solution
(through learning of V and q) as a possible minimum of the optimized cost
function.
Equations 18 and 23 describing plant and controller dynamics can be repre
sented in a network, as well as the desired trajectory dynamics (see Figure 14).
Note that the resulting network is composed of six learnable weights for each
basic controller block: three PID weights plus one bias weight (W weights) and
two switch unit weights (V weights). Actual and desired state vectors at time t + 1
are fed back to the input layers. Thus, with recurrent connections between output
and input layers, the network generates entire trajectories and can be seen as a
recurrent back-propagation network (Rumelhart, Hinton & Williams, 1986;
Nguyen & Widrow, 1990; Jordan & Jacobs, 1990). The network is then trained
using the back-propagation algorithm given wind distributions.
Copyrighted Material
86 SCHLEY, CHAUVIN, VAN HENKLE
Performance Constraints
As previously noted, the optimization procedure can also depend on performance
constraints. The example we use involves relative stability constraints. As will be
seen in the fifth section, the unconstrained solution shows lightly damped aircraft
responses during glideslope. Here, a penalty on "bad" damping is formulated.
In order to perform a stability analysis, the controlled autoland system is
structured as shown in Figure 15. The aircraft altitude responds to values of 6cmd
which are generated by the controller. The controller, whether conventional or
neural network, can be represented as a PID operation.
The aircraft response denoted in Figure 15 (h in response to θcmd) consists of
the dynamics and kinematics of the airframe along with the pitch stability aug
mentation system and the autothrottle. This response can be represented as a set
of differential equations or, more conveniently, as a Laplace transform transfer
function. Equations 25 and 26 provide the transfer functions during glideslope
and flare.
Glideslope
0.409736(s + 3.054256)(.s – 2.288485)(s + 0.299794)(s + 0.154471)
F (s) = (25)
s(s2 + 2.216836s + 2.407746)(s2 + 0.673172s + 0.114482)(s + 0.091075)'
Flare
1.671273(s + 3.049899)(s – 2.294112)(s + 0.561642)(s + 0.123544)
F (s) = (26)
s(s2 + 3.498162s + 6.184131)(s2 + 1.057216s + 0.280124)(s + 0.103319).
In Equations 25 and 26, note the appearance of the short period and the
damped phugoid responses (the two quadratic terms in the denominators of the
transfer functions). Note also that there is a real positive term in the numerators,
leading to stability concerns since the closed-loop roots (eigenvalues) could have
positive real parts for some gain values. A positive eigenvalue means instability.
A schematic of a two-unit neural network controller is shown in Figure 16. In
function, the network can be viewed to perform as a PID operation on the altitude
h(s) G(s)F(s)
Closed loop transfer function: =
hcmd(s) 1 + G(s)F(s)
Figure 15. Closed-loop autoland system.
Copyrighted Material
3. FLARE AND TOUCHDOWN CONTROL USING NEEURAL NETWORKS 87
1 V01 +
h +
Vl1 Sigmoid( )
Bias
1 w01 +
Multiply
+
W11 +
hcmd +
w21 Integral +
–
w31 Derivative
Controller Block 1
+
V02 +
θcmd
+
V12 Sigmoid( )
Same
Bias
Inputs Wo2 + Multiply
2
+
W1
+
W22 Integral +
W32 Derivative
Controller Block 2
error hcmd – h. However, the gains on each term are determined by altitude
through the sigmoidal switch. Thus, for a particular frozen altitude point along
the aircraft flight trajectory, the controller can be represented by the transfer
function of Equation 27. Equation 27 also represents the function of the conven
tional controller.
Copyrighted Material