0% found this document useful (0 votes)

153 views100 pages

Backpropagation - Theory, Architectures, and Applications (1) - 1-100

This document provides the table of contents for a book titled "Backpropagation: Theory, Architectures, and Applications" edited by Yves Chauvin and David E. Rumelhart. The book contains 15 chapters that discuss various aspects of backpropagation algorithms, including the basic theory, applications to tasks like phoneme recognition and aircraft control, recurrent backpropagation networks, gradient-based learning algorithms, and more. It provides an overview of the significant developments in backpropagation since its introduction in the 1980s Parallel Distributed Processing books.

Uploaded by

Đoàn Tiến Đạt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

153 views100 pages

Backpropagation - Theory, Architectures, and Applications (1) - 1-100

Uploaded by

Đoàn Tiến Đạt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 100

BACKPROPAGATION

Theory, Architectures, and Applications

Copyrighted Material
DEVELOPMENTS IN CONNECTIONIST THEORY
David E. Rumelhart, Editor

Gluck/Rumelhart • Neuroscience and

Connectionist Theory

Ramsey/Stich/Rumelhart • Philosophy
and Connectionist Theory

Chauvin/Rumelhart • Backpropagation:
Theory, Architectures, and Applications

Copyrighted Material
BACKPROPAGATION

Theory, Architectures, and Applications

Edited by
Yves Chauvin
Stanford University
and Net-ID, Inc.
David E. Rumelhart
Department of Psychology
Stanford University

LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS

1995 Hillsdale, New Jersey Hove, UK

Copyrighted Material
Copyright © 1995 by Lawrence Erlbaum Associates, Inc.
All rights reserved. No part of this book may be reproduced in
any form, by photostat, microform, retrieval system, or any other
means, without the prior written permission of the publisher.

Lawrence Erlbaum Associates, Inc., Publishers

365 Broadway
Hillsdale, New Jersey 07642

Library of Congress Cataloging-in-Publication Data

Backpropagation : theory, architectures, and applications / edited by Yves Chauvin and David E.
Rumelhart.
p. cm.
Includes bibliographical references and index.
ISBN 0-8058-1258-X (alk. paper). — ISBN 0-8058-1259-8 (pbk. : alk. paper)
1. Backpropagation (Artificial intelligence) I. Chauvin, Yves, Ph. D. II. Rumelhart,
David E.
Q327.78.B33 1994
006.3—dc20 94-24248
CIP

Books published by Lawrence Erlbaum Associates are printed on acid-free paper, and their
bindings are chosen for strength and durability.

Printed in the United States of America

10 9 8 7 6 5 4 3 2 1

Copyrighted Material
Contents

Preface vii

1. Backpropagation: The Basic Theory 1

David E. Rummelhart, Richard Durbin, Richard
Golden, and Yves Chauvin

2. Phoneme Recognition Using Time-Delay Neural

Networks 35
Alexander Waibel, Toshiyuki Hanazawa, Geoffrey
Hinton, Kiyohiro Shikano, and Kevin J. Lang

3. Automated Aircraft Flare and Touchdown Control

Using Neural Networks 63
Charles Schley, Yves Chauvin, and Van Henkle

4. Recurrent Backpropagation Networks 99

Fernando J. Pineda

5. A Focused Backpropagation Algorithm for Temporal

Pattern Recognition 137
Michael C. Mozer

6. Nonlinear Control with Neural Networks 171

Derrick H. Nguyen and Bernard Widrow

Copyrighted Material
vi CONTENTS

7. Forward Models: Supervised Learning with a Distal

Teacher 189
Michael I. Jordan and David E. Rumelhart

8. Backpropagation: Some Comments and Variations 237

Stephen Jose Hanson

9. Graded State Machines: The Representation of

Temporal Contingencies in Feedback Networks 273
Axel Cleeremans, David Servan-Schreiber, and James
L. McClelland

10. Spatial Coherence as an Internal Teacher for a Neural

Network 313
Suzanna Becker and Geoffrey E. Hinton

11. Connectionist Modeling and Control of Finite State

Systems Given Partial State Information 351
Jonathan R. Bachrach and Michael C. Mozer

12. Backpropagation and Unsupervised Learning in

Linear Networks 389
Pierre Baldi, Yves Chauvin, and Kurt Hornik

13. Gradient-Based Learning Algorithms for Recurrent

Networks and Their Computational Complexity 433
Ronald J. Williams and David Zipser

14. When Neural Networks Play Sherlock Holmes 487

Pierre Baldi and Yves Chauvin

15. Gradient Descent Learning Algorithms: A Unified

Perspective 509
Pierre Baldi

Author Index 543

Subject Index 549

Copyrighted Material
Preface

Almost ten years have passed since the publication of the now classic
volumes Parallel Distributed Processing: Explorations in the Microstruc-
ture of Cognition. These volumes marked a renewal in the study of brain-in-
spired computations as models of human cognition. Since the publication of
these two volumes, thousands of scientists and engineers have joined the
study of Artificial Neural Networks (or Parallel Distributed Processing) to
attempt to respond to three fundamental questions: (1) how does the brain
work? (2) how does the mind work? (3) how could we design machines with
equivalent or greater capabilities than biological (including human) brains?
Progress in the last 10 years has given us a better grasp of the complexity
of these three problems. Although connectionist neural networks have shed
a feeble light on the first question, it has become clear that biological
neurons and computations are more complex than their metaphorical con-
nectionist equivalent by several orders of magnitude. Connectionist models
of various brain areas, such as the hippocampus, the cerebellum, the olfac-
tory bulb, or the visual and auditory cortices have certainly helped our
understanding of their functions and internal mechanisms. But by and large,
the biological metaphor has remained a metaphor. And neurons and synapses
still remain much more mysterious than hidden units and weights.
Artificial neural networks have inspired not only biologists but also
psychologists, perhaps more directly interested in the second question. Al-
though the need for brain-inspired computations as models of the workings
of the mind is still controversial, PDP models have been successfully used to
model a number of behavioral observations in cognitive, and more rarely,
clinical or social psychology. Most of the results are based on models of
perception, language, memory, learning, categorization, and control. These
results, however, cannot pretend to represent the beginning of a general
understanding of the human psyche. First, only a small fraction of the large
quantity of data amassed by experimental psychologists has been examined
by neural network researchers. Second, some higher levels of human cogni-
tion, such as problem solving, judgment, reasoning, or decision making
rarely have been addressed by the connectionist community. Third, most
models of experimental data remain qualitative and limited in scope: No
general connectionist theory has been proposed to link the various aspects of
cognitive processes into a general computational framework. Overall, the

vii

Copyrighted Material
viii PREFACE

possibility of an artificial machine that could learn how to function in the

world with a reasonable amount of intelligence, communication, or "com-
mon sense" remains far away from our current state of knowledge.
It is perhaps on the third problem, the design of artificial learning sys-
tems, expert in specific tasks, that connectionist approaches have made their
best contribution. Such models have had an impact in many different dis-
ciplines, most of them represented in this volume. This trend is in part the
result of advances in computer, communication, and data acquisition tech-
nologies. As databases of information are becoming ubiquitous in many
fields, corresponding accurate models of the data-generating process are
often unavailable. It is in these areas that machine learning approaches are
making their greatest impact. And it is here that connectionist approaches are
beneficially interbreeding with several other related disciplines such as
statistical mechanics, statistical pattern recognition, signal processing, sta-
tistical inference, and information and decision theory.
It may be seen as somewhat of a disappointment to the great excitement
of the late 1980s that the idea of "intelligent general learning systems" has
to yield to local, specialized, often handcrafted neural networks with limited
generalization capabilities. But it is also interesting to realize that prior
domain knowledge needs to be introduced to constrain network architectures
and statistical performance measures if these networks are to learn and
generalize. With hindsight, this realization certainly appears to be a sign of
maturity in the field.
The most influential piece of work in the PDP volumes was certainly
Chapter 8, "Learning Interal Representations by Error Propagation." Since
the original publication of the PDP volumes, the back propagation algorithm
has been implemented in many different forms by many different researchers
in different fields. The algorithm shows that complex mappings between
input and target patterns could be learned in an elegant and practical way by
non-linear connectionist networks. It also overcomes many limitations asso-
ciated with neural network learning algorithms of the previous generation,
such as the perceptron algorithm. At the same time, the back-propagation
algorithm includes the basic ingredients of the general connectionist recipe:
local computations, global optimization, and parallel operation. But most
interestingly, the algorithm showed that input-output mappings could be
created during learning by the discovery of internal representations of the
training data. These representations were sometimes clever, nontrivial, and
not originally intended or even imagined by the human designer of the
back-propagation network architectures. In the 1960s and 1970s, the cogni-
tive psychology revolution was partially triggered by the realization that
such internal representations were necessary to explain intelligent behavior
beyond the scope of stimulus-response theory. The internal representations
learned by the back-propagation algorithm had an "intelligent flavor" that
was difficult for artificial intelligence researchers to ignore. Altogether,
these features contributed to the success of back propagation as a versatile

Copyrighted Material
PREFACE ix

tool for computer modellers, engineers, and cognitive scientists in general.

This present volume can be seen as a progress report on the third problem
achieved through a deeper exploration of the back-propagation algorithm.
The volume contains a variety of new articles that represent a global per-
spective on the algorithm and show new practical applications. We have also
included a small number of articles that appeared over the last few years and
had an impact on our understanding of the back-propagation mechanism.
The chapters distinguish the theory of back propagation from architectures
and applications. The theoretical chapters relate back-propagation principles
to statistics, pattern recognition, and dynamical system theory. They show
that back-propagation networks can be viewed as non-parametrized, non-
linear, structured, statistical models. The architectures and applications
chapters then show successful implementations of the algorithm for speech
processing, fingerprint recognition, process control, etc.
We intend this volume to be useful not only to students in the field of
artificial neural networks, but also to professionals who are looking for
concrete applications of learning machine systems in general and of the
back-propagation algorithm in particular. From the theory section, readers
should be able to relate neural networks to their own background in physics,
statistics, information or control theory. From the examples, they should be
able to generalize the design principles and construct their own architectures
optimally adapted to their problems, from medical diagnosis to financial
prediction to protein analysis.
Considering our current stage of knowledge, there is still a lot of terrain
to be explored in the back-propagation landscape. The future of back propa-
gation and of related machine learning techniques resides in their effective-
ness as practical solutions to real world problems. The recent creation of
start-up companies with core technologies based on these mechanisms shows
that the engineering world is paying attention to the computational advan-
tages of these algorithms. Success in the competition for cost effective
solutions to real-world problems will probably determine if back-propaga-
tion learning techniques are mature enough to survive. We believe it will be
the case.

Yves Chauvin and David E. Rumelhart

Acknowledgments

It would probably take another volume just to thank all the people who
contributed to the existence of this volume. The first editor would like to
mention two of them: Marie-Thérèse and René Chauvin.

Copyrighted Material
x PREFACE

Editors' note: Recent usage of the term "backpropagation" in neural net-

works research appears to favor treating it as one word. While the editors
acknowledge this trend with the title of this book, we also respect the fact
that different researchers in different disciplines have tended to handle the
term differently. Therefore, for the purposes of this volume we have respect-
fully allowed the contributing authors free license to their own preferred
usage.

Copyrighted Material
BACKPROPAGATION

Theory, Architectures, and Applications

Copyrighted Material
Copyrighted Material
1 Backpropagation:
The Basic Theory

David E. Rumelhart
Richard Durbin
Richard Golden
Yves Chauvin
Department of Psychology, Stanford University

INTRODUCTION

Since the publication of the PDP volumes in 1986,1 learning by backpropagation

has become the most popular method of training neural networks. The reason for
the popularity is the underlying simplicity and relative power of the algorithm.
Its power derives from the fact that, unlike its precursors, the perceptron learning
rule and the Widrow-Hoff learning rule, it can be employed for training nonlinear
networks of arbitrary connectivity. Since such networks are often required for
real-world applications, such a learning procedure is critical. Nearly as important
as its power in explaining its popularity is its simplicity. The basic idea is old and
simple; namely define an error function and use hill climbing (or gradient descent
if you prefer going downhill) to find a set of weights which optimize perfor-
mance on a particular task. The algorithm is so simple that it can be implemented
in a few lines of code, and there have been no doubt many thousands of imple-
mentations of the algorithm by now.
The name back propagation actually comes from the term employed by
Rosenblatt (1962) for his attempt to generalize the perceptron learning algorithm
to the multilayer case. There were many attempts to generalize the perceptron
learning procedure to multiple layers during the 1960s and 1970s, but none of
them were especially successful. There appear to have been at least three inde-
pendent inventions of the modern version of the back-propagation algorithm:
Paul Werbos developed the basic idea in 1974 in a Ph.D. dissertation entitled

'Parallel distributed processing: Explorations in the microstructure of cognition. Two volumes by

Rumelhart, McClelland, and the PDP Research Group.

Copyrighted Material
2 RUMELHART, DURBIN, GOLDEN, CHAUVIN

"Beyond Regression," and David Parker and David Rumelhart apparently devel-
oped the idea at about the same time in the spring of 1982. It was, however, not
until the publication of the paper by Rumelhart, Hinton, and Williams in 1986
explaining the idea and showing a number of applications that it reached the field
of neural networks and connectionist artificial intelligence and was taken up by a
large number of researchers.
Although the basic character of the back-propagation algorithm was laid out
in the Rumelhart, Hinton, and Williams paper, we have learned a good deal more
about how to use the algorithm and about its general properties. In this chapter
we develop the basic theory and show how it applies in the development of new
network architectures.
We will begin our analysis with the simplest cases, namely that of the feedfor-
ward network. The pattern of connectivity may be arbitrary (i.e., there need not
be a notion of a layered network), but for our present analysis we will eliminate
cycles. An example of such a network is illustrated in Figure l. 2
For simplicity, we will also begin with a consideration of a training set which
consists of a set of ordered pairs [( , d)i] where we understand each pair to
represent an observation in which outcome d occurred in the context of event .
The goal of the network is to learn the relationship between and d. It is useful to
imagine that there is some unknown function relating to d, and we are trying to
find a good approximation to this function. There are, of course, many standard
methods of function approximation. Perhaps the simplest is linear regression. In
that case, we seek the best linear approximation to the underlying function. Since
multilayer networks are typically nonlinear it is often useful to understand feed-
forward networks as performing a kind of nonlinear regression. Many of the
issues that come up in ordinary linear regression also are relevant to the kind of
nonlinear regression performed by our networks.
One important example comes up in the case of "overfitting." We may have
too many predictor variables (or degrees of freedom) and too little training data.
In this case, it is possible to do a great job of "learning" the data but a poor job of
generalizing to new data. The ultimate measure of success is not how closely we
approximate the training data, but how well we account for as yet unseen cases.
It is possible for a sufficiently large network to merely "memorize" the training
data. We say that the network has truly "learned" the function when it performs
well on unseen cases. Figure 2 illustrates a typical case in which accounting
exactly for noisy observed data can lead to worse performance on the new data.
Combating this "overfitting" problem is a major problem for complex networks
with many weights.
Given the interpretation of feedforward networks as a kind of nonlinear re-
gression, it may be useful to ask what features the networks have which might

2
As we indicate later, the same analysis can be applied to networks with cycles (recurrent
networks), but it is easiest to understand in the simpler case.

Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 3

Y
Output Units

Hidden
Units

X
Input Units

Figure 1. A simple three-layer network. The key to the effectiveness

of the multilayer network is that the hidden units learn to represent the
input variables in a task-dependent way.

X X
X

Figure 2. Even though the oscillating line passes directly through all
of the data points, the smooth line would probably be the better pre-
dictor if the data were noisy.

Copyrighted Material
4 RUMELHART, DURBIN, GOLDEN, CHAUVIN

g i v e t h e m an a d v a n t a g e o v e r other m e t h o d s . For these purposes it is useful to

c o m p a r e the s i m p l e feedforward network with o n e hidden layer to the m e t h o d of
p o l y n o m i a l regression. In the case of p o l y n o m i a l regression w e imagine that w e
transform the input variables into a large n u m b e r of variables by adding a
n u m b e r of the cross t e r m s x1x2, x1x3, . . . , x1x2x3,x1x2x4,. . . . We can also add
t e r m s with h i g h e r p o w e r s x21,x31, . . . as well as cross terms with higher p o w e r s .
In d o i n g this w e c a n , of course a p p r o x i m a t e any output surface w e please. G i v e n
that w e can p r o d u c e any output surface with a simple p o l y n o m i a l regression
m o d e l , w h y should w e want to use a multilayer n e t w o r k ? T h e structures of these
t w o n e t w o r k s are s h o w n in Figure 3 .
We m i g h t s u p p o s e that the feedforward network would have an a d v a n t a g e in
that it m i g h t be able to represent a larger function space with fewer p a r a m e t e r s .
T h i s d o e s not a p p e a r to be true. Roughly, it s e e m s to be that the " c a p a c i t y " of
both n e t w o r k s is proportional to the n u m b e r of parameters in the network (cf.
C o v e r , 1965; Mitchison & D u r b i n , 1989). T h e real difference is in the different
k i n d s of constraints the t w o representations i m p o s e . Notice that for the poly-
n o m i a l n e t w o r k the n u m b e r of possible t e r m s g r o w s rapidly with the size of the
input vector. It is not, in g e n e r a l , p o s s i b l e , even to use all of the first-order cross
t e r m s since there are n(n + l ) / 2 of t h e m . T h u s , w e need to be able to select that
subset of input variables that are most relevant, which often m e a n s selecting the
lower-order cross terms and thereby representing only the pairwise or, p e r h a p s ,
t h r e e - w a y interactions.

(a) (b)

X1 X2 Xn X1 X Xn X1X2X1X3
2

multi-layer network polynomial network

Figure 3. Two networks designed for nonlinear regression problems.

The multilayer network has a set of hidden units designed to discover
a "low-order" representation of the input variables. In the polynomial
network the number of terms expands exponentially.

Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 5

In layered n e t w o r k s the constraints are very different. Rather than limiting the
order of the interactions, w e limit only the number of interactions and let the
n e t w o r k select the a p p r o p r i a t e c o m b i n a t i o n s of units. In m a n y real-world situa-
tions the r e p r e s e n t a t i o n of the signal in physical terms (for e x a m p l e , in terms of
the pixels of an i m a g e or the acoustic representation of a speech signal) m a y
require looking at the relationships a m o n g m a n y input variables at a t i m e , but
there m a y exist a description in terms of a relatively few variables if only w e
k n e w w h a t they w e r e . T h e idea is that the multilayer network is trying to find a
low-order representation (a few hidden units), but that representation itself i s , in
g e n e r a l , a n o n l i n e a r function of the physical input variables w h i c h allows for the
interactions of m a n y t e r m s .
Before w e turn to the substantive issues of this chapter, it is useful to ask for
w h a t k i n d s of applications neural n e t w o r k s would be best suited. Figure 4 pro-
vides a f r a m e w o r k for u n d e r s t a n d i n g these issues. T h e figure has t w o d i m e n -
s i o n s , " T h e o r y R i c h n e s s " and " D a t a R i c h n e s s . " T h e basic idea is that different
k i n d s of s y s t e m s are appropriate for different kinds of p r o b l e m s . If w e have a
g o o d theory it is often possible to d e v e l o p a specific "physical m o d e l " to describe
the p h e n o m e n a . S u c h a "first-principles" model is especially valuable w h e n w e

Data Richness

Expert Systems Neural Networks

and
Theory

Richness Related Statistical Models

First Principles Models

Figure 4. Neural networks and back propagation can be of the most

value for a problem relatively poor in theory and relatively rich in data.

Copyrighted Material
6 RUMELHART, DURBIN, GOLDEN, CHAUVIN

have little data. Sometimes we are "theory poor" and also "data poor." In such a
case, a good model may be best determined through asking "experts" in a field
and, on the basis of their understanding, devise an "expert system." The cases
where networks are particularly useful are domains where we have lots of data
(so we can train a complex network) but not much theory, so we cannot build a
first-principles model. Note that when a situation gets sufficiently complex and
we have enough data, it may be that so many approximations to the first princi-
ples models are required that in spite of a good deal of theoretical understanding
better models can be constructed through learning than by application of our
theoretical models.

SOME PRELIMINARY CONSIDERATIONS

There are three major issues we must address when considering networks such as
these. These are:

1. The representation problem. What is the representational capacity of a

networks of this sort? How must the size of the network grow as the
complexity of the function we are attempting to approximate grows?
2. The learning problem. Given that a function can be approximated reason-
ably closely by the network, can the function be learned by the network?
How does the training time scale with the size of the network and the
complexity of the problem?
3. The generalization problem. Given a network which has learned the train-
ing set, how certain can we be of its performance on new cases? How must
the size of the data set grow as the complexity of the to be approximated
function grows? What strategies can be employed for improving general-
ization?

Representation
The original critique by Minsky and Pappert was primarily concerned with the
representational capacity of the perceptron. They showed (among other things)
that certain functions were simply not representable with single-layer per-
ceptrons. It has been shown that multilayered networks do not have these limita-
tions. In particular, we now know that with enough hidden units essentially any
function can be approximated as closely as possible (cf. Hornik et al., 1989).
There still is a question about the way the size of the network must scale with the
complexity of the function to be approximated. There are results which indicate
that smooth, continuous functions require, in general, simpler networks than
functions with discontinuities.

Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 7

Learning
Although there are results that indicate that the general learning problem is
extremely difficult—certain representable functions may not be learnable at
all—empirical results indicate that the "learning" problem is much easier than
expected. Most real-world problems seem to be learnable in a reasonable time.
Moreover, learning normally seems to scale linearly; that is, as the size of real
problems increase, the training time seems to go up linearly (i.e., it scales with
the number of patterns in the training set). Note that these results were something
of a surprise. Much of the early work with the back-propagation algorithm was
done with artificial problems, and there was some concern about the time that
some problems, such as the parity problem, required. It now appears that these
results were unduly pessimistic. It is rare that more than 100 times through the
training set is required.

Generalization
Whereas the learning problem has turned out to be simpler than expected, the
generalization problem has turned out to be more difficult than expected. It
appears to be possible to easily build networks capable of learning fairly large
data sets. Learning a data set turns out to be little guarantee of being able to
generalize to new cases. Much of the most important work during recent years
has been focused on the development of methods to attempt to optimize general-
ization rather than just the learning of the training set.

A PROBABILISTIC MODEL FOR

BACK-PROPAGATION NETWORKS

The goal of the analysis which follows is to develop a theoretical framework

which will allow for the development of appropriate networks for appropriate
problems while optimizing generalization. The back-propagation algorithm in-
volves specifying a cost function and then modifying the weights iteratively
according to the gradient of the cost function. In this section we develop a
rationale for an appropriate cost function. We propose that the goal is to find that
network which is the most likely explanation of the observed data sequence. We
can express this as trying to maximize the term
P(D|N)P(N)
P(D|N) =
P(D) ,
where N represents the network (with all of the weights and biases specified), D
represents the observed data, and P(D|N) is the probability that the network N
would have produced the observed data D. Now since sums are easier to work

Copyrighted Material
8 RUMELHART, DURBIN, GOLDEN, CHAUVIN

with than p r o d u c t s , w e will m a x i m i z e the log of this probability. Since the log is
a m o n o t o n i c transformation, m a x i m i z i n g the log is equivalent to m a x i m i z i n g the
probability itself. In this c a s e w e have

ln P(N\D) = ln P ( D | N ) + ln P(N) - ln P(D).

Finally, since the probability of the data is not d e p e n d e n t o n the n e t w o r k , it is

sufficient to m a x i m i z e ln P(D|N) + 1n P(N).
N o w , it is useful to understand the m e a n i n g of these t w o t e r m s . T h e first term
represents the probability of the data given the network; that i s , it is a m e a s u r e of
h o w well the network accounts for the data. T h e second term is a representation
of the probability of the n e t w o r k itself; that is, it is a prior probability or a prior
constraint o n the n e t w o r k . A l t h o u g h it is often difficult to specify the prior, doing
so is an important w a y of inserting k n o w l e d g e into the learning p r o c e d u r e . M o r e
will b e said about this later. For the time b e i n g , however, w e focus o n the first
t e r m , the p e r f o r m a n c e .
It is useful to begin by noticing that the data can b e broken d o w n into a set of
o b s e r v a t i o n s , e a c h , w e will a s s u m e , c h o s e n independently of the o t h e r s . T h u s ,
w e c a n write the probability of the data given the network as

ln P ( D | N ) = ln P([( ,di)]|N)

= ln p(( , di)|N) = ln P(( ,di>|N).

i i

N o t e that again this a s s u m p t i o n allows us to express the probability of the data

given the n e t w o r k as the s u m of t e r m s , each term representing the probability of
a single observation given the n e t w o r k . We can take still another step. We c a n
b r e a k the data into t w o parts: the o u t c o m e di a n d the observed event i. We can
write

ln P(D|N) = Σ ln P(di| i Λ N) + Σ ln P( i).

i i

N o w , since w e s u p p o s e that the event i d o e s not d e p e n d on the n e t w o r k , the last

t e r m of the equation will not affect the determination of the optimal n e t w o r k .
T h e r e f o r e , w e need only m a x i m i z e the term Σi ln (P(d i | i Λ N).
S o far w e have been very general; the only real assumption m a d e is the
i n d e p e n d e n c e of the o b s e r v e d data p o i n t s . In order to get further, h o w e v e r , w e
n e e d to m a k e s o m e specific a s s u m p t i o n s , particularly about the relationship
b e t w e e n t h e output of the network i and the observed o u t c o m e di a probabilistic
a s s u m p t i o n . First, w e a s s u m e that the relationship b e t w e e n i and di is not deter
m i n i s t i c , but that, for a n y given i, there is a distribution of possible values of di.
T h e n e t w o r k , h o w e v e r , is deterministic, so rather than trying to predict the actual
o u t c o m e w e are only trying to predict the expected value of di given i. T h u s , the
n e t w o r k output i is to b e interpreted as the m e a n of the actual o b s e r v e d v a l u e .
T h i s i s , of c o u r s e , the standard a s s u m p t i o n .

Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 9

T h e Gaussian Case

To p r o c e e d further, w e must specify the form of the distribution of w h i c h the

n e t w o r k o u t p u t is the m e a n . To decide which distribution is most a p p r o p r i a t e , it
is n e c e s s a r y to c o n s i d e r the nature of the o u t c o m e s , d. In ordinary linear regres
s i o n , there is an u n d e r l y i n g a s s u m p t i o n that the noise is normally distributed
about the predicted v a l u e s . In situations in which this is s o , a G a u s s i a n p r o b a
bility distribution is a p p r o p r i a t e , even for nonlinear regression p r o b l e m s in w h i c h
n o n l i n e a r n e t w o r k s are required. We begin o u r analysis in this simple c a s e .
U n d e r the a s s u m p t i o n of normally distributed noise in the observations w e can
write

(yij – dij)2
P ( d i | x i Λ N) = K e x p
( Σ 2Σ2 )
j

w h e r e K is the n o r m a l i z a t i o n term for the G a u s s i a n distribution. N o w w e take the

log of the probability:

Σj (yij – dij)2
ln P (d i | i Λ N) = - ln K -
2Σ2 .

U n d e r the a s s u m p t i o n that a is fixed, w e want to m a x i m i z e the following t e r m ,

w h e r e l is t h e function to be m a x i m i z e d .

(yij - dij)2
l = - Σ Σ
2Σ 2 .
i j

N o w w e must c o n s i d e r the appropriate transfer functions for the output units.

For the m o m e n t , w e will c o n s i d e r the c a s e of what w e have termed quasi-linear
output units in w h i c h the output is a function of the net input of the unit, w h e r e 3
the net input is s i m p l y a w e i g h t e d s u m of the inputs to the unit. That i s , the net
input for u n i t y j ,ηj,is given by ηj = Σk wjkhk +βj. T h u s , w e have yi = F ( Η j ) . 4
Recall that the b a c k - p r o p a g a t i o n learning rule is d e t e r m i n e d by the derivative
of the cost function with respect to the p a r a m e t e r s of the n e t w o r k . In this c a s e w e
can write

l (dij - yij) F(η j )

= .
ηj σ2 ηj

T h i s has the form of the difference b e t w e e n the predicted and o b s e r v e d values

divided by the variance of the error term times the derivative of the output

3
Note, this is not necessary. The output units could have a variety of forms, but the quasi-linear
class is simple and useful.
4
Note that ηj itself is a function of the input vector i and the weights and biases of the entire
network.

Copyrighted Material
10 RUMELHART, DURBIN, GOLDEN, CHAUVIN

function with respect to its net input. As w e shall see, this is a very general form
w h i c h occurs often.
N o w w h a t form should the output function t a k e ? It has been conventional to
take it to be a sigmoidal function of its net input, but u n d e r the G a u s s i a n
a s s u m p t i o n of error, in which the m e a n c a n , in principle, take on any real v a l u e ,
it m a k e s m o r e sense to let be linear in its net input. T h u s , for an assumption of
G a u s s i a n error and linear output functions w e get the following very simple form
of the learning rule:

l
∞
(dij - y i j ) .
ηj

T h e c h a n g e in Η should be proportional to the difference b e t w e e n the o b s e r v e d

o u t p u t and its predicted value. This m o d e l is frequently appropriate for predic
tion p r o b l e m s in which the error can r e a s o n a b l y be normally distributed. A s w e
shall s e e , classification p r o b l e m s in w h i c h the observations are binary are a
different situation and generate a different m o d e l .

T h e B i n o m i a l Case

Often w e use n e t w o r k s for classification p r o b l e m s — t h a t is, for p r o b l e m s in

w h i c h the goal is to provide a binary classification of each input vector for each
of several classes. This class of p r o b l e m s requires a different m o d e l . In this case
the o u t c o m e vectors normally consist of a s e q u e n c e of 0's and 1's. T h e " e r r o r "
c a n n o t be n o r m a l l y distributed, but w o u l d be expected to be binomially distrib
uted. In this c a s e , w e i m a g i n e that each e l e m e n t of represents the probability
that t h e c o r r e s p o n d i n g e l e m e n t of the o u t c o m e vector d takes on the value 0 or 1.
In this c a s e w e can write the probability of the data given the network as

P(d| Λ N) = yjdj(1 - yj)1-dj.

T h e log of probability is Σj dj ln yi + (1 – dj) ln (1 – yj) a n d , finally,

l = Σ Σ dj ln yj + (1 - dj) ln(l - yj).

i j

In t h e neural network w o r l d , this has been called the cross-entropy error t e r m . A s

w e shall s e e , this is j u s t o n e of m a n y such error t e r m s . Now, the derivative of this
function is

l dj – yj F(ηj)
= .
ηj yj(1 - yj) ηj
A g a i n , the derivative has the same form as b e f o r e — t h e difference b e t w e e n the
p r e d i c t e d and o b s e r v e d values divided by the variance (in this case the variance

Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 11

of the b i n o m i a l ) t i m e s the derivative of the transfer function with respect to its

net input.
We m u s t n o w d e t e r m i n e the m e a n i n g of the form of the output function. In
this c a s e , w e w a n t it to r a n g e b e t w e e n 0 and 1, so a sigmoidal function is natural.
Interestingly, w e see that if w e c h o o s e the logistic F(ηj) = 1/(1 + e-ηj), w e find
an interesting result. T h e derivative of the logistic is F (η j ) ( l – F(η j )) or yi(l –
yj). It h a p p e n s that this is the variance of the b i n o m i a l , so it cancels the d e n o m i
nator in the p r e v i o u s e q u a t i o n , leaving the s a m e simple form as w e had for the
Gaussian case:

l
∞ (d – y ) .
ηj j j

In the w o r k o n g e n e r a l i z e d linear m o d e l s (cf. M c C u l l a g h & Nelder, 1989) such

functions are called linking functions, and they point out that different linking
functions are a p p r o p r i a t e for different sorts of p r o b l e m s . It turns out to be useful
to see feedforward n e t w o r k s as a generalization into the nonlinear r e a l m of the
w o r k o n g e n e r a l i z e d linear m o d e l s . M u c h of the analysis given in the M c C u l l a g h
and N e l d e r applies directly to such n e t w o r k s .

T h e M u l t i n o m i a l Case

In m a n y applications w e e m p l o y not multiple classification or binary classifica

t i o n , but " 1 - o f - n " classification. Here w e m u s t e m p l o y still another transfer
function. In this c a s e , c h o o s e the n o r m a l i z e d exponential output function 5

eηj
Fj(η) =
Σk enk .

In this c a s e , the d vector consists of exactly o n e 1 and the r e m a i n i n g digits are

z e r o s . We c a n then interpret the output unit j, for e x a m p l e , as representing the
probability that the input vector w a s a m e m b e r of class j . In this c a s e w e can
write the cost function as

eηj
l = Σ Σ dij ln .
j Σk eηk
i

a n d , a g a i n , after c o m p u t i n g the derivative w e get

l
(dij - yij).
ηj

-This is sometimes called the "soft-max" or "Potts" unit. As we shall see, however, it is a simple
generalization of the ordinary sigmoid and has a simple interpretation as representing the posterior
probability of event j out of a set n of possible events.

Copyrighted Material
12 RUMELHART, DURBIN, GOLDEN, CHAUVIN

T h e G e n e r a l Case

T h e fact that these cases all end up with essentially the same learning rule in spite
of different m o d e l s is not accidental. It requires exactly the right choice of output
functions for each class of p r o b l e m . It turns out that this result will o c c u r
w h e n e v e r w e c h o o s e a probability function from the exponential family of p r o b a
bility distributions. This family, which i n c l u d e s , in addition to the n o r m a l and the
b i n o m i a l , the g a m m a distribution, the exponential distribution, the Poisson dis
tribution, the negative b i n o m i a l distribution, and most other familiar probability
distributions. T h e general form of the exponential family of probability distribu
tions is

[ ( d i θ - B(θ)) + C(d )]
P ( d | Λ N) = e x p Σ
a( ) ,
i
w h e r e θ is the "sufficient statistic" of the distribution and is related to the m e a n of
the distribution, is a m e a s u r e of the overall variance of the distribution, and the
B( ), C( ) and a( ) are different for each m e m b e r of the family. It is b e y o n d the
s c o p e of this chapter to d e v e l o p the general results of this m o d e l . 6 Suffice it to
say that for all m e m b e r s of the exponential family we get

l dj - yj
ηj var(yi) .

We then c h o o s e as an output function o n e w h o s e derivative with respect to η is

e q u a l to the v a r i a n c e . For m e m b e r s of the exponential family of probability
distributions w e can a l w a y s d o this.
T h e major point of this analysis is that by using one simple cost function, a
log-likelihood function, and by looking carefully at the p r o b l e m at h a n d , w e see
that, unlike the original w o r k , in which the squared error criterion w a s normally
e m p l o y e d in nearly all c a s e s , different cost functions are appropriate for different
c a s e s — p r e d i c t i o n , cross-classification, and 1-of-n classification all require dif
ferent forms of output units. T h e major a d v a n t a g e of this is not so m u c h that the
s q u a r e d error criterion is w r o n g , but that by m a k i n g specific probability a s s u m p
tions w e can get a better u n d e r s t a n d i n g of the meaning of the output units. In
particular, w e can interpret t h e m as the m e a n s of underlying probability distribu
t i o n s . A s w e shall show, this u n d e r s t a n d i n g allows for the d e v e l o p m e n t of rather
sophisticated architecture in a n u m b e r of c a s e s .

SOME EXAMPLES

Before d i s c u s s i n g priors and analyzing hidden units, w e sketch h o w to use this

m e t h o d of analysis to design appropriate learning rules for c o m p l e x n e t w o r k s .

6
See McCullagh and Nelder (1989, pp. 28-30) for a more complete description.

Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 13

A Simple Clustering Network

C o n s i d e r the following p r o b l e m . S u p p o s e that w e wish to build a network w h i c h

received a s e q u e n c e of data and attempted to cluster the data points into s o m e
predefined n u m b e r of clusters. T h e basic structure of the n e t w o r k , illustrated in
Figure 5 , consists of a set of input units, o n e for each e l e m e n t of the data vector
, a set of h i d d e n " c l u s t e r " units, and a single linear output unit. In this c a s e , w e
s u p p o s e that the h i d d e n units are " G a u s s i a n " ; that is, their output values are
given by K e x p [Σj (xj – w j k ) 2 / 2 σ 2 ] . In this c a s e , the weights wk can be viewed as
the center of the kth cluster. T h e p a r a m e t e r a , constant in this c a s e , d e t e r m i n e s
the spread of the cluster. We want the output unit to represent the probability of
the d a t a given the n e t w o r k , P( i |N).
N o w if w e a s s u m e that the clusters are mutually exclusive and exhaustive w e
can write

P( i|N) = Σ P( i| i ck) P ( c k ) ,
k

w h e r e ck indexes the kth cluster. For simplicity w e can a s s u m e that the clusters
are to be equally p r o b a b l e , so P(ck) = 1/N, w h e r e N is the n u m b e r of clusters.
N o w , the probability of t h e data given the cluster is simply the output of the kth
h i d d e n unit, hk. T h e r e f o r e , the value of the output unit is 1/N Σk hk and the log-
likelihood of the data given the input is

l
l = ln
( Σ hk N ) .
k

O u t p u t U n i t

C l u s t e r U n i t s

I n p u t U n i t s

Figure 5. A simple network for clustering input vectors.

14 RUMELHART, DURBIN, GOLDEN, CHAUVIIM

T h e derivative of l with respect to the w e i g h t s is

l (K/N)exp[Σj(xj – wjk)2/2σ2] xj wjk

= –
wjk ( Σk(K/N)exp – Σj(xj – wjk) /2σ ] 2 2 ) σ2 .

T h e term in parentheses represents the posterior probability that the correct

cluster is ck given the input in a m e m b e r of o n e of the clusters. We can call this
posterior probability pk. We can n o w see that the learning rule is again very
simple:

l
pk(xj – Wjk).
wjk

T h i s is a slight modification of the general form already discussed. It is the

difference b e t w e e n the o b s e r v e d value xj and the estimated m e a n value wjk
w e i g h t e d by the probability that cluster k w a s the correct cluster pk.
T h i s simple case represents a classic m i x t u r e of G a u s s i a n m o d e l . We a s s u m e d
fixed probabilities per cluster and a fixed v a r i a n c e . It is not difficult to estimate
the probabilities of each cluster and the variance associated with the clusters. It is
also p o s s i b l e to add priors of various k i n d s . A s w e will explain, it is possible to
o r d e r the clusters and add constraints that nearby clusters o u g h t to have similar
m e a n s . In this c a s e , this feedforward n e t w o r k can be used to i m p l e m e n t the
elastic n e t w o r k of Durbin and W i l l s h a w (1987) and c a n , for e x a m p l e , be used to
find a solution to the traveling s a l e s m a n p r o b l e m .

Society of Experts

C o n s i d e r the network p r o p o s e d by J a c o b s , Jordan, N o w l a n , and Hinton (1991)

and illustrated in Figure 6. T h e idea of this network is that instead of having a
single n e t w o r k to solve every p r o b l e m , we have a set of n e t w o r k s which learn to
s u b d i v i d e a task and thereby solve it m o r e efficiently and elegantly. T h e architec
ture allows for all n e t w o r k s to look at the input units and m a k e their best g u e s s ,
but a n o r m a l i z e d exponential " g a t e i n g " is used to weight the outputs of the
individual n e t w o r k p r o v i d i n g an overall best g u e s s . T h e gateing n e t w o r k also
looks at the input vector.
We m u s t train both the gateing n e t w o r k and the individual " e x p e r t " n e t w o r k s .
A s b e f o r e , w e wish to m a x i m i z e the log-likelihood of the data given the n e t w o r k .
T h e final o u t p u t of the n e t w o r k is

yij. = Σ r k y ijk ,
k

w h e r e rk is the probability estimated by the normalized exponential " r e l e v a n c e "

n e t w o r k that s u b n e t w o r k k is the correct network for the current input. At first, it
m a y not b e o b v i o u s h o w to train this n e t w o r k . Perhaps w e should look at the
difference b e t w e e n the output of the network and the observed o u t c o m e and use

Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 15

W e i g h t e d Output U n i t s

Relevance
Units
Semi-Output
Units

Expert
Networks

Input U n i t s

Figure 6. A simple network for clustering input vectors.

that as the error signal to be p r o p a g a t e d back through the n e t w o r k . It turns out

that the probabilistic analysis w e have b e e n discussing offers a different, m o r e
p r i n c i p l e d solution. We s h o u l d , of c o u r s e , m a x i m i z e the l o g - l i k e l i h o o d — t h e
probability of the d a t a given the n e t w o r k . O n the a s s u m p t i o n that e a c h input
vector s h o u l d b e p r o c e s s e d b y o n e n e t w o r k and that the relevance n e t w o r k
p r o v i d e s the probability that it should be n e t w o r k k, w e can write

l = ln P ( d i | Λ N) = ln Σ P ( d i |
j i Λ sk)rk,
k
w h e r e sk r e p r e s e n t s the kth subnet.
We m u s t n o w m a k e s o m e specific a s s u m p t i o n s about the form of P(d i | i Λ
sk). For c o n c r e t e n e s s , w e a s s u m e a G a u s s i a n distribution, but w e c o u l d have
c h o s e n any of the other probability distributions w e have d i s c u s s e d . In this case

[ (dj - yjk)2 ]
l = ln Σ Krkexp Σ 2
.
2σ
k j

We n o w m u s t c o m p u t e the derivative of the log-likelihood function with

respect to ηjk for e a c h s u b n e t w o r k and with respect to ηk for the relevance
n e t w o r k . In the first c a s e w e get

l rkexp[Σj(dj - yjk)2/2σ2] dj - yjk dj – yjk

= = Pk .
ηjk ( Σi riKexp[Σj(dj - yjk) /2σ ] )
2 2
σ 2
σ2

N o t e that this is precisely the s a m e form as for the clustering n e t w o r k . T h e only

real difference is that the probabilities of e a c h class given the input w e r e indepen-

Copyrighted Material
16 RUMELHART, DURBIN, GOLDEN, CHAUVIN

dent of the input. In this case, the probabilities are input dependent. It is slightly
more difficult to calculate, but it turns out that the derivative for the relevance
units also has the simple form

l = pk – rk,
ηk
the difference between the position and the prior probability that subnetwork k is
the correct network.
This example, although somewhat complex, is useful for seeing how we can
use our general theory to determine a learning rule in a case where it might not be
immediately obvious and in which the general idea of just taking the difference
between the output of the network and the target and using that as an error signal
is probably the wrong thing to do. We now turn to one final example.

Integrated Segmentation and Recognition Network

A major problem with standard back-propagation algorithms is that they seem to
require carefully segmented and localized input patterns for training. This is a
problem for two reasons: first, it is often a labor-intensive task to provide this
information and, second, the decision as to how to segment often depends on
prior recognition. It is possible, however, to design a network and corresponding
back-propagation learning algorithm in which we simultaneously learn to identi
fy and segment a pattern.7
There are two important aspects to many pattern recognition problems which
we have built directly into our network and learning algorithm. The first is that
the exact location of the pattern, in space or time, is irrelevant to the classifica
tion of the pattern. It should be recognized as a member of the same class
whereever or whenever it occurs. This suggests that we build translation inde
pendence directly into our network. The second aspect we wish to build into the
network is that feedback about whether or not a pattern is present is all that
should be required for training. Information about the exact location and relation
ship to other patterns ought not be required. The target information thus does not
include information about where the patterns occur, but only about whether a
pattern occurs.
We have incorporated two basic tricks into our network design to deal with
these two aspects of the problem. The first is to build the assumption of transla
tion independence into the network by using local linked receptive fields, and the

7
The algorithm and network design presented here were first proposed by Rumelhart in a presen
tation entitled "Learning and generalization in multilayer networks" given at the NATO Advanced
Research Workshop on Neurocomputing, Algorithms, Architecture and Applications held in Les
Arcs, France, in February 1989. The algorithm can be considered a generalization and refinement of
the TDNN network developed by (Waibel et al., 1989). A version of the algorithm was first published
in Keeler, Rumelhart, and Loew (1991).

Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 17

s e c o n d is to build a fixed "forward m o d e l " (cf. Jordan & R u m e l h a r t , 1992) w h i c h

translates a location-specific recognition process into a location-independent out-
put v a l u e and then is used to b a c k - p r o p a g a t e the nonspecific error signal back
through this fixed n e t w o r k to train the u n d e r l y i n g location-specific n e t w o r k . T h e
following sections s h o w h o w these features can be realized and p r o v i d e a ratio-
nale for the exact structure and a s s u m p t i o n s of the network. T h e basic organiza-
tion of the n e t w o r k is illustrated in Figure 7 .
We d e s i g n a t e the stimulus pattern by the vector . We a s s u m e that any c h a r a c -
ter m a y o c c u r in any position. T h e input features then project to a set of h i d d e n
units w h i c h are a s s u m e d to abstract h i d d e n features from the input field. T h e s e
feature abstraction units are o r g a n i z e d into r o w s , o n e for each feature t y p e . E a c h
unit within a r o w is c o n s t r a i n e d to have the s a m e pattern of w e i g h t s as every
other unit in the row. T h e units are thus simply translated versions of o n e another.
T h i s is enforced b y " l i n k i n g " the weights of all units in a given row, a n d

p . . = p(letter i at location j) output target

a
d d
semi-
g g
output
o o

p = l- (1 - p ij)
i
j

hidden

input

Upto 6 0 Input Strokes

Figure 7. The basic recognition network. See text for detailed net-
work description.

Copyrighted Material
18 RUMELHART, DURBIN, GOLDEN, CHAUVIN

w h e n e v e r one weight is c h a n g e d all linked weights are c h a n g e d . This is the s a m e

trick used by R u m e l h a r t et al. (1986) to solve the so-called T / C p r o b l e m and by
L e C u n et al. (1990) in their work on zip c o d e recognition. We let the activation
value of a hidden unit of type i at location j be a sigmoidal function of its net
input and designate it hij. We interpret the activation of hidden unit hij as t h e
probability that hidden feature fi is present in the input at position j. T h e hidden
units t h e m s e l v e s have the c o n v e n t i o n a l logistic sigmoidal transfer functions.
T h e hidden units then project o n t o a set of position-specific letter detection
units. T h e r e is a r o w of position-specific units for each character t y p e . E a c h unit
in a r o w receives inputs from the feature units located in the i m m e d i a t e vicinity
of the recognition unit. A s with the hidden units, the units in a given r o w are
translated versions of o n e another. We designate the unit for detecting character i
at location j as pij. We let

1
= ,
Pij 1 + e-ηij

where
=
ηij Σ wik hkj + βi
k

and wik is t h e weight from h i d d e n unit hkj to the detector pij. N o t e that since the
w e i g h t s from the hidden unit to the detection units are linked, this same weight
will c o n n e c t each feature unit in the r o w with a c o r r e s p o n d i n g detection unit in
t h e r o w a b o v e . Since w e have built translational i n d e p e n d e n c e into the structure
of the n e t w o r k , anything w e learn about features or characters at any given
location i s , through the linking of w e i g h t s , automatically transferred to every
location.
If w e w e r e willing, or a b l e , to carefully s e g m e n t the input and tell the network
exactly w h e r e each character w a s , w e could use a standard training t e c h n i q u e to
train the n e t w o r k to r e c o g n i z e any character at any location. H o w e v e r , w e are
interested in a training algorithm in which w e d o not have to provide the network
with specific training information. We are interested in simply telling the net
w o r k w h i c h characters w e r e present in the input, not w h e r e each character is. To
i m p l e m e n t this idea, w e have built an additional network w h i c h takes the output
of t h e Pij units and c o m p u t e s , through a fixed output n e t w o r k , the probability that
at least o n e character of a given type is present a n y w h e r e in the input field. We d o
this by c o m p u t i n g the probability that at least o n e unit of a particular type is on.
T h i s can s i m p l y be written as

y i=1– (1 - Pij).
j

T h u s , yi is interpreted as representing directly the probability that character i

o c c u r r e d at least o n c e in the input field.

Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 19

N o t e that exactly the same target would be given for the word " d o g " and the
w o r d " g o d . " N e v e r t h e l e s s , the network learns to properly localize the units in the
pij layer. T h e reason is, simply, that the individual characters occur in m a n y
c o m b i n a t i o n s and the only w a y that the n e t w o r k can learn to discriminate cor
rectly is to actually detect the particular letter. T h e localization that occurs in the
pij layer d e p e n d s on each character unit seeing only a small part of the input field
and on each unit of t y p e i constrained to r e s p o n d in the s a m e way.
I m p o r t a n t in the design of the n e t w o r k w a s an assumption as to the meaning of
the individual units in the n e t w o r k . We will s h o w w h y w e m a k e these interpreta
tions and h o w the learning rule w e derive d e p e n d s on these interpretations.
To b e g i n , w e w a n t to interpret each output unit as the probability that at least
o n e of that character is in the input field. A s s u m i n g that the letters occurring in a
g i v e n w o r d are a p p r o x i m a t e l y independent of the other letters in the w o r d , w e
can also a s s u m e that the probability of the target vector given the input is

p(d| ) = ydji (1 – y j ) ( 1 -dj)

.
j

T h i s is o b v i o u s l y an e x a m p l e of the b i n o m i a l multiclassification m o d e l . T h e r e
fore, w e get the following form of o u r log-likelihood function:

l = Σ dj ln yj + (1 - d j )ln(1 - yj),

w h e r e dj e q u a l s 1 if c h a r a c t e r j is presented, and zero o t h e r w i s e .

O n having set u p o u r network and d e t e r m i n e d a reasonable p e r f o r m a n c e
criterion, w e straightforwardly c o m p u t e the derivative of the error function with
respect to Ηij, the net input into the detection unit pij. We get

' l =
(dj - yi) pij .
ηij yi
This is a kind of competitive rule in which the learning is proportional to the
relative strength of the activation of the unit at a location in the ith r o w to the
strength of activation in the entire row. T h i s ratio is the conditional probability
that the target w a s at position j u n d e r the a s s u m p t i o n that the target w a s , in fact,
p r e s e n t e d . T h i s c o n v e n i e n t interpretation is not accidental. By assigning the out
put units their probablistic interpretations and by selecting the a p p r o p r i a t e ,
t h o u g h u n u s u a l , o u t p u t unit yi – 1 – (1 – pij), w e w e r e able to ensure a
plausible interpretation and behavior of our character detection units.

Concluding Remarks

In this section w e have s h o w n three cases in which o u r ability to provide useful

analyses of o u r n e t w o r k s has given us important insights into n e t w o r k d e s i g n . It
is the general theory that allows us to see what a s s u m p t i o n s to m a k e , h o w to put

Copyrighted Material
20 RUMELHART, DURBIN, GOLDEN, CHAUVIN

o u r n e t w o r k s together, and h o w to interpret the outputs of the n e t w o r k s . We

h a v e , h o w e v e r , attended only to one portion of the p r o b l e m , n a m e l y the m e a s u r e
of p e r f o r m a n c e . Equally important is the other t e r m of our cost function, n a m e l y
the priors o v e r the n e t w o r k s . We n o w turn to this issue.

PRIORS

Recall that the general form of the posterior likelihood function is

l = Σ Σ ln P ( d i j | i Λ N) + ln P(N).
i j

In the p r e v i o u s section w e focused on the p e r f o r m a n c e term. N o w w e focus o n

the priors t e r m . A s indicated, the major point of this term is to get information
and constraints into the learning p r o c e d u r e . T h e basic p r o c e d u r e is to modify the
p a r a m e t e r s of the network based on the derivatives of both terms of the entire
cost function, not j u s t the p e r f o r m a n c e t e r m .

W e i g h t Decay

P e r h a p s the simplest c a s e to understand is the " w e i g h t d e c a y " t e r m . In this c a s e

w e a s s u m e that the weights are distributed normally about a zero m e a n . We can
write this term as

Σij wij2 =– 1
ln P(N) = ln e x p – 2
Σ w ij.
( 2σ 2
) 2σ2 ij

T h i s a m o u n t s to a penalty for large w e i g h t s . T h e term Σ d e t e r m i n e s h o w impor

tant the small weight constraint is. If Σ is large, the penalty term will not be very
i m p o r t a n t . If it is s m a l l , than the penalty term will be heavily w e i g h t e d . T h e
derivative then is given by

l 1
—
wij σ2 wij

T h u s , every t i m e w e see a new pattern the weights should be modified in t w o

w a y s ; first, they should b e modified so as to r e d u c e the overall error (as in t h e
first t i m e of E q u a t i o n 1); then they should be m o v e d t o w a r d zero by an a m o u n t
p r o p o r t i o n a l to the m a g n i t u d e of the weight. T h e term a d e t e r m i n e s the a m o u n t
of m o v e m e n t that should take place.
W h y should w e think that the weights should be small and centered a r o u n d
z e r o ? O f c o u r s e , this could be a bad a s s u m p t i o n , but it is one w a y of limiting t h e
s p a c e of possible functions that the network can e x p l o r e . All things being e q u a l ,
the n e t w o r k will select a solution with small weights rather than large o n e s . In

Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 21

linear p r o b l e m s this is often a useful strategy. T h e addition of this penalty term is

k n o w n as " r i d g e r e g r e s s i o n , " a kind of " r e g u l a r i z a t i o n " term w h i c h limits the
s p a c e of possible solutions to those with smaller w e i g h t s . This is an important
strategy for dealing with the overfitting p r o b l e m . Weight decay w a s first p r o
p o s e d for c o n n e c t i o n i s t n e t w o r k s by Geoffrey H i n t o n .

Weight Elimination

A general strategy for dealing with overfitting involves a simple application of

O c c a m ' s R a z o r — t h a t is, of all the n e t w o r k s w h i c h will fit the training d a t a , find
the simplest. T h e idea is to use the prior term to m e a s u r e the c o m p l e x i t y of the
n e t w o r k and "prefer" s i m p l e r n e t w o r k s to m o r e c o m p l e x o n e s . But h o w d o w e
m e a s u r e the c o m p l e x i t y of the n e t w o r k ? T h e basic idea, due to K o l m o g o r o v (cf.
K o l m o g o r o v , 1991), is that the complexity of a function is m e a s u r e d b y the
n u m b e r of bits required to c o m m u n i c a t e the function. T h i s is, in g e n e r a l , difficult
to m e a s u r e , but it is possible to find a set of variables which vary m o n o t o n i c a l l y
with the c o m p l e x i t y of a n e t w o r k . For e x a m p l e , the m o r e weights a n e t w o r k has
the m o r e c o m p l e x it i s — e a c h weight has to be described. T h e m o r e h i d d e n units
a n e t w o r k h a s , the greater the c o m p l e x i t y of the network; the m o r e bits per
w e i g h t , the m o r e c o m p l e x is the n e t w o r k ; the m o r e s y m m e t r i e s there are a m o n g
the w e i g h t s , the s i m p l e r the network is, and so on.
Weigend et al. ( 1 9 9 0 ) p r o p o s e d a set of priors each of which led to a reduction
in n e t w o r k c o m p l e x i t y : for several priors, the weight elimination p r o c e d u r e has
b e e n the m o s t useful. T h e idea is that the weights are not d r a w n from a single
distribution a r o u n d z e r o , as in weight decay, but w e a s s u m e that they are d r a w n
either from a n o r m a l distribution centered at zero or from a uniform distribution
b e t w e e n , say, ± 2 0 . It is possible to express this prior roughly as
[ ]
(wij/σ1)2
P(N) = exp - Σ ,
1 + (wij/σ2)2
ij

or, taking the log and multiplying through by the s i g m a s , as

σ22 w2ij
ln P(N) =- Σ .
σ21 ij σ22 + w2ij

T h e derivative is

l σ22 wij
— .
2
Wij σ2
1
(σ 2 + w2ij)2

N o t e that this has the p r o p e r t y that for small weights (weights in which wij is
small relative to σ 2 ), the d e n o m i n a t o r is approximately constant and the c h a n g e
in weights is s i m p l y proportional to the n u m e r a t o r wij, as in weight decay. For

Copyrighted Material
22 RUMELHART, DURBIN, GOLDEN, CHAUVIN

large w e i g h t s (wij is large relative to Σ2) the c h a n g e is proportional to 1 / w 3 — i n

other w o r d s , very little c h a n g e o c c u r s . T h u s , this penalty function causes small
w e i g h t to m o v e t o w a r d z e r o , eliminates them and leaves large weights alone.
T h i s has t h e effect of r e m o v i n g u n n e e d e d w e i g h t s . In the nonlinear c a s e , there is
reason to believe that the weight elimination strategy is a m o r e useful prior than
w e i g h t d e c a y since large weights are required to establish the nonlinearities. A
n u m b e r of successful e x p e r i m e n t s have b e e n carried out using the strategy (cf.
W e i g e n d , H u b b e r m a n , & R u m e l h a r t , 1990 and Weigend, R u m e l h a r t , & H u b b e r -
m a n , 1991).
A l t h o u g h similar strategies have been suggested for eliminating u n n e e d e d
h i d d e n units and for reducing the information content of w e i g h t s , these h a v e
b e e n studied very little. P e r h a p s the most successful p a r a d i g m , h o w e v e r , is a
generalization of the weight elimination p a r a d i g m to i m p o s e important weight
s y m m e t r i e s . T h i s w o r k has b e e n d o n e b y N o w l a n (1991) and is described h e r e .

Weight Symmetries

In w e i g h t d e c a y the idea w a s to have a prior such that the weight distribution has
a z e r o m e a n and is normally distributed. T h e weight elimination p a r a d i g m is
m o r e general in that it distinguishes t w o classes of w e i g h t s , of which o n e i s , like
the w e i g h t d e c a y c a s e , c e n t e r e d on zero and normally distributed, and the other is
uniformly distributed. In weight s y m m e t r i e s there is a small set of normally
distributed weight clusters. T h e p r o b l e m is to simultaneously estimate the m e a n
of the priors and the weights t h e m s e l v e s . In this case the priors are
[ (wi - μk)2 ]
P(N) = Σ exp - 2 P(ck),
2σk
i k

w h e r e P(ck) is the probability of the kth weight cluster and μk is its center. To
d e t e r m i n e h o w the weights are to b e c h a n g e d , w e must c o m p u t e the derivative of
the log of this probability. We get

l exp[(w i - μk)2 P(ck)/2σ2k] (μk - wi)

Σ .
wi k Σjexp[(wi - μj)2 P(cj)/2σ2j] σ2k

We c a n similarly estimate μk, σk, and P(ck) by gradient m e t h o d s as well. F o r

e x a m p l e , w e write the derivative of the error with respect to μk:

l exp[- (wj - μk)2 P(ck)/2σ2k] (wj - μk)

Σ 2 2 .
μk Σi exp[- (wj - μk) P(ci)/2σ i] σ2k
j

B y similar m e t h o d s , it is possible to estimate the other p a r a m e t e r s of the net

w o r k . N o w l a n (1991) have s h o w n that these priors g o far t o w a r d solving the
overfitting p r o b l e m .

Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 23

Elastic N e t w o r k a n d t h e
Traveling Salesman Problem

Earlier w e s h o w e d h o w o n e could d e v e l o p a clustering algorithm by using

G a u s s i a n h i d d e n units and optimizing the log-likelihood of the data given the
n e t w o r k . It turns o u t that by adding priors to the cost function w e can put
constraints on the clusters. I m a g i n e that w e are trying to solve the traveling
s a l e s m a n p r o b l e m in which w e are to find the shortest path through a set of cities.
In this c a s e , w e represent the cities in t e r m s of their (x, y) coordinate values so
that there is a t w o - d i m e n s i o n a l input vector for each city. T h e m e t h o d is the s a m e
as that p r o p o s e d by D u r b i n and W i l l s h a w (1987). T h e r e is a set of clusters, e a c h
cluster located at s o m e point in a t w o - d i m e n s i o n a l s p a c e . We want to m o v e the
m e a n s of the clusters t o w a r d the cities until there is o n e cluster for each city and
adjacent clusters are as close to one another as possible. T h i s provides for the
following cost function:

l = Σ ln Σ exp[ [ - ( x i - μx,j)2 - yi + μd,j)2]/2σ2]

i j
exp[ [-(xi - μx,j)2 - (μd,i - μd,j+1)2]/λ]
+ ln
j
exp[ [-(xi - μx,j)2 - yi + μd,j)2]/2σ2]
Σ ln Σ
i j

-(μx,j - μx,j+1)2 - (μd,j - μd,j+1)2

+ Σ .
j k

N o t e that the constraint c o n c e r n i n g the distance b e t w e e n successive cities is

e n c o d e d in the prior t e r m by the a s s u m p t i o n that adjacent cluster m e a n s are near
o n e another.
We next must c o m p u t e the derivative of the likelihood function with respect to
the p a r a m e t e r s μx,j and μ d,j , by n o w this should be a familiar form. We can write

l exp[ [-(xi - μx,j)2 - (di + μd,k)2]/2σ2] _

(xi μx,k)
Σ 2 2 2
μx,k i
Σj exp[[-(xi - μx,j) - (di + μd,j) ]/2σ ] σ2

1
+ (μx,k+1 – μx,k+1).
λ

In o r d e r to m a k e this w o r k , w e must imagine that the clusters form a ring so that

the cluster before the cluster — 1 is the last cluster and the cluster n + 1 ( w h e r e
w e have n clusters) is the s a m e as cluster 0. N o w w e proceed to solve the
traveling s a l e s m a n p r o b l e m in the following way. We start out with a rather large
value of σ. T h e cities are presented to the network one at a t i m e , and the weights

Copyrighted Material
24 RUMELHART, DURBIN, GOLDEN, CHAUVIN

(i.e, the means μx,k and μd,k) are adjusted until the network stabilizes. At this
point it is likely that none of the cluster centers are located at any of the cities. We
then decrease a and present the cities again until it stabilizes. Then σ is de
creased again. This process is repeated until there is a cluster mean located at
each city. At this point we can simply follow the cluster means in order and read
off the solution to the problem.

Concluding Comments
In this section we showed how knowledge and constraints can be added to the
network with well-chosen priors. So far the priors have been of two types. (1) We
have used priors to constrain the set of networks explored by the learning algo
rithm. By adding such "regularization" terms we have been able to design net
works which provide much better generalization. (2) We have been able to add
further constraints among the network parameter relationships. These constraints
allow us to force the network to a particular set of possible solutions, such as
those which minimize the tour in the traveling salesman problem.
Although not discussed here, it is possible to add knowledge to the network in
another way by expressing priors about the behavior of different parts of the
network. It is possible to formulate priors that, for example, constrain the output
of units on successive presentations to be as similar or as dissimilar as possible to
one another. The general procedure can dramatically affect the solution the
network achieves.

HIDDEN UNITS

Thus far, we have focused our attention on log-likelihood cost functions, appro
priate interpretation of the output units, and methods of introducing additional
constraints in the network. The final section focuses on the hidden units of the
network.8 There are at least four distinct ways of viewing hidden units.

1. Sigmoidal hidden units can be viewed as approximations to linear thresh

old functions which divide the space into regions which can then be
combined to approximate the desired function.
2. Hidden units may be viewed as a set of basis functions, linear combina
tions of which can be used to approximate the desired output function.
3. Sigmoidal hidden units can be viewed probabilistically as representing the
probability that certain "hidden features" are present in the input.

8
As an historical note, the term "hidden unit" is used to refer to those units lying between the
input and output layers. The name was coined by Geoffrey Hinton, inspired by the notion of "hidden
states" in hidden Markov models.

Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 25

4. Layers of hidden units can be viewed as a mechanism for transforming

stimuli from one representation to another from layer to layer until those
stimuli which are functionally similar are near one another in hidden-unit
space.

In the following sections we treat each of these conceptions.

Sigmoidal Units as Continuous Approximations to

Linear Threshold Functions
Perhaps the simplest way to view sigmoidal hidden units is as continuous approx-
imations to the linear threshold function. The best way to understand this is in
terms of a two-dimensional stimulus space populated by stimuli which are la-
beled as members of class A or class B. Figure 8 illustrates such a space. We can
imagine a single output unit which is to classify the input vectors (stimuli) as
being in one or the other of these classes. A simple perceptron will be able to
solve this problem if the stimuli are linearly separable; that is, we can draw a
line which puts all of the A stimuli on one side of the line and all of the B stimuli
on the other side. This is illustrated in part (a) of Figure 8. In part (b) we see that
replacing the sharp line of the threshold function with a "fuzzy" line of the
sigmoid causes little trouble. It tends to lead to a condition in which stimuli near
the border as classified less certainly than those far from the border. This may not
be a bad thing since stimuli near the border may be more ambiguous.
When the stimuli are not linearly separable (as illustrated in panel (c) of the
figure), the problem is more difficult and hidden units are required. In this case,
each hidden unit can be seen as putting down a dividing line segmenting the input
field into regions. Ideally each region will contain stimuli of the same kind. Then
the weights from the hidden-unit layer to the output units are used to combine the
regions which go together to form the final classification. If the stimuli are
binary, or more generally if the regions in which they lie are convex (as they are
in panel (c)), a single layer of hidden threshold units will always be sufficient. If
the space is concave, as illustrated in panel (d), then two layers of threshold units
may be necessary so that the right regions can be combined. It is nevertheless
possible to "approximate" the regions arbitrarily closely with a single hidden
layer if enough hidden units are employed. Figure 9 shows how the problem
illustrated can be solved exactly with two hidden units in the two-layer case and
be approximated arbitrarily closely by many hidden units.

Hidden Units as Basis Functions for

Function Approximation
It is also possible to see the hidden layers as forming a set of "basis functions"
and see the output units as approximating the function through a linear combina-

Copyrighted Material
26 RUMELHART, DURBIN, GOLDEN, CHAUVIN

( a ) (b)

B
B
A B A B

A B A B

A B A B
B B
A A

( c ) ( d )

B
A B A
B
A A A
B

B
A A B A

Figure 8. (a) A simple example of a linearly separable set of points.

Perceptrons are capable of classifying such data sets. (b) How the
same data would be classified by a sigmoid. The density of the dots
indicates the magnitude of the sigmoid. If the problem is really linearly
separable, the weights on the sigmoid can grow and it can act just like
a perceptron. (c) A set of lines can be used to segregate a convex
region. The hidden units put down a set of lines and make space that is
originally not linearly separable into one that is. (d) In a concave space
it might not be possible to find a set of lines which divide the two
regions. In such a case two hidden layers are sometimes convenient.

tion of t h e h i d d e n units. T h i s is a view e s p o u s e d by P o g i o and Girosi (1989) and

others e m p l o y i n g the "radial basis function" a p p r o a c h . Typically, this a p p r o a c h
involves s i m p l y substituting a G a u s s i a n or similar radially s y m m e t r i c function
for t h e c o n v e n t i o n a l sigmoidal hidden units. Of c o u r s e , there is n o limit to the
kind of transfer function the h i d d e n units m i g h t employ. T h e only real constraint
(as far as b a c k propagation is c o n c e r n e d ) is that the functions are differentiable in
their inputs a n d p a r a m e t e r s . S o long as this is true any hidden-unit type is
p o s s i b l e . Certain unit types m a y have a d v a n t a g e s o v e r o t h e r s , h o w e v e r . A m o n g
the important considerations are the p r o b l e m s of local m i n i m a .

Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 27

Classifier
A B
Second
Hidden B
A

First
Hidden A B

B A
X Y
Figure 9. The first layer works by putting in the vertical and horizon
tal lines and moving the points to the corners of the region. This
means that at the second level the problem is convex and two further
hidden units divide the space and make it linearly separable.

As we will see, sigmoidal units are somewhat better behaved than many
others with respect to the smoothness of the error surface. Durbin and Rumelhart
(1989), for example, have found that, although "product units" (Fj( i|wij) =
wij
ix i ) are much more powerful than conventional sigmoidal units (in that fewer
parameters were required to represent more functions), it was a much more
difficult space to search and there were more problems with local minima.
Another important consideration is the nature of the extrapolation to data
points outside the local region from which the data were collected. Radial basis
functions have the advantage that they go to zero as you extend beyond the region
where the data were collected. Polynomial units Fj(ηj) = ηpk are very ill behaved
outside of the training region and for that reason are not especially good choices.
Sigmoids are well behaved outside of their local region in that they saturate and
are constant at 0 or 1 outside of the training region.

Sigmoidal Hidden Units as Representing Hidden

Feature Probabilities
The sigmoidal hidden unit has turned out to be a serendipitous choice. It has a
number of nice properties and interpretations which make it rather useful. The

Copyrighted Material
28 RUMELHART, DURBIN, GOLDEN, CHAUVIN

first property has to do with the learning process itself. As noted from Figure 10,
the sigmoidal unit is roughly linear for small weights (a net input near zero) and
gets increasingly nonlinear in its response as it approaches its points of maximum
curvature on either side of the midpoint. Thus, at the beginning of learning,
when the weights are small, the system is mainly in its linear range and is seeking
an essentially linear solution. As the weights grow, the network becomes increas-
ing nonlinear and begins to move toward the nonlinear solution to the problem.
This property of initial linearity makes the units rather robust and allows the
network to reliably attain the same solution.
Sigmoidal hidden units have a useful interpretation as the posterior probability
of the presence of some feature given the input. To see this, think of a sigmoidal
hidden unit connected directly to the input units. Suppose that the input vectors
are random variables drawn from one of two probability distributions and that the
job of the hidden unit is to determine which of the two distributions is being
observed. The role of the hidden unit is to give an output value equal to the
probability that the input vector was drawn from distribution 1 rather than distri-
bution 2. If drawn from distribution 1 we say that some "hidden feature" was
present; otherwise we say it was absent. Denoting the hidden feature for the jth
hidden unit as fj we have
P( |fj = 1)P(fj = 1)
P(fj= 1| ) =
p( |fj = 1)P(fj = 1) + P ( |fj = 0)P(fj = 0).

1.0

0.5

0.0

-6.0 -3.0 0.0 3.0 6.0

Logistic S i g m o i d F u n c t i o n
Figure 10. The logistic sigmoid is roughly linear near the middle of its
range and reaches its maximum curvature.

Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 29

N o w , on the a s s u m p t i o n that the x's are conditionally independent ( i . e . , if w e

k n o w w h i c h distribution they were d r a w n from, there is a fixed probability for
each input e l e m e n t that it will occur), w e can write

P( | f j = 1) = i P( i|fj = 1) and P( |fj= 0) = i P( i|fj = 0).

N o w , o n the further a s s u m p t i o n that the x's are binomially distributed w e get

P( |fj = 1) = pxiji (1 - pij)(1-x),

i
P( |fj = 0) = pxiji (1 - pij)(1 - x).

S o w e finally have

x i
P(fj = 1| = ip ij (1 – pij)(1 - x) P (fj =
x i .
ip ij (1 – pij)(1 - x) P (fj = 1) + ipxiji (1 – pij)(1 - xi) P (fj = 0

Taking logs and e x p o n e n t i a t i n g give

P(fj = 1|

exp (xilnpij + (1 - xi)ln(1 - pij) +lnP(fj= 1

{ }
i
=
exp{ (xi ln pij + (1 - xi)ln(1 - pij)) + ln P(fj = 1)+ exp (xi ln pij + (1 - xi)ln(1 - pij)) + ln P(fj = }0)
} {
i
Pij + ln (1 - pij) + ln p(fj + =1)
exp { ( xi ln1 - p ) }
i ij i
=
Pij qij
exp{ ( xi ln 1 - p + ln(l - pij) ) +lnP(fj= 1)} + exp{ ( xi, ln1 - qij+ ln(l - qij)) +lnP(fj= 0) }
i ij i i
1
=
P(f = 1)
1 + exp{ - lnPij(1 - qij) + ln 1-Pij + ln j .
i (1 - Pij)qij i 1-qij P(fj = 0)}

N o w , it is possible to interpret the e x p o n e n t as representing Ηj, the net input to

the unit. If w e let

= P(fj = 1)
βj ln 1 - pij + ln
1 - qij P(fj = 0)
i

and

p i j (1 - qij)
In ,
(1 - pij)qij
w e can see the similarity. We can thus see that the sigmoid is properly u n d e r s t o o d
as representing the posterior probability that s o m e hidden feature is present given

Copyrighted Material
30 RUMELHART, DURBIN, GOLDEN, CHAUVIN

the input. N o t e that, as before, the binomial assumption is not necessary. It is

possible to a s s u m e that the underlying input vectors are m e m b e r s of any of the
e x p o n e n t i a l family of probability distributions. It is easy to write the general
form

(xiθi- B(θi)) + C( , )
exp +lnP(fj= 1)
{ a() }
P(fj =1|) = i
(xi θi - B(θi)) + C( , ) +lnP(f = 1) + (xi θ*i - B(θ*i)) + C( , ) +lnP(f =0)
exp{ Σ j
} exp{ } j
i a() i a()

B y r e a r r a n g i n g terms and dividing through by the numerator, w e obtain the

s i m p l e form
θi - θi* B(θi) - B(θi*)
P(fj = 1| ) = 1 + exp -
( xj + + ln P(fj = 1) -1
{ ( a() i a() P(fj = 0)) } )
i
= 1

1 + exp ( xiwi + β) i ]
[
i
T h u s , u n d e r t h e a s s u m p t i o n that the input variables are d r a w n from s o m e m e m
b e r of the exponential family and differ only in their m e a n s (represented by 6,),
the s i g m o i d a l hidden unit can be interpreted as the probability that the hidden
feature is present. N o t e the very same form is derived whether the underlying
distributions are G a u s s i a n , b i n o m i a l , or any m e m b e r of the exponential family. It
can readily b e seen that, w h e r e a s the sigmoid represents the two-alternative c a s e ,
the n o r m a l i z e d exponential clearly represents the multialternative c a s e . T h u s , w e
d e r i v e the n o r m a l i z e d exponential in exactly the same w a y as w e derive the
sigmoid:

e x p { Σ i [ ( x i j θi - B(θij)) + C( ) ] / a ( ) + ln P(cj = 1)}

P(cj = 1| ) =
Σk exp{Σi[(xiθi - B(θik)) + C( )]/a( ) + ln P(ck = 1)}

exp{Σi xij θi/a( ) - ΣiB(θij)/a( ) + ln P(cj = 1)}

= Σk exp{Σi xi θik/a( ) - B(θik)/a( ) + ln P(ck = 1)}

ei xij wij +βj

= .
kei xij wij +βk

H i d d e n - U n i t L a y e r s as R e p r e s e n t a t i o n s o f
the Input Stimuli

F i g u r e 11 illustrates a very simple connectionist network consisting of t w o layers

of u n i t s , the input units and output units, c o n n e c t e d by a set of w e i g h t s . A s a
result of the particular connectivity and weights of this n e t w o r k , each pattern of
activation presented at the input units will induce another specific pattern of
activation at the output units. This simple architecture is useful in various w a y s .
If the input and output patterns all use distributed representations ( i . e . , can all be

Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 31

Outputs

Inputs

Similar Inputs Lead

to Similar Outputs
Figure 11. Similar inputs produce similar outputs.

described as sets of microfeatures), then this network will exhibit the property
that "similar inputs yield similar outputs," along with the accompanying general-
ization and transfer of learning. Two-layer networks behave this way because the
activation of an output unit is given by a relatively smooth function of the
weighted sum of its inputs. Thus, a slight change in the value of an input unit will
generally yield a similarly slight change in the values of the output units.
Although this similarity-based processing is mostly useful, it does not always
yield the correct generalizations. In particular, in a simple two-layer network, the
similarity metric employed is determined by the nature of the inputs themselves.
And the "physical similarity" we are likely to have at the inputs (based on the
structure of stimuli from the physical world) may not be the best measure of the
"functional" or "psychological" similarity we would like to employ at the output
(to group appropriate similar responses). For example, it is probably true that a
lowercase a is physically less similar to an uppercase A than to a lowercase o, but
functionally and psychologically a a and A are more similar to one another than
are the two lowercase letters. Thus, physical relatedness is an inadequate sim-
ilarity metric for modeling human responses to letter-shaped visual inputs. It is
therefore necessary to transform these input patterns from their initial physically
derived format into another representational form in which patterns requiring
similar (output) responses are indeed similar to one another. This involves learn-
ing new representations.
Figure 12 illustrates a layered feedforward network in which information
(activation) flows up from the input units at the bottom through successive layers
of hidden units, to create the final response at the layer of output units on top.
Such a network is useful for illustrating how an appropriate psychological or
functional representation can be created. If we think of each input vector as a
point in some multidimensional space, we can think of the similarity between

Copyrighted Material
32 RUMELHART, DURBIN, GOLDEN, CHAUVIN

Output

Final
Representation

T1
Input

Figure 12. We can think of multilayer networks as transforming the

input through a series of successive transformations so as to create a
representation in which "functionally" similar stimuli are near one
another when viewed as points in a multidimensional space.

two such vectors as the distance between their two corresponding points. Further-
more, we can think of the weighted connections from one layer of units to the
next as implementing a transformation that maps each original input vector into
some new vector. This transformation can create a new vector space in which the
relative distances among the points corresponding to the input vectors are differ-
ent from those in the original vector space, essentially rearranging the points.
And if we use a sequence of such transformations, each involving certain non-
linearities, by "stacking" them between successive layers in the network, we can
entirely rearrange the similarity relations among the original input vectors.
Thus, a layered network can be viewed simply as a mechanism for transform-
ing the original set of input stimuli into a new similarity space with a new set of
distances among the input points. For example, it is possible to move the initially
distant "physical" input representations of a and A so that they are very close to
one another in a transformed "psychological" output representation space, and
simultaneously transform the distance between a and o output representations so
that they are rather distant from one another. (Generally, we seek to attain a
representation in the second-to-last layer which is sufficiently transformed that
we can rely on the principle that similar patterns yield similar outputs at the final
layer.) The problem is to find an appropriate sequence of transformations that
accomplish the desired input-to-output change in similarity structures.
The back-propagation learning algorithm can be viewed, then, as a procedure
for discovering such a sequence of transformations. In fact, we can see the role

Copyrighted Material
1. BACKPROPAGATION: THE BASIC THEORY 33

of learning in general as a mechanism for constructing the transformations which

will convert the original physically based configuration of the input vectors into
an appropriate functional or psychological space, with the proper similarity
relationships between concepts for making generalizations and transfer of learn-
ing occur automatically and correctly.

CONCLUSION

In this chapter we have tried to provide a kind of overview and rationale for the
design and understanding of networks. Although it is possible to design and use
interesting networks without any of the ideas presented here, it is, in our experi-
ence, very valuable to understand networks in terms of these probabilistic inter-
pretations. The value is primarily in providing an understanding of the networks
and their behavior so that one can craft an appropriate network for an appropriate
problem. Although it has been commonplace to view networks as kinds of black
boxes, this leads to inappropriate applications which may fail not because such
networks cannot work but because the issues are not well understood.

REFERENCES

Cover, T. H. (1965). Geometrical and statistical properties of systems of linear inequalities with
applications in pattern recognition. IEEE Transactions on Electronic Computers, 14. pp. 326-
334
Durbin, R., & Rumelhart, D. E. (1989). Product units: A computationally powerful and biolog-
ically lausible extension to backpropagation networks. Neural Computation, 1, 133-142.
Durbin, R., & Willshaw, D. (1987). An analogue approach to the travelling salesman problem using
an elastic net method. Nature, 326, 689-691.
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer Feed-forward Networks are Univer-
sal Approximators, Neural Networks, 2, pp. 359-366.
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., & Hinton, G. E. (1991). Adaptive mixtures of local
experts. Neural Computation, 3(1).
Jordan, M. I., & Rumelhart, D. E. (1992). Forward models: Supervised learning with a distal
teacher. Cognitive Science, 16, pp. 307-354.
Keeler, J. D., Rumelhart, D. E., & Loew, W. (1991). Integrated segmentation and recognition of
hand-printed numerals. In R. P. Lippmann, J. E. Moody, and D. S. Touretzky (Eds.), Neural
information processing systems (Vol. 3). San Mateo, CA: Morgan Kaufmann.
Kolmogorov, A. N. (1991). Selected Works of A. N. Kolmogorov, Dordrecht; Boston; Kluwer Aca-
demic.
Le Cun, Y., Boser, Y. B., Denke, J. S., Henderson, R. D , Howard, R. E., Hubbard, W., & Jackel,
L. D. (1990). In D. S. Touretzky (Ed.), Handwritten digit recognition with a back-propagation
network (Vol. 2). San Mateo, CA: Morgan Kaufmann.
McCullagh, P. & Nelder, J. A. (1989). Generalized linear models. London: Chapman and Hall.
Mitchison, G. J., & Durbin, R. M. (1989). Bounds on the learning capacity of some multi-layer
networks. Biological Cybernetics, 60, 345-356.
Nowlan, S. J. (1991). Soft Competitive Adaptation: Neural Network Learning Algorithm based on

Copyrighted Material
34 RUMELHART, DURBIN, GOLDEN, CHAUVIN

Fitting Statistical Mixtures. Ph.D. thesis, School of Computer Science, Carnegie Mellon Univer-
sity, Pittsburgh, PA.
Parker, D. B. (1982). Learning-logic (Invention Report S81-64, File 1). Stanford, CA: Office of
Technology Licensing, Stanford University.
Pogio, T., & Girosi, F. (1989). A theory of networks for approximation and learning. A. I. Memo
No. 1140, Artificial Intelligence Laboratory, Massachusetts Institute of Technology.
Rosenblatt, F. (1962). Principles of neurodynamics. New York: Spartan.
Rumelhart, D. E. (1990). Brain style computation: Learning and generalization. In S. F. Zornetzer,
J. L. Davis, and C. Lau (Eds.), An introduction to neural and electronic networks. San Diego:
Academic Press.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by
error propagation. In D. E. Rumelhart and J. L. McClelland (Eds.), Parallel Distributed Process-
ing: Explorations in the Microstructure of Cognition (Vol. 1). Cambridge, MA: Bradford Books.
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. (1989). Phoneme recognition
using time-delay neural networks. IEEE Transactions on Acoustics, Speech and Signal Process-
ing. 37, 328-338.
Weigend, A. S., Huberman, B. A., & Rumelhart, D. E. (1990). Predicting the future: A connec-
tionist approach. International Journal of Neural Systems, I, 193-209.
Weigend, A. S., Rumelhart, D. E., & Huberman, B. (1991). Generalization by weight-elimination
with application to forecasting. In R. P. Lippman, J. Moody, and D. S. Touretsky (Eds.),
Advances in neural information processing (Vol. 3, pp. 875-882). San Mateo, CA: Morgan
Kaufman.
Werbos, P. (1974). Beyond regression: New tools for prediction and analysis in the behavioral
sciences. Unpublished dissertation, Harvard University.

Copyrighted Material
2 Phoneme Recognition Using
Time-Delay Neural Networks*

Alexander Waibel
Computer Science Department, Carnegie Mellon University

Toshiyuki Hanazawa
ATR Interpreting Telephony Research Laboratories, Osaka, Japan

Geoffrey Hinton
University of Toronto

Kiyohiro Shikano
ATR Interpreting Telephony Research Laboratories, Osaka, Japan

Kevin J. Lang
Carnegie Mellon University

ABSTRACT

In this paper we present a Time-Delay Neural Network (TDNN) approach to

phoneme recognition which is characterized by two important properties. 1) Using
a 3 layer arrangement of simple computing units, a hierarchy can be constructed
that allows for the formation of arbitrary nonlinear decision surfaces. The TDNN
learns these decision surfaces automatically using error backpropagation [1]. 2)
The time-delay arrangement enables the network to discover acoustic-phonetic
features and the temporal relationships between them independent of position in
time and hence not blurred by temporal shifts in the input.
As a recognition task, the speaker-dependent recognition of the phonemes "B,"
"D," and "G" in varying phonetic contexts was chosen. For comparison, several
discrete Hidden Markov Models (HMM) were trained to perform the same task.
Performance evaluation over 1946 testing tokens from three speakers showed that
the TDNN achieves a recognition rate of 98.5 percent correct while the rate ob-

*Reprinted from IEEE TRANSACTIONS ON ACOUSTICS, SPEECH AND SIGNAL PRO-

CESSING Vol. 37, No. 3, March 1989.

Copyrighted Material
36 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG

tained by the best of our HMM's was only 93.7 percent. Closer inspection reveals
that the network "invented" well-known acoustic-phonetic features (e.g., F2-rise,
F2-fall, vowel-onset) as useful abstractions. It also developed alternate internal
representations to link different acoustic realizations to the same concept.

I. INTRODUCTION

In recent years, the advent of new learning procedures and the availability of high
speed parallel supercomputers have given rise to a renewed interest in connec-
tionist models of intelligence [ 1 ]. Sometimes also referred to as artificial neural
networks or parallel distributed processing models, these models are particularly
interesting for cognitive tasks that require massive constraint satisfaction, i.e.,
the parallel evaluation of many clues and facts and their interpretation in the light
of numerous interrelated constraints. Cognitive tasks, such as vision, speech,
language processing, and motor control, are also characterized by a high degree
of uncertainty and variability and it has proved difficult to achieve good perfor-
mance for these tasks using standard serial programming methods. Complex
networks composed of simple computing units are attractive for these tasks not
only because of their "brain-like" appeal1 but because they offer ways for auto-
matically designing systems that can make use of multiple interacting con-
straints. In general, such constraints are too complex to be easily programmed
and require the use of automatic learning strategies. Such learning algorithms
now exist (for an excellent review, see Lippman [2]) and have been demonstrated
to discover interesting internal abstractions in their attempts to solve a given
problem [1], [3]-[5]. Learning is most effective, however, when used in an
architecture that is appropriate for the task. Indeed, applying one's prior knowl-
edge of a task domain and its properties to the design of a suitable neural network
model might well prove to be a key element in the successful development of
connectionist systems.
Naturally, these techniques will have far-reaching implications for the design
of automatic speech recognition systems, if proven successful in comparison to
already-existing techniques. Lippmann [6] has compared several kinds of neural
networks to other classifiers and evaluated their ability to create complex deci-
sion surfaces. Other studies have investigated actual speech recognition tasks and
compared them to psychological evidence in speech perception [7] or to existing
speech recognition techniques [8], [9]. Speech recognition experiments using
neural nets have so far mostly been aimed at isolated word recognition (mostly
the digit recognition task) [10]–[13] or phonetic recognition with predefined
constant [14], [15] or variable phonetic contexts [16], [14], [17].
A number of these studies report very encouraging recognition performance
1
The uninitiated reader should be cautioned not to overinterpret the now-popular term "neural
network." Although these networks appear to mimic certain properties of neural cells, no claim can
be made that present exploratory attempts simulate the complexities of the human brain.

Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 37

[16], but only few comparisons to existing recognition methods exist. Some of
these comparisons found performance similar to existing methods [9], [11], but
others found that networks perform worse than other techniques [8]. One might
argue that this state of affairs is encouraging considering the amount of fine-
tuning that has gone into optimizing the more popular, established techniques.
Nevertheless, better comparative performance figures are needed before neural
networks can be considered as a viable alternative for speech recognition sys-
tems.
One possible explanation for the mixed performance results obtained so far
may be limitations in computing resources leading to shortcuts that limit perfor-
mance. Another more serious limitation, however, is the inability of most neural
network architectures to deal properly with the dynamic nature of speech. Two
important aspects of this are for a network to represent temporal relationships
between acoustic events, while at the same time providing for invariance under
translation in time. The specific movement of a formant in time, for example, is
an important cue to determining the identity of a voiced stop, but it is irrelevant
whether the same set of events occurs a little sooner or later in the course of time.
Without translation invariance, a neural net requires precise segmentation to
align the input pattern properly. Since this is not always possible in practice,
learned features tend to get blurred (in order to accommodate slight misalign-
ments) and their performance deteriorates. In general, shift invariance has been
recognized as a critically important property for connectionist systems and a
number of promising models have been proposed for speech and other domains
[18]-[21], [14], [17], [22].
In the present paper, we describe a Time-Delay Neural Network (TDNN)
which addresses both of these aspects of speech and demonstrate through exten-
sive performance evaluation that superior recognition results can be achieved
using this approach. In the following section, we begin by introducing the
architecture and learning strategy of a TDNN aimed at phoneme recognition.
Next, we compare the performance of our TDNN's to one of the more popular
current recognition techniques: Hidden Markov Models (HMM). In Section III,
we start by describing an HMM, under development at ATR [23], [24]. Both
techniques, the TDNN and the HMM, are then evaluated over a testing database
and we report the results. We show that substantially higher recognition perfor-
mance is achieved by the TDNN than by the best of our HMM's. In Section IV,
we then take a closer look at the internal representation that the TDNN learns for
this task. It discovers a number of interesting linguistic abstractions which we
show by way of examples. The implications of these results are then discussed
and summarized in the final section of this paper.

II. TIME-DELAY NEURAL NETWORKS (TDNN)

To be useful for speech recognition, a layered feedforward neural network must
have a number of properties. First, it should have multiple layers and sufficient

Copyrighted Material
38 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG

interconnections between units in each of these layers. This is to ensure that the
network will have the ability to learn complex nonlinear decision surfaces [2],
[6]. Second, the network should have the ability to represent relationships be-
tween events in time. These events could be spectral coefficients, but might also
be the output of higher level feature detectors. Third, the actual features or
abstractions learned by the network should be invariant under translation in time.
Fourth, the learning procedure should not require precise temporal alignment of
the labels that are to be learned. Fifth, the number of weights in the network
should be sufficiently small compared to the amount of training data so that the
network is forced to encode the training data by extracting regularity. In the
following, we describe a TDNN architecture that satisfies all of these criteria and
is designed explicitly for the recognition of phonemes, in particular, the voiced
stops "B," "D," and "G."

A. A TDNN Architecture for Phoneme Recognition

The basic unit used in many neural networks computes the weighted sum of its
inputs and then passes this sum through a nonlinear function, most commonly a
threshold or sigmoid function [2], [1]. In our TDNN, this basic unit is modified
by introducing delays D1 through DN as shown in Fig. 1. The J inputs of such a
unit now will be multiplied by several weights, one for each delay and one for the
undelayed input. For N = 2, and J - 16, for example, 48 weights will be needed
to compute the weighted sum of the 16 inputs, with each input now measured at
three different points in time. In this way, a TDNN unit has the ability to relate
and compare current input to the past history of events. The sigmoid function
was chosen as the nonlinear output function F due to its convenient mathematical
properties [18], [5].
For the recognition of phonemes, a three layer net is constructed.2 Its overall
architecture and a typical set of activities in the units are shown in Fig. 2.
At the lowest level, 16 normalized melscale spectral coefficients serve as
input to the network. Input speech, sampled at 12 kHz, was Hamming windowed
and a 256-point FFT computed every 5 ms. Melscale coefficients were computed
from the power spectrum by computing log energies in each melscale energy
band [25], where adjacent coefficients in frequency overlap by one spectral
sample and are smoothed by reducing the shared sample by 50 percent [25].3
2
Lippmann [2], [6] demonstrated recently that three layers can encode arbitrary pattern recogni-
tion decision surfaces. We believe that complex nonlinear decision surfaces are necessary to properly
perform classification in the light of considerable acoustic variability as reported in the experiments
below.
3
Naturally, a number of alternative signal representations could be used as input, but have not
been tried in this study. Filterbank coefficients were chosen as they are simple to compute and readily
interpretable in the light of acoustic-phonetics. The melscale is a physiologically motivated frequency
scale that provides better relative frequency resolution for lower frequency bands. Our implementa-
tion resulted in coefficients with a band-width of approximately 190 Hz up to 1400 Hz, and with
increasing band-widths thereafter.

Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 39

DN
Wj + N
UJ

D1
Wj + 1

Σ F

Wi + N

U1
D1

Wi + 1

Figure 1. A Time-Delay Neural Network (TDNN) unit.

Adjacent coefficients in time were collapsed for further data reduction resulting
in an overall 10 ms frame rate. All coefficients of an input token (in this case, 15
frames of speech centered around the hand-labeled vowel onset) were then nor
malized. This was accomplished by subtracting from each coefficient the average
coefficient energy computed over all 15 frames of an input token and then
normalizing each coefficient to lie between – 1 and +1. All tokens in our
database were preprocessed in the same fashion. Fig. 2 shows the resulting
coefficients for the speech token "BA" as input to the network, where positive
values are shown as black squares and negative values as gray squares.
This input layer is then fully interconnected to a layer of 8 time-delay hidden
units, where J = 16 and N = 2 (i.e., 16 coefficients over 3 frames with time
delay 0, 1, and 2). An alternative way of seeing this is depicted in Fig. 2. It
shows the inputs to these time-delay units expanded out spatially into a 3 frame
window, which is passed over the input spectrogram. Each unit in the first hidden

Copyrighted Material
40 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG

B D G

Output Layer

integration

3 units
Hidden Layer 2

8 units
Hidden Layer 1

(Hz)
16 melscale filterbank coefficients

5437
4547
3797
3187
2672

2250
Input Layer
1922
I641
1406
1219
1031
844
656
462
281
141

15 frames
10 msec frame rate

Figure 2. The architecture of the TDNN.

layer now receives input (via 48 weighted connections) from the coefficients in
the 3 frame window. The particular choice of 3 frames (30 ms) was motivated by
earlier studies [26]–[29] that suggest that a 30 ms window might be sufficient to
represent low level acoustic-zphonetic events for stop consonant recognition. It
was also the optimal choice among a number of alternative designs evaluated by
Lang [21] on a similar task.
In the second hidden layer, each of 3 TDNN units looks at a 5 frame window
of activity levels in hidden layer 1 (i.e., J = 8, N = 4). The choice of a larger 5
frame window in this layer was motivated by the intuition that higher level units

Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 41

should learn to make decisions over a wider range in time based on more local
abstractions at lower levels.
Finally, the output is obtained by integrating (summing) the evidence from
each of the 3 units in hidden layer 2 over time and connecting it to its pertinent
output unit (shown in Fig. 2 over 9 frames for the "B" output unit). In practice,
this summation is implemented simply as another nonlinear (sigmoid function is
applied here as well) TDNN unit which has fixed equal weights to a row of unit
findings over time in hidden layer 2. 4
When the TDNN has learned its internal representation, it performs recogni-
tion by passing input speech over the TDNN units. In terms of the illustration of
Fig. 2, this is equivalent to passing the time-delay windows over the lower level
units' firing patterns.5 At the lowest level, these firing patterns simply consist of
the sensory input, i.e., the spectral coefficients.
Each TDNN unit outlined in this section has the ability to encode temporal
relationships within the range of the N delays. Higher layers can attend to larger
time spans, so local short duration features will be formed at the lower layer and
more complex longer duration features at the higher layer. The learning proce-
dure ensures that each of the units in each layer has its weights adjusted in a way
that improves the network's overall performance.

B. Learning in a TDNN
Several learning techniques exist for optimization of neural networks [1], [2],
[30]. For the present network, we adopt the Backpropagation Learning Proce-
dure [18], [5]. Mathematically, backpropagation is gradient descent of the mean-
squared error as a function of the weights. The procedure performs two passes
through the network. During the forward pass, an input pattern is applied to the
network with its current connection strengths (initially small random weights).
The outputs of all the units at each level are computed starting at the input layer
and working forward to the output layer. The output is then compared to the
desired output and its error calculated. During the backward pass, the derivative
of this error is then propagated back through the network, and all the weights are
adjusted so as to decrease the error [18], [5]. This is repeated many times for all
the training tokens until the network converges to producing the desired output.
In the previous section, we described a method of expressing temporal struc-
ture in a TDNN and contrasted this method to training a network on a static input

4
Note, however, that as for all units in this network (except the input units), the output units are
also connected to a permanently active threshold unit. In this way, both an output unit's one shared
connection to a row in hidden layer 2 and its dc-bias are learned and can be adjusted for optimal
classification.
5
Thus, 13 frames of activations in hidden layer 1 arc generated when scanning the 15 frames of
input speech with a 3 frame time delay window. Similarly, 9 frames are produced in hidden layer 2
from the 13 frames of activation in the layer below.

Copyrighted Material
42 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG

pattern (spectrogram), which results in shift sensitive networks (i.e., poor perfor-
mance for slightly misaligned input patterns) as well as less crisp decision mak-
ing in the units of the network (caused by misaligned tokens during training).
To achieve the desired learning behavior, we need to ensure that the network is
exposed to sequences of patterns and that it is allowed (or encouraged) to learn
about the most powerful cues and sequences of cues among them. Conceptually,
the backpropagation procedure is applied to speech patterns that are stepped
through in time. An equivalent way of achieving this result is to use a spatially
expanded input pattern, i.e., a spectrogram plus some constraints on the weights.
Each collection of TDNN units described above is duplicated for each one frame
shift in time. In this way, the whole history of activities is available at once.
Since the shifted copies of the TDNN units are mere duplicates and are to look
for the same acoustic event, the weights of the corresponding connections in the
time shifted copies must be constrained to be the same. To implement this, we
first apply the regular backpropagation forward and backward pass to all time-
shifted copies as if they were separate events. This yields different error deriva-
tives for corresponding (time shifted) connections. Rather than changing the
weights on time-shifted connections separately, however, we actually update
each weight on corresponding connections by the same value, namely by the
average of all corresponding time-delayed weight changes.6 Fig. 2 illustrates this
by showing in each layer only two connections that are linked to (constrained to
have the same value as) their time-shifted neighbors. Of course, this applies to all
connections and all time shifts. In this way, the network is forced to discover
useful acoustic-phonetic features in the input, regardless of when in time they
actually occurred. This is an important property, as it makes the network inde-
pendent of error-prone preprocessing algorithms that otherwise would be needed
for time alignment and/or segmentation. In Section IV-C, we will show exam-
ples of grossly misaligned patterns that are properly recognized due to this
property.
The procedure described here is computationally rather expensive, due to the
many iterations necessary for learning a complex multidimensional weight space
and the number of learning samples. In our case, about 800 learning samples
were used, and between 20 000 and 50 000 iterations of the backpropagation
loop were run over all training samples. Two steps were taken to perform learn-
ing within reasonable time. First, we have implemented our learning procedure
in C and Fortran on a 4 processor Alliant supercomputer. The speed of learning
can be improved considerably by computing the forward and backward sweeps
for several different training samples in parallel on different processors. Further
improvements can be gained by vectorizing operations and possibly assembly
coding the innermost loop. Our present implementation achieves about a factor

6
Note that in the experiments reported below, these weight changes were actually carried out each
time the error derivatives from all training samples had been computed [5].

Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 43

0.6

3 6 9 24 99 249 780

0.3

iterations

0 10000

Figure 3. TDNN output error versus number of learning iterations

(increasing training set size).

of 9 speedup over a VAX 8600, but still leaves room for further improvements
(Lang [21], for example, reports a speedup of a factor of 120 over a VAX11/780
for an implementation running on a Convex supercomputer). The second step
taken toward improving learning time is given by a staged learning strategy. In
this approach, we start optimizing the network based on 3 prototypical training
tokens only.7 In this case, convergence is achieved rapidly, but the network will
have learned a representation that generalizes poorly to new and different pat-
terns. Once convergence is achieved, the network is presented with approx-
imately twice the number of tokens and learning continues until convergence.
Fig. 3 shows the progress during a typical learning run. The measured error is
½ the squared error of all the output units, normalized for the number of training
tokens. In this run, the number of training tokens used were 3, 6, 9, 24, 99, 249,
and 780. As can be seen from Fig. 3, the error briefly jumps up every time more
variability is introduced by way of more training data. The network is then forced
to improve its representation to discover clues that generalize better and to
deemphasize those that turn out to be merely irrelevant idiosyncracies of a
limited sample set. Using the full training set of 780 tokens, this particular run

7
Note that for optimal learning, the training data are presented by always alternating tokens for
each class. Hence, we start the network off by presenting 3 tokens, one for each class.

Copyrighted Material
44 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG

was continued until iteration 35 000 (Fig. 3 shows the learning curve only up to
15 000 iterations). With this full training set, small learning steps have to be
taken and learning progresses slowly. In this case, a step size of 0.002 and a
momentum [5] of 0.1 was used. The staged learning approach was found to be
useful to move the weights of the network rapidly into the neighborhood of a
reasonable solution, before the rather slow fine tuning over all training tokens
begins.
Despite these speedups, learning runs still take in the order of several days. A
number of programming tricks [21] as well as modifications to the learning
procedure [31] are not implemented yet and could yield another factor of 10 or
more in learning time reduction. It is important to note, however, that the amount
of computation considered here is necessary only for learning of a TDNN and not
for recognition. Recognition can easily be performed in better than real time on a
workstation or personal computer. The simple structure makes TDNN's also well
suited for standardized VLSI implementation. The detailed knowledge could be
learned "off-line" using substantial computing power and then downloaded in the
form of weights onto a real-time production network.

III. RECOGNITION EXPERIMENTS

We now turn to an experimental evaluation of the TDNN's recognition perfor-

mance. In particular, we would like to compare the TDNN's performance to the
performance of the currently most popular recognition method: Hidden Markov
Models (HMM). For the performance evaluation reported here, we have chosen
the best of a number of HMM's developed in our laboratory. Several other
HMM-based variations and models have been tried in an effort to optimize our
HMM, but we make no claim that an exhaustive evaluation of all HMM-based
techniques was accomplished. We should also point out that the experiments
reported here were aimed at evaluating two different recognition philosophies.
Each recognition method was therefore implemented and optimized using its
preferred representation of the speech signal, i.e., a representation that is well
suited and most commonly used for the method evaluated. Evaluation of both
methods was of course carried out using the same speech input data, but we
caution the reader that due to the differences in representation, the exact contri-
bution to overall performance of the recognition strategy as opposed to its signal
representation is not known. It is conceivable that improved front end processing
might lead to further performance improvements for either technique. In the
following sections, we will start by introducing the best of our Hidden Markov
Models. We then describe the experimental conditions and the database used for
performance evaluation and conclude with the performance results achieved by
our TDNN and HMM.

Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 45

A. A Hidden Markov Model (HMM)

for Phoneme Recognition
HMM's are currently the most successful and promising approach [32]–[34] in
speech recognition as they have been successfully applied to the whole range of
recognition tasks. Excellent performance was achieved at all levels from the
phonemic level [35]–[38] to word recognition [39], [34] and to continuous
speech recognition [40]. The success of HMM's is partially due to their ability to
cope with the variability in speech by means of stochastic modeling. In this
section, we describe an HMM developed in our laboratory that was aimed at
phoneme recognition, more specifically the voiced stops "B," "D," and "G." The
model described was the best of a number of alternate designs developed in our
laboratory [23], [24].
The acoustic front end for Hidden Markov's Modeling is typically a vector
quantizer that classifies sequences of short-time spectra. Such a representation
was chosen as it is highly effective for HMM-based recognizers [40].
Input speech was sampled at 12 kHz, preemphasized by (1 — 0.97 z–1), and
windowed using a 256-point Hamming window every 3 ms. Then a 12-order
LPC analysis was carried out. A codebook of 256 LPC spectrum envelopes was
generated from 216 phonetically balanced words. The Weighted Likelihood Ra-
tio [41], [42] augmented with power values (PWLR) [43], [42] was used as LPC
distance measure for vector quantization.
A fairly standard HMM was adopted in this paper as shown in Fig. 4. It has
four states and six transitions and was found to be the best of a series of alternate
models tried in our laboratory. These included models with two, three, four, and
five states and with tied arcs and null arcs [23], [24].
The HMM probability values were trained using vector sequences of
phonemes according to the forward-backward algorithm [32]. The vector se-
quences for "B," "D," and "G" include a consonant part and five frames of the
following vowel. This is to model important transient information, such as
formant movement, and has lead to improvements over context insensitive mod-
els [23], [24]. Again, variations on these parameters have been tried for the
discrimination of these three voiced stop consonants. In particular, we have used
10 and 15 frames (i.e., 30 and 45 ms) of the following vowel in a 5 state HMM,
but no performance improvements over the model described were obtained.
The HMM was trained using about 250 phoneme tokens of vector sequences

s0 s1 s2 s3
Figure 4. Hidden Markov Model.

Copyrighted Material
46 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG

per speaker and phoneme (see details of the training database below). Fig. 5
shows for a typical training run the average log probability normalized by the
number of frames. Training was continued until the increase of the average log
probability between iterations became less than 2 * 10 –3 .
Typically, about 10–20 learning iterations are required for 256 tokens. A
training run takes about 1 h on a VAX 8700. Floor values8 were set on the output
probabilities to avoid errors caused by zero probabilities. We have experimented
with composite models, which were trained using a combination of context-
independent and context-dependent probability values as suggested by Schwartz
et al. [35], [36]. In our case, no significant improvements were attained.

B. Experimental Conditions
For performance evaluation, we have used a large vocabulary database of 5240
common Japanese words [44]. These words were uttered in isolation by three
male native Japanese speakers (MAU, MHT, and MNM, all professional an-
nouncers) in the order they appear in a Japanese dictionary. All utterances were
recorded in a sound-proof booth and digitized at a 12 kHz sampling rate. The
database was then split into a training set (the even numbered files as derived
from the recording order) and a testing set (the odd numbered files). A given
speaker's training and testing data, therefore, consisted of 2620 utterances each,
from which the actual phonetic tokens were extracted.
The phoneme recognition task chosen for this experiment was the recognition
of the voiced stops, i.e., the phonemes "B," "D," and "G." The actual tokens
were extracted from the utterances using manually selected acoustic-phonetic
labels provided with the database [44]. For speaker MAU, for example, a total of
219 "B's," 203 "D's," and 260 "G's" were extracted from the training and 227
"B's," 179 "D's," and 252 "G's," from the testing data. Both recognition
schemes, the TDNN's and the HMM's, were trained and tested speaker depen-
dently. Thus, in both cases, separate networks were trained for each speaker.
In our database, no preselection of tokens was performed. All tokens labeled
as one of the three voiced stops were included. It is important to note that since
the consonant tokens were extracted from entire utterances and not read in
isolation, a significant amount of phonetic variability exists. Foremost, there is
the variability introduced by the phonetic context out of which a token is extrac-
ted. The actual signal of a "BA" will therefore look significantly different from a
"BI" and so on. Second, the position of a phonemic token within the utterance
introduces additional variability. In Japanese, for example, a "G" is nasalized,
when it occurs embedded in an utterance, but not in utterance initial position.
Both of our recognition algorithms are only given the phonemic identity of a
token and must find their own ways of representing the fine variations of speech.

8
Here, once again, the optimal value out of a number of alternative choices was selected.

Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 47

iog_prob
-2.8

-3.2

-3.6

-4.0 iterations

0 10 20 30 40 50

Figure 5. Learning in the Hidden Markov Model.

C. Results
Table I shows the results from the recognition experiments described above as
obtained from the testing data. As can be seen, for all three speakers, the TDNN
yields considerably higher performance than our HMM. Averaged over all three
speakers, the error rate is reduced from 6.3 to 1.5 percent—a more than fourfold
reduction in error.
While it is particularly important here to base performance evaluation on
testing data,9 a few observations can be made from recognition runs over the
training data. For the training data set, recognition error rates were: 99.6 percent
(MAU), 99.7 percent (MHT), and 99.7 percent (MNM) for the TDNN, and 96.9
percent (MAU), 99.1 percent (MHT), and 95.7 percent (MNM) for the HMM.
Comparison of these results to those from the testing data in Table I indicates that
both methods achieved good generalization from the training set to unknown
data. The data also suggest that better classification rather than better generaliza-
tion might be the cause of the TDNN's better performance shown in Table I.
Figs. 6–11 show scatter plots of the recognition outcome for the test data for
speaker MAU, using the HMM and the TDNN. For the HMM (see Figs. 6–8),
the log probability of the next best matching incorrect token is plotted against the
log probability10 of the correct token, e.g., "B," "D," and "G." In Figs. 9–11,
the activation levels from the TDNN's output units are plotted in the same
fashion. Note that these plots are not easily comparable, as the two recognition
methods have been trained in quite different ways. They do, however, represent
the numerical values that each method's decision rule uses to determine the
9
If the training data are insufficient, neural networks can in principle learn to memorize training
patterns rather than finding generalization of speech.
10
Normalized by number of frames.

Copyrighted Material
48 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG

TABLE 1
Recognition Results for Three Speakers Over Test Data
Using TDNN and HMM

number number recognition number recognition

speaker of tokens of errors rate TDNN of errors rate HMM

b(227) 4 98.2 18 92.1

MAU d(179) 3 98.3 98.8 6 96.7 92.9
g(252) 1 99.6 23 90.9
b(208) 2 99.0 8 96.2
MHT d(170) 0 100 99.1 3 98.2 97.2
g(254) 4 98.4 7 97.2
b(216) 11 94.9 27 87.5
MNM d(178) 1 99.4 97.5 13 92.7 90.9
g(256) 4 98.4 19 92.6

recognition outcome. We present these plots here to show some interesting

properties of the two techniques. The most striking observation that can be made
from these plots is that the output units of a TDNN have a tendency to fire with
high confidence as can be seen from the cluster of dots in the lower right-hand
corner of the scatter plots. Most output units tend to fire strongly for the correct
phonemic class and not all for any other, a property that is encouraged by the

0.0
next_best

-10.0 correct

-10.0 0.0

Figure 6. Scatter plot showing log probabilities for the best matching
incorrect case versus the correctly recognized "B's" using an HMM.

Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 49

0.0
next_best

-10.0 correct

-10.0 0.0

Figure 7. Scatter plot showing log probabilities for the best matching
incorrect case versus the correctly recognized "D's" using an HMM.

0.0
ne:<t_best

-10.0 correct

-10.0 0.0

Figure 8. Scatter plot showing log probabilities for the best matching
incorrect case versus the correctly recognized "G's" using an HMM.

Copyrighted Material
50 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG

next_best
1.0

0.0 correct

0.0 1.0

Figure 9. Scatter plot showing activation levels for the best matching
incorrect case versus the correctly recognized "B's" using a TDNN.

next_best
1.0

0,0 correct

0.0 1.0

Figure 10. Scatter plot showing activation levels for the best match-
ing incorrect case versus the correctly recognized "D's" using a TDNN.

Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 51

1.0 next_best

0.0 correct

0.0 1.0

Figure 11. Scatter plot showing activation levels for the best match-
ing incorrect case versus the correctly recognized "G's" using a TDNN.

learning procedure. One possible consequence of this is that rejection thresholds

could be introduced to improve recognition performance. If one were to elimi-
nate among speaker MAU's tokens all those whose highest activation level is less
than 0.5 and those which result in two or more closely competing activations
(i.e., are near the diagonal in the scatter plots), 2.6 percent of all tokens would be
rejected, while the remaining substitution error rate would be less than 0.46
percent.

IV. THE LEARNED INTERNAL REPRESENTATIONS OF

A TDNN

Given the encouraging performance of our TDNN's, a close look at the learned
internal representation of the network is warranted. What are the properties or
abstractions that the network has learned that appear to yield a very powerful
description of voiced stops? Figs. 12 and 13 show two typical instances of a "D"
out of two different phonetic contexts ("DA" and "DO," respectively). In both
cases, only the correct unit, the "D-output unit," fires strongly, despite the fact
that the two input spectrograms differ considerably from each other. If we study
the internal firings in these two cases, we can see that the network has learned to

Copyrighted Material
52 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG

(HZ)

5437
4547
3797
3187
2672
2250
1922
1641
1406
1219
1031
814
656
462
281
Figure 12. TDNN activation patterns for "DA." 141

use alternate internal representations to link variations in the sensory input to the
same higher level concepts. A good example is given by the firings of the third
and fourth hidden unit in the first layer above the input layer. As can be seen
from Fig. 13, the fourth hidden unit fires particularly strongly after vowel onset
in the case of "DO," while the third unit shows stronger activation after vowel
onset in the case of "DA."
Fig. 14 shows the significance of these different firing patterns. Here the
connection strengths for the eight moving TDNN units are shown, where white
and black blobs represent positive and negative weights, respectively, and the
magnitude of a weight is indicated by the size of the blob. In this figure, the time
delays are displayed spatially as a 3 frame window of 16 spectral coefficients.
Conceptually, the weights in this window form a moving acoustic–phonetic
feature detector that fires when the pattern for which it is specialized is encoun-

Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 53

(Hz)

5437
4547
3797
3187
2672
2250
1922
1641
1406
1219
1031
844
656
462
281
Figure 13. TDNN activation patterns for "DO." 141

tered in the input speech. In our example, we can see that hidden unit number 4
(which was activated for "DO") has learned to fire when a falling (or rising)
second formant starting at around 1600 Hz is found in the input (see filled arrow
in Fig. 14). As can be seen in Fig. 13, this is the case for "DO" and hence the
firing of hidden unit 4 after voicing onset (see row pointed to by the filled arrow
in Fig. 13). In the case of "DA" (see Fig. 12), in turn, the second formant does
not fall significantly, and hidden unit 3 (pointed to by the filled arrow) fires
instead. From Fig. 14 we can verify that TDNN unit 3 has learned to look for a
steady (or only slightly falling) second formant starting at about 1800 Hz. The
connections in the second and third layer then link the different firing patterns
observed in the first hidden layer into one and the same decision.
Another interesting feature can be seen in the bottom hidden unit in hidden
layer number 1 (see Figs. 12 and 13, and compare them to the weights of hidden

Copyrighted Material
54 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG

5.) 6.) 7.) 8.)

TDNN-units
frequency

Figure 14. Weights on connections time

from 16 coefficients over 3 time 4.)

1.) 2.) 3.)
frames to each of the 8 hidden units
in the first layer. TDNN-units

unit 1 displayed in Fig. 14). This unit has learned to take on the role of finding
the segment boundary of the voiced stop. It does so in reverse polarity, i.e., it is
always on except when the vowel onset of the voiced stop is encountered (see
unfilled arrow in Figs. 13 and 12). Indeed, the higher layer TDNN units subse-
quently use this "segmenter" to base the final decision on the occurrence of the
right lower features at the right point in time.
In the previous example, we have seen that the TDNN can account for varia-
tions in phonetic context. Figs. 15 and 16 show examples of variability caused by
the relative position of a phoneme within a word. In Japanese, a "G" embedded
in a word tends to be nasalized as seen in the spectrum of a "GA" in Fig. 15. Fig.
16 shows a word initial "GA." Despite the striking differences between these two

Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 55

[Hz)

5037
0507
3797
3187
2672
2250
1922
1641
1406
1219
1031
844
656
462
Figure 15. TDNN activation patterns for "GA" 281
embedded in an utterance. 101

input spectrograms, the network's internal alternate representations manage to

produce in both cases crisp output firings for the right category.
Figs. 17 and 18, finally, demonstrate the shift invariance of the network. They
show the same token "DO" of Fig. 13, misaligned by +30 ms and —30 ms,
respectively. Despite the gross misalignment (note that significant transitional
information is lost by the misalignment in Fig. 18), the correct result was ob-
tained reliably. A close look at the internal activation patterns reveals that the
hidden units' feature detectors do indeed fire according to the events in the input
speech, and are not negatively affected by the relative shift with respect to the
input units. Naturally, error rates will gradually increase when the tokens are
artificially shifted to an extent that important features begin to fall outside the 15
frame data window considered here. We have observed, for example, a 2.6

Copyrighted Material
56 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG

(Hz)

5437
4547
3797
3187
2672
2250
1922
1641
1406
1219
1031
844
656
462
Figure 16. TDNN activation patterns for "GA" 281
in utterance initial position. 141

percent increase in error rate when all tokens from the training data were arti-
ficially shifted by 20 ms. Such residual time-shift sensitivities are due to the edge
effects at the token boundaries and can probably be removed by training the
TDNN using randomly shifted training tokens.11 We also consider the formation
of shift-invariant internal features to be the important desirable property we
observe in the TDNN. Such internal features could be incorporated into larger
speech recognition systems using more sophisticated search techniques or a
syllable or word level TDNN, and hence could replace the simple integration
layer we have used here for training and evaluation.
Three important properties of the TDNN's have been observed. First, our
TDNN was able to learn, without human interference, meaningful linguistic

11
We gratefully acknowledge one of the reviewers for suggesting this idea.

Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 57

(HZ)

5437
4547
3797
3187
2672
2250
1922
1641
1406
1219
1031
844
656
462
Figure 17. TDNN activation patterns for "DO" 281
misaligned by +30 ms. 141

abstractions such as formant tracking and segmentation. Second, we have dem-

onstrated that it has learned to form alternate representations linking different
acoustic events with the same higher level concept. In this fashion, it can imple-
ment trading relations between lower level acoustic events leading to robust
recognition performance. Third, we have seen that the network is shift invariant
and does not rely on precise alignment or segmentation of the input.

V. CONCLUSION AND SUMMARY

In this paper we have presented a Time-Delay Neural Network (TDNN) ap-

proach to phoneme recognition. We have shown that this TDNN has two desir-
able properties related to the dynamic structure of speech. First, it can learn the

Copyrighted Material
58 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG

(HZ)

5437
4547
3797
3187
2672
2250
1922
1641
1406
1219
1031
644
656
462
Figure 18. TDNN activation patterns for "DO" 281
141
misaligned by –30 ms.

temporal structure of acoustic events and the temporal relationships between

such events. Second, it is translation invariant, that is, the features learned by the
network are insensitive to shifts in time. Examples demonstrate that the network
was indeed able to learn acoustic-phonetic features, such as formant movements
and segmentation, and use them effectively as internal abstractions of speech.
The TDNN presented here has two hidden layers and has the ability to learn
complex nonlinear decision surfaces. This could be seen from the network's
ability to use alternate internal representations and trading relations among lower
level acoustic-phonetic features, in order to arrive robustly at the correct final
decision. Such alternate representations have been particularly useful for repre-
senting tokens that vary considerably from each other due to their different
phonetic environment or their position within the original speech utterance.
Finally, we have evaluated the TDNN on the recognition of three acoustically
similar phonemes, the voiced stops "B," "D," and "G." In extensive performance

Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 59

evaluation over testing data from three speakers, the TDNN achieved an average
recognition score of 98.5 percent. For comparison, we have applied various
Hidden Markov Models to the same task and only been able to recognize 93.7
percent of the tokens correctly. We would like to note that many variations of
HMM's have been attempted, and many more variations of both HMM's and
TDNN's are conceivable. Some of these variations could potentially lead to
significant improvements over the results reported in this study. Our goal here is
to present TDNN's as a new and successful approach for speech recognition.
Their power lies in their ability to develop shift-invariant internal abstractions of
speech and to use them in trading relations for making optimal decisions. This
holds significant promise for speech recognition in general, as it could help
overcome the representational weaknesses of speech recognition systems faced
with the uncertainty and variability in real-life signals.

ACKNOWLEDGMENT

The authors would like to express their gratitude to Dr. A. Kurematsu, President
of ATR Interpreting Telephony Research Laboratories, for his enthusiastic en-
couragement and support which made this research possible. We are also in-
debted to the members of the Speech Processing Department at ATR and Mr.
Fukuda at Apollo Computer, Tokyo, Japan, for programming assistance in the
various stages of this research.

REFERENCES

[1] D. E. Rumelhart and J. L. McClelland, Parallel Distributed Processing: Explorations in the

Microstruclure of Cognition, Vol. I and II. Cambridge, MA: M.I.T. Press, 1986.
[2] R. P. Lippmann, "An introduction to computing with neural nets," IEEE ASSP Mag., vol. 4,
Apr. 1987.
[3] D. C. Plaut, S. J. Nowlan, and G. E. Hinton, "Experiments on learning by back propagation,"
Tech. Rep. CMU-CS-86-126, Carnegie-Mellon Univ., June 1986.
[4] T. J. Sejnowski and C. R. Rosenberg, "NETtalk: A parallel network that learns to read aloud,"
Tech. Rep. JHU/EECS-86/01, Johns Hopkins Univ., June 1986.
[5] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning representations by back-
propagating errors," Nature, vol. 323, pp. 533–536, Oct. 1986.
[6] W. Y. Huang and R. P. Lippmann, "Comparison between neural net and conventional classi-
fiers," in Proc. IEEE Int. Conf. Neural Networks, June 1987.
[7] J. L. McClelland and J. L. Elman, Interactive Processes in Speech Perception: The TRACE
Model. Cambridge, MA: M.I.T. Press, 1986, ch. 15, pp. 58–121.
[8] S. M. Peeling, R. K. Moore, and M. J. Tomlinson, "The multi-layer perceptron as a tool for
speech pattern processing research," in Proc. IoA Autumn Conf. Speech Hearing, 1986.
[9] H. Bourlard and C. J. Wellekens, "Multilayer perceptrons and automatic speech recognition,"
in Proc. IEEE Int. Conf. Neural Networks, June 1987.
[10] B. Gold, R. P. Lippmann, and M. L. Malpass, "Some neural net recognition results on isolated
words," in Proc. IEEE Int. Conf. Neural Networks, June 1987.

Copyrighted Material
60 WAIBEL, HANAZAWA, HINTON, SHIKANO, LANG

[11] R. P. Lippmann and B. Gold, "Neural-net classifiers useful for speech recognition," in Proc.
IEEE Int. Conf. Neural Networks, June 1987.
[12) D. J. Burr, "A neural network digit recognizer," in Proc. IEEE Int. Conf. Syst., Man, Cybern.,
Oct. 1986.
[13] D. Lubensky, "Learning spectral-temporal dependencies using connectionist networks," in
Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Apr. 1988.
[14] R. L. Watrous and L. Shastri, "Learning phonetic features using connectionist networks: An
experiment in speech recognition," in Proc. IEEE Int. Conf. Neural Networks, June 1987.
[15] R. W. Prager, T. D. Harrison, and F. Fallside, "Boltzmann machines for speech recognition,"
Comput., Speech, Language, vol. 3, no. 27, Mar. 1986.
[16] J. L. Elman and D. Zipser, "Learning the hidden structure of speech,"Tech. Rep., Univ. Calif.,
San Diego, Feb. 1987.
[17] R. L. Watrous, L. Shastri, and A. H. Waibel, "Learned phonetic discrimination using connec-
tionist networks," in Proc. Euro. Conf. Speech Technol., Edinburgh, Sept. 1987, pp. 377-380.
[18] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, Learning Internal Representations by Error
Propagation. Cambridge, MA: M.I.T. Press, 1986, ch. 8, pp. 318-362.
[19] J. S. Bridle and R. K. Moore, "Boltzmann machines for speech pattern processing," in Proc.
Inst. Acoust. 1984, 1984, 315-322.
[20] D. W. Tank and J. J. Hopfield, "Neural computation by concentrating information in time," in
Proc. Nat. Academy Sci., Apr. 1987, pp. 1896-1900.
[21] K. Lang, "Connectionist speech recognition," Ph.D. dissertation proposal, Carnegie-Mellon
Univ., Pittsburgh, PA.
[22] K. Fukushima, S. Miyake, and T. Ito, "Neocognitron: A neural network model for a mechanism
of visual pattern recognition," IEEE Trans. Syst., Man, Cybern., vol. SMC-13, pp. 826-834,
Sept./Oct. 1983.
[23] T. Hanazawa, T , Kawabata, and K. Shikano, "Discrimination of Japanese voiced stops using
Hidden Markov Model," in Proc. Conf. Acoust. Soc. Japan, Oct. 1987, pp. 19-20 (in Japa-
nese).
[24] , "Recognition of Japanese voiced stops using Hidden Markov Models," 1E1CE Tech.
Rep., Dec. 1987 (in Japanese).
[25] A. Waibel and B. Yegnanarayana, "Comparative study of nonlinear time warping techniques in
isolated word speech recognition systems," Tech. Rep., Carnegie-Mellon Univ., June 1981.
[26] S. Makino and K. Kido, "Phoneme recognition using time spectrum pattern," Speech Com-
mun., pp. 225-237, June 1986.
[27] S. E. Blumenstein and K. N. Stevens, "Acoustic invariance in speech production: Evidence
from measurements of the spectral characteristics of stop consonants," J. Acoust. Soc. Amer.,
vol. 66, pp. 1001-1017, 1979.
[28] , "Perceptual invariance and onset spectra for stop consonants in different vowel envi-
ronments," J. Acoust. Soc. Amer., vol. 67, pp. 648-662, 1980.
[29] D. Kewley-Port, "Time varying features as correlates of place of articulation in stop conso-
nants," J. Acoust. Soc. Amer., vol. 73, pp. 322-335, 1983.
[30] G. E. Hinton, "Connectionist learning procedures," Artificial Intelligence, 1987.
[31] M. A. Franzini, "Speech recognition with back propagation," in Proc. 9th Annu. Conf.
lEEE/Eng. Med. Biol. Soc, Nov. 1987.
[32] F. Jelinek, "Continuous speech recognition by statistical methods," Proc. IEEE, vol. 64,
pp. 532-556, Apr. 1976.
[33] J. K. Baker, "Stochastic modeling as a means of automatic speech recognition," Ph.D. disserta-
tion, Carnegie-Mellon Univ., Apr. 1975.
[34] L. R. Bahl, S. K. Das, P. V. de Souza, F. Jelinek, S. Katz, R. L. Mercer, and M. A. Picheny,
"Some experiments with large-vocabulary isolated-word sentence recognition," in Proc. IEEE
Int. Conf. Acoust., Speech, Signal Processing, Apr. 1984.

Copyrighted Material
2. PHONEME RECOGNITION USING TDNN 61

[35] R. Schwartz, Y. Chow, O. Kimball, S. Roucos, M. Krasner, and J. Makhoul, "Context-

dependent modeling for acoustic-honetic recognition of continuous speech," in Proc. IEEE Int.
Conf. Acoust., Speech, Signal Processing, Apr. 1985.
[36] A.-M. Derouault, "Context-dependent phonetic Markov models for large vocabulary speech
recognition," in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Apr. 1987, pp. 360-
363.
[37] K. F. Lee and H. W. Hon, "Speaker-independent phoneme recognition using hidden Markov
models," Tech. Rep. CMU-CS-88-121, Carnegie-Mellon Univ., Pittsburgh, PA, Mar. 1988.
[38] P. Brown, "The acoustic-modeling problem in automatic speech recognition," Ph.D. disserta-
tion, Carnegie-Mellon Univ., May 1987.
[39] L. R. Rabiner, B. H. Juang, S. E. Levinson, and M. M. Sondhi, "Recognition of isolated digits
using hidden Markov models with continuous mixture densities," AT&T Tech. J., vol. 64,
no. 6, pp. 1211–1233, July–Aug. 1985.
[40] Y. L. Chow, M. O. Dunham, O. A. Kimball, M. A. Krasner, G. F. Kubala, J. Makhoul,
S. Roucos, and R. M. Schwartz, "BYBLOS: The BBN continuous speech recognition system,"
in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Apr. 1987, pp. 89–92.
[41] M. Sugiyama and K. Shikano, "LPC peak weighted spectral matching measures," Inst. Elec.
Commun. Eng. Japan, vol. 64-A, no. 5, pp. 409–416, 1981 (in Japanese).
[42] K. Shikano, "Evaluation of LPC spectral matching measures for phonetic unit recognition,"
Tech. Rep., Carnegie-Mellon Univ., May 1985.
[43] K. Aikawa and K. Shikano, "Spoken word recognition using vector quantization in power-
spectrum vector space," Inst. Elec. Commun. Eng. Japan, vol. 68-D, no. 3, Mar. 1985 (in
Japanese).
[44] Y. Sagisaka, K. Takeda, S. Katagiri, and H. Kuwabara, "Japanese speech database with fine
acoustic-phonetic transcriptions," Tech. Rep., ATR Interpreting Telephony Res. Lab., May
1987.

Copyrighted Material
Copyrighted Material
3 Automated Aircraft Flare and
Touchdown Control Using
Neural Networks

Charles Schley
Yves Chauvin
Van Henkle
Thomson-CSF, Inc., Palo Alto Research Operation

ABSTRACT

We present a general-purpose neural network architecture capable of controlling

nonlinear plants. The network is composed of dynamic, parallel, linear maps gated
by nonlinear switches. Using a recurrent form of the back-propagation algorithm,
we achieve control by optimizing the linear gains and the nonlinear switch parame-
ters. A mean quadratic cost function computed across a nominal plant trajectory is
minimized along with performance constraint penalties. The approach is demon-
strated for a control task consisting of landing a commercial aircraft from a position
on the glideslope to the ground position in difficult wind conditions. We show that
the network learns how to control the aircraft in extreme wind conditions, yielding
performance comparable to or better than that of a "traditional" autoland system
while remaining within acceptable response characteristics constraints. Further-
more, we show that this performance is achieved not only through learning of
control gains in the linear maps but also through learning of task-adapted gain
schedules in the nonlinear switches.

INTRODUCTION

This chapter illustrates how a recurrent back-propagation neural network algo-

rithm (Rumelhart, Hinton, & Williams, 1986) may be exploited as a procedure
for controlling complex systems. In particular, a general-purpose task-adapted
network architecture was devised for the control of nonlinear systems. To apply
the technique, a simplified mathematical model of aircraft landing in the pres-
ence of severe wind gusts was developed and simulated. A recurrent back-

Copyrighted Material
64 SCHLEY, CHAUVIN, VAN HENKLE

propagation neural network architecture was then designed to numerically esti-

mate the parameters of an optimal nonlinear control law for landing the aircraft.
The performance of the network was then evaluated and compared to "more
traditional" methods of designing control laws.
We have exploited a general neural network approach that might be used
successfully in a variety of control system applications. For this reason, we
present in this section the general problem of controlling a physical system and
our approach to its solution. In the next section we describe the basic neural
controller architecture we devised and explain its unique features. The third
section develops the aircraft model along with a conventional control system.
The fourth section describes the recurrent implementation of the plant, the con-
troller, and the performance indices to be minimized with performance con-
straints. The fifth section presents and evaluates simulation statistics. The last
section reviews our basic findings, discusses their applicability to other types of
control problems, and considers some possible extensions of this research.

A Typical Control System

A typical control system consists of a controller and a process to be controlled, as
illustrated in Figure 1. The controller's function is to accept task inputs along
with process outputs and to determine control signals tailored to the response
characteristics of the process. Frequently, it is desired that the process outputs
closely follow the task inputs. The physical process to be controlled (enclosed
within the shaded box of Figure 1) can be electromechanical, aerodynamic,
chemical, or other. It generally has well-defined behavior following physical
principles such as Newton's laws or laws of thermodynamics. The process states
are physical variables such as acceleration, velocity or temperature. Their dy-
namics are usually expressed in the form of differential equations relating their
interdependence and their dependence on time, position, and other quantities.

Disturbances

Task Control Process Process Process

Inputs Signals Inputs States Outputs
Physical
Controller Process Sensors

Figure 1. Typical control system.

Copyrighted Material
3. FLARE AND TOUCHDOWN CONTROL USING NEURAL NETWORKS 65

The process is observed through a set of sensors. Finally, the process is subjected
to disturbances from its external operating environment.

Controller Design
What is the best method of designing a control law? Two types of methods, often
called classical and modern methods, are described in the literature. Classical
methods look at linearized versions of the plant to be controlled and some loosely
defined response specifications such as bandwidth (i.e., speed of response) and
phase margin (i.e., degree of stability). These methods make use of time-domain
or frequency-domain mathematical tools such as root-loci methods or Bode plots.
Modern methods generally assume that a performance index for the process is
specified and provide controllers that optimize this performance index. Optimal
control theory attempts to find the parameters of the controller such that the
performance measure (possibly with added performance constraint penalties) is
minimized. In many standard optimal control problems, the controller network is
linear, the plant is linear, and the performance measure is quadratic. In this case
and when the process operation is over infinite time, the parameters of the
controller may be explicitly derived. In general, however, if either the controller
is nonlinear, the plant is nonlinear, or the performance measure is not quadratic,
closed-form solutions for the controller's parameters are not available. Neverthe-
less, numerical methods may be used to compute an optimal control law. Modern
methods make use of sophisticated mathematical tools such as the calculus of
variations, Pontryagin's maximum principle, or dynamic programming.
Although modern methods are more universal, classical methods are widely
used in practice, even with sophisticated control problems (McRuer, Ashkenas,
& Graham, 1973). We feel that the differences between classical and modern
methods can be summarized as follows:

• Classical techniques use ad hoc methods based on engineering judgement

that is usually aimed at system robustness. No explicit unified performance
index is optimized.
• Modern techniques use principled methods based on optimizing a perfor-
mance index. However, the choice of a performance index is often some-
what ad hoc.

Narendra and Parthasarathy (1990) and others have noted that recurrent back-
propagation networks implement gradient descent algorithms that may be used to
optimize the performance index of a plant. The essence of such methods is to
propagate performance errors back through the process and then back through the
controller to give error signals for updating the controller parameters. Figure 2
provides an overview of the interaction of a neural control law with a complex
system and possible performance indices for evaluating various control laws.

Copyrighted Material
66 SCHLEY, CHAUVIN, VAN HENKLE

Controller Disturbances
(Neural Net)
Task
Inputs Control
Signals
Process to be Process Outputs
Controlled

Optimization Performance
Procedure Constraints

Objective
Performance
Measure

Figure 2. Neural network controller design. At the top, the process to

be controlled is the same as Figure 1 with controller replaced by a
neural net.

The functional components needed to train the controller are shown within the
shaded box of Figure 2. The objective performance measure contains factors that
are written mathematically and usually represent terms such as weighted square
error or other quantifiable measures. The performance constraints are often more
subjective in nature and can be formulated as reward or penalty functions on
categories such as "good" or "bad." The controller training function illustrated in
Figure 2 also contains an optimization procedure used to adjust the parameters of
the controller.
The network controller may be interpreted as a "neural" network when its
architecture and the techniques employed during control and parameter change
resemble techniques inspired from brain-style computations (e.g., Rumelhart &
McClelland, 1986). In the present case, these techniques are (i) parallel computa-
tions (ii) local computations during control and learning and (iii) use of "neural"
network learning algorithms. Narendra and Parthasarathy (1990) provide a more
extensive review of the common properties between neural networks and control
theories.

Copyrighted Material
A GENERAL-PURPOSE NONLINEAR
CONTROL ARCHITECTURE

The Switching Principle

Many complex systems are in fact nonlinear or "multimodal." That is, their
behavior changes in fundamental ways as a function of their position in the state-
space. In practice, controllers are often designed for such systems by treating
them as a collection of linear systems, each of which is linearized about a "set
point" in state-space. A linear controller can then be determined separately for
each of these system "modes." These observations suggest that a reasonable
approach for controlling nonlinear or multimodal systems would be to design a
multimodal control law.
The architecture of our proposed general nonlinear control law for multimodal
plants is shown in Figure 3. Task inputs and process outputs are entered into
multiple basic controller blocks (shown within the shaded box of Figure 3). Each
basic controller block first determines a weighted sum of the task inputs and
process outputs (multiplication by weights W). Then, the degree to which the
weighted sum passes through the block is modified by means of a saturating
switch and multiplier. The input to the switch is itself another weighted sum of
the task inputs and process outputs (multiplication by weights V). If the input to
the saturating switch is large, its output is unity and the weighted sum (weighted
by W) is passed through unchanged. At the other extreme, if the saturating

Task Inputs and Weight Saturating Switch

Process Outputs

V Sigmoid( )

Weight
Multiply

w II
Basic Controller Block Control
Signals

Replicate i=1,...,n
Σ

Figure 3. Architecture of the neural network controller.

Copyrighted Material
68 SCHLEY, CHAUVIN, VAN HENKLE

switch has zero output, the weighted sum of task inputs and process outputs does
not appear in the output. When these basic controller blocks are replicated and
their outputs are added, control signals then consist of weighted sums of the
controller inputs. Moreover, these weighted sums can be selected and/or blended
by the saturating switches to yield the control signals. The overall effect is a
prototypical feedahead and feedback controller with selectable gains and multi-
ple pathways where the overall equivalent gains are a function of the task and
process outputs. The resulting architecture yields a sigma-pi processing unit in
the final controller (Rumelhart, Hinton, & Williams, 1986).
Note that the controller of Figure 3 is composed of multiple parallel computa-
tion paths and can thus be implemented with parallel hardware to facilitate fast
computation. Also, should one or more of the parallel paths fail, some level of
performance would remain, providing a degree of fault tolerance. In addition,
since the controller can select one or more weighted mappings, it can operate in
multiple modes depending on conditions within the process to be controlled or
upon environmental conditions. This property can result in a controller finely
tuned to a process having several different regimes of behavior (nonlinear pro-
cess).

Modeling Dynamic Mappings

The weights shown in Figure 3 may be constant and represent a static relation-
ship between input and control. However, further controller functionality is
obtained by considering the weights V and W as implementing dynamic map-
pings. For example, proportional plus integral plus derivative (PID) feedback
may be used to ensure that process outputs follow task inputs with adequate
steady-state error and transient damping. A PID relationship can be expressed as
a Laplace transform of the ratio y(s)/x(s) = Kp + Ki(1/s) + Kds, where Kp, Ki,
and Kd are the actual weights to be adjusted during controller design. Thus, the
weights express differential equations in the time domain or transfer functions in
the frequency domain. Standard first- and second-order filters can be combined
to yield a wide variety of functionality. In general, the complexity of the dynam-
ics of the controller depends on the complexity of the control task and on the
dynamics of the plant. Therefore, the design of the architecture of the network
requires knowledge of the physics of the plant and of the control requirements. In
Section 4, we explain how this dynamic control is implemented in a neural
network architecture for the application of aircraft landing.

AIRCRAFT AUTOLAND SYSTEM

Presented here is the development of a model for the glideslope tracking and flare
phases of aircraft flight immediately prior to landing. The model is in the form of

Copyrighted Material
3. FLARE AND TOUCHDOWN CONTROL USING NEURAL NETWORKS 69

i n c r e m e n t a l e q u a t i o n s of m o t i o n in aircraft stability axes along with typical

control l o o p s . T h e s e e q u a t i o n s are suitable for a simulation that can be used to
obtain i n c r e m e n t a l trajectory data j u s t prior to landing. Their presentation as
s u m e s s o m e familiarity with control s y s t e m s . T h e y are intended to specify a
w e l l - b e h a v e d aircraft plant that, although s i m p l e , r e s p o n d s in a realistic w a y to
e n v i r o n m e n t and control p a r a m e t e r s . Linearized longitudinal/vertical perturba
tion e q u a t i o n s of m o t i o n are used for the aircraft w h i c h is a s s u m e d to b e initially
t r i m m e d in wings level flight at low altitude. Data and control system designs
p r o v i d e a realistic representation of a large c o m m e r c i a l transport aircraft. High-
frequency h a r d w a r e a n d actuator d y n a m i c s are neglected. H o w e v e r , simplifica
tions retain the overall quality of system r e s p o n s e .
D u r i n g aircraft l a n d i n g , the final t w o p h a s e s of a landing trajectory consist of
a " g l i d e s l o p e " p h a s e and a " f l a r e " p h a s e . Figure 4 s h o w s these t w o p h a s e s . Flare
occurs at about 4 5 feet. Glideslope is characterized by a linear d o w n w a r d slope;
flare by a n e g a t i v e e x p o n e n t i a l . W h e n the aircraft reaches flare, its r e s p o n s e
characteristics are c h a n g e d to m a k e it m o r e sensitive to the pilot's a c t i o n s ,
m a k i n g the p r o c e s s n o n l i n e a r o v e r the w h o l e trajectory.
In addition to linearized e q u a t i o n s of m o t i o n for the bare airframe, the b a s e
line s y s t e m has the following c o m p o n e n t s :

1. Stability a u g m e n t a t i o n control s y s t e m s for the bare airframe to provide

a d e q u a t e d a m p i n g and speed of r e s p o n s e c o n c e r n i n g principal aircraft
d y n a m i c s ( s p e e d , p h u g o i d , short period)
2. G l i d e s l o p e tracking and flare controllers to p r o d u c e c o m m a n d s in re
s p o n s e to desired positions during descent

At Flare Initiation At Touchdown:

Altitude: h =hf Altitude: h
TD =0
Altitude Rate: h=hf Altitude Rate:≤ 0 <hmin≤hTD≤hmax
Speed: v - vƒ Position: xmin ≤x TD ≤xmax
Pitch Angle: θ min ≤ θTD≤ θmax
Altitude
h
begin flare

Glideslope Angle: ygs touchdown point

position x Glideslope Predicted x

min X X
TD
Intercept Point max

Figure 4. Glideslope and flare geometry. The pitch angle is the angle
between the aircraft body and the ground measured in the vertical
plane.

Copyrighted Material
70 SCHLEY, CHAUVIN, VAN HENKLE

3. Desired position commands based on instrument landing system (ILS)

tracking and an adaptive ground speed exponential flare law to provide a
smooth descent
4. Altitude wind shear and turbulent gusts modeled using a Dryden distribu
tion (frozen field model for generating gusty winds)

Vehicle Equations of Motion

The vehicle equations of motion are derived in a general sense from Newtonian
mechanics. Two preliminary assumptions are made. Only rigid-body mechanics
are used and the earth is fixed in space. This means that vehicle motion may be
described as translation of and rotation about the vehicle center of mass and that
the vehicle inertial frame is fixed at or moves at constant velocity with respect to
the earth.
Newton's law is then phrased so that the sum of forces (torques) acting on a
body is equal to the time rate of change in inertial space of the body's linear
(angular) momentum.
d d
(mV) = F, (1Ω) = M, (1)
dt dt
where m = vehicle mass
V = inertial velocity vector
F = externally applied force vector
I = moment of inertia dyad
Ω = angular velocity vector
M = applied torque vector
Two coordinate frames of reference are important. The first is chosen to be
body fixed with its origin at the vehicle center of mass. This is the frame in which
thrust and aerodynamic forces and torques exist. Figure 5 illustrates the body-
fixed frame with the components of the vehicle translation and rotation veloc
ities.

center of mass U XB (forward)

V p
W [ U] [ p]
Q R Velocity vector V = V Angular velocity vector Ω = Q
w R
YB (right) ZB (down)
Figure 5. Body-fixed coordinate frame (stability axes).

Copyrighted Material
3. FLARE AND TOUCHDOWN CONTROL USING NEURAL NETWORKS 71

The second coordinate frame of interest is the inertial or earth-fixed frame in

which the gravity force and the vehicle kinematics (navigation or location vari-
ables) exist. Figure 6 illustrates the inertial frame and its relationship with the
body-fixed frame.
Two assumptions are made at this point. Vehicle mass and mass distribution
are constant (i.e., rates of change of mass and moments of inertia are zero). Also,
the XB, YB plane of Figure 5 is a plane of symmetry. This means that the left side
of the aircraft is the same as the right side and the cross moments of inertia Ixy, Iyz
are zero.
It is noted that the body-fixed frame is rotating with respect to the inertial
frame which gives rise to centripetal terms. Accounting for these along with
gravitational acceleration and the expansion of the angular momentum terms, the
following equations describe vehicle dynamics in the body-fixed frame:
max = m(U + QW – RV + g sin ) = XF,
may = m(V + RU – PW – g cos sin ) = YF,
maz = m(W + PV – QU – g cos cos ) = ZF,

lxxP – IXZR + (Jzz – Iyy)QR – lxzPQ = L,

lyyQ + (Ixx – Izz )PR – Ixz R2 + IxzP2 = M,
LzzR – IXZP + (Iyy – Ixx pQ + IXZQR = N, (2)

mg Euler angles

Gravity vector = mg

ZB
y

YB z
Figure 6. Inertial coordinate frame and transformation to body fixed.

Copyrighted Material
72 SCHLEY, CHAUVIN, VAN HENKLE

where ax, ay, az = vehicle acceleration components in body-fixed frame

U, V, W = vehicle velocity components in body-fixed frame
P, Q, R = vehicle angular velocity components in body-fixed frame
" , , = Euler transformation angles from inertial to body-fixed
frames
XF, YF, ZF = thrust and aerodynamic forces applied to vehicle
Ixx, Ixz, I yy ,I zz = moment of inertia components
L, M, N = thrust and aerodynamic torques applied to vehicle
m, g = vehicle mass, gravitational acceleration
Projecting the vehicle translational and angular velocities onto the inertial
frame, the following equations are obtained describing the Euler angles and
vehicle kinematics:

= Q sin + R cos
cos ,

= Q cos – R sin ,
= P + Q tan sin + R tan cos ,
X [ cos – sin 0 ] [ cos 0 sin ] [ I 0 0 ] [ U]
Y = sin cos 0 0 10 0 cos – sin V
,
Z 0 0 1 – sin 0 cos 0 sin cos w
(3)

where X, Y, Z = vehicle position components in inertial frame.

Linearized Aircraft Equations of Motion

The full 6-degree-of-freedom equations of motion, (2) and (3), are highly non-
linear and depend on complex aerodynamic effects. In order to perform analysis
and stability designs efficiently, the relations are linearized. From here on, we
confine our attention to the longitudinal/vertical plane of motion; that is, the XB,
ZB plane of Figure 5 and the X, Z plane of Figure 6.
The linearized equations of motion define incremental aircraft dynamics in the
longitudinal/vertical plane. They constitute the bare airframe velocity compo-
nents, the pitch rate and angle along with the aircraft position. They are devel-
oped by first assuming that the aircraft is flying in a trimmed condition (i.e., zero
translational and rotational accelerations). This gives rise to mean velocities for
the motion values (i.e., U0, W0, Q0). Small perturbations u, w, q about the mean
values are considered and the equations of motion are expanded to first order. For
the purposes at hand, the trim condition is assumed to consist of straight, wings-
level flight. Thus, the following additional assumptions are invoked. Perturba-
tions are small enough so that small angle approximations are valid (i.e., sin A =
A and cos A = 1). Also, products and squares of terms are negligible. Cross

Copyrighted Material
3. FLARE AND TOUCHDOWN CONTROL USING NEURAL NETWORKS 73

perturbation effects (lateral to longitudinal, and vice versa) are ignored. Trim
values are assumed as constant true airspeed Vtas with U0 = Vtas, W0 = 0.
Angular velocity is QQ = 0; angular position is 0 = Γo where Γo is the trim
flight path angle. The body-fixed coordinate frame thus consists of the aircraft
stability axes. Since the aircraft is trimmed at constant speed, the aircraft nose (or
wing chord) direction is actually elevated with respect to the body-fixed x axis.
Additionally, there is no rotation of the vertical relative to the inertial frame (i.e.,
no hypervelocity orbital flights). Then, the linearized equations are obtained as
follows:

m[u + (g cos 0)θ] = dXF,

m[w – U0q + (g sin O)θ] = dZF,
Ixyq = dM θ = q,
x = (U0 + u) cos 0 + (w – U0θ) sin 0,

= — z = (U0 + u) sin 0 – (w – U0θ) cos 0, (4)

where u, w, q, θ, x, z = incremental aircraft dynamic and kinematic values

dXF, dZF, dM = incremental aerodynamic forces and moments
U0, o, g nominal speeds, pitch angle, gravity
The aerodynamic forces and moments are typically functions of airspeed, air
density, and various aircraft parameters (e.g., lift and drag coefficients, surface
area, etc.). The following two assumptions are made at this time. Air-mass flow
is assumed quasisteady. This means that the aerodynamic forces and moments
depend only on the velocities of the vehicle and not on the rates of change of the
velocities. Also, atmospheric properties are constant and there are no Mach
number or altitude dependencies.
Then, the aerodynamic forces and moments can be expanded to first order by
considering partial derivatives with respect to each principal variable. In addi
tion, since the airmass flow is also a function of wind gusts, these latter are
introduced into the force and moment expansions as follows:

XF XF F X
dXF = u–
(u ug ) + W (w – wg ) + W (w – wg)
U
XF XF
+ (q – qg) + Δ δ,
Q
ZF ZF ZF
dZF (u – ug) + (w — wg) + (w – wg)
U W W
ZF
+ (q – qg) + ZF δ,
Q A

Copyrighted Material
74 SCHLEY, CHAUVIN, VAN HENKLE

M M M
dM = (u – ug) + (w – wg) + (w – wg)
U W W

M M
+ (q – qg) + δ, (5)
Q A

where ug, wg, wg,qg = wind gust components

Δ, δ = terms due to control inputs
The pitch gust disturbance qg depends on spatial distributions of the wind
speeds around the aircraft. This can be simplified to

wg wg / t 1
qg = = – – wg, (6)
x x/ t Vtas
where qg = spatial distribution of wind gusts in pitch direction.

The force equations can be divided by mass and the moment equation by the
moment of inertia. Neglecting the qg term and the velocity derivative terms,
collecting other terms, and simplifying yields the complete longitudinal lin
earized equations in terms of stability derivatives. Note that the angle of attack a
has been introduced, where a = (180/π)(w/V tas ). Also, note that the initial flight
path angle Γo is assumed to be 0, implying a true straight and level trim condi
tion.

V π π π
u = Xuu + tas
180
Xwa +
180 Xqq– 180
gθ
+ XE δE + XTδT – Xuug – Xwwg,
180 l
= Zuu + Zwa + (V tas + Zq)q
Vtasπ V tas
180
+ (ZEδE + ZTδT – Zuug – Zwwg),
Vtasπ

180
q = Muu + VtasMwa + Mqq
π

180
+ π
(MEδE + MTδT – Muug – Mwwg),

θ = q,
Vttasπ
=
(V tas +
u)cos θ + a sin θ Vtas + u,
180
Vtasπ
h = (V tas + u)sin θ — Vtasπ
a cos θ — a + Vtasπ θ, (7)
180 180 180

Copyrighted Material
3. FLARE AND TOUCHDOWN CONTROL USING NEURAL NETWORKS 75

where u, a, q, θ = incremental speed (ft/sec), an

gle of attack (deg), pitch rate
(deg/sec), pitch angle (deg)
x, h = aircraft position (ft) along
ground, altitude
ug, wg = wind gust velocity components
(ft/sec)
δE, δT = incremental elevator, throttle
settings
Xu, Xw Xq, Z u , Zw, Zq, Mu, Mw, Mq = stability derivatives of aircraft
at trim condition
XE, XT, ZE, ZT, ME, MT = control derivatives of aircraft at
trim condition
Vtas g = nominal speed (ft/sec), gravity
(ft/sec/sec)
Relations 7 provide the time behavior of speed (u), angle of attack (a), pitch
rate (q), pitch angle (6), and positions (x and h) in response to elevator and thrust
commands (δE and δT), and horizontal and vertical wind gust speeds ug and w .
The parameter values listed provide specific values typical of a large commercial
transport in a configuration just before landing.
Vtas = 235.6 ft/sec g = 32.2 ft/sec/sec
Xu = –.03829 Xu, = .051342 Xq = .08709 XE = –.00005 XT = .15781
Zu = –.31338 Zw = –.60538 Zq = 2.34657 ZE = -.14643 ZT= -.030963
Mu = –.00036861 Mw = –.0027481 Mq = –.61162 ME = -.0080083 MT = .00094824

Stability Augmentation Systems

The transient response characteristics of the longitudinal equations are typical of
standard bare airframe behavior. There are two types of responses noticeable: a
low-frequency, very lightly damped phugoid response and a higher-frequency,
well-damped short period response. This response pattern is entirely too oscilla
tory to apply glideslope and flare control laws. Hence, the longitudinal aircraft
dynamics are provided with a stability augmentation system (pitch control sys
tem and autothrottle).
The function of the pitch augmentation system is to significantly damp the
oscillatory phugoid behavior while providing reasonably fast pitch response to
commands. The pitch stability augmentation system consists of proportional plus
rate feedback combined with a pitch command to develop the aircraft elevator
angle as shown in Figure 7.
The function of the autothrottle as shown in Figure 8 is to maintain constant
airspeed. The aircraft throttle setting δT is based on proportional plus integral
feedback of the airspeed error. The result is that the incremental speed u is

Copyrighted Material
76 SCHLEY, CHAUVIN, VAN HENKLE

θcmd +
Kθ
+ δE

θ + + Typical Values – Glideslope: Flare:

Kθ = 2.7982 Kθ = 11.414
q Kq = 2.7982 Kq = 5.9729
Kq

Figure 7. Pitch stability augmentation system.

commanded to be equal to the constant component of the horizontal windugccos

( wind ) (see Equation 14 and Figure 8).
Relations 8 provide the equations representing the functions of the longitudi
nal/vertical stability augmentation system. These equations implement the dia
grams of Figures 7 and 8.

vs1 = –u+ ugc cos ( wind ),

δE = Kqq + Kθθ + Kθθcmd,

δT = K T ω T x vs1 – KTu + KTugc cos ( wind ), (8)

where x VS1 = autothrottle integrator

δE, δT = incremental elevator, throttle settings
u, q, θ = incremental speed (ft/sec), pitch rate (deg/sec), pitch angle
(deg)
ugc = constant component of the horizontal wind (ft/sec)
θ cmd = incremental pitch angle command (deg)

Control Laws
Longitudinal controllers must be specified for both the glideslope tracking and
the flare modes. This involves determination of the pitch command value (θcmd)
referenced in Figure 7. Other than parameter values and an open-loop pitchup
command for flare, both of these controllers are constructed similarly. Figure 9
illustrates the architecture of the PID controller for both glideslope tracking and
flare.

Reference
Airspeed: Vtas + + δT
KT
Estimated
Airspeed: Vt Vtas + u–ugccos( wind
–) + Typical Values:
xVS1 KT = 3.0
1
KTΩT
s ωT = 0 . 1

Figure 8. Autothrottle.

Copyrighted Material
3. FLARE AND TOUCHDOWN CONTROL USING NEURAL NETWORKS 77

θpitchup
hcmd
Kh
+
+ +
hest 1 xvc1 + θcmd
Khωh
S
+ Typical Values
Glideslope: θpitchup = 0
S Kh Flare:θPitchup=3°
Kh = 0.20
Flare begins when h ≤ hf (45 ft)
ωh = 0.10
Kh = 0.32

Figure 9. Glideslope and flare controller architecture.

Relations 9 comprise the equations representing the function of the glideslope

and flare controller. They implement the diagram of Figure 9.

xVC1 "hest – hcmd'

K
θ c m d = hωhxVCl + K
hhest + K est – K
hhcmd – K
cmd + θpitchup, (9)

x =
where vc1 glideslope and flare controller integrator
θ c m d = incremental pitch angle command (deg)
hcmd, c m d = altitude (ft) and altitude rate (ft/sec) commands
h est , e s t = altitude (ft) and altitude rate (ft/sec) estimates obtained
from compl. filter
θ p i t c h u p = open-loop pitchup command (deg) for flare

As shown in Figure 9, controller inputs consist of altitude commands along

with aircraft altitude and altitude rate estimates. These estimates are obtained by
complementary filtering of sensor data (accelerometer, altimeter, etc.). During
glideslope tracking a third-order complementary filter is used while a second-
order filter is applied during flare. Figure 10 illustrates the complementary filter
used during glideslope tracking. The functions of the glideslope complementary
filter are implemented via Equations 10.

LXIRS = 74.2
π
K' = 1.431 – LXIRS
180 k1
θ
K1 = 0.15
k2 = 0.0075
k' k2
k3 = 0.000125

+ + +
h + + 1 xVC4 + 1 xVC3 + 1 hest
k3
s s s XVC2
Nz + hest

Figure 10. Glideslope complementary filter.

Copyrighted Material
78 SCHLEY, CHAUVIN, VAN HENKLE

VC2 = –k1xVC2 xVC3 + K1K' θ + k1h,

=
VC3 –k2XVC2 + xVC4 +
k2k'θ + k2h + N
z'

VC4 = – k3xvC2 + KWK'θ + k3h,

Vtasπ π
Nz = (— + θ) = — Zuu — (V t a s z w a + Zqq)
180 180

– ZEδE – ZTδT + Z u u g + Zwwg,

hest = X
VC2 , est = X
VC3 , (10)
where x V C 2 , x V C 3 , x VC4 =
glideslope complementary filter integrators
Nz =
incremental vertical acceleration (ft/ sec/ sec)
hest, est =
altitude (ft) and altitude rate (ft/sec) estimates
θ, h =
incremental pitch angle (deg), altitude (ft)
u, a, q =
incremental speed (ft/sec), angle of attack (deg),
pitch rate (deg/sec)
δE, δT = incremental elevator, throttle settings
ug, wg = longitudinal and vertical wind gust velocities (ft/sec)

Figure 11 shows the complementary filter used for flare. Its functions are
implemented by means of Equations 11.
X
VC2 = – K1xVC2 + VC3 + k1h,

vc3 = –k 2 x V C 2 + k2h + Nz,

TT
Nz = Vtasπ (– + θ) = –Z u u– (VtasZwa + Zqq)
180 180

– ZEδE – ZTδT + Zuug + Zwwg,

hest = X
VC2 , est = X
VC3 ' (11)
where xVC2, xVC3, xVC4 = glideslope complementary filter integrators
Nz = incremental vertical acceleration (ft/sec/sec)
h est , esl = altitude (ft) and altitude rate (ft/ sec) estimates
θ, h = incremental pitch angle (deg), altitude (ft)
u, a, q = incremental speed (ft/sec), angle of attack (deg),
pitch rate (deg/sec)

k1
k1 = 2.0
+
h + k2 = 1.0 + 1 xvc3 + 1 hest
k2
s S XVC2
Nt + hest

Figure 11. Flare complementary filter.

Copyrighted Material
3. FLARE AND TOUCHDOWN CONTROL USING NEURAL NETWORKS 79

δE, δT = incremental elevator, throttle settings

ug , wg = longitudinal and vertical wind gust velocities (ft/sec)
The value of the altitude command (hcmd) for the glideslope and flare control
ler of Figure 9 must also be determined. The glideslope command is developed
so that a constant descent will be commanded along an ILS path assumed fixed in
space. Commanded altitude is thus a function of ground distance which is de
fined as the distance to the glideslope predicted intercept point on the ground.
The flare command consists of an adaptive ground speed exponential law (fixed
in space rather than in time). Equations 12 and 13 show the glideslope and flare
altitude command computation.

Glideslope (for x ≤ xf = – hf/tan γgs),

hcmd = –x tan γ gs , x(tgs) = xgs = –hgs/tan γ gs ,
cmd = – tan γgs = –(Vtas + u) tan γgs (12)

where hcmd, cmd = altitude (ft), altitude rate (ft/sec) commands, glideslope be
gins at tgs = 0
x = negative of the ground range to the GPIP (ft)
γgs = desired glideslope angle, γgs 2.75°
hgs = altitude at beginning of the glideslope, hgs 300 ft
xgs = x distance at beginning of glideslope, xgs 6245.65 ft
Note: Values for h and x at beginning of glideslope are for simulation purposes
only. Glideslope actually begins at about 1500 ft altitude.

Flare (for x > xf = – hf/tan γgs)

hf
hcmd [Vtas tan γgse(x–xƒ)/Tx + TD],
V tas tan γgs + TD

cmd = – tan γgse–(x–xƒ)Tx = –(vtas + u)tan γgse–(x–xf)/T,

xf = – h f / t a n γ gs , hƒVtas
τx =
Vtas tan γgs + TD
.

When hcmd(tTD) = 0, x(tTD) = xf – τxloge(– TD /V tas tan γ gs ) and cmd (t TD ) =

(tTD)
T D TD,
vtas
where h cmd , cmd = altitude (ft), altitude rate (ft/sec) commands, flare begins at
t = lf
hf = altitude at beginning of flare (ft), hf 45 ft
T D = altitude rate (ft/sec) at touchdown time tTD,hTD — 1.5
ft/sec

Copyrighted Material
80 SCHLEY, CHAUVIN, VAN HENKLE

x = negative of the ground range to the GPIP (ft)

xf = x distance at beginning of glideslope

Wind Disturbances
The environment influences the process through wind disturbances represented
by constant-velocity and turbulence components. They are confined to wind
disturbances having two components: constant velocity and turbulence. The
magnitude of the constant-velocity component is a function of altitude (wind
shear). Turbulence is more complex and is a temporal and spatial function as an
aircraft flies through an airspace region. The constant velocity wind component
exists only in the horizontal plane (i.e., combination of headwind and cross wind)
and its value is given in Equation 14 as a logarithmic variation with altitude. The
quantity H is a value representing the constant wind component at an altitude of
510 ft. A typical value of H is 20 ft/sec. In the next section we explain that the
network was trained with a distribution of constant wind components from H =
–10 ft/sec to H = 40 ft/sec. Note that a wind model with H = 40 represents a
very strong turbulent wind.
[ loge(h/510) ]
ugc = –H 1+ (14)
loge(51)
where ugc = constant (altitude shear) component of ug, zero at 10-ft altitude
H = wind speed at 510-ft altitude (typical value = 20 ft/sec)
h = aircraft altitude
For the horizontal and vertical wind turbulence velocities, the Dryden spectra
(Neuman and Foster, 1970) for spatial turbulence distribution are assumed.
These spectra involve a wind disturbance model frozen in time. This is not
seriously limiting since the aircraft moves in time through the wind field and thus
experiences temporal wind variation. The Dryden spectra are also amenable to
simulation and show reasonable agreement with measured data. The generation
of turbulence velocities is effected by the application of Gaussian white noise to
coloring filters. This provides the proper correlation to match the desired spectra.
Figures 12 and 13 summarize turbulence calculations, while Equations 15 imple
ment them.
[ 2 ]
d r y 1 = – a u xdry 1 +
a
u σu √ N(0, 1),
auΔt
aw 3 ]
d r y 2=–
a
wxdry2 + a
w-xdry3 + a
w (b )
w
σwbw √ a3wΔt
N(0, 1),

aw [ 3 ]
d r y 2= –awxdry3 + a
w
(
1+ σ b
b) w w
w
√a3wΔt N(0, 1),

ug = U
gc + X
dry 1 ,
W
g = xdry2, (15)

Copyrighted Material
3. FLARE AND TOUCHDOWN CONTROL USING NEURAL NETWORKS 81

ugc

N(0, 1) + x + ug
σu
2 au dry1 +
auΔt s
–
Vtas 1/3
σu = 0.2|u gc |, au = Lu = 100(h) ƒor h > 230, Lu = 600 for h ≤ 230
Lu
where: Vtas h, Δt = nominal aircraft speed (ftlsec), aircraft altitude (ft), simulation time step
Lu, σu = scale length (ft), turbulence standard deviation (ft/sec)
N(0,1)= Gaussian white noise with zero mean and unity standard deviation

Figure 12. Horizontal plane wind disturbance.

where xdry1, , xdry2, x dry3 = Dryden wind disturbance integrators

ug, vg, wg = turbulent wind speed components (ft/sec)
N(0, 1) = Gaussian white noise with zero mean and unity stan
dard deviation
wind = direction of the constant wind
ugc = constant wind speed (ft/sec)
Δt = simulation time step (sec)

Initial Conditions
In order to begin a simulation of the flight of the aircraft, all dynamic variables
must have initial conditions specified. For the longitudinal/vertical equations of
motion (Equations 7) for the bare airframe, initial conditions are specified by
placing the aircraft on the glideslope in a steady-state condition. This means that
initial values for the u, a, q, θ, x, and h variables of Relations 7 are obtained
according to the following assumptions: u0 = ugo, u0 = 0, 0 = 0, q0 = 0,θ0=
0, and 0 = —0tan γ gs , where ugo is the initial longitudinal constant wind speed
and γgs is the glideslope angle. Substituting these conditions into Relations 7

aw
bw

Xdry3 + Xdry2 w
N(0, 1) 3 aw + aw aw g

σwbw 1+ +
aw3 t bw s s
– –
ΣW = 0.2|u g c |(0.5 + 0 . 0 0 0 9 8 h ) for 0 ≤ h ≤ 500, σw = 0.2\ugc\ for h > 500,
V Vtas
aw = tas bw = Lw = h
Lw , √3Lw
where:Vtas,h, Δt = nominal aircraft speed (ftlsec), aircraft altitude (ft), simulation time step
Lw,ΣW= scale length (ft), turbulence standard deviation (ftlsec)
N(0,1)= Gaussian white noise with zero mean and unity standard deviation

Figure 13. Vertical plane wind disturbance.

Copyrighted Material
82 SCHLEY, CHAUVIN, VAN HENKLE

provides the results of Relations 16 whose solution determines the initial condi
tions for the variables uQ, a0, q0, θ0, x0, h0, δE0 , andδT0.

Initial Conditions for Longitudinal/Vertical Aircraft Variables

u0 ugo

q0 = 0
x0 –h(t0)/tan γgs ,
h0 h(t0)
Vtasπ g π
Xw XE XT ao 0
180 180
180 180 ZE
Zw 0 ZE θ0 0
Vtasπ Vtasπ
,
180 180
VtasMw 0 ME MT δE0
0
TT TT

V tas π Vtasπ
– 0 0 δT0
–(Vtas + ugo)tan γgs
180 180
(16)
where ugo = ugc (h0) = longitudinal constant wind speed (ft/sec).

Simulation
To simulate the flight of the aircraft, the aircraft states (i.e., u, a, q, θ, x, and h)
given in Equations 7 must be solved by means of some method for solution of
differential equations (e.g., Runge-Kutta). The values for elevator, throttle, rud
der, and aileron angles for equations [7] (i.e., δE and δT) are obtained by imple
menting Equations 8. These are implied by the stability augmentation diagrams
of Figures 7 and 8. Here,the symbol s indicates the Laplace operator (derivative
with respect to time). Input to the pitch stability augmentation system (θcmd) is
calculated by implementing the controller shown in Figure 9 where flare takes
place at about 45 ft of altitude. Inputs to the glideslope and flare controller (hesl
and est ) are determined by means of the glideslope and flare complementary
filters given by Equations 10 and 11. The hcmd input to the glideslope and flare
controller is provided by Equations 12 and 13. Finally, the wind disturbance
functions for Equations 7 (i.e., ug and wg) are generated by means of the average
horizontal speed given by Equation 14 and the Dryden spectra turbulence calcu
lations of Equations 15.
A sample scenario for the generation of a trajectory involves a descent along a
glideslope from an initial altitude of 300 ft followed by flare at 45 ft altitude and
then touchdown. Both nominal (disturbance-free) conditions and logarithmic
altitude shear turbulent conditions should be considered for completeness.

Copyrighted Material
NEURAL NETWORK LEARNING IMPLEMENTATION

In this section, we present our approach for developing control laws for the
aircraft model described in the preceding section and relate it to classical and
modern control theories. We confine our attention to the glideslope and flare and
do not consider the lateral control system. Additionally, we do not model the
complementary filters and assume that their function is perfect. At any rate, in
the classical approach, controllers are often specified as PID devices for both the
glideslope and flare modes. Other than parameter values and an open loop
pitchup command for flare, both of these controllers are constructed similarly.
Recall that Figure 9 illustrates the conventional controller architecture for both
glideslope and flare.
As previously noted, modern control theory suggests that a performance index
for evaluating control laws should first be constructed, and then the control law
should be computed to optimize the performance index. When closed form
solutions are not available, numerical methods for estimating the parameters of a
control law may be developed. Neural network algorithms can actually be seen
as constituting such numerical methods (Narendra and Parthasarathy, 1990;
Bryson and Ho, 1969; Le Cun, 1989). We present here an implementation of a
neural network algorithm to address the aircraft landing problem.

Difference Equations
The state of the aircraft (including stability augmentation and autothrottle) can be
represented by the following eight-dimensional vector:
Xt = [ut at qt θt t ht x7t x8t]T. (17)
State variables ut, at, qt, θt, t, and ht correspond to the aircraft state variables
per se. Variable x7t originates from the autothrottle. Variable x8t computes the
integral of the difference between actual altitude h and desired altitude hcmd over
the entire trajectory. Alternatively, the state variables t and x8t can be considered
as being internal to the network controller (see below). The difference equations
describing the dynamics of the controlled plant can be written as
Xt+1 = AtXt + BtUt + CDt + Nt, (18)
gs f1
At = ZtA + (1 – Zt)A , (19)
Bt = Z,Bgs + (1 – Zt)Bf1, (20)
Zt = S((ht – hf)σf), (21)
S(x) = 1/(1 + exp (–x)), (22)
where A represents the plant dynamics and B represents the aircraft response to
the control U. The matrix C is used to compute x8 from the desired state Dt

Copyrighted Material
84 SCHLEY, CHAUVIN, VAN HENKLE

containing desired altitude, desired altitude rate, and desired ground position, as
obtained from nominal glideslope and flare trajectories. N is the additive noise
computed from the wind model. The matrices Ags, Af1, Bgs, and Bf1 are constant.
The variable Zt generates a smooth transition between glideslope and flare dy
namics and makes the cost function J differentiable over the whole trajectory.
The switching controller described in Section 2 can be written as
Ut = PTLt where Pt = S([VXt + q]σ) and Lt = W[Xt –
RDt] + r, (23)
where the function S(x) is the logistic function 1/(1 + exp (–x)) taken over each
element of the vector x and σ is an associated slope parameter. The weight matrix
V links actual altitude h to each switch unit in Pt (the switch is static). The weight
matrix W links altitude error, altitude rate error, and altitude integral error to each
linear unit in Lt.
Figure 14 shows a network implementation of the equations where each unit

JE JP
Damping judge
ht+1 t+1
hcmdt+1
h cmdt +1

Xt+ 1 Dt+1

Kh K

Ut
A

W
V

Xt Dt

Figure 14. Architecture of the neural network. On the left, implemen

tation of the plant dynamics and controller. In the center, the desired
trajectory. On the right, implementation of the performance con
straints. Connections A, B, V, and W are explained in the text along
with other connections.

Copyrighted Material
3. FLARE AND TOUCHDOWN CONTROL USING NEURAL NETWORKS 85

computes a linear mapping of ht – hcmdt, x8t, and t – cmdt. Thus, the network
controller forms a PID dynamic weighting of the altitude error ht – hcmdt.
Initially, we chose two basic controller blocks (see Figure 3) to represent
glideslope and flare. The sigmoidal selection for each block is based on altitude
alone. In this sense, we use our a priori knowledge of the physics of the plant to
adapt the complexity of the network controller to the control task. The task of the
network is then to learn the state-dependent PID controller gains that optimize a
cost function in given environmental conditions.

Performance Index Optimization

The main optimization problem is now stated. Given an initial state X1 and initial
desired state D1 = [hcmd1 cmd1 ]T, minimize the expected value
E[J] = J p(X 1 • • • XT) dX1 • • • dXT,
T

with J = JE = a h [h cmd – ht]2 + a [ cmdt – t]2 (24)

t=1

with respect to V, r, W, and q. Note that the cost function E[J] assigns a cost to a
particular control law parameterized by V, r, W, and q using knowledge of the
stochastic plant model described by the distribution p{X1---XT). Note that the
quadratic cost function J is parametrized by an arbitrary choice of the parameters
ah, a (we used ah = a = 1).
When Σ is large, the slope of the sigmoid S in Equation 23 becomes large and
the associated switching response of the units P becomes sharp. This solution is
equivalent to dividing the entire trajectory into a set of linear plants and finding
the optimal linear control law for each individual linear plant. If σ is of moderate
magnitude (e.g., a 1), nonlinear switching and blending of linear control laws
via the sigmoidal switching unit is a priori permitted by the architecture. One of
the main points of our approach resides in this switching/blending solution
(through learning of V and q) as a possible minimum of the optimized cost
function.
Equations 18 and 23 describing plant and controller dynamics can be repre
sented in a network, as well as the desired trajectory dynamics (see Figure 14).
Note that the resulting network is composed of six learnable weights for each
basic controller block: three PID weights plus one bias weight (W weights) and
two switch unit weights (V weights). Actual and desired state vectors at time t + 1
are fed back to the input layers. Thus, with recurrent connections between output
and input layers, the network generates entire trajectories and can be seen as a
recurrent back-propagation network (Rumelhart, Hinton & Williams, 1986;
Nguyen & Widrow, 1990; Jordan & Jacobs, 1990). The network is then trained
using the back-propagation algorithm given wind distributions.

Copyrighted Material
86 SCHLEY, CHAUVIN, VAN HENKLE

Performance Constraints
As previously noted, the optimization procedure can also depend on performance
constraints. The example we use involves relative stability constraints. As will be
seen in the fifth section, the unconstrained solution shows lightly damped aircraft
responses during glideslope. Here, a penalty on "bad" damping is formulated.
In order to perform a stability analysis, the controlled autoland system is
structured as shown in Figure 15. The aircraft altitude responds to values of 6cmd
which are generated by the controller. The controller, whether conventional or
neural network, can be represented as a PID operation.
The aircraft response denoted in Figure 15 (h in response to θcmd) consists of
the dynamics and kinematics of the airframe along with the pitch stability aug
mentation system and the autothrottle. This response can be represented as a set
of differential equations or, more conveniently, as a Laplace transform transfer
function. Equations 25 and 26 provide the transfer functions during glideslope
and flare.

Glideslope
0.409736(s + 3.054256)(.s – 2.288485)(s + 0.299794)(s + 0.154471)
F (s) = (25)
s(s2 + 2.216836s + 2.407746)(s2 + 0.673172s + 0.114482)(s + 0.091075)'

Flare
1.671273(s + 3.049899)(s – 2.294112)(s + 0.561642)(s + 0.123544)
F (s) = (26)
s(s2 + 3.498162s + 6.184131)(s2 + 1.057216s + 0.280124)(s + 0.103319).

In Equations 25 and 26, note the appearance of the short period and the
damped phugoid responses (the two quadratic terms in the denominators of the
transfer functions). Note also that there is a real positive term in the numerators,
leading to stability concerns since the closed-loop roots (eigenvalues) could have
positive real parts for some gain values. A positive eigenvalue means instability.
A schematic of a two-unit neural network controller is shown in Figure 16. In
function, the network can be viewed to perform as a PID operation on the altitude

Controller Aircraft Response

hcmd + herror Fcmd h
G(s) F(s)
–

h(s) G(s)F(s)
Closed loop transfer function: =
hcmd(s) 1 + G(s)F(s)
Figure 15. Closed-loop autoland system.

Copyrighted Material
3. FLARE AND TOUCHDOWN CONTROL USING NEEURAL NETWORKS 87

1 V01 +
h +
Vl1 Sigmoid( )

Bias
1 w01 +
Multiply
+
W11 +
hcmd +
w21 Integral +
–
w31 Derivative
Controller Block 1

+
V02 +
θcmd
+
V12 Sigmoid( )
Same
Bias
Inputs Wo2 + Multiply
2
+
W1
+
W22 Integral +

W32 Derivative
Controller Block 2

Figure 16. Two-unit neural network autoland controller.

error hcmd – h. However, the gains on each term are determined by altitude
through the sigmoidal switch. Thus, for a particular frozen altitude point along
the aircraft flight trajectory, the controller can be represented by the transfer
function of Equation 27. Equation 27 also represents the function of the conven
tional controller.

θ Cmd (s) 1 K h s 2 + Khs + KIh

G(S) = = Kh + KIh + K s = (27)
h(s) – hcmd(s) s s .
Taking account of the piecewise linear plant and assuming for the moment
constant weights, the aircraft can have transient oscillatory responses of the form
e–ζωt c o s (√ – ζ2 ωt + φ), where ζ is a damping factor, ω is a frequency, and
φ is a phase angle. The damping factor ζ is of particular importance since a small
value will give rise to oscillations that persist in time. An analysis of the smallest
damping factor was performed by probing the plant with a range of values for
controller weight parameters. This data was then used to train a standard feedfor-

Copyrighted Material

Dietmar Hildenbrand - The Power of Geometric Algebra Computing - For Engineering and Quantum Computing-Chapman and Hall - CRC (2021)
100% (1)
Dietmar Hildenbrand - The Power of Geometric Algebra Computing - For Engineering and Quantum Computing-Chapman and Hall - CRC (2021)
202 pages
Introduction To Probability For Data Science
No ratings yet
Introduction To Probability For Data Science
709 pages
Jan de Vries - Topological Dynamical Systems, An Introduction To The Dynamics of Continuous Mappings (2014, de Gruyter) - Libgen - Li
No ratings yet
Jan de Vries - Topological Dynamical Systems, An Introduction To The Dynamics of Continuous Mappings (2014, de Gruyter) - Libgen - Li
515 pages
Dokumen - Pub - Numerical Approximation of Partial Differential Equations 1st Ed 3319323539 978 3 319 32353 4 978 3 319 32354 1 3319323547
100% (2)
Dokumen - Pub - Numerical Approximation of Partial Differential Equations 1st Ed 3319323539 978 3 319 32353 4 978 3 319 32354 1 3319323547
541 pages
Marvin Minsky, Seymour A. Papert - Perceptrons - An Introduction To Computational Geometry (1987, The MIT Press)
83% (6)
Marvin Minsky, Seymour A. Papert - Perceptrons - An Introduction To Computational Geometry (1987, The MIT Press)
311 pages
Variational Principles in Physics: From Classical To Quantum Realm
100% (1)
Variational Principles in Physics: From Classical To Quantum Realm
122 pages
LIVIU I NICOLAESCU - A Graduate Course in Probability-World Scientific Publishing (2023)
100% (2)
LIVIU I NICOLAESCU - A Graduate Course in Probability-World Scientific Publishing (2023)
559 pages
Large Scale Geometry PDF
No ratings yet
Large Scale Geometry PDF
203 pages
ALAFF
100% (1)
ALAFF
642 pages
Schmidt SubgroupLatticesOfGroups
100% (2)
Schmidt SubgroupLatticesOfGroups
590 pages
Laboratory Manual Operational Amplifiers and Linear Integrated Circuits Fiore
100% (1)
Laboratory Manual Operational Amplifiers and Linear Integrated Circuits Fiore
210 pages
Vdoc - Pub - Foundations of Discrete Harmonic Analysis Applied and Numerical Harmonic Analysis
100% (1)
Vdoc - Pub - Foundations of Discrete Harmonic Analysis Applied and Numerical Harmonic Analysis
257 pages
Operator Theory Functional Analysis and Applications
No ratings yet
Operator Theory Functional Analysis and Applications
654 pages
Exercises in Applied Mathematics: Daniel Alpay
No ratings yet
Exercises in Applied Mathematics: Daniel Alpay
694 pages
Murray Aitkin - Introduction To Statistical Modelling and Inference-CRC Press - Chapman & Hall (2022)
No ratings yet
Murray Aitkin - Introduction To Statistical Modelling and Inference-CRC Press - Chapman & Hall (2022)
391 pages
Large Networks and Graph Limits
100% (2)
Large Networks and Graph Limits
487 pages
Umberto Michelucci - Fundamental Mathematical Concepts For Machine Learning in Science-Springer (2024)
100% (1)
Umberto Michelucci - Fundamental Mathematical Concepts For Machine Learning in Science-Springer (2024)
259 pages
Multiplicative Number Theory I-Montgomery
100% (1)
Multiplicative Number Theory I-Montgomery
572 pages
Eberhard Malkowsky, Ćemal Dolićanin, Vesna Veličković - Differential Geometry and Its Visualization-CRC Press - Chapman & Hall (2023)
100% (1)
Eberhard Malkowsky, Ćemal Dolićanin, Vesna Veličković - Differential Geometry and Its Visualization-CRC Press - Chapman & Hall (2023)
493 pages
Shafer Romanovski
100% (1)
Shafer Romanovski
336 pages
1441998861projection PDF
100% (1)
1441998861projection PDF
247 pages
Riemann-Hilbert Problems, Their Numerical Solution and The Computation of Nonlinear Special Functions
100% (1)
Riemann-Hilbert Problems, Their Numerical Solution and The Computation of Nonlinear Special Functions
371 pages
Abstraction, Refinement and Proof For Probabilistic Systems - 2005 - Annabelle McIver - Carroll Morgan
No ratings yet
Abstraction, Refinement and Proof For Probabilistic Systems - 2005 - Annabelle McIver - Carroll Morgan
395 pages
Mark Joshi (Auth.) - Proof Patterns-Springer International Publishing (2015)
No ratings yet
Mark Joshi (Auth.) - Proof Patterns-Springer International Publishing (2015)
189 pages
Discrete Fractional Calculus and Fractional Difference Equations
No ratings yet
Discrete Fractional Calculus and Fractional Difference Equations
95 pages
Algebraic and Coalgebraic Methods in The Mathematics of Program Construction - Backhouse, Crole, and Gibbons
100% (1)
Algebraic and Coalgebraic Methods in The Mathematics of Program Construction - Backhouse, Crole, and Gibbons
400 pages
Non Commutative Analysis
100% (2)
Non Commutative Analysis
640 pages
Free Probability and Random Matrices
No ratings yet
Free Probability and Random Matrices
342 pages
De Saracibar CA Nonlinear Continuum Mechanics An Engineering
No ratings yet
De Saracibar CA Nonlinear Continuum Mechanics An Engineering
356 pages
Foundation of Differential geometry-II-Kobayashi+Nomuzu
100% (2)
Foundation of Differential geometry-II-Kobayashi+Nomuzu
242 pages
Linear Algebra, Signal Processing, and Wavelets - A Unifi Ed Approach
100% (1)
Linear Algebra, Signal Processing, and Wavelets - A Unifi Ed Approach
381 pages
Advances in Deterministic and Stochastic Analysis (2007)
100% (2)
Advances in Deterministic and Stochastic Analysis (2007)
372 pages
Sans 10292
No ratings yet
Sans 10292
31 pages
(Series in Mathematical Analysis and Applications 9) Leszek Gasinski, Nikolaos S. Papageorgiou - Nonlinear Analysis-Chapman & Hall - CRC (2006)
100% (1)
(Series in Mathematical Analysis and Applications 9) Leszek Gasinski, Nikolaos S. Papageorgiou - Nonlinear Analysis-Chapman & Hall - CRC (2006)
960 pages
(Cambridge Texts in Applied Mathematics) Gabriel J. Lord, Catherine E. Powell, Tony Shardlow - An Introduction To Computational Stochastic PDEs-Cambridge University Press (2014)
100% (1)
(Cambridge Texts in Applied Mathematics) Gabriel J. Lord, Catherine E. Powell, Tony Shardlow - An Introduction To Computational Stochastic PDEs-Cambridge University Press (2014)
513 pages
Cohen Macaulay Rings
100% (1)
Cohen Macaulay Rings
465 pages
Perceptrons
No ratings yet
Perceptrons
308 pages
(123doc) Bai Thi Danh Gia Muc Do Phu Hop Danh Cho Ung Vien Samsung Gsat 2024
No ratings yet
(123doc) Bai Thi Danh Gia Muc Do Phu Hop Danh Cho Ung Vien Samsung Gsat 2024
27 pages
(Lecture Notes in Mathematics 1667) Jesús M. F. Castillo, Manuel González (Auth.) - Three-Space Problems in Banach Space Theory-Springer-Verlag Berlin Heidelberg (1997) PDF
No ratings yet
(Lecture Notes in Mathematics 1667) Jesús M. F. Castillo, Manuel González (Auth.) - Three-Space Problems in Banach Space Theory-Springer-Verlag Berlin Heidelberg (1997) PDF
280 pages
Backpropagation - Theory, Architectures, and Applications - Yves Chauvin (Ed.), David E. Rumelhart (Ed.) - Lawrence Erlbaum Associates (1995)
No ratings yet
Backpropagation - Theory, Architectures, and Applications - Yves Chauvin (Ed.), David E. Rumelhart (Ed.) - Lawrence Erlbaum Associates (1995)
575 pages
Michio Kuga - Susan Addington - Motohico Mulase - Galois' Dream - Group Theory and Differential Equations - Group Theory and Differential Equations-Birkhauser (1993)
No ratings yet
Michio Kuga - Susan Addington - Motohico Mulase - Galois' Dream - Group Theory and Differential Equations - Group Theory and Differential Equations-Birkhauser (1993)
158 pages
Hott Ebook
100% (1)
Hott Ebook
599 pages
The Pillar of Computation Theory
100% (3)
The Pillar of Computation Theory
343 pages
(Alain Bensoussan, Jens Frehse (Auth.), Philippe G, Partial Differentials Equations
100% (1)
(Alain Bensoussan, Jens Frehse (Auth.), Philippe G, Partial Differentials Equations
431 pages
Advanced Quantum Mechanics
No ratings yet
Advanced Quantum Mechanics
402 pages
Multiple Integrals in The Calculus of Variations Oct 2008
100% (5)
Multiple Integrals in The Calculus of Variations Oct 2008
522 pages
Pierre Bremaud - Mathematical Principles of Signal Processing
No ratings yet
Pierre Bremaud - Mathematical Principles of Signal Processing
262 pages
Data Sheet
No ratings yet
Data Sheet
106 pages
MeasureTheory UCLA
100% (3)
MeasureTheory UCLA
105 pages
STP 1571-2014
No ratings yet
STP 1571-2014
184 pages
(Algebra and Applications 13) Cédric Bonnafé (Auth.) - Representations of SL2 (FQ) - Springer-Verlag London (2011) PDF
100% (1)
(Algebra and Applications 13) Cédric Bonnafé (Auth.) - Representations of SL2 (FQ) - Springer-Verlag London (2011) PDF
209 pages
An Introduction To Differential Geometry Through Computation
No ratings yet
An Introduction To Differential Geometry Through Computation
225 pages
ML 42
No ratings yet
ML 42
32 pages
A Course in Applied Stochastic Processes-Hindustan Book Agency (2006)
No ratings yet
A Course in Applied Stochastic Processes-Hindustan Book Agency (2006)
226 pages
Kenwood TRC 80 User Manual PDF
No ratings yet
Kenwood TRC 80 User Manual PDF
33 pages
Applied Numerical Computing
100% (1)
Applied Numerical Computing
257 pages
Notes On Relativistic Quantum Field Theory: A Course Given by Dr. Tobias Osborne
No ratings yet
Notes On Relativistic Quantum Field Theory: A Course Given by Dr. Tobias Osborne
97 pages
Neural Networks Unveiled: A Data Science Perspective
From Everand
Neural Networks Unveiled: A Data Science Perspective
Willie Nelson
No ratings yet
Paul C. Shields The Ergodic Theory of Discrete Sample Paths Graduate Studies in Mathematics 13 1996
100% (1)
Paul C. Shields The Ergodic Theory of Discrete Sample Paths Graduate Studies in Mathematics 13 1996
259 pages
Complexity of Algorithms
No ratings yet
Complexity of Algorithms
180 pages
Classical Planning: Efinition OF Lassical Lanning
No ratings yet
Classical Planning: Efinition OF Lassical Lanning
36 pages
Natural Language: Anguage Odels
No ratings yet
Natural Language: Anguage Odels
28 pages
Linear Functional Analysis: Joan Cerdà
0% (1)
Linear Functional Analysis: Joan Cerdà
26 pages
First-Order Logic: Epresentation Evisited
No ratings yet
First-Order Logic: Epresentation Evisited
37 pages
Chapter1 Introduction
No ratings yet
Chapter1 Introduction
43 pages
Symmetry & Space Groups
No ratings yet
Symmetry & Space Groups
49 pages
Selection
No ratings yet
Selection
12 pages
5 Word Embeddingfor Understanding Natural Language ASurvey 1
No ratings yet
5 Word Embeddingfor Understanding Natural Language ASurvey 1
26 pages
The Virtual File System (VFS)
No ratings yet
The Virtual File System (VFS)
60 pages
Getting Started With MATLAB: Notes
100% (4)
Getting Started With MATLAB: Notes
27 pages
Geerations of Computer 1st To 5th Explained With Pictures
No ratings yet
Geerations of Computer 1st To 5th Explained With Pictures
9 pages
ModelSim Tutorial - Getting Started
No ratings yet
ModelSim Tutorial - Getting Started
18 pages
LearningMaterial ICT4 v6 2 Bai5
No ratings yet
LearningMaterial ICT4 v6 2 Bai5
13 pages
Assessment - 6: Question 1: Write The C Program To Implement A Linked List File Allocation Method
No ratings yet
Assessment - 6: Question 1: Write The C Program To Implement A Linked List File Allocation Method
6 pages
YOLO5Face: Why Reinventing A Face Detector: Delong Qi, Weijun Tan, Qi Yao, Jingfeng Liu
No ratings yet
YOLO5Face: Why Reinventing A Face Detector: Delong Qi, Weijun Tan, Qi Yao, Jingfeng Liu
10 pages
M-Duino 21+Arduino-PLC
No ratings yet
M-Duino 21+Arduino-PLC
3 pages
Artificial Neural Network: 1 Background
No ratings yet
Artificial Neural Network: 1 Background
13 pages
The Design of A Low-Voltage Bandgap Reference The Analog Mind
No ratings yet
The Design of A Low-Voltage Bandgap Reference The Analog Mind
8 pages
NemoSens Briefsheet 008
No ratings yet
NemoSens Briefsheet 008
2 pages
Rule-Based Information Extraction: Advantages, Limitations, and Perspectives
No ratings yet
Rule-Based Information Extraction: Advantages, Limitations, and Perspectives
4 pages
21 22
No ratings yet
21 22
14 pages
Week 1: Learning Material - Experiment in ICT 2
No ratings yet
Week 1: Learning Material - Experiment in ICT 2
2 pages
Data Quality Model
No ratings yet
Data Quality Model
107 pages
EQUIP9-Operations-Use Case Challenge
No ratings yet
EQUIP9-Operations-Use Case Challenge
6 pages
Series 315 Specification Sheet
No ratings yet
Series 315 Specification Sheet
1 page
Digital System Design (EC 302) - MCQ (Google Classroom Uploading)
No ratings yet
Digital System Design (EC 302) - MCQ (Google Classroom Uploading)
7 pages
An Introduction To Back
No ratings yet
An Introduction To Back
4 pages
Photo
No ratings yet
Photo
2 pages
Engine Immobilizer System
No ratings yet
Engine Immobilizer System
6 pages
R23 DBMS Syllabus
No ratings yet
R23 DBMS Syllabus
3 pages
Agri-Fishery LAS 5
No ratings yet
Agri-Fishery LAS 5
5 pages
Mazvita C Form 4 Control Systems
No ratings yet
Mazvita C Form 4 Control Systems
4 pages
3573 PF00603
No ratings yet
3573 PF00603
8 pages
Technical Brochure Metal Ceilings V100-V200-en EU
No ratings yet
Technical Brochure Metal Ceilings V100-V200-en EU
12 pages
Realtime Festival Overview
No ratings yet
Realtime Festival Overview
28 pages
Tanishka From
No ratings yet
Tanishka From
1 page
CP R80 CheckPoint API ReferenceGuide
No ratings yet
CP R80 CheckPoint API ReferenceGuide
6 pages
Information and Software Technology: Andrew Austin, Casper Holmgreen, Laurie Williams
No ratings yet
Information and Software Technology: Andrew Austin, Casper Holmgreen, Laurie Williams
10 pages
Bitumen Emulsion Production Plant: Capacity: 10 M /H
No ratings yet
Bitumen Emulsion Production Plant: Capacity: 10 M /H
10 pages
IHHA sts2011 - Turner
No ratings yet
IHHA sts2011 - Turner
9 pages
A Reliable Architecture Based On Reactive Microservices For Iot Applications
No ratings yet
A Reliable Architecture Based On Reactive Microservices For Iot Applications
5 pages
DM1022MST - Direct Evolution 2013 Renewal
No ratings yet
DM1022MST - Direct Evolution 2013 Renewal
4 pages
Resume - Devendra Rathore - Digital Marketing Professional-Compressed
No ratings yet
Resume - Devendra Rathore - Digital Marketing Professional-Compressed
2 pages
Arwa Alrezehi - Shahad Sultan
No ratings yet
Arwa Alrezehi - Shahad Sultan
1 page

Backpropagation - Theory, Architectures, and Applications (1) - 1-100

Uploaded by

Backpropagation - Theory, Architectures, and Applications (1) - 1-100

Uploaded by

BACKPROPAGATION

Theory, Architectures, and Applications

Gluck/Rumelhart • Neuroscience and

Theory, Architectures, and Applications

LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS

Lawrence Erlbaum Associates, Inc., Publishers

Library of Congress Cataloging-in-Publication Data

Printed in the United States of America

1. Backpropagation: The Basic Theory 1

2. Phoneme Recognition Using Time-Delay Neural

3. Automated Aircraft Flare and Touchdown Control

4. Recurrent Backpropagation Networks 99

5. A Focused Backpropagation Algorithm for Temporal

6. Nonlinear Control with Neural Networks 171

7. Forward Models: Supervised Learning with a Distal

8. Backpropagation: Some Comments and Variations 237

9. Graded State Machines: The Representation of

10. Spatial Coherence as an Internal Teacher for a Neural

11. Connectionist Modeling and Control of Finite State

12. Backpropagation and Unsupervised Learning in

13. Gradient-Based Learning Algorithms for Recurrent

14. When Neural Networks Play Sherlock Holmes 487

15. Gradient Descent Learning Algorithms: A Unified

Author Index 543

possibility of an artificial machine that could learn how to function in the

tool for computer modellers, engineers, and cognitive scientists in general.

Yves Chauvin and David E. Rumelhart

Editors' note: Recent usage of the term "backpropagation" in neural net-

Theory, Architectures, and Applications

Since the publication of the PDP volumes in 1986,1 learning by backpropagation

'Parallel distributed processing: Explorations in the microstructure of cognition. Two volumes by

Figure 1. A simple three-layer network. The key to the effectiveness

g i v e t h e m an a d v a n t a g e o v e r other m e t h o d s . For these purposes it is useful to

multi-layer network polynomial network

Figure 3. Two networks designed for nonlinear regression problems.

Expert Systems Neural Networks

Richness Related Statistical Models

First Principles Models

Figure 4. Neural networks and back propagation can be of the most

SOME PRELIMINARY CONSIDERATIONS

1. The representation problem. What is the representational capacity of a

A PROBABILISTIC MODEL FOR

The goal of the analysis which follows is to develop a theoretical framework

ln P(N\D) = ln P ( D | N ) + ln P(N) - ln P(D).

Finally, since the probability of the data is not d e p e n d e n t o n the n e t w o r k , it is

= ln p(( , di)|N) = ln P(( ,di>|N).

N o t e that again this a s s u m p t i o n allows us to express the probability of the data

ln P(D|N) = Σ ln P(di| i Λ N) + Σ ln P( i).

N o w , since w e s u p p o s e that the event i d o e s not d e p e n d on the n e t w o r k , the last

To p r o c e e d further, w e must specify the form of the distribution of w h i c h the

w h e r e K is the n o r m a l i z a t i o n term for the G a u s s i a n distribution. N o w w e take the

U n d e r the a s s u m p t i o n that a is fixed, w e want to m a x i m i z e the following t e r m ,

N o w w e must c o n s i d e r the appropriate transfer functions for the output units.

l (dij - yij) F(η j )

T h i s has the form of the difference b e t w e e n the predicted and o b s e r v e d values

T h e c h a n g e in Η should be proportional to the difference b e t w e e n the o b s e r v e d

Often w e use n e t w o r k s for classification p r o b l e m s — t h a t is, for p r o b l e m s in

P(d| Λ N) = yjdj(1 - yj)1-dj.

T h e log of probability is Σj dj ln yi + (1 – dj) ln (1 – yj) a n d , finally,

l = Σ Σ dj ln yj + (1 - dj) ln(l - yj).

In t h e neural network w o r l d , this has been called the cross-entropy error t e r m . A s

of the b i n o m i a l ) t i m e s the derivative of the transfer function with respect to its

In the w o r k o n g e n e r a l i z e d linear m o d e l s (cf. M c C u l l a g h & Nelder, 1989) such

In m a n y applications w e e m p l o y not multiple classification or binary classifica­

In this c a s e , the d vector consists of exactly o n e 1 and the r e m a i n i n g digits are

a n d , a g a i n , after c o m p u t i n g the derivative w e get

We then c h o o s e as an output function o n e w h o s e derivative with respect to η is

Before d i s c u s s i n g priors and analyzing hidden units, w e sketch h o w to use this

A Simple Clustering Network

C o n s i d e r the following p r o b l e m . S u p p o s e that w e wish to build a network w h i c h

Figure 5. A simple network for clustering input vectors.

T h e derivative of l with respect to the w e i g h t s is

l (K/N)exp[Σj(xj – wjk)2/2σ2] xj wjk

T h e term in parentheses represents the posterior probability that the correct

T h i s is a slight modification of the general form already discussed. It is the

C o n s i d e r the network p r o p o s e d by J a c o b s , Jordan, N o w l a n , and Hinton (1991)

In m a n y applications w e e m p l o y not multiple classification or binary classifica

T h i s a m o u n t s to a penalty for large w e i g h t s . T h e term Σ d e t e r m i n e s h o w impor

B y similar m e t h o d s , it is possible to estimate the other p a r a m e t e r s of the net

1. Sigmoidal hidden units can be viewed as approximations to linear thresh