Learning Piecewise Control Strategies in a Modular Neural Network Architecture

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 23, NO.

2, MARCHIAPRIL 1993 337

Learning Piecewise Control Strategies in a Modular


Neural Network Architecture
Robert A. Jacobs and Michael I. Jordan

-h1
Abstract-The dynamics of nonlinear systems often vary qual-
itatively over their parameter space. Methodologies for design- operating
conditions s c Z l e feedback
ing piecewise control laws for dynamical systems, such as gain
gains
scheduling, are useful because they circumvent the problem of
determining a single global model of the plant dynamics. Instead,
the dynamics are approximated using local models that vary with
the plant’s operating point. When a controller is learned instead t control
command
Feedback signal Output c
of designed, analogous issues arise. This article describes a multi- signal
controller
Plant
network, or modular, neural network architecture that learns
to perform control tasks using a piecewise control strategy. The
architecture’s networks compete to learn the training patterns. As
a result, a plant’s parameter space is adaptively partitioned into a
number of regions, and a different network learns a control law in Fig. 1. A plant controlled by a gain scheduling controller.
each region. This learning process is described in a probabilistic
framework and learning algorithms that perform gradient ascent
in a log likelihood function are discussed. Simulations show that determining a fixed global model of the plant dynamics.
the modular architecture’s performance is superior to that of a Instead, the designer approximates the plant dynamics using
single network on a multipayload robot motion control task. local models that vary with the plant’s operating point in a
predetermined manner. Unfortunately, the open-loop nature of
I. INTRODUCTION gain scheduling means that it shares many of the limitations
of other model-based feedforward compensation approaches.
W HEN attempting to satisfy the requirements of non-
linear control system design, it is often difficult to
find continuous control laws that are useful in all the relevant
Accurate models of nonlinear plants, whether global or local,
are often hard to formulate and model parameters are often
difficult to determine. Additionally, such models may be
regions of a plant’s parameter space. If it is known how the
inflexible in the sense that they are closely tied to a particular
dynamics of a plant change with its operating conditions,
parameterization of the plant and require significant modifica-
then it may be possible to design a piecewise controller
tion if the plant is altered or if the class of reference trajectories
that uses different control laws when the plant is operat-
is changed. Consequently, it may be desirable to utilize
ing under different conditions. This principle underlies the
learning control techniques that permit the implementation of
design methodology called gain scheduling. In a classical
feedfonvard compensators without detailed prior knowledge
gain scheduled design procedure [ 11, the designer judiciously
selects several operating points that cover the range of the of the plant dynamics. Such techniques effectively place the
plant’s dynamics. At each point, the designer constructs a compensator in an error-correcting feedback loop.
linear time-invariant approximation to the plant and a linear Nonlinear plants may be difficult to control regardless of
compensator for the linearized plant. Between operating points whether the control law is designed by a human or learned by a
the gains of the compensators are interpolated or scheduled. machine. When the control law is designed, gain scheduling is
As is illustrated in Fig. 1, gain scheduling results in a feedback an effective design methodology because it decomposes a task
control system in which the feedback gains are adjusted using into sub-tasks; the designer using this methodology partitions
feedfonvard compensation [2]. a complex control task into a number of simpler control tasks.
A n advantage of gain scheduling over more conventional In this article we argue that task decomposition is also a useful
design methodologies is that it circumvents the problem of technique when the control law is learned. We suggest that
an ideal controller is one that uses local models of the plant
Manuscript received July 16, 1991; revised November 20, 1992. This work dynamics, like gain scheduling controllers, and learns useful
was supported in part by a postdoctoral fellowship from the McDonnell-Pew
Program in Cognitive Neuroscience, in part by ATR Auditory and Visual control laws despite initial uncertainties about the plant or
Perception Research Laboratories, in part by the Siemens Corporation, in part environment, like learning controllers.
by the McDonnell-Pew Program Foundation, in part by the Human Frontier This article presents a multinetwork, or modular, neural
Science Program, in part by ONR grant N00014-90-5-1942, and in part by
NSF grant number 1R1-9013991 network architecture that learns to perform nonlinear control
R. A. Jacobs is with Department of Psychology University of Rochester, tasks using a piecewise control strategy. Section I1 motivates
Rochester, NY 14627. the need for modular neural network architectures. Section 111
M. I. Jordan is with Department of Brain & Cognitive Sciences Massachu-
setts Institute of Technology, Cambridge, MA 02139. gives a probabilistic characterization of the class of control
IEEE Log NumbPr 9207057. tasks that we are interested in and presents the modular
0018-9472/93$03.00 0 1993 IEEE

Authorized licensed use limited to: Peking University. Downloaded on December 09,2024 at 08:47:48 UTC from IEEE Xplore. Restrictions apply.
338 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 23, NO. 2, MARCHIAPRIL 1993

architecture that we have developed. The architecture’s learn- partitioning a complex mapping, modular architectures tend
ing algorithm is shown to perform gradient ascent in an to find representations that are more easily interpretable than
appropriate log likelihood function. Sections IV and V report those in fully connected networks. This is useful for analysis
the results of training the architecture on a multipayload and can be useful in incremental design procedures.
robotic motion control task. Section VI draws final conclusions In the next section we present a probabilistic characteriza-
and suggests possible extensions. tion of the tasks for which a modular architecture is expected to
perform well. The probabilistic model is a regression model
that relies on a partitioning of the input space. Within each
11. MOTIVATION
FOR MODULAR
NEURAL partition a particular stochastic process describes the most
NETWORKARCHITECTURES probable mapping from inputs to outputs. If these processes
Fully connected layered neural networks are capable in each have a relatively simple characterization in terms of
principle of approximating any well-behaved function [3]. network parameterizations, then an architecture that allocates
This does not imply, however, that it is equally easy to learn to training data from different partitions of the input space to
represent any function from a finite amount of training data. different networks should be expected to perform well. The
Indeed, because of the global nature of the approximations problem of deciding how to allocate training data to particular
obtained in fully connected networks, it is generally difficult networks is a significant part of the problem of learning in
to train such networks when the data are sampled from a such a modular architecture.
underlying function that has significant variation on a local or
intermediate scale. In such cases the network may require an 111. A MODULAR NETWORK
NEURAL ARCHITECTURE
excessive amount of training data in order to yield reasonable We have developed a modular neural network architecture
generalization. that learns to partition a task into two or more functionally
Sutton [4] identified a particular instance of this problem independent tasks and allocates different networks to learn
that we refer to as “temporal crosstalk” [ 5 ] . Suppose that a each task. An appropriate task decomposition is discovered by
network is trained on one task until some criterion is reached forcing the networks comprising the architecture to compete
and then is switched to a second task that is incompatible to learn the training patterns. As a result of the competition,
with the first. As Sutton pointed out, fully connected networks different networks learn different training patterns and, thus,
tend to alter the weights of the hidden units utilized for the learn to compute different functions. The architecture was first
first task rather than recruit new hidden units. Consequently, presented in Jacobs, Jordan, Nowlan, and Hinton [ 6 ] , and
after learning to perform the second task, the network is no combines earlier work on learning task decompositions by
longer able to perform the first task. If the training regime Jacobs, Jordan, and Barto [5] with the mixture models view
alternates between tasks, the network may eventually learn of competitive learning advocated by Nowlan [7] and Hinton
both tasks; however, its learning speed and generalization and Nowlan [8].
ability are adversely affected by the blocked presentation of The architecture, which is illustrated in Fig. 2, consists
incompatible training data. of two types of networks: expert networks and a gating
The issue of temporal crosstalk is particularly salient in network. The expert networks compete to learn the training
control problems. Because controlled dynamical systems tend patterns and the gating network mediates this competition.
to move relatively slowly through state space, the training data The expert networks and the gating network are layered or
are inherently available in blocked format. Temporal crosstalk recurrent neural networks with arbitrary connectivity. The
is inevitable if the dynamics of the plant vary at different gating network is restricted to have as many output units
operating points. as there are expert networks, and the activations of these
As is demonstrated in the following, modular networks output units must be nonnegative and sum to one. To meet
deal better with temporal crosstalk than do fully connected these constraints, we use the “softmax” activation function
networks. If a modular architecture is presented with blocks [9]; specifically, the activation of the ith output unit of the
of incompatible training data, it tends to allocate different gating network, denoted gi, is
networks to different blocks. Each network receives data
est / T
only from a single task and is therefore immune to temporal
crosstalk between tasks.
si = n (1)
es, / T
An additional advantage of modular networks is that they
j=1
can be structured more easily then fully connected networks. A
modular architecture may contain a variety of types of network where si denotes the weighted sum of unit i’s inputs, T
modules (e.g., networks with different topologies) that are denotes a “temperature” parameter [lo], and n denotes the
more or less appropriate for particular tasks. By matching number of expert networks. The output vector of the entire
the module to the task, the system can achieve superior architecture, denoted y, is
generalization to what can be achieved in a single multipurpose n
network. Designers of learning controllers often have some Y =Cgiyi (2)
knowledge-albeit incomplete-f the plant dynamics; thus i=l
it may be feasible to choose particular classes of network where yi denotes the output vector of the i t h expert network.
modules that are appropriate for particular tasks. Also, by During training, the weights of the expert and gating net-

Authorized licensed use limited to: Peking University. Downloaded on December 09,2024 at 08:47:48 UTC from IEEE Xplore. Restrictions apply.
JACOBS AND JORDAN: LEARNING PIECEWISE CONTROL STRATEGIES 339

These derivatives involve the error term y* - y, weighted


Network 1 Network 2 by the a posteriori probability associated with the ith expert
Yz
network. Thus the ith expert network’s weights are adjusted
g1 Gating to correct the error between the network’s output and the
g2 Network global target vector, but only in proportion to the a posteriori
n
- Y = sly1 + 9lYl
probability. For each input vector, typically only one expert
network has a large a posteriori probability. Consequently,
only one expert network tends to learn each training pattern.
Fig. 2. A modular neural network architecture. In general, different expert networks learn different training
patterns and, thus, learn to compute different functions.
works are adjusted simultaneously using the backpropagation Finally we compute the derivative of the log likelihood
algorithm [11]-[14] so as to maximize the cost functional with respect to u$,the variance associated with the ith expert

&’
n network. Differentiation of In L yields
l n L = l n C % e - &? IlY* -Y, 112
1

(3) d l-
nL h;
i=l - - 5”’Y* - Yi It2 - 4 (7)
where y* denotes the target output vector and a; denotes a
scaling parameter associated with the ith expert network. This expression implies that the parameter a$ is adjusted
This architecture is best understood if it is given a prob- toward the sample variance 11 y* - 112, with a step size
abilistic interpretation as an “associative gaussian mixture that is weighted by the a posteriori probability.
model” (see Duda and Hart [15] and McLachlan and Basford
[16] for a discussion of non-associative gaussian mixture IV. A MULTIPAYLOAD
ROBOTICSTASK
models). Under this interpretation, the training patterns are
We have trained several neural network systems to serve
assumed to be generated by a number of different probabilistic
as feedforward controllers for a simulated robot arm when a
processes. At each time step, a process is selected with
variety of payloads, each of a different mass, must be moved
probability g; and a training pattern is generated by the
along a specified trajectory. In our simulations, we assume that
process. Each process is characterized by a statistical model
the controller can detect only the identity of a payload (e.g.,
+
of the form y* = f;(z) E ; , where f,(z) is a fixed nonlinear
payload A or payload B ) and not its mass. Six payloads were
function of the input vector, denoted z,and 6 , is a random
used with masses of 0, 2, 10, 15, 22, and 27 kg respectively.
variable. If it is assumed that E , is gaussian with covariance
Before describing the neural network systems, we detail
matrix a$ I, then the residual vector y* - yi is also gaussian
the training procedure used to provide error information to
and the cost function in (3) is the log likelihood of generating
the systems. The procedure involves training an adaptive
a particular target vector y* .
feedforward controller to control a robot arm in conjunction
The goal of the architecture is to model the distribution of
with a fixed feedback controller. The feedback controller aids
training patterns. This is achieved by gradient ascent in the
in generating training data that the feedforward controller
log likelihood function. To compute the gradient consider first
uses to learn a model of the arm’s inverse dynamics. This
the partial derivative of the log likelihood with respect to the
direct approach to training a neural network controller has
weighted sum s; at the ith output unit of the gating network.
been studied by Atkeson and Reinkensmeyer [17], Kawato,
Using the chain rule and (1) we find that this derivative is
Furukawa, and Suzuki [MI, and Miller, Glanz, and Kraft [19].
given by
Let the state of the arm be represented in joint space by
dln L a position vector O(t) and a velocity vector O(t). In order
= hi - gi (4)
dS; to achieve a desired acceleration a(t),an appropriate torque
where h; is the a posteriori probability that the ith expert ~ ( tmust
) be applied to the arm. The relationship between
network generates the target vector: acceleration and torque is the inverse dynamics of the arm
and is written:
g; -&?IlY*-Y*Il2
--e 1

~ ( t=). p ( e ( t ) , e ( t ) , e ( t ) ) . (8)
The goal of the learning procedure is to train a feedforward
L aj
j=1
- controller to model this relationship.
Thus the weights of the gating network are adjusted so that the The feedforward controller is trained on-line in the fol-
network’s outputs-the a priori probabilities gi-move toward lowing manner. At each time step, the control signal is
the a posteriori probabilities. obtained by summing the outputs of the feedforward controller
Consider now the gradient of the log likelihood with respect and the feedback controller (see Fig. 3(a)). The inputs to
to the output of the ith expert network. Differentiation of In L the feedforward controller are the desired joint positions,
with respect to yi yields velocities, and accelerations for the current time step, as
specified by the reference trajectory. The inputs to the feedback
dlnL hi controller (a PID controller) are the desired and actual joint
= -(y* - y,).
dy, u’ positions and velocities. The sum of the feedforward and

, - -

Authorized licensed use limited to: Peking University. Downloaded on December 09,2024 at 08:47:48 UTC from IEEE Xplore. Restrictions apply.
340 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 23, NO. 2, MARCHIAPRIL 1993

mass of link 2
Q
variables

0, e - c Feedback
controller
zfb

-angle of joint 1

Desired
output
Fig. 4. Two-joint planar arm.
Actual
Joint
variables

e, e, e
- --i- 7 Parameter
TABLE I
PARAMETERS OF THE ROBOTARM
Link 1 Link 2
Length 1.0 m 0.8 m
Mass 10 kg 10 kg
(b) viscous friction Coefficient 30 kg . mZ/s/rad 20 kg . m2/s/rad
Fig. 3. (a) A feedforward controller and a feedback controller compute
torques used to control the robot arm. (b) To train the feedforward controller,
the actual joint variables are provided as inputs. The target output for the
controller is the torque applied to the arm. The robot arm is the two-joint planar manipulator shown
in Fig. 4. Each link of the arm was modeled as a point
mass located at its distal end. The payload was modeled as
feedback control signals is a torque vector that is applied attached rigidly to the second link. The dynamic equations
to the arm. The resulting joint accelerations are observed were integrated using a fourth-order Runge-Kutta method
and the feedforward controller then receives the actual joint sampled at 100 Hz. The parameters of the arm are shown
positions, velocities, and accelerations as inputs and computes in Table I.
new outputs (see Fig. 3(b)). An error is computed between The feedback controller was a position controller with a
this output and the actual torques applied to the arm, and gain of 1000 kg . m2/s2/rad. The parameters of the arm and
the error is used to change the weights in the controller of the feedback controller were obtained from Miller, Glanz,
using the backpropagation algorithm [11]-[14]. In this manner, and Kraft [19]. The desired trajectory was a straight line
the feedforward controller receives samples of the inverse horizontal movement with a velocity of zero at the endpoints
dynamic mapping and can learn a model of this mapping by and maximum velocity at the midpoint of the trajectory. If the
minimizing its prediction error. Early in the training session, origin of a Cartesian coordinate system lies at the axis of joint
the feedback controller dominates and the arm follows the 1, then the coordinates of the desired trajectory, expressed in
desired trajectory imprecisely. As the feedforward controller meters, are
learns to model the arm's inverse dynamics, it begins to t7r
generate torques that allow the arm to follow the desired z ( t ) = - cos( -) (9)
2
trajectory more faithfully.
We trained several neural network systems to serve as and
feedforward controllers for a robot arm when a variety of
payloads, each of a different mass, must be moved along a
specified trajectory. One complete simulation of the training where t E [0, 21.
of a system is called a run and is described using four Four neural network systems were trained to serve as feed-
time scales. Each run was divided into ten epochs each of forward controllers. All systems received input vectors with
which was divided into six bins corresponding to the six 19 components. Ten components were binary and represented
payloads. At the start of each bin, a payload was selected ten possible payloads. The remaining nine components were
randomly from a uniform distribution such that each payload real-valued and represented the robot arm's joint positions,
was selected exactly once in each epoch. During a bin, the velocities, and accelerations. These nine components were the
system attempted to direct the robot arm with the selected transformations of the positions, velocities, and accelerations
payload along the desired trajectory five times. Torques were shown in Table 11. They were selected so that, for a fixed
applied at a rate of 100 Hz during a two second period. In payload, the inverse dynamics are linear in the transformations
summary, a run comprised 10 epochs, an epoch comprised six (cf. [18]).
bins, a bin comprised five traversals of the desired trajectory, The specifications of the four neural network systems trained
and a traversal comprised 200 time steps. to serve as feedforward controllers are listed in Table 111.

I l-

Authorized licensed use limited to: Peking University. Downloaded on December 09,2024 at 08:47:48 UTC from IEEE Xplore. Restrictions apply.
JACOBS AND JORDAN LEARNING PIECEWISE CONTROL STRATEGIES 341

TABLE I1 networks, and a gating network. The share and expert networks
received the kinematic joint variables, and the gating network
received the payload identities. A second modular architecture
with a share network-system CMAS-was constrained to
discover a particular type of decomposition. Specifically, it
was constrained so that the share network learns to produce
4
the correct feedforward torques to control the arm with no
payload, and the expert networks learned to add the extra
torques required to compensate for the payloads’ masses. This
constraint was imposed by setting one of the expert network’s
output to be the zero vector, and fixing the gating network’s
weights so that this expert network was “gated on” in the
Note: f, are the transformations of the robot absence of a payload.
a m ’ s joint positions, velocities, and,
e
accelerations. The subscripts on 8,8, and are Fig. 7 shows the learning curves for the four systems. The
the joint number horizontal axis gives the number of epochs. The vertical axis
gives the joint root mean square error (RMSE) in radians
averaged over 25 runs. For many parameters of each system
The single network, labeled SN, has 19 input units (corre-
(e.g., step size), we used the values that appeared to give the
sponding to the joint variables and payload identity) that are
best performance. These values and additional details of the
connected to 10 hidden units that are connected to two output
simulations are provided in the Appendix. The curves show
units (corresponding to the feedforward torques). The modular
that the modular architectures MA, MAS, and CMAS achieve
architecture, labeled MA, has six expert networks and one
significantly better performance than the single network SN (at
gating network (see Fig. 5.). Each expert network receives the
epoch 10, the difference between the performance of any of
joint variables and has two output units. The gating network
the modular architectures and the single network is statistically
receives the payload identity and has six output units. The
significant at the p < 0.01 level: for the comparison between
last two systems, labeled MAS and CMAS, are referred to
SN and MA, t = 5.84; for the comparison between SN and
as modular architectures with a share network. This type of
MAS, t = 16.04; for the comparison between SN and CMAS,
architecture is formed by modifying the standard modular
t = 19.16). The modular architectures MA, MAS, and CMAS
architecture and its use requires additional explanation.
show similar levels of performance. The superior performances
The modular architecture with a share network is a system
of the modular architectures compared to the single network
that dedicates one network to learning the common features
are due in part to the fact that the multipayload robotics task
of a set of tasks and dedicates other networks to learning the
is characterized by a high degree of temporal crosstalk. Each
features that are unique to each task. Such an architecture is
system effectively attempts to model the inverse dynamics of
illustrated in Fig. 6.’ It consists of a share network as well as a
the robot arm with a payload that is fixed within each bin of
set of expert networks and a gating network. During training,
1000 consecutive time steps, but varied between bins. Whereas
the share network learns a global strategy that is common to all
the rate of learning of the single network is diminished by
tasks and the expert networks learn modifications to the global
temporal crosstalk, the modular architectures’ performances
strategy that are particular to individual tasks. The output of the
are robust in the face of temporal crosstalk. This is because
architecture, y, is the sum of the output of the share network,
the modular architectures learn to allocate different expert
y,, and the gated outputs of the expert networks:
networks to model the inverse dynamics of the arm at different
n
operating points.
There are at least three ways that the modular architectures
i=l
could allocate their expert networks so as to learn to control
The training procedure for this architecture is identical to the the robot arm with different payloads. One of the expert
training procedure for the standard modular architecture with networks could learn the appropriate feedforward torques for
the exception that the networks’ weights are adjusted so as to all payloads. Clearly, this solution doesn’t decompose the task
maximize the log likelihood into subtasks. A second possibility is that the architectures
could use a different expert network to learn the appropriate
(12) torques to control the arm with each of the different payloads.
This solution involves extensive task decomposition, but fails
to take advantage of the similarities between the control laws
When controlling a robot arm to move a variety of payloads,
needed for payloads of similar masses. A third and possibly
there are many possible decompositions of the task into a
best solution would be one in which different expert networks
shared strategy and a set of modifications to this strategy.
learn the appropriate control laws for payloads from different
The first modular architecture with a share network that we
mass categories (e.g., light, medium, or heavy payloads). This
simulated, system MAS, was unconstrained as to the decom-
solution takes advantage of task decomposition and of the
position it could discover. It had a share network, six expert
similarities between the control laws needed for payloads of
‘ A similar system was proposed by Kawato, Furukawa, and Suzuki [18]. similar masses.

Authorized licensed use limited to: Peking University. Downloaded on December 09,2024 at 08:47:48 UTC from IEEE Xplore. Restrictions apply.
342 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 23, NO. 2, MARCHIAPRIL 1993

TABLE 111
A-SINGLE NETWORKSN
Topology Input
19 * 10 + 2 payload identity and fi(8. e. 8)
%MODULAR ARCHITECTURE
MA

Expert Networks Gating Network


Number Topology Input Topology Input
6 9 -+ 2 f,(e, 8.4) 10 + 6 payload identity

C-MODULAR ARCHITECTURE
WITH A SHARENETWORK
MAS

Share Network Expert Networks Gating Network


Tomloev Inout Number Touoloev Input T O D O ~ O ~ Y Input

D-CONSTRAINEDMODULAR
ARCHITECTURE
WITH A SHARENETWORK
CMAS

Share Network Expert Networks Gating Network


Topology Input Number Topology Input Topology Input
0+2 none . 1 0 -+ 6 payload identity

Jolnt variables 0.3 -


SYsmSl
SN -
Sin@ n e w
MA-W!ar"
MAS-Modu!ararehiwre&l
a*-
0.2 - ckus - consmind modulerabiec)m,
wiha$lannebwk
Jolnt
I 0 RMSE .
(radians)
t 0
I 0 c-payload 0.1 -
identity
I 0 ---..--__, CMAS
I
M
0
0.o-l ' 0 . I . I ' 8 ' - MAS

Fig. 5. Modular architecture MA used in the multipayload robotics Payloads that are allocated to the same expert network have
experiment.
their masses bracketed together.2 The bars on the right side of
the figures give the number of NIIS in which an architecture
allocates its expert networks in the corresponding manner. For
Network Network 1 example, the top line of Fig. 8 shows that on 12 of the 25 runs,
the modular architecture MA allocates one expert network to
control the arm when the payload is either 0 or 2 kg, a second
Network
expert network when the payload is 10 or 15 kg, and a third
- Y = Y3 + sly1 + 92YZ expert network when the payload is 22 or 27 kg. The remaining
expert networks are unused.
Fig. 6 . A modular architecture with a share network.
In general, the modular architectures tend to use different
control laws for payloads from different mass categories. In
the case of the architectures with a share network, different
Figs. 8-10 show how the systems MA, MAS, and CMAS
expert networks learn to add the appropriate extra torques
allocate their expert networks to the various payloads over
different runs. Each line of these figures corresponds to a 2Recall that after training, in response to an input pattern, one of the output
different allocation. The allocation is given by the arrangement units of the gating network tends to have a large activation and all other output
units tend to have a small activation. For the purposes of making Figs. 8-10,
of the six numbers on the left side of the figures where the we considered the output unit with the largest activation to have an activation
numbers are the masses in kilograms of the six payloads. of one and all other output units to have an activation of zero.

Authorized licensed use limited to: Peking University. Downloaded on December 09,2024 at 08:47:48 UTC from IEEE Xplore. Restrictions apply.
JACOBS AND JORDAN: LEARNING PIECEWISE CONTROL STRATEGIES 343

Grouping of payloads Frequency of grouping Grouping of payloads Frequency of grouping


[ 0 2 ] [ 10 15 ] [ 22 27 ] 1 12 [ 0 2 ] [ 10 ] [ 15 ] [ 22 27 ] 9
[ 0 2 ] [ 10 15 ] [ 22 J [ 27 1 6 [ 0 21 [ 10 151 [ 22 271 6
[ 0 2 ] [ 10 ] [ 15 ] [ 22 27 ] El [ 0 2 10 ] [ 15 ] [ 22 27 ] 131
[ 0 2 ] [ 10 ] [ 15 22 ] [ 27 ] El [ 0 2 ] [ 10 ] [ 15 22 27 ] 121
[ 0 2 ] [ 10 15 22 ] [ 27 ] El [ 0 2 ] [ 10 15 ] [ 22 ] [ 27 ] El
[ 0 2 ] [ 10 ] [ 15 22 27 ] [ 0 ] [ 2 ] [ 10 15 ] [ 22 27 ] El
[ 0 2 10 ] [ 15 22 27 ] El [ 0 2 I O ] [ 15 ] [ 22 ] [ 27 ] El
[021[101~151[221[271 El [ 0 10 ] [ 2 ] [ 15 22 ] [ 27 ]
[ 0 2 ] [ 10 22 27 ] [ 15 ] El [ 0 10 15 ] [ 2 ] [ 22 ] [ 27 ]
Fig. 8. Allocation of modular architecture MA’s expert networks to the Fig. 10. Allocation of constrained modular architecture with a share net-
payloads. work CMAS’s expert networks to the payloads.

Grouping of payloads Frequency of grouping 0.20

I
[ 0 2 ] [ 10 15 ] [ 22 27 10
02][10][15][2227 15) 0.15
02][1015][22][27 (31
0 2 1 [ 101 [ 15 221 [ 27 El
[ 0 2 ] [ 10 ] [ 15 22 27 Joint

ll fi i
RMSE 0.10
(radians)
0 ] [ 2 10 ] [ 15 ] [ 22 27 El 0.0854
El 0.0778

0.05 0.0544

Fig. 9. Allocation of modular architecture with a share network MAS’S ex-


pert networks to the payloads. 0.00
SN MA MAS CMAS

Fig. 11. The joint root mean square error on the first trajectory traversal
to compensate for the mass of payloads from different mass with a novel payload (6 kg) for single network SN, modular architecture MA,
categories. The modular architectures tend to allocate one modular architecture with a share network MAS, and constrained modular
expert network to learn the control law when the payload is architecture with a share network CMAS.

absent or light, another network when the payload is medium


weight, and a third network when the payload is heavy. This of learning than the use of a single global control strategy.
occurs despite the fact that the architectures are only provided Furthermore, the results show that the modular architecture
with the identity of each payload, not each payload’s mass. can be easily modified so that one network learns a shared
The tendency to allocate the same expert network to control the strategy that is useful at all operating points and other networks
arm with payloads of similar masses, and to allocate different learn modifications to this strategy that are applied in a context
expert networks when the payloads’ masses are dissimilar sensitive manner.
results from the competition between the networks to learn After training the systems as described previously, we
the training patterns. If, for example, an expert network wins conducted an additional experiment in which the systems were
the competition to learn to control the arm with no payload, tested for their ability to learn to control the robot arm with
then it is likely to also win the competition when the payload a novel payload. The novel payload’s mass was 6 kg and
is 2 kg, but lose the competition when the payload is 27 kg. the training period was only a single traversal of the desired
In summary, the results show that suitably designed modular trajectory. Fig. 11 shows the joint root mean square error for
architectures can learn to perform nonlinear control tasks the four systems. Based on their experiences with the previous
using a piecewise control strategy. Due to the competition six payloads, all systems performed well. In general, the
between the architecture’s networks to learn the training modular architectures performed better than the single network
patterns, different networks learn to model the plant dynamics (the differences in performances between the single network
at different operating points. Between operating points, the SN and the modular architectures MA, MAS, and CMAS
networks’ outputs are smoothly interpolated by the softmax are statistically significant at the p < 0.01 level). Because
function ((1)). The use of a piecewise control strategy, at the expert networks learned local models of the robot arm’s
least in the experiments reported here, leads to faster rates inverse dynamics at different operating points, it is surprising

Authorized licensed use limited to: Peking University. Downloaded on December 09,2024 at 08:47:48 UTC from IEEE Xplore. Restrictions apply.
344 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 23, NO. 2, MARCHIAPRIL 1993

o.6 1 o’6 1
Joint Joint
RMSE RMSE
(radians) (radlans)

0 20 4 0. 60 80 100 120 V.” I

0 20 40 60 80 100 120
Trajectory traversals
Trajectory traversals
Fig. 12. Learning curve for single network SN trained in the temporal
crosstalk experiment. Fig. 13. Learning curve for modular architecture MA trained in the tem-
poral crosstalk experiment.

that the modular architectures achieve such good performance.


However, an examination of the gating networks’ outputs control laws for different regions of a plant’s parameter space.
shows that these architectures utilize a weighted combination When a controller is learned instead of designed, analogous
of the expert networks’ outputs that yields an appropriate issues arise. Temporal crosstalk, which retards learning, is
control signal for the novel payload. This suggests that a particularly salient in control problems. Because controlled
modular architecture can use the functions computed by its dynamical systems tend to move relatively slowly through
expert networks as a set of “basis functions” that the gating state space, learning controllers receive training data from
network can learn to combine in order to control the arm with a local region for long periods of time. Temporal crosstalk
novel payloads. is therefore inevitable if the dynamics of the plant vary at
different operating points. A n advantage of modular neural
network architectures is that they are able to partition a plant’s
V. ROBUSTNESS
T o TEMPORAL
CROSSTALK parameter space into a number of regions and can allocate
The previous experiments show that modular architectures different networks to learn a control law for each region. As
can allocate different networks to approximate a robot arm’s a result, they are relatively robust to temporal crosstalk.
inverse dynamics at different operating points. As a result, In conclusion, we note that the modular architecture is
modular architectures tend to be more robust to temporal not limited to learning a piecewise feedforward control law
crosstalk than single networks. In order to more precisely as has been presented here, but can be usefully applied
characterize the effects of temporal crosstalk, we trained to a variety of other control problems. For example, the
the single network SN and the modular architecture MA to architecture is equally applicable to the problem of learning
direct the robot arm along the desired trajectory under two a piecewise feedback control law. Furthermore, we believe
conditions. In the first condition, the arm didn’t carry a payload that the architecture may be useful for piecewise state recon-
whereas in the second condition, it carried a 22 kg payload. struction, system identification, and for indirect approaches to
These two conditions were alternated every 20 traversals of learning control [21], [23]. Also, some control problems may
the trajectory. require learning models of different complexity in different
The learning curves for single network SN and modular regions of the plant’s parameter space. In this case, a modular
architecture MA are shown in Figs. 12 and 13. The horizontal architecture may contain expert networks with different char-
axis gives the number of trajectory traversals. The vertical axis acteristics (e.g., networks with different topologies or different
gives the joint root mean square error in radians averaged over regularizers) so that the architecture can allocate a network
25 runs. Clearly the single network is not robust to temporal with an appropriate complexity to each region. Finally, a
crosstalk. Training with the 22 kg payload causes the network multiresolution control law may be useful for some problems.
to “forget” how to control the arm in the absence of a payload, This can be achieved by extending the modular architecture
and vice versa. In contrast, the modular architecture MA is presented here to a multi-resolution hierarchy that partitions
robust to temporal crosstalk. It learned to allocate different the input space recursively [22].
expert networks to control the arm with and without the
payload. Consequently, training with and without the payload APPENDIX
are uncoupled and, thus, training in one condition doesn’t
degrade performance in the other condition. Step sizes-In all simulations, the modular architectures
were initialized with a step size of 0.01 for modifying 0’.
During the course of learning, this step size was set to the
VI. CONCLUSION product of 204 and its initial value (see (7) for why this was
To summarize, the dynamics of a plant frequently change done). The step size for modifying system SN’s weights was
with its operating conditions. Design methodologies, such 0.002. The step sizes for modifying the modular architectures’
as gain scheduling, allow a designer to construct different weights were a function of the 0:. At the start of a simulation,

Authorized licensed use limited to: Peking University. Downloaded on December 09,2024 at 08:47:48 UTC from IEEE Xplore. Restrictions apply.
JACOBS AND JORDAN: LEARNING PIECEWISE CONTROL STRATEGIES 345

each gating network’s step size was set to 0.01. The expert [lo] R. Durbin and D. J. Willshaw, “An analogue approach to the traveling
networks of system MA were initialized with a step size of salesman problem using an elastic net method,” Nature, vol. 326, pp.
689-691, 1987.
O.Ool and systems MAS and CMAS’s expert networks were [11] Y. le Cun, “Une procedure d’apprentissage pour rCseau a sequil
initialized with step sizes of 0.0003. At each time step, the ith asymitrique (A learning procedure for asymmetric threshold networks),”
Proc. Cognitiva, vol. 85, pp. 599404, 1985.
expert network’s step size was set to the product Of T f and its [12] D. B. Parker, “Learning logic,” Tech. Rep. TR-47, Mass. Inst. Technol.,
initial step size (see (6) for why this was done). Each share Cambridge. MA. 1985.
network’s step size was set to the product of the average ‘ 0 [13] D. E. Rimelhart, G. E. Hinton, and R. J. Williams, “Learning internal
representations by error propagation,” in D. E. Rumelhart, J. L. McClel-
and its initial step size. Each gating network,s step size was land, and the PDP Research Group, Parallel Distributed Processing:
set to the product of the square root of the average g: and Explorations in the Microstructure of Cognition. Val. I : Foundations.
its initial step size. Cambridge, MA: The MIT, 1986.
[14] P. J. Werbos, “Beyond regression: New tools for prediction and anal-
Temperature-The temperature parameter, T , in the softmax ysis in the behavioral sciences,” Ph.D. dissertation, Harvard Univ.,
function ((1)) was initialized to 1 and slowly decreased during Cambridge, MA, 1974.
a simulation. Specifically, T was decreased by 25% at the start [ 151 R. 0. Duda and P. E. Hart, Pattern Classification and Scene Analysis.
New York: Wiley, 1973.
of each epoch. However, in the simulation of system MA on [16] G. J. McLachlan, and K. E. Basford, Mixture Models: Inference and
the temporal crosstalk task we simply set T = 0.075. Applications to Clustering. New York: Marcel Dekker, 1988.
Joint Root Mean Square Error: The joint RMSE was com- [17] C. G. Atkeson and D. J. Reinkensmeyer, “Using associative con-
tent-addressable memories to control robots,” Proc. IEEE Con$ De-
puted as follows. At each time step, the squared differences cision and Control, pp. 792-797, 1988.
between the desired and actual joint positions were determined. [18] M. Kawato, K. Furukawa, and R. Suzuki, “Hierarchical neural-network
model for control and learning of voluntary movement,” Biological
The joint RMSE was determined for each traversal of the Cybern., vol. 57, pp. 169-185, 1987.
trajectory by averaging the squared differences over all 200 [19] W. T. Miller, F. H. Glanz, and L. G. Kraft, “Application of a general
time steps of the traversal, and then taking the square root of learning algorithm to the control of robotic manipulators,” The Znt. J .
Robotics Res., vol. 6, pp. 84-98, 1987.
this average. The joint RMSE for each epoch was determined [20] A. G. Barto, “Connectionist learning for control: An overview,” in
by averaging the joint RMSE of the 30 traversals (six payloads T. Miller, R. S. Sutton, and P. J. Werbos, Eds., Neural Networks for
times five traversals per payload) that occurred in each epoch. Control. Cambridge, MA: The MIT Press, 1990.
[21] M. I. Jordan and D. E. Rumelhart, “Forward models: Supervised learning
The joint RMSE’s reported in the figures were determined by with a distal teacher,” Cognitive Sci., vol. 16, pp. 307-354, 1992.
averaging the joint RMSE of each traversal or epoch over the [22] M. I. Jordan and R. A. Jacobs, “Hierarchies of adaptive experts,” in
Advances in Neural Information Processing Systems 4 , J. Moody, S.
25 runs. Hanson, and R. Lippmann, Eds. San Mateo, CA: Morgan Kaufmann,
Miscellaneous-The hidden units of system SN used the 1992.
logistic activation function with asymptotes at 0 and 1. All [23] K. S. Narendra and K. Parthasarathy. “Identification and control of dy-
namical systems using neural networks,” IEEE Trans. Neural Networks,
networks were initialized with weights selected from a uniform vol. I , pp. 4-27, 1990.
distribution over the interval (-0.1, O.l), except the gating
networks that were initialized with zero weights.

REFERENCES Robert A. Jacobs received the Ph.D. degree


in computer science from the University of
J. S. Shamma, and M. Athans, “Analysis of gain scheduled control for Massachusetts at Amherst.
nonlinear plants,” IEEE Trans. Automat. Contr., vol. 35, pp. 898-907, He is an Assistant Professor in the Department
1990. of Psychology at the University of Rochester,
K. J. Astriim and B. Wittenmark, Adaptive Control. Reading, MA: Rochester, NY. Previously, he served as a post-
Addison-Wesley, 1989. doctoral fellow in cognitive neuroscience at the
K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward Massachusetts Institute of Technology, Cambridge,
networks are universal approximators,” Neural Networks, vol. 2, pp. and at Harvard University, Cambridge, MA. His
359-366, 1989. research has focused on learning in modular and
[41 R. S. Sutton, “Two problems with backpropagation and other steep- hierarchical neural networks, and on functional
estdescent learning procedures for networks,” Pro. Eighth Annual Conf: specialization in high-le vel vision.
Cognitive Sci. SOC., 1986, pp. 823-831.
[51 R. A. Jacobs, M. I. Jordan, and A. G. Barto, “Task decomposition
through competition in a modular connectionist architecture: The what
and where vision tasks,” Cognitive Sci., vol. 15, no. 2, 1991.
R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive Michael I. Jordan received the Masters degree in
mixtures of local experts,” Neural Computation, vol. 3, pp. 79-87, 1991. mathematics from Arizona State University, Tempe,
S. J. Nowlan, “Maximum likelihood competitive learning,’’ in Advances and the Ph.D. degree in cognitive science from the
in Neural Information Processing Systems 2, D. S. Touretzky, Ed. San University of California at San Diego.
Mateo, CA: Morgan Kaufmann, 1990. He is an Associate Professor in the Deparment of
G. E. Hinton and S. J. Nowlan, “The bootstrap Widrow-Hoff rule as a Brain and Cognitive Sciences at the Massachusetts
cluster-formation algorithm,” Neural Computation, vol. 2, pp. 355-362, Institute of Technology, Cambridge. His research
1990. has focused on nonlinear adaptive control using
[91 J. Bridle, “Probabilistic interpretation of feedforward classification net- neural networks, learning in recurrent networks,
work outputs, with relationships to statistical pattern recognition,” and learning in modular and hierarchical neural
in Neurocomputing: Algorithms, Architectures, and Applications, F. networks.
Fogelman-Soulie and J. Herauk, Eds. New York: Springer-Verlag, Dr. Jordan is the recipient of a Presidential Young Investigator Award from
1989. the National Science Foundation.

Authorized licensed use limited to: Peking University. Downloaded on December 09,2024 at 08:47:48 UTC from IEEE Xplore. Restrictions apply.

You might also like