0% found this document useful (0 votes)
76 views7 pages

Learning To Control An Inverted Pendulum Using Neural Networks

The document summarizes research on using neural networks to control an inverted pendulum. The goal is to balance the pendulum with no prior knowledge of its dynamics. Previous work applied reinforcement learning to learn control policies from failure signals. The document describes the inverted pendulum simulation and how neural networks were used to learn two functions: an action function to select controls, and an evaluation function to assign credit to actions. Experimental results showed this approach can learn to balance the pendulum without knowledge of its dynamics or an objective function.

Uploaded by

Aditya Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views7 pages

Learning To Control An Inverted Pendulum Using Neural Networks

The document summarizes research on using neural networks to control an inverted pendulum. The goal is to balance the pendulum with no prior knowledge of its dynamics. Previous work applied reinforcement learning to learn control policies from failure signals. The document describes the inverted pendulum simulation and how neural networks were used to learn two functions: an action function to select controls, and an evaluation function to assign credit to actions. Experimental results showed this approach can learn to balance the pendulum without knowledge of its dynamics or an objective function.

Uploaded by

Aditya Agarwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Learning to Control an Inverted

Pendulum Using Neural Networks


Charles W. Anderson

ABSTRACT: An inverted pendulum is sim- of the pendulum must be learned through Inverted Pendulum
ulated as a control task with the goal of experience by trying various actions and not-
ing the results, starting with no hints as to The inverted pendulum task involves a
learning to balance the pendulum with no a
pendulum hinged to the top of a wheeled cart
priori knowledge of the dynamics. In con- which actions are correct.
Without an objective function to evaluate that travels along a track, as shown in Fig.
trast to other applications of neural networks
states and actions, modifications to the con- 1. The cart and pendulum are constrained to
to the inverted pendulum task, performance
troller can be based only on the occurrence move within the vertical plane. The state at
feedback is assumed to be unavailable on
time t is specified by four real-valued vari-
each step, appearing only as a failure signal of failure signals. A long sequence of actions
when the pendulum falls or reaches the ables: the angle between the pendulum and
can develop before a failure signal is en-
vertical and the angular velocity (0, and 4,)
bounds of a horizontal track. To solve this countered, resulting in the difficult assign-
task, the controller must deal with issues of ment-ofcredit problem, where it is neces- and the horizontal position and velocity of
delayed performance evaluation, learning sary to decide which actions in the sequence the cart (h, and A,). The inverted pendulum
under uncertainty, and the learning of non- system was simulated using the following
contributed to the failure.
linear functions. Reinforcement and tem- equations of motion, where the units of 0,
In this paper, neural network learning
h, and time t are radians, meters, and sec-
poral-drference learning methods are pre- methods are described that learn to generate
sented that deal with these issues in order to successful action sequences by acquiring two onds, respectively, and where g is the ac-
avoid unstable conditions and balance the functions: an action function, which maps celeration due to gravity (9.8 m/sec2), F, the
pendulum. the current state into control actions, and an output of the action network (k10 N), m,
evaluation function, which maps the current the mass of the cart (1 .0 kg), m the mass of
state into an evaluation of that state. The the pendulum plus the cart (1.1 kg), and 1
Introduction evaluation function is used to assign credit the distance from the pivot to the pendulum’s
to individual actions. Two networks having center of mass (0.5 m).
The inverted pendulum is a classic ex- a similar structure are used to learn the action .. mg sin er - cos O,[F, + mP@ sin e,]
ample of an inherently unstable system. Its and evaluation functions. They will be re- e, =
dynamics are basic to tasks involving the ferred to as the action network and the eval- (413)ml - m,l cos’ 0,
maintenance of balance, such as walking and uation network. i;, = {F, + mpl[4f sin e, - 8, cos ~ , l } / m
the control of rocket thrusters. A number of As shown in later sections, the desired
control design techniques have been inves- evaluation function for the inverted pendu- This system was simulated by numerically
tigated using the inverted pendulum [ 11-[4]. lum task is nonlinear; a single-layer neural approximating the equations of motion using
The successful application of these design network cannot form this map. One solution Euler’s method with a time step of 7 = 0.02
techniques requires considerable knowledge to this problem is to transform the original sec and discrete-time state equations of the
of the system to be controlled, including an state variables into a new representation with + +
form O[t 1 1 = e[r] ~ & [ i ]The. sampling
accurate model of the dynamics of the sys- which a single-layer network can form the rate of the inverted pendulum’s state and the
tem and an expression of the system’s de- evaluation function. Barto et al. [ 5 ] dem- rate at which control forces are applied are
sired behavior, usually in the form of an ob- onstrated a quantization of the state space of the same as the basic simulation rate, i.e.,
jective function. the inverted pendulum with which single- 50 Hz.
How can control be accomplished when layer networks could learn to balance the The goal of the inverted pendulum task is
such knowledge is not available? This ques- pendulum. A second solution is to add a sec- to apply a sequence of right and left forces
tion is addressed here by considering the in- ond adaptive layer that learns such a repre- of fixed magnitude to the cart such that the
verted pendulum control problem when the sentation. Anderson [6] extended the work pendulum is balanced and the cart does not
dynamics are not known a priori and an an- of Barto et al. by applying a form of the hit the edge of the track. A zero-magnitude
alytical objective function is not given. All popular error back-propagation method to
that is known are the values and ranges of two-layered networks that learn to balance
the state variables of the inverted pendulum the pendulum given the actual state variables
system and that a negative failure signal is of the inverted pendulum as input.
to be maximized over time. A function that In this paper, the work of Barto et al. and
selects control actions given the current state Anderson is summarized by discussing the
neural network structures and learning meth-
Presented at the 1988 American Control Confer- ods from a functional viewpoint and by pre-
ence, Atlanta, Georgia, June 15-17, 1988. Charles senting the experimental results. First, the
W. Anderson is with the Self-Improving Systems inverted pendulum task and previous appli-
Department of GTE Laboratories, Inc., Waltham, cations of neural networks to this task are
MA 02254. described. Fig. 1. The inverted pendulum.

April 1989 31

Authorized licensed use limited to: Birla Institute of Technology and Science. Downloaded on April 07,2022 at 12:59:42 UTC from IEEE Xplore. Restrictions apply.
force is not permitted. Bounds on the angle native source of desired behavior, as dem- ated indirectly by considering the effect of
and on the cart’s horizontal position specify onstrated by Widrow and Smith [ 7 ] , [8] and its output on the environment with which the
the states for which a failure signal occurs. Guez and Selinsky [9]. Tolat and Widrow network interacts. Reinforcement learning
There is no unique solution-any trajectory [IO] provide a third demonstration of train- methods can be applied when this effect is
through the state space that does not result ing with a human controller, with the novel measured by changes in an evaluation signal,
in a failure signal is acceptable. The only use of feedback based on pixel values de- or reinforcement-a term borrowed from
information regarding the goal of the task is rived from a visual image of the pendulum theories of animal learning from which re-
provided by a failure signal, which signals location rather than the pendulum’s state inforcement learning methods originated
+
either the pendulum falling past 12 deg or variables. ~41.
the cart hitting the bounds of the track at If neither a designed controller nor a hu- Next we describe the experiments by Barto
+2.4 m. These two kinds of failure are not man expert is available, learning must be et al. [5] involving two single-layer net-
distinguishable in the case considered herein. guided by some measure of actual perfor- works. See Sutton [15] for a more thorough,
The goal as just stated makes this task very mance. For example, Connell and Utgoff extended treatment of this approach to the
difficult; the failure signal is a delayed and [ 111 measured performance by the distance inverted pendulum task and of reinforcement
rare performance measure. Before describ- from the current state to the (0, 0, 0 , O ) state, learning methods in general.
ing a solution to this formulation of the in- taking the action on each step that most re- We distinguish the two networks by call-
verted pendulum task, we briefly discuss duced this difference. As is the case for tra- ing one the action network and the other the
other approaches that assume the existence ditional control methods, this amounts to evaluation network. The action network
of additional task-specific knowledge. adding the knowledge that the system must learns to select actions as a function of states.
A traditional control approach when the be stabilized about a particular state. Rosen It consists of a single unit having two pos-
dynamic equations of motion are known is et al. [12] assumed a different type of knowl- sible outputs, one for each of the two allow-
to assume that the control force F, is a linear edge, the knowledge that the inverted pen- able control actions of pushing left or right
e,,
function of the four state variables (e,, h,, dulum task is a failure-avoidance task. For on the cart with a fixed-magnitude force. The
h,) with constant coefficients b , , . . . , bq: avoidance tasks, longer trajectories through output of the unit is probabilistic-the prob-
state space between failures are more desir- ability of generating each action depends on
F, = b,O, + b2e, + b3h, + b4h, able, so Rosen et al. identified actions that the weighted sum of the unit’s inputs, i.e.,
resulted in cycles in state space as the pre- the inner product of the input vector and the
The coefficients b, are chosen to stabilize the ferred actions. unit’s weight vector.
linearized version of the system differential The only performance measure present in Initial values of the weights are zero, mak-
equations for 0, and h, small in magnitude. our formulation of the inverted pendulum ing the two actions equally probable. The
This is the approach followed by Cheok and task is the failure signal. The approach to action unit learns via a reinforcement learn-
Loh [I] in a similar problem; they use linear this task, as summarized in the next two sec- ing method. It tries actions at random and
feedback in three (of the four) variables to tions, is an example of how successful con- makes incremental adjustments to its
obtain stable control for a ball-balancing ex- trol can be learned when limited task-specific weights, and, thus, to its action probabili-
periment. The success of this approach de- information is available. ties, after receiving nonzero reinforcements.
pends heavily on the match between the ac- The only nonzero reinforcement present in
tual system dynamics and the linearized the inverted pendulum task is a failure sig-
approximation. Solution Using Two
nal. Learning good actions is extremely slow
The earliest application of neural networks Single-Layer Networks when based on this rare and delayed signal.
to the inverted pendulum task is that of Wid- The architecture of a network and the A second mechanism is needed to appor-
row and Smith [7] and Widrow [8]. They computations performed by each unit specify tion the blame for the failure among the ac-
approached the problem as described earlier, a function from input to output vectors. The tions in the sequence leading to the failure.
using traditional control methods to derive a function is parameterized by the numerical This mechanism is provided by the evalua-
control law to stabilize the linearized system connection weights between units and on the tion network, which also consists of a single
for small 8, and h,. Then they trained a net- inputs to the network. The function is altered unit. The evaluation unit learns the expected
work to mimic the output of the control law by a learning method that adjusts the values value of a discounted sum of future failure
by observing the input-output behavior of the of the weights. signals by means of a temporal-difference,
control law as it balanced the pendulum. Learning can be based on several forms of or TD, method of prediction, developed by
Guez and Selinsky [9] extended this ap- evaluative feedback (see Hinton [I31 for a Sutton [ 161. TD methods learn associations
proach to include multilayer networks trained review). Supervised learning methods, the among signals separated in time, such as the
by observing a nonlinear control law. most commonly used in neural networks, re- inverted pendulum state vectors and failure
When the dynamics are not known, it is quire a training set of data consisting of input signals. Through learning, the output of the
necessary to use some adaptive or learning vectors and corresponding desired output evaluation network comes to predict the fail-
approach to obtain a stable control, which is vectors. Such methods cannot be applied to ure signal, with the strength of the prediction
some unknown function U of the four state tasks for which the desired output of the net- indicating how soon failure can be expected
variables: work is not known. The inverted pendulum, to occur. The predictions are adjusted after
F, = we,, e,, h,, A,) as we have defined it, is such a task: the
correct action for most states is not even well-
each step by an amount proportional to the
network’s input and the difference between
With unknown dynamics, a controller cannot defined, since many trajectories are possible the new prediction, based on the current state
be designed to provide examples of desired that indefinitely avoid failure. of the inverted pendulum, and the previous
behavior. A human controller who is able to If the desired output is not available, the prediction, based on the previous state, i.e.,
stabilize an inverted pendulum is an alter- performance of the network must be evalu- the temporal difference or change in predic-

32 IEEE Control Systems Mogozine

Authorized licensed use limited to: Birla Institute of Technology and Science. Downloaded on April 07,2022 at 12:59:42 UTC from IEEE Xplore. Restrictions apply.
tion of failure. This sequence of prediction diction of failure. Thus, the shape of this The primary purpose of these experiments
changes ends with the occurrence of failure, evaluation function as the state moves from was to compare the learning performances
and the final prediction change is dependent -8, -0 to +8, +e is nonmonotonic, first of the TD prediction method (called the
on the difference between the failure signal rising from - 1 toward 0, then falling back Adaptive Critic in [ 5 ] ) and the method used
and the previous prediction. Convergence toward - 1 . by Michie and Chambers to assign credit to
theorems for the TD class of algorithms have Since the output of the evaluation unit is actions. The key difference between the
been proven by Sutton [16]. Ongoing work a linear function of its input, this evaluation methods is that Michie and Chambers used
by Sutton includes the investigation of rela- function cannot be formed. A different rep- counters in each region to remember past
tionships between TD methods and dynamic resentation of the state must be used for states and times between state occurrences
programming. which this function is linear. One alternative and failure, and that learning occurred only
The output of the evaluation unit is the is to adopt a table lookup strategy and divide on failure. The TD method allows learning
inner product of the input vector and the the state space into discrete, nonoverlapping to occur continuously using the learned eval-
unit's weight vector. Assuming the input regions, associating a unique input compo- uation function and differences in its output
vector is a representation of the inverted pen- nent and weight with each region. The input as reinforcement, rather than waiting for fur-
dulum's state (discussed subsequently) and components then could be binary-valued, ther failures.
that the weights are developed by the TD with only one being nonzero at a time, sig- The results in Fig. 3 (from [ 5 ] ) show steps
method, the scalar output of the evaluation naling which region the current state of the between failures versus failures for the sin-
unit provides a ranking of the states. The pendulum is in. A unique evaluation can be gle-layer network and for Michie and Cham-
difference in the unit's output on the transi- assigned to each region by adjusting the cor- bers' BOXES system. The curves in the fig-
tion from one state to another is used to judge responding weight, approximating any func- ure are averaged over 10 experiments, each
the effectiveness of the previous action. An tion to an accuracy determined by the starting with the learning system in a com-
increase in the evaluation signifies a transi- coarseness of the state-space partitioning. pletely naive state and terminated either after
tion to a state having a weaker prediction of The experiments described here were mo- 100 failures or 500,000 action steps (a sim-
failure and that the probability of the pre- tivated by the work of Michie and Chambers ulated time of almost 3 hr). Balancing time
ceding action should be increased. Similarly, [17], who devised a learning system called increases with experience for both learning
the probability of repeating an action that BOXES which learned to control an inverted systems, but the networks attain a much
precedes an evaluation decrease should be pendulum using a state representation of dis- longer balancing time, demonstrating the su-
lowered. In this way, the change in the eval- crete regions as described earlier. To com- periority of the TD method of learning be-
uation network's output serves as a rein- pare with the performance of the BOXES tween failures for this task. The final flatten-
forcement during the possibly long periods learning system, the same representation was ing of the networks' curve is a ceiling effect
between failures. However, the learned eval- used. The regions of the state space were due to the termination of runs longer than
uation function is not always helpful, partic- formed by the intersections of six intervals 500,000 steps; the length of the final bal-
ularly before much experience has been along the 8 dimension and three intervals ancing period for each run was assigned to
gained. The learning methods for updating along the 4, h, and h dimensions, making a the remaining failures. If run longer, this
both the evaluation and action networks must total of 162 regions. The resulting networks curve would continue to increase. After
deal with this uncertainty. are shown in Fig. 2. Each unit receives the 500,000 steps, the probability that the net-
The performance of any learning system 162 binary input components, and the eval- work will generate actions leading to failure
is highly dependent on its input representa- uation unit's output directs the learning pro- becomes very small and continues to de-
tion. The four real-valued state variables are cess for both units. crease with additional experience.
an adequate representation for the action unit,
since the optimal control law for a similar
inverted pendulum task is linear and can be Evaluation
approximated by the stochastic action unit. Network
However, using this representation would Failure Signal
prevent the evaluation unit from being able _____-__-_
to form a good prediction of failure for the Four
following reason. State n
Consider evaluations as a function of just Variables -
162
8 and 0 , the pendulum's angle and angular 0 State
velocity. The failure signal is defined to have Evaluation
the value - 1 on failure and 0 for all other Inverted
states. A failure occurs when the value of 8 Pendulum
is less than - 12 deg or greater than 12 deg. System
States that occur just prior to failure typically Action
have either high 8 and high 8, or low 8 and
- - - - - - - - - -!:.......""",.....NAect wt i oo nr k
low 0 , i.e., the pendulum is falling in the
c
same direction in which it is leaning. These
states should produce an evaluation near - I ,
a strong prediction of failure. Other states,
such as those for which the pendulum is
moving toward the balanced position, should State Variables
have an evaluation closer to 0, a weak pre- Fig. 2. Single-layer networks with state-space quantization.

Aprii 1989 33

Authorized licensed use limited to: Birla Institute of Technology and Science. Downloaded on April 07,2022 at 12:59:42 UTC from IEEE Xplore. Restrictions apply.
80,000- Single-Layer Networks networks, when given the state variables as
I with fi input, were unable to learn to balance the
f Quantized Representation
Time
of State Space
I pendulum, actually doing only slightly better
than the nonlearning strategy of choosing ac-

! I
Steps tions randomly with equal probability. Even
until 40.000 BOXES
Failure Michie and though a good control law can be represented
Chambers' by the single-layer action network, the fact
Learning System that the linear evaluation network cannot
form useful evaluations prevented the con-
0 25 50 75 100
trol law from being learned.
Failures A direct comparison between these results
Fig. 3. Learning curves for single-layer and the single-layer results of the previous
networks and BOXES using quantized section cannot be made. In the single-layer
representation of state space. experiments, the state variables of the pen-
dulum were reset to zero after every failure.
When this was done for the two-layer ex-
periments, the networks learned to balance
Solution Using Two-Layer Networks after the original work in the 1950s and the pendulum but did not learn to keep the
In deciding how to divide the state space, 1960s. However, recent work has suggested cart in the center of the track. Generalization
one must strike a balance between generality that gradient descent techniques, even with of what was learned in the center of the track
and learning speed. A very fine quantization the well-known problems of plateaus and lo- prevented the learning of different, more ap-
with many regions permits accurate approx- cal minima, may be feasible solutions to propriate, actions at the ends of the track.
imation of complex functions, but learning learning in hidden units for some problems. To ensure richer experience early in a train-
the correct output for each of the many re- A gradient descent technique for learning in ing run, the state variables were reset to ran-
gions requires much experience. Learning hidden units having particular nonlinear, dif- dom values after failure. With this change,
can be faster with a coarse quantization be- ferentiable, output functions has been stud- the networks learned actions for centering
cause learning for one state in a region is ied and given the name error back-propa- the cart and for balancing the pendulum. An-
transferred to all states in the region, but gation [18]. Anderson [6] used variants of other possible solution to this difficulty is to
only functions whose output remains rela- error back-propagation to learn in the hidden increase the importance of centering the cart
tively constant over regions can be repre- units of the evaluation and action networks. by using a larger, more negative, failure sig-
sented. Clearly, an adaptive representation The errors propagated to hidden evaluation nal on bumping the tmck bounds relative to
that learns a quantization or other form of units were based on the differences in the the failure signal for the pendulum exceeding
feature set based on experience is needed. It evaluation network's output, whereas errors its angular bounds. This was not tested.
should learn to make fine discriminations for hidden action units were based on this Analysis of what the networks have learned
among some states and coarse generaliza- difference and on which action was taken. is aided by plotting output values of units as
tions among others, as appropriate for a given The results of Anderson's two-layer ex- surfaces with respect to two of the four state
task. periments are shown in Fig. 5. The curves variables. Plotting the output of the evalua-
The experiments with two-layer networks are averaged results of 10 experiments, each e
tion network with respect to 8 and for fixed
by Anderson [6], described in this section, starting with a naive network and terminated values of h and h, as shown in Fig. 6, pro-
are a step in this direction. A second layer after 10,OOO failures or 500,000 steps. The duces a surface with a ridge running from
of adaptive units is added to the single units two-layer networks learned to balance the -8, +e to + 8 , -4 with lowest values for
described in the previous section. The real- pendulum for an average of about 24 min of -8, -e and +8, +e. This function ranks
valued state variables are given as input to simulated real time, before runs were ter- states for which the pendulum is moving to-
every unit in both layers, and the outputs minated at the 500,OOOth step. Single-layer ward vertical more favorable than other
from the new units become additional inputs
to the original units. This structure is shown
Hidden Evaluation
in Fig. 4. Units ,.....,"Network
The original units are called the output
units of the networks. The new units are
called hidden units because their outputs do
4 State Variables
not have a direct effect on the network's en- to Every Unit
vironment. Whereas output-unit learning can in Both Networks

be based on evaluation differences, there is


no analogous signal on which to base leam- Inverted
Pendulum
ing in the hidden units. After assigning credit
to an individual action, there remains the Action
problem in a multilayer network of distrib- ..Action I
uting this credit among the hidden units that Network
influenced the selection of that action by the
output unit.
Stale Variables
This is one of the major problems that
slowed developments in adaptive networks Fig. 4. Two-layer networks receiving unquantized state variables.

34 / E € € Control Systems Mogorrne

Authorized licensed use limited to: Birla Institute of Technology and Science. Downloaded on April 07,2022 at 12:59:42 UTC from IEEE Xplore. Restrictions apply.
ooooo I Two-Layer Networks
Receiving Four
State Variables
I--
unit outputs. For the learning run from which
these figures are generated, we find that the
output unit has used the four state-variable
inputs to form one side of the ridge in Fig.
6 as a positive linear ramp from +8, +6
Time
Steps toward -8, -6. The addition of the hidden
Until 1000 unit’s output (Fig. 7) pulls down the -8,
Failure -4 comer of the evaluation network surface.
Single-Layer Networks
Log Receiving Four
This hidden unit function, which we can call
Scale afeature, is sufficient to form the ridge. In
\\\\\\\\.\\\......\\\\\\\\\\\\\\\\\\\
Stale Variables
fact, all five hidden units tend to redundantly
10 Random Actions
develop the same function. If more features
(No Learning) were required, the hidden units would prob-
1J ! I ably develop different functions.
0 5,000 10,000 The output of the action network is sto-
Failures chastic; therefore, its output is represented
Fig. 5. Learning curves for two-layer and by the probability of an action being a push
single-layer networks receiving unquantized to the right. Figure 8 is a plot of this prob-
state variables. ability. States for which the probability is
near 1 will result mostly in pushes to the
right, whereas pushes to the left will result
states. The ridge is shifted for different val- unit from the evaluation network. This unit from states for which the probability is near
ues of h and A: when the cart is approaching has learned a function that is simply a pos- 0. The surface shows a quick transition from
the right end of the track ( + h and + A ) , states itively sloped ramp from -0, -6 toward + 0 , left pushes to right pushes as the state shifts
for which the pendulum is to the left of ver- +6, with the slope decreasing to zero near from -0, -6 to + 0 , +b. This transition is
tical are evaluated more favorably than the the midrange of 0 and b. The contribution analogous to a switching curve for a deter-
states with the pendulum straight up. This is that this unit makes to the evaluation net- ministic controller. The location of the tran-
exactly what is required; the pendulum must work’s output depends on the value of the sition shifts as the cart’s position and veloc-
be “balanced” a bit left of vertical to permit weight with which its output is connected to ity change in order to maintain balance to
a larger proportion of pushes to the left to the output unit and on the values of the other the left of vertical when at the right side of
bring the pendulum back to the center of the output unit’s weights. Recall that, in addi- the track and to the right of vertical when at
track. The reverse situation exists on the left tion to the hidden units’ outputs, the output the left side.
side of the track. unit receives the four state variables as input. The output unit of the action network can
To understand what is learned by the hid- The output unit simply computes a weighted form the function in Fig. 8 without the aid
den units, their outputs can be similarly plot- sum of its inputs, so can represent any linear of hidden units. The action network’s hidden
ted. Figure 7 shows the output of one hidden function of the state variables and hidden units tend to evolve very little from their
initial states and do not develop significant
weights connecting them to the output unit.
I ~r
For different initial weight values and dif-
0.2

p“
ferent seeds for the random number gener-
ator, the learned evaluation and action func-
V V tions will differ only slightly. The hidden
unit functions and weight values, however,
-1.0
1 do differ significantly. For example, for some
runs, the hidden units of the evaluation net-
work learn functions that provide the + 8 ,
+e side of the evaluation hill with the output
unit forming the -8, -e side as a function
Fig. 6. Output of evaluation network. of the direct state-variable inputs, a dual so-
lution to that shown in the preceding figures.

I
~
Discussion
1
1 In many real-world situations, a control

I y5
0.5
-1
Y5 Y5 objective cannot be expressed as a function
defined over all states, but only for a rela-
tively small subset of states. For some con-
trol tasks, such a minimally defined objec-
tive is perhaps even desirable. Requiring a
h = 1.6 h = 1.0 controller to bring the value of a state vari-
able as close to zero as possible when the
Fig. 7. Output of hidden unit in evaluation network. true objective is just to avoid extreme values

April 1989 35

Authorized licensed use limited to: Birla Institute of Technology and Science. Downloaded on April 07,2022 at 12:59:42 UTC from IEEE Xplore. Restrictions apply.
I I 1 Proc. IEEE Int. Con$ on Neural Networks,
San Diego, CA, pp. 11-64-11-647, July
1 1988.
M. E. Connell and P. E. Utgoff, “Learning
P P to Control a Dynamic Physical System,”
P
0.5 Proc. M I - 8 7 , vol. 2 , pp. 456-460, Amer-
ican Association for Artificial Intelligence,
0 Seattle, WA, 1987.
-1 B. E. Rosen, J. M. Goodwin, and J. J. Vi-
dal, “State Recurrence Learning,” First
I h = -1.6 k = -1.0 1 h = 0.0 LA: 0.0 1 h = 1.6 = 1.0 Annual Int. Neural Network Society Meet-
ing, Boston, MA, Sept. 1988 (abstract ap-
Fig. 8. Output of action network. pears in Neural Networks, vol. 1, Suppl. 1,
p. 48, 1988).
might interfere wiih the control of other state simple to difficult parts of the problem might G. E. Hinton, “Connectionist Learning
variables. greatly reduce overall learning time [23]. Procedures,” Tech. Rept. CMU-CS-87-
Learning from experience during periods 115, Carnegie-Mellon Univ., Pittsburgh,
Acknowledgments PA, 1987; to appear in Artificial Intelli-
of no performance feedback is dificult.
gence.
Neural networks learning via reinforcement This work was partially supported by the R. S. Sutton and A. G. Barto, “Toward a
learning and temporal difference methods Air Force Office of Scientific Research and Modem Theory of Adaptive Networks: Ex-
deal with this problem by simultaneously the Avionics Laboratory (Air Force Wright pectation and Prediction,” Psychol. Rev.,
learning a probabilistic action-generating Aeronautical Laboratories) through Contract vol. 88, no. 2, pp. 135-170, 1981.
function and a state-evaluation function. This F33615-83-C-1078. R. S. Sutton, “Temporal Credit Assign-
approach to the inverted pendulum task is ment in Reinforcement Learning,” Doc-
unique; all other learning systems designed toral Dissertation, COINS Tech. Rept. 84-
References 02, Univ. of Massachusetts, Amherst, 1984.
for this task assume more a priori knowl-
edge, such as an explicit teacher providing I] K. C. Cheok and N. K. Loh, “A Ball-Bal- R. S . Sutton, “Learning to Predict by the
ancing Demonstration of Optimal and Dis- Methods of Temporal Differences,” Ma-
correct actions 171, 191. The applicability of
turbance-Accommodating Control,” IEEE chine Learning, vol. 3, pp. 9-44, 1988.
this approach to other tasks has been dem- Contr. Syst. Mag., vol. 7, no. 1, pp. 54- D. Michie and R. A. Chambers, “BOXES:
onstrated by Helferty et al. [19] for a one- 57, Feb. 1987. An Experiment in Adaptive Control,” Ma-
legged hopping machine and by Hoskins and 21 E. Eastwood, “Control Theory and the En- chine Intelligence 2 , E. Dale and D. Mi-
Himmelblau [20] in a process control situa- gineer,” Proc. IEE, vol. 115, no. 1, pp. chie, eds., Edinburgh: Oliver and Boyd, pp.
tion. 203-211, Jan. 1968. 137-152, 1968.
The difficulties of motor control and other [31 R. H. Cannon, Jr., Dynamics of Physical D. E. Rumelhart, G. E. Hinton, and R. W.
low-level control tasks may yield to the care- Systems, McGraw-Hill, 1967. Williams, “Learning Internal Representa-
ful application of neural network learning [41 J. K. Roberge, “The Mechanical Seal,” tions by Error Propagation,” in Parallel
methods. The ability of neural networks to S.B. Thesis, Massachusetts Institute of Distributed Processing: Explorations in the
Technology, Cambridge, MA, May 1960. Microstructure of Cognition, Volume I :
handle multiple-input and -output variables,
A. G. Barto, R. S. Sutton, and C. W. An- Foundations, D. E. Rumelhart, J . L.
nonlinear functions, and delayed feedback as derson, “Neuronlike Adaptive Elements McClelland, and The PDP Research Group,
well as their potential for fast, parallel im- That Can Solve Difficult Learning Control Cambridge, MA: Bradford, 1986.
plementation warrants further investigation Problems,” IEEE Trans. Syst., Man, Cy- J. J. Helferty, J. B. Collins, and M. Kam,
of neural network learning methods in con- bern., vol. SMC-13, pp. 834-846, Sept.- “A Learning Strategy for the Control of a
trol domains. It is important to realize that Oct. 1983. Mobile Robot That Hops and Runs,” Proc.
these methods are not special, magical, C. W. Anderson, “Strategy Learning with IASTED-88, Galveston, TX, pp. 7-1 1, In-
stand-alone techniques, but outgrowths of Multilayer Connectionist Representations,” ternational Association of Science and
long lines of research in function approxi- Tech. Rept. TR87-509.3, GTE Laborato- Technology for Development. 1988.
mation, optimization, signal processing, and ries, Waltham, MA, 1987. (This is a cor- J. C. Hoskins and D. M. Himmelblau,
rected version of the report published in “Automatic Chemical Process Control
pattern classification, and can be combined
Proc. Fourth International Workshop on Using Reinforcement Learning in Artificial
with existing control techniques in straight- Machine Learning, Irvine, CA, pp. 103- Neural Networks,” First Annual Int. Neural
forward ways. For example, Miller 1211 and 114, June 1987.) Network Society Meeting, Boston, MA,
Franklin [22] have added networks trained [71 B. Widrow and F. W. Smith, “Pattern- Sept. 1988 (abstract appears in Neural Net-
by supervised and reinforcement learning Recognizing Control Systems,’’ 1963 Com- w o r k , vol. 1, Suppl. 1, p. 446, 1988).
methods, respectively, to refine the perfor- puter and Information Sciences (COINS) W. T. Miller, “Sensor-Based Control of
mance of predefined controllers. Symp. Proc., Washington, DC: Spartan, pp. Robotic Manipulators Using a General
Experiments in learning control with neural 288-317, 1964. Learning Algorithm,” IEEE J . Robotics
networks may shed some light on how to B. Widrow, “The Original Adaptive Neural Automat., vol. RA-3, no. 2, pp. 157-165,
deal with real-world uncertainties and com- Net Broom-Balancer,” Inr. Symp. Circuits Apr. 1987.
and Syst., pp. 351-357, May 1987. J. A. Franklin, “Learning Control in a Ro-
plexities in control. However, many issues
A. Guez and J. Selinsky, “A Trainable botic System,” Proc. IEEE Int. Con$ Syst.,
remain unresolved, such as how the perfor- Neuromorphic Controller,” J . Robotic Man, Cybern., Alexandria, VA, pp. 466-
mance of learning methods scales up to Syst., vol. 5 , no. 4, pp. 363-388, Aug. 470, Oct. 1987.
larger, more complex tasks than those cur- 1988. 0. G . Selfridge, R. S. Sutton, and A. G. Barto
rently being studied. For more difficult prob- V. V. Tolat and B. Widrow, “An Adaptive “Training and Tracking in Robotics,” Proc.
lems, a training cumculum progressing from ‘Broom Balancer’ with Visual Inputs,” IJCAI-BS, pp. 670-672.

36 IEEE Control Systems Mogozrne

Authorized licensed use limited to: Birla Institute of Technology and Science. Downloaded on April 07,2022 at 12:59:42 UTC from IEEE Xplore. Restrictions apply.
Charles W. Anderson Improving Systems Department of GTE Labora- Supervisor: Peter Dorato.
received the B.S. degree tones, Waltham, Massachusetts, where he is Current Address: Deparbnent of Electronics,
in computer science from studying learning methods for multilayer connec- Kyungpook National University, Taegu
the University of Ne- tionist networks with a focus on learning methods 635, Korea.
braska in 1978 and the for control domains, including problems in pro-
M.S.and Ph.D. degrees cess and robotic control. In addition to connec- Purdue University
in computer science from tionist learning methods and control, his interests King, Andrew, “Discretization and Model
the University of Massa- include optimization, pattern classification, com- Reduction for a Class of Nonlinear Sys-
chusetts, Amherst, in puter graphics, and simulation. tems.”
1982 and 1986, respec- Date: August 1988.
tively. He is currently a Supervisor: R. E. Skelton.
Senior Member of the Current Address: Hughes Aircraft Company,
Technical Staff in the Self- P.O. Box 92919, Los Angeles, CA 90009.

Purdue University
Hu, Anren, “Modal Cost Analysis of Flexible
Structures for Control Design.”
Date: January 1988.
Supewisor: R. E. Skelton.
Current Address: Dynacs Engineering Com-
Doctoral Dissertations 2 pany, 2280 US 19 North, Clearwater, FL
33575.

Date: August 1988. Purdue University


The information about doctoral disserta- Supervisor Steven I. Marcus. Collins, Emmanuel, “State Covariance As-
tions should be typed double-spaced using Current Address: Agency for Defense Devel- signment of Discrete Systems.”
the following format and sent to: opment (2-2-21, Daejeon, P.O. Box 35, Date: May 1987.
Daejeon, South Korea. Supervisor: R. E. Skelton.
Prof. Bruce H. Krogh CurrentAddress: Harris Corporation, P.O. Box
Dept. of Electrical and Computer Engrg. Swiss Federal Institute of Technology, Zurich, 137, Melbourne, FL 32901.
Carnegie-Mellon University Switzerland
Pittsburgh, PA 15213 University of Manchester Institute of Science
& Technology
Constantinescu,I., “On the Asymptotic Eigen- Muha, P.A., “Incorporation of an Expert Sys-
structure of Multivariable Systems with tem into an Existing Computer-Aided Con-
High Feedback Gain.” trol Systems Design Package.”
Ohio State University Date: June 1988.
Iftar, Altug, “Robust Controller Design for Date: March 1988. Supervisor Dr. P. A. Cook.
Large Scale Systems.” Supervisor: Hans P. Geering. CurrentAddress: Engineering Faculty, Univer-
Date: August 1988. Current Address: Measurement and Control siti Kebangsaan Malaysia, 43600 UKM,
Supervisor;Umit Ozgiiner. Laboratory, Swiss Federal Institute of Bangi, Malaysia.
Current Address: Department of Electrical Technology, ETH-Zentrum, 8092 Zurich,
Engineering, University of Toronto, Toron- Switzerland.
to, Ontario M5S 1A4, Canada. University of Cambridge
Lam, James, “Model Reduction of Delay Sys-
Ohio State University University of New Mexico tems.”
Barbieri, Enrique, “Modelling and Control of Park, Hong Bae, “Nominal HZFeedback Sys- Date: March 1988.
Planar Flexible Structures with Applica- tem Optimization with Simultaneous Sta- Supervisor Keith Glover.
tion to an Optical ‘Ikacking System.” bilization Constraints.” Current Address: Department of Applied
Date: September !988. Date: May 1988. Mathematics, City Polytechnic of Hong
Supervisor: Umit Ozgiiner. Supervisor: Peter Dorato. Kong, Nathan Road, Hong Kong.
Current Address: Department of Electrical Current Address: Department of Electronics,
Engineering, Tulane University, 204 Stan- Kyungpook National University, Taegu, University of Tennessee
ley Thomas Hall, New Orleans, LA Korea 702-701. McCullough, Claire L., “Error Consider-
701 18-5674. ations in Distributed Estimation.”
University of New Mexico Date: June 1988.
University of Texas at Austin Park, Hong Bae, “Nommal, HZ Feedback Supervisor: Dr. J. Douglas Birdwell.
Cho, Hangju, “Suprema1 and Maximal Sub- Optimization with Simultaneous Stabiliza- Current Address: Department of Electrical &
languages Arising in Supervisor Synthesis tion Constraints.” Computer Engineering, University of Ala-
Problems with Partial Observations.” Date: August 1988. bama in Huntsville, Huntsville, AL 35899.

37

Authorized licensed use limited to: Birla Institute of Technology and Science. Downloaded on April 07,2022 at 12:59:42 UTC from IEEE Xplore. Restrictions apply.

You might also like