Applied Optimal Control For: Dynamically Stable Legged Locomotion
Applied Optimal Control For: Dynamically Stable Legged Locomotion
Applied Optimal Control For: Dynamically Stable Legged Locomotion
Russell L. Tedrake
Submitted to the Department of Electrical Engineering and Computer Science
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
at the
September 2004
TECHNOLOGY A
MASSACHUSETTS IIE
OCT 2 8 2004
8ARKER
URARIE]
2
Applied Optimal Control for
Dynamically Stable Legged Locomotion
by
Russell L. Tedrake
Abstract
Online learning and controller adaptation will be an essential component for legged robots
in the next few years as they begin to leave the laboratory setting and join our world. I
present the first example of a learning system which is able to quickly and reliably acquire
a robust feedback control policy for 3D dynamic bipedal walking from a blank slate using
only trials implemented on the physical robot. The robot begins walking within a minute
and learning converges in approximately 20 minutes. The learning works quickly enough
that the robot is able to continually adapt to the terrain as it walks. This success can be
attributed in part to the mechanics of our robot, which is capable of stable walking down
a small ramp even when the computer is turned off.
In this thesis, I analyze the dynamics of passive dynamic walking, starting with reduced
planar models and working up to experiments on our real robot. I describe, in detail,
the actor-critic reinforcement learning algorithm that is implemented on the return map
dynamics of the biped. Finally, I address issues of scaling and controller augmentation
using tools from optimal control theory and a simulation of a planar one-leg hopping robot.
These learning results provide a starting point for the production of robust and energy
efficient walking and running robots that work well initially, and continue to improve with
experience.
Thesis Supervisor:
H. Sebastian Seung
Professor, Department of Brain & Cognitive Sciences and Department of Physics
Thesis Committee:
Emilio Bizzi
Institute Professor, Massachusetts Institute of Technology
Leslie P. Kaelbling
Professor, Department of Electrical Engineering and Computer Science
Jovan Popovic
Assistant Professor, Department of Electrical Engineering and Computer Science
Jean-Jacques E. Slotine
Professor, Department of Mechanical Engineering & Information Sciences and
Department of Brain & Cognitive Sciences
3
4
Acknowledgments
First and foremost, I would like to thank a number of outstanding MIT undergraduates that
I have been able to work with thanks to the MIT Undergraduate Research Opportunities
Program (UROP). Ming-fai Fong worked on the mechanical design and construction of our
first passive walker and the first version of our actuated walker. Derrick Tan made the
robot autonomous for the first time by building the computer and batteries onto the robot.
Andrew Baines is currently building the kneed version of the robot. Each of them has
contributed immeasurably to the success of this project. In particular, I want to thank
Teresa Weirui Zhang, a mechanical engineer who worked on this project for over a year.
Teresa is largely responsible for the current mechanical design of the robot, which is in every
way superior to it's predecessors.
I want to thank a number of my friends and colleagues that I have learned from and
collaborated with over the last few years. Andrew Richardson, Max Berniker, and Dan
Paluska have been a constant source for new ideas and for feedback on existing ideas.
Matt Malchano and I talked about building an actuated version of a passive walker on the
weekends, just because we thought it would be a fun project. It turned out to be the idea
that fueled this thesis. Jerry Pratt has continued to give me a lot of great feedback over the
phone, even now that he has graduated and moved into a faculty position of his own. Alan
Chen helped with the linux installations on the robots. Josh Presseisen, who I originally
met on the commuter rail, has been helping me design and build a cool body for the robot.
I also want to thank all of the members of the Seung Lab, the former Leg Lab, the new
Biomechatronics Lab, and the Bizzi Lab. These outstanding labs provided a community
for me that was supportive, friendly, and constructive.
I've also received guidance and support from a number of faculty. First I want to thank
my thesis committee: Emilio Bizzi, Leslie Kaelbling, Jovan Popovic, and Jean-Jacques
Slotine. Their time and their feedback on this document was greatly appreciated. I also
want to thank Gill Pratt, Hugh Herr, and Neville Hogan for the feedback and support on
numerous occasions throughout my graduate career.
A very humble and special thank you goes to my advisor, Sebastian. Over the last
4 1/2 years, Sebastian has given me the support and the independence that I needed to
thrive and taught me many of the mathematical tools that I missed out on during my
undergraduate career as a computer programmer. The day that he suggested that we could
buy some table-top machining equipment for the lab changed my entire research direction,
and changed my life.
Most importantly, I want to thank my wife, Rachel, who has always been there for me.
This thesis is dedicated to her.
5
6
Contents
1 Introduction 11
1.1 The State-of-the-Art . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.2 Machine Learning and Optimal Control . . . . . . . . . . . . . . . . . . . . 12
1.3 Passive Dynamic Walking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4 Contributions and Organization of Chapters . . . . . . . . . . . . . . . . . . 14
7
4.2 The Compass Gait . . . . . . . 42
4.3 The Curved Foot Model . . . . . . . . . . . . . . . 43
.
4.4 The Frontal Plane Model . . . . . . . . . . . . . . 46
.
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . 47
.
4.6 Adding Arms . . . . . . . . . . . . . . . . . . . . . 50
.
4.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . 50
.
5 Actuating a Simple 3D Passive Dynamic Walker 51
5.1 Toddler . . . . . . . . . . . . . . . . . . . . . . 51
.
5.2 The Optimal Control Problem . . . . 52
.
5.3 Hand-designed controllers . . . . . . . 54
.
5.3.1 Feed Forward Ankle Trajectories 54
5.3.2 Feedback Ankle Trajectories . . 55
.
.
5.3.3 Velocity Control . 56
5.3.4 Summary . . . . 57
.
5.8 Discussion . . . . . . . . 63
.
6.5 Results . . . . . . . . . . . . . . 69
.
6.6 Discussion . . . . . . . . . . . . 70
.
8
List of Figures
.
2-1 Example la: Infinite-horizon, discounted LQR . . . . . . . . . . . . . . . . 27
.
2-2 Example 1b: Optimal limit cycles . . . . . . . . . . . . . . . . . . . . . . . 28
.
2-3 Example 2: Infinite-horizon, discounted non-linear quadratic regulator . . 29
.
3-1 A seven link planar biped . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
.
3-2 Return map analysis of the Van der Pol oscillator . . . . . . . . . . . . . . 36
.
4-1 The 3D passive dynamic walker . . . . . . . . . . . . . . . . . . . . . . . . 37
.
4-2 The rimless wheel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
.
4-3 Limit cycle trajectory of the rimless wheel . . . . . . . . . . . . . . . . . . 39
.
4-4 Return map, fixed points, and basins of attraction of the rimless wheel . . 41
.
4-5 The compass gait . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
.
4-6 Limit cycle trajectory of the compass gait . . . . . . . . . . . . . . . . . . 44
.
4-7 The curved foot model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
.
4-8 Limit cycle trajectory of the curved foot model . . . . . . . . . . . . . . . 46
.
4-9 The frontal plane model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
.
4-10 Passive dynamic walking experiments . . . . . . . . . . . . . . . . . . . . 48
.
5-1 The 3D passive dynamic walker and Toddler . . . . . 52
.
5-4 A typical learning curve, plotted as the average error on each step. 62
5-5 Oroll trajectory of the robot starting from standing still. . . . . . 62
.
7-1 A prototype knee for the next version of the Toddler robot . . . 74
.
9
10
Chapter 1
Introduction
'[Muybridge and Mozley, 1979] is an excellent pictorial reference of the huge variety of gaits in birds and
mammals.
11
CoP is located at:
f,,(x' - x)F(x')dx'
x,, F(x')dx'
As we lean forward, the CoP moves from the middle of the foot toward the front of the
foot. As soon as it reaches the front edge of the foot, the foot begins to roll.
The most successful bipedal robot to date is Honda Motor Company's humanoid robot,
ASIMO. ASIMO's control system is based on tracking a trajectory which maintains the
location of the CoP near the middle of the foot. This ensures that the robot can apply
corrective ankle torques as it walks. Combined with a superb mechanical design and a
traditional high gain adaptive trajectory tracking controller, this algorithm allows ASIMO
to walk across a flat factory floor or even up stairs.
Despite these impressive accomplishments, this state-of-the-art robotic walking cannot
compare to a human in measures of speed, efficiency, or robustness. ASIMO's top walking
speed is 0.44 m/s [Honda Motor Co., 2004] and it cannot run, whereas humans don't even
transition from walking to running until about 2.0 m/s [Alexander, 1996]. The gold medal-
ist in the 100m sprint at the 2004 Summer Olympics in Athens finished in 9.85 seconds,
averaging over 10 m/s; the marathon gold medalist in the same Olympics finished 26.2
miles in 2:10:55, averaging well over 5 m/s. ASIMO's large 38.4 volt, 10 amp-hour battery
pack powers the robot for only 26 minutes [Honda Motor Co., 2004], which is roughly 20
times as inefficient as human walking [Collins et al., 2004]. There are no demonstrations of
ASIMO walking on uneven or unmodelled terrain.
ASIMO's performance is limited because it is commanded to follow trajectories that
conform to an overly restrictive measure of dynamic balance. ASIMO always walks with
flat feet, but when humans walk, our CoP does reach the edge of our foot, and our feet do
roll. The situation is even more exaggerated when we jog or run - no joint torques that we
apply during the aerial phase will alter the ballistic trajectory of our CoM. What ASIMO is
missing is the ability to plan a feasible trajectory through the uncontrolled portion of state
space with the knowledge that control will be regained at the other side.
12
on the real robot, the learning algorithms must perform well after obtaining a very limited
amount of data. Finally, learning algorithms for dynamic walking must deal with dynamic
discontinuities caused by collisions with the ground and with the problem of delayed reward
- motor torques applied at one moment can affect the trajectory of the robot far into the
future.
In order to successfully apply learning to a real bipedal robot, I built a robot that min-
imizes many of these difficulties. The robot pictured in Figure 1-1 is capable of dynamic
walking, but has a total of only nine degrees of freedom and only four actuators. Further-
more, the mechanical design of this robot has been optimized so that the robot is actually
capable of stable walking down a small ramp even when the computer is turned off. This
dramatically improves the performance of the learning system by causing the robot to follow
reasonable trajectories even when the learning policy is initialized to a blank slate.
13
adding a small number of actuators to a passive dynamic walker, and the control system
was designed using machine learning. This turned out to be a winning combination; the
robot achieves robust and efficient walking on a variety of terrains after only a few minutes
of training on the physical robot.
14
allowed me to focus on the essence of the learning problem, instead of dealing with robot-
specific implementation details. The examples demonstrate the power of using simulation,
analytical methods, and real robots together to effectively study the problem of locomotion,
and lay the ground-work for augmenting real robots with a learning component.
15
16
Chapter 2
The goal of optimal control is to specify a metric which evaluates the performance of
the controlled system, and then to find the control inputs at each time, un, which optimize
this performance. The optimal control problem can either be formatted as finite-horizon or
infinite-horizon.
Define an instantaneous scalar cost g(x, u, n) and terminal cost h(x). The finite-horizon
cost is given by
N-1
C = h(XN)+ g(x, un, n),
n=O
subject to the dynamics given some distribution of initial conditions P{xo}. For our pur-
poses, I will always assume that g and h are self-administered measures of cost, and that
they are deterministic and known. If the control inputs are generated by a deterministici
policy
Un = 7r(xn, n),
'Often times, the optimal control action at each time is not unique, in which case a policy which selects
stochastically among the optimal actions is still optimal. In partially observable domains, or in game theory
where you have an intelligent opponent, it may actually be possible that a stochastic policy outperforms a
deterministic policy. For the problems that we consider in this thesis, deterministic policies are sufficient.
17
then the expected cost-to-go function, or value function, is given by
J'(x, N) h(x)
The goal of optimal control is to find the optimal policy, qr*, which minimizes the cost-to-go.
Therefore [Bertsekas, 2000], the optimal cost-to-go is
J*(x, N) h(x)
These conditions on J* and 7r* are necessary and sufficient conditions for optimality; if we
find a policy and corresponding cost-to-go function which satisfy these equations, then that
policy is optimal.
An example of a finite-horizon problem is the execution of point-to-point reaching move-
ments as studied in Emilio Bizzi's lab (i.e., [Li et al., 2001]). These experiments are often
done with rhesus-macaque monkeys, which are given drops of juice as reward if they move a
manipulandum from point A to point B in a specified amount of time. The monkey learns
what motor commands (muscle activations) it needs to execute in order to maximize the
reward (apple juice).
Another interesting example of a finite-horizon problem is the trajectory learning prob-
lem for neural networks. Consider the dynamical system
Un = 7rw,b(xn, n) = Wxn + b.
This is a common discrete time representation of a neural network. The trajectory learning
problem is formulated [Pearlmutter, 1995, Williams and Zipser, 1989] as an optimal control
problem where the goal is to find the W and b to minimize the cost
1 a x d21
.
18
gradient algorithm, which will be discussed section 2.3.
subject to the dynamics and some distribution on the initial conditions xO. For infinite-
horizon problems, if the instantaneous cost is not a function of time and can be written as
g(xn, un), then the optimal policy does not depend on time [Bertsekas, 2000], and can be
written 2
Un = 7r(xn).
This property makes the infinite-horizon formulation more elegant, and is one of the primary
reasons that we use it in this thesis. The cost-to-go is
N
C = lim g(xn, un),
N-ooN
The classic example of an infinite-horizon problem is investing in the stock market. For
robots, the infinite-horizon formulation allows us to reward or penalize a robot for accom-
plishing task without placing strict bounds on the time allowed for completion. Legged
locomotion, including both walking and running, falls nicely into this formulation.
2
For finite-horizon problems, the policy still depends on time even if the instantaneous cost does not.
19
2.1.3 Continuous Time Optimal Control
The optimality conditions for both finite and infinite horizon problems can be extended to
the continuous time case. For the deterministic system,
C = f (x, u),
or
0 = min g(x, u) +
U I
f(x, u) + aj
.
x at_
This famous equation is called the Hamilton-Jacobi-Bellman (HJB) equation. The optimal
policy is simply:
F aJ* + aJ*
at
7r*(x, t) = arg min g(x, u) + axf (x,u)
Recall that for infinite-horizon problems, the dependence on t drops out, further simplifying
the optimality conditions.
The HJB for the discounted case is a little more complicated. We define the continuous
time, infinite-horizon, discounted cost as
C= j e r g(x, u)dt.
On most robots, the control equations are evaluated at discrete intervals, but these
intervals may not be evenly spaced in time 3 . Understanding the continuous time optimal
control equations allows us to generalize the discrete time updates to handle a variable
amount of time between updates.
3
0n my robots, the control equations are evaluated as quickly as possible, and the interval between
updates depends primarily on the CPU load.
20
2.1.4 Solving the Optimal Control Problem
In the next few sections I will quickly describe a number of tools for solving optimal control
problems. For some simple systems, such as linear systems with quadratic cost functions,
it is possible to find an optimal policy analytically. For more complicated systems, we must
resort to computational tools that attempt to approximate an optimal control solution.
A notable subset of these algorithms are the "reinforcement learning" algorithms, which
attempt to solve the optimal control problem using trial and error. Reinforcement learning
algorithms have the advantage that they can be applied to online learning on a real robot.
There are three fundamentally different approaches taken in reinforcement learning al-
gorithms. In policy gradient algorithms, the control policy is parameterized by some vector
w, and the gradient - E{C} is approximated directly and used to perform gradient de-
scent. Policy gradient methods have become popular again recently because it is possible
to prove convergence to a locally optimal policy. Unfortunately, they often suffer from slow
convergence rates.
In the second approach, the policy is not represented explicitly. Instead, these value
function algorithms attempt to learn the cost-to-go function, J. As we have seen, once
we have the cost-to-go function, the policy is completely determined. These methods are
potentially more efficient than policy gradient methods, but in general they are not guar-
anteed to convergence to even a locally optimal policy4 . In fact, there are illustrations of
the basic algorithms failing to converge on very simple problems.
The third approach combines the policy gradient methods with the value function meth-
ods, and are called actor-critic methods. These methods attempt to combine the strengths
of each of the previous approaches: efficient learning and provable convergence.
For some systems, it may be possible to solve for, or at least approximate, the optimal
controller analytically. The most famous example of this is the linear-quadratic regulator
(LQR). Consider the linear system
x = Ax + Bu,
and the quadratic cost function
where A, B, Q, R are constant linear matrices and R is symmetric and invertible. In the
finite-horizon formulation, the HJB is
J*(xt) = xTK(t)x,
4
They are known to converge for problems with full-observable, finite, state and action spaces
21
where K is also symmetric. Then optimal policy would be
-R-BTK(t)x
K(T) = Q
K(t) = -K(t)A - ATK(t) + K(t)BR- 1 B T K(t) - Q
For the infinite-horizon case, K(t) is a steady state solution to the equation above,
which can be solved for K with some algebraic tricks. The linear quadratic regulator with
Gaussian noise (LQRG) is an extension of LQR control to stochastic systems. Although
most real-world systems are not linear, the infinite-horizon LQR and LQRG gain matrices
are widely used in control system applications where it is possible to linearize a system
around the set point. This is an important idea - performing approximate optimal control
by acting optimally on a simplified model of the dynamics of the system.
a aE{C(w)},
(9w
and then perform standard gradient descent on the policy parameters.
We know the equations for the parametric form of the policy exactly. In some cases we
know the dynamics of the environment exactly, and we can compute the policy gradient
analytically. For the finite-horizon case, the gradient can be written as
- E{C(w)} =E N 9
+
22
can estimate the policy gradient numerically. Suppose that we add small Gaussian random
vector z, with E{zi} = 0 and E{zizj} = o26, to w and compute the following value
For small z, we can do a Taylor expansion to see that a(w) is a sample of the desired
gradient:
aC
E{a (w)} =E [C(w) +z - C(w)]zi
1C 1
&
=E z - zj -2 E{C(w)}
awj o09wi
Aw% = -ra(w),
does not necessarily move in the direction of the true gradient on every step, but can be
shown to move in the direction of the true gradient on average:
E{Awi} oc
a E{C(w)}.
awi
Therefore, for small r7, the algorithm is guaranteed to converge to a local minimum 5. This
algorithm, called "weight perturbation", and others like it (i.e., William's REINFORCE
[Williams, 1992], [Baxter and Bartlett, 2001]), all try to estimate the gradient online and
perform gradient descent on average.
The strength of the policy gradient methods is this guaranteed convergence to a locally
optimal policy. Their major weakness is that it can be difficult to calculate the performance
gradient of a real system analytically, and methods based on numerical estimates of the
gradient can require a very large number of samples to acquire an accurate estimate.
Let 7r* = [7r(x, 0), 7r(xi, 1), ... , 7r(xN, N)] be an optimal policy for the N-step
finite horizon problem. Now consider the subproblem where we are at x' at time
i and wish to minimize the cost from time i to time N:
5
Technically, C and the second derivatives of C with respect to w must be bounded, and 77 must be
decreased to 0 to obtain true convergence [Bertsekas and Tsitsiklis, 1996, section 3.2].
23
The truncated policy [7r(xi, )...,7r(xN, N)] is optimal for this subproblem.
The intuitive justification of this principle is that if we could find a better policy for the
subproblem, then we could have also found a better policy for the complete problem. The
principle of optimality tells us that it is possible to solve the complete reinforcement learning
problem by iteratively solving more and more complicated subproblems.
Dynamic programming is an efficient algorithm for solving the finite-horizon optimal control
problem when you have finite states and finite actions, and a perfect model of the dynamics
of the environment. It uses the principle of optimality to compute the optimal cost-to-go
recursively backward, starting from the terminal cost:
J*(x, N) =h(XN)
J*(x, n) = min [g(xn, un, n) + E{J*(X,+1, n + 1) xn = X}
UnL
With a finite state and action space, this minimization can be done by searching over all
possible un and xn+1.
To actually solve this problem, the algorithm needs to find the cost-to-go for every x
before proceeding to the previous n. This can be computationally prohibitive as the state
space becomes large, a problem known as the "curse of dimensionality". It is not a true
reinforcement learning algorithm, as it not based on trial and error data and therefore is
not suitable for online learning on a robot.
Unlike policy gradient methods, dynamic programming does find a globally optimal
policy. For problems with small, finite state and action spaces, it is actually quite efficient;
it solves the problem without repeating any computations as a policy gradient method
would.
The value iteration algorithm extends the basic dynamic programming algorithm to infinite
horizon problems. The algorithm begins with an arbitrary initial estimate j(x) of J*, which
is stored in a lookup table for finite state spaces or in a function approximator for continuous
state spaces. The estimate is refined by performing the update,
until convergence. In the synchronous version of the algorithm, this update is performed
in a sweep over the entire state space. In the asynchronous version of the algorithm, the
update is performed on the states visited during learning trials. Assuming that all states
are visited with some non-zero probability during the learning trials, then both versions of
the algorithm can be shown to converge, J -> J*, for lookup table representations as time
goes to infinity. When J is stored in a function approximator, all guarantees of convergence
disappear.
Unlike dynamic programming, the asynchronous value iteration algorithm forces us to
select actions before the learning has converged. Our choice of actions must balance explo-
24
ration with exploitation. A typical action selection strategy is to select the greedy action
with high probability, and to occasionally select a random action. For continuous action
spaces, it is sometimes possible to solve for the optimal action given the current estimate
of the value function. [Doya, 1999] observed that this is particularly easy if 22 x-2) is linear
in u.
Another way to obtain the cost-to-go of the current policy, J', is through Monte-Carlo sim-
ulation. Recall that the definition of the cost-to-go for discounted, infinite-horizon problems
is
J'(x) = E yng(Xn, Un)Ixo = x, un = 7r(xn).
n=O
This expectation can be approximated by running T trials and taking the average return.
Temporal difference learning, or TD(A), is a way to combine value iteration and Monte-
Carlo estimation. Define the n-step prediction of J' as
n-1
Notice that the asynchronous value iteration algorithm performs the update J 7(x) <- Jfn (x),
and that Monte-Carlo updating uses the update Jfr(x) <- J; (x). The TD(A) algorithm uses
a weighted combination of n-step predictions to update J':
00
n=1
Therefore, TD(O) is simply value iteration, and TD(1) works out to be the Monte-Carlo
update. The elegance of the TD update, though, is that it can be implemented very
efficiently online using an eligibility trace [Sutton and Barto, 1998].
In general, we cannot prove that TD(A) will converge j' to J' when j is represented in
a function approximator. [Tsitsiklis and Roy, 1997] proved convergence for a class of linear
function approximators. TD(1) can also be shown to converge [Bertsekas and Tsitsiklis, 1996]
for arbitrary approximators, but there is significant experimental evidence that algorithms
using A < 1 work more quickly in practice [Sutton and Barto, 1998].
2.4.4 Q-learning
In some situations, it is not practical to solve the optimal policy by evaluating
online. A common solution to this problem is to learn a Q-function instead of the value
function. The Q-function is the expected cost-to-go from taking action u in state x. This
25
is closely related to the cost-to-go function:
There are a variety of Q-learning algorithms which extend TD(A) to learning a Q-function.
Learning a Q function instead of a cost-to-go function has some subtle implications.
The disadvantage is that it increases the dimensionality of the function that you are trying
to learn. One advantage is that you reduce the action selection problem to a lookup in the
function approximator (instead of computing the argmin online). The primary advantage,
though, is that it allows us to evaluate one policy while following another (called off-policy
evaluation), because our update only changes our model for the current action. Off-policy
evaluation can be desirable when the optimal policy does a poor job of exploring the state
space.
2.4.5 Summary
In general, value function methods are very powerful for small problems with finite state
and action spaces. They take advantage of the underlying structure in an optimal control
problem, the HJB equation, and therefore have the potential to be much more efficient than
a policy gradient algorithm.
When the value estimate has to be stored in a function approximator instead of a lookup
table, the convergence guarantees of the algorithms are weakened dramatically. There are
a handful of great success stories using this approach, including TD-Gammon by Tesauro
[Tesauro, 1995], but there are also trivial examples of problems for which TD(A) fails to
converge (i.e., [Boyan and Moore, 1995]). The intuitive problem is that a small change in
the parameters of the value estimate can produce large changes in the greedy policy. This
causes instability in the update.
Actor-critic methods attempt to combine the fast, efficient learning of value function meth-
ods with the guaranteed convergence of policy gradient methods. The goal of these algo-
rithms, which have recently begun to receive more attention in the reinforcement learning
literature, is to use a value estimate to reduce the variance of the gradient update.
The conventional [Sutton et al., 1999, Konda and Tsitsiklis, 1999] actor-critic update
uses an acquired estimate of the Q-function to estimate a policy gradient. For a stochastic
26
policy 7r,(x, u) = Pr{U = uIX = x, w}:
a- E{C(w)} =
a E{J'(xo) xo} =
a E{Qw(xo, u)Ixo, u}
awi 0m awi
=E { wj irw(xo, u')Qw(xo, u')du' xo
)
u') f0)
=E 7r (xou) Qw(xo, u') + 7rw (xo, u') YE{Jw(x1)Jx} du'xo
where JW and QW are short-hand for J"w and Q'w, respectively. For most systems, these
functions must also be estimated online using a function approximator. By properly param-
eterizing the function approximators for the actor, 7r, and the critic, Qw, relative to each
other, it is possible to update the policy using this gradient estimate and prove that the
algorithm performs stochastic gradient descent for arbitrary differentiable function approx-
imators [Sutton et al., 1999]. The claim is that an estimate of the policy gradient using a
learned Q function should have less variance than an estimate based on samples taken from
only a single trajectory.
I use a slightly different actor-critic algorithm for the online optimization on the robot.
That algorithm is described in detail in chapter 5.
These dynamics could, for instance, represent pushing a one kilogram box with u Newtons
of force on a horizontal frictionless surface. Because these dynamics are linear, we can solve
quadratic cost functions analytically using the LQR results. For arbitrary cost functions,
we can discretize the state and action spaces and apply value iteration to efficiently solve
for the optimal policy. This works well for this problem because the system has only two
state variables (x, d) and one action variable (u).
Second-order systems move clockwise in state space. The cost-to-go function effectively
rewinds the instantaneous cost function backward in time. Figure 2-1 shows the instanta-
neous cost, value function, and optimal policy for a quadratic regulator cost function on
this simple system. The white line represents a sample trajectory with a cross at the initial
conditions. The upper right and lower left corners of the policy are distorted by edge effects
(the true solution is linear), but otherwise the value iteration algorithm has done an excel-
27
Cost Function (u=0) Value Function Policy
10[ 30 10 20
60
15
50 25
5 10
40 20 5
x
V
10 0 0
-5
20 10
-5 -10
10 5 -15
-101 -10 -5 -10- -20
-5 0 5 -5 0 5 -5 0 5
x
Figure 2-1: Example la: Infinite-horizon, discounted LQR. x=u, g(x, i, u) = }X2 + f2
+
1 2
j:ou
0.9.
0.9
lent job of matching the analytical LQR solution. If we were to remove the penalty on u
in the cost function, then the optimal policy would be bang-bang control: apply maximum
force toward the desired position, and then apply maximum braking force so that the state
ends up exactly at the origin as quickly as possible.
In locomotion, the desired behavior is a stable limit cycle. What instantaneous cost
function for this system would generate a limit cycle as the optimal behavior? Figure 2-2
shows one solution.
There is an interesting relationship between optimal control and stability theory. For
finite-horizon problems, the value function can also serve as a Lyapunov function [Slotine and Li, 1990]
for the system, since the cost-to-go monotonically decreases as the system trajectories
evolve. Unfortunately, the same can not be said for infinite-horizon discounted value func-
tions, as illustrated by the limit cycle example.
5 6 -5 -10
-15
-105 U4 _10
-5 0 5 0 5 -5 0 5
x x x
Figure 2-2: Example 1b: Optimal limit cycles. z u, g(x, , U) = X 2 + 11 - 5.012+ n2,
-Y = 0.9.
28
2.6.2 Example 2: Nonlinear, Second-Order System
Consider a simple pendulum:
ml2 5 - mgl sin(O) = u,
where m is a point mass, I is the length, g is gravity, and 0 represents the state of the
system. This system cannot be solved using LQR6 , even when the cost function is quadratic.
However, since the state and action spaces are still small, we can once again apply the value
iteration algorithm. Although the optimal value function resembles the value function for
the linear case, the optimal policy, shown in figure 2-3, is clearly nonlinear.
-
X 0
20
40 -10
-5 15
20 10 -20
1 -10
5 -30
0 -2 0 2 -2 0 2
x
6
Recall, however, that a common approach to deriving controllers for non-linear systems is to linearize
the system over small regions and apply local linear controllers.
29
30
Chapter 3
31
Figure 3-1: A seven link planar biped
32
els of walking and running dynamics (i.e., [Kajita et al., 1992, Schwind, 1998]). Many of
these models come from the extensive biomechanics literature describing kinetic and kine-
matic properties of gait in humans and animals [McMahon, 1984, Alexander, 1996]. For the
trajectory following robots, these models are useful in designing the desired trajectories.
In a category by themselves are the controllers based on coupling "neural" oscillators or
pattern generators to a walking robot (i.e. [Taga, 1995]). There is definitive experimental
evidence in lower vertebrates, and suggestive evidence in higher mammals, that pattern
generators like these can be found in our spinal cords and are used to coordinate movement
between multiple limbs. It is amazing that these generators can be tuned by hand to
construct a detailed enough feedback response to allow dynamically stable walking, but it
is not clear that this approach will scale to humanoids performing a wide class of movements.
The final, and often distinct, class of controllers are those acquired automatically using
machine learning.
33
Although not directly related to legged locomotion, [Ng et al., 2003] presents some re-
lated work on learning control of an autonomous helicopter. The learning idea was to
obtain a linear stochastic model of the helicopter dynamics using traditional model estima-
tion algorithms. A policy for this model was optimized offline using the Pegasus algorithm
[Ng and Jordan, 2000], which approximates the stochastic system with a deterministic one
by using the same random number generator on each iteration of the policy gradient evalua-
tion. Using a very simple policy gradient algorithm, Ng provided impressive demonstrations
of helicopter control. I actually use an algorithm that, in retrospect, is very similar to Pe-
gasus for the hopping results in Chapter 6.
A number of researchers have considered the problem of using machine learning to tune
the parameters of a hand-designed controller. [Smith, 1998] introduces a method combining
CMACs and eligibility traces to train a feedback controller online. The algorithm was
applied to the control of a physical cart-pole system as well as a simulation of the Raibert
3D one-legged hopper and of a 3D biped. For the hopper, Smith's networks accepted a
small number of inputs (speed,body angle, and slope of the ground) and output 2 of the
gains for a modified Raibert controller. His reward function compared desired and actual
speeds, and he reports good performance after only 20 iterations of the algorithm. Similarly,
[Chew, 2000] used Q-learning with CMAC networks to find key parameters of swing leg
strategy to augment the controller by [Pratt .and Pratt, 1998]. One or two parameters
were tuned at a time, and the policy was evaluated once per step. The objective function
included a desired forward velocity and lateral stability. This controller reportedly produced
100 seconds of walking by the 16th trial.
Another common approach is to approximate an optimal solution by solving a reduced-
order model of the system. [Larin, 1999] and [Seyfarth et al., 2002] reduce the dynamics
of the planar one-leg hopper down to the spring-loaded inverted pendulum (SLIP) model,
which consists of a point mass on a springy, massless leg. On this model, they could predict
exactly the amount of thrust required to put the hopper back at the desired hopping height
in one hop. [Maier et al., 1999] took a similar approach, but used radial basis functions
(RBFs) to automatically acquire the controller. The network inputs were the leg angle
of attack at touchdown and at lift-off, the x and y components of the COM velocity at
touchdown and at lift-off, and the desired horizontal velocity. The output of the network is
the angle of attack of the leg. The results presented in Chapter 6 of this thesis are similar,
but are implemented on a complete model of the hopper instead of the reduced SLIP model.
Learning control has also been successfully implemented on Sony's quadrupedal robot
AIBO (i.e., [Kohl and Stone, 2004]). This algorithm optimizes a parameterized open-loop
trajectory for fast walking, using speed as the only objective function. The "policy gradient"
algorithm used here is naive numerical gradient descent. This approach appears to work
because AIBO is robust enough to reproduce trajectories with reasonable fidelity using only
open-loop control (stability is not a concern), and because the experimenters automated the
learning process with the robot walking back and forth across the field between beacons.
Trajectory optimization as seen on AIBO has been studied extensively in the robot arm
literature. It is the constraints of dynamic stability that distinguishes the formulation of
the legged locomotion trajectory optimization problem from its predecessors.
Similar trajectory optimization routines have been described for bipeds by [Yamasaki et al., 2002]
and [Channon et al., 1990], to name a few. [Yamasaki et al., 2002] used genetic algorithms
to learn open-loop trajectory commands for PINO, a little hobby-servo based biped. They
claimed that by tuning trajectories (parameterized by Fourier coefficients) they were able
to compensate for the unmodelled dynamics of the servo motors to produce a smoother and
34
more efficient gait. [Channon et al., 1990] reduced gait optimization to fitting the coeffi-
cients of a polynomial for the feed forward trajectory of the swing foot and of the hip for a
five link biped with the stance leg bolted to the ground. The remaining joint angles were
computed using inverse kinematics. They used an interesting cost function that actually
attempted to model some of the motor properties ("armature losses"). Their final plot
shows the minimal energy for a given walking speed, that they claim correlates nicely with
expectations. [Juang and Lin, 1996] took a similar approach using the Backpropagation-
Through-Time algorithm.
One of the most principled examples of applying numerical optimal control to bipedal
locomotion comes from [Hardt et al., 1999]. They develop a tool for efficiently computing
the dynamics and gradients of a rigid-body mechanism, even through a discrete impulse
model. They then apply an off-the-shelf optimal control solver (DIRCOL) to a simulation of
a five-link biped model to solve for the feed-forward torques of each joint. The optimization
attempted to minimize f u(t)Tu(t)dt while satisfying boundary constraints. The stated goal
of this work was to model human walking, which presumably minimizes energy consumption.
There have been a number of other modelling studies of human locomotor "optimal"
control. [Kuo, 1995] argued that LQRG control could explain the selection of control
strategies that humans use in response to small perturbations to stable upright balance.
[Todorov, 2002] argues that stochastic optimal feedback control can explain a wide vari-
ety of motor control and motor coordination tasks in biology. [Anderson and Pandy, 2001]
used brute-force dynamic optimization packages to optimize energy consumption via the
feed-forward muscle- activation trajectories for a three-dimensional, neuro-musculo-skeletal
model of the human body. By enforcing only kinematic boundary conditions, they claim
that their model reproduces the salient features of normal gait, and justifies the use of
minimum metabolic energy per unit distance traveled as a measure of walking performance.
Finally, and most recently, [Vaughan, 2003, Vaughan et al., 2004] took an approach that
is very similar to the one used in this thesis - learning a control strategy to actuate a
simulation of a passive dynamic walker. They used genetic algorithms to tune both a open-
loop central pattern generator and a neural network feedback controller. They also made
an effort to evaluate the robustness of the learned controller to a variety of parameters, in
an attempt to derive a controller that could potentially work on a real robot in the near
future.
During stable walking on flat terrain, the limit cycle trajectories of a legged robot can be ana-
lyzed using a Poincare map, also known as a return map [Strogatz, 1994]. This tool has been
used extensively in the legged robot literature (i.e., [McGeer, 1990, Koditschek and Buehler, 1991]).
For an n-dimensional dynamical system x = f(x), a Poincard section S is a n - 1 dimen-
sional surface that the system crosses exactly once during each period. By integrating the
dynamics forward from one intersection of S to the next, we can describe the discrete return
map dynamics of the system as
Xn = r(xn).
The advantage of doing this is that the limit cycle stability of the the periodic trajectory
in x can now be studied by looking at the fixed point stability of the discrete system in xn.
35
As an example, consider the classical Van der Pol oscillator:
Figure 3-2 plots a sample trajectory of this dynamics and the associated return map. Fixed
points of the return map are located at the intersection of the return map and the line of
slope one. The local stability of this fixed point can be determined by an eigenvalue analysis
of the return map dynamics linearized around this point.
2- 3.5-
3-
. 2.5-
0
0
- 1.5-
-
-2. 0.5
-
-3 - -1 0 1 2 3 0.5 1 15 dx2/ dt 2.5 3 3.5
Figure 3-2: Return Map Analysis of the Van der Pol oscillator. On the left is a sample
trajectory of the oscillator plotted in state space. The vertical line represents the Poincare
section. On the right is the associated return map plotted through that section. The
diagonal line on this plot is the line of slope one. The points are return map data acquired
by numerically simulating the oscillator dynamics.
Although it is rare that we are able to obtain an analytical form for the return map
dynamics, r, I make extensive use of numerical return map analysis throughout this thesis.
For a walking system, the return map is often taken at the point of collision between one
of the feet and the ground, and is often called the step-to-step return map [McGeer, 1990].
This is a natural configuration to slice the trajectory because it simplifies the analytical
analysis and because it is an event that is easy to measure using contact sensors that are
present on most walking robots.
36
Chapter 4
In order to build an actuated walker capable of stable passive walking down a ramp, I
first needed to understand the dynamics of passive walking. The passive dynamic walker
shown in Figure 4-1 represents the simplest machine that I could build which captures the
essence of stable dynamic walking in three dimensions. It has only a single passive pin
joint at the hip. When placed at the top of a small ramp and given a small push sideways,
the walker will rock onto a single stance leg, allowing the opposite leg to leave the ground
and swing forward down the ramp. Upon landing, the robot rocks onto the opposite foot,
and the cycle continues. This walker is morphologically equivalent to the Tinkertoy walker
[Coleman and Ruina, 1998], except that on our robot the center of the radius of curvature
of the feet is higher than the center of mass. Because of this, standing is a statically stable
configuration.
The energetics of this passive walker are common to all passive walkers: the energy lost
due to friction and collisions when the swing leg returns to the ground are balanced by the
gradual conversion of potential energy into kinetic energy as the walker moves down the
slope. The characteristics of the walker's gait are determined by the mass distribution and
by the shape of the large curved feet. In order to understand this relationship, I began by
analyzing two of the simplest walking models: the rimless wheel and the compass gait. By
working up to a dynamic model of our curved foot robot, I was able to design the curvature
37
for the feet to tune the step length and step frequency of the passive gait.
In this chapter, I will present an analysis of these passive walking models, working up
to the design and experimental stability analysis of the physical 3D passive dynamic walker
pictured in Figure 4-1.
2a
The most elementary model of passive dynamic walking, first used in the context of
walking by [McGeer, 1990], is the rimless wheel. This simplified system has rigid legs and
only a point mass at the hip as illustrated in Figure 4-2. To further simplify the analysis,
we make the following modelling assumptions:
* Collisions with ground are inelastic and impulsive (only angular momentum is con-
served around the point of collision).
" The stance foot acts as a pin joint and does not slip.
" The transfer of support at the time of contact is instantaneous (no double support
phase).
The most comprehensive analysis of the rimless wheel was done by [Coleman, 1998]. The
analysis presented here simplifies and extends that work with a novel method for deriving
the basins of attraction using contraction analysis [Lohmiller and Slotine, 1998].
The dynamics of the system when one leg is on the ground are given by
O=~~sin(O +).
If we assume that the system is started in a configuration directly after a transfer of support
(0(0) = -a), then forward walking occurs when the system has an initial velocity, 6(0) > wi,
38
where
wi is the threshold at which the system has enough kinetic energy to vault the mass over the
stance leg and take a step. This threshold is zero for -y = a and does not exist for -y > a.
The next foot touches down when O(t) = a, at which point the conversion of potential
energy into kinetic energy yields the velocity
The angular momentum around the point of collision at time t just before the next foot
collides with the ground is
L(t-) = m12 0(t-) cos(2a).
The angular momentum at the same point immediately after the collision is
L(t+) = ml 2 (t+).
Simulations of these dynamics reveal a stable limit cycle solution with a continuous
phase punctuated by a discrete collision, as shown in Figure 4-3. The red dot on this graph
represents the initial conditions, and this limit cycle actually moves counter-clockwise in
phase space because for this trial the velocities were always negative. The collision represents
as instantaneous change of velocity, and a transfer of the coordinate system to the new point
of contact.
We can now derive the angular velocity at the beginning of each stance phase as a function
of the angular velocity of the previous stance phase. First we will handle the case where
-y < a. For 0, > w, this "step-to-step return map" is simply
Using the same analysis for the other cases, we can complete the return map. For
0 < , < wi, we have
Un+1 -- n
Using the convention that negative initial velocities incur a collision immediately, the thresh-
39
-0.4-
-0.6
-0.8
-1.2-
-1.4--
Figure 4-3: Limit cycle trajectory of the rimless wheel (m = 1, 1 = 1, g 9.8, a =r/8, -y
0.08). Notice that the output coordinate system has been changed so that the trajectory
can be compared directly to the models presented in the remainder of this chapter.
1
W2 =F-- [I - cos(a + 7]
cos(2a)
Notice that the return map is undefined for O7n {w1, W2 }, because from these configu-
rations, the wheel will end up in the (unstable) equilibrium point where 0 = -y and 0 0,
and will therefore never return to the map.
This return map blends smoothly into the case where y > a. In this regime, W 1 no
longer exists because it is kinematically impossible to have the wheel statically balancing
on a single leg. For this case, the region 0 < On < w 1 disappears, and the return map
presented for On > w1 is valid for 6, > 0. The negative half of the return map remains
mostly intact, because W2 is simply the threshold for transfer onto the rear leg. The only
difference is that the system is now well-defined at the point 0 2 , and the conditions for the
last region become 6n < L2.
For a fixed point, we require that n+1 = 6n = w*. Our system has two possible fixed
points, depending on the parameters:
40
I1
The limit cycle plotted in Figure 4-3 illustrates a state-space trajectory in the vicinity of the
rolling fixed point. w*and is a fixed point whenever y < a. r* 0 11 is a fixed point whenever
Wrol > w1 . It is interesting to view these bifurcations in terms of 7. For small Y, Wstand is
the only fixed point, because energy lost from collisions with the ground is not compensated
for by gravity. As we increase y, we obtain a stable rolling solution, where the collisions
with the ground exactly balance the conversion of gravitational potential to kinetic energy.
As we increase -y further to 7 > a, it becomes impossible for the center of mass of the wheel
to be inside the support polygon, making standing an unstable configuration.
4-
2-
0-
2
-6-
-6 -4 -2 0 2 4
de /dt
Figure 4-4: Return map, fixed points, and basins of attraction of the rimless wheel (m =
1, 1 = 1, g = 9.8, a = 7r/8, 7 = 0.08). The gray area is the basin of attraction of Wroll, and
the white area is the basin of attraction of Wstand.
To understand the stability of these fixed points, we first observe that the return map
dynamics are a contraction mapping [Lohmiller and Slotine, 1998]. Briefly, a system's dy-
namics are said to be contracting if initial conditions are forgotten exponentially. A sufficient
condition for the discrete system xn+1 = f(xn) to be contracting is
af T of
A= - I<o.
Ox Ox
Intuitively, although the rimless wheel is modelled as conservative during the stance phase,
foot collisions cause neighboring trajectories to converge, causing the dynamics to contract
on the return map. For O, > w1 , we have
#2
A =cos2 (2a) - 1 < 0.
2 + 49 sinasinY
Scos 4 (2a)
#2 cos 2 (2a) - 4 sin a sin-y
41
Finally, by convention the one-step dynamics for 0 < On < wi do not include a foot collision
and therefore are not contracting. However, with the exception of the fixed point at On = 0,
points in this region are mapped in one step into W2 < 6n+1 < 0, making the two-step
dynamics contracting. Therefore, except for the undefined points W 1 and 0 2 , we have the
global property that neighboring trajectories will converge and initial conditions will be
exponentially forgotten.
Consider the parameter regime where both fixed points exist. First, observe that the
region 6n > wi is closed (trajectories inside the region will stay in that region). In other
words, 6n+1 > W1 when On > wi and wr*o 0 > w 1 . Since the region is closed and contracting
with a fixed point w* 11, all trajectories in the region 6n > w, converge exponentially to
ol* By iterating the return map backward, we can map out all of the initial conditions
that will cause the system to enter this contracting region. This set of initial conditions
(plotted in gray in Figure 4-4) define the basin of attraction of the rolling fixed point, and
include a series of stripes with increasing width in the region Wn < 0. The boundaries of
this basin of attraction are formed by the points wi, W 2 , and all initial conditions that lead
to wi and W2, all of which will eventually lead to the unstable fixed point of the continuous
dynamics where 0 = 6 = 0. For convenience, we will refer to the set of boundary points as
B.
The remaining regions (outside of the basin of attraction of the rolling fixed point and
not in B) form the basin of attraction of the standing fixed point, and are plotted in white
in Figure 4-4. To see this, observe that this is by definition a closed region (if it where to
ever leave the region, then it would have been colored gray). Since all trajectories in this
region converge, they must converge to the only fixed point in that region, Ws*ta.
When wo*o < wi, only the standing fixed point exists. The region 6n > w1 is still
contracting, but is no longer a closed set and no longer has a fixed point. Therefore all
initial conditions 0 o B will converge exponentially to ws*ta.
As -y approaches a, the width of the gray stripes increases until, when y > a, the white
region and the standing fixed point cease to exist. For this regime, the system is well defined
at all of the boundary conditions, and the fixed point wr*11 is globally exponentially stable.
This model was recently implemented in a physical robot which could control its speed
by actively changing the leg length [Yan and Agrawal, 2004].
The rimless wheel models only the dynamics of the stance leg, and simply assumes that there
will always be a swing leg in position at the time of collision. To remove this assumption,
we take away all but two of the spokes, and place a pin joint at the hip. To model the
dynamics of swing, we add point masses to each of the legs.
In addition to the modelling assumptions used for the rimless wheel, we also assume that
the swing leg retracts in order to clear the ground without disturbing the position of the
mass of that leg. This model, known as the compass gait, is well studied in the literature
using numerical methods [Goswami, 1999, Spong and Bhatia, 2003], but relatively little is
known about it analytically.
0
The state of this robot can be described by 4 variables: st, Osw, st, and 0
SW. The
abbreviation st is shorthand for the stance leg and sw for the swing leg. Using q =
42
b mh
a m
-SW Q
Y:
with
mb2 - mlb cos(0,t - OsW)]
H = -mlb cos(GOt - Osw) (mh + m)1 2 + ma2
B= 0 mlbsin( 8 t - Ow)6,t
mlb sin(,t - 0s,w) ,w 0
G=[ mbg sin(OSW)
G --(Mhl + ma + ml)g sin(Oet)
'
Q+(a)A+ = Q-(a)q--
where
and a = OSW-St
2
Numerical integration of these equations reveals a stable limit cycle, plotted in Figure
4-6. The cycle is composed of a swing phase (top) and a stance phase (bottom), punctuated
by two instantaneous changes in velocity which correspond to the ground collisions. The
dependence of this limit cycle on the system parameters has been studied extensively in
[Goswami et al., 1996].
The basin of attraction of the stable limit cycle is a narrow band of states surrounding
43
2-
1.5-
0.5-
0-
-0.5
-1.5-
Figure 4-6: Limit cycle trajectory of the compass gait. (m = 5kg,mh 10kg,a b
0
0.5m,# = 0.03deg. q(O) = [0, 0, 2, - 0 .4 ]T). lPitch is the pitch angle of the left leg, which is
recovered from Ot and 0, in the simulation with some simple book-keeping.
the steady state trajectory. Although the simplicity of this model makes it analytically
attractive, this lack of stability makes it difficult to implement on a physical device.
0
- S
st
S
RS
To model the sagittal plane dynamics of the passive walker in Figure 4-1, we need to
extend the compass gait model in two important ways. First, the addition of the large
curved feet will dramatically increase the basin of attraction of the stable walking solution.
We assume that one of the feet is always in contact with the ground at exactly one point.
We also assume that the feet do not slip, but we neglect rolling friction. Second, we replace
the point masses in the compass gait model with more realistic mass and moments of inertia
for each link. A complete listing of the parameters of this new model can be found in Figure
44
4-7, and the dynamics are given by:
-
H22 =I + mi(b - d) 2
,
1
B11 =miR,(b + d) sin(0 8 t - 7)At + -mia(b - d) sin(Ots-osw) S
2
1.
B 1 2 =mi(b - d)[d sin(0,t - O,)(Osw - -Ost) + Rs sin(6SW - -y)OS]
2
1. 1
B 21 =mi (b - d) [d sin(98 t - Osw)(Ost - - sw) - Rs sin(OS - Y) SW]
2 2
1
B 22 =-mi(b - d)[dsin(O8 t - y) + Rs sin(Osw -y)] st
2
G 1 =mig(b + d) sin 0 s, - 2migR, sin y,
G 2 =ml g (b - d) sin Os..
As with the compass gait model, we assume the swing foot can swing through the
ground. Only the transfer of support as the swing leg becomes the stance leg is modeled as
a collision. Because our model has large curved feet rather than point feet, some portion of
the swing foot remains below the ground for the portion of the swing phase after the swing
leg moves past the stance leg. Therefore, we are not able to model the time of collision as
the time of the second impact. Instead, we couple this sagittal plane model to the frontal
plane model discussed in the next section by estimating that a collision occurs once every
half period of the frontal plane oscillations. At this moment, the collision is again modeled
as an angular momentum conserving impulse:
where
45
This simulation generates stable trajectories (see Figure 4-8) that are extremely similar
to those generated by the compass gait (recall Figure 4-6), except that they are much more
stable. Our dynamics do not model the edges of the feet, so our simulation actually models
a passive walker shaped like two halves of an ellipsoid. Nevertheless, we have not been able
to find any initial conditions from which the system does not return to a stable gait. Figure
4-8 was generated using the initial conditions with all variables set to zero and a slope of
0.027 radians, which is approximately the starting configuration that we use on our passive
walker.
1.5
0.5
-
-0 0-
-0.5
-
-iI
-0.2 -0.15 -0.1 -0.05 0 0.05 0.1
0Pitch
Figure 4-8: Limit cycle trajectory of the curved foot model. 6 lPitch is the pitch angle of
the left leg, which is recovered from 0 ,t and 0,, in the simulation with some simple book-
keeping.
The step length and the forward velocity of the steady state gait can be tuned by
adjusting the radius of curvature, R. Smaller radii cause the robot to fall forward more
quickly. For the slope of 0.027 radians, a simulation of our model predicts that our robot
will take steps of 2.56 inches (6.50 cm) and walk with an average forward velocity of 8.81
inches/second (22.38 cm/s).
When 16 > 0, the ground contact point is in the curved portion of one of the feet (the
'Other mechanisms for achieving foot clearance for the swing leg include bending at the knees, or cutting
a hole in the ramp where the collision would have occurred.
46
a
m,I
Rf
boundary condition on the outside of the foot is not modeled), and the dynamics are:
When 10| < #, the ground contact is along the inside edge of the foot. In this case, the
dynamics are:
47
for this robot.
4.5 Experiments
The planar models from the last two sections can be used as tools for designing the curvature
of the feet to approximately tune the step frequency and step length of our robot. The
coupling between these models is more complicated, and we are currently studying them in
a simulation of the full 3D dynamics. The most important characteristic of this coupling is
that energy from the sagittal plane stabilizes oscillations in the frontal plane. Consequently,
we should expect to observe smaller steps than that predicted by the sagittal plane model.
Using the curvature of the feet in the frontal (Rf) and sagittal (R,) planes determined
from the planar models, we machined experimental feet on our CNC milling machine. The
surfaces of the feet are given by:
z= R - x2 - Rf + R2 -y 2 - Rs.
Using these feet, our robot produces stable periodic trajectories when placed on a small
decline. Figure 4-10 demonstrates this stability with a sample trajectory of the machine
walking down a slope of 0.027 radians. 0 ro1 is the roll angle of the robot body, in radians,
which was simply called 0 in the frontal plane model. This data was recorded from the
actuated version of the robot with its ankle joints mechanically locked in place, because the
passive robot is not equipped with the necessary sensors.
The limit cycle displayed in Figure 4-10 is fairly typical. Notice that the initial conditions
for the robot are slightly outside the steady state trajectory, but that trajectories converge
to a very reproducible cycle in roll and pitch. The robot has an uncompensated moment
about the yaw axis - it will twist and turn whenever the point contact of the foot slips on
the ground. This can be seen in the wandering trace of the yaw variable.
Upon close inspection, we determined that the majority of the noise visible in our
unfiltered data is actually due to a mechanical vibration of the leg at approximately 10
Hz. For this reason, we have decided to low-pass filter our limit cycle plots at 8 Hz with a
4th-order Butterworth filter (sampling rate is 100 Hz).
The local stability of the passive walkers is traditionally quantified by examining the eigen-
values of the linearized step-to-step return map [McGeer, 1990], taken around a point in
the period either immediately preceding or immediately following the collision. While we
are currently designing a foot contact switch that will not interfere with the curved feet of
the robot, the return map for this analysis is evaluated through the hyperplane roll = 0,
when 0 roii > 0. The point in the cycle when the robot passes through the vertical position
in the frontal plane is the expected point of impact.
The state of our passive robot is described by 4 variables (Oyaw, 0 lPitch, 9 rPitch, Oroll)
and their derivatives, therefore the return map has dimension 7. Pitch is short for left leg
pitch and rPitch is for the right leg pitch. In a simulation, we could find the eigenvalues
of the return map numerically by perturbing the robot from the fixed point by E in each
dimension [Goswami et al., 1996]. Unfortunately, on the real robot we do not have accurate
enough position control nor accurate enough sensors to perform this idealized analysis. To
evaluate the eigenvalues of the return map experimentally on our robot, we ran the robot
48
radia ns
0.4
0.2
0
-0.2
-0.4
0.2
tM
0
-0.2
0.2
0
-0.2 ___
2 4 6 8 10 12 14 16
Time seconds
1.5-
0.5-
0-
+
-0.5-
-1.5
Figure 4-10: Passive dynamic walking experiments. The top figure plots the raw (unfiltered)
yaw, pitch, and roll sensors for walking down a slope of 0.027 radians. The bottom figure
is the state space plot of the resulting limit cycle in the roll axis, low-pass filtered at with
a cut-off at 8 Hz.
from a large number of initial conditions and created the vectors x , 7 x 1 vectors which
represent the state of the system on the ith crossing of the jth trial. For each trial we
estimated x*, the equilibrium of the return map. Finally, we performed a least squares fit
of the matrix A to satisfy the relation
This was accomplished by accumulating the data from all trials into matrices
and computing
1
A YXT(XXT)-
.
49
After 63 trials with the robot walking down a ramp of 0.027 radians, our linear approx-
imation of the return map had the following eigenvalues: 0.88, 0.87, 0.75, 0.70, 0.34 0.li.
61 trials with on a slope of 0.035 radians produces very similar eigenvalues (0.88, 0.70,
0.43 0.01i, 0.36 0.08i, 0.21). We have also studied the associated eigenvectors, but find
them difficult to interpret since they are sensitive to the units and scale of our data. The
largest eigenvalue of 0.88 indicates that this system is locally stable.
The distribution of equilibrium trajectories was unimodal and narrow for both slopes
(examined separately). We believe that most of the variance in the distribution of equilib-
rium trajectories can be accounted for by sensor noise, small disturbances, and changes in
the effective ramp angle when the robot yaws to one side or the other.
Domain of Attraction
In practice, the robot can be initialized from a large range of initial conditions and can
recover from relatively large perturbations. Because the return map has some eigenvalues
close to 1, this recovery takes many steps. Walking trials are nearly always started with the
robot tilted sideways to a non-zero 0rou position but with 0 yaw, 6lPitch, and 6rPitch close to
zero. It is not necessary to give the robot any initial forward velocity.
When started in this configuration, one of three things happen. If 10,111 is too small,
approximately less than 1.25, then the robot will converge to a stable fixed point at
Oro1 = rou = 0. If 2# < 1jrolI < ', where 4' is the angle at which the center of mass of the
robot is directly above the outside edge of the foot, then the robot returns to a stable gait.
For larger 10,1a I, the robot falls over sideways. On our robot, # = 0.03 radians and ' = 0.49
radians, which makes for a very large basin of attraction in this dimension. Compare this
with the predictions of the compass gait model, which must be initialized much closer to
the steady state trajectory in order to produce stable walking.
4.7 Discussion
The rimless wheel represents the simplest model of passive dynamic walking, and the only
model for which we have a complete analytical understanding. Until we find an analytical
expression for the return map of the compass gait 2, we must resort to numerical analysis
techniques which include local stability analysis using eigenvalues and a more global analysis
by plotting the basins of attraction of the limit cycle.
2
[Coleman, 1998] approximates the compass gait map by taking the limits as the mass of the legs goes
to zero and linearizing around the fixed point.
50
The planar models that were constructed of the curved foot walker pictured in Figure
4-1 ignored rolling friction, assumed that the collisions with the ground were completely
inelastic, and most importantly, ignored the dynamic coupling between the frontal and
sagittal planes. Despite these assumptions, the models were accurate enough to predict the
step frequency and step length of the walker with a high enough fidelity to use these models
in the design process.
Simulation code for all of the models presented in this chapter are available at online at
http://hebb.mit.edu/people/russt/
51
52
Chapter 5
Ideas from passive dynamic walking are only beginning to have an impact on the way that
fully actuated bipedal robots are designed and controlled (i.e., [Pratt, 2000, Spong and Bhatia, 2003]).
To bridge the gap between passive and active walkers, a number of researchers have inves-
tigated the problem of adding a small number of actuators to an otherwise passive de-
vice ([Camp, 1997, van der Linde, 1998, Kuo, 2002, Wisse and van Frankenhuyzen, 2003]).
There are two major advantages to this approach. First, actuating a few degrees of freedom
on an otherwise passive walker is a way to capitalize on the energy efficiency of passive
walking and the robustness of actively controlled systems. Second, by allowing the dy-
namics of the system to solve a large portion of the control problem, it may be possible
to simplify the control problem that is solved by the actuators. In this chapter, I will
present Toddler - one of the first robots capable of both passive walking down a ramp and
actuated walking on the flat. Two other robots, developed independently at Cornell Uni-
versity [Collins et al., 2001] and at the Delft University of Technology in the Netherlands
[Wisse and van Frankenhuyzen, 2003], also crossed this milestone, all at approximately the
same time [Collins et al., 2004].
The passive dynamics of the robot can be thought of as an approximate solution to the
optimal control problem, which works well for walking on a ramp, but works poorly on flat
terrain. Using this philosophy, I will begin this chapter by formulating this optimal control
problem on the step-to-step return map. Next, I will present two hand-designed controllers
which improve the stability of the passive gait and produce stable walking on flat terrain.
Finally, I will present a learning system which is able to quickly acquire a controller, using
only trials implemented on the physical robot, that is quantifiably more stable than any
controller I could design by hand.
5.1 Toddler
The passive dynamic walker described in the last chapter and shown on the left in Figure
5-1 represents the simplest machine that we could build which captures the essence of stable
dynamic walking in three dimensions. It has only a single passive pin joint at the hip. We
designed our learning robot by adding a small number of actuators to this passive design.
The robot shown on the right in figure 5-1 has passive joints at the hip and 2 degrees of
actuation (roll and pitch) at each ankle. The ankle actuators are position controlled servo
53
Figure 5-1: The robot on the left is a simple passive dynamic walker. The robot on the
right is our actuated version of the same robot.
motors which, when commanded to hold their zero position, allow the actuated robot to
walk stably down a small ramp, "simulating" the passive walker. The shape of the feet is
designed using the planar walking models from the last chapter to make the robot walk
passively at 0.8Hz, and to take steps of approximately 6.5 cm when walking down a ramp of
0.03 radians. The robot stands 44 cm tall and weighs approximately 2.9 kg, which includes
the CPU and batteries that are carried on-board. This photo also illustrates the passive
arms which are mechanically coupled to the opposite leg, and therefore do not add any
degrees of freedom to the robot. We call this robot "Toddler" because the word is normally
used to describe a child during the time that they are learning to walk, and this robot is
primarily designed to investigate learning algorithms for dynamic walking. The name is
also appropriate because the robot literally toddles back and forth when it walks.
When placed on flat terrain, the passive walker waddles back and forth, slowly losing
energy, until it comes to rest standing still. In order to achieve stable walking on flat terrain,
the actuators on our learning robot must restore energy into the system that would have
been restored by gravity when walking down a slope.
16 internal DOFs and 3 DOFs for the robot's orientation. We assume that the robot is always in contact
with the ground at a single point, and infer the robot's absolute (x, y) position in space directly from the
remaining variables.
54
where
H is the state dependent inertial matrix, B contains interaction torques between the links,
G represents the effect of gravity, T are the motor torques, D are random disturbances
to the system, and f represents the collision model. Our shorthand lPitch, bPitch, and
rPitch refer to left leg pitch, body pitch, and right leg pitch, respectively. raRoll, laRoll,
raPitch, and laPitch are short for right and left ankle roll and pitch. The robot uses a
control policy
u = r(i, t), withx= [4. (5.3)
The notation k represents a noisy estimate of the state x. The actual output of the controller
is a motor command vector
U = [UraRoll,UlaRoll, UraPitch,UlaPitch ]
F,(x', x) represents the probability density function over the state space which contains the
dynamics in equations 5.1 - 5.4 integrated over one cycle. We do not make any assumptions
about its form, except that it is Markov. Note that for this robot, the element of F,
representing 0 rou1 is the delta function, independent of x. The stochasticity in F, comes
from the random disturbances D(t) and from sensor noise, k - x.
The instantaneous cost that we would like to minimize uses a constant desired value,
xd, on the return map:
55
Our goal is to find the policy 7 which minimizes
lim E {C(;^)} (5.8)
N-oo
By minimizing this error, we are effectively minimizing the eigenvalues of return map, and
maximizing the stability of the desired limit cycle.
where q = [0, 0 1a, OralT and u = [0, Ula, Ura]T. The abbreviations la and ra are short for left
and right ankle, respectively. At each collision with the ground, the kinetic energy of the
system, T, changes by:
In order to stabilize the oscillations, the control torques, u, must restore the energy lost
from these collisions.
The first actuation strategy that I experimented with use the idea of coupled oscillators.
The mechanical system in the frontal plane is a damped oscillator. By producing a second
oscillator in the control system which can apply forces to the robot, we hope that the control
oscillator can entrain the dynamics of the mechanical oscillator - applying forces at just the
right time to produce stable walking. This is a degenerate form of true oscillator-based
control where we only consider a one-way coupling between the oscillators.
The oscillator we used is given by:
ura =a sin(27rwt)
Ula ==- Ura,
where w is the oscillator frequency and a is a control gain. Surprisingly, this oscillation
was sufficient to stabilize the walking when w was properly related to the natural stepping
56
a little
frequency of the robot. The best entrainment occurred when the controller was
Hz for our robot stepping at 0.8 Hz). The
slower than the passive step frequency (w = 0.55
that mechanical designs based
success of this simple controller is our first piece of evidence
on passive dynamic walking can simplify control design for actuated walkers.
2
1.5-
05
-0.5Q
-1 -:
-1.5-
Using the techniques presented in the last chapter, we performed an online experimental
local stability analysis by fitting a linear model to data collected on the return map. Unlike
freedom
the passive robot, the actuated robot has 9 total degrees of freedom: 4 degrees of
in common with the passive robot, 2 extra degrees of freedom in each ankle, and an extra
degree of freedom for the computer which hangs passively from the hip. The 4 degrees
of freedom in the ankles are rigidly connected to our position controlled actuators and
produce repeatable trajectories from step to step. For that reason, I don't even have the
ankle potentiometers wired to the main computer, and I have left them out of the stability
(5
analysis. The return map data is recorded and fit to a linear model in 9 dimensions
states + 5 velocities - 1 Poincare section).
The local stability analysis of the limit cycles generated by the feed forward controller
suggests that this controller is more stable than the passive system. The eigenvalues of the
return map evaluated from 89 trials on flat terrain were 0.80, 0.60, 0.49 0.04i, 0.36, 0.25,
0.20 0.01i, 0.01. Practically, this controller converges from a variety of initial configurations
of the robot, but is very sensitive to disturbances in phase. The robot must be initialized
in phase with the controller, and relatively small perturbations can knock it out of phase.
A more direct approach to stabilizing the roll oscillations is to build a controller which,
on every cycle, injects exactly the amount of energy into the system that was lost during
that cycle. Even simpler is the idea implemented by Marc Raibert's height controller
for hopping robots ([Raibert, 1986, Buehler and Koditschek, 1988]): if we inject a roughly
constant amount of energy into the system during every cycle, then system will settle into
a stable oscillation with an amplitude that is monotonically related to the energy injected.
Our heuristic for injecting a roughly constant amount of energy on each cycle is imple-
mented using a state machine with only two states. The right ankle is given by:
the
and the left ankle controller is symmetric. a is the desired ankle thrust position, 01 is
57
roll threshold, and w, is the angular roll velocity threshold. Typical values are a = 0.08
radians, 01 = 0.1 radians, and wi = 0.5 radians/second.
With this state machine, as the robot rolls from an upright position onto one foot, it
crosses a threshold position and velocity at which the stance ankle is servoed by a small
fixed amount, causing the robot to accelerate further in the direction that it was moving.
As the robot is moving back toward the upright position, the ankle is servoed back to
the original zero position, which further accelerates the center of mass toward the upright
position. Both ankles are at their zero position when the robot is transferring support to
the opposite foot in order to minimize the energy lost by the collision with the ground. The
desired angle in the non-zero state is monotonically related to the resulting amplitude of
oscillation.
0.5
-
-0.5
Figure 5-3: Limit cycle trajectory using the feedback controller. Notice that this trajectory
demonstrates a gradual convergence to the limit cycle from small initial conditions.
The local stability analysis reveals that this controller converges more quickly than both
the feed-forward controller and the purely passive device. After 58 trials on flat terrain, our
linear return map analysis produced the eigenvalues: 0.78, 0.69 0.03i, 0.44, 0.36 0.04i,
0.13 t 0.06i, 0.13. The controller is able to slowly recover from a range of perturbations,
but requires many steps to return to steady state. It is not able to initiate walking when
the robot is standing still, and never applies any breaking forces to slow the robot down
and prevent it from falling over sideways.
We control the dynamics of the sagittal plane independently using a simple velocity control
algorithm. The forward velocity of the robot, regardless of the slope of terrain, depends
on the location of the center of mass relative to the ground contact. When the center of
mass is out in front of the ground contact point, the robot will lean forward. As soon as
one leg leaves the ground, the passive joint at the hip allows it to swing forward, and the
robot begins walking. The farther the center of mass is from the ground contact point,
the faster the robot will move in that direction. On Toddler, the operator specifies the
desired forward speed by joystick, and the corresponding placement of the center of mass is
controlled by actuating the ankle pitch actuators. The heuristic makes it easy for Toddler
to walk on flat terrain, and even up small inclines. For large changes in slope, the gains of
the roll stabilizing controllers must also adapt.
The direction of the robot can also be controlled, to a limited degree, by differentially
actuating the right and left ankles (either pitch or roll). The effectiveness of this direction
control is limited, however, by the uncontrolled moment in the yaw axis generated by the
58
swinging leg. When the frictional contact between the robot and the ground does not
balance this moment, then the robot turns. To minimize this effect, we have added rubber
soles to the robot's feet to increase the surface contact with the ground. With the rubber
soles, the robot walks well on the lab's linoleum flooring, but it still walks much better on
a firm carpet or rubber surface.
5.3.4 Summary
To control this under-actuated system, we have developed two simple hand-designed control
algorithms which stabilize the passive gait on a small slope or even on flat terrain. When
walking down the linoleum hallways of MIT, which are not nearly as flat and even as they
appear, the feed-forward controller needs to be restarted fairly frequently. The feedback
controller rarely needs to be restarted on the linoleum, but does not adjust well to different
surfaces, like carpet or wooden tiles. The velocity controller works well in a variety of
situations, and will be used for the remainder of the experiments described in this thesis.
The robot could walk faster by using a more sophisticated controller in the sagittal plane,
potentially by pushing off with the toes at the end of every step, but this improvement has
been left for future work.
U = 7rWwi Oik)
Before learning, w is initialized to all zeros, making the policy outputs zero everywhere, so
that the robot simulates the passive walker.
The learning algorithm is an actor-critic algorithm (recall section 2.5) which makes small
changes to the control parameters w on each step and uses correlations between changes
in w and changes in the return map error to climb the performance gradient. This policy
gradient update is augmented with an estimate of the value function, J(x), which is used
to reduce the variance of the policy update. Recall that the value of state x is the expected
average cost to be incurred by following policy 7r, starting from state x:
N
lim + E{ g(xn)}, with xo = x.
J(x) = N-+oo N
n=O
Jv(x) is an estimate of the value function parameterized by vector v. This value estimate
is represented in another function approximator:
v = vi i(). (5.9)
The particular algorithm that we present here was originally proposed by [Kimura and Kobayashi, 1998].
We present a thorough derivation of this algorithm in the next section.
59
During learning, we add stochasticity to our deterministic control policy by varying
w. Let Z, be a Gaussian random vector with E{Zi,,} = 0 and E{Zi,aZj,,,} = o-26j nnr.
During the nth step that the robot takes, we evaluate the controller using the parameter
vector w' = wn + zn. The algorithm uses a storage variable, en, which we call the eligibility
trace. We begin with wo = eo = 0. At the end of the nth step, we make the updates:
T1W > 0 and qv > 0 are the learning rates and y is the discount factor of the eligibility trace,
which will be discussed in more detail in the algorithm derivation. bi,n is a boolean one-step
eligibility, which is 1 if the parameter wi is activated (/i(i) > 0) at any point during step
n and 0 otherwise. dn is called the one-step temporal difference error.
The algorithm can be understood intuitively. On each step the robot receives some cost
g(kn). This cost is compared to cost that we expect to receive, as estimated by jv(x). If
the cost is lower than expected, then -- Idn is positive, so we add a scaled version of the
noise terms, zi, into wi. Similarly, if the cost is higher than expected, then we move in the
opposite direction. This simple online algorithm performs approximate stochastic gradient
descent on the expected value of the average infinite-horizon cost.
E{C( )} = C(P)P,1{A= d.
where F, is shorthand for F,. Taking the gradient of E{C(R)} with respect to w we find
owi
E{C( )}= C( )
a
0Pw
wi
{A = 4Jd.
C( )Pw{ = } log Pwrf{ = 4cdk
=E C() 'logPw
aww
{X = 4}
N-1a
(M=0
Recall that F,, (x', x) is a complicated function which includes the integrated dynamics
60
of the controller and the robot. Nevertheless, a. log F., is simply:
a
awi log Fw, (x'+, xm)=
N n
Nor2 E 1
n=O
{Z (kn,) Z:bimzm}
M=0
This final reduction is based on the observation that E{g(*n) n Zi,m} = 0 (noise added
to the controller on or after step n is not correlated to the cost at step n). Similarly, random
changes to a weight that is not used during the nth step (bi,m = 0) have zero expectation,
and can be excluded from the sum.
Observe that the variance of this gradient estimate grows without bound as N -* oc
[Baxter and Bartlett, 20011. To bound the variance, we use a biased estimate of this gradi-
ent which artificially discounts the eligibility trace:
a 1 Nn n
E{C(2)} N 2
E E (n) S -Y m
bi'mzi'm
Nn
Ni2 E {Z (in)ei,n}
n=O
with 0 < < 1. The discount factor y parameterizes the bias-variance trade-off.
Next, observe that we can subtract any mean zero baseline from this quantity without
effecting the expected value of our estimate [Williams, 1992]. Including this baseline can
dramatically improve the performance of our algorithm because it can reduce the variance of
our gradient estimate. In particular, we subtract a mean-zero term containing an estimate
of the value function as recommended by [Kimura and Kobayashi, 1998]:
lm a 1 (N N
lim
N-oo &Wi
E{C( )} ~~ lim
N-oo N
E
v -g(x.)ei,n - jv(xn)zi,n
'
The terms jv(xn)zi,, have zero expectation because the value estimate of the state xn is
built up from previous trials, and is uncorrelated with the noise that is being injected on
this particular step. Through some algebraic manipulation, we convert this equation into a
61
an efficient online update:
lim
N-*oo &Wi
E{C( )} ~lim
N-*oo N
E g(xn)ein - S (X.)zi,.
n=O n=O
(N N N-1
=lim -E Yg(xn)ei,n + EZi,n 7
' m-vy ( (x+ - Jv(Xm)
N-oo NIL.' n=0 n=0
L.d'vm1 m=n
= lim IE
E dnei,n
N-oo N
(n=
By this derivation, we can see that the average of the weight update given in equations
5.10-5.13 is in the direction of the performance gradient:
1N
lim - E
N-oo N
limHE {O Awi, n~ -77 lim
N-oco iwi
E{C()}
The value estimate update given in equation 5.13 is the well-known update for the tem-
poral difference learning algorithm, TD(A), with A = 0. This update performs approximate
gradient descent on the value parameters to minimize the squared temporal difference error:
01 2 1 2
--d
Ovi 2 vi 2
n [yOjv(kn+i) + Jv (kn)
+ vi I
aJv
dn
(k
)
vi
=n-i d$( n).
By neglecting the contribution of the Jv(k+ 1 ) term to the gradient, this algorithm looses
guarantees of convergence (it is no longer performing true gradient descent). Nevertheless,
there is significant experimental evidence in the literature suggesting that this assumption,
which makes the update a backup of future rewards, can significantly improve the rate of
convergence in practice [Sutton and Barto, 1998].
62
line, is robust to noise, and provably performs gradient descent on the performance gradient.
Another feature of the algorithm is that it performs well even when we limit the represen-
tational power of the function approximators for the policy and the value estimate. The
policy gradient update will attempt to find a local minimum in performance error within
the function class of the approximator. The convergence of the algorithm does not depend
on the quality of the value estimate, but the better our estimate, the faster the convergence.
The focus of this work is on an algorithm that works quickly enough to be implemented
on the real robot. For that reason, we are willing to make some sacrifices on the optimality
of the learned controller in order to improve the learning speed. This can be accomplished
by constraining the feedback policy to be a function of only two variables: 0 rou and 0 roll.
The choice of these two variables is not arbitrary; they are the only variables that we used
when writing a non-learning feedback controller that stabilizes the oscillation. The policy is
also constrained to be symmetric - the controller for the left ankle is simply a mirror image
of the controller for the right ankle. With these reductions, the learned control policy has
only two inputs and only a single output. The simplifications made in the value estimate
are even more severe - we approximate the value as a function of only a single variable: 0 roll.
This extremely low dimensionality allows the algorithm to very quickly learn a very coarse
approximation to the real value function, and turns out to be good enough to significantly
reduce the variance of our policy update and allow for very fast learning.
The control policy and value functions are both represented using linear function approx-
imators of the form described in Equations 5.3 and 5.9, which are fast and very convenient
to initialize. We use an overlapping tile-coding for our approximator basis functions: 35
tiles for the policy (5 in 0 rou x 7 in ro1) and 11 tiles for the value function.
In order to make the robot explore the state space during learning, we hand-designed a
simple controller to place the robot in random initial conditions on the return map. These
initial conditions are selected from a random distribution that is biased according to the
distribution of points that the robot has already experienced on the return map - the most
likely initial condition is the state that the robot experienced least often. We use this
controller to randomly reinitialize the robot every time that it comes to a halt standing
still, or every 10 seconds, whichever comes first. This heuristic makes the distribution on
the return map more uniform, and increases the likelihood of the algorithm converging on
the same policy each time that it learns from a blank slate.
63
walking on flat terrain. Figure 5-6 displays the final policy.
0.4
0.35
0.3
20.25
0.2
00.15
0.1-
0.05
Figure 5-4: A typical learning curve, plotted as the average error on each step.
0.6
0.40
0.2-
0
-0.2-
-0.4--
0 5 10seod15 20 25
Figure 5-5: 0,11 trajectory of the robot starting from standing still.
15Actor
0.04
0.03
-001
-0.5 0.02
-1 -0.03
-1.5 : -0.04
-0.5 000.5
In Figure 5-7 we plot the return maps of the system before learning (w 0) and after
1000 steps. Recall that we are fitting a return map in 9 dimensions (5 states + 5 derivatives
- 1 Poincare section). In general, the projection of these dynamics onto a single dimension
is difficult to interpret. The plots in Figure 5-7 where made with the robot walking in
place on flat terrain. In this particular situation, most of the return map variables are close
to zero throughout the dynamics, and a two dimensional return map captures the desired
dynamics. As expected, before learning the return map illustrates a single fixed point at
6,11 = 0, which means the robot is standing still. After learning, we obtain a single fixed
point at the desired value (9roi = 1.0 radians / second), and the basin of attraction of this
fixed point extends over the entire domain that we tested. On the rare occasion that the
robot falls over, the system does not return to the map and stops producing points on this
graph.
Looking at Figure 5-7, we might guess that the controller has a single eigenvalue of 0.5.
64
Return Map for dr 11 dt
Retum Map for de 1 /dt
1.5 1.6
1.4
1 .. **
S11.2
i
-
0.8
0.6
0.5
. x
0.4
.'-.
Figure 5-7: Experimental return maps, before (left) and after (right) learning. Fixed points
exist at the intersections of the return map data and the diagonal line of slope one.
The linear least-squares eigenvalue analysis results for the learned controller, and all of the
other controllers are:
Controller Eigenvalues
Passive walking 0.88 0.Oli, 0.75, 0.66 0.03i,
(63 trials) 0.54, 0.36, 0.32 0.13i
Hand-designed 0.80, 0.60, 0.49 0.04i, 0.36,
feed-forward (89 trials) 0.25, 0.20 L 0.01i, 0.01
Hand-designed 0.78, 0.69 0.03i, 0.36, 0.25,
feedback (58 trials) 0.20 0.01i, 0.01
Learned feedback 0.74 0.05i, 0.53 0.09i, 0.43,
(42 trials) 0.30 0.02i, 0.15, 0.07
All of these experiments were on flat terrain except the passive walking, which was on a
slope of 0.027 radians. The convergence of the system to the nominal trajectory is largely
governed by the largest eigenvalues. This analysis suggests that our learned controller
converges to the steady state trajectory more quickly that the passive walker on a ramp
and more quickly than any of our hand-designed controllers.
The results for walking in place on flat terrain were encouraging, so we tested the
controller walking at different velocities and over different terrain. While it took a few
minutes to learn a controller from a blank slate, adjusting the learned controller to adapt to
slow changes in the joystick velocity command or to small changes in the terrain appears to
happen very quickly - often within a few steps. While the non-learning controllers require
constant attention and small manual changes to the parameters as the robot walks down
the hall, on tiles, and on carpet. The learning controller easily adapts to each of these
situations.
5.8 Discussion
Designing our robot like a passive dynamic walker changed the learning problem in a number
of ways. Because each of the dynamic variables for the robot are coupled through the passive
dynamics and because these dynamics already solve a large portion of the control problem,
it was possible for us to learn a policy with only a single output which controlled a 9 degree
of freedom system. Unlike most bipeds, most of the policies in this class of single output
functions produce some form of stable behavior - whether it is walking or standing. With
a relatively low cost for experimenting with different policy parameters on the real robot,
we were able to blindly inject noise into our policy for the actor-critic update. Another
65
benefit of the passive dynamic design of this robot is that we were able to obtain stable
periodic trajectories even when the policy was initialized to a blank slate. This allowed us
to formulate the learning problem on the return map dynamics with the confidence that
most trajectories will return to the Poincare section.
The return map formulation of the learning problem was a major factor in the success
of this algorithm, but it has a number of implications. The return map dynamics are
modelled as a Markov chain that is parameterized by the policy parameters instead of as
a Markov decision process (MDP). This distinction is necessary because we evaluate the
policy many times within a single step, and we cannot directly apply a large variety of
reinforcement learning algorithms which rely on the structure of an MDP (such as dynamic
programming or value iteration). The return map formulation also has implications for
the solution of the delayed reward problem. Our algorithm solves the delayed reward
problem by accumulating the eligibility within a step and discounting eligibility between
steps. Interestingly, our algorithm performs best with heavy discounting between steps
(0 < 7 < 0.2). This is probably a result of the particular choice of cost function, which
implicitly rewards monotonic convergence to the fixed point on the return map, and the fact
that the robot was walking on flat terrain where multi-step maneuvers were not required
to improve stability.
Another important factor in the success of this algorithm is the variance reduction
using a coarse estimate of the value function. Before implementing this component, the
learning took so many trials that we were not confident that the policy changes were actually
improving performance. To demonstrate the importance of the learned value estimate,
we can run the learning with the policy parameters reset to zero, but the value estimate
parameters in tact from the previous learning session. When we do this, the robot achieves
good performance in just a few steps, and the learning actually converges in just one or two
minutes.
The learning algorithm works well on this simple robot, but will the technique scale to
more complicated robots? As the number of degrees of freedom increases, the actor-critic
algorithm implemented here may have problems with scaling. The algorithm correlates
changes in the policy parameters with changes in the performance on the return map. As
we add degrees of freedom, the assignment of credit to a particular actuator will become
more difficult, requiring more learning trials to obtain a good estimate of the correlation.
This scaling problem is an open and interesting research question and a primary focus of
our current research.
Although it was accomplished on a simple biped, this demonstration of online learning
is important for a number of reasons. It suggests that principles from passive dynamic
walking can and should be used on actuated robots. It provides a real demonstration of
policy gradient learning implemented on a real robot with sensor noise and stochasticity in
the environment, and experimental evidence for the importance of variance reduction using
a value estimate. Finally, it provides a fresh idea on the solution to the delayed reward
problem for walking robots - accumulating eligibility evenly over one step and discounting
heavily between steps.
66
Chapter 6
In the early 1980's, Marc Raibert's Leg Laboratory built a series of extremely successful
hopping robots [Raibert, 1986] which redefined people's expectations for robots executing
extremely dexterous and dynamic movements [Koditschek and Buehler, 1991]. The first of
these robots, a planar one-leg hopper, is a simple legged system which captures the essence
of dynamic stability but avoids the complication of coordinating multiple legs or solving
complex leg kinematics.
Due to the carefully designed dynamics of the robot, Raibert's group was able to design
a surprisingly simple controller for stable hopping using only their engineering intuition.
Twenty years later, this controller is still one of the most prominent controllers for robotic
running, even though surprisingly little is understood about it analytically. The only real
analysis was done by [Koditschek and Buehler, 1991], who provided more insight into the
height controller by providing a Lyapunov function for a reduced physical model.
Raibert's original controller works well for steady-state hopping, but is not able to
recover from large disturbances. In this chapter, I will present Raibert's original controller
as an approximate solution to an optimal control problem, and then apply a policy gradient
algorithm to systematically improve upon it. The learned controller is able to recover from
a dramatically larger region of initial conditions than the original controller.
The one-leg hopper has a total of ten state variables x = [q, 4]T, where q = [xft, Yft, 0, #, r]
as defined in Figure 6-1, and two control variables u = [rhip, F8 9]T which are the torque at
the hip and the linear force from the pneumatic spring in the leg. The equations of motion
use the following constants: m and J are mass and moment of inertia of the body, m, and
J1 are the mass and moment of inertia of the leg, 11 is the distance from the foot to the
center of mass of the leg, 12 is the distance from the hip to the center of mass of the body,
and R = r - 11.
The equations of motion for the hopper, taken with corrections from [Raibert, 1986],
67
(xft, Yfit)
Figure 6-1: The planar one-legged hopper. (xft, yft) represents the position of the point foot
in world coordinates. 0 represents the absolute angle of the leg, q represents the absolute
angle of the body, and r represents the length of the leg.
are:
(JI - miRl1)9 cos 0 - mjkRft = cos 0(1 1Fy sinG - 1F, COS 0 - rhip)
2
- R(Fx - Fie, sin 0 - mi111 sin 0)
(Ji - mjRl1)9 sin 0 + mjRQft = sin 9(l1 Fy sin G - l1 Fx cos 0 - rhip)
+ R(mi 11 9 2 cos 0 + Fy - Fe, cos 0 - mig)
(Ji + mRr)9 cos 0 + mift + mRi sin 0 + mRl 2 cos 4 = cos 0(l 1 Fy sin 0 - l1Fx cos 9 - Thip)
2
+ RFie, sin 0 + mR(r6 sin 0 + 124)2 sin 4 - 2 9 cos 0)
(.Ji + mRr)9 sin 0 - mRDf t - mR: cos 0 + mRl 24 sin 4 = sin 0(l 1 Fy sin 0 - 11Fx cos 0 - Thip)
The ground is modelled as a piecewise linear spring-damper system, with reaction forces
given by
68
larger than the moment of inertia of the leg. During the flight phase, this allows us to
assume that hip torque causes the leg to move without altering the body attitude. The
forward velocity is then controlled by the foot placement: using a symmetry argument to
estimate the position on the ground that the foot should be placed for steady state hopping,
a simple servo controller is used to move the leg to the correct angle.
When the leg is on the ground, we assume that the hip torque causes the body to
rotate, but that the ground reaction forces keep the leg angle constant. Again, a simple
servo controller was used to keep the body angle near the desired position. Finally, hopping
height is controlled by injecting a fixed amount of energy during the thrust phase. The
amount of energy injected is monotonically related to the resulting hopping height.
On the actual robot, these controllers were switched in and out using a simple state ma-
chine. It is possible to approximate Raibert's state-machine controller using an autonomous
feedback policy u = lraibert(x), as in [Buehler and Koditschek, 1988]. The robot is con-
sidered to be in the FLIGHT phase when the foot is above the ground (yft > 0), in the
COMPRESSION phase when the foot is on the ground and the leg is compressing (yft = 0
and r < 0 ), or in the THRUST phase any other time the leg is compressed (r < rO, where
rso is the rest length of the spring in the leg). This allows us to write Raibert's control law
in the following autonomous equations:
0 otherwise.
Vretract if FLIGHT
Fieg(x) Vthrust if THRUST
0 otherwise.
i is the forward velocity of the center of mass of the body and xd is the desired forward
velocity. kfp,bfp, and k, are the gains of the foot placement controller, T, is the duration
of the last stance phase. Finally, Vthrust is a constant that is monotonically related to the
hopping height. In my simulation, this controller works very nearly as well as the state
machine controller.
As illustrated in Figure 6-2, when the robot is initialized with 11 > 0, these original
control equations cannot prevent the robot from falling down. The large moment of inertia of
the body serves to minimize the effect of the swing leg during the aerial phase of the steady
state trajectory, but consequently complicates the problem of dealing with disturbances
around the hip. The original control equations were based on an analysis of steady-state
hopping conditions augmented with simple linear feedback, and they simply can't recover
from these disturbances. [Vermeulen et al., 2000] reported improved robustness using a
more comprehensive model of hip displacements, suggesting that a nonlinear controller
may improve the situation.
69
-. __ __ - r- - imemalm"
-
3-
2.5-
2-
1.5
0.5-
-1 0 1 2 3 4 5 6 7 8
Figure 6-2: The original controller cannot recover from perturbations at the hip. The
shading of the robot gets darker as time progresses.
model. Although Raibert never mentions optimization, I believe that his intuitive solution
to the control problem can be interpreted as an approximation to an optimal controller.
The control objectives, as described by Raibert, are to regulate the hopping height (y),
body attitude (#), and forward speed ( ). The simplest cost function which quantifies
Raibert's control objectives is
1 21 21
g(x, U) = 1(Y -- Yd)2 + - (# - #0)2+ I ( - )is2
2 2 2
where Yd is the desired (average) hopping height, #d is the desired body attitude, and cbd is
the desired forward velocity.
The goal of the original control equations is to regulate these variables over time. In
order to efficiently simulate many trials with the robot, we optimize the finite-horizon cost:
C = [ag(x, u) - 1]dt,
where a is a constant scalar set to keep ag < 1 and T is either the time that the robot falls
down or 5 seconds, which ever comes first. By defining a finite-horizon time instead of an
infinite-horizon discounted, we can evaluate the cost exactly in T seconds of simulation time.
Cutting the simulations short when the robot falls over is another trick for dramatically
reducing the computation time1 . In order to ensure that simulations where the robot falls
down are penalized more than those where the robot remains balanced, we optimize ag - 1
instead of just g. 5 seconds is enough time for the hopper to land and hop at least 5 times.
Although the cost function is quadratic, the complicated equations of motion make it
too difficult to solve the optimal control problem analytically. Instead, we will apply a
policy gradient algorithm to improve the performance of the controller. We do this by
augmenting the existing controller with a feedback controller stored in an artificial neural
network parameterized by the vector w:
'Not only do we simulate less time, but the equations of motion typically become stiff when modelling
many successive collisions, forcing us to dramatically reduce the step-size. By avoiding these situations, we
can keep the simulation time step at a reasonable number (typically le-4 seconds).
70
We use a small 3-layer neural network with 9 inputs, 25 hidden units, and 2 outputs. The
inputs are (yft, 9,0, r, 'vt, y/ftj ,# ,r ), scaled and shifted linearly so that most inputs are
within the range [-1.0, 1.0]. The outputs are (uo, ui), which are computed from the network
using the standard sigmoid activation function and then linearly scaled and shifted from the
range [0, 11 to the desired actuator range. The parameter vector w contains the connection
strengths between layers and the threshold biases for each unit, a total of 302 values.
This representation is very common for standard function approximation. In many of my
simulations, I use a more local representation of the state space, as do [Miller et al., 1990,
Schaal et al., 2001]. A global representation was chosen in this problem to deal with the
large, continuous state space with a relatively small number of learning parameters, because
the convergence rate of this policy gradient algorithm scales with the number of learning
parameters.
E{Cw k C(xo).
xoCXo
In order to perform gradient descent on the total policy value, E{Cw }, I implemented the
Downhill Simplex method [Press et al., 1992, section 10.4]. This algorithm is just a fancy
way of estimating a multi-dimensional gradient by sampling without having to resample
every dimension on every iteration. w was always initialized to all zeros, so that they initial
policy was simply Raibert's original controller.
Due to the large moment of inertia of the body relative to the hip, the robot is most
sensitive to initial conditions where 11 > 0. For this reason, random initial conditions for
# were selected from a uniform distribution over [-7r, w] radians. Similarly, the internal
angle between the body and the leg (0 - 9) was selected from a uniform distribution over
[-7r/2 + 0.25, 7r/2 - 0.25] radians. The remaining state variables were set to ensure that the
center of mass of the robot was stationary at the point x = Om, y = 2.5m. These random
initial conditions correspond to the robot being dropped from 2.5m with arbitrary leg and
body angles.
6.5 Results
After a very large number of simulations (~ 106), the learning algorithm had converged on
a surprising controller. Because our simulation did not model collisions between the body
and the leg, the final controller recovered from large hip angles by spinning the body nearly
360 degrees around the hip to return to vertical (see Figure 6-3).
71
Figure 6-3: The controller augmented with a learning component is able to recover from an
initial condition where both the leg angles are very far from vertical.
The goal of the optimization is to maximize the region of initial conditions from which
the robot can recover. For comparison, Figure 6-4 plots the region of initial conditions
for the three controllers. The initial conditions used for learning were taken as a uniform
distribution from the range shown in this plot, also with some initial velocity.
(radians)
-3.0
02.0
-1.0
0.0
1.0
2.0
3.0
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0
9 (radians)
Figure 6-4: Regions of stability for the optimized controller, u = 7rraibert+ 7w, in gray, and
for the original policy, u = 7raibert, in black.
6.6 Discussion
Using brute force optimization techniques, we were able to train a very small neural network
controller that notably expanded the region of stability of the hopping robot. This was
accomplished by evaluating the robot from a relatively small sample (50-100) of random
initial conditions pulled from a very broad distribution. Although we did not attempt to
optimize any energetic criteria, nor to drive the robot along a specific trajectory, these could
easily be taken into account by adding terms to the cost function. Furthermore, none of
the algorithms presented here depend on the robot model and could be applied directly to
a simulation of a different robot.
72
More importantly, this procedure was used to augment an existing controller. Because
we were able to interpret the intentions of the original control equations in terms of optimal
control, we were able to use a numerical algorithm in such a way that the performance
could only improve. This particular policy gradient algorithm was optimized for efficient
parallel computations, but the algorithm presented in the last chapter would have worked
(more slowly).
73
74
Chapter 7
Optimal control is a powerful framework for designing control systems for dynamically sta-
ble legged locomotion. It allows us to maximize the performance of our control algorithms
during the design phase, and to derive rules for continually adapting the controller after
the robot has left the laboratory. Finding truly optimal controllers for complicated legged
systems is currently analytically and computationally intractable. In thesis, I have demon-
strated that by combining existing ideas from legged robots with computational tools from
the optimal control literature, we can derive controllers that outperform their non-learning
counterparts for at least two of the simplest legged robots.
The Toddler robot was designed to be a minimal 3D bipedal walker. The robot is
under-actuated, with only 4 actuators to control 9 degrees of freedom, making it difficult to
analytically derive controllers with theoretical performance guarantees. The passive walking
capabilities of the robot made it relatively easy to derive intuitive control strategies for stable
flat-ground walking, but these controllers required many steps to return to the nominal
trajectory after a small perturbation and could not recover from a large one. By applying
an online actor-critic algorithm, I was able to quickly and reliably obtain a quantifiably
more robust controller for walking on flat terrain. As the robot moves, the algorithm is
constantly refining the policy allowing the robot to quickly adapt to small changes in the
terrain.
Although it was accomplished on a very simple biped, this demonstration is important
for a number of reasons. It suggests that passive walking principles used in the mechanical
design of the robot can be used to improve the performance of our online learning algorithms
by mechanically biasing the system toward a walking solution. It provides hope for policy
gradient algorithms, which require us to perturb the policy parameters as the robot walks,
by demonstrating that it was possible to perturb the parameters enough to estimate the
performance gradient without visibly changing the walking gait. It also argues for the
utility of value function methods, by demonstrating that even an extremely coarse estimate
of the value function could be used to dramatically improve the performance of the learning
system. Toddler also gives us some priceless hands-on experience with learning on a real
walking robot.
7.1 Scaling Up
The major obstacle for applying these ideas to more complicated walkers is scaling the
performance of the learning algorithm. As the number of degrees of freedom increases,
75
so will the number of trials necessary to acquire the controller. The algorithm correlates
changes in the policy parameters with changes in the performance on the return map. As
we add degrees of freedom, the assignment of credit to a particular actuator will become
more difficult, requiring more learning trials to obtain a good estimate of the correlation.
The scaling of the reinforcement learning algorithms is an open and interesting research
question and a primary focus of my current research. Most existing approaches to scaling in
reinforcement learning involve hierarchical methods [Barto and Mahadevan, 2003]. I believe
that there is also hope for improving the non-hierarchical methods. There is certainly room
for improvement on the convergence and performance guarantees of the existing algorithms,
and to come up with new algorithms. There is also a large body of domain specific knowledge
in the robotics literature that can and should be exploited in our learning algorithms.
Passive dynamic walking is another piece of domain specific knowledge that we have used
in this thesis. To address scaling on the Toddler robot, we have built a new version of the
robot with knees. The version of the robot described in chapter 5 has an efficient gait that
is robust enough to accommodate small changes in terrain, but it cannot traverse obstacles.
This limitation is primarily due to the mechanics of the robot, which can only achieve foot
clearance by leaning sideways. The kneed version of the robot will have much more foot
clearance and a much more anthropomorphic gait.
Figure 7-1: A prototype knee for the next version of the Toddler robot
In order to exploit the efficiency of the passive dynamics in the knees, we have developed
76
a simple clutch mechanism at the knee (see Figure 7-1) which will allow the knee actuator
to apply torque when engaged and allow the knee to swing freely when disengaged. Figure
7-2 shows Toddler with the knees attached.
The new robot will also have foot contact sensors built into the ankles and actuators
attached to the arms. The curvature of the feet is a very important component to both the
stability and efficiency of the current robot. It is not trivial to build in a contact sensor
in the foot without disturbing this geometry. Our solution is to add a small spring-loaded
mechanism between the ankle and the foot that will close whenever the foot is on the
ground. This sensor will improve our return map estimation as well as aid in the timing
for the clutch mechanism on the kneed robot. The arm actuators will be added in series
with the existing arm design, so that the basic mechanical coupling of each arm to the
opposite leg will remain intact. When the motor is holding it's zero position, the arm will
mimic the passive gait. When actuated, the arm will be provide active control over the yaw
compensation, which will help to control the direction of the robot.
This version of the robot will not only walk better, but it will make a significantly
more compelling argument that the learning ideas presented in this thesis can be applied
to humanoid walking.
As we begin to address the scaling problem, we plan to apply these algorithms to more and
more sophisticated robots, ultimately working up to a full scale humanoid robot like Honda's
ASIMO [Honda Motor Co., 2004]. Figure 7-3 shows Pete Dilworth's new Protoceratops
robot that is just beginning to walk using a hand-designed controller. This robot has 14
primary degrees of freedom (not including the animatronics in the head). We plan to
test our controller augmentation ideas on this substantially more complicated robot in the
coming months and hope to demonstrate online policy improvement.
The algorithms considered in this thesis could also be applied to physical simulations
of walking characters for computer graphics. These simulations are a special case of the
77
I1
robot problem where we have perfect knowledge of the robot dynamics and perfect sensors.
Scaling is particularly important in this domain because animated characters typically have
many more degrees of freedom than robots.
78
Bibliography
[Anderson and Pandy, 2001] Anderson, F. C. and Pandy, M. G. (2001). Dynamic optimiza-
tion of human walking. Journal of Biomechanical Engineering (AMSE), 123(5):381-390.
[Barto and Mahadevan, 2003] Barto, A. G. and Mahadevan, S. (2003). Recent advances in
hiearchical reinforcement learning. Discrete Event Systems Journal, 13:41-77.
[Baxter and Bartlett, 2001] Baxter, J. and Bartlett, P. (2001). Infinite-horizon policy-
gradient estimation. Journal of Artificial Intelligence Research, 15:319-350.
[Boyan and Moore, 1995] Boyan, J. A. and Moore, R. W. (1995). Generalization in rein-
forcement learning: Safely approximating the value function. Advances in Neural Infor-
mation Processing Systems (NIPS), 7.
[Camp, 1997] Camp, J. (1997). Powered "passive" dynamic walking. Master's thesis, Cor-
nell University.
[Channon et al., 1990] Channon, P., Hopkins, S., and Pham, D. (1990). Simulation and
optimisation of gait for a bipedal robot. Mathematical and Computer Modeling, 14:463-
467.
[Chew, 2000] Chew, C.-M. (2000). Dynamic Bipedal Walking Assisted by Learning. PhD
thesis, Massachusettes Institute of Technology.
[Coleman and Ruina, 1998] Coleman, M. J. and Ruina, A. (1998). An uncontrolled toy
that can walk but cannot stand still. Physical Review Letters, 80(16):3658-3661.
79
[Collins et al., 2004] Collins, S., Ruina, A., Tedrake, R., and Wisse, M. (2004). Powered
bipedal robots based on ramp-walking toys. Submitted to Nature.
[Collins et al., 2001] Collins, S. H., Wisse, M., and Ruina, A. (2001). A three-dimensional
passive-dynamic walking robot with two legs and knees. InternationalJournalof Robotics
Research, 20(7):607-615.
[Doya, 1999] Doya, K. (1999). Reinforcement learning in continuous time and space. Neural
Computation, 12:243-269.
[Goswami, 1999] Goswami, A. (1999). Postural stability of biped robots and the foot rota-
tion indicator (FRI) point. InternationalJournal of Robotics Research, 18(6).
[Goswami et al., 1996] Goswami, A., Thuilot, B., and Espiau, B. (1996). Compass-like
biped robot part I: Stability and bifurcation of passive gaits. Technical Report RR-2996,
INRIA.
[Hardt et al., 1999] Hardt, M., Kreutz-Delgado, K., and Helton, J. W. (1999). Modelling
issues and optimal control of minimal energy biped walking. In Proceedings of the In-
ternational Conference on Climbing and Walking Robots (CLAWAR), pages 239-251.
Professional Engineering Publishing Limited.
[Hirai et al., 1998] Hirai, K., Hirose, M., Haikawa, Y., and Takenaka, T. (1998). The devel-
opment of honda humanoid robot. In Proceedings of the IEEE International Conference
on Robotics and Automation (ICRA), pages 1321-1326.
[Huber and Grupen, 1998] Huber, M. and Grupen, R. A. (1998). A control structure for
learning locomotion gaits. In 7th InternationalSymposium on Robotics and Applications.
[Juang and Lin, 1996] Juang, J.-G. and Lin, C.-S. (1996). Gait synthesis of a biped robot
using backpropagation through time algorithm. In Proceedings of the IEEE International
Conference on Neural Networks, pages 1710-1715.
[Kajita et al., 1992] Kajita, S., Yamaura, T., and Kobayashi, A. (1992). Dynamic walking
control of a biped robot along a potential energy conserving orbit. IEEE Transactions
on Robotics and Automation, 8(4):431 - 438.
[Kato et al., 1983] Kato, T., Takanishi, A., Jishikawa, H., and Kato, 1. (1983). The re-
alization of the quasi-dynamic walking by the biped walking machine. In Morecki, A.,
Bianchi, G., and Kedzior, K., editors, Fourth Symposium on Theory and Practice of
Walking Robots, pages 341-351. Warsaw: Polish Scientific Publishers.
[Kimura and Kobayashi, 1998] Kimura, H. and Kobayashi, S. (1998). An analysis of ac-
tor/critic algorithms using eligibility traces: Reinforcement learning with imperfect value
functions. In Proceedings of the International Conference on Machine Learning (ICML),
pages 278-286.
80
[Kohl and Stone, 2004] Kohl, N. and Stone, P. (2004). Policy gradient reinforcement learn-
ing for fast quadrupedal locomotion. In Proceedings of the IEEE InternationalConference
on Robotics and Automation (ICRA).
[Konda and Tsitsiklis, 1999] Konda, V. R. and Tsitsiklis, J. N. (1999). Actor-critic algo-
rithms. Advances in Neural Information Processing Systems, 12:1008-1014.
[Kun and Miller, 1996] Kun, A. L. and Miller, III, W. (1996). Adaptive dynamic balance of
a biped robot using neural networks. In Proceedings of the IEEE InternationalConference
on Robotics and Automation (ICRA), pages 240-245.
[Kuo, 1995] Kuo, A. D. (1995). An optimal control model for analyzing human postural
balance. IEEE Transactions on Biomedical Engineering, 42(1):87-101.
[Kuo, 2002] Kuo, A. D. (2002). Energetics of actively powered locomotion using the simplest
walking model. Journal of Biomechanical Engineering, 124:113-120.
[Li et al., 2001] Li, C.-S. R., Padoa-Schioppa, C., and Bizzi, E. (2001). Neuronal corre-
lates of motor performance and motor learning in the primary motor cortex of monkeys
adapting to an external force field. Neuron, 30:593-607.
[Lohmiller and Slotine, 1998] Lohmiller, W. and Slotine, J. (1998). On contraction analysis
for non-linear systems. Automatica, 34(6):683-696.
[Maier et al., 1999] Maier, K., Blickman, R., Glauche, V., and Beckstein, C. (1999). Control
of legged planar hopping with radial basis function neural networks. In Proceedings of the
International Conference on Climbing and Walking Robots (CLAWAR), pages 133-141.
Professional Engineering Publishing Limited.
[Miller et al., 1990] Miller, III, W., Glanz, F., and Kraft III, L. (1990). CMAC: an as-
sociative neural network alternative to backpropagation. Proceedings of the IEEE,
78(10):1561-1567.
[Miller, 1994] Miller, III, W. T. (1994). Real-time neural network control of a biped walking
robot. IEEE Control Systems Magazine, 14(1):41-48.
[Morimoto and Atkeson, 2002] Morimoto, J. and Atkeson, C. (2002). Minimax differential
dynamic programming: An application to robust biped walking. Advances in Neural
Information Processing Systems.
[Muybridge and Mozley, 1979] Muybridge, E. and Mozley, A. V. (1979). Muybridge's Com-
plete Human and Animal Locomotion: All 781 Plates from the 1887 Animal Locomotion.
3 Vols. Dover Publications, Incorporated.
81
[Ng and Jordan, 2000] Ng, A. Y. and Jordan, M. (2000). PEGASUS: A policy search
method for large MDPs and POMDPs. In Uncertainty in Artificial Intelligence.
[Ng et al., 2003] Ng, A. Y., Kim, H. J., Jordan, M. I., and Sastry, S. (2003). Autonomous
helicopter flight via reinforcement learning. Advances in Neural Information Processing
Systems (NIPS), 16.
[Pratt, 2000] Pratt, J. (2000). Exploiting Inherent Robustness and Natural Dynamics in
the Control of Bipedal Walking Robots. PhD thesis, Computer Science Department,
Massachusetts Institute of Technology.
[Pratt and Pratt, 1998] Pratt, J. and Pratt, G. (1998). Intuitive control of a planar bipedal
walking robot. In Proceedings of the IEEE International Conference on Robotics and
Automation (ICRA).
[Pratt and Pratt, 1999] Pratt, J. and Pratt, G. (1999). Exploiting natural dynamics in the
control of a 3d bipedal walking simulation. In Proceedings of the InternationalConference
on Climbing and Walking Robots (CLAWAR).
[Press et al., 1992] Press, W. H., Teukolsky, S. A., Vetterling, W. T., and Flannery, B. P.
(1992). Numerical Recipes in C: The Art of Scientific Computing. Cambridge University
Press, second edition.
[Raibert, 1986] Raibert, M. H. (1986). Legged Robots That Balance. The MIT Press.
[Schaal et al., 2001] Schaal, S., Atkeson, C., and Vijayakumar, S. (2001). Scalable tech-
niques from nonparametric statistics for real-time robot learning. Applied Intelligence,
17(1):49-60.
[Seyfarth et al., 2002] Seyfarth, A., Geyer, H., Gunther, M., and Blickhan, R. (2002). A
movement criterion for running. Journal of Biomechanics, 35(5):649-655.
[Slotine and Li, 1990] Slotine, J.-J. E. and Li, W. (1990). Applied Nonlinear Control. Pren-
tice Hall.
[Slotine and Wang, 2003] Slotine, J.-J. E. and Wang, W. (2003). A study of synchroniza-
tion and group cooperation using partial contraction theory. Block Island Workshop on
Cooperative Control, Kumar V. Editor, Springer-Verlag, 2003.
[Smith, 1998] Smith, R. (1998). Intelligent Motion Control with an Artificial Cerebellum.
PhD thesis, University of Auckland, New Zealand.
[Spong, 1994] Spong, M. W. (1994). Swing up control of the acrobot. In Proceedings of the
IEEE International Conference on Robotics and Automation (ICRA), pages 2356-2361.
82
[Spong and Bhatia, 2003] Spong, M. W. and Bhatia, G. (2003). Further results on control
of the compass gait biped. In Proceedings of the IEEE International Conference on
Intelligent Robots and Systems (IROS), pages 1933-1938.
[Strogatz, 1994] Strogatz, S. H. (1994). Nonlinear Dynamics and Chaos: With Applications
to Physics, Biology, Chemistry, and Engineering. Perseus Books.
[Sutton and Barto, 1998] Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning:
An Introduction. MIT Press.
[Sutton et al., 1999] Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. (1999). Pol-
icy gradient methods for reinforcement learning with function approximation. Advances
in Neural Information Processing Systems (NIPS).
[Taga, 1995] Taga, G. (1995). A model of the neuro-musculo-skeletal system for human
locomotion. I. emergence of basic gait. Biological Cybernetics, 73(2):97-111.
[Tedrake and Seung, 2002] Tedrake, R. and Seung, H. S. (2002). Improved dynamic sta-
bility using reinforcement learning. In 5th International Conference on Climbing and
Walking Robots and the Support Technologies for Mobile Machines (CLAWAR). Profes-
sional Engineering Publishing Limited.
[Tedrake et al., 2004a] Tedrake, R., Zhang, T. W., Fong, M., and Seung, H. S. (2004a).
Actuating a simple 3D passive dynamic walker. In Proceedings of the IEEE International
Conference on Robotics and Automation (ICRA).
[Tedrake et al., 2004b] Tedrake, R., Zhang, T. W., and Seung, H. S. (2004b). Stochastic
policy gradient reinforcement learning on a simple 3D biped. In Proceedings of the IEEE
International Conference on Intelligent Robots and Systems (IROS).
[Tesauro, 1995] Tesauro, G. (1995). Temporal difference learning and td-gammon. Com-
munications of the ACM, 38(3):58-68.
[Todorov, 2002] Todorov, E. (2002). Optimal feedback control as a theory of motor coor-
dination. Nature Neuroscience, 5(11):1226.
[Tsitsiklis and Roy, 1997] Tsitsiklis, J. and Roy, B. V. (1997). An analysis of temporal-
difference learning with function approximation. IEEE Transactions on Automatic Con-
trol, 42(5):674-690.
[van der Linde, 1998] van der Linde, R. Q. (1998). Active leg compliance for passive walk-
ing. In Proceedings of the IEEE International Conference on Robotics and Automation
(ICRA), pages 2339-2345.
[Vaughan et al., 2004] Vaughan, E., Di Paolo, E., and Harvey, I. (2004). The evolution of
control and adaptation in a 3D powered passive dynamic walker. In Proceedings of the
International Conference on the Simulation and Synthesis of Living Systems (ALIFE).
MIT Press.
83
[Vermeulen et al., 2000] Vermeulen, J., Lefeber, D., and Man, H. D. (2000). A control
strategy for a robot with one articulated leg hopping on irregular terrain. In Proceedings
of the InternationalConference on Climbing and Walking Robots (CLA WAR), pages 399-
406. Professional Engineering Publishing.
[Westervelt and Grizzle, 2002] Westervelt, E. and Grizzle, J. (2002). Design of asymptoti-
cally stable walking for a 5-link planar biped walker via optimization. In Proceedings of
the IEEE International Conference on Robotics and Automation (ICRA).
[Williams and Zipser, 1989] Williams, R. J. and Zipser, D. (1989). A learning algorithm
for continually running fully recurrent neural networks. Neural Computation, 1:270-280.
[Wisse and van Frankenhuyzen, 2003] Wisse, M. and van Frankenhuyzen, J. (2003). Design
and construction of Mike; a 2D autonomous biped based on passive dynamic walking.
In Proceedings of the InternationalSymposium on Adaptive Motion of Animals and Ma-
chines.
[Yamasaki et al., 2002] Yamasaki, F., Endo, K., Kitano, H., and Asada, M. (2002). Acqui-
sition of humanoid walking motion using genetic algorithms: Considering characteristics
of servo modules. In Proceedings of the IEEE International Conference on Robotics and
Automation (ICRA), pages 3123-3128.
[Yan and Agrawal, 2004] Yan, J. and Agrawal, S. K. (2004). Rimless wheel with radially
expanding spokes: Dynamics, impact, and stable gait. In Proceedings of the IEEE Inter-
national Conference on Robotics and Automation (ICRA).
84