Scirobotics Adi8022
Scirobotics Adi8022
S c i e n c e R o b o t i c s | R e s e ar c h A r t i c l e
INTRODUCTION learning and transfer of distinct basic skills, such as walking (24),
Creating general embodied intelligence, that is, creating agents that running (25), stair climbing (26), and jumping (27). The state of the
can act in the physical world with agility, dexterity, and understand- art in humanoid control uses targeted model-based predictive con-
ing—as animals or humans do—is one of the long-standing goals of trol (28), thus limiting the generality of the method.
artificial intelligence (AI) researchers and roboticists alike. Animals Our work focuses on learning-based full-body control of hu-
and humans are not just masters of their bodies, able to perform and manoids for long-horizon tasks. In particular, we used deep RL to
combine complex movements fluently and effortlessly, but they also train low-cost off-the-shelf robots to play multi-robot soccer well
perceive and understand their environment and use their bodies to beyond the level of agility and fluency that is intuitively expected
effect complex outcomes in the world. from this type of robot. Sports like soccer showcase many of the
Attempts at creating intelligent embodied agents with sophisti- hallmarks of human sensorimotor intelligence, which has been rec-
cated motor capabilities go back many years, both in simulation (1) ognized in the robotics community, especially through the Robo-
and in the real world (2, 3). Progress has recently accelerated consid- Cup (29, 30) initiative. We considered a subset of the full soccer
erably, and learning-based approaches have contributed substan- problem and trained an agent to play simplified one-versus-one
tially to this acceleration (4–6). In particular, deep reinforcement (1v1) soccer in simulation and directly deployed the learned policy
learning (deep RL) has proven capable of solving complex motor on real robots (Fig. 1). We focused on sensorimotor full-body con-
control problems for both simulated characters (7–11) and physical trol from proprioceptive and motion capture observations.
robots. High-quality quadrupedal legged robots have become wide- The trained agent exhibited agile and dynamic movements, in-
ly available and have been used to demonstrate behaviors ranging cluding walking, side stepping, kicking, fall recovery, and ball inter-
from robust (12, 13) and agile (14, 15) locomotion to fall recovery action, and composed these skills smoothly and flexibly. The agent
(16); climbing (17); basic soccer skills such as dribbling (18, 19), discovered unexpected strategies that made more use of the full ca-
shooting (20), intercepting (21), or catching (22) a ball; and simple pabilities of the system than scripted alternatives and that we may
manipulation with legs (23). On the other hand, much less work has not have even conceived of. An example of this is the emergent turn-
been dedicated to the control of humanoids and bipeds, which im- ing behavior, in which the robot pivots on the corner of a foot and
pose additional challenges around stability, robot safety, number of spins, which would be challenging to script and which outper-
degrees of freedom, and availability of suitable hardware. The exist- formed the more conservative baseline (see the “Comparison with
ing learning-based work has been more limited and focused on scripted baseline controllers” section). One further problem in ro-
botics in general and in robot soccer in particular is that optimal
1
behaviors are often context dependent in a way that can be hard to
Google DeepMind, London, UK. 2University College London, London, UK.
3
Proactive Global, London, UK.
predict and manually implement. We demonstrate that the learning
*Corresponding author. Email: tuomash@google.com (T.H.); guylever@google. approach can discover behaviors that are optimized to the specific
com (G.L.) game situation. Examples include context-dependent agile skills,
†These authors contributed equally to this work. such as kicking a moving ball; emergent tactics, such as subtle de-
‡Present address: University of Oxford, Oxford, UK.
§Present address: Google, Mountain View, CA, USA. fensive running patterns; and footwork that adapts to the game situ-
¶Present address: Isomorphic Labs, London, UK. ation, such as taking shorter steps when approaching an attacker in
S c i e n c e R o b o t i c s | R e s e ar c h A r t i c l e
possession of the ball compared with when chasing a loose ball (see locomotion local optima. More skills could be used, but we used
the “Behavior analysis” section). The agent learned to make predic- pretrained skills only when necessary because using a minimal set of
tions about the ball and the opponent, to adapt movements to the skills allows emergent behaviors to be discovered by the agent and
game context, and to coordinate them over long time scales for scor- optimized for specific contexts (highlighted in the “Comparison
ing while being reactive to ensure dynamic stability. Our results also with scripted baseline controllers” and “Behavior analysis” sections)
indicate that, with appropriate regularization, domain randomiza- rather than learning to sequence prespecified behaviors.
tion, and noise injected during training, safe sim-to-real transfer is
possible even for low-cost robots. Example behaviors and gameplay
can be seen in the accompanying movies and on the website https:// RESULTS
sites.google.com/view/op3-soccer. We evaluated the agent in a 1v1 soccer match on physical Robotis
Our training pipeline consists of two stages. In the first stage, we OP3 miniature humanoid robots (31) and analyzed the emergent
trained two skill policies: one for getting up from the ground and behaviors. We isolated certain behaviors (walking, turning, getting
another for scoring a goal against an untrained opponent. In the up, and kicking), compared them with corresponding scripted base-
second stage, we trained agents for the full 1v1 soccer task by distill- line controllers (see the “Comparison with scripted baseline con-
ing the skills and using multiagent training in a form of self-play, trollers” section), and qualitatively analyzed the behaviors in a latent
where the opponent was drawn from a pool of partially trained cop- space (see the “Behavior embeddings” section). To assess reliability,
ies of the agent itself. Thus, in the second stage, the agent learned to gauge performance gaps between simulation and reality and study
combine previously learned skills, refine them to the full soccer task, the sensitivity of the policy to game state, we also investigated se-
and predict and anticipate the opponent’s behavior. We used a small lected set pieces (see the “Behavior analysis” section). Last, we inves-
set of shaping rewards, domain randomization, and random pushes tigated the agent’s sensitivity to the observations of the ball, goal,
and perturbations to improve exploration and to facilitate safe and the opponent using a value function analysis (see the “Value
transfer to real robots. An overview of the learning method is shown function analysis” section).
in Fig. 2 and Movie 1 and discussed in Materials and Methods. Selected extracts from the 1v1 matches can be seen in Fig. 3 and
We found that pretraining separate soccer and get-up skills was Movie 2. The agent exhibited a variety of emergent behaviors, in-
the minimal set needed to succeed at the task. Learning end to end cluding agile movement behaviors such as getting up from the
without separate soccer and get-up skills resulted in two degenerate ground, quick recovery from falls, running, and turning; object in-
solutions depending on the exact setup: converging to either a poor teraction such as ball control and shooting, kicking a moving ball,
locomotion local optimum of rolling on the ground or focusing on and blocking shots; and strategic behaviors such as defending by
standing upright and failing to learn to score (see the “Ablations” consistently placing itself between the attacking opponent and its
section for more details). Using a pretrained get-up skill simplified own goal and protecting the ball with its body. During play, the
the reward design and exploration problem and avoided poor agents transitioned between all of these behaviors fluidly.
S c i e n c e R o b o t i c s | R e s e ar c h A r t i c l e
Fig. 2. Agent training setup. We trained agents in two stages. (Left) In stage 1, we trained a separate soccer skill and get-up skill (“Stage 1: Skill training” section). (Right)
Comparison with scripted baseline controllers trajectories. For example, the walking controller, which can also per-
Certain key locomotion behaviors, including getting up, kicking, form turning, has tunable step length, step angle, step time, and joint
walking, and turning, are available for the OP3 robot (32), and we offsets. We optimized the behaviors by performing a grid search over
used these as baselines. The baselines are parameterized open-loop step length (for walking), step angle (for turning), and step time (for
S c i e n c e R o b o t i c s | R e s e ar c h A r t i c l e
Fig. 3. Gallery of robot behaviors. Each row gives an example of a type of behavior that was observed when the trained policy was deployed on real robots. The top six
rows are photographs taken from five consecutive matches played on the same day, using the same policy. The bottom row of photographs demonstrate the same policy
but were not taken from match play.
both) on a real robot. We adjusted the joint offsets where necessary how well the learned deep RL agent performed on these key behav-
to prevent the robot from losing balance or the feet colliding with iors, we compared it both quantitatively and qualitatively against the
each other. The kick and the get-up controllers were specifically de- baselines. See Movie 3 for an illustration of the baselines and a side-
signed for this robot and have no tunable parameters. To measure by-side comparison with the corresponding learned behaviors.
S c i e n c e R o b o t i c s | R e s e ar c h A r t i c l e
Details of the comparison experiments are given in the “Baseline across episodes was 3.4 m/s. As well as outperforming the scripted
behavior comparisons: Experiment details” section in the Supple- get-up behavior, in practice, the learned policy also reacts to prevent
mentary Materials, and the results are given in Table 1. The learned falling in the first instance (see movie S4).
policy performed better than the specialized manually designed Close observation of the learned policy (as shown in Movies 3
controller: It walked 181% faster, turned 302% faster, and took 63% and 4 and figs. S5 and S6) reveals that it has learned to use a highly
less time to get up. When initialized near the ball, the learned policy dynamic gait. Unlike the scripted controller, which centers the ro-
kicked the ball with 3% less speed; both achieved a ball speed of bot’s weight over the feet and keeps the foot plates almost parallel to
around 2 m/s. However, with an additional run-up approach to the the ground, the learned policy leans forward and actively pushes off
ball, the learned policy’s mean kicking speed was 2.8 m/s (34% fast- from the edges of the foot plate at each step, landing on the heels.
er than the scripted controller), and the maximum kicking speed The forward running speed of 0.57 m/s and turning speed of
2.85 rad/s achieved by the learned policy on the real OP3 compare
favorably with the values of 0.45 m/s and 2.01 rad/s reported in (33)
on simulated OP3 robots. This latter work optimized parametric
controllers of the type used by top-performing RoboCup teams. It
was evaluated only in the RoboCup simulation, and the authors
note that the available OP3 model featured unrealistically powerful
motors. Although those results are not precisely comparable because
of methodological differences, this gives an indication of how our
learned policy compares with parameterized controllers implemented
on the OP3 in simulation.
Table 1. Performance at specific behaviors. The learned behavior is compared with the scripted baseline at the four behaviors. The learned policy’s mean
kicking power was roughly equivalent to the scripted behavior from a standing pose, but with an additional run up approach to the ball, the learned policy
achieved a more powerful kick.
Walking speed mean (SD) Turning speed mean (SD) Get-up time mean (SD) Kicking speed mean (SD)
Scripted baseline 0.20 m/s (0.005 m/s) 0.71 rad/s (0.04 rad/s) 2.52 s (0.006 s) 2.07 m/s (0.05 m/s)
Learned policy (real robot) 0.57 m/s (0.003 m/s) 2.85 rad/s (0.19 rad/s) 0.93 s (0.12 s) 2.02 m/s (0.26 m/s)
With run up – – – 2.77 m/s (0.11 m/s)
Learned policy (simulation) 0.51 m/s (0.01 m/s) 3.19 rad/s (0.12 rad/s) 0.73 s (0.01 s) 2.12 m/s (0.07 m/s)
S c i e n c e R o b o t i c s | R e s e ar c h A r t i c l e
Behavior analysis
Reliability and sim-to-real analysis
To gauge the reliability of the learned agent, we designed a get-up-
and-shoot set piece, implemented in both the simulation (training)
environment and the real environment. This set piece is a short epi-
Movie 4. Slow-motion highlight reel of turning and kicking behaviors. sode of 1v1 soccer in which the agent must get up from the ground
Fig. 4. Joint angle embeddings. Embedding of the joint angles recorded while executing different policies, as described in the “Behavior embeddings” section. (A) The
embedding for the scripted baseline walking policy. (B) The embedding for the soccer skill. (C) The embedding for the full 1v1 agent.
S c i e n c e R o b o t i c s | R e s e ar c h A r t i c l e
and score within 10 s; see the “Set piece: Experiment details” section performance due to transfer to the real environment, but the robot
in the Supplementary Materials for full details. was still able to reliably get up, kick the ball, and score the majority
We played 50 episodes of the set piece each in the simulation and of the time. Results are given in Fig. 5A and Table 1.
the real environment. In the real environment, the robot scored 29 To further gauge the sim-to-real gap, we also analyzed the four
of 50 (58%) of goals and was able to get up from the ground and kick behaviors discussed in the “Comparison with scripted baseline con-
the ball every time. In simulation, the agent scored more consis- trollers” section (walking, turning, getting up, and kicking) in simu-
tently, scoring 35 of 50 (70%) of goals. This indicates a drop in lation and compared the results with those obtained using the real
Fig. 5. Behavior analysis. (A to C) Set pieces. Top rows: Example initializations for the set piece tasks in simulation and on the real robot. Second rows: Overlayed plots of
the 10 trajectories collected from the set piece experiments showing the robot trajectory before kicking (solid lines) and after kicking (dotted lines), the ball trajectory
(dashed lines), final ball position (white circle), final robot position (red-pink circles), and opponent position (blue circle). Each red-pink shade corresponds to one of the
10 trajectories. (D) Adaptive footwork set piece. Right foot trajectory (orange), left foot trajectory (red), ball trajectory (white), point of the kick (yellow), and footsteps
highlighted with dots. (E) Turn-and-kick set piece. Right three panels: A sequence of frames from the set piece. Left: A plot of the footsteps from the corresponding trajec-
tory. The agent turned, walked ~2 m, turned, kicked, and lastly balanced using 10 footsteps. Please refer to the “Behavior analysis” section for a discussion of these results.
S c i e n c e R o b o t i c s | R e s e ar c h A r t i c l e
Table 2. Performance in the get-up-and-shoot set piece. Performance in simulation and on the real robot are compared. Values in parentheses are standard
errors.
Scoring success rate Mean time to first touch
OP3. Results are shown in Table 2: When implemented on the real reactivity. Although the reason for using short footsteps is unclear in
robot, the learned policy walked (13%) faster, turned (11%) more our environment (it could be, for example, that the agent takes extra
slowly, took (28%) more time to get up, and kicked (5%) more slow- care to stay on the path between ball and goal and so moves more
ly than when implemented in simulation. These results indicate no slowly), this result demonstrates that the agent adapts its gait to the
extreme sim-to-real gap in the execution of any behavior. The gap specific context of the game. Results are illustrated in Fig. 5D.
with the baseline behavior performance is substantially larger, for To further demonstrate the efficiency and fluidity of the discov-
instance. The turning behavior is highly optimized (pivoting on a ered gait, we analyzed the footstep pattern of the agent in a turn-
corner of the foot) and can be seen both in simulation and on the and-kick set piece. In this task, the agent is initialized near the
real robot in Movie 4. sideline and facing parallel with it, with the ball in the center. The
S c i e n c e R o b o t i c s | R e s e ar c h A r t i c l e
A Expected values versus ball velocity (m/s) B Expected values versus opponent position
Fig. 6. Critic’s predicted values. Projections of the expected values learned by the critic in selected game states. (In each panel, the ball is marked in white, the opponent
in black, and the agent in gray, and brighter colors indicate preferred states.) (A) Varying the ball (x, y) velocity (instantaneous velocity in meters per second, mapped to
the position after 1 s). High-value states are those in which the ball is either traveling toward the goal or remaining near the agent. (B) Varying the opponent position
around the pitch. The predicted value is lower when the opponent is located between the ball and the target goal. (C) Varying the agent position when the opponent is
far from the ball (high value at positions near where the agent shoots, low value around the opponent reflecting the interference penalty). (D) Varying the agent position
when the opponent is close to the ball. (The value function has higher value ridges at locations blocking the opponent’s shot.)
development of better hardware and algorithms (4, 37). In par- associated with training directly on hardware. A common theme
ticular, high-quality quadrupedal robots have become widely is that an unexpectedly small number of techniques can be suffi-
available, which have been used to demonstrate robust, efficient, cient to reduce the sim-to-real gap (37, 43), which is also sup-
and practical locomotion in a variety of environments (12, 13, 38, ported by our results. However, there have also been successful
39). For example, Lee et al. (12) applied zero-shot sim-to-real attempts at training legged robots to walk with deep RL directly
deep RL to deploy learned locomotion policies in natural envi- on hardware (44–48). Training on hardware can lead to better
ronments, including mud, snow, vegetation, and streaming water. performance, but the range of behaviors that can be learned has
Our work similarly relies on zero-shot sim-to-real transfer and so far been limited because of safety and data efficiency concerns.
model randomization but instead focuses on a range of dynamic Similar to our work, prior work has shown that learned gaits can
motions, stability, long horizon tasks, object manipulation, and achieve higher velocities compared with scripted gaits (49–51).
multiagent competitive play. Most recent work in this area relies However, the gaits have been specifically trained to attain high
on some form of sim-to-real transfer (14, 15, 17, 19, 27, 40–42), speeds instead of emerging as a result of optimizing for a higher
which can help to reduce the safety and data efficiency concerns level goal.
S c i e n c e R o b o t i c s | R e s e ar c h A r t i c l e
Quadrupedal platforms constitute most legged locomotion re- our learning pipeline relied on some domain-specific knowledge
search, but an increasing number of works consider bipedal plat- and domain randomization, as is common in the robot learning lit-
forms. Recent works have produced behaviors including walking erature (5, 6, 12, 37, 43). Domain-specific knowledge was used for
and running (24, 25), stair climbing (26), and jumping (27). Most reward function design and for training the get-up skill, which re-
recent works have focused on high-quality, full-sized bipeds and hu- quires access to hand-designed key poses, which can be difficult or
manoids, with a much smaller number (48, 52–54) targeting more impractical to choose for more dynamic platforms. In addition, the
basic platforms whose simpler and less precise actuators and sensors distillation step assumed that we could manually choose the correct
pose additional challenges in terms of sim-to-real transfer. In addi- skill (either get-up or soccer) for each state, although a method in
tion, there is a growing interest in whole-body control, that is, tasks which the distillation target is automatically selected has been dem-
in which the whole body is used in flexible ways to interact with the onstrated in prior work (11), which we anticipate would work in this
environment. Examples include getting up from the ground (55) application. Second, we did not leverage real data for transfer; in-
and manipulation of objects with legs (23, 56). Recently, RL has stead, our approach relied solely on sim-to-real transfer. Fine-tuning
been applied to learn simple soccer skills, including goalkeeping on real robots or mixing in real data during training in simulation
(21), ball manipulation (18, 19), and shooting (20). These works fo- could help improve transfer and enable an even wider spectrum of
cus on a narrower set of skills than the 1v1 soccer game, and the stable behaviors. Third, we applied our method to a small robot and
quadrupedal platform is inherently more stable and therefore pres- did not consider additional challenges that would be associated with
ents an easier learning challenge. a larger form factor.
Our current system could be improved in a number of ways. We
Comparison with RoboCup found that tracking a ball with motion capture was particularly chal-
Robot soccer has been a longstanding grand challenge for AI and lenging: Detection of the reflective tape markers is sensitive to the
S c i e n c e R o b o t i c s | R e s e ar c h A r t i c l e
agent did not approach the ball. However, it also learned fewer agile reduce the size of the observation space. However, including a full
behaviors. Insights from prior work in simulation (11) could be ap- observation history could be important for improving the agent’s
plied to improve performance in this setting. ability to react and adapt to the opponent in more subtle ways. A
Playing soccer from raw vision more detailed description of the observations is given in the “Envi-
Another important direction for future work is learning from on- ronment Details” section in the Supplementary Materials.
board sensors only, without external state information from a mo-
tion capture system. In comparison with state-based agents that Robot hardware and motion capture
have direct access to the ball, goal, and opponent locations, vision- We used the Robotis OP3 robot (31), which is a low-cost, battery-
based agents need to infer information from a limited history of powered, miniature humanoid platform. It is 51 cm tall, weighs
high-dimensional egocentric camera observations and integrate the 3.5 kg, and is actuated by 20 Robotis Dynamixel XM430350-R servo-
partial state information over time, which makes the problem sig- motors. We controlled the servos by sending target angles using
nificantly harder (73). position control mode with only proportional gain (in other words,
As a first step, we investigated how to train vision-based agents without any integral or derivative terms). Each actuator has a mag-
that only use an onboard RGB camera and proprioception. We cre- netic rotary encoder that provides the joint position observations
ated a visual rendering of our lab using a neural radiance field mod- to the agent. The robot also has an inertial measurement unit
el (74) based on the approach introduced by Byravan et al. (75). The (IMU), which provides angular velocity and linear acceleration
robot learned behaviors including ball tracking and situational measurements. We found that the default robot control software
awareness of the opponent and goal. See the “Playing soccer from was sometimes unreliable and caused nondeterministic control la-
raw vision” section in the Supplementary Materials for our prelimi- tency, so we wrote a custom driver that allows the agent to com-
nary results with this approach. municate directly and reliably with the servos and IMU via the
S c i e n c e R o b o t i c s | R e s e ar c h A r t i c l e
μ , gives rise μ (ξ) = μ0 (s0 ) reward components. This included components to encourage for-
∏0∞ � � to a distribution over trajectories: π
ward velocity and ball interaction and to make exploration easier as
t=0 π at �s t (s t+1 ∣ s t , at ). The aim is to obtain a policy π that max-
imizes the expected discounted cumulative reward or return well as components to improve sim-to-real transfer and reduce ro-
bot breakages, as discussed in the “Regularization for safe behav-
T iors” section.
Get-up skill training
∑
(π) ≜ 𝔼ξ∼μπ [ γt r(st )] (1)
t=0 The get-up skill was trained using a sequence of target poses to bias
the policy toward a stable and collision-free trajectory. We used the
We parameterized the policy as a deep feed-forward neural preprogrammed get-up trajectory (32) to extract three key poses for
network with parameters θ that outputs the mean and diagonal getting up from either the front or the back (see the “Get-up skill
covariance of a multivariate Gaussian. We trained this policy to training” section in the Supplementary Materials for an illustration
optimize Eq. 1 using maximum a posteriori policy optimization of the key poses).
(MPO) (79), which is an off-policy actor-critic RL algorithm. We trained the get-up skill to reach any target pose interpolated
MPO alternates between policy evaluation and policy improve- between the key poses. We conditioned both the actor and critic on
ment. In the policy evaluation step, the critic (or Q function) is the target pose, which consists of target joint angles ptarget and target
trained to estimate Qπθ(s, a), which describes the expected return torso orientation gtarget. The target torso orientation is expressed as
from taking action a in state s and then following policy πθ: the gravity direction in the egocentric frame. This is independent of
∑T
Qπθ (s, a) ≜ 𝔼ξ∼μθ (ξ∣s0 =s,a0 =a) [ t=0 γt r(st )]. We used a distributional the robot yaw angle (heading), which is irrelevant for the get-up task.
critic (80) and refer to the overall algorithm as distributional MPO Conditioning on the joint angles steers the agent toward collision-
(DMPO). In the policy improvement step, the actor (or policy) is free and stable poses, whereas conditioning on the gravity direction
S c i e n c e R o b o t i c s | R e s e ar c h A r t i c l e
Unlike most prior work, in our setting, the skill policies are use- included shaping reward terms to obtain behaviors that are less like-
ful in mutually exclusive sets of states: The soccer skill is useful only ly to damage the robot.
when the agent is standing up; otherwise, the getup skill is more
useful. Thus, in each state, we regularized the agent’s policy πθ to System identification
only one of the two skills. We achieved this by replacing the critic’s We identified the actuator parameters by applying a sinusoidal con-
predicted Q values used in the policy improvement step with a trol signal of varying frequencies to a motor with a known load at-
weighted sum of the predicted Q values and Kullback-Leibler (KL) tached to it and optimized over the actuator model parameters in
regularization to the relevant skill policy: simulation to match the resulting joint angle trajectory. For simplic-
ity, we chose a position-controlled actuator model with torque
(1 − λs ) 𝔼a∼π(⋅∣s) [Qπθ (s, a)] − λs KL(πθ (⋅ ∣ s) ∥ πs (⋅ ∣ s)) if s ∈
{
feedback and with only damping [1.084 N·m/(rad/s)], armature
(1 − λg ) 𝔼a∼π(⋅∣s) [Qπθ (s, a)] − λg KL(πθ (⋅ ∣ s) ∥ πg (⋅ ∣ s)) if s ∉ (0.045 kg m2), friction (0.03), maximum torque (4.1 N·m), and pro-
portional gain (21.1 N/rad) as free parameters. The values in parenthe-
(2) ses correspond to the final values after applying this process. This
where is the set of all states in which the agent is upright. model does not exactly correspond to the servos’ operating mode,
To enable the agent to outperform the skill policies, the weights which controls the coil voltage instead of output torque, but we
λs and λg are adaptively adjusted such that there is no regularization found that it matched the training data sufficiently well. We believe
once the predicted Q values are above the preset thresholds Qs and that this is because using a position control mode hides model mis-
Qg, respectively. This approach was proposed by Abdolmaleki et al. match from the agent by applying fast stabilizing feedback at a high
(88) for a similar setting and is closely related to the Lagrangian frequency. We also experimented with direct current control but
multiplier method used in constrained RL (89). Specifically, λs (or found that the sim-to-real gap was too large, which caused zero-shot
S c i e n c e R o b o t i c s | R e s e ar c h A r t i c l e
Fig. 7. Comparison against ablations. (A) The plot compares our full method (1) against ablations: At regular intervals across training, each policy was evaluated against
S c i e n c e R o b o t i c s | R e s e ar c h A r t i c l e
4. J. Peters, S. Schaal, Reinforcement learning of motor skills with policy gradients. Neural 32. Robotis, “Robotis OP3 source code,” April 2023; https://github.com/ROBOTIS-GIT/
Netw. 21, 682–697 (2008). ROBOTIS-OP3.
5. M. P. Deisenroth, G. Neumann, J. Peters, “A survey on policy search for robotics” in 33. M. Bestmann J. Zhang, “Bipedal walking on humanoid robots through parameter
Foundations and Trends in Robotics, vol. 2, no. 1–2 (Now Publishers, 2013), pp. 1–142. optimization” in RoboCup 2022: Robot World Cup XXV, vol. 13561 of Lecture Notes in
6. J. Kober, J. A. Bagnell, J. Peters, Reinforcement learning in robotics: A survey. Int. J. Robot. Computer Science, A. Eguchi, N. Lau, M. Paetzel-Prüsmann, T. Wanichanon, Eds. (Springer,
Res. 32, 1238–1274 (2013). 2022), pp. 164–176.
7. N. Heess, D. Tirumala, S. Sriram, J. Lemmon, J. Merel, G. Wayne, Y. Tassa, T. Erez, Z. Wang, 34. B. D. DeAngelis, J. A. Zavatone-Veth, D. A. Clark, The manifold structure of limb
A. Eslami, M. Riedmiller, D. Silver, Emergence of locomotion behaviours in rich coordination in walking Drosophila. eLife 8, e46409 (2019).
environments. arXiv:1707.02286 (2017). 35. L. McInnes, J. Healy, J. Melville, UMAP: Uniform manifold approximation and projection
8. T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, I. Mordatch, “Emergent complexity via for dimension reduction. arXiv:1802.03426 (February 2018).
multi-agent competition” in 6th International Conference on Learning Representations 36. T. Röfer, T. Laue, A. Baude, J. Blumenkamp, G. Felsch, J. Fiedler, A. Hasselbring, T. Haß,
(ICLR, 2018). J. Oppermann, P. Reichenberg, N. Schrader, D. Weiß, “B-Human team report and code
9. X. B. Peng, P. Abbeel, S. Levine, M. van de Panne, DeepMimic: Example-guided deep release 2019,” 2019; http://b-human.de/downloads/publications/2019/
reinforcement learning of physics-based character skills. ACM Transac. Graph. 37, 1–14 CodeRelease2019.pdf.
(2018). 37. J. Ibarz, J. Tan, C. Finn, M. Kalakrishnan, P. Pastor, S. Levine, How to train your robot with
10. J. Merel, S. Tunyasuvunakool, A. Ahuja, Y. Tassa, L. Hasenclever, V. Pham, T. Erez, G. Wayne, deep reinforcement learning: Lessons we have learned. Int. J. Robot. Res. 40, 698–721
N. Heess, Catch & Carry: Reusable neural controllers for vision-guided whole-body tasks. (2021).
ACM Transac. Graph. 39, 1–14 (2020). 38. T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, M. Hutter, Learning robust perceptive
11. S. Liu, G. Lever, Z. Wang, J. Merel, S. M. A. Eslami, D. Hennes, W. M. Czarnecki, Y. Tassa, locomotion for quadrupedal robots in the wild. Sci. Robot. 7, eabk2822 (2022).
S. Omidshafiei, A. Abdolmaleki, N. Y. Siegel, L. Hasenclever, L. Marris, S. Tunyasuvunakool, 39. A. Agarwal, A. Kumar, J. Malik, D. Pathak, “Legged locomotion in challenging terrains
H. F. Song, M. Wulfmeier, P. Muller, T. Haarnoja, B. D. Tracey, K. Tuyls, T. Graepel, N. Heess, using egocentric vision” in Conference on Robot Learning (MLResearchPress, 2023),
From motor control to team play in simulated humanoid football. Sci. Robot. 7, eabo0235 pp. 403–415.
(2022). 40. I. Radosavovic, T. Xiao, B. Zhang, T. Darrell, J. Malik, K. Sreenath, Learning humanoid
12. J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, M. Hutter, Learning quadrupedal locomotion locomotion with transformers. arXiv:2303.03381 [cs.RO] (14 December 2023).
S c i e n c e R o b o t i c s | R e s e ar c h A r t i c l e
60. P. Stone, R. S. Sutton, G. Kuhlmann, Reinforcement learning for RoboCup-soccer 85. Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, R. Pascanu,
keepaway. Adapt. Behav. 13, 165–188 (2005). Distral: Robust multitask reinforcement learning. Adv. Neural Inf. Process. Syst. 30 (2017).
61. S. Kalyanakrishnan P. Stone, “Learning complementary multiagent behaviors: A case 86. A. Galashov, S. Jayakumar, L. Hasenclever, D. Tirumala, J. Schwarz, G. Desjardins, W. M.
study” in RoboCup 2009: Robot Soccer World Cup XIII, vol. 5949 of Lecture Notes in Czarnecki, Y. W. Teh, R. Pascanu, N. Heess, “Information asymmetry in KLregularized RL” in
Computer Science, J. Baltes, M. G. Lagoudakis, T. Naruse, S. S. Ghidary, Eds. (Springer, International Conference on Learning Representations, New Orleans, LA, 6 to 9 May 2019.
2010), pp. 153–165. 87. S. Schmitt, J. J. Hudson, A. Z’ıdek, S. Osindero, C. Doersch, W. M. Czarnecki, J. Z. Leibo,
62. S. Kalyanakrishnan, Y. Liu, P. Stone, “Half field offense in RoboCup soccer: A multiagent H. Küttler, A. Zisserman, K. Simonyan, S. M. A. Eslami, Kickstarting deep reinforcement
reinforcement learning case study” in RoboCup-2006: Robot Soccer World Cup X, vol. 4434 learning. arXiv:1803.03835 (2018).
of Lecture Notes in Artificial Intelligence, G. Lakemeyer, E. Sklar, D. Sorenti, T. Takahashi, Eds. 88. A. Abdolmaleki, S. H. Huang, G. Vezzani, B. Shahriari, J. T. Springenberg, S. Mishra, D. TB,
(Springer, 2007), pp. 72–85. A. Byravan, K. Bousmalis, A. Gyorgy, C. Szepesvari, R. Hadsell, N. Heess, M. Riedmiller, On
63. P. Stone M. Veloso, “Layered learning” in European Conference on Machine Learning multi-objective policy optimization as a tool for reinforcement learning.
(Springer, 2000), pp. 369–381. arXiv:2106.08199 (2021).
64. P. MacAlpine, P. Stone, Overlapping layered learning. Artif. Intell. 254, 21–43 (2018). 89. A. Stooke, J. Achiam, P. Abbeel, “Responsive safety in reinforcement learning by pid
65. M. Abreu, L. P. Reis, N. Lau, “Learning to run faster in a humanoid robot soccer lagrangian methods” in Proceedings of the 37th International Conference on Machine
environment through reinforcement learning” in Robot World Cup (Springer, 2019) Learning (ICML, 2020), pp. 9133–9143.
pp. 3–15. 90. S. Liu, G. Lever, J. Merel, S. Tunyasuvunakool, N. Heess, T. Graepel, “Emergent coordination
66. L. C. Melo, D. C. Melo, M. R. Maximo, Learning humanoid robot running motions with through competition” in International Conference on Learning Representations, New
symmetry incentive through proximal policy optimization. J. Intell. Robot. Syst. 102, 54 Orleans, LA, 6 to 9 May 2019.
(2021). 91. S. Thrun A. Schwartz, Finding structure in reinforcement learning. Adv. Neural Inf. Process.
67. M. Saggar, T. D’Silva, N. Kohl, P. Stone, “Autonomous learning of stable quadruped Syst. 7, (1994).
locomotion” in RoboCup-2006: Robot Soccer World Cup X, vol. 4434 of Lecture Notes in 92. M. Bowling, M. Veloso, “Reusing learned policies between similar problems” in
Artificial Intelligence, G. Lakemeyer, E. Sklar, D. Sorenti, T. Takahashi, Eds. (Springer, 2007), Proceedings of the AI* AI-98 Workshop on New Trends in Robotics (1998); https://cs.cmu.
pp. 98–109. edu/afs/cs/user/mmv/www/papers/rl-reuse.pdf.
68. M. Hausknecht, P. Stone, “Learning powerful kicks on the Aibo ERS-7: The quest for a 93. X. B. Peng, M. Chang, G. Zhang, P. Abbeel, S. Levine, “MCP: learning composable
S c i e n c e R o b o t i c s | R e s e ar c h A r t i c l e
107. O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, K. Patel, M. Gwira, A. Huber, N. Hurley, F. Nori, R. Hadsell, N. Heess, Data release for:
R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, Learning agile soccer skills for a bipedal robot with deep reinforcement learning [data
T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V. Dalibard, set], 2024; https://doi.org/10.5281/zenodo.10793725.
D. Budden, Y. Sulsky, J. Molloy, T. L. Paine, C. Gulcehre, Z. Wang, T. Pfaff, Y. Wu, R. Ring,
D. Yogatama, D. Wunsch, K. McKinney, O. Smith, T. Schaul, T. P. Lillicrap, K. Kavukcuoglu, Acknowledgements: We thank D. Hennes at Google DeepMind for developing the plotting
D. Hassabis, C. Apps, D. Silver, Grandmaster level in StarCraft II using multi-agent tools used for the soccer matches and M. Riedmiller and M. Neunert at Google DeepMind for
reinforcement learning. Nature 575, 350–354 (2019). their helpful comments. Funding: This research was funded by Google DeepMind. Author
108. B. Baker, I. Kanitscheider, T. Markov, Y. Wu, G. Powell, B. McGrew, I. Mordatch, “Emergent contributions: Algorithm development: T.H., B.M., G.L., S.H.H., D.T., J.H., M.W., S.T., N.Y.S., R.H.,
tool use from multi-agent autocurricula” in 8th International Conference on Learning M.B., K.H., A.B., L.H., Y.T., F.S.; Software infrastructure and environment development: T.H., B.M.,
Representations (ICLR, 2020). G.L., S.H.H., J.H., S.T., N.Y.S., R.H., M.B., Y.T.; Agent analysis: T.H., B.M., G.L., S.H.H.;
109. R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction (MIT Press, 2018). Experimentation: T.H., B.M., G.L., S.H.H., D.T., J.H., N.Y.S.; Article writing: T.H., B.M., G.L., S.H.H.,
110. J. Schulman, S. Levine, P. Abbeel, M. Jordan, P. Moritz, “Trust region policy optimization” D.T., M.W., N. Heess; Infrastructure support: N.B., F.C., S.S., C.G., N.S., K.P., M.G.; Management
in Proceedings of the 32nd International Conference on Machine Learning (ICML) (ACM, support: N.B., F.C., A.H., N. Hurley; Project supervision: F.N., R.H., N. Heess; Project design: T.H.,
2015), pp. 1889–1897. N. Heess. Data availability: The data used for our quantitative figures and tables have been
111. V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, made available for download at https://zenodo.org/records/10793725 (91).
M. Riedmiller, A. K. Fidjeland, G. Ostrovski, Human-level control through deep
reinforcement learning. Nature 518, 529 (2015). Submitted 31 May 2023
112. T. Haarnoja, B. Moran, G. Lever, S. H. Huang, D. Tirumala, J. Humplik, M. Wulfmeier, Accepted 14 March 2024
S. Tunyasuvunakool, N. Y. Siegel, R. Hafner, M. Bloesch, K. Hartikainen, A. Byravan, Published 10 April 2024
L. Hasenclever, T. Y., F. Sadeghi, N. Batchelor, F. Casarini, S. Saliceti, C. Game, N. Sreendra, 10.1126/scirobotics.adi8022