ROBEL - Robotics Benchmarks For Learning With Low-Cost Robots
ROBEL - Robotics Benchmarks For Learning With Low-Cost Robots
δ †
UC Berkeley, USA Google Research, USA
Figure 1: ROBEL robots: D’Kitty (left) and D’Claw (middle and right)
1 Introduction
Learning-based methods for solving robotic control problems have recently seen significant mo-
mentum, driven by the widening availability of simulated benchmarks [1, 2, 3] and advancements
in flexible and scalable reinforcement learning [4, 5, 6, 7]. While learning through simulation is
relatively inexpensive and scalable, developments on these simulated environments often encounter
difficulty in deploying to real-world robots due to factors such as inaccurate modeling of physical
phenomena and domain shift. This motivates the need to develop robotic control solutions directly
in the real world on physical hardware.
Modern advancements in reinforcement learning have shown some success in the real world
[8, 9, 10]. However, learning on real robots generally does not take into account physical limitations
2
Although not posed as benchmarks, the idea of comparing progress in the real world via shared
datasets [28], testbeds [29], and hardware designs [24, 30, 31] has been around for a while. Recently,
benchmarking in the real world using commercially available platforms has also been proposed
[32, 33]. These benchmarks include robot-centric tasks such end-effector reaching, joint angle track-
ing, and grasping via parallel jaws grippers. To further diversify the benchmarking scene, ROBEL
presents a wide variety of high DoF tasks spanning dexterous manipulation as well as quadruped
locomotion.
With learning-based methods [4], it is common to measure the average episodic return to evaluate
the performance of an agent. These returns are task-specific and often ignore the challenges of
the real world, such as unsafe exploration, movement quality, hardware risks, energy expenditure,
etc. These challenges are highlighted by the DARPA Robotics Challenge [20, 21, 22] where many
robots failed to achieve their task objective due to undervaluing safety objectives, thereby indicating
that real world challenges (such as safety) are important objectives to prioritize. Hardware safety
considerations have been posing as explicit constraints (position, velocity, acceleration, jerk limits)
as well as regularization (energy, control cost) before, but have not found appropriate emphasis in
existing [3, 1, 12] learning related benchmarks. Addressing this, ROBEL provides three signals
(dense-reward, sparse-score and hardware-safety) to facilitate the study of these challenges.
3 ROBEL
Hardware Platforms
manipulation
cost(USD)/actuated-DOF
locomotion
As the number of actuated DOFs of a system grows, we 10k
tend to see a proportional increase in cost and decrease in Shadow-BT
reliability. The modularity of ROBEL allows us to build Robotiq-3F
reasonably high DOF robots while remaining low-cost OnRobot Shadow
and easily maintainable. The robots only use off-the-shelf Robotiq-2F
Weiss Vision60
components, commonly-available prototyping tools (3D Likago
printers, laser cutters), and require only a few hours to Minitaur
build (Table 1). ROBEL robots are actuated at joint level D'Claw Allegro
0 D'Kitty
(i.e. no transmission between joint and actuators) via Dy- 0 10 20
namixel smart actuators [34] that feature fully integrated #actuated-dof
motors with an embedded controller, reduction drive, and
high-baudrate communication. Multiple actuators can be Figure 2: Cost comparison of RO-
daisy-chained together to increase the number of DOFs BEL with other commonly used plat-
in the system, which allows ROBEL robots to be easy to forms. We note that (a) ROBEL plat-
build (Table 1) and extend. For the context of this work forms have the most economical price
we use a USB-serial bus [35] for communication to the point, thereby facilitating experiment’s
robots. An 12V power supply is used to power the plat- scalability and (b) Prices scale linearly
forms. ROBEL platforms also support a wide variety of with # of DOFs, thanks to modular de-
choices in sensing and actuation modes, which are sum- sign, thereby facilitating experiments’
marized in Table 2. complexity
The schematic details of ROBEL platforms are summarized in Figure 3. Detailed CAD models and
bill of materials (BOM) with step-by-step assembly instructions are included in the supplementary
materials package. ROBEL platforms have also been independently replicated and tested for relia-
bility (subsection 5.3) at a geographically remote location which demonstrates the reproducibility
(details in subsection 5.2) of the ROBEL platforms and associated results.
3
The combination of reproducibility and scalibility exhibited by ROBEL platforms presents to the
field of robotics a lucrative preposition of a standard set of benchmarks (proposed in section 4) to
facilitate sharing and collaborative comparison of results. ROBEL consists of the following two
platforms:
4 Benchmark Tasks
ROBEL proposes a collection of tasks for D’Claw and D’Kitty to serve as a foundation for real-world
benchmarking for continuous control problems in robotics. We first outline the formulations of these
benchmark tasks, and then provide details of the tasks grouped into manipulation and locomotion.
ROBEL tasks are formulated in a standard Markov decision process (MDP) setting [37], in which
each step, corresponding to a time t in the environment, consists of a state observation s, an input
action a, a resulting reward rd , and a resulting next state s0 . In addition to the reward rd , which is
usually dense, ROBEL also provides a sparse signal called score rs , which can be interpreted as a
sparse task objective without any shaping. To standardize quantification a policy’s π performance,
ROBEL provides success evaluator φse (π) metrics and hardware safety φhs (π) metrics.
To implement the MDP setting, we employ the
MuJoCo
commonly-adopted OpenAI Gym [2] API. ROBEL is Sim
Actions Robot
presented as an open-source Python library consisting Bullet
of modular, reusable software components that enable a RL
Environment Robot
common interface to interact with hardware and simula- Agent
Dynamixel
tion. Figure 4 provides architectural outline of ROBEL. Observations Hardware
SDK
4
physics simulation engines. Figure 3b, and Figure 3e show the simulated robot modelled in Mu-
JoCo [38]. We encourage the usage of simulation primarily as a rapid prototyping tool and promote
purely real-world hardware results as ROBEL benchmarks.
The reward rd is the most commonly used signal in reinforcement learning that the agents directly
optimize. Since the reward often consists of multiple sub-goals and regularization terms, score rs
provides a more direct task-specific sparse objective. Success evaluator φse (π) is defined to be
reward (or score) agnostic. It evaluates success (task-specific) percentage of policy over multiple
runs. Unlike rewards and score, which are provided at each step, hardware safety φhs (π) is an
array of counters that evaluates a policy over the specified horizon to measure the number of safety
violations. We include the following violations in our safety measure: joint limits, velocity limits,
and current limits.
We propose an initial set of ROBEL benchmark tasks to tackle a variety of challenges involving
manipulation and locomotion. We summaries the tasks below and encourage readers to refer to
Appendix A and supplementary material1 for task details.
D’Claw is 9-DoF dexterous manipulator capable of contact rich diverse behaviors. We structure our
first group of the benchmark tasks around fundamental manipulation behaviors.
(a) Pose: conform to a shape (b) Turn: rotate to a fixed target (c) Screw: rotate to a moving target
Figure 5: D’Claw manipulation benchmarks: Pose, Turn and Screw are motivated by commonly
observed manipulation behaviors in daily life
a) Pose (conform to the shape of the environment): This task is motivated by the primary objective
of a manipulator to conform to its surrounding in order to prepare for the upcoming maneuvers –
commonly observed as various pre-grasp and latching maneuvers (Figure 5a). This set of tasks is
posed as trying to match randomly selected joint angle targets. Successful completion of this task
demonstrates the capability of a manipulator to have controlled access to all its joints. This set of
tasks are comparatively easier to train, thereby facilitates fast iteration cycles and a gradual transition
to the rest of the tasks. Two variants of this task are provided: a static variant DClawPoseFixed
where the desired joint angles remain constant, and a dynamic variant DClawPoseRandom where
the desired joint angle is time-dependent and oscillates between two goal positions that are sampled
at the beginning of the episode.
b) Turn (rotate to a fixed target angle): This task encapsulates the ability of a manipulator to repo-
sition unactuated DoFs present in the environments to target configurations – commonly observed
as turning various knobs, latches and handles. This set of tasks is posed as trying to match randomly
selected joint angle targets for the unactuated object(s). Successful completion of this task demon-
strates the ability of a manipulator to bring desired changes on external targets. In order to succeed,
the manipulator requires not only co-ordination between the internal DoFs, but also an understand-
ing of environment dynamics perceived through contact interactions. Three variants of this task are
provided: DClawTurnFixed where initial and target angles are constant, DClawTurnRandom where
both initial and target angles are randomly selected, DClawTurnRandomDynamics where initial and
target angles are randomly selected as well as the environment (object size, surface, and dynamics
properties) is randomized.
1
code repository, detailed documentation, and task videos are available at www.roboticsbenchmarks.org
5
c) Screw (rotate to a moving target angle): This task focuses on the ability of a manipulator to
continuously rotate an unactuated object at a constant velocity. This set of tasks is posed as trying to
match joint angle targets that are themselves moving. Although very similar to turn tasks but the nu-
ances of moving target challenge the manipulator’s strategy to constantly evolve as the target drifts.
Fingers often enter singular positions as the rotation progresses. A successful strategy needs to learn
finger co-ordinated gating to simultaneous progress as well as stay out of local minima. Three vari-
ants of this task are provided: DClawScrewFixed where target velocity is constant, DClawScrewRan-
dom where the initial angle and target velocity is randomly selected, DClawScrewRandomDynamics
where the initial angle and target velocity is randomly selected as well as the environment (object
size, surface, and dynamics properties) is randomized.
The twelve DoF locomotion platform D’Kitty is capable of exhibiting diverse behaviors. We struc-
ture this group of the benchmark tasks on the platform around simple locomotion behaviors exhibited
by quadrupeds.
(a) Stand: getting upright (b) Orient: align heading (c) Walk: get to target
Figure 6: D’Kitty locomotion benchmarks
a) Stand: Standing upright is one of the most fundamental behavior exhibited by the animals.
This task involves reaching a pose while being upright. A successful strategy requires maintaining
the stability of the torso via the ground reaction forces. Three variants of this task are provided:
DKittyStandFixed standing up from a fixed initial configuration, DKittyStandRandom standing up
from a random initial configuration, DKittyStandRandomDynamics standing up from random initial
configuration where the environment (surface, dynamics properties of D’Kitty and ground height
map) is randomized. See supplementary materials1 for full details
b) Orient: This task involves D’Kitty changing its orientation from an initial facing direction to
a desired facing direction. This set of tasks is posed as matching the target configuration of the
torso. A successful strategy requires maneuvering the torso via the ground reaction forces while
maintaining balance. Three variants of this task are provided: DKittyOrientFixed maneuvers to a
fixed target orientation, DKittyOrientRandom maneuvers to a random target orientation, DKittyOri-
entRandomDynamics maneuver to a random target orientation where the environment (surface, dy-
namics properties of D’Kitty and ground height map) is randomized. See supplementary materials1
for full details
c) Walk: This task involves the D’Kitty moving its world position from an initial cartesian posi-
tion to desired cartesian position while maintaining a desired facing direction. This task is posed
as matching the cartesian position of the torso with a distant target. Successful strategy needs to
exhibit locomotion gaits while maintaining heading. Three variants of this task are provided: DKit-
tyWalkFixed walk to a fixed target location, DKittyWalkRandom walk to a randomly selected target
location, DKittyWalkRandomDynamics walk to a selected target location where the environment
(surface, dynamics properties of D’Kitty and ground height map) is randomized. See supplementary
materials1 for full details
ROBEL tasks variants are carefully designed to represent a wide task spectrum. The fixed variants
(task-name suffix “Fixed”) are fast to iterate and are helpful for getting started. The random variants
(task-name suffix “Random”) present a wide initial and goal distribution to study task generalization.
In addition to the wider distribution, the random dynamics variants (task-name suffix “RandomDy-
namics”) presents variability in the various environment properties. This variant is hardest to solve
and is well suited for sim2real line of research.
6
Figure 7: Success percentage for D’Claw and D’Kitty tasks trained on a physical D’Claw robot and
a simulated D’Kitty robot using several agents: Soft Actor Critic (SAC)[7], Natural Policy Gradient
(NPG)[39], Demo-augmented Policy Gradient (DAPG)[40], and Behavior Cloning (BC) over 20
trajectories. Success is measured via the success evaluator φse (π) of the task (See Appendix A for
details). Each timestep corresponds to 0.1 real-world seconds
5 Experiments
We first summarize on-hardware training runs of various reinforcement learning algorithms that are
included as ROBEL baselines. Later we evaluate ROBEL for its reproducibility with-in the same as
well as at a geographically separated location, and reliability over extended usage. We conclude by
presenting performance of our baselines over the proposed safety metrics.
5.1 Baselines
ROBEL has been tested to meet the rigor of a wide variety of learning algorithms. One candidate
from each algorithmic class was added to the spectrum of baselines (Figure 7). We include Natural
Policy Gradient [39] for on-policy, Soft Actor Critic [7] for off-policy, Demo Augmented Policy
Gradients [40] for demonstration accelerated methods, and behavior cloning as supervised learning
baseline. Using the sim-robot, dynamics randomized variant of all the tasks (referred as randomDy-
namics) are also included in the package to facilitate sim2real research direction. We also invite the
open source community to add to our family of baselines via our open source repository.
5.2 Reproducibility
Figure 8: Left-3: Training reproducibility between two real D’Claw robots, developed at differ-
ent laboratory locations, over the benchmark tasks. Right: Effectiveness of a policy on different
hardware it wasn’t trained on. Score denotes closeness to to the goal.
We test ROBEL reproducibility on multiple platforms independently developed at different locations
(60 miles apart) via different groups (no in-person visits) using only ROBEL documentation2 . We
2
Occasional minor clarification over emails were later adopted into the documentation
7
evaluate ROBEL’s reproducibility by studying the effectiveness of a policies on different hardware.
Figure 8 outlines the effectiveness of a policy on multiple hardware across two different sites.
5.3 Reliability
We provide a qualitative measure of the reliability of the system in Table 3. It should be noted
that these metrics include data gathered while the system was under development. The matured
system reported in this paper is much more reliable. Figure 9 provides a qualitative depiction of
the robustness of the system using side by side comparison of a new and used D’Claw assembly.
The system is fairly robust in facilitating multiple day real-world experimentation on the hardware.
Occasional maintenance needs primarily include screws becoming loose. We attribute this to the
vibrations caused by recurring collision impacts during manipulation and locomotion. We also
observe occasional motor failures (Table 3). Owing to the modularity of ROBEL, this is easy to
replace3 .
site A B
training
9000 5000
hours
motors
150 40
bought
motors
19 10
broken
Figure 9: Change in physical Figure 10: Safety violations ob- Table 3: (approximate) Usage
appearance depicting D’Claw served during the training of the statistics of ROBEL over 12
resilience to extreme usage. DClawScrewFixed task months. Note that statistics in-
(left: new D’Claw) (right: op- clude data from when ROBEL
erational for 6̃ months) was still under development
5.4 Safety
Smooth, elegant behavior has been a desirable but hard-to-define trait for all continuous control
problems. Various forms of regularization on control, velocity, acceleration, jerks, and energy are
often used to induce such properties. While there is not a universally accepted definition for smooth-
ness, few metrics for safe behaviors can be defined in terms of hardware safety limits. In addition
to dense and sparse objective, ROBEL also provides hardware safety objectives, which has been
largely ignored in available benchmarks [3][12][1]. ROBEL defines safety objects over position,
velocity, and torque violations calculated over a finite horizon trajectory. The success evaluator,
provided with all benchmarks, not only reports the average task success metric, but also reports the
average number of safety violations. A benchmarks challenge is considered successful when there
are no safety violations. Figure 10 shows the average number of joint per episode under safety vi-
olations for two RL agents. We observe that these policies, while successful in solving the task,
exhibit significant safety violations. While safety is desirable, it has largely been ignored in existing
RL benchmarks resulting in limited progress. We hope that safety-metric included in ROBEL will
sprout research in this direction.
6 Conclusion
This work proposes ROBEL– an open source platform of cost-effective robots designed for on-device
reinforcement learning experimentation needs. ROBEL platforms are robust and have sustained over
14000 hours of real world training on them till date. ROBEL feature a 9-DOF manipulation platform
D’Claw, and a 12-DOF locomotion platform D’Kitty , with a set of prepackaged benchmark tasks
around them. We show the performance of these benchmarks on a variety of learning-based agents
– on-policy (NPG), off-policy (SAC), demo-accelerated method (DAPG), and supervised method
(BC). We provide these results as baselines for ease of comparison and extensibility. We show
reproducibility of the ROBEL’s benchmarks by independently reproducing results at a remote site.
We are excited to bring ROBEL to the larger robotics community and look forward to the possibilities
it presents towards the evolving experimentation needs of learning-based methods, and robotics in
general.
3
Broken motors are repairable via manufacturers RMA. Motor sub-assemblies are available online as well.
8
Acknowledgments
We thank Aravind Rajeswaran, Emo Todorov, Vincent Vanhoucke, Matt Neiss, Chad Richards,
Thinh Nguyen, Byron David, Garrett Peake, Krista Reymann, and the rest of Robotics at Google
for their contributions and discussions all along the way.
References
[1] Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki,
J. Merel, A. Lefrancq, et al. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
[2] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba.
Openai gym. arXiv preprint arXiv:1606.01540, 2016.
[3] M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Powell, J. Schneider, J. To-
bin, M. Chociej, P. Welinder, V. Kumar, and W. Zaremba. Multi-goal reinforcement learning:
Challenging robotics environments and request for research, 2018.
[4] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
[5] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra.
Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
[6] X. B. Peng, G. Berseth, K. Yin, and M. Van De Panne. Deeploco: Dynamic locomotion skills
using hierarchical deep reinforcement learning. ACM Transactions on Graphics, 2017.
[7] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta,
P. Abbeel, et al. Soft actor-critic algorithms and applications. arXiv:1812.05905, 2018.
[8] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen. Learning hand-eye coordination
for robotic grasping with deep learning and large-scale data collection. The International
Journal of Robotics Research, 37(4-5):421–436, 2018.
[9] L. Pinto and A. Gupta. Supersizing self-supervision: Learning to grasp from 50k tries and 700
robot hours. In 2016 IEEE international conference on robotics and automation (ICRA), 2016.
[10] H. Zhu, A. Gupta, A. Rajeswaran, S. Levine, and V. Kumar. Dexterous manipulation with deep
reinforcement learning: Efficient, general, and low-cost. preprint arXiv:1810.06045, 2018.
[11] F. Allgöwer and A. Zheng. Nonlinear model predictive control, volume 26. Birkhäuser, 2012.
[12] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel. Benchmarking deep reinforce-
ment learning for continuous control. In International Conference on Machine Learning, 2016.
[13] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization
for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ
International Conference on Intelligent Robots and Systems (IROS), pages 23–30. IEEE, 2017.
[14] F. Sadeghi and S. Levine. Cad2rl: Real single-image flight without a single real image. arXiv
preprint arXiv:1611.04201, 2016.
[15] M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron,
M. Plappert, G. Powell, A. Ray, et al. Learning dexterous in-hand manipulation. arXiv preprint
arXiv:1808.00177, 2018.
[16] J. Matas, S. James, and A. J. Davison. Sim-to-real reinforcement learning for deformable
object manipulation. arXiv preprint arXiv:1806.07851, 2018.
[17] D. Kalashnikov, A. Irpan, P. P. Sampedro, J. Ibarz, A. Herzog, E. Jang, D. Quillen, E. Holly,
M. Kalakrishnan, V. Vanhoucke, and S. Levine. Qt-opt: Scalable deep reinforcement learning
for vision-based robotic manipulation. 2018.
[18] F. Ramos, R. C. Possas, and D. Fox. Bayessim: adaptive domain randomization via probabilis-
tic inference for robotics simulators. arXiv preprint arXiv:1906.01728, 2019.
[19] M. Bhairav, D. Manfred, G. Florian, J. P. Christopher, and P. Liam. Active domain randomiza-
tion. arXiv preprint arXiv:1904.04762, 20189.
[20] M. Johnson, B. Shrewsbury, S. Bertrand, T. Wu, D. Duran, M. Floyd, P. Abeles, D. Stephen,
N. Mertins, A. Lesman, et al. Team ihmc’s lessons learned from the darpa robotics challenge
trials. Journal of Field Robotics, 32(2):192–208, 2015.
9
[21] S. Behnke. Robot competitions-ideal benchmarks for robotics research. In Proc. of IROS-
2006 Workshop on Benchmarks in Robotics Research. Institute of Electrical and Electronics
Engineers (IEEE), 2006.
[22] S. Thrun, M. Montemerlo, H. Dahlkamp, D. Stavens, A. Aron, J. Diebel, P. Fong, J. Gale,
M. Halpenny, G. Hoffmann, et al. Stanley: The robot that won the darpa grand challenge.
Journal of field Robotics, 23(9):661–692, 2006.
[23] A. M. Dollar and R. D. Howe. The highly adaptive sdm hand: Design and performance evalu-
ation. The international journal of robotics research, 29(5):585–597, 2010.
[24] Y. She, C. Li, J. Cleary, and H.-J. Su. Design and fabrication of a soft robotic hand with
embedded actuators and sensors. Journal of Mechanisms and Robotics, 7(2):021007, 2015.
[25] Z. Xu and E. Todorov. Design of a highly biomimetic anthropomorphic robotic hand towards
artificial limb regeneration. In 2016 IEEE International Conference on Robotics and Automa-
tion (ICRA), pages 3485–3492. IEEE, 2016.
[26] S. H. Collins, M. Wisse, and A. Ruina. A three-dimensional passive-dynamic walking robot
with two legs and knees. The International Journal of Robotics Research, 2001.
[27] W. Bosworth, S. Kim, and N. Hogan. The mit super mini cheetah: A small, low-cost
quadrupedal robot for dynamic locomotion. In 2015 IEEE International Symposium on Safety,
Security, and Rescue Robotics (SSRR), pages 1–8. IEEE, 2015.
[28] B. Calli, A. Singh, A. Walsman, S. Srinivasa, P. Abbeel, and A. M. Dollar. The ycb object and
model set: Towards common benchmarks for manipulation research. In 2015 international
conference on advanced robotics (ICAR), pages 510–517. IEEE, 2015.
[29] D. Pickem, P. Glotfelter, L. Wang, M. Mote, A. Ames, E. Feron, and M. Egerstedt. The rob-
otarium: A remotely accessible swarm robotics research testbed. In 2017 IEEE International
Conference on Robotics and Automation (ICRA), pages 1699–1706. IEEE, 2017.
[30] K. A. Wyrobek, E. H. Berger, H. M. Van der Loos, and J. K. Salisbury. Towards a personal
robotics development platform: Rationale and design of an intrinsically safe personal robot. In
2008 IEEE International Conference on Robotics and Automation, pages 2165–2170. IEEE,
2008.
[31] M. J. Lum, D. C. Friedman, G. Sankaranarayanan, H. King, K. Fodero, R. Leuschke, B. Han-
naford, J. Rosen, and M. N. Sinanan. The raven: Design and validation of a telesurgery system.
The International Journal of Robotics Research, 28(9):1183–1197, 2009.
[32] A. R. Mahmood, D. Korenkevych, G. Vasan, W. Ma, and J. Bergstra. Benchmarking reinforce-
ment learning algorithms on real-world robots. arXiv preprint arXiv:1809.07731, 2018.
[33] B. Yang, J. Zhang, V. Pong, S. Levine, and D. Jayaraman. Replab: A reproducible low-cost
arm benchmark platform for robotic learning. arXiv preprint arXiv:1905.07447, 2019.
[34] Dynamixel smart actuator. http://www.robotis.us/dynamixel/. Accessed: 2019-
07-02.
[35] Usb-serial bus. http://www.robotis.us/u2d2/. Accessed: 2019-07-02.
[36] A. X. Lee, A. Nagabandi, P. Abbeel, and S. Levine. Stochastic latent actor-critic: Deep rein-
forcement learning with a latent variable model. arXiv preprint arXiv:1907.00953, 2019.
[37] M. L. Puterman. Markov decision processes: Discrete stochastic dynamic programming. 1994.
[38] E. Todorov, T. Erez, and Y. Tassa. Mujoco: A physics engine for model-based control. In 2012
IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012.
[39] S. M. Kakade. A natural policy gradient. In Advances in neural information processing sys-
tems, pages 1531–1538, 2002.
[40] A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine.
Learning complex dexterous manipulation with deep reinforcement learning and demonstra-
tions. arXiv preprint arXiv:1709.10087, 2017.
10
Appendix
A ROBEL task details
In this section, we outline details of the benchmark task presented in section 4.
The action space of all D’Claw tasks (subsection 4.1) is a 1D vector of 9 D’Claw joint positions.
(i) Pose: This task involves posing D’Claw by driving its joints θt to a desired joint angles θgoal
sampled randomly from the feasible joint angle space at the beginning of the episode. The
observation space st is a 36-size 1D vector that consists of the current joint angles θt , the joint
velocities θ̇t , the error between the goal and current joint angles, and the last action. The reward
function is defined as:
rt = −kθgoal − θt k − 0.1
θ̇t ∗ 1(|θ̇| > 0.5)
(ii) Turn: This task involves rotating an object from an initial angle θ0,obj to a goal angle θgoal,obj .
The observation space is a 21-size 1D vector of the current joint angles θt , the joint velocities
θ̇t , the sine and cosine values of the object’s angle θt,obj , the last action, and the error between
the goal and the current object angle ∆θt,obj = θt,obj − θgoal,obj . The reward function is defined
as
rt = −5|∆θt,obj | − kθnominal − θt k −
θ̇t
+ 101(|∆θt,obj | < 0.25) + 501(|∆θt,obj | < 0.1)
(iii) Screw: This task involves rotating an object at a desired velocity θ̇desired from an initial angle.
This is represented by a θt,goal that is updated every step as θt,goal = θt−1,goal + θ̇desired ∗ dt.
Screw tasks have the same observation space and reward definitions as the Turn tasks. Three
variants of this task are provided:
(a) DClawScrewFixed: constant initial angle (0◦ ) and velocity (0.5 sec
rad
)
(b) DClawScrewRandom: random initial angle ([−180◦ , 180◦ ]) and desired velocity
rad rad
([−0.75 sec , 0.75 sec ])
11
(c) DClawScrewRandomDynamics: same as previous. The position of the D’Claw relative to
the object, the object’s size, the joint damping, and the joint friction loss are randomized at
the beginning of every episode.
Success evaluator metric φse (π) of policy π is defined using the mean absolute tracking error
being within the threshold β = 0.1
T
h1 X i
(τ )
φse (π) = Eτ ∼π |∆θt,obj | < β
T t=0
The action space of all of the D’Kitty tasks is a 1D vector of 12 joint positions. The observation
space shares 49 common entries: the Cartesian position (3), Euler orientation (3), velocity (3), and
angular velocity (3) of the D’Kitty torso, the joint positions θ (12) and velocities θ̇ (12) of the 12
joints, the previous action (12), and ‘uprightness’ ut,kitty (1). The uprightness ut,kitty of the D’Kitty
is measured as it’s orientation projected over the global vertical axis:
ut,kitty = Rẑ,t,kitty · Ẑ
The D’Kitty tasks share a common term in the reward function rt,upright regarding uprightness
defined as:
ut,kitty − β
rt,upright = αupright + αf alling (ut,kitty < β)
1−β
where β is the cosine similarity threshold with the global z-axis beyond which we consider the
D’Kitty to have fallen. When perfectly upright αt,upright reward is collected, when alignment
(ut,kitty ) falls below the threshold β, the episode terminates early and αf alling is collected.
(i) Stand: This task involves D’Kitty coordinating its 12 joints θt to stand upright maintaining a pose
specified by θgoal . The observation space is a 61-size 1D vector of the shared observation space
entries and pose error et,pose = (θgoal − θt ). The reward function is defined as:
π π
rt = rt,upright − 4ēt,pose − 2||pt,kitty ||2 + 5ut,kitty 1(ēt,pose < ) + 10ut,kitty 1(ēt,pose < )
6 12
where ēt,pose is mean absolute pose error, pt,kitty is the cartesian position of D’Kitty on the
horizontal plane and the shared reward function constants are αupright = 2, αf alling = −100,
β = cos(90◦ ).
Three variants of this task are provided:
(a) DKittyStandFixed: constant initial pose.
(b) DKittyStandRandom: random initial pose.
(c) DKittyStandRandomDynamics: same as previous. The joint gains, damping, friction loss,
geometry friction coefficients, and masses are randomized. In addition, a randomized height
field is generated with heights up to 0.05m
The successor evaluator indicates success if the mean pose error is within the goal threshold
π
β = 12 and the D’Kitty is sufficiently upright at the last step (t = T ) of the episode:
(τ ) (τ )
φse (π) = Eτ ∼π [1(ēT,pose < β) ∗ 1(uT,kitty > 0.9)]
(ii) Orient: This task involves D’Kitty matching its current facing direction ωt with a goal facing
direction ωgoal , thus minimizing the facing angle error et,f acing between ωdesired and ωt . The
observation space is a 53-size 1D vector of the shared observation space entries, ωt and ωgoal
represented as unit vectors on the (X,Y) plane, and angle error et,f acing . The reward function is
defined as:
rt = rt,upright − 4et,f acing − 4||pt,kitty ||2 + rbonus small + rbonus big
rbonus small = 5(et,f acing < 15◦ or ut,kitty > cos(15◦ ))
rbonus big = 10(et,f acing < 5◦ and ut,kitty > cos(15◦ ))
where the shared reward function constants are αupright = 2, αf alling = −500, β = cos(25◦ ).
Three variants of this task are provided:
12
(a) DKittyOrientFixed: constant initial facing (0◦ ) and goal facing (180◦ ).
(b) DKittyOrientRandom: random initial facing ([−60◦ , 60◦ ]) and goal facing ([120◦ , 240◦ ])
(c) DKittyOrientRandomDynamics: same as previous. The joint gains, damping, friction loss,
geometry friction coefficients, and masses are randomized. In addition, a randomized height
field is generated with heights up to 0.05m
The successor evaluator indicates success if the facing angle error is within the goal threshold
and the D’Kitty is sufficiently upright at the last step (t = T ) of the episode:
(τ ) (τ )
φse (π) = Eτ ∼π [1(eT,f acing < 5◦ ) ∗ 1(uT,kitty > cos(15◦ )]
(iii) Walk: This task has the D’Kitty move its current Cartesian position pt,kitty to a desired Carte-
sian position pgoal , minimizing the distance dt,goal = ||pgoal − pt,kitty ||2 . Additionally,
the D’Kitty is incentivized to face towards the goal. The heading alignment is calculated as
p −pt,kitty
ht,goal = Rŷ,t,kitty · goaldt,goal . The observation space is a 52-size 1D vector of the shared
observation space entries, ht,goal and pgoal − pt,kitty .
The reward function is defined as:
rt = rt,upright − 4dt,goal + 2ht,goal + rbonus small + rbonus big
rbonus small = 5(dt,goal < 0.5 or ht,goal > cos(25◦ ))
rbonus big = 10(dt,goal < 0.5 and ht,goal > cos(25◦ ))
and the shared reward function constants are αupright = 1, αf alling = −500, β = cos(25◦ ).
Three variants of this task are provided:
(a) DKittyWalkFixed: constant distance (2m) towards 0◦ .
(b) DKittyWalkRandom: random distance ([1, 2]) towards random angle ([−60◦ , 60◦ ])
(c) DKittyWalkRandomDynamics: same as previous. The joint gains, damping, friction loss,
geometry friction coefficients, and masses are randomized. In addition, a randomized height
field is generated with heights up to 0.05m
The successor evaluator indicates success if the goal distance is within a threshold and the
D’Kitty is sufficient upright at the last step of the episode:
(τ ) (τ )
φse (π) = Eτ ∼π [1(dT,goal < 0.5) ∗ 1(uT,kitty > cos(25◦ ))]
(i) Position violations: This score indicates that the joint positions are near their operating bounds.
For the N joints of the robot, this is defined as:
N
X
sposition = 1(|θi − βi,lower | < ) + 1(|θi − βi,upper | < )
i=1
where βi,lower and βi,upper is the respective lower and upper joint position bound for the ith
joint, and is the threshold within which the joint position is considered to be near the bound.
(ii) Velocity violations: This score indicates that the joint velocities are exceeding a safety limit For
the N joints of the robot, this is defined as:
N
X
svelocity = 1(|θ̇i | > αi )
i=1
where αi is the speed limit for the ith joint.
(iii) Current violations: This score indicates that the joints are exerting forces that exceed a safety
limit. For the N joints of the robot, this is defined as:
N
X
scurrent = 1(|ki | > γi )
i=1
where γi is the current limit for the ith joint.
13
B Locomotion benchmark performance on D’Kitty
Figure 11: Success percentage (3 seeds) for all D’Kitty tasks trained on a simulated D’Kitty robot
using Soft Actor Critic (SAC), Natural Policy Gradient (NPG), Demo-Augmented Policy Gradi-
ent (DAPG), and Behavior Cloning (BC) over 20 trajectories. Each timestep corresponds to 0.1
simulated seconds.
C ROBEL reproducibility
Figure 12: SAC training performance of D’Claw tasks on two real D’Claw robots each at different
laboratory locations. Score denotes the closeness to the goal. Each timestep corresponds to 0.1
simulated seconds. Each task is trained over two different task objects: a 3-prong valve and a 4-
prong valve.
14