Solving Rubik's Cube With A Robot Hand OpenAI Blog
Solving Rubik's Cube With A Robot Hand OpenAI Blog
▶Watch Video
We’ve trained a pair of neural networks to solve the Rubik’s Cube with a human-like
robot hand. The neural networks are trained entirely in simulation, using the same
reinforcement learning code as OpenAI Five paired with a new technique called
Automatic Domain Randomization (ADR). The system can handle situations it
never saw during training, such as being prodded by a stuffed giraffe. This shows
that reinforcement learning isn’t just a tool for virtual tasks, but can solve physical-
world problems requiring unprecedented dexterity.
Human hands let us solve a wide variety of tasks. For the past 60 years of robotics, hard
tasks which humans accomplish with their fixed pair of hands have required designing a
custom robot for each task. As an alternative, people have spent many decades trying to
use general-purpose robotic hardware, but with limited success due to their high degrees
of freedom. In particular, the hardware we use here is not new—the robot hand we use
has been around for the last 15 years—but the software approach is.
Since May 2017, we’ve been trying to train a human-like robotic hand to solve the
Rubik’s Cube. We set this goal because we believe that successfully training such a
robotic hand to do complex manipulation tasks lays the foundation for general-purpose
robots. We solved the Rubik’s Cube in simulation in July 2017. But as of July 2018, we
could only manipulate a block on the robot. Now, we’ve reached our initial goal.
A full solve of the Rubik’s Cube. This video plays at real-time and was not edited in any way.
Solving a Rubik’s Cube one-handed is a challenging task even for humans, and it takes
children several years to gain the dexterity required to master it. Our robot still hasn’t
perfected its technique though, as it solves the Rubik’s Cube 60% of the time (and only
20% of the time for a maximally difficult scramble).
Our approach
We train neural networks to solve the Rubik’s Cube in simulation using reinforcement
learning and Kociemba’s algorithm for picking the solution steps. [1] Domain
randomization enables networks trained solely in simulation to transfer to a real robot.
Domain randomization exposes the neural network to many different variants of the same problem, in this
case solving a Rubik’s Cube.
The biggest challenge we faced was to create environments in simulation diverse enough
to capture the physics of the real world. Factors like friction, elasticity and dynamics are
incredibly difficult to measure and model for objects as complex as Rubik’s Cubes or
robotic hands and we found that domain randomization alone is not enough.
ADR starts with a single, nonrandomized environment, wherein a neural network learns
to solve Rubik’s Cube. As the neural network gets better at the task and reaches a
performance threshold, the amount of domain randomization is increased automatically.
This makes the task harder, since the neural network must now learn to generalize to
more randomized environments. The network keeps learning until it again exceeds the
performance threshold, when more randomization kicks in, and the process is repeated.
. cm
One of the parameters we randomize is the size of the Rubik’s Cube (above). ADR begins
with a fixed size of the Rubik’s Cube and gradually increases the randomization range as
training progresses. We apply the same technique to all other parameters, such as the
mass of the cube, the friction of the robot fingers, and the visual surface materials of the
hand. The neural network thus has to learn to solve the Rubik’s Cube under all of those
increasingly more difficult conditions.
ADR
M a n u a l D o m a i n Ra n d o m i z a t i o n
- . - . - . - . . .
ADR Entropy
We compared ADR to manual domain randomization on the block flipping task, where
we already had a strong baseline. In the beginning ADR performs worse in terms of
number of successes on the real robot. But as ADR increases the entropy, which is a
measure of the complexity of the environment, the transfer performance eventually
doubles over the baseline—without human tuning.
Analysis
Perturbations that we apply to the real robot hand while it solves the Rubik’s Cube. All videos play at real-
time.
To test the limits of our method, we experiment with a variety of perturbations while the
hand is solving the Rubik’s Cube. Not only does this test for the robustness of our control
network but also tests our vision network, which we here use to estimate the cube’s
position and orientation.
We find that our system trained with ADR is surprisingly robust to perturbations even
though we never trained with them: The robot can successfully perform most flips and
face rotations under all tested perturbations, though not at peak performance.
Emergent meta-learning
We believe that meta-learning, or learning to learn, is an important prerequisite for
building general-purpose systems, since it enables them to quickly adapt to changing
conditions in their environments. The hypothesis behind ADR is that a memory-
augmented networks combined with a sufficiently randomized environment leads to
emergent meta-learning, where the network implements a learning algorithm that allows
itself to rapidly adapt its behavior to the environment it is deployed in. [3]
To test this systematically, we measure the time to success per cube flip (rotating the
cube such that a different color faces up) for our neural network under different
perturbations, such as resetting the network’s memory, resetting the dynamics, or
breaking a joint. We perform these experiments in simulation, which allows us to
average performance over 10,000 trials in a controlled setting.
. sec
Perturbation
P e r t u r b e d p o l i cy
.
Baseline
th flip th th th th th th th th th
In the beginning, as the neural network successfully achieves more flips, each successive
time to success decreases because the network learns to adapt. When perturbations are
applied (vertical gray lines in the above chart), we see a spike in time to success. This is
because the strategy the network is employing doesn’t work in the changed environment.
The network then relearns about the new environment and we again see time to success
decrease to the previous baseline.
We also measure failure probability and performed the same experiments for face
rotations (rotating the top face 90 degrees clockwise or counterclockwise) and find the
same pattern of adaptation. [4]
The memory of our neural network is visualized above. We use a building block from the
interpretability toolbox, namely non-negative matrix factorization, to condense this
high-dimensional vector into 6 groups and assign each a unique color. We then display
the color of the currently dominant group for every timestep.
We find that each memory group has a semantically meaningful behavior associated
with it. For example, we can tell by looking at only the dominant group of the network’s
memory if it is about to spin the cube or rotate the top clockwise before it happens.
Challenges
Solving the Rubik’s Cube with a robot hand is still not easy. Our method currently solves
the Rubik’s Cube 20% of the time when applying a maximally difficult scramble that
requires 26 face rotations. For simpler scrambles that require 15 rotations to undo, the
success rate is 60%. When the Rubik’s Cube is dropped or a timeout is reached, we
consider the attempt failed. However, our network is capable of solving the Rubik’s Cube
from any initial condition. So if the cube is dropped, it is possible to put it back into the
hand and continue solving.
We generally find that our neural network is much more likely to fail during the first few
face rotations and flips. This is the case because the neural network needs to balance
solving the Rubik’s Cube with adapting to the physical world during those early rotations
and flips.
Rubik’s Cube prototypes, from left to right: Locked cube, Face cube, Full cube, Giiker cube, regular
Rubik’s Cube.
P ROTOT Y P E P O S I T IO N + O R I E N TAT IO N I N T E R N A L D EG R E E S O F F R E E D O M (S E N S O R)
Next steps
We believe that human-level dexterity is on the path towards building general-purpose
robots and we are excited to push forward in this direction.
If you want to help make increasingly general AI systems, whether robotic or virtual,
we’re hiring!
Footnotes
. We focus on the problems that are currently difficult for machines to master: perception and dexterous
manipulation. We therefore train our neural networks to achieve the required face rotations and cube flips
as generated by Kociemba’s algorithm. ↩
. Our work is strongly related to POET, which automatically generates 2D environments. However, our work
learns a joint policy over all environments, which transfers to any newly generated environment. ↩
. More concretely, we hypothesize that a neural network with finite capacity trained on environments with
unbounded complexity forces the network to learn a special-purpose learning algorithm since it cannot
memorize solutions for each individual environment and there exists no single robust policy that works
under all randomizations. ↩
. Please refer to our paper for full results. ↩
. The only modification we made was cutting out a small piece of each center cublet’s colorful sticker. This
was necessary to break rotational symmetry. ↩
Authors
OpenAI, Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur
Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, Jonas Schneider, Nikolas Tezak, Jerry
Tworek, Peter Welinder, Lilian Weng, Qiming Yuan, Wojciech Zaremba & Lei Zhang
Acknowledgments
Thanks to the following for feedback on drafts of this post and paper: Josh Achiam, Greg Brockman, Nick
Cammarata, Jack Clark, Jeff Clune, Ruben D’Sa, Harri Edwards, David Farhi, Ken Goldberg, Leslie P.
Kaelbling, Hyeonwoo Noh, Lerrel Pinto, John Schulman, Ilya Sutskever & Tao Xu.
Video
Editor
Ashley Pilipiszyn
Design
Justin Jay Wang & Ben Barry
Photography
Eric Haines
Filed Under
Research, Milestones