0% found this document useful (0 votes)
91 views17 pages

Pi 0

Uploaded by

nl1no00001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views17 pages

Pi 0

Uploaded by

nl1no00001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

π0: A Vision-Language-Action Flow Model for

General Robot Control


Physical Intelligence
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai,
Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine,
Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong,
Anna Walling, Haohuan Wang, Ury Zhilinsky
https://physicalintelligence.company/blog/pi0

Z)0./GSO/,SGL Ÿ G.G 0%G) °2G)©¯SO/,SG )¡ G     


2.G .(
./O/ ), G

O') O%G
O') O%G 20 )O/ %G

# ! 2/  .() /)


mlhg/%('(G0 /)S/,G% ,&&0'% )  
O()(./0G.G /2G)2/20/. 0 /)
2.G .G,)|{z GŠ2G.

GS2 ³)2. SG ),.³G.

%/,),pG /. )%G(/


y&/%,)p. n

O 0p)&/%,)p. 

S G)0/&&GG ÀGG2) O%G Õ(pÓ'% ³)2/  .(), 


8::!7 2/  .() /)
'GG)  

GS2 ³),.³G. G ) O%G


p(p),GŠ G. ³)  
2' ) GS)),.ÀG.

20 )pG%& &% G)O/Š

,)S³)S/.G±
'GG)  
.G2%0G)22G.) /ÀG%

Fig. 1: Our generalist robot policy uses a pre-trained vision-language model (VLM) backbone, as well as a diverse cross-
embodiment dataset with a variety of dexterous manipulation tasks. The model is adapted to robot control by adding a separate
action expert that produces continuous actions via flow matching, enabling precise and fluent manipulation skills. The model
can then be prompted for zero-shot control or fine-tuned on high-quality data to enable complex multi-stage tasks, such as
folding multiple articles of laundry or assembling a box.

Abstract—Robot learning holds tremendous promise to unlock built on top of a pre-trained vision-language model (VLM) to
the full potential of flexible, general, and dexterous robot systems, inherit Internet-scale semantic knowledge. We then discuss how
as well as to address some of the deepest questions in artificial this model can be trained on a large and diverse dataset from
intelligence. However, bringing robot learning to the level of multiple dexterous robot platforms, including single-arm robots,
generality required for effective real-world systems faces major dual-arm robots, and mobile manipulators. We evaluate our
obstacles in terms of data, generalization, and robustness. In model in terms of its ability to perform tasks in zero shot after
this paper, we discuss how generalist robot policies (i.e., robot pre-training, follow language instructions from people and from
foundation models) can address these challenges, and how we can a high-level VLM policy, and its ability to acquire new skills via
design effective generalist robot policies for complex and highly fine-tuning. Our results cover a wide variety of tasks, such as
dexterous tasks. We propose a novel flow matching architecture laundry folding, table cleaning, and assembling boxes.

Physical Intelligence, San Francisco, California, USA. Correspondance to:


[email protected]
Fig. 2: π0 controls a mobile manipulator to fold laundry. Our model is pre-trained on diverse data from 7 distinct robot
configurations and 68 tasks, and can then be used in zero-shot or fine-tuned to complex downstream tasks, as in the case of
this laundry folding policy, which fetches laundry from a dryer, packs it into a hamper, brings the hamper to a folding table,
and then folds each article of clothing.

I. I NTRODUCTION solutions. For example, if the goal is to recognize birds in


photographs, it is likely more expedient to pre-train on many
A human being should be able to change a diaper, plan different image-language associations and then fine-tune or
an invasion, butcher a hog, conn a ship, design a prompt for the bird recognition task, than it is to train on only
building, write a sonnet, balance accounts, build a
wall, set a bone, comfort the dying, take orders, give bird recognition data. Similarly, we may find that for effective
orders, cooperate, act alone, solve equations, analyze a specialized robot systems, it is more effective to first pre-train
new problem, pitch manure, program a computer, cook on highly diverse robot data, and then fine-tune or prompt for
a tasty meal, fight efficiently, die gallantly. the desired task. This can resolve the data scarcity challenge,
Specialization is for insects. because many more sources of data are available to a generalist
Robert A. Heinlein, Time Enough for Love model — including data from other tasks, other robots, or
even non-robot sources — and it may resolve robustness and
Artificial intelligence systems come in all shapes and sizes, generalization challenges, because the diverse data exhibits
from highly specialized systems that solve complex prob- a greater coverage of observations and actions, providing a
lems inaccessible to the human mind, such as predicting variety of scenes, corrections, and recovery behaviors that
the conformation of a protein [21], to systems that can might not be present in more narrow specialized data. Thus,
produce lifelike high-resolution images or videos based on adopting a large-scale pre-training approach to robot learning
textual prompts [40]. However, the axis along which human has the potential to address many of the field’s challenges
intelligence most outpaces machine intelligence is versatility: and make practical learning-enabled robots a reality, while
the ability to solve diverse tasks situated in varied physical at the same time furthering our understanding of the deepest
environments, while responding intelligently to environmental problems in artificial intelligence.
constraints, language commands, and unexpected perturba-
tions. Perhaps the most tangible progress toward this kind of However, developing such generalist robot policies — i.e.,
versatility in AI can be seen in large language- and vision- robot foundation models — involves a number of major
language models [1, 48]: systems that are pre-trained on large challenges. First, any such research must be done at a very
and very diverse corpora of images and text from the web, large scale, because the full benefits of large-scale pre-training
and then fine-tuned (“aligned”) using more carefully curated are often not present at smaller scales [54]. Second, it requires
datasets meant to induce the desired pattern of behavior developing the right model architectures that can effectively
and responsiveness. While such models have been shown make use of diverse data sources, while at the same time being
to exhibit broad instruction-following and problem-solving able to represent the intricate and subtle behaviors necessary
abilities [53, 27], they are not truly situated in a physical world to interact with complex physical scenes. Third, it requires
the way that people are, and their understanding of physical the right training recipe. This is perhaps the most important
interaction is based entirely on abstract descriptions. If such ingredient, as much of the recent progress with large models
methods are to make tangible progress toward AI systems that in NLP and computer vision has relied heavily on delicate
exhibit the kind of physically situated versatility that people strategies for curating pre-training and post-training data [35].
possess, we will need to train them on physically situated data In this paper, we present a prototype model and learning
— that is, data from embodied robot agents. framework, which we call π0 , that illustrates how each of
Flexible and general-purpose models that can be tasked these three bottlenecks could be tackled. We illustrate our
to perform a variety of robot behaviors have tremendous model and system in Figure 1. To incorporate diverse data
practical ramifications, but they may also offer solutions to sources, we begin by utilizing a pre-trained vision-language
some of the toughest challenges facing robot learning today, model (VLM) to import Internet-scale experience. By basing
such as availability of data, generalization, and robustness. In our model on a VLM, we inherit the general knowledge,
natural language [1] and computer vision [39], general-purpose semantic reasoning, and problem-solving abilities of language-
foundation models that are pre-trained on diverse multi-task and vision-language models. We then further train our model
data tend to outperform narrowly tailored and specialized to incorporate robot actions, turning it into a vision-language-
action (VLA) model [7]. In order to make it feasible to utilize that are fine-tuned for robot control [7, 24, 55]. Such models
a variety of diverse robot data sources, we employ cross- employ autoregressive discretization to represent actions in a
embodiment training [10], where data from many robot types manner analogous to text tokens. In contrast, our model em-
is combined into the same model. These different robot types ploys a novel design that fine-tunes a VLM to produce actions
have different configuration spaces and action representations, via flow matching [32, 28], a variant of diffusion [20, 46].
including single and dual-arm systems, as well as mobile This allows us to handle high-frequency action chunks [57]
manipulators. Additionally, in order to make it possible to (up to 50 Hz) and highly dexterous tasks, which we show
perform highly dexterous and intricate physical tasks, we use pose a major challenge for prior autoregressive VLAs [7]. This
an action chunking architecture [57] with flow matching (a resembles a number of recent works on diffusion models for
variant of diffusion) to represent complex continuous action action generation [9, 60]. In contrast to these works, our model
distributions [28, 32]. This enables our model to control robots uses a pre-trained VLM backbone [5]. Our contribution is also
at frequencies of up to 50 Hz for dexterous tasks such as fundamentally integrative, focusing on a framework for robot
laundry folding (see Figure 1). To combine flow matching foundation models, including not only the model architecture
with VLMs, we use a novel action expert that augments the itself but also a pre-training recipe, pre-training and post-
standard VLM with flow-based outputs. training phases, and a range of real-world experiments.
As with language models, the architecture of our model is Outside of robot control, many models have been proposed
only part of our method. In order to flexibly and robustly that combine pre-trained language models with diffusion [40,
perform complex tasks, we need the right training recipe. 41, 14], including models that specifically hybridize diffusion
Our recipe mirrors the pre-training/post-training separation and autoregressive large language models [19, 29, 59]. Such
commonly seen in exascale language- and image-language models are typically concerned with image generation, but
models [1, 48], where the model is first pre-trained on a very our action generation model builds on a number of previously
large and diverse corpus, and then fine-tuned on more narrow proposed concepts. Like Zhou et al. [59], we train our model
and more carefully curated data to induce the desired pattern of via a diffusion-style (flow matching) loss applied on individual
behavior — in our case, dexterity, efficiency, and robustness. sequence elements, in lieu of the standard cross-entropy loss
Intuitively, training only on high-quality data does not teach for decoder-only transformers. Like Liu et al. [29], we use
the model how to recover from mistakes, since mistakes are a separate set of weights for the tokens corresponding to
rarely seen in such data. Training on only lower-quality pre- diffusion. Incorporating these concepts into a VLA model, we
training data does not teach the model to act efficiently and introduce what to our knowledge is the first flow matching
robustly. Combining both provides the desired behavior: the VLA that produces high-frequency action chunks for dexterous
model attempts insofar as possible to act in a manner similar control.
to the high-quality data, but still has a repertoire of recoveries Our work also builds on a rich history of prior works on
and corrections that it can deploy in the case of a mistake. large-scale robot learning. Early work in this area often utilized
The contributions of our work consist of a novel generalist self-supervised or autonomous data collection [26, 22, 8],
robot policy architecture based on VLM pre-training and flow providing a tractable data source for simple tasks such as
matching, and an empirical investigation of pre-training/post- grasping [18, 37] or pushing [56], but without the complexity
training recipes for such robot foundation models. We evaluate of more dexterous behaviors. More recently, a number of high-
our model for zero-shot control with language commands, with quality datasets have been collected for robot control that
fine-tuning to downstream tasks, and in combination with a enable broad generalization [23, 10, 52, 33, 34, 43, 13, 6], but
high-level semantic policy that outputs intermediate language typically for simpler tasks that consist of object relocation and
commands to perform complex and temporally extended tasks. rudimentary furniture manipulation (e.g., drawer opening) [31,
While our model and system make use of a variety of ideas 15]. More dexterous tasks have been studied at a smaller
presented in recent work, the combination of ingredients is scale, typically with 10s or 100s of training trajectories [57],
novel, and the empirical evaluation demonstrates a level of equivalent to 10 or less hours. Since one of our aims is to
dexterity and generality that goes significantly beyond pre- study complex and dexterous behaviors, we utilize a much
viously demonstrated robot foundation models. We evaluate larger dataset, with about 10,000 hours of demonstrations,
our approach by pre-training on over 10,000 hours of robot complemented by the open-source OXE dataset [10]. To our
data, and fine-tuning to a variety of dexterous tasks, including knowledge, this represents by far the largest robot learning
laundry folding (see Figure 2), clearing a table, putting dishes experiment in terms of the amount of robot data. At this scale,
in a microwave, stacking eggs into a carton, assembling a box, we show that a more sophisticated pre-training/post-training
and bagging groceries. recipe is highly effective — analogously to the recipes used
for large language models, a pre-training phase endows our
II. R ELATED W ORK model with a broad base of knowledge, which is then refined
Our work builds on recently proposed methods in large- in a post-training phase with higher-quality curated data to
scale robot learning, as well as multimodal language models. achieve the desired behavior.
Our work is most closely related to recently proposed vision- The complexity of the tasks we illustrate goes signifi-
language action (VLA) models, which use pre-trained VLMs cantly beyond prior work. While recent work has illustrated
! -,+* ... -,+/

´³ &J „‚ €(|
y~w
u ~9(&
  L(
L( D
D J
J

pb
 VPRRN[
\XZUYTSXQWXOVPRRN[ „• €(|
u(‹9
 ·¶µ u ~9(&
¸
@ @ @ ><(9 &63
¬  • €(|
$"# (& ¥›9 y
u ~9(&

Fig. 3: Overview of our framework. We start with a pre-training mixture, which consists of both our own dexterous
manipulation datasets and open-source data. We use this mixture to train our flow matching VLA model, which consists
of a larger VLM backbone and a smaller action expert for processing robot states and actions. The VLM backbone weights
are initialized from PaliGemma [5], providing representations learned from large-scale Internet pre-training. The resulting π0
model can be used to control multiple robot embodiments with differing action spaces to accomplish a wide variety of tasks.

a number of more complex and dexterous behaviors, such Our model, which we describe in Section IV, is based on the
as tying shoelaces [58] or cooking shrimp [17], we show PaliGemma vision-language model [5], which we then further
that our framework can train very long tasks, sometimes train with our data mixture. To turn the base PaliGemma VLM
tens of minutes in length, for behaviors that combine both into π0 , we add action outputs that use flow matching [32, 28]
physical dexterity and combinatorial complexity. For example, to generate continuous action distributions. We describe this
our laundry folding task requires the robot to manipulate a design in detail in the following section. Note that we use
variety of clothing items that can start in any configuration, PaliGemma for convenience and because of its comparatively
and fold multiple items in sequence. Our table bussing task small size (which is useful for real-time control), but our
requires discerning the class of novel objects (trash or dishes). framework is compatible with any base pre-trained VLM.
We show that a single cross-embodiment model can be used
as the base model for these tasks. To our knowledge, our work IV. T HE π0 M ODEL
demonstrates the longest dexterous tasks in the end-to-end
robot learning literature. The π0 model, illustrated in Figure 3, consists primarily
of a language model transformer backbone. Following the
standard late fusion VLM recipe [3, 11, 30], image encoders
III. OVERVIEW
embed the robot’s image observations into the same em-
We provide an outline of our model and training procedure bedding space as language tokens. We further augment this
in Figure 3. In our training framework, we first assemble backbone with robotics-specific inputs and outputs — namely,
a pre-training mixture consisting of a weighted combination proprioceptive state and robot actions. π0 uses conditional
of our own dexterous manipulation datasets (Section V-C), flow matching [28, 32] to model the continuous distribution
collected on 7 different robot configurations for 68 different of actions. Flow matching provides our model with high
tasks, and the entire OXE dataset [10], which contains data precision and multimodal modeling capability, making it es-
from 22 robots. The pre-training phase (Section V-A) also uses pecially well suited to high-frequency dexterous tasks. Our
diverse language labels, combining task names and segment architecture is inspired by Transfusion [59], which trains a
annotations (fine-grained labels for sub-trajectories, typically single transformer using multiple objectives, with tokens1
about 2 seconds in length). The purpose of the pre-training corresponding to continuous outputs supervised via a flow
phase is to train a base model that exhibits broad capabilities matching loss and tokens corresponding to discrete outputs
and generalization, but is not necessarily specialized for high supervised via a cross-entropy loss. Building on Transfusion,
performance on any one task. This base model can follow we additionally found that using a separate set of weights
language commands and perform a variety of tasks at rudi- for the robotics-specific (action and state) tokens led to an
mentary proficiency. For complex and dexterous tasks, we improvement in performance. This design is analogous to a
then employ a post-training procedure (Section V-A), which mixture of experts [45, 25, 12, 16] with two mixture elements,
uses high-quality curated data to adapt the model to specific where the first element is used for image and text inputs, and
downstream tasks. We study both efficient post-training with
1 In this paper, we use the word “token” to refer to an input/output slot along
small to moderate amounts of data, and high-quality post-
the sequence dimension, whether the slot corresponds to a discrete variable
training with larger datasets for complex tasks such as laundry (e.g., a language token) or a continuous variable (e.g., an image patch or a
folding and mobile manipulation. robot action).
the second is used for robotics-specific inputs and outputs. We Non-VLM baseline model. In addition to our main VLA
refer to the second set of weights as the action expert. model, we also trained a similar baseline model that did not
Formally, we want to model the data distribution p(At |ot ), use a VLM initialization for ablation experiments. This model,
where At = [at , at+1 , ..., at+H−1 ] corresponds to an action which we refer to as π0 -small, has 470M parameters, does not
chunk of future actions (we use H = 50 for our tasks), and ot use VLM initialization, and has a number of small differences
is an observation. The observation consists of multiple RGB that we found to be helpful for training on our data without
images, a language command, and the robot’s proprioceptive VLM initialization, which are summarized in Appendix C.
state, such that ot = [I1t , ..., Int , ℓt , qt ], where Iit is ith image This model is used in our comparisons to evaluate the benefits
(with 2 or 3 images per robot), ℓt is a sequence of language of incorporating VLM pertaining.
tokens, and qt is a vector of joint angles. The images Iit
and state qt are encoded via corresponding encoders and then V. DATA C OLLECTION AND T RAINING R ECIPE
projected via a linear projection layer into the same embedding Broadly capable robot foundation models require not only
space as the language tokens. an expressive and powerful architecture, but also the right
For each action at′ in the action chunk At , we have a dataset and, more importantly, the right training recipe. In
corresponding action token that we feed through the action the same way that LLM training is typically divided into
expert. During training, we supervise these action tokens using pre-training and post-training phases, we employ a multi-
a conditional flow matching loss [28, 32], stage training procedure for our model. The goal of the pre-
training phase is to expose the model to a diverse range of
Lτ (θ) = Ep(At |ot ),q(Aτt |At ) ||vθ (Aτt , ot ) − u(Aτt |At )||2 ,
tasks so that it can acquire broadly applicable and general
where subscripts denote robot timesteps and superscripts physical capabilities, while the goal of the post-training phase
denote flow matching timesteps, with τ ∈ [0, 1]. Recent is to provide the model with the ability to skillfully and
work in high-resolution image [14] and video [38] synthe- fluently execute the desired downstream task. Because of
sis has shown that flow matching can achieve strong em- this, the requirements for the pre-training and post-training
pirical performance when combined with a simple linear- datasets are distinct: the pre-training dataset should cover
Gaussian (or optimal transport) probability path [28], given as many tasks as possible, and within each of those tasks
by q(Aτt |At ) = N (τ At , (1 − τ )I). In practice, the network should cover a diversity of behaviors. The post-training dataset
is trained by sampling random noise ϵ ∼ N (0, I), computing should instead cover behaviors that are conducive to effective
the “noisy actions” Aτt = τ At + (1 − τ )ϵ, and then training task execution, which should exhibit a consistent and fluent
the network outputs vθ (Aτt , ot ) to match the denoising vector strategy. Intuitively, the diverse (but lower quality) pre-training
field u(Aτt |At ) = ϵ − At . The action expert uses a full data allows the model to recover from mistakes and handle
bidirectional attention mask, so that all action tokens attend highly varied situations, which might not otherwise occur in
to each other. During training, we sample the flow matching the high-quality post-training data, while the post-training data
timestep τ from a beta distribution that emphasizes lower teaches the model to perform the task well.
(noisier) timesteps. See Appendix B for more details.
A. Pre-training and post-training
At inference time, we generate actions by integrating the
learned vector field from τ = 0 to τ = 1, starting with random
noise A0t ∼ N (0, I). We use the forward Euler integration
rule:
Aτt +δ = Aτt + δvθ (Aτt , ot ),
where δ is the integration step size. We use 10 integration
steps (corresponding to δ = 0.1) in our experiments. Note
that inference can be implemented efficiently by caching
the attention keys and values for the prefix ot and only
recomputing the suffix corresponding to the action tokens for Fig. 4: Overview of our dataset: The pre-training mixture
each integration step. We provide more details regarding the consists of a subset of OXE [10] and the π dataset. We use
inference procedure, including the inference time for each part a subset of OXE, which we refer to as OXE Magic Soup
of the model, in Appendix D. [24]. The right figure illustrates the weight of the different
While in principle our model can be initialized from scratch datasets in the pre-training mixture. The left figure illustrates
or fine-tuned from any VLM backbone, in practice we use their relative sizes as measured by the number of steps.
PaliGemma [5] as our base model. PaliGemma is an open-
source 3 billion parameter VLM that offers a convenient trade- We provide an overview of our pre-training mixture in Fig-
off between size and performance. We add 300M parameters ure 4. Since each training example corresponds to a timestep
for the action expert (which is initialized from scratch) for a — i.e., a tuple (ot , At ), — we will quantify data in terms
total of 3.3 billion parameters. We provide a full description of timesteps in this discussion. 9.1% of the training mixture
of the model architecture in Appendix B. consists of open-source datasets, including OXE [10], Bridge
v2 [52], and DROID [23]. The robots and tasks in these
datasets typically have one or two cameras and use low-
frequency control, between 2 and 10 Hz. However, these
datasets cover a wide range of objects and environments. To
learn dexterous and more complex tasks, we also use 903M
timesteps of data from our own datasets, where 106M steps are Bimanual UR5e Bimanual Trossen Bimanual ARX
from single-arm robots and 797M are from dual-arm robots.
This data has 68 tasks, where each task is composed of
complex behaviors — e.g., the “bussing” task involves putting
a wide range of different dishes, cups, and utensils into a
bussing bin, and a wide array of trash items into the garbage.
Note that this definition of task is significantly different from
prior work, which typically uses any combination of noun UR5e Franka Mobile Trossen Mobile Fibocom
and verb (e.g., “pick up the cup” vs. “pick up the plate”)
to constitute a distinct task. Therefore, the actual range of Fig. 5: The robots used in our experiments. These include
behaviors in our dataset is significantly broader than this single and dual-arm manipulators with 6-DoF and 7-DoF arms,
number of “tasks” would imply. We discuss the specific robots as well as holonomic and nonholonomic mobile manipulators.
and tasks in our dataset in more detail in Section V-C. π0 is trained jointly on all of these platforms.
Since the datasets are somewhat imbalanced in size (e.g.,
the more difficult laundry folding tasks are overrepresented),
we weight each task-robot combination by n0.43 , where n UR5e. An arm with a parallel jaw gripper, with a wrist-
is the number of samples for that combination, such that mounted and over-the-shoulder camera, for a total of two
over-represented combinations are down-weighted. The con- camera images and a 7-dimensional configuration and action
figuration vector qt and action vectors at always have the space.
dimensionality of the largest robot in the dataset (18 in our Bimanual UR5e. Two UR5e setups, for a total of three camera
case, to accommodate two 6-DoF arms, 2 grippers, a mobile images and a 14-dimensional configuration and action space.
base, and a vertically actuated torso). For robots with lower- Franka. The Franka setup has two cameras and an 8-
dimensional configuration and action spaces, we zero-pad the dimensional configuration and action space.
configuration and action vectors. For robots with fewer than Bimanual Trossen. This setup has two 6-DoF Trossen ViperX
three images, we also mask out the missing image slots. arms in a configuration based on the ALOHA setup [4, 57],
In the post-training phase, we fine-tune our model with a with two wrist cameras and a base camera, and a 14-
smaller task-specific dataset to specialize it to particular down- dimensional configuration and action space.
stream applications. As mentioned previously, our definition Bimanual ARX & bimanual AgileX. This setup uses two
of “task” is fairly broad — e.g., the “bussing” task requires 6-DoF arms, and supports either ARX or AgileX arms, with
manipulating a wide range of different objects. Different tasks three cameras (two wrist and one base) and a 14-dimensional
require very different datasets, with the simplest of the tasks configuration and action space. This class encompasses two
necessitating only 5 hours and the most complex tasks using distinct platforms, but we categorize them together because of
100 or more hours of data. their similar kinematic properties.
Mobile Trossen & mobile ARX. This setup is based on the
B. Language and high-level policies
Mobile ALOHA [57] platform, with two 6-DoF arms on a
More complex tasks that require semantic reasoning and mobile base, which are either ARX arms or Trossen ViperX
high-level strategy, such as table bussing, can also benefit from arms. The nonholonomic base adds two action dimensions,
a high-level policy that decomposes high-level tasks (such as for a 14-dimensional configuration and 16-dimensional action
“bus the table”) into more immediate subtasks (such as “pick space. There are two wrist cameras and a base camera. This
up the napkin” or “throw the napkin into the trash”). Since class encompasses two distinct platforms, but we categorize
our model is trained to process language inputs, we can use a them together because of their similar kinematic properties.
high-level VLM to make these semantic inferences, a method Mobile Fibocom. Two 6-DoF ARX arms on a holonomic base.
that is analogous to LLM/VLM planning methods such as The base adds three action dimensions (two for translation and
SayCan [2]. We use such a high-level policy to assist our one for orientation), for a 14-dimensional configuration and
model with high-level strategy for several of our experimental 17-dimensional action space.
tasks, as we will discuss in Section VI. We summarize the proportion of our dataset from each robot
in Figure 4.
C. Robot system details
Our dexterous manipulation datasets include 7 different VI. E XPERIMENTAL E VALUATION
robot configurations and 68 tasks. We summarize these plat- Our experimental evaluation consists of zero-shot evaluation
forms in Figure 5, and discuss them below: experiments that compare our base (pre-trained) model to
Zero-Shot Performance 0 OpenVLA
Comparison Across Tasks OpenVLA (UR5e only)
1.0 0 (parity)
0 small Octo

0.8

Average Task Progress


0.6

0.4

0.2

0.0
Shirt Folding Bussing Easy Bussing Hard Grocery Bagging Toast
(Bi-ARX) (UR5e) (UR5e) (UR5e) (Bi-Trossen)

Fig. 7: Zero-shot evaluation results: We evaluate π0 trained


for the full 700k steps, a version trained for 160k steps that
matches the number of updates for baseline models, π0 -small,
Fig. 6: Zero-shot evaluation tasks: To evaluate our base
and three baselines: OpenVLA and Octo trained on all of our
model, we run it after pre-training on five tasks: shirt folding,
data, and OpenVLA trained only on the UR5e tasks (which
bussing easy, bussing hard, grocery bagging, and toast
we found to work better on UR5e tasks). Across all tasks
out of toaster. The tasks require a combination of dexterous
and all comparisons, even the “parity” version of our model
manipulation, multi-stage behaviors, and semantic recognition.
outperforms all baselines, and the full version of our model
achieves the best results by a large margin.
alternative model designs, as well as detailed fine-tuning ex-
periments that evaluate our model on challenging downstream the literature: both VLAs and smaller models that are trained
tasks, comparing it to other methods that have been proposed from scratch on the same pre-training mixture. We evaluate
for dexterous manipulation. We study the following research on the following tasks, visualized in Figure 6, with each task
questions: commanded to the same base model via a language command.
How well does π0 perform after pre-training on a variety Shirt folding: the robot must fold a t-shirt, which starts
of tasks that are present in the pre-training data? We study flattened.
this question by directly evaluating π0 , with comparisons to Bussing easy: the robot must clean a table, putting trash in the
other robot foundation models. trash bin and dishes into the dish bin. The score indicates the
How well does π0 follow language commands? These number of objects that were placed in the correct receptacle.
experiments compare π0 to π0 -small, a smaller version of our Bussing hard: a harder version of the bussing task, with more
model without VLM initialization, to evaluate its performance objects and more challenging configurations, such as utensils
on following language commands. We evaluate with both intentionally placed on top of trash objects, objects obstructing
human-provided commands and commands specified by a each other, and some objects that are not in the pre-training
high-level VLM policy, as discussed in Section V-B. dataset.
How does π0 compare to methods that have been proposed Grocery bagging: the robot must bag all grocery items, such
specifically for addressing dexterous manipulation tasks? as potato chips, marshmallows, and cat food.
These experiments study downstream tasks for which we can Toast out of toaster: the robot removes toast from a toaster.
either fine-tune our model from the pre-trained initialization, Providing comparisons for these experiments is challenging
or train it from scratch on task-specific data, comparing to because very few prior models can operate at this scale. We
prior methods that were proposed for dexterous manipulation. compare to OpenVLA [24], a 7B parameter VLA model that
We aim to evaluate both the benefits of our architecture and was originally trained on the OXE dataset [10]. We train
our pre-training procedure. OpenVLA on our full mixture. This is a very difficult mixture
Can π0 be adapted to complex, multi-stage tasks? In our for OpenVLA, which does not support action chunking or
final set of experiments, we fine-tune π0 to a set of particularly high-frequency control. We also compare to Octo [50], a
complex tasks, including folding laundry and bussing a table. smaller 93M parameter model. While Octo is not a VLA, it
These tasks take between 5 and 20 minutes to complete. Some does use a diffusion process to generate actions, providing a
require guidance from a high-level policy. valuable point of comparison for our flow matching VLA. We
also train Octo on the same mixture as our model. Due to time
A. Evaluating the base model constraints, we were unable to train OpenVLA and Octo for
In our first set of experiments, we evaluate the model after the same number of epochs as our full model. We therefore
pre-training on our full mixture, without any post-training, also compare to a “compute parity” version of our model,
to evaluate how well our base model can perform a variety which is trained for only 160k steps (as opposed to 700k
of tasks. We compare to other robot foundation models in steps for our main model), which is equal to or lower than the
Fig. 8: The tasks in our language evaluation. We evaluate
our model on 3 different language-conditioned tasks, each of
which requires following a sequence of intermediate language
commands. The tasks involve bussing a table (top) to put Fig. 9: Language evaluation. We compare “flat” versions of
dishes in a bin and garbage in a trash bin, setting a table our policies, −flat, which receive only the overall task com-
(middle) by taking items out of a bin, and packing a shopping mand (e.g., “bag the groceries”) with a method that receives
bag (bottom). intermediate commands from a human expert, −human, or a
high-level VLM policy, −HL. We also compare our model to a
small non-VLM variant under the “expert” condition, π0 and
number of steps provided to the baselines (160k for OpenVLA, π0 -small, in terms of language following accuracy. The results
320k for Octo). We also include a version of the OpenVLA show a significant improvement with π0 from intermediate
model that we fine-tuned only on the UR5e data, without language commands provided by a human expert and to a
cross-embodiment training, in the hopes of providing an even lesser degree by an autonomous high-level policy. Notably,
stronger baseline on the UR5e tasks. Finally, we include a due to π0 -small’s limited language following ability, overall it
comparison to the π0 -small model described in Section IV, does not gain with the addition of a high-level expert.
which can be viewed as a scaled-down version of our model
without VLM pre-training.
The evaluation metric uses a normalized score averaged over
domains. We compare this fine-tuned π0 model with the π0 -
10 episodes per task and method, where an episode receives a
small model described in Section IV, which we found to
score of 1.0 for a full success, and a fractional score for partial
be the strongest baseline in the previous section. Recall that
success. For example, the score for bussing is the fraction of
π0 -small does not use a VLM initialization. This experi-
objects that are correctly placed in the proper receptacle. We
ment therefore aims to measure how much VLM pre-training
describe the scoring rubrics in Appendix E. The results, shown
boosts our model’s ability to follow language instructions.
in Figure 7, show that π0 attains by far the best results across
Note that π0 -small is also a significantly smaller model —
the board on all the zero-shot tasks, with near perfect success
unfortunately, it is difficult to remove this confounder, because
rates on shirt folding and the easier bussing tasks, and large
VLM initialization serves both to make it practical to train
improvements over all baselines. The “parity” version of π0 ,
a much larger model without overfitting, and to improve
which is trained for only 160k steps, still outperforms all the
language instruction following. We nonetheless hope that this
baselines, and even π0 -small outperforms OpenVLA and Octo.
experiment sheds light on the language capabilities of π0 .
OpenVLA struggles on these tasks because its autoregressive
The language instructions for each task consist of objects to
discretization architecture does not support action chunks. The
pick up and locations to place those objects, with language-
UR5e-only OpenVLA model performs better, but is still far
labeled segments that are about 2 seconds in length. Each full
below the performance of π0 . Octo does support action chunks,
task consists of numerous such segments. The tasks in this
but has a comparatively limited representational capacity. This
evaluation consist of:
comparison illustrates the importance of combining large,
Bussing: the robot must clean a table, placing dishes and
expressive architectures with the ability to model complex
cutlery in a bin, and trash into a trash bin.
distributions via flow matching or diffusion. Additionally, the
comparison to π0 -small illustrates the importance of incor- Table setting: the robot must take out items from a bin to set
porating VLM pre-training. Unfortunately, it is hard to make a table, including a place mat, dishes, silverware, napkin, and
this last comparison fair: π0 -small uses fewer parameters, but cups, and adjust them according to language instructions.
larger models are difficult to use without pre-training. Overall, Grocery bagging: the robot must pack grocery items, such as
these experiments show that π0 provides a powerful pre- bags of coffee beans, barley, marshmallow, seaweed, almonds,
trained model with the ability to effectively perform a variety spaghetti, and cans into a bag.
of tasks with a variety of robots, with much better performance In Figure 8, we show the language-conditioned tasks in our
than prior models. evaluation and present the evaluation results. We evaluate five
different conditions. π0 -flat (and π0 -small-flat) corresponds
B. Following language commands to directly command the model with the task description
In the next set of experiments, we fine-tune the base π0 (e.g., “bag the groceries”), without intermediate language com-
model to follow language commands in a set of evaluation mands. π0 -human (and π0 -small-human) provides intermediate
a variety of bowls, and the evaluations use a mix of seen and
unseen bowls.
Towel folding. This task requires folding a towel. Since this
is similar to shirt folding, which is present in pre-training, we
place it in the “easy” tier.
Tupperware in microwave. This task requires opening a
microwave, putting a plastic container inside it, and closing
it. The containers come in different shapes and colors, and
the evaluations use a mix of seen and unseen containers. The
container manipulation resembles pre-training data, but the
microwave is not found in pre-training.
Paper towel replacement. This task requires removing an old
cardboard paper towel tube from a holder and replacing it with
a fresh paper towel roll. Because no such items are found in
Fig. 10: Fine-tuning evaluation tasks: We fine-tune our pre-training, we consider this “hard.”
model to a variety of downstream tasks that are distinct from Franka items in drawer. This task requires opening a drawer,
the tasks seen in pre-training. Our tasks represent a range of packing items into a drawer, and closing it. Because there is no
similarity from the pre-training tasks, with tasks that are most similar task with the Franka robot in pre-training, we consider
similar to pre-training (stack bowls and towel folding), a task this “hard.”
that introduces an unseen new element (a microwave), and We compare our model after fine-tuning both to Open-
tasks that require new motions and new object types (Franka VLA [24] and Octo [50], which also employ a pre-training
items in drawer and paper towel replacement). and fine-tuning recipe. Since our aim is to evaluate the specific
models (rather than the architectures), we use the publicly
available pre-trained checkpoints for these models, which are
step commands (e.g., which object to pick and where to place trained on OXE [10], and then fine-tune them to each task. We
it) from an expert human user. These conditions evaluate each also compare to ACT [57] and Diffusion Policy [9], which
model’s ability to follow more detailed language commands: are designed specifically for learning dexterous tasks from
while these intermediate commands provide considerable in- smaller datasets. ACT and Diffusion Policy are trained only
formation for how to perform the task, the model must be on the fine-tuning datasets, which are of similar size to the
able to understand and follow those commands to benefit from individual datasets used in the ACT and Diffusion Policy
them. Finally, π0 -HL evaluates π0 with high-level commands experiments [9, 57]. We evaluate π0 by fine-tuning from our
provided by a high-level VLM, as discussed in Section V-B. pre-trained base model, as well as by training from scratch.
This condition is also autonomous, without any human expert. This comparison is meant to evaluate the individual benefits of
The results in Figure 9, averaging over 10 trials per the π0 architecture and our pre-training procedure. We hypoth-
task, show that the language following accuracy of π0 is esize that the π0 architecture with VLM initialization should
significantly better than that of π0 -small. This suggests a already provide a stronger starting point for the individual
significant improvement from the larger pre-trained VLM tasks, while the pre-training procedure should further improve
initialization. This capability translates to an improvement its performance, especially with smaller fine-tuning datasets.
in performance with expert human guidance (π0 -human) and Figure 11 shows the performance across all of the tasks for
with high-level model guidance (π0 -HL). The results indicate a variety of methods, averaging over 10 trials per task, with
that π0 ’s language following ability directly translates into different amounts of fine-tuning data on each task. We include
better autonomous performance on complex tasks with high- all of the baselines on the stack bowls and Tupperware in mi-
level guidance. crowave tasks. Since OpenVLA and Octo attain significantly
worse performance, we only run these for one of the dataset
C. Learning new dexterous tasks sizes, due to the time cost of evaluating so many models in
In the next set of experiments, we evaluate our model on the real world. The results show that π0 generally outperforms
new tasks that differ significantly from the pre-training data, other methods. Interestingly, the strongest prior models are
requiring entirely new behaviors. For these evaluations, we the ones that are trained entirely from scratch on the target
fine-tune the model using various amounts of data for each tasks, suggesting that leveraging pre-training in these domains
new task. While each task is new, we partition the tasks into presents a major challenge for prior approaches. While the 5-
“tiers” depending on how much they differ from tasks in the hour policy for π0 on the Tupperware task performs similarly
pre-training data. The tasks, shown in Figure 10, are: to the baselines, the 1-hour version is significantly better. As
UR5e stack bowls. This task requires stacking bowls, with expected, pre-training leads to larger improvement for tasks
four bowls of different sizes. Since this task requires grasping that are more similar to the pre-training data, though the pre-
and moving dishes like the bussing task in the pre-training trained model is frequently better than the non-pre-trained
data, we place it in the “easy” tier. The training data contains model, sometimes by as much as 2x.
0 DP OpenVLA
0 (scratch) Octo ACT
Paper Towel Replacement Items in Drawer
Average Across All Tasks (Bi-UR5e) (Franka)
Average Task Progress

Average Task Progress

Average Task Progress


1.0 1.0 1.0

0.5 0.5 0.5

0.0 0.0 0.0


1 5 10 1 5 10 1 5 10
Fine-Tuning Data (Hours) Fine-Tuning Data (Hours) Fine-Tuning Data (Hours)

Towel Folding Stack Bowls Tupperware in Microwave


(Bi-ARX) (UR5e) (Bi-ARX)
Average Task Progress

Average Task Progress

Average Task Progress


1.0 1.0 1.0

0.5 0.5 0.5

0.0 0.0 0.0


1 5 10 1 5 10 1 5 10
Fine-Tuning Data (Hours) Fine-Tuning Data (Hours) Fine-Tuning Data (Hours)

Fig. 11: Fine-tuning with varying amounts of data. π0 can learn some easier tasks even with smaller amounts of data, and
the pre-trained model often attains a larger improvement over the model trained from scratch.

D. Mastering complex multi-stage tasks The robot must handle dense clutter and intelligently sequence
various behaviors — for example, to clean off a plate with
In our final set of experiments, we tackle a range of
trash, it must first pick up the plate, then shake its co ntents
challenging multi-stage tasks via a combination of fine-tuning
into the garbage, and then place the plate in the bin. This task
and language. For some of these tasks, data is present in pre-
is not present in pre-training.
training, but fine-tuning is required to attain mastery. For some,
no data is present in pre-training. The tasks in this evaluation, Box building: The robot has to assemble a cardboard box
shown in Figure 12, are: that starts in a flattened state. This task presents a number of
Laundry folding: This task requires a static (non-mobile) bi- major challenges: the box needs to bent in the right way, and
manual system to fold articles of clothing. The clothing items the robot needs to hold down parts of the box while folding
start in a randomized crumpled state in a bin, and the goal is others, utilizing both arms and even the surface of the table to
to take out the item, fold it, and place it on top of a stack of brace during folding motions. The robot might need to retry
previously folded items. The randomized initial configuration some folds, requiring a reactive and intelligent strategy. This
of the crumpled laundry presents a major challenge, since the task is not present in the pre-training data. This task is not
policy needs to generalize to any configuration. This task is present in pre-training.
present in pre-training. To-go box: This task requires moving several food items from
Mobile laundry: Here, the Fibocom mobile robot in Figure 5 a plate into a to-go box, requiring packing the items into the
has to fold laundry, facing many of the same challenges while box so that they do not stick out, and then closing the box
controlling orientation and translation. This task is present in with both arms. This task is not present in the pre-training
pre-training. data. This task is not present in pre-training.
Dryer unloading: Here, the Fibocom mobile robot has to take Packing eggs: The robot needs to take six eggs out of a
laundry out of a dryer and place it into a hamper. This task is bowl and pack them into an egg carton, and then close the
present in pre-training. carton. The eggs need to be grasped in a manner appropriate
Table bussing: This task requires bussing a table with a to their pose inside the bowl, and then placed into open slots
diverse array of novel objects in a clutter scene, presenting in the carton. This presents challenges due to the egg shape,
a much greater challenge than the benchmark in our zero- slipperiness, and the need for careful placement. Closing the
shot evaluation: the policy must generalize to unseen objects box requires the use of both arms. This task is not present in
of varying shapes and sizes, and perform complex dexterous the pre-training data. This task is not present in pre-training.
motions, such as twisting the gripper to pick up large plates The results, showing average scores per task over 10 trials,
and carefully grasping thin, delicate items such as glasses. are presented in Figure 13. The scoring rubrics are in Ap-
Fig. 12: We evaluate a range of complex and temporally
extended tasks. This includes: folding laundry from a bin
with a stationary (a) or mobile (b) robot, bussing a real
lunch table (c), assembling a box (d), packing eggs into a
carton (e), and packing food into a to-go box (f). These tasks
require combining dozens of individual behaviors, such as Fig. 13: Post-training results on complex tasks in terms of
grasping, stacking, folding, and flattening, generalization to average scores over 10 trials. The full pre-trained π0 model
a huge variety of object configurations, and complex physical attains more than 50% of the maximum score across all of the
properties, such as deformable objects or flexible cardboard. tasks, and typically outperforms the ablations, with especially
significant improvements on the hardest tasks.

pendix E. A score of 1.0 represents a perfect execution, while


partial scores correspond to partially completed tasks (e.g., 0.5 and temporally extended multi-stage behaviors. Our model
indicates that half the objects were bussed correctly). These incorporates Internet-scale vision-language model (VLM) pre-
tasks are very difficult, and we were not able to solve them training with flow matching for representing complex high-
with other methods. We therefore use these tasks to compare frequency action chunks. Our pre-training mixture consists
to ablations of our approach, evaluating π0 after pre-training of 10,000 hours of dexterous manipulation data from 7
and fine-tuning, zero-shot evaluation after pre-training only different robot configurations and 68 tasks, in addition to
(“zero-shot”), and training on the fine-tuning data without any large amounts of previously collected robot manipulation
pre-training (“scratch”). The results show that π0 can solve data from OXE [10], DROID [23], and Bridge [52]. To our
many of these tasks, with our full pre-training and fine-tuning knowledge, this represents the largest pre-training mixture
recipe performing best across the board. Note that many of ever used for a robot manipulation model. Our fine-tuning
these more difficult tasks show a very large improvement experiments include over 20 tasks, where we show that our
from using the pre-trained model, indicating that pre-training is model outperforms a variety of baselines, including prior VLA
especially useful with harder tasks. The absolute performance models [24] and models designed specifically for dexterous
of π0 varies across the tasks, likely due to differences in task manipulation [57, 9]. We also examine how our post-training
difficulty and the degree to which the tasks are represented recipe can enable highly complex tasks, such as folding mul-
in pre-training. We recommend that readers watch the task tiple articles of clothing from arbitrary initial configurations
videos on the accompanying website for a more complete or assembling boxes.
impression of these tasks and their complexity. We believe Our framework broadly resembles the training procedures
that this level of autonomous performance on such challenging employed for large language models, which typically consist
tasks represents a new state of the art in dexterous robot of pre-training a base model on very large datasets scraped
manipulation with learned policies. from the web, followed by a post-training procedure that aims
to “align” the model to enable it to follow instructions and
VII. D ISCUSSION , L IMITATIONS , AND F UTURE W ORK perform user commands. It is generally recognized that most
We presented a framework for training a robot foundation of the “knowledge” in such models is acquired in the pre-
model, which we refer to as π0 , that consists of pre-training training phase, while the post-training phase serves to tell
on highly diverse data, followed by either zero-shot evaluation the model how it should leverage that knowledge to fulfill
or fine-tuning to complex downstream tasks. Our empirical user commands. Our experiments imply that an analogous
evaluation studies tasks that combine dexterity, generalization, phenomenon might take place with robot foundation models,
where pre-trained models have some zero-shot capabilities, [4] Jorge Aldaco, Travis Armstrong, Robert Baruch,
but complex tasks like laundry following require fine-tuning Jeff Bingham, Sanky Chan, Kenneth Draper, De-
with high-quality data. Training on only this high-quality data bidatta Dwibedi, Chelsea Finn, Pete Florence, Spencer
results in a brittle model that does not reliably recover from Goodrich, et al. Aloha 2: An enhanced low-cost
mistakes, while running the pre-trained model in zero shot hardware for bimanual teleoperation. arXiv preprint
does not always exhibit the fluent strategies demonstrated in arXiv:2405.02292, 2024.
the post-training data. [5] Lucas Beyer, Andreas Steiner, André Susano Pinto,
We hope that our results will serve as a stepping stone to- Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim
ward general and broadly applicable robot foundation models. Neumann, Ibrahim Alabdulmohsin, Michael Tschannen,
Our experiments suggest that such models may soon be a Emanuele Bugliarello, et al. Paligemma: A versatile 3b
reality, but there are a number of limitations and ample room vlm for transfer. arXiv preprint arXiv:2407.07726, 2024.
for future work. First, our experiments do not yet provide a [6] Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Ab-
comprehensive understanding of how the pre-training datasets hinav Gupta, Shubham Tulsiani, and Vikash Kumar.
should be composed: we combined all data available to us, but RoboAgent: Generalization and efficiency in robot ma-
understanding what type of data is more helpful to add and nipulation via semantic augmentations and action chunk-
how it should be weighted remains an open problem. Not all ing. In 2024 IEEE International Conference on Robotics
tasks in our evaluation work reliably, and it remains unclear and Automation (ICRA), pages 4788–4795. IEEE, 2024.
how to predict how much and what kind of data is needed [7] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen
to attain near-perfect performance. Finally, it remains to be Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding,
seen how much positive transfer there is in combining highly Danny Driess, Avinava Dubey, Chelsea Finn, Pete Flo-
diverse data, particularly from different tasks and different rence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana
robots: although our results suggest that universal pre-trained Gopalakrishnan, Kehang Han, Karol Hausman, Alexan-
robot foundation models might become a reality, it is left for der Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil
future work to understand whether this universality extends Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang,
to much more distinct domains, such as autonomous driving, Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey
navigation, and legged locomotion. Levine, Yao Lu, Henryk Michalewski, Igor Mordatch,
Karl Pertsch, Kanishka Rao, Krista Reymann, Michael
ACKNOWLEDGEMENTS Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet,
We thank Laura Smith and Dibya Ghosh for feedback on Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran,
the paper and assistance with figures and videos, Philip Clark, Vincent Vanhoucke, Quan Vuong, Ayzaan Wahid, Stefan
Kelly Sims, and Saunaz Moradi for feedback on writing, and Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao,
Evan Pokrandt, Joakim Keussen, Dan Philibin, Eitan Penner, Peng Xu, Sichun Xu, Tianhe Yu, and Brianna Zitkovich.
Adam Lisagor, and Greg Miller for help with illustrations, Rt-2: Vision-language-action models transfer web knowl-
design, and videos. We also thank Lili Yu for helpful technical edge to robotic control. arXiv preprint arXiv:2307.15818,
discussion. We are tremendously grateful to all of the robot 2023.
operators for tirelessly collecting robot manipulation data. For [8] Serkan Cabi, Sergio Gómez Colmenarejo, Alexander
a full contribution statement, see Appendix A. Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong,
Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik,
R EFERENCES et al. Scaling data-driven robotics with reward sketch-
[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama ing and batch reinforcement learning. arXiv preprint
Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo arXiv:1909.12200, 2019.
Almeida, Janko Altenschmidt, Sam Altman, Shyamal [9] Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau,
Anadkat, et al. Gpt-4 technical report. arXiv preprint Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran
arXiv:2303.08774, 2023. Song. Diffusion policy: Visuomotor policy learning via
[2] Michael Ahn, Anthony Brohan, Noah Brown, Yevgen action diffusion. The International Journal of Robotics
Chebotar, Omar Cortes, Byron David, Chelsea Finn, Research, page 02783649241273668, 2023.
Chuyuan Fu, Keerthana Gopalakrishnan, Karol Haus- [10] OX-Embodiment Collaboration, A Padalkar, A Pooley,
man, et al. Do as i can, not as i say: Grounding language A Jain, A Bewley, A Herzog, A Irpan, A Khazatsky,
in robotic affordances. arXiv preprint arXiv:2204.01691, A Rai, A Singh, et al. Open X-Embodiment: Robotic
2022. learning datasets and RT-X models. arXiv preprint
[3] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- arXiv:2310.08864, 1(2), 2023.
toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur [11] Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch,
Mensch, Katherine Millican, Malcolm Reynolds, et al. Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid,
Flamingo: a visual language model for few-shot learning. Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-
Advances in neural information processing systems, 35: e: An embodied multimodal language model. arXiv
23716–23736, 2022. preprint arXiv:2303.03378, 2023.
[12] Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, [23] Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ash-
Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi win Balakrishna, Sudeep Dasari, Siddharth Karam-
Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient cheti, Soroush Nasiriany, Mohan Kumar Srirama,
scaling of language models with mixture-of-experts. In Lawrence Yunliang Chen, Kirsty Ellis, et al. DROID: A
International Conference on Machine Learning, pages large-scale in-the-wild robot manipulation dataset. arXiv
5547–5569. PMLR, 2022. preprint arXiv:2403.12945, 2024.
[13] Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, [24] Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted
Bernadette Bucher, Georgios Georgakis, Kostas Dani- Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov,
ilidis, Chelsea Finn, and Sergey Levine. Bridge data: Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla:
Boosting generalization of robotic skills with cross- An open-source vision-language-action model. arXiv
domain datasets. arXiv preprint arXiv:2109.13396, 2021. preprint arXiv:2406.09246, 2024.
[14] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim [25] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, De-
Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik hao Chen, Orhan Firat, Yanping Huang, Maxim Krikun,
Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling Noam Shazeer, and Zhifeng Chen. Gshard: Scaling
rectified flow transformers for high-resolution image giant models with conditional computation and automatic
synthesis. In Forty-first International Conference on sharding. arXiv preprint arXiv:2006.16668, 2020.
Machine Learning, 2024. [26] Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian
[15] Haritheja Etukuru, Norihito Naka, Zijin Hu, Seung- Ibarz, and Deirdre Quillen. Learning hand-eye coor-
jae Lee, Julian Mehu, Aaron Edsinger, Chris Pax- dination for robotic grasping with deep learning and
ton, Soumith Chintala, Lerrel Pinto, and Nur Muham- large-scale data collection. The International journal of
mad Mahi Shafiullah. Robot utility models: General robotics research, 37(4-5):421–436, 2018.
policies for zero-shot deployment in new environments. [27] Yujia Li, David Choi, Junyoung Chung, Nate Kushman,
arXiv preprint arXiv:2409.05865, 2024. Julian Schrittwieser, Rémi Leblond, Tom Eccles, James
[16] William Fedus, Barret Zoph, and Noam Shazeer. Switch Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hu-
transformers: Scaling to trillion parameter models with bert, Peter Choy, Cyprien de Masson d’Autume, Igor
simple and efficient sparsity. Journal of Machine Learn- Babuschkin, Xinyun Chen, Po-Sen Huang, Johannes
ing Research, 23(120):1–39, 2022. Welbl, Sven Gowal, Alexey Cherepanov, James Molloy,
[17] Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Mobile Daniel J. Mankowitz, Esme Sutherland Robson, Push-
aloha: Learning bimanual mobile manipulation with low- meet Kohli, Nando de Freitas, Koray Kavukcuoglu, and
cost whole-body teleoperation. In Conference on Robot Oriol Vinyals. Competition-level code generation with
Learning (CoRL), 2024. alphacode. Science, 378(6624):1092–1097, 2022.
[18] Abhinav Gupta, Adithyavairavan Murali, [28] Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maxim-
Dhiraj Prakashchand Gandhi, and Lerrel Pinto. ilian Nickel, and Matt Le. Flow matching for generative
Robot learning in homes: Improving generalization and modeling. arXiv preprint arXiv:2210.02747, 2022.
reducing dataset bias. Advances in neural information [29] Bingchen Liu, Ehsan Akhgari, Alexander Visheratin,
processing systems, 31, 2018. Aleks Kamko, Linmiao Xu, Shivam Shrirao, Joao Souza,
[19] Wanggui He, Siming Fu, Mushui Liu, Xierui Wang, Suhail Doshi, and Daiqing Li. Playground v3: Improving
Wenyi Xiao, Fangxun Shu, Yi Wang, Lei Zhang, Zhelun text-to-image alignment with deep-fusion large language
Yu, Haoyuan Li, et al. Mars: Mixture of auto-regressive models. arXiv preprint arXiv:2409.10695, 2024.
models for fine-grained text-to-image synthesis. arXiv [30] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae
preprint arXiv:2407.07614, 2024. Lee. Visual instruction tuning. Advances in neural
[20] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising information processing systems, 36, 2024.
diffusion probabilistic models. Advances in neural infor- [31] Peiqi Liu, Yaswanth Orru, Jay Vakil, Chris Paxton, Nur
mation processing systems, 33:6840–6851, 2020. Muhammad Mahi Shafiullah, and Lerrel Pinto. Ok-
[21] John Jumper, Richard Evans, Alexander Pritzel, Tim robot: What really matters in integrating open-knowledge
Green, Michael Figurnov, Olaf Ronneberger, Kathryn models for robotics. arXiv preprint arXiv:2401.12202,
Tunyasuvunakool, Russ Bates, Augustin Žı́dek, Anna 2024.
Potapenko, et al. Highly accurate protein structure [32] Qiang Liu. Rectified flow: A marginal preserv-
prediction with alphafold. Nature, 596(7873):583–589, ing approach to optimal transport. arXiv preprint
2021. arXiv:2209.14577, 2022.
[22] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian [33] Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan
Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Booher, Max Spero, Albert Tung, Julian Gao, John
Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, Emmons, Anchit Gupta, Emre Orbay, et al. RoboTurk: A
et al. Scalable deep reinforcement learning for vision- crowdsourcing platform for robotic skill learning through
based robotic manipulation. In Conference on robot imitation. In Conference on Robot Learning, pages 879–
learning, pages 651–673. PMLR, 2018. 893. PMLR, 2018.
[34] Ajay Mandlekar, Soroush Nasiriany, Bowen Wen, Ire- arXiv:1701.06538, 2017.
tiayo Akinola, Yashraj Narang, Linxi Fan, Yuke Zhu, [46] Jascha Sohl-Dickstein, Eric Weiss, Niru
and Dieter Fox. MimicGen: A data generation system Maheswaranathan, and Surya Ganguli. Deep
for scalable robot learning using human demonstrations. unsupervised learning using nonequilibrium
arXiv preprint arXiv:2310.17596, 2023. thermodynamics. In International conference on
[35] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, machine learning, pages 2256–2265. PMLR, 2015.
Carroll Wainwright, Pamela Mishkin, Chong Zhang, [47] Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai,
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Ross Wightman, Jakob Uszkoreit, and Lucas Beyer. How
Training language models to follow instructions with to train your vit? data, augmentation, and regularization
human feedback. Advances in neural information pro- in vision transformers. arXiv preprint arXiv:2106.10270,
cessing systems, 35:27730–27744, 2022. 2021.
[36] William Peebles and Saining Xie. Scalable diffu- [48] Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-
sion models with transformers. In Proceedings of the Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalk-
IEEE/CVF International Conference on Computer Vi- wyk, Andrew M Dai, Anja Hauth, Katie Millican, et al.
sion, pages 4195–4205, 2023. Gemini: a family of highly capable multimodal models.
[37] Lerrel Pinto and Abhinav Gupta. Supersizing self- arXiv preprint arXiv:2312.11805, 2023.
supervision: Learning to grasp from 50k tries and 700 [49] Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert
robot hours. In 2016 IEEE international conference Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent
on robotics and automation (ICRA), pages 3406–3413. Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love,
IEEE, 2016. et al. Gemma: Open models based on gemini research
[38] Adam Polyak, Amit Zohar, Andrew Brown, Andros and technology. arXiv preprint arXiv:2403.08295, 2024.
Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen [50] Octo Model Team, Dibya Ghosh, Homer Walke, Karl
Shi, Chih-Yao Ma, Ching-Yao Chuang, et al. Movie Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey
gen: A cast of media foundation models. arXiv preprint Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An
arXiv:2410.13720, 2024. open-source generalist robot policy. arXiv preprint
[39] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya arXiv:2405.12213, 2024.
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, [51] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser,
Learning transferable visual models from natural lan- and Illia Polosukhin. Attention is all you need. In
guage supervision. In International conference on ma- Advances in Neural Information Processing Systems,
chine learning, pages 8748–8763. PMLR, 2021. volume 30, 2017.
[40] Robin Rombach, Andreas Blattmann, Dominik Lorenz, [52] Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan
Patrick Esser, and Björn Ommer. High-resolution image Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An-
synthesis with latent diffusion models. In Proceedings dre Wang He, Vivek Myers, Moo Jin Kim, Max Du,
of the IEEE/CVF conference on computer vision and et al. BridgeData v2: A dataset for robot learning at
pattern recognition, pages 10684–10695, 2022. scale. In Conference on Robot Learning, pages 1723–
[41] Chitwan Saharia, William Chan, Saurabh Saxena, Lala 1736. PMLR, 2023.
Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, [53] Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin
Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Sali- Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M
mans, et al. Photorealistic text-to-image diffusion models Dai, and Quoc V Le. Finetuned language models are
with deep language understanding. Advances in neural zero-shot learners. arXiv preprint arXiv:2109.01652,
information processing systems, 35:36479–36494, 2022. 2021.
[42] V Sanh. Distilbert, a distilled version of bert: [54] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Bar-
Smaller, faster, cheaper and lighter. arXiv preprint ret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten
arXiv:1910.01108, 2019. Bosma, Denny Zhou, Donald Metzler, et al. Emer-
[43] Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja gent abilities of large language models. arXiv preprint
Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and arXiv:2206.07682, 2022.
Lerrel Pinto. On bringing robots home. arXiv preprint [55] Junjie Wen, Yichen Zhu, Jinming Li, Minjie Zhu, Kun
arXiv:2311.16098, 2023. Wu, Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin
[44] Noam Shazeer. Fast transformer decoding: One write- Shen, Yaxin Peng, Feifei Feng, and Jian Tang.
head is all you need. arXiv preprint arXiv:1911.02150, Tinyvla: Towards fast, data-efficient vision-language-
2019. action models for robotic manipulation. arXiv preprint
[45] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, arXiv:2409.12514, 2024.
Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff [56] Kuan-Ting Yu, Maria Bauza, Nima Fazeli, and Alberto
Dean. Outrageously large neural networks: The Rodriguez. More than a million ways to be pushed. a
sparsely-gated mixture-of-experts layer. arXiv preprint high-fidelity experimental dataset of planar pushing. In
2016 IEEE/RSJ international conference on intelligent information τ , and (3) a second, smaller set of weights for the
robots and systems (IROS), pages 30–37. IEEE, 2016. action expert.
[57] Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Additional inputs and outputs. The standard PaliGemma
Finn. Learning fine-grained bimanual manipulation with architecture takes in a sequence of images [I1t , ..., Int ] followed
low-cost hardware. arXiv preprint arXiv:2304.13705, by a language prompt ℓt . We add an input qt for the robot’s
2023. proprioceptive state, which is mapped to the transformer
[58] Tony Z Zhao, Jonathan Tompson, Danny Driess, Pete embedding dimension using a linear projection. The final
Florence, Kamyar Ghasemipour, Chelsea Finn, and set of input tokens correspond to the noisy action chunk
Ayzaan Wahid. Aloha unleashed: A simple recipe for Aτt = [aτt , ..., aτt+H−1 ], with the number of tokens equal to
robot dexterity. arXiv preprint arXiv:2410.13126, 2024. the action horizon (H = 50 for our tasks). We only use
[59] Chunting Zhou, Lili Yu, Arun Babu, Kushal Tiru- the transformer outputs corresponding to the H noisy actions,
mala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, which are decoded into vθ (Aτt , ot ) using a linear projection.
Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfu- Incorporating the flow matching timestep. The noisy
sion: Predict the next token and diffuse images with one action chunk Aτt is mapped to the transformer’s embedding
multi-modal model. arXiv preprint arXiv:2408.11039, dimension using an MLP that also incorporates the flow
2024. matching timestep τ . For each noisy action aτt′ , the expres-
[60] Minjie Zhu, Yichen Zhu, Jinming Li, Junjie Wen, sion for the corresponding embedding that is fed into the
Zhiyuan Xu, Ning Liu, Ran Cheng, Chaomin Shen, Yaxin transformer is W3 · swish(W2 · concat(W1 · aτt′ , ϕ(τ ))), where
Peng, Feifei Feng, et al. Scaling diffusion policy in trans- ϕ : R → Rw is a sinusoidal positional encoding function [51],
former to 1 billion parameters for robotic manipulation. W1 ∈ Rw×d , W2 ∈ Rw×2w , W3 ∈ Rw×w , d is the action
arXiv preprint arXiv:2409.14411, 2024. dimension, and w is the embedding dimension (or width) of
the action expert.
A PPENDIX Attention mask. π0 uses a blockwise causal attention mask
A. Contributions with 3 blocks: [I1t , ..., Int , ℓt ], [qt ], and [aτt , ..., aτt+H−1 ]. Within
The authors contributed to the following areas (listed alpha- each block, there is full bidirectional attention, whereas the
betically): tokens in each block cannot attend to the tokens in future
Data and operations: Noah Brown, Michael Equi, Chelsea blocks. The first block includes the input modalities from
Finn, Niccolo Fusai, Lachy Groom, Liyiming Ke, Suraj Nair, PaliGemma’s VLM pre-training, which are prevented from
Lucy Shi, and Anna Walling. attending to future blocks (which include new inputs) to
Evaluation experiments: Kevin Black, Michael Equi, Chelsea minimize distribution shift from said pre-training. The robot
Finn, Brian Ichter, Liyiming Ke, Adrian Li-Bell, Suraj Nair, state qt is its own block because it does not change with each
Karl Pertsch, and Lucy Shi. flow matching integration step; preventing it from attending
Model design: Kevin Black, Brian Ichter, Sergey Levine, Karl to the final block allows its corresponding keys and values to
Pertsch, Lucy Shi, and Quan Vuong. be cached during sampling. The final block corresponds to the
Post-training: Michael Equi, Chelsea Finn, Liyiming Ke, noisy actions Aτt , which can attend to the full input sequence.
Adrian Li-Bell, Suraj Nair, and Lucy Shi. Action expert. π0 is implemented as a single transformer
Pre-training: Kevin Black, Danny Driess, Brian Ichter, Sergey with two sets of weights (also known as experts [45]),
Levine, Karl Pertsch, Lucy Shi, and Quan Vuong. where each token is routed to one of the experts; the
Robot hardware: Noah Brown, Adnan Esmail, Chelsea Finn, weights interact only through the transformer’s self-attention
Tim Jones, and Mohith Mothukuri. layers. The images and language prompt, [I1t , ..., Int , ℓt ], are
Robot software: Karol Hausman, Szymon Jakubczak, Sergey routed to the larger VLM backbone, which we initialize
Levine, James Tanner, and Haohuan Wang. from PaliGemma. The inputs not seen during VLM pre-
Training infrastructure: Kevin Black, Michael Equi, Sergey training, [qt , Aτt ], are routed to the action expert. PaliGemma
Levine, Adrian Li-Bell, Suraj Nair, Quan Vuong, Haohuan is based on the Gemma 2B [49] language model, which
Wang, and Ury Zhilinsky. uses multi-query attention [44] and a configuration of
Writing and illustration: Kevin Black, Chelsea Finn, Lachy {width=2048, depth=18, mlp dim=16,384, num heads=18,
Groom, Karol Hausman, Brian Ichter, Sergey Levine, and num kv heads=1, head dim=256}. Since the experts interact
Quan Vuong. only in the self-attention layers, width and mlp dim do not
necessarily need to match between experts. To speed up
B. Model Architecture Details inference (which requires multiple forward passes of the
In this section, we provide a full description of the model action expert), we downsize the action expert to {width=1024,
architecture. We follow the PaliGemma VLM [5] design, mlp dim=4096}, resulting in a parameter count of ∼300M.
with the following differences: (1) additional input and output Sampling the flow matching timestep. The original flow
projections for the robotics-specific tokens, including the state matching papers [28, 32] sample the flow matching timestep
vector qt and action vectors At = [at , ..., at+H−1 ], (2) an from a uniform distribution: τ ∼ U(0, 1). Esser et al. [14]
additional MLP for incorporating the flow matching timestep instead propose sampling from a logit-normal distribution that
share weights; (5) The transformer backbone that encodes the
observations (which comes after the ViT image encoders) is
p(τ ) not pre-trained on Internet data; (6) The action expert uses
the DiT architecture [36] rather than the Gemma architecture,
and hence incorporates the flow-matching timestep τ using
AdaLN-Zero layers. Besides this, the models are broadly
similar: both use pre-trained ViT image encoders, both use
separate weights for the observation encoder and the action
expert, both take in the same observation format, and both
0 s 1 perform 10 steps of flow matching to predict the action chunk.
τ
D. Inference
Fig. 14: Flow matching timestep sampling distribution.
Recall that our model takes an observation ot =
We sample τ from a shifted beta distribution that emphasizes
[I1t , ..., Int , ℓt , qt ] and the noisy actions Aτt and outputs the
lower timesteps (corresponding to noisier actions), and does
vector field that needs to be integrated to obtain the next
not sample timesteps at all above a cutoff value s. We use
flow matching step, vtτ . Each time we predict a new action
s = 0.999 in our experiments.
chunk At , we must encode each of the images I1t , ..., Int , run a
forward pass on the tokens corresponding to ot , and then run
10 steps of flow matching, where each step requires running a
emphasizes the middle timesteps; the authors posit that at high forward pass on the tokens corresponding to Aτt (the keys and
timesteps (low noise levels), the model needs only to learn the values corresponding to ot are cached). Table I summarizes
identity function, and at low timesteps (high noise levels), the the computation time for this operation with 3 camera images.
model needs only to learn the mean of the data distribution. The operations were timed on an NVIDIA GeForce RTX
However, we hypothesize that the task of action prediction is 4090 consumer-grade GPU. For the mobile robot, inference
subtly different from high-resolution image synthesis — while was done off-board over a Wi-Fi connection, adding a small
it may be relatively easy to predict the mean image conditioned amount of network latency. Further optimizations, quantiza-
on a text label, predicting the mean action conditioned on a tion, and other improvements might further reduce inference
robot observation (i.e., learning E[At |ot ]) is a much harder times.
problem; this is because the observation ot is very informative Since the model generates an entire H-step action chunk
in that it should constrain the distribution of possible actions at once, we can execute up to H actions before we need to
much more than a text label constrains the distribution of run inference again. However, we may run inference more
possible images. As a result, we designed a timestep sampling often than that, as well as combine actions from different
distribution that emphasizes low timesteps (high noise levels); inference calls using various aggregation strategies. We tried
additionally, timesteps above a given threshold s are not temporal ensembling [57] early on and found that it hurt
sampled at all, since they are not needed so long as the policy performance, so we opted not to aggregate actions and
integration step δ is greater than 1−s. The distribution is given instead execute action chunks open-loop. For the 20Hz UR5e
by p(τ ) = Beta( s−τ
s ; 1.5, 1) and is visualized in Figure 14. We and Franka robots, we run inference every 0.8 seconds (after
1
use s = 0.999 in our experiments, which allows for δ > 1000 , executing 16 actions), and for all other robots, which run at
or up to 1,000 integration steps. 50Hz, we run inference every 0.5 seconds (after executing 25
actions).
C. Non-VLM Baseline Architecture
Our baseline architecture π0 -small is not based on a VLM model part inference time
backbone. Hence, we use it to evaluate the benefits of VLM- image encoders 14 ms
pre-training. We design it to be sufficiently expressive to fit observation forward pass 32 ms
x10 action forward pass (flow) 27 ms
our large dataset while still providing good performance when network latency (if off-board) 13 ms
trained from scratch. This model has about 470M parameters,
total on-board inference 73 ms
and differs from our main model in the following ways: (1) total off-board inference 86 ms
We use DistilBERT [42] to encode the language tokens of
the language command ℓt , since this model does not use a TABLE I: Inference time of our model on an NVIDIA GeForce
language model backbone; (2) The action expert cross-attends RTX 4090 GPU.
to the outputs of the observation encoder, akin to a traditional
encoder-decoder transformer [51], rather than our main model
which is more like a decoder-only mixture of experts [45]; E. Evaluation Details
(3) The images are encoded with a smaller pre-trained ViT For each task, we design a score rubric that measures
encoder (specifically, the R26-S-32 ResNet-ViT hybrid from progress on the task, and use this for our quantitative results.
Steiner et al. [47]); (4) The ViT image encoders do not We describe this rubric for each task below:
A. Evaluating the base model D. Mastering complex multi-stage tasks
Shirt folding: Shirt folding is recorded as either success or Laundry folding: This task is scored out of 4. Our evaluation
failure. We begin each shirt folding eval by laying the shirt includes five items, three shirts of size M, L, and XL and two
flat on the table. Success is defined as having folded in the shorts of size 28 and 36. We perform two trials for each item,
sleeves and performed one half-fold along the length of the and the items left to be evaluated start randomly crumpled in a
shirt. Our eval includes 4 small t-shirts and 1 medium t-shirt. laundry bin (while previously evaluated items start in a folded
We run 2 evals for each item for a maximum of 15000 steps stack). One point is given for picking an item out of the bin
or approximately 5 minutes each. and putting it on the table. Another point is given for flattening
Bussing easy: This task is scored out of 7, where there are the shirt or shorts. A third point is granted for folding the shirt
7 different objects on the table, and 1 point is given for each or shorts. A final point is given for either placing the item in
correctly sorted object. the corner of the table (if it is the first item evaluated), or
Bussing hard: This task is scored out of 12, where there stacking it onto an existing stack of folded clothes. We run
are 12 different objects on the table, and 1 point is given for each eval for a maximum of 15000 steps or approximately 5
each correctly sorted object. This version of the task includes minutes.
particularly challenging settings, like a chopstick on top of a Mobile laundry: This evaluation follows the same protocol
piece of trash. as laundry folding. The three shirts are sized M, M, and XL,
Grocery bagging: This task is scored out of 7. For each 7 and the shorts are sized 32 and 31 W.
grocery items, a point is given for putting it in the bag. Table bussing: This task is scored out of 12, where there
Toast out of toaster: This task is scored out of 4. For each are 12 different objects on the table, and 1 point is given for
piece of toast, 1 point is given for picking it from the toaster each correctly sorted object. This version of the task includes
and another for putting it on the plate. particularly challenging settings, like a chopstick on top of a
B. Language instruction following. The policy is scored on piece of trash.
successfully repositioning each object and whether it follows Box building: This task is scored out of 5. One point is given
instructions. for successfully picking up the box to begin the task. One
Bussing: The robot has to follow the command to pick up point is given for folding the box in half, so the flaps can be
the correct object and place each of them into the correct closed. One point is given for closing the right flap. One point
receptacle. The robot receives 12 objects in total and around is given for closing the left flap. The final point is given for
30 instructions in one episode. neatly centering the final product.
Table setting: The robot arranges all dishes, utensils, and Packing eggs: This task is scored out of 7. One point for each
napkins and makes adjustments according to language speci- egg placed in the correct slot in the carton, and one point for
fication. The robot receives 7 objects in total and around 20 closing the lid.
instructions in one episode. Packing food: This task is scored out of 5. One point for
Grocery bagging: The robot picks up the correct item (among picking up the plate of food, one point for each of 3 food
bag of coffee beans, bag of barley, bag of marshmallow, cat items placed in the to-go box, and one point for closing the
food, spaghetti, bag of seaweed, bag of almonds), and bags to-go box.
them into a paper bag. The robot receives 7 objects in total Dryer unloading: This task involves having the robot ap-
and around 14 instructions in one episode. proach a dyer with a laundry basket and unload the clothes
C. Learning new dexterous tasks into the basket. We score this eval out of five, where one
Stack bowls: This task is scored out of 3. One point for each point is given for properly approaching the dryer. Another for
of two bowls stacked in larger bowls, and one for the neatness placing the laundry basket on the stool. A third for opening
of the final product. the dryer. A fourth for putting all the clothes in the basket and
Towel folding: This task is scored out of 3. One point for the a fifth point for closing the dryer. We eval with 3 shirts and
first half-fold of the towel, one point for the second half-fold 2 shorts that start in a random configuration inside the dryer.
of the towel, and one point for neatness of the final product.
Tupperware in microwave: This task is scored out of 4. One
point for opening the microwave, one point for picking up
the Tupperware, one point for putting the Tupperware in the
microwave, and one point for closing the microwave.
Paper towel replacement: This task is scored out of 4. One
point is given for grasping the old roll, and another point is
given for removing it. Then, one point is given for grasping the
new paper towel roll, and the final point is given for placing
it on the dispenser.
Items in drawer: This task is scored out of 5. One point for
opening the drawer, one point for each of 3 items picked and
placed into the drawer, and one point for closing the drawer.

You might also like