Pi 0
Pi 0
O') O%G
O') O%G 20 )O/ %G
O 0p)&/%,)p.
,)S³)S/.G±
'GG)
.G2%0G)22G.) /ÀG%
Fig. 1: Our generalist robot policy uses a pre-trained vision-language model (VLM) backbone, as well as a diverse cross-
embodiment dataset with a variety of dexterous manipulation tasks. The model is adapted to robot control by adding a separate
action expert that produces continuous actions via flow matching, enabling precise and fluent manipulation skills. The model
can then be prompted for zero-shot control or fine-tuned on high-quality data to enable complex multi-stage tasks, such as
folding multiple articles of laundry or assembling a box.
Abstract—Robot learning holds tremendous promise to unlock built on top of a pre-trained vision-language model (VLM) to
the full potential of flexible, general, and dexterous robot systems, inherit Internet-scale semantic knowledge. We then discuss how
as well as to address some of the deepest questions in artificial this model can be trained on a large and diverse dataset from
intelligence. However, bringing robot learning to the level of multiple dexterous robot platforms, including single-arm robots,
generality required for effective real-world systems faces major dual-arm robots, and mobile manipulators. We evaluate our
obstacles in terms of data, generalization, and robustness. In model in terms of its ability to perform tasks in zero shot after
this paper, we discuss how generalist robot policies (i.e., robot pre-training, follow language instructions from people and from
foundation models) can address these challenges, and how we can a high-level VLM policy, and its ability to acquire new skills via
design effective generalist robot policies for complex and highly fine-tuning. Our results cover a wide variety of tasks, such as
dexterous tasks. We propose a novel flow matching architecture laundry folding, table cleaning, and assembling boxes.
´³ &J (|
y~w
u ~9(&
L(
L( D
D J
J
pb
VPRRN[
\XZUYTSXQWXOVPRRN[ (|
u(9
 ·¶µ u ~9(&
¸
@ @ @ ><(9 &63
¬ (|
$"# (& ¥9 y
u ~9(&
Fig. 3: Overview of our framework. We start with a pre-training mixture, which consists of both our own dexterous
manipulation datasets and open-source data. We use this mixture to train our flow matching VLA model, which consists
of a larger VLM backbone and a smaller action expert for processing robot states and actions. The VLM backbone weights
are initialized from PaliGemma [5], providing representations learned from large-scale Internet pre-training. The resulting π0
model can be used to control multiple robot embodiments with differing action spaces to accomplish a wide variety of tasks.
a number of more complex and dexterous behaviors, such Our model, which we describe in Section IV, is based on the
as tying shoelaces [58] or cooking shrimp [17], we show PaliGemma vision-language model [5], which we then further
that our framework can train very long tasks, sometimes train with our data mixture. To turn the base PaliGemma VLM
tens of minutes in length, for behaviors that combine both into π0 , we add action outputs that use flow matching [32, 28]
physical dexterity and combinatorial complexity. For example, to generate continuous action distributions. We describe this
our laundry folding task requires the robot to manipulate a design in detail in the following section. Note that we use
variety of clothing items that can start in any configuration, PaliGemma for convenience and because of its comparatively
and fold multiple items in sequence. Our table bussing task small size (which is useful for real-time control), but our
requires discerning the class of novel objects (trash or dishes). framework is compatible with any base pre-trained VLM.
We show that a single cross-embodiment model can be used
as the base model for these tasks. To our knowledge, our work IV. T HE π0 M ODEL
demonstrates the longest dexterous tasks in the end-to-end
robot learning literature. The π0 model, illustrated in Figure 3, consists primarily
of a language model transformer backbone. Following the
standard late fusion VLM recipe [3, 11, 30], image encoders
III. OVERVIEW
embed the robot’s image observations into the same em-
We provide an outline of our model and training procedure bedding space as language tokens. We further augment this
in Figure 3. In our training framework, we first assemble backbone with robotics-specific inputs and outputs — namely,
a pre-training mixture consisting of a weighted combination proprioceptive state and robot actions. π0 uses conditional
of our own dexterous manipulation datasets (Section V-C), flow matching [28, 32] to model the continuous distribution
collected on 7 different robot configurations for 68 different of actions. Flow matching provides our model with high
tasks, and the entire OXE dataset [10], which contains data precision and multimodal modeling capability, making it es-
from 22 robots. The pre-training phase (Section V-A) also uses pecially well suited to high-frequency dexterous tasks. Our
diverse language labels, combining task names and segment architecture is inspired by Transfusion [59], which trains a
annotations (fine-grained labels for sub-trajectories, typically single transformer using multiple objectives, with tokens1
about 2 seconds in length). The purpose of the pre-training corresponding to continuous outputs supervised via a flow
phase is to train a base model that exhibits broad capabilities matching loss and tokens corresponding to discrete outputs
and generalization, but is not necessarily specialized for high supervised via a cross-entropy loss. Building on Transfusion,
performance on any one task. This base model can follow we additionally found that using a separate set of weights
language commands and perform a variety of tasks at rudi- for the robotics-specific (action and state) tokens led to an
mentary proficiency. For complex and dexterous tasks, we improvement in performance. This design is analogous to a
then employ a post-training procedure (Section V-A), which mixture of experts [45, 25, 12, 16] with two mixture elements,
uses high-quality curated data to adapt the model to specific where the first element is used for image and text inputs, and
downstream tasks. We study both efficient post-training with
1 In this paper, we use the word “token” to refer to an input/output slot along
small to moderate amounts of data, and high-quality post-
the sequence dimension, whether the slot corresponds to a discrete variable
training with larger datasets for complex tasks such as laundry (e.g., a language token) or a continuous variable (e.g., an image patch or a
folding and mobile manipulation. robot action).
the second is used for robotics-specific inputs and outputs. We Non-VLM baseline model. In addition to our main VLA
refer to the second set of weights as the action expert. model, we also trained a similar baseline model that did not
Formally, we want to model the data distribution p(At |ot ), use a VLM initialization for ablation experiments. This model,
where At = [at , at+1 , ..., at+H−1 ] corresponds to an action which we refer to as π0 -small, has 470M parameters, does not
chunk of future actions (we use H = 50 for our tasks), and ot use VLM initialization, and has a number of small differences
is an observation. The observation consists of multiple RGB that we found to be helpful for training on our data without
images, a language command, and the robot’s proprioceptive VLM initialization, which are summarized in Appendix C.
state, such that ot = [I1t , ..., Int , ℓt , qt ], where Iit is ith image This model is used in our comparisons to evaluate the benefits
(with 2 or 3 images per robot), ℓt is a sequence of language of incorporating VLM pertaining.
tokens, and qt is a vector of joint angles. The images Iit
and state qt are encoded via corresponding encoders and then V. DATA C OLLECTION AND T RAINING R ECIPE
projected via a linear projection layer into the same embedding Broadly capable robot foundation models require not only
space as the language tokens. an expressive and powerful architecture, but also the right
For each action at′ in the action chunk At , we have a dataset and, more importantly, the right training recipe. In
corresponding action token that we feed through the action the same way that LLM training is typically divided into
expert. During training, we supervise these action tokens using pre-training and post-training phases, we employ a multi-
a conditional flow matching loss [28, 32], stage training procedure for our model. The goal of the pre-
training phase is to expose the model to a diverse range of
Lτ (θ) = Ep(At |ot ),q(Aτt |At ) ||vθ (Aτt , ot ) − u(Aτt |At )||2 ,
tasks so that it can acquire broadly applicable and general
where subscripts denote robot timesteps and superscripts physical capabilities, while the goal of the post-training phase
denote flow matching timesteps, with τ ∈ [0, 1]. Recent is to provide the model with the ability to skillfully and
work in high-resolution image [14] and video [38] synthe- fluently execute the desired downstream task. Because of
sis has shown that flow matching can achieve strong em- this, the requirements for the pre-training and post-training
pirical performance when combined with a simple linear- datasets are distinct: the pre-training dataset should cover
Gaussian (or optimal transport) probability path [28], given as many tasks as possible, and within each of those tasks
by q(Aτt |At ) = N (τ At , (1 − τ )I). In practice, the network should cover a diversity of behaviors. The post-training dataset
is trained by sampling random noise ϵ ∼ N (0, I), computing should instead cover behaviors that are conducive to effective
the “noisy actions” Aτt = τ At + (1 − τ )ϵ, and then training task execution, which should exhibit a consistent and fluent
the network outputs vθ (Aτt , ot ) to match the denoising vector strategy. Intuitively, the diverse (but lower quality) pre-training
field u(Aτt |At ) = ϵ − At . The action expert uses a full data allows the model to recover from mistakes and handle
bidirectional attention mask, so that all action tokens attend highly varied situations, which might not otherwise occur in
to each other. During training, we sample the flow matching the high-quality post-training data, while the post-training data
timestep τ from a beta distribution that emphasizes lower teaches the model to perform the task well.
(noisier) timesteps. See Appendix B for more details.
A. Pre-training and post-training
At inference time, we generate actions by integrating the
learned vector field from τ = 0 to τ = 1, starting with random
noise A0t ∼ N (0, I). We use the forward Euler integration
rule:
Aτt +δ = Aτt + δvθ (Aτt , ot ),
where δ is the integration step size. We use 10 integration
steps (corresponding to δ = 0.1) in our experiments. Note
that inference can be implemented efficiently by caching
the attention keys and values for the prefix ot and only
recomputing the suffix corresponding to the action tokens for Fig. 4: Overview of our dataset: The pre-training mixture
each integration step. We provide more details regarding the consists of a subset of OXE [10] and the π dataset. We use
inference procedure, including the inference time for each part a subset of OXE, which we refer to as OXE Magic Soup
of the model, in Appendix D. [24]. The right figure illustrates the weight of the different
While in principle our model can be initialized from scratch datasets in the pre-training mixture. The left figure illustrates
or fine-tuned from any VLM backbone, in practice we use their relative sizes as measured by the number of steps.
PaliGemma [5] as our base model. PaliGemma is an open-
source 3 billion parameter VLM that offers a convenient trade- We provide an overview of our pre-training mixture in Fig-
off between size and performance. We add 300M parameters ure 4. Since each training example corresponds to a timestep
for the action expert (which is initialized from scratch) for a — i.e., a tuple (ot , At ), — we will quantify data in terms
total of 3.3 billion parameters. We provide a full description of timesteps in this discussion. 9.1% of the training mixture
of the model architecture in Appendix B. consists of open-source datasets, including OXE [10], Bridge
v2 [52], and DROID [23]. The robots and tasks in these
datasets typically have one or two cameras and use low-
frequency control, between 2 and 10 Hz. However, these
datasets cover a wide range of objects and environments. To
learn dexterous and more complex tasks, we also use 903M
timesteps of data from our own datasets, where 106M steps are Bimanual UR5e Bimanual Trossen Bimanual ARX
from single-arm robots and 797M are from dual-arm robots.
This data has 68 tasks, where each task is composed of
complex behaviors — e.g., the “bussing” task involves putting
a wide range of different dishes, cups, and utensils into a
bussing bin, and a wide array of trash items into the garbage.
Note that this definition of task is significantly different from
prior work, which typically uses any combination of noun UR5e Franka Mobile Trossen Mobile Fibocom
and verb (e.g., “pick up the cup” vs. “pick up the plate”)
to constitute a distinct task. Therefore, the actual range of Fig. 5: The robots used in our experiments. These include
behaviors in our dataset is significantly broader than this single and dual-arm manipulators with 6-DoF and 7-DoF arms,
number of “tasks” would imply. We discuss the specific robots as well as holonomic and nonholonomic mobile manipulators.
and tasks in our dataset in more detail in Section V-C. π0 is trained jointly on all of these platforms.
Since the datasets are somewhat imbalanced in size (e.g.,
the more difficult laundry folding tasks are overrepresented),
we weight each task-robot combination by n0.43 , where n UR5e. An arm with a parallel jaw gripper, with a wrist-
is the number of samples for that combination, such that mounted and over-the-shoulder camera, for a total of two
over-represented combinations are down-weighted. The con- camera images and a 7-dimensional configuration and action
figuration vector qt and action vectors at always have the space.
dimensionality of the largest robot in the dataset (18 in our Bimanual UR5e. Two UR5e setups, for a total of three camera
case, to accommodate two 6-DoF arms, 2 grippers, a mobile images and a 14-dimensional configuration and action space.
base, and a vertically actuated torso). For robots with lower- Franka. The Franka setup has two cameras and an 8-
dimensional configuration and action spaces, we zero-pad the dimensional configuration and action space.
configuration and action vectors. For robots with fewer than Bimanual Trossen. This setup has two 6-DoF Trossen ViperX
three images, we also mask out the missing image slots. arms in a configuration based on the ALOHA setup [4, 57],
In the post-training phase, we fine-tune our model with a with two wrist cameras and a base camera, and a 14-
smaller task-specific dataset to specialize it to particular down- dimensional configuration and action space.
stream applications. As mentioned previously, our definition Bimanual ARX & bimanual AgileX. This setup uses two
of “task” is fairly broad — e.g., the “bussing” task requires 6-DoF arms, and supports either ARX or AgileX arms, with
manipulating a wide range of different objects. Different tasks three cameras (two wrist and one base) and a 14-dimensional
require very different datasets, with the simplest of the tasks configuration and action space. This class encompasses two
necessitating only 5 hours and the most complex tasks using distinct platforms, but we categorize them together because of
100 or more hours of data. their similar kinematic properties.
Mobile Trossen & mobile ARX. This setup is based on the
B. Language and high-level policies
Mobile ALOHA [57] platform, with two 6-DoF arms on a
More complex tasks that require semantic reasoning and mobile base, which are either ARX arms or Trossen ViperX
high-level strategy, such as table bussing, can also benefit from arms. The nonholonomic base adds two action dimensions,
a high-level policy that decomposes high-level tasks (such as for a 14-dimensional configuration and 16-dimensional action
“bus the table”) into more immediate subtasks (such as “pick space. There are two wrist cameras and a base camera. This
up the napkin” or “throw the napkin into the trash”). Since class encompasses two distinct platforms, but we categorize
our model is trained to process language inputs, we can use a them together because of their similar kinematic properties.
high-level VLM to make these semantic inferences, a method Mobile Fibocom. Two 6-DoF ARX arms on a holonomic base.
that is analogous to LLM/VLM planning methods such as The base adds three action dimensions (two for translation and
SayCan [2]. We use such a high-level policy to assist our one for orientation), for a 14-dimensional configuration and
model with high-level strategy for several of our experimental 17-dimensional action space.
tasks, as we will discuss in Section VI. We summarize the proportion of our dataset from each robot
in Figure 4.
C. Robot system details
Our dexterous manipulation datasets include 7 different VI. E XPERIMENTAL E VALUATION
robot configurations and 68 tasks. We summarize these plat- Our experimental evaluation consists of zero-shot evaluation
forms in Figure 5, and discuss them below: experiments that compare our base (pre-trained) model to
Zero-Shot Performance 0 OpenVLA
Comparison Across Tasks OpenVLA (UR5e only)
1.0 0 (parity)
0 small Octo
0.8
0.4
0.2
0.0
Shirt Folding Bussing Easy Bussing Hard Grocery Bagging Toast
(Bi-ARX) (UR5e) (UR5e) (UR5e) (Bi-Trossen)
Fig. 11: Fine-tuning with varying amounts of data. π0 can learn some easier tasks even with smaller amounts of data, and
the pre-trained model often attains a larger improvement over the model trained from scratch.
D. Mastering complex multi-stage tasks The robot must handle dense clutter and intelligently sequence
various behaviors — for example, to clean off a plate with
In our final set of experiments, we tackle a range of
trash, it must first pick up the plate, then shake its co ntents
challenging multi-stage tasks via a combination of fine-tuning
into the garbage, and then place the plate in the bin. This task
and language. For some of these tasks, data is present in pre-
is not present in pre-training.
training, but fine-tuning is required to attain mastery. For some,
no data is present in pre-training. The tasks in this evaluation, Box building: The robot has to assemble a cardboard box
shown in Figure 12, are: that starts in a flattened state. This task presents a number of
Laundry folding: This task requires a static (non-mobile) bi- major challenges: the box needs to bent in the right way, and
manual system to fold articles of clothing. The clothing items the robot needs to hold down parts of the box while folding
start in a randomized crumpled state in a bin, and the goal is others, utilizing both arms and even the surface of the table to
to take out the item, fold it, and place it on top of a stack of brace during folding motions. The robot might need to retry
previously folded items. The randomized initial configuration some folds, requiring a reactive and intelligent strategy. This
of the crumpled laundry presents a major challenge, since the task is not present in the pre-training data. This task is not
policy needs to generalize to any configuration. This task is present in pre-training.
present in pre-training. To-go box: This task requires moving several food items from
Mobile laundry: Here, the Fibocom mobile robot in Figure 5 a plate into a to-go box, requiring packing the items into the
has to fold laundry, facing many of the same challenges while box so that they do not stick out, and then closing the box
controlling orientation and translation. This task is present in with both arms. This task is not present in the pre-training
pre-training. data. This task is not present in pre-training.
Dryer unloading: Here, the Fibocom mobile robot has to take Packing eggs: The robot needs to take six eggs out of a
laundry out of a dryer and place it into a hamper. This task is bowl and pack them into an egg carton, and then close the
present in pre-training. carton. The eggs need to be grasped in a manner appropriate
Table bussing: This task requires bussing a table with a to their pose inside the bowl, and then placed into open slots
diverse array of novel objects in a clutter scene, presenting in the carton. This presents challenges due to the egg shape,
a much greater challenge than the benchmark in our zero- slipperiness, and the need for careful placement. Closing the
shot evaluation: the policy must generalize to unseen objects box requires the use of both arms. This task is not present in
of varying shapes and sizes, and perform complex dexterous the pre-training data. This task is not present in pre-training.
motions, such as twisting the gripper to pick up large plates The results, showing average scores per task over 10 trials,
and carefully grasping thin, delicate items such as glasses. are presented in Figure 13. The scoring rubrics are in Ap-
Fig. 12: We evaluate a range of complex and temporally
extended tasks. This includes: folding laundry from a bin
with a stationary (a) or mobile (b) robot, bussing a real
lunch table (c), assembling a box (d), packing eggs into a
carton (e), and packing food into a to-go box (f). These tasks
require combining dozens of individual behaviors, such as Fig. 13: Post-training results on complex tasks in terms of
grasping, stacking, folding, and flattening, generalization to average scores over 10 trials. The full pre-trained π0 model
a huge variety of object configurations, and complex physical attains more than 50% of the maximum score across all of the
properties, such as deformable objects or flexible cardboard. tasks, and typically outperforms the ablations, with especially
significant improvements on the hardest tasks.