0% found this document useful (0 votes)
95 views26 pages

DROID A Large-Scale in-The-Wild Robot Manipulation Dataset

DROID (Distributed Robot Interaction Dataset) is a comprehensive robot manipulation dataset comprising 76,000 demonstration trajectories and 350 hours of interaction data collected over 12 months across 564 scenes and 86 tasks. This dataset aims to enhance the performance, robustness, and generalization of robot manipulation policies by providing diverse and high-quality data collected in real-world environments. The full dataset, along with pre-trained models and a guide for reproducing the robot setup, is open-sourced for research purposes.

Uploaded by

GlenBennetHermon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views26 pages

DROID A Large-Scale in-The-Wild Robot Manipulation Dataset

DROID (Distributed Robot Interaction Dataset) is a comprehensive robot manipulation dataset comprising 76,000 demonstration trajectories and 350 hours of interaction data collected over 12 months across 564 scenes and 86 tasks. This dataset aims to enhance the performance, robustness, and generalization of robot manipulation policies by providing diverse and high-quality data collected in real-world environments. The full dataset, along with pre-trained models and a guide for reproducing the robot setup, is open-sourced for research purposes.

Uploaded by

GlenBennetHermon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

DROID: A Large-Scale In-The-Wild

Robot Manipulation Dataset


https://droid-dataset.github.io

Alexander Khazatsky∗,1 , Karl Pertsch∗,1,2 , Suraj Nair1,3 , Ashwin Balakrishna3 , Sudeep Dasari4 ,
Siddharth Karamcheti1 , Soroush Nasiriany5 , Mohan Kumar Srirama4 , Lawrence Yunliang Chen2 , Kirsty Ellis6 ,
Peter David Fagan7 , Joey Hejna1 , Masha Itkina3 , Marion Lepert1 , Jason Ma14 , Patrick Tree Miller3 ,
Jimmy Wu8 , Suneel Belkhale1 , Shivin Dass5 , Huy Ha1 , Abraham Lee2 , Youngwoon Lee2,16 , Arhan Jain9 ,
Marius Memmel9 , Sungjae Park10 , Ilija Radosavovic2 , Kaiyuan Wang11 , Albert Zhan6 , Kevin Black2 ,
Cheng Chi1 , Kyle Hatch3 , Shan Lin11 , Jingpei Lu11 , Abdul Rehman7 , Pannag R Sanketi12 ,
arXiv:2403.12945v2 [cs.RO] 22 Apr 2025

Archit Sharma1 , Cody Simpson3 , Quan Vuong12 , Homer Walke2 , Blake Wulfe3 , Ted Xiao12 , Jonathan Yang1 ,
Arefeh Yavary13 , Tony Z. Zhao1 , Christopher Agia1 , Rohan Baijal9 , Mateo Guaman Castro9 , Daphne Chen9 ,
Qiuyu Chen9 , Trinity Chung2 , Jaimyn Drake2 , Ethan Paul Foster1 , Jensen Gao1 , Vitor Guizilini3 ,
David Antonio Herrera1 , Minho Heo10 , Kyle Hsu1 , Jiaheng Hu5 , Muhammad Zubair Irshad3 , Donovon Jackson3 ,
Charlotte Le2 , Yunshuang Li14 , Kevin Lin1 , Roy Lin2 , Zehan Ma2 , Abhiram Maddukuri5 , Suvir Mirchandani1 ,
Daniel Morton1 , Tony Nguyen3 , Abby O’Neill2 , Rosario Scalise9 , Derick Seale3 , Victor Son1 , Stephen Tian1 ,
Andrew Wang2 , Yilin Wu4 , Annie Xie1 , Jingyun Yang1 , Patrick Yin9 , Yunchu Zhang9 ,
Osbert Bastani14 , Glen Berseth6 , Jeannette Bohg1 , Ken Goldberg2 , Abhinav Gupta4 , Abhishek Gupta9 ,
Dinesh Jayaraman14 , Joseph J. Lim10 , Jitendra Malik2 , Roberto Martín-Martín5 , Subramanian Ramamoorthy7 ,
Dorsa Sadigh1 , Shuran Song1,15 , Jiajun Wu1 , Yuke Zhu5 , Thomas Kollar3 , Sergey Levine2 , Chelsea Finn1
Bathroom DROID Kitchen

Distributed Robot
Interaction Dataset

76k Episodes

564 Scenes

52 Buildings

13 Institutions

86 Tasks / Verbs

Dining Room Bedroom Laboratory Laundry Room O ce

Fig. 1: We introduce DROID (Distributed Robot Interaction Dataset), an “in-the-wild” robot manipulation dataset with 76k trajectories or
350 hours of interaction data, collected across 564 scenes, 86 tasks, and 52 buildings over the course of 12 months. Each DROID episode
contains three synchronized RGB camera streams, camera calibration, depth information, and natural language instructions. We demonstrate
that training with DROID leads to policies with higher performance, greater robustness, and improved generalization ability. We open source
the full dataset, pre-trained model checkpoints, and a detailed guide for reproducing our robot setup.

Abstract— The creation of large, diverse, high-quality robot manipulation data in diverse environments poses logistical and
manipulation datasets is an important stepping stone on the path safety challenges and requires substantial investments in hardware
toward more capable and robust robotic manipulation policies. and human labour. As a result, even the most general robot
However, creating such datasets is challenging: collecting robot manipulation policies today are mostly trained on data collected
in a small number of environments with limited scene and task
∗ Project Co-leads, correspondence to [email protected], diversity. In this work, we introduce DROID (Distributed Robot
[email protected] Interaction Dataset), a diverse robot manipulation dataset with
76k demonstration trajectories or 350 hours of interaction data, Europe over the course of 12 months. To streamline distributed
collected across 564 scenes and 86 tasks by 50 data collectors in data collection and ensure applicability of the final dataset to
North America, Asia, and Europe over the course of 12 months. a wide range of research settings, all data is collected on the
We demonstrate that training with DROID leads to policies with
higher performance and improved generalization ability. We open same robot hardware stack based on the popular Franka Panda
source the full dataset, policy learning code, and a detailed guide robot arm. Each episode contains three camera views, depth
for reproducing our robot hardware setup. information, camera calibration, and language annotations.
In experiments across 6 tasks and 4 locations, from labs
I. I NTRODUCTION
to offices and real households, we find that DROID boosts
A key feature of robot manipulation policies is their ability to policy performance, robustness and generalizability by 20% on
generalize, i.e., their ability to perform a desired manipulation average over state-of-the-art approaches that leverage existing
task under new lighting conditions, in new environments, or large-scale robot manipulation datasets [39]. We open-source
with new objects. Training policies that are robust to such the full DROID dataset under CC-BY 4.0 license, code for
variations is a crucial step towards the deployment of robots training policies using the dataset, and a detailed guide for
in everyday environments and may bring us closer to every reproducing our complete robot software and hardware setup.
roboticist’s dream: robot models that can be downloaded
and “just work” when tested on a new robot setup. A II. R ELATED W ORK
central ingredient for training such generalizable policies is
a) Large datasets in machine learning: The rapid progress
diverse training data: in computer vision and natural language
in machine learning has been closely tied to the construction
processing, training on large and diverse datasets scraped from
of large and diverse datasets. Examples include ImageNet [11],
the internet yields models that work in a wide range of new
Kitti [18], Ego4D [19] and LAION [46] in computer vision,
tasks. Similarly, in robot manipulation, a number of recent
Common Crawl [43] and The Pile [17] in natural language
works have demonstrated that larger, more diverse robot training
processing, and ShapeNet [5] and Objaverse [9, 10] in 3D
datasets enable us to push the envelope on policy generalization,
modeling. Key to their impact is their size and diversity: by
including positive transfer to new objects, instructions, scenes,
enabling training on larger and more diverse data, they push the
and embodiments [1, 2, 14, 36, 38, 39, 47, 63]. This suggests
capabilities and robustness of machine learning models. With
that an important stepping stone on the path toward more
DROID we aim to continue this trend for robot manipulation
capable and robust robotic manipulation policies is the creation
and provide a large and diverse robot manipulation dataset to
of large, diverse, high-quality robot manipulation datasets.
spur progress on generalizable policy learning.
However, creating such datasets is challenging: in contrast to
b) Robot learning datasets: A number of prior works
vision or language data, training manipulation policies typically
introduce datasets for robot learning of various sizes and
requires robot manipulation data with recorded observations
diversity levels (see Table I). Broadly, these can be categorized
and actions, which cannot be easily scraped from the internet.
into datasets collected autonomously via scripted and semi-
Collecting robot manipulation data in diverse environments
random behaviors or learned agents [3, 8, 20, 26, 27, 32, 40],
poses logistical and safety challenges when moving robots
and datasets collected via human teleoperation [1, 2, 13, 14,
outside of controlled lab environments. Additionally, collecting
24, 36, 50, 59]. Multiple works focus on increasing dataset
data at scale requires substantial investments in hardware
diversity: RH20T [14] collects data across 33 tasks in 7 table-
and human labour for supervision, particularly for collecting
top scenes and BridgeV2 [59] collects data in 24 scenes.1
demonstration data. As a result, even the most general robot
While these datasets increase diversity, most of their data is
manipulation policies today are mostly trained on data collected
collected in a small number of scenes in a single research lab
in controlled, lab-like environments with limited scene and
or building.
task diversity. To enable the next level of generalizable
More recently, there has been a larger effort on pooling
robot manipulation policy learning, the robot manipulation
existing robot datasets into a coherent format, the Open
community needs more diverse datasets, collected across a
X-Embodiment dataset (OXE) [39]. Albeit larger in scale
wide range of environments and tasks.
than prior robot datasets, the OXE dataset still consists of
In this work, we introduce DROID (Distributed Robot
individual datasets with few scenes, thus totalling around 300
Interaction Dataset), a robot manipulation dataset of unprece-
scenes at the time of writing. Our goal with the DROID
dented diversity (see Fig. 1). DROID consist of 76k demon-
dataset is to significantly increase the scene diversity as well
stration trajectories or 350 hours of interaction data, collected
across 564 scenes, 52 buildings and 86 tasks. DROID was 1 Note that prior works use various definitions for what constitutes a “task”
collected by 18 research labs in North America, Asia, and and what constitutes a “scene”. In this work, we use the number of unique
verbs extracted from the language instructions to represent the number of
Affiliations: 1 Stanford University; 2 University of California, Berkeley; tasks, which is more scalable than manually defining tasks [59] yet often
3 Toyota Research Institute; 4 Carnegie Mellon University; 5 University of more reflective of the behavior diversity than e.g., counting the number of
Texas, Austin; 6 University of Montreal; 7 University of Edinburgh; 8 Princeton verb-object combinations [2] (see Fig. 3 for DROID’s verb distribution as an
University; 9 University of Washington; 10 Korea Advanced Institute of Science example). For scenes, we only count a scene as new if there is a substantial
& Technology (KAIST); 11 University of California, San Diego; 12 Google change of the robot’s workspace, e.g., if it gets transported to a new corner
DeepMind; 13 University of California, Davis; 14 University of Pennsylvania; of the kitchen or a new room altogether, but not if only the arrangement of
15 Columbia University; 16 Yonsei University objects in front of the robot or the table cloth changes.
Dataset # Traj. # Verbs # Scenes Lang. Instruct. Cam. Calibration Public Robot Collection
MIME [50] 8.3k 20 1 ✗ ✗ ✓ human teleop
RoboTurk [36] 2.1k 2 1 ✗ ✗ ✓ human teleop
RoboNet [8] 162k n/a 10 ✗ ✗ ✓ scripted
MT-Opt [26, 27] 800k 2 1 ✗ ✗ ✓ scripted & learned
BridgeData [13] 7.2k 4 12 ✓ ✗ ✓ human teleop
BC-Z [24] 26k 3 1 ✓ ✗ ✗ human teleop
RT-1 [2] 130k 2 2 ✓ ✗ ✗ human teleop
RH20T [14] 13k2 33 7 ✓ ✓ ✓ human teleop
RoboSet [1] 98.5k 9 11 ✓ ✗ ✓ 30% human / 70% scripted
BridgeData V2 [59] 60.1k 82 24 ✓ ✗ ✓ 85% human / 15% scripted
DobbE [47]∗ 5.6k 6 216 ✓ n/a (✓) human tool-based
Open X-Embodiment [39]† 1.4M 217 311 (✓) ✗ (✓) dataset aggregation
DROID (ours) 76k 86 564 ✓ ✓ ✓ human teleop

Table I: Comparison to existing datasets for robot manipulation. “# Scenes” refers to the number of unique robot work spaces, e.g., different
kitchens count as different scenes, but rearrangement of objects does not. See Section II for a detailed discussion of the definition of “Tasks”
and “Scenes”. DROID offers high diversity in both, the number of verbs and scenes. ∗ non-robot, tool-based data collection, † not a dataset in
itself, but aggregation of existing datasets, including most previous rows in this table.

as scene realism by collecting data across a wide array of diffusion policies [7] for all of our policy learning experiments.
real world buildings in a diverse set of geographic locations.
As a result, DROID contains data from 564 scenes across III. DROID DATA C OLLECTION S ETUP
52 buildings, a substantial increase compared to any existing In this work, we introduce DROID (Distributed Robot
robot manipulation dataset. Interaction Dataset), an open-source robot manipulation dataset
Collecting such data “in-the-wild” is more common for robot that provides for very high diversity and variability of scenes,
navigation and autonomous driving [4, 18, 28, 48, 49, 55, 57, tasks, and objects (see Table I). Diverse and high-quality data is
64] and enables training of policies that generalize zero-shot a key ingredient for training generalizable policies, and DROID
to new environments and even embodiments [48, 49]. With is designed to deliver both quantity and quality: it contains
DROID, we take a step towards enabling similar generalization 76k robot demonstration trajectories, spanning 86 tasks and
for robotic manipulation policies. Finally, there are some works 564 scenes. It was collected over the course of 12 months in
that leverage cheap, off-the-shelf tools, such as reacher-grabber a large, cross-institutional effort with 18 robots and 50 data
tools, for data collection, equipping robots with the same tools collectors across 13 institutions. All data is collected on a
to allow for zero-shot transfer to the robot [47, 53, 63]. While shared, open-source robot platform.
this simplifies the data collection process, it limits the data We are releasing all resources to enable researchers to build
to wrist camera viewpoints and may suffer from morphology upon DROID at https://droid-dataset.github.io. This includes
differences when transferring from human-arm-collected data the full dataset under CC-BY 4.0 license, an interactive
to robot arm execution. Additionally, DROID has larger scene dataset visualizer, code for training generalizable policies on
and task diversity than prior tool-based collection datasets [47]. DROID, pre-trained policy checkpoints, and a detailed guide
c) Scalable robot policy learning: Learning robot policies for reproducing our robot hardware setup and control stack.
from increasingly large and diverse datasets has been the In this section, we introduce our hardware setup and the data
focus of numerous efforts over the last few years. Initially, collection protocol.
these efforts focused in large part on learning from scripted or
autonomously collected data [8, 12, 20, 26, 32, 40]. The success A. DROID Robot Platform
of transformer models [58] in natural language processing A crucial component of building the DROID dataset was
and computer vision motivated a number of recent works distributed data collection at 13 institutions around the world:
that collected large-scale demonstration datasets and trained it is what enabled us to collect manipulation data across a large
transformer-based policies on them [2, 16, 25, 38, 39, 42, 49, diversity of scenes and tasks. A key challenge in this distributed
51, 65, 67]. Additionally, recent works suggest that diffusion setup is robot hardware: how can we ensure consistent and
denoising models [22] are a powerful parametrization for multi- reproducible robot control across so many setups, locations
modal action output distributions that combine expressivity and time zones? To streamline the distributed data collection
with scalability [7, 16, 21, 38, 54]. Our focus with DROID process we designed the DROID robot platform (see Fig. 2), a
is on introducing a new dataset, not a new policy learning hardware platform for data collection that is shared between all
algorithm. As such, we build on existing state-of-the-art institutions, allowing us to quickly set up new data collection
2 Fang et al. [14] report 110k trajectories for RH20T, but count each camera
units and roll out updates across the whole data collection fleet.
stream separately – here we report the number of unique multi-view trajectories, It is designed to support easy transportation between scenes
to compare fairly to all other datasets. and quick adjustment to new scenes and tasks.
Adjustable Zed 2
Stereo Cameras
we focused on the following objectives: (1) preventing common
data collection mistakes like “camera cannot see robot” or
Zed Mini Wrist “teleoperator in camera view”, (2) encouraging collection of
Stereo Camera diverse data, (3) allowing data collectors to creatively choose
scenes and tasks.
Every data collection session starts with moving the robot to
a new scene. Data collectors were encouraged to choose scenes
Control Laptop
that include multiple interesting tasks, numerous interaction
objects, and a healthy amount of clutter (see example scenes in
Oculus Quest 2 Fig. 12). After setting up the robot in the new scene, the data
Headset for Teleop
collector chooses views for the 3rd person cameras that can
Robotiq 2F-85
capture a wide range of interesting behaviors in the scene. Then
Gripper they perform extrinsic camera calibration using a checkerboard
and the OpenCV calibration algorithm. Next, the data collector
Franka Panda Portable will enter all potential tasks for the current scene into a data
7DoF Robot Arm Standing Desk
collection GUI on the laptop attached to the robot, either by
Fig. 2: The DROID robot platform. We use the same hardware selecting from a list of task options or by typing in free-from
setup across all 13 institutions to streamline data collection while task instructions (see Fig. 11 for screenshots of the GUI).
maximizing portability and flexibility. The setup consists of a Franka During data collection the GUI will prompt the data collector
Panda 7DoF robot arm, two adjustable Zed 2 stereo cameras, a wrist-
mounted Zed Mini stereo camera, and an Oculus Quest 2 headset with with a randomly sampled task from this list for each new
controllers for teleoperation. Everything is mounted on a portable, episode. This way we ensure that there is high coverage of
height-adjustable desk for quick scene changes. diverse tasks and collection is not biased to easier tasks or closer
objects. Additionally, the GUI periodically prompts the data
collector to perform randomly sampled “scene augmentations”
We chose the Franka Emika Panda 7 DoF robot arm as the like nudges to the mobile base, moving and re-calibrating the
base of our setup since it is widely adopted in the robot research 3rd person cameras, changing the room lighting, and adding
community, reliable, relatively affordable and was available or removing items within the scene. For each trajectory, we
at most participating institutions. The robot arm is equipped record the output of all RGB cameras, relevant low level state
with a Robotiq 2F-85 gripper and is mounted on a height- information from the robot, equivalent robot control commands
adjustable standing desk with wheels so it can easily move from various popular action spaces, a data collector ID, and
between scenes and buildings. We record image observations the metadata entered in the GUI (see Section B for a detailed
with three synchronized stereo camera streams: two exterior list of all features we record). The data collector also marks
Zed 2 cameras, table-mounted on adjustable tripods to quickly whether the collected sequence was a success, which we log
adapt to a new scene layout, and a wrist-mounted Zed-Mini as part of the metadata. DROID consists of 76k successful
camera. We use the Polymetis controller [33] and record actions episodes; roughly 16k trajectories in our data collection were
both in robot joint space and in end-effector space at a control labeled as “not successful”, which we include in our dataset
frequency of 15Hz. The setup is completed with the Franka release but do not count towards the size of DROID. A data
robot control box, a NUC that hosts the Polymetis server and collector will typically collect up to 100 trajectories or about
an Alienware laptop that runs our data collection GUI (see 20 minutes of interaction data per scene before moving on to
Section III-B). Everything is powered with a single power a new scene.
cable to further simplify changes in location. During post-processing, we label each episode with natural
For teleoperation, we use the controllers of a Meta Quest language commands using crowdsourcing via the tasq.ai data
2 headset to control the pose of the arm in 6D space as well labeling platform. We provide up to three independently labeled
as the gripper in continuous space. Over the course of this instructions per episode from different crowd workers to ensure
project we have replicated this setup 18 times across various diversity of annotations.
locations in North America, Asia, and Europe. We provide a Since the initial extrinsic calibration parameters, provided
thoroughly tested guide to replicate the hardware and software through conventional calibration detailed above, may not always
of our setup. We found that the setup is well-suited for data be accurate due to factors such as checkerboard misalignment,
collection and policy learning across a wide range of scenes inconsistent lighting, or errors inherent to the OpenCV calibra-
and tasks. tion method, we address these inaccuracies in Section G. We
discuss in detail the automatic post-hoc calibration process and
B. Data Collection Protocol provide three comprehensive sets of camera calibration matrices
Our dataset is collected by 50 data collectors across various for the DROID dataset, each accompanied by respective
research institutions. A shared data collection protocol helps quality assessment metrics. These include camera-to-base
streamline data collection, particularly for inexperienced data calibrations for around 36k unique scenes with one camera
collectors. When designing the collection protocol for DROID, calibrated relative to the base, camera-to-camera calibrations
Bridge V2

RH20T

RT-1

Food
Furniture

Toys
Hardware
Utensils
Textile

Sports
Clothes

Personal Care
Stationary

Kitchen Tools
Containers

Appliances

Accessories
Fig. 3: Distribution of verbs and objects in DROID. Top: Distribution of verbs after de-duplication with GPT-4. DROID has a long tail of
diverse tasks that span a wide range of behaviors. We also visualize the verb distributions for existing large manipulation datasets and find
that only Bridge V2 [59] has a comparable long tail of skills (for a detailed view of verb distributions for all datasets, see Appendix, Fig. 18).
Bottom: Distribution of objects the robot interacts with in DROID, sorted by category (best viewed zoomed in; for a detailed view, see
Fig. 19).

for all scenes, and a curated superset of 24k scenes covering and interaction location diversity. The latter refers to the
all three calibration methods with both cameras calibrated diversity of 3D locations relative to the robot’s base at which
relative to the base. These refined calibrations enhance the interactions with objects occur, an important factor when
dataset’s suitability for robust geometric understanding in generalizing to new scene layouts where interactions often
robotics and 3D perception tasks. For more details, please need to generalize to new table heights or new parts of the
see Sec. Section G. robot’s workspace.
We analyze DROID along these axes and compare it to
IV. DROID DATASET A NALYSIS existing large-scale robot manipulation datasets [2, 14, 59]. For
While we have so far referred to DROID and other large- each dataset, we run our analysis using one randomly sampled
scale robot manipulation datasets as “diverse,” there is nuance third-person camera frame per episode and the provided
in what constitutes a diverse robot dataset. Different axes of language instruction annotations. We find that results are
data diversity will affect the generalization abilities of models consistent across randomly sampled frames.
trained on the data differently: scene diversity may facilitate We visualize the results of our analysis in Figs. 3 to 6.
generalization to new scenes, while task or camera viewpoint Overall, we find that DROID significantly increases diversity
diversity allows for greater generalization to new instructions in tasks, objects, scenes, viewpoints and interaction locations
and camera angles. We will analyze DROID along multiple over existing large scale robot manipulation datasets. A key
important axes of diversity and compare it to existing large reason is DROID’s data collection protocol (see Section III-B):
robot manipulation datasets. by collecting data with 50 data collectors in 52 buildings
When deciding which axes of generalization to inspect across three continents, switching scenes approximately every
for robot manipulation datasets, it is important to consider 20 minutes during collection and giving collectors the freedom
which aspects of the problem may change between the training to freely choose scene-appropriate tasks, we can substantially
and downstream usage scenarios, i.e., which axes we want increase the diversity of scenes, tasks, and objects featured
manipulation policies to generalize over. This may involve in the dataset. Next, we will describe our analysis for each
aspects of the scene, task, and robot setup. We identify the category in more detail.
following important axes of diversity for closer analysis: task a) Task diversity: As explained in Section II, we use the
diversity, object diversity, scene diversity, viewpoint diversity, distribution of de-duplicated verbs in a dataset’s instructions
Side view:

DROID
Bridge V2
RT-1
RH20T

Fig. 4: Number of scenes per scene type. DROID has an order of


magnitude more scenes than other large robot manipulation datasets,
spanning a much wider range of scene types. We manually verified Fig. 5: Third-person camera viewpoints in DROID (subsampled).
or confirmed with the authors that scene count and type for prior DROID episodes cover a total of 1417 camera viewpoints along
datasets is accurately reported. with intrinsic and extrinsic stereo camera calibration. Brighter colors
indicate regions of higher viewpoint density.

as a scalable indicator for behavioral diversity. We use a


semantic parsing algorithm [23] to extract verbs and referenced for each scene due to the small number of total scenes. DROID
objects from the language instructions. We then use GPT4 contains 564 unique scenes, an order of magnitude more than
to de-duplicate the verbs, i.e., remove synonyms and typos. existing large robot manipulation datasets. The scenes cover
We plot the distribution of verbs for DROID in Fig. 3, top. a wide spectrum of scene types, from office environments
DROID features a wide variety of verbs with a long-tailed to households. Qualitatively, the scenes in DROID reflect
distribution. We use a logarithmic scale for these visualizations, realistic real world scenarios with naturally occuring objects
since diversity is about covering a wide range of tasks, rather and backgrounds. We highly encourage the reader to inspect
than having a high concentration of many episodes on only qualitative examples of scenes in Fig. 12 and the supplementary
a handful of tasks – that is, it is less important if a task has videos.
1000 vs. 2000 trajectories than whether it has 0 vs. 10. We d) Viewpoint diversity: Existing large-scale robot learning
also visualize the corresponding verb distributions for existing datasets often only contain a limited set of camera viewpoints
large manipulation datasets [2, 14, 59], and find that only because the cameras are mounted in a fixed location relative
Bridge V2 [59] has a comparable long tail of verb classes, to the scene or robot. In contrast, DROID varies camera
although in a more restricted set of scenes (see scene diversity viewpoints significantly during data collection and thus has a
analysis below). Fig. 18 shows a detailed view of the verb broad coverage of viewpoints (see Fig. 5) with 1417 unique
distributions for all datasets. view points in the dataset.
b) Object diversity: A dataset that includes manipulations e) Interaction location diversity: Another subtle yet
for a large variety of objects facilitates generalization to important aspect of robot datasets is the diversity of interaction
new objects downstream. We analyze the objects the robot locations: are tasks always executed in the same narrow slice of
manipulates for each episode in DROID from the language the workspace, e.g., at the same table height, or does the data
instruction labels using the same semantic parsing pipeline [23] cover interactions across a large fraction of the work space?
and show the distribution in Fig. 3, bottom (best viewed zoomed We use the point of first gripper closing in every episode
in, or see Fig. 19 for an enlarged version). DROID contains as a proxy for interactions in the dataset and visualize the
interactions with a wide range of everyday objects, spanning a 3D location of these interaction points for different datasets
diverse set of categories. We also plot the joint distribution of in Fig. 6. DROID features interactions in a wider range of
the most common verbs and interacted objects in Fig. 20. It the workspace than existing robot manipulation datasets that
shows that DROID not only contains diverse objects, but also typically focus interactions on a table surface in front of the
a diverse range of interactions with most objects. robot.
c) Scene diversity: We define 10 scene types (see Fig. 4)
V. E XPERIMENTS
and use GPT-4V to determine the scene type for a given
episode in DROID (see Appendix C for the used prompt). The analysis in the previous section highlighted the diversity
We find that this leads to high-quality scene type annotations of tasks, objects, scenes, and viewpoints in the DROID dataset.
(see Fig. 12 for example scenes and their categorization). For In this section, we investigate whether this diverse data resource
existing robot datasets we manually determine the scene types can be used to boost policy performance and robustness across
DROID Bridge V2 position are randomized between episodes on the table. The
out of distribution variant involves placing a distractor plate
on the table.
Toasting: A medium horizon task in a lab setting (150
demonstrations), where the task is to put an object on a toaster
oven tray, then close the toaster oven. The object and toaster
position are randomized between episodes on the table. The
out of distribution variant consists of toasting novel objects.
RT-1 RH20T
Clean up Desk: A long horizon task in an office setting (50
demonstrations), where the task is to open a drawer, pick and
place an eraser into the drawer, and then close the drawer. The
eraser position is fixed. The out of distribution variant consists
of adding distractor objects on the desk and in the drawer.
Cook Lentils: A long horizon task in a kitchen setting (50
demonstrations), where the task is to remove the lid off a pan,
Fig. 6: Visualization of 3D interaction points relative to the robot pour lentils into the pan, and turn on the stove. The object
base. We visualize the 3D location at which the gripper first closes in positions are fixed. The out of distribution variant consists of
each trajectory, since closing the gripper often indicates meaningful adding several distractor objects and a camera shift.
object interactions. DROID’s interactions cover a larger part of the
robot’s workspace, since the robot is moved freely between collection Additional details about each evaluation task can be found
sessions instead of being placed in front of repetitive tabletop scenes. in Appendix E. All data is collected using the DROID
teleoperation setup and training uses the same standardized
policy learning backbone.
a wide spectrum of robot manipulation tasks and environments.
To this end, we train policies across 6 tasks in 4 different b) Policy training: The goal of this work is to intro-
duce a new robot manipulation dataset, not to introduce
locations including lab, office, and household settings, to reflect
a new policy learning method. Thus, during experimental
the diversity of real world robotic research use cases (see Fig. 7).
All experiments use representative, state of the art robot policy evaluations we aim to leverage a well-adopted, state-of-the-
learning approaches [7]. Across the board, we find that DROID art policy learning pipeline. To this end, we use diffusion
improves policy success rate while increasing robustness to policies [7, 16, 21, 38, 54], which leverage denoising diffusion
scene changes like distractors or novel object instances. models for action prediction and have recently demonstrated
strong performance across a range of applications. We build on
A. Experimental Setup the implementation of diffusion policies in Robomimic [37],
a) Tasks: As illustrated in Fig. 7, we choose 6 tasks which provides high quality open-source implementations
in 4 locations that span a representative range of real robot of a number of different imitation learning and offline RL
learning use cases: from simple pick-place tasks to multi- algorithms. Concretely, all of our policies are conditioned on
stage cooking tasks; from clean lab settings to real households. a language instruction, use the RGB camera streams from the
All experiments use the DROID hardware stack for policy two external cameras and the robot proprioception as input, and
evaluations. Concretely, we evaluate on the following 6 tasks, produce absolute robot end-effector translation, rotation, and
each with their own out-of-distribution variants: gripper actions. We first downsample the camera observations
Closing Waffle Maker: A short horizon task in a lab setting to a resolution of 128 × 128 and use a ResNet-50 visual
(70 demonstrations), where the task is to close a waffle maker. encoder pre-trained on ImageNet [11] to encode both visual
The waffle maker position is randomized between episodes. The inputs. We then concatenate these visual embeddings with a
out of distribution variant consists of adding several distractor frozen DistilBERT [45] language embedding and the robots
objects on the table. proprioceptive state. These concatenated features are then fed
Place Chips on Plate: A short horizon task in a lab setting through an MLP and passed to a U-Net diffusion head which
(50 demonstrations), where the task is to pick and place a bag generates action trajectories. In line with prior work [7], we
of Doritos chips onto a plate, with two distractor objects on train the diffusion policy to generate 16-step action sequences,
the table. All objects and the plate position are randomized and during rollouts, step 8 actions open loop before re-running
between episodes on the table. The out of distribution variant policy inference. For leveraging DROID during policy training,
consists of (a) changing the type of chips or (b) adding more we simply mix training batches at a 50/50 ratio between the
distractor objects to the table. small in-domain dataset and the complete DROID dataset but
Put Apple in Pot: A medium horizon task in a lab setting (60 excluding trajectories marked as “not successful”, which we
demonstrations), where the task is to pick and place an apple find to work well in practice. Additional details about the
into a pot and then put a lid on the pot. The apple, pot, and lid policy training can be found in Appendix F.
Close Wa e Maker Place Chips on Plate Put Apple in Pot Toasting Clean up Desk Cook Lentils

Lab Evaluation O ce Evaluation Household Evaluation

Fig. 7: Robot setups for policy evaluation. We cover a wide range of tasks and scenes, from lab evaluations to offices and real households, to
reflect the diversity of use cases in real robot research. Depending on the task we collect between 50 and 150 demonstrations. We describe
each task with out-of-distribution evaluation modifications in parenthesis, left to right: Close Waffle Maker: The robot needs to close a
waffle maker (distractor objects). Place Chips on Plate: The robot needs to pick up the chips bag and place it on the provided plate (unseen
chips bag and distractor objects). Put Apple in Pot: The robot needs to pick up the apple, place it in the pot, and close the lid (unseen
distractor object). Toasting: The robot needs to pick up the object, place it in the toaster oven, and close the oven (toast a novel object).
Clean up Desk: The robot needs to open the drawer, place the eraser that is on top of the desk inside the drawer, and close it (distractor
objects on desk and in drawer). Cook Lentils: The robot needs to remove the pan lid, pick up and pour lentils into the pan, and turn on the
stove(add distractor objects).

B. Does DROID Improve Policy Performance and Robustness? when co-training with DROID, which has the strongest overall
To study if co-training with DROID can enable improved performance.
policy learning, we train separate policies for each evaluation Qualitatively, we find that policies that leverage DROID
during training are notably smoother and precise than other
ffi
ffl
task and compare all policies head-to-head in A/B evaluations
using 10 rollouts for each task setting and method. To test comparisons, particularly in the more challenging out-of-
how DROID and existing datasets affect policy robustness, we distribution evaluations. For instance, in the OOD setting of the
evaluate each task and method in two settings: “in-distribution,” Waffle Closing task, DROID is the only method that consistently
which reflects the distribution of tasks in the in-domain reaches for the waffle maker, while the other methods get
demonstrations with noise added to the initial robot and object confused about the task. Similarly, in the multi-step Cook
positions, and “out-of-distribution” (OOD), which tests policy Lentils task, baselines tend to fail after two or sometimes just
robustness e.g., by introducing distractor objects or switching one step, while co-training with DROID is the only method able
the manipulated object. We evaluate the following approaches: to consistently finish all three steps See Fig. 9 for examples
• No Co-training: Trains a diffusion policy [7] using the
of qualitative task rollouts.
in-domain demonstrations only
• DROID (Ours): Trains a diffusion policy, but mixes C. How important is the scene diversity in DROID?
batches 50/50 between in-domain demonstrations and One of the unique benefits of DROID compared to existing
DROID demonstrations robot datasets is its amount of scene diversity. Indeed we see
• OXE [39]: Trains a diffusion policy, but mixes batches in Figure 4 that DROID contains far more scene diversity than
50/50 between in-domain demonstrations and trajectories the next most diverse robot manipulation dataset. While we’ve
from the Open X-Embodiment dataset [39] (OXE). OXE seen the benefits of co-training with DROID, can we quantify
contains most of the existing large robot manipulation how much of a role scene diversity plays in improved policy
datasets we compared DROID to in Section IV, as well as robustness?
a large number of other robot datasets, spanning 22 robot To test this, we design an experiment that uses the challeng-
embodiments and approximately 300 scenes total.3 ing OOD versions of the evaluation tasks from Section V-A,
We present the results of our policy evaluations in Fig. 8. but compares:
Across all tasks, we find that DROID substantially improves • DROID (7k, 20 Scenes): Selects for the 20 scenes from
policy performance compared to the diffusion policy trained DROID with the most demonstrations each, resulting in
on in-domain data only. Policies co-trained with DROID also 7362 trajectories with comparatively little scene diversity.
perform better than policies that leverage diverse, existing • DROID (7k, Diverse Scenes): Uniform random sample of
robot datasets in Open X-Embodiment (OXE). Notably, when 7362 successful demonstrations from the DROID dataset,
testing out of distribution performance, the No Co-training which matches dataset size to the previous method while
baseline performs quite poorly while the co-trained policies retaining high scene diversity.
are much more effective. This difference is especially notable
These comparisons use the same 50/50 co-training paradigm
3We use a curated split of OXE based on Octo Model Team et al. [38], with individual task data used in the previous experiment.
which has been shown to work well for policy learning in prior work [38]. Hence, this helps establish whether the scene diversity of
We remove the Language Table dataset [35], equivalent to 5% of the Octo
training mix, due to its repetitive scene layouts and tasks, and its raw size, DROID results in better policy performance than just using 20
which proved challenging to handle for our training infrastructure. scenes while controlling for dataset size.
Fig. 8: Does DROID Improve Policy Performance and Robustness? We find that across all our evaluation tasks, co-training with DROID
significantly improves both in distribution and OOD performance over both no co-training and co-training with the Open-X dataset. We
compare success rate averaged across all tasks with standard error, and find DROID outperforms the next best method by 22% absolute
success rate in-distribution and by 17% out of distribution.

In Figure 10 we observe that using the split of the dataset We hope that DROID can be a catalyst for research on
with more diverse scenes yields better performance in the general-purpose robot manipulation policies that are able to
OOD evaluation setting. By comparing Figure 10’s individual generalize to a broad range of tasks and scenes. In this work,
task performances with the corresponding tasks in Figure 8, we showed one example for leveraging DROID to boost policy
we also see that the performance of co-training with the full performance, but there are many open questions about how to
DROID dataset matches or outperforms the performance with best make use of such diverse data: how should we combine
the subsampled dataset on all three tasks. These results suggest DROID with existing large-scale robot datasets and how can
that the strength of DROID lies in its size and especially in we train policies that perform tasks in new scenes without any
its diversity. in-domain data? Can the diverse interaction data in DROID be
used to learn better visual representations for robotic control?
VI. D ISCUSSION And in what situations is it helpful to train on the full dataset vs.
In this work, we introduced DROID (Distributed Robot slices of the data? We hope that DROID can accelerate research
Interaction Dataset), a new robot manipulation dataset with a on these questions and are excited for how the community
large diversity of scenes, tasks, objects and viewpoints. Our will leverage the dataset! We also hope that our open-sourced
dataset analysis in Section IV showed that DROID has an order hardware platform, which already exists in 18 labs around the
of magnitude larger scene diversity than existing large robot globe and is easy to reproduce, can improve reproducibility of
manipulation datasets, a wide range of tasks, many interaction robot learning research and facilitate future additions to the
objects, and diverse viewpoints. Our policy learning evaluations DROID dataset.
show that DROID is a valuable data resource for improving
ACKNOWLEDGMENT
policy performance and robustness, even in comparison to
existing large robot data sources like the Open X-Embodiment We thank the Toyota Research Institute (TRI) for their
dataset [39]. support in various aspects of this project, from data collection
Remove Lid Pick up Lentils Pour Lentils Turn on Stove

DROID
(ours)

OXE

No
Co-Train

Fig. 9: Representative policy rollout examples on the most challenging “Cook Lentils” task in the kitchen of one of the authors. Qualitatively,
we find that policies co-trained with DROID perform smoother, more precise motions, allowing them to solve long-horizon tasks like lentil
cooking even in the presence of unseen distractor objects. In contrast, policies trained only with in-domain demonstrations or co-trained with
Open-X data [39] struggle with long-horizon tasks and out-of-distribution evaluation settings. See https://droid-dataset.github.io for rollout
videos.

to the UKRI Research Node on Trustworthy Autonomous


Systems Governance and Regulation; Dorsa Sadigh’s group
was supported by TRI and ONR grant N00014-22-1-2293;
Glen Berseth’s group acknowledges funding support from
NSERC and CIFAR and compute support from Digital Research
Alliance of Canada, Mila IDT and NVidia; Jeannette Bohg’s
group was supported by TRI, Intrinsic, Toshiba and the National
Science Foundation under Grant 2327974; Joseph Lim’s group
was supported by Institute of Information & Communications
Technology Planning & Evaluation (IITP) grants (No.2019-
0-00075, Artificial Intelligence Graduate School Program,
KAIST; No.2022-0-00077, AI Technology Development for
Commonsense Extraction, Reasoning, and Inference from
Heterogeneous Data), and a National Research Foundation
Fig. 10: How important is the scene diversity in DROID? We find of Korea (NRF) grant (NRF-2021H1D3A2A03103683) funded
that co-training on a subset of DROID with diverse scenes has higher by the Korean government (MSIT).
OOD performance than co-training on a subset of DROID with only
20 scenes, suggesting the scene diversity of DROID is one of the R EFERENCES
driving factors behind strong policy performance when co-training. [1] Homanga Bharadhwaj, Jay Vakil, Mohit Sharma, Abhinav
Gupta, Shubham Tulsiani, and Vikash Kumar. Roboagent:
Generalization and efficiency in robot manipulation via
to compute for policy training. This work was supported by semantic augmentations and action chunking. arXiv
the Google TPU Research Cloud. We further acknowledge preprint arXiv:2309.01918, 2023.
the following funding sources: Chelsea Finn’s group was [2] Anthony Brohan, Noah Brown, Justice Carbajal, Yev-
supported by TRI and ONR grants N00014-20-1-2675 and gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana
N00014-22-1-2621; Sergey Levine’s group was supported by Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine
TRI, NSF FRR IIS-2150826, and ONR N00014-20-1-2383; Hsu, et al. Rt-1: Robotics transformer for real-world
Ram Ramamoorthy’s group was supported by the United control at scale. arXiv preprint arXiv:2212.06817, 2022.
Kingdom Research and Innovation through grant EP/S023208/1 [3] Serkan Cabi, Sergio Gomez Colmenarejo, Alexander
to the EPSRC Centre for Doctoral Training in Robotics Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong,
and Autonomous Systems (RAS) and grant EP/V026607/1 Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik,
Oleg Sushkov, David Barker, Jonathan Scholz, Misha diverse skills in one-shot. Towards Generalist Robots:
Denil, Nando de Freitas, and Ziyu Wang. Scaling Learning Paradigms for Scalable Skill Acquisition@
data-driven robotics with reward sketching and batch CoRL2023, 3:5, 2023.
reinforcement learning. RSS, 2019. [15] Martin A Fischler and Robert C Bolles. Random
[4] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh sample consensus: a paradigm for model fitting with
Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, applications to image analysis and automated cartography.
Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: Communications of the ACM, 24(6):381–395, 1981.
A multimodal dataset for autonomous driving. preprint [16] Zipeng Fu, Tony Z Zhao, and Chelsea Finn. Mobile
arXiv:1903.11027, 2019. aloha: Learning bimanual mobile manipulation with
[5] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, low-cost whole-body teleoperation. arXiv preprint
Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, arXiv:2401.02117, 2024.
Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: [17] Leo Gao, Stella Biderman, Sid Black, Laurence Golding,
An information-rich 3d model repository. arXiv preprint Travis Hoppe, Charles Foster, Jason Phang, Horace He,
arXiv:1512.03012, 2015. Anish Thite, Noa Nabeshima, et al. The pile: An 800gb
[6] Lawrence Yunliang Chen, Chenfeng Xu, Karthik Dhar- dataset of diverse text for language modeling. arXiv
marajan, Muhammad Zubair Irshad, Richard Cheng, Kurt preprint arXiv:2101.00027, 2020.
Keutzer, Masayoshi Tomizuka, Quan Vuong, and Ken [18] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
Goldberg. Rovi-aug: Robot and viewpoint augmentation ready for autonomous driving? the kitti vision benchmark
for cross-embodiment robot learning. In Conference on suite. In 2012 IEEE conference on computer vision and
Robot Learning (CoRL), Munich, Germany, 2024. pattern recognition, pages 3354–3361. IEEE, 2012.
[7] Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric [19] Kristen Grauman, Andrew Westbury, Eugene Byrne,
Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson
fusion policy: Visuomotor policy learning via action Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al.
diffusion. In Proceedings of Robotics: Science and Ego4d: Around the world in 3,000 hours of egocentric
Systems (RSS), 2023. video. In Proceedings of the IEEE/CVF Conference on
[8] Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Computer Vision and Pattern Recognition, pages 18995–
Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, 19012, 2022.
Sergey Levine, and Chelsea Finn. Robonet: Large-scale [20] Abhinav Gupta, Adithyavairavan Murali, Dhiraj Prakashc-
multi-robot learning. CoRL, 2019. hand Gandhi, and Lerrel Pinto. Robot learning in
[9] Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong homes: Improving generalization and reducing dataset
Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian bias. Advances in neural information processing systems,
Laforte, Vikram Voleti, Samir Yitzhak Gadre, et al. 31, 2018.
Objaverse-xl: A universe of 10m+ 3d objects. arXiv [21] Huy Ha, Pete Florence, and Shuran Song. Scaling up and
preprint arXiv:2307.05663, 2023. distilling down: Language-guided robot skill acquisition.
[10] Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, In Conference on Robot Learning, pages 3766–3777.
Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana PMLR, 2023.
Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: [22] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising
A universe of annotated 3d objects. In Proceedings of the diffusion probabilistic models. Advances in neural
IEEE/CVF Conference on Computer Vision and Pattern information processing systems, 33:6840–6851, 2020.
Recognition, pages 13142–13153, 2023. [23] Matthew Honnibal and Ines Montani. spaCy 2: Natural
[11] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, language understanding with Bloom embeddings, con-
and Li Fei-Fei. Imagenet: A large-scale hierarchical image volutional neural networks and incremental parsing. To
database. In 2009 IEEE conference on computer vision appear, 2017.
and pattern recognition, pages 248–255. Ieee, 2009. [24] Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler,
[12] Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea
Xie, Alex Lee, and Sergey Levine. Visual foresight: Finn. Bc-z: Zero-shot task generalization with robotic
Model-based deep reinforcement learning for vision-based imitation learning. In Conference on Robot Learning,
robotic control. arXiv:1812.00568, 2018. pages 991–1002. PMLR, 2022.
[13] Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, [25] Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi
Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima
Chelsea Finn, and Sergey Levine. Bridge data: Boosting Anandkumar, Yuke Zhu, and Linxi Fan. VIMA: Robot ma-
generalization of robotic skills with cross-domain datasets. nipulation with multimodal prompts. In Andreas Krause,
arXiv preprint arXiv:2109.13396, 2021. Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt,
[14] Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings
Chenxi Wang, Junbo Wang, Haoyi Zhu, and Cewu Lu. of the 40th International Conference on Machine Learn-
Rh20t: A comprehensive robotic dataset for learning ing, volume 202 of Proceedings of Machine Learning
Research, pages 14975–15022. PMLR, 23–29 Jul 2023. Savarese, Yuke Zhu, and Roberto Martín-Martín. What
URL https://proceedings.mlr.press/v202/jiang23b.html. matters in learning from offline human demonstrations for
[26] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian robot manipulation. In arXiv preprint arXiv:2108.03298,
Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, 2021.
Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, [38] Octo Model Team, Dibya Ghosh, Homer Walke, Karl
et al. QT-Opt: Scalable deep reinforcement learning Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey
for vision-based robotic manipulation. arXiv preprint Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You
arXiv:1806.10293, 2018. Liang Tan, Dorsa Sadigh, Chelsea Finn, and Sergey
[27] Dmitry Kalashnkov, Jake Varley, Yevgen Chebotar, Ben Levine. Octo: An open-source generalist robot policy.
Swanson, Rico Jonschkowski, Chelsea Finn, Sergey https://octo-models.github.io, 2023.
Levine, and Karol Hausman. Mt-opt: Continuous multi- [39] Open X-Embodiment Collaboration, Abhishek Padalkar,
task robotic reinforcement learning at scale. arXiv, 2021. Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Her-
[28] Haresh Karnan, Anirudh Nair, Xuesu Xiao, Garrett War- zog, Alex Irpan, Alexander Khazatsky, Anant Rai,
nell, Sören Pirk, Alexander Toshev, Justin Hart, Joydeep Anikait Singh, Anthony Brohan, Antonin Raffin, Ayzaan
Biswas, and Peter Stone. Socially compliant navigation Wahid, Ben Burgess-Limerick, Beomjoon Kim, Bern-
dataset (scand): A large-scale dataset of demonstrations hard Schölkopf, Brian Ichter, Cewu Lu, Charles Xu,
for social navigation. IEEE Robotics and Automation Chelsea Finn, Chenfeng Xu, Cheng Chi, Chenguang
Letters, 7(4):11807–11814, 2022. Huang, Christine Chan, Chuer Pan, Chuyuan Fu, Coline
[29] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Devin, Danny Driess, Deepak Pathak, Dhruv Shah, Dieter
Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Büchler, Dmitry Kalashnikov, Dorsa Sadigh, Edward
Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, Johns, Federico Ceola, Fei Xia, Freek Stulp, Gaoyue Zhou,
and Ross Girshick. Segment anything, 2023. URL https: Gaurav S. Sukhatme, Gautam Salhotra, Ge Yan, Giulio
//arxiv.org/abs/2304.02643. Schiavi, Hao Su, Hao-Shu Fang, Haochen Shi, Heni Ben
[30] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Amor, Henrik I Christensen, Hiroki Furuta, Homer Walke,
Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Hongjie Fang, Igor Mordatch, Ilija Radosavovic, Isabel
Whitehead, Alexander C. Berg, Wan-Yen Lo, et al. Leal, Jacky Liang, Jaehyung Kim, Jan Schneider, Jasmine
Segment Anything, April 2023. Hsu, Jeannette Bohg, Jeffrey Bingham, Jiajun Wu, Jialin
[31] Vincent Lepetit, Francesc Moreno-Noguer, and Pascal Wu, Jianlan Luo, Jiayuan Gu, Jie Tan, Jihoon Oh, Jitendra
Fua. Ep n p: An accurate o (n) solution to the p n p Malik, Jonathan Tompson, Jonathan Yang, Joseph J. Lim,
problem. International journal of computer vision, 81: João Silvério, Junhyek Han, Kanishka Rao, Karl Pertsch,
155–166, 2009. Karol Hausman, Keegan Go, Keerthana Gopalakrishnan,
[32] Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Ken Goldberg, Kendra Byrne, Kenneth Oslund, Kento
Quillen. Learning hand-eye coordination for robotic Kawaharazuka, Kevin Zhang, Keyvan Majd, Krishan
grasping with large-scale data collection. In International Rana, Krishnan Srinivasan, Lawrence Yunliang Chen,
Symposium on Experimental Robotics. Springer, 2016. Lerrel Pinto, Liam Tan, Lionel Ott, Lisa Lee, Masayoshi
[33] Yixin Lin, Austin S. Wang, Giovanni Sutanto, Ak- Tomizuka, Maximilian Du, Michael Ahn, Mingtong
shara Rai, and Franziska Meier. Polymetis. https: Zhang, Mingyu Ding, Mohan Kumar Srirama, Mohit
//facebookresearch.github.io/fairo/polymetis/, 2021. Sharma, Moo Jin Kim, Naoaki Kanazawa, Nicklas Hansen,
[34] Jingpei Lu, Zekai Liang, Tristin Xie, Florian Ritcher, Nicolas Heess, Nikhil J Joshi, Niko Suenderhauf, Nor-
Shan Lin, Sainan Liu, and Michael C Yip. Ctrnet-x: man Di Palo, Nur Muhammad Mahi Shafiullah, Oier
Camera-to-robot pose estimation in real-world conditions Mees, Oliver Kroemer, Pannag R Sanketi, Paul Wohlhart,
using a single camera. arXiv preprint arXiv:2409.10441, Peng Xu, Pierre Sermanet, Priya Sundaresan, Quan Vuong,
2024. Rafael Rafailov, Ran Tian, Ria Doshi, Roberto Martín-
[35] Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Martín, Russell Mendonca, Rutav Shah, Ryan Hoque,
Ding, James Betker, Robert Baruch, Travis Armstrong, Ryan Julian, Samuel Bustamante, Sean Kirmani, Sergey
and Pete Florence. Interactive language: Talking to robots Levine, Sherry Moore, Shikhar Bahl, Shivin Dass, Shuran
in real time. IEEE Robotics and Automation Letters, Song, Sichun Xu, Siddhant Haldar, Simeon Adebola,
2023. Simon Guist, Soroush Nasiriany, Stefan Schaal, Stefan
[36] Ajay Mandlekar, Yuke Zhu, Animesh Garg, Jonathan Welker, Stephen Tian, Sudeep Dasari, Suneel Belkhale,
Booher, Max Spero, Albert Tung, Julian Gao, John Takayuki Osa, Tatsuya Harada, Tatsuya Matsushima, Ted
Emmons, Anchit Gupta, Emre Orbay, et al. Roboturk: A Xiao, Tianhe Yu, Tianli Ding, Todor Davchev, Tony Z.
crowdsourcing platform for robotic skill learning through Zhao, Travis Armstrong, Trevor Darrell, Vidhi Jain,
imitation. In Conference on Robot Learning, pages 879– Vincent Vanhoucke, Wei Zhan, Wenxuan Zhou, Wolfram
893. PMLR, 2018. Burgard, Xi Chen, Xiaolong Wang, Xinghao Zhu, Xuanlin
[37] Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Li, Yao Lu, Yevgen Chebotar, Yifan Zhou, Yifeng Zhu,
Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Silvio Ying Xu, Yixuan Wang, Yonatan Bisk, Yoonyoung Cho,
Youngwoon Lee, Yuchen Cui, Yueh hua Wu, Yujin [51] Mohit Shridhar, Lucas Manuelli, and Dieter Fox.
Tang, Yuke Zhu, Yunzhu Li, Yusuke Iwasawa, Yutaka Perceiver-actor: A multi-task transformer for robotic
Matsuo, Zhuo Xu, and Zichen Jeff Cui. Open X- manipulation. In Proceedings of the 6th Conference on
Embodiment: Robotic learning datasets and RT-X models. Robot Learning (CoRL), 2022.
https://arxiv.org/abs/2310.08864, 2023. [52] Mohit Shridhar, Lucas Manuelli, and Dieter Fox.
[40] Lerrel Pinto and Abhinav Gupta. Supersizing self- Perceiver-actor: A multi-task transformer for robotic
supervision: Learning to grasp from 50k tries and 700 manipulation. In Conference on Robot Learning, pages
robot hours. In 2016 IEEE international conference on 785–799. PMLR, 2023.
robotics and automation (ICRA), pages 3406–3413. IEEE, [53] Shuran Song, Andy Zeng, Johnny Lee, and Thomas
2016. Funkhouser. Grasping in the wild: Learning 6dof closed-
[41] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya loop grasping from low-cost demonstrations. IEEE
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Robotics and Automation Letters, 5(3):4978–4985, 2020.
Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen [54] Ajay Sridhar, Dhruv Shah, Catherine Glossop, and Sergey
Krueger, and Ilya Sutskever. Learning transferable visual Levine. Nomad: Goal masked diffusion policies for navi-
models from natural language supervision, 2021. URL gation and exploration. arXiv preprint arXiv:2310.07896,
https://arxiv.org/abs/2103.00020. 2023.
[42] Ilija Radosavovic, Baifeng Shi, Letian Fu, Ken Goldberg, [55] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien
Trevor Darrell, and Jitendra Malik. Robot learning Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin
with sensorimotor pre-training. In Conference on Robot Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan,
Learning, 2023. Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev,
[43] Ahad Rana. Common crawl – building an open web-scale Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi,
crawl using hadoop, 2010. URL https://www.slideshare. Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir
net/hadoopusergroup/common-crawlpresentation. Anguelov. Scalability in perception for autonomous
[44] Nikhila Ravi, Jeremy Reizenstein, David Novotny, Taylor driving: Waymo open dataset, 2019.
Gordon, Wan-Yen Lo, Justin Johnson, and Georgia [56] Stephen Tian, Blake Wulfe, Kyle Sargent, Katherine
Gkioxari. Accelerating 3d deep learning with pytorch3d. Liu, Sergey Zakharov, Vitor Guizilini, and Jiajun Wu.
arXiv:2007.08501, 2020. View-invariant policy learning via zero-shot novel view
[45] Victor Sanh, Lysandre Debut, Julien Chaumond, and synthesis. arXiv, 2024.
Thomas Wolf. Distilbert, a distilled version of bert: [57] Samuel Triest, Matthew Sivaprakasam, Sean J Wang,
smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108, Wenshan Wang, Aaron M Johnson, and Sebastian Scherer.
2019. URL https://api.semanticscholar.org/CorpusID: Tartandrive: A large-scale dataset for learning off-road
203626972. dynamics models. In 2022 International Conference on
[46] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Robotics and Automation (ICRA), pages 2546–2552. IEEE,
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo 2022.
Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- [58] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
man, et al. Laion-5b: An open large-scale dataset for Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,
training next generation image-text models. Advances in and Illia Polosukhin. Attention is all you need. In
Neural Information Processing Systems, 35:25278–25294, Advances in neural information processing systems, pages
2022. 5998–6008, 2017.
[47] Nur Muhammad Mahi Shafiullah, Anant Rai, Haritheja [59] Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim,
Etukuru, Yiqian Liu, Ishan Misra, Soumith Chintala, and Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-
Lerrel Pinto. On bringing robots home, 2023. Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan
[48] Dhruv Shah, Ajay Sridhar, Arjun Bhorkar, Noriaki Hirose, Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2:
and Sergey Levine. Gnm: A general navigation model to A dataset for robot learning at scale, 2023.
drive any robot. In 2023 IEEE International Conference [60] Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea
on Robotics and Automation (ICRA), pages 7226–7233. Vedaldi, Christian Rupprecht, and David Novotny. Vggt:
IEEE, 2023. Visual geometry grounded transformer. In Proceedings
[49] Dhruv Shah, Ajay Sridhar, Nitish Dashora, Kyle Stachow- of the IEEE/CVF Conference on Computer Vision and
icz, Kevin Black, Noriaki Hirose, and Sergey Levine. Pattern Recognition, 2025.
ViNT: A foundation model for visual navigation. In [61] Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris
7th Annual Conference on Robot Learning, 2023. URL Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d
https://arxiv.org/abs/2306.14846. vision made easy, 2024. URL https://arxiv.org/abs/2312.
[50] Pratyusha Sharma, Lekha Mohan, Lerrel Pinto, and Abhi- 14132.
nav Gupta. Multiple interactions made easy (mime): Large [62] Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael
scale demonstrations data for imitation. In Conference Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier,
on robot learning, pages 906–915. PMLR, 2018. and Matt Feiszli. Fast3r: Towards 3d reconstruction of
1000+ images in one forward pass. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), June 2025.
[63] Sarah Young, Dhiraj Gandhi, Shubham Tulsiani, Abhinav
Gupta, Pieter Abbeel, and Lerrel Pinto. Visual imitation
made easy, 2020.
[64] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian,
Yingying Chen, Fangchen Liu, Vashisht Madhavan, and
Trevor Darrell. Bdd100k: A diverse driving dataset for
heterogeneous multitask learning. In Proceedings of the
IEEE/CVF conference on computer vision and pattern
recognition, pages 2636–2645, 2020.
[65] Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea
Finn. Learning fine-grained bimanual manipulation with
low-cost hardware. arXiv preprint arXiv:2304.13705,
2023.
[66] Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang,
Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-
vla: A 3d vision-language-action generative world model.
arXiv preprint arXiv:2403.09631, 2024.
[67] Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted
Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker,
Ayzaan Wahid, et al. Rt-2: Vision-language-action models
transfer web knowledge to robotic control. In 7th Annual
Conference on Robot Learning, 2023.
A PPENDIX A Martín, Subramanian Ramamoorthy, Dorsa Sadigh, Shuran
C ONTRIBUTIONS Song, Jiajun Wu, Yuke Zhu, Thomas Kollar, Sergey Levine
Project Leads: Alexander Khazatsky, Karl Pertsch A PPENDIX B
DROID DATA F EATURES
Research Leads (contributed significantly to development
of data collection setup, data post-processing and policy All DROID data is recorded at 15Hz. Each DROID trajectory
training): Alexander Khazatsky, Karl Pertsch, Suraj Nair, contains the following elements:
Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, • 3 stereo RGB camera streams at 1280x720 resolution
Soroush Nasiriany, Mohan Kumar Srirama • robot joint positions and velocities (7D)
• robot end-effector pose and velocity in robot base frame
Engineers (helped implement data collection, postprocessing (6D)
and policy learning infrastructure): Alexander Khazatsky, • robot gripper position and velocity (1D)
Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Siddharth
Additionally each trajectory has the following metadata:
Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama,
• 1-3 natural language instructions describing the task
Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan,
Masha Itkina, Marion Lepert, Jason Ma, Patrick Tree Miller, performed in the trajectory, collected via crowdsourcing
• extrinsic camera calibration matrices for both exterior
Jimmy Wu, Huy Ha, Youngwoon Lee, Kaiyuan Wang, Kevin
Black, Cheng Chi, Kyle Hatch, Shan Lin, Jingpei Lu, Abdul cameras
• building name and data collector user ID
Rehman, Pannag R Sanketi, Cody Simpson, Quan Vuong,
• scene type, as classified by GPT4V (see Section C)
Blake Wulfe, Ted Xiao, Jonathan Yang, Arefeh Yavary, Tony
Z. Zhao
A PPENDIX C
Lab Leads (coordinated data collection in their respective S CENE T YPE C LASSIFICATION
labs): Suraj Nair, Ashwin Balakrishna, Siddharth Karamcheti, We labeled scene types in an automated fashion using the
Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang GPT4V API. For each scene, we sampled a random episode
Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha from that scene and a random image from that episode. That
Itkina, Marion Lepert, Jason Ma, Patrick Tree Miller, Jimmy image along with the prompt shown in Listing C.1 was sent
Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Abraham Lee, for labeling. We then reviewed samples assigned "Other" to
Youngwoon Lee, Arhan Jain, Marius Memmel, Sungjae Park, confirm that we were not missing any major categories, and
Ilija Radosavovic, Kaiyuan Wang, Albert Zhan, Archit Sharma, then reassigned those labels manually.
Homer Walke
A PPENDIX D
Policy Evaluators (ran robot evaluations for policy learning I NDENTIFYING U NIQUE S CENES
experiments): Alexander Khazatsky, Suraj Nair, Ashwin As mentioned in the main body of the paper, we define a
Balakrishna, Sudeep Dasari, Mohan Kumar Srirama, Joey unique scene as a substantial change to the robot’s workspace.
Hejna, Donovon Jackson, Tony Nguyen, Derick Seale For example, a home kitchen may have multiple unique scenes
associated with it in which the robot is placed in front of the
Data Collectors: Alexander Khazatsky, Mohan Kumar
refrigerator, sink, stove top, or different sections of the counter.
Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David
We do not consider changes in the objects being interacted
Fagan, Masha Itkina, Marion Lepert, Jason Ma, Patrick
with or changes to the poses of external cameras sufficient to
Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass,
constitute a unique scene.
Abraham Lee, Arhan Jain, Marius Memmel, Sungjae Park,
We label unique scenes as follows. During data collection,
Ilija Radosavovic, Albert Zhan, Christopher Agia, Rohan
a scene ID number is generated each time the user indicates
Baijal, Mateo Guaman Castro, Daphne Chen, Qiuyu Chen,
that robot or external cameras are moved. In total, there are
Trinity Chung, Jaimyn Drake, Ethan Paul Foster, Jensen Gao,
2,080 unique scene IDs in the dataset. Many of these scene IDs
David Antonio Herrera, Minho Heo, Kyle Hsu, Jiaheng Hu,
correspond to the same scene based on the definition provided
Donovon Jackson, Charlotte Le, Yunshuang Li, Kevin Lin,
above since users mislabel scene changes or move the robot
Roy Lin, Zehan Ma, Abhiram Maddukuri, Suvir Mirchandani,
back to the same scene after moving it somewhere else.
Daniel Morton, Tony Nguyen, Abby O’Neill, Rosario Scalise,
Derick Seale, Victor Son, Stephen Tian, Andrew Wang, In order to identify these duplicates, we collect scenes into
Yilin Wu, Annie Xie, Jingyun Yang, Patrick Yin, Yunchu Zhang groups that share the same robot serial number, name of the
lab collecting the data, and building name. Within each group,
Lead Advisor: Chelsea Finn we order the scenes by timestamp. We then go through the
scenes sequentially, identifying cases where the scene did not
Advisors: Osbert Bastani, Glen Berseth, Jeannette Bohg, change sufficiently to constitute a unique scene. Finally, we
Ken Goldberg, Abhinav Gupta, Abhishek Gupta, Dinesh search across the remaining set of scenes within each group
Jayaraman, Joseph J. Lim, Jitendra Malik, Roberto Martín- to identify cases where a robot was placed at the same scene
Fig. 11: DROID data collection GUI. Top left: Screen for entering feasible tasks for the current scene. Tasks can either be selected from a
list of suggestions or typed as free-form instructions. Top right: Instruction screen – the GUI samples a task at random from the entered list
of feasible tasks and instructs the data collector to record a demonstration trajectory. This ensures wide coverage of possible tasks in each
scene and avoids bias towards easy or familiar tasks. Bottom left: Data collection screen – displays RGB and depth camera live streams.
Bottom right: The GUI periodically suggests scene changes between demonstration collections to ensure high scene diversity.

twice (though not sequentially), and also remove these from Put Apple in Pot: A medium horizon task in a lab setting,
the set of unique scenes. where the task is to pick and place an apple into a pot and
This labeling approach has some limitations. For example, then put a lid on the pot. The apple, pot, and lid position
because we group scenes based on robot serial number and only are randomized between episodes on the table. We collect 60
identify duplicates within that group, if two different robots demonstrations, and mark success if the apple is in the pot and
are placed at the same scene then that scene would be counted the lid is on the pot. The out of distribution variant involves
twice. Nevertheless, during labeling we were conservative in placing an additional plate on the table as a distractor.
our estimate of what constituted a unique scene, and as a result Toasting: A medium horizon task in a lab setting, where the
believe that the number reported in the paper represents a task is to put an object on a toaster oven tray, then close the
conservative estimate. toaster oven. The object and toaster position are randomized
A PPENDIX E between episodes on the table. We collect 150 demonstrations,
E VALUATION P ROCEDURE and mark success if the object is in the toaster oven and the
toaster oven is closed. The out of distribution variant consists
We evaluate learned policies on the following 6 tasks,
of considering novel objects to toast.
each with their own out of distribution variants. For each
evaluation, we ensure that each of the policies see a similar Closing Waffle Maker: A short horizon task in a lab setting,
initial distribution of object locations across trials. where the task is to close a waffle maker. The waffle maker
Place Chips on Plate: A short horizon task in a lab setting, position is randomized between episodes. We collect 70
where the task is to pick and place a bag of Doritos chips onto demonstrations, and mark success if the waffle maker is closed.
a plate, with two distractor objects on the table. All objects The out of distribution variant consists of adding several
and the plate position are randomized between episodes on the distractor objects on the table.
table. We collect 50 demonstrations, and mark success if the Clean up Desk: A long horizon task in an office setting, where
chips are in the plate. We also consider two out of distribution the task is to open a drawer, pick and place an eraser into
variants: (1) changing the type of chips to Sun Chips (different the drawer, and then close the drawer. The eraser position is
size and color) and (2) putting two additional distractor objects varied at the start of each episode at a set schedule of different
(an apple and an orange) on the table. positions and orientations. We collect 50 demonstrations, and
Bathroom Bedroom Closet

Dining Room Kitchen Laboratory

Laundry Room Living Room O ce

Fig. 12: Qualitative examples of scenes in DROID. We use GPT-4V to categorize scenes into 9 scene types. DROID contains robot manipulation
demonstrations in a wide range of “in-the-wild” scenes across 52 buildings. Please check out the interactive dataset viewer included in the
supplementary material to browse the dataset videos.
ffi
mark success if the drawer is closed with the eraser in it. The are successfully completed. The out of distribution variant
out of distribution variant consists of adding distractor objects consists of adding several distractor objects and a camera shift.
on the desk, specifically a calculator, three whiteboard markers,
and a clip. We found that adding distractor objects inside the A PPENDIX F
desk caused all policies to fail. D IFFUSION P OLICY D ETAILS
Cook Lentils: A long horizon task in a kitchen setting, where In Section F-A we discuss the policy architecture and
the task is to remove the lid off a pan, pour lentils into the pan, hyperparameters used for all policy learning experiments. Then
and turn on the stove. The object positions are fixed. We collect in Section F-B, we describe how the various datasets in the
50 demonstrations, and mark success if all 3 stages of the task paper are used to construct training batches for policy learning.
Please classify the image into one of the following categories.
Respond with just the category name (do not include the category number).
1. Industrial office: industrial office tables and chairs, conference rooms,
conference TVs
2. Industrial kitchen: industrial refrigerator, sink, coffee maker
3. Industrial dining room: industrial setting with dining tables
4. Home office: desk or desk chairs in a home setting
5. Home kitchen: refrigerator, kitchen sink, kitchen tabletop in a home setting
6. Home dining room: dining table, dining chairs, in a home setting
7. Bedroom: room with a bed
8. Bathroom: Showers, baths, toilets, bathroom sinks
9. Living room: places with couches, armchairs, coffee tables, tvs in a home
setting
10. Hallway / closet: areas between rooms, situations where the robot is
interacting with a door or objects in a closet
11. Other: any other location that does not fit into those categories
12: Unknown: a scene that’s too hard to classify because the image is dark or too
close up

Listing C.1: The prompt provided to GPT4V in order to classify scene types.

Table II: Training Hyperparameters


A. Diffusion Policy Architecture and Hyperparameters
Hyperparameter Value
Batch Size 128
We build our diffusion policy [7] training pipeline on Optimizer Adam
the Robomimic codebase [37], which provides high quality Learning Rate 1e-4
Learning Rate Scheduler Linear
implementations of a number of different imitation learning Train Steps 25000
and offline RL algorithms. Given camera observations and a Observation Processing MLP [1024, 512, 512]
language instruction for the task, within Robomimic, we define Image Resolution (128, 128)
observation keys corresponding to each of the two external Crop Height 116
Crop Width 116
camera observations, a frozen DistilBERT [45] language
embedding, the 3D cartesian position of the gripper, and the Diffusion Method DDIM
gripper state, which measures the degree to which the gripper EMA Power 0.75
is closed. U-Net Hidden Layer Sizes [256, 512, 1024]
Observation Horizon 2
For each of the camera observations, we first downsample Prediction Horizon 16
Action Horizon 8
each image to a resolution of 128 × 128 and apply color jitter
and random cropping as a form of data augmentation. We then
use a ResNet-50 visual encoder pre-trained on ImageNet [11]
to produce embeddings for each of the visual inputs. These B. Training Batch Construction
embeddings are directly concatenated with all of the other For each evaluation task, we train policies with 3 different
observation keys. These concatenated features are then fed methods of constructing training batches:
through an Observation Processing MLP with layers defined
• No Co-training: trains a state of the art diffusion pol-
in Table II. The output of this MLP is then passed to a U-
icy [7] using samples from the in-domain demonstrations
Net diffusion head which generates action trajectories. We
only.
use an observation horizon of 2 to condition the diffusion
• DROID (Ours): Trains a diffusion policy, but mixes
head, diffuse out 16-step action sequences, and step the first
batches 50/50 between in-domain demonstrations and
8 actions open loop before re-running policy inference. All
DROID trajectories. For this experiment, we consider
relevant hyperparameters are defined in Table II. In line with
the first 40K successful trajectories in DROID for which
prior work [7], we use DDIM to diffuse out action trajectories
language annotations were available at the time of policy
for improved efficiency.
training. For the scene diversity experiments, we use a
All experiments use the training hyperparameters in Table II 7362 trajectory subset of these 40K trajectories.
with one exception: for the Cook Lentils task OOD experiment, • OXE [39]: Trains a diffusion policy, but mixes batches
we train all policies with 50000 training steps due to the 50/50 between in-domain demonstrations and trajecto-
increased complexity of the task. ries from a curated subset of the Open X-Embodiment
Fig. 13: Camera-to-robot base calibration qualitative results showing randomly picked scenes with synthetically rendered robot masks
using PyTorch3D [44]. Renderings are generated by importing the robot’s URDF-defined mesh and kinematic structure, applying joint angles
to compute the articulated pose, and transforming the mesh to the camera frame using the extrinsic Tcam→base . The extrinsic results are a
combination of results from automatic quality assessment-based filtering, outlined in Sec. G-A and running a tuned CtRNet-X model [34],
outlined in Sec. G-B. We provide quality assessment metrics for both approaches in our released extrinsic.

dataset [39] (OXE) used in Octo Model Team et al. [38]. cameras calibrated with respect to base, camera-to-camera
We also omitted data from the language table split of OXE calibrations for all scenes, and a curated superset of 24k
to bring down the number of trajectories to a manageable scenes encompassing all three methods and with both cameras
scale (400K trajectories). calibrated with respect to base, facilitating downstream robust
Each of the above settings defer only in the data used geometric understanding in robotics and 3D perception tasks.
to construct each training batch: otherwise all policies have Accurate camera calibration can be very useful in robotics
identical architectures and are trained with the same training and 3D perception, as it enables the consistent encoding of
parameters specified in Section F-A. spatial geometry from visual data. It serves as the backbone
The in-domain demonstrations used for policy training for various downstream tasks in robotics manipulation, such as
consist of only the demonstrations collected for each evaluation learning viewpoint invariant representations [6, 56] or ground-
task in Section E with one exception: for the Toasting and ing actions through 3D vision-and-language models [52, 66],
Close Waffle Maker Tasks, one multi-task policy is trained on thus enabling the robotics agents to achieve geometric and
the combination of their demonstrations. Thus, in this case, visual generalization. In robotic applications, calibration allows
the No Co-training policy defined above trains one diffusion for precise scene understanding and interaction by aligning
policy on the combined in-domain demonstrations, while the sensor observations to a shared spatial frame.
Co-training experiments sample batches via a 50/50 split of The DROID dataset provides initial extrinsic parame-
data from these combined in-domain demonstrations and data ters (Sec. III) that transform coordinates from the camera
from either DROID or OXE [39]. frame to the robot base frame. However, these calibrations are
not always accurate, primarily due to slight errors that can
A PPENDIX G
arise during the manual calibration process, such as imperfect
AUTOMATIC C AMERA C ALIBRATION
checkerboard placements, variations in lighting conditions, or
In this section, we provide three comprehensive sets of inaccuracies in the OpenCV calibration procedure performed
camera calibration matrices for the DROID dataset with their at the start of each data collection session. Following the
respective quality assessment metrics, including camera-to- data collection efforts outlined in Sec. III, we additionally
base calibrations for 36k unique scenes with one of the focus on providing robust calibration values for the col-
Fig. 14: Camera-to-Camera calibration qualitative results showing images, camera poses and pointclouds after our improved off-line and
post-hoc camera calibration as discussed in Sec. G-C. Scenes are picked from the top 30% quantile based on the number of matches after
calibration (See. Figure 16) External cameras are shown in red and blue. Here accumulated pointclouds from both views are shown after
deprojecting the depth maps using camera intrinsics and accumulated using relative camera poses between the two cameras.

lected dataset in an off-line post-hoc manner. This process x ∈ RN ×2 are computed as x = π(K · Tcam→base · X), where
utilizes recent advances in deep-learning based perception π(·) denotes perspective projection followed by normalization.
systems [30, 34, 41, 61] to automatically calibrate the relevant These 2D keypoints are used to guide a Segment-
cameras and provide quality metrics in a post-hoc manner. Anything (SAM) [29] instance segmentation model, which
The following sections details the automatic post-hoc cali- predicts masks MSAM . Simultaneously, synthetic robot masks
bration of the pre-collected DROID dataset. It focuses on two MGT are rendered using PyTorch3D by importing the robot’s
key types of calibration: (i) camera-to-robot base extrinsic mesh geometry and kinematic structure defined in its URDF.
calibration, which computes the transformation between a Each joint angle configuration θ is applied to the URDF
fixed camera and the robot’s kinematic base; and (ii) camera- to compute the articulated 3D mesh pose of the robot. The
to-camera extrinsic calibration, which estimates the relative resulting mesh is transformed to the camera frame using the
pose i.e. orientation and translation between two external same extrinsic transformation Tcam→base . The posed mesh is
cameras. Both are essential for fusing multi-view observations then rasterized into a binary silhouette using a differentiable
as well as allowing the grounding of robotics actions in renderer with the corresponding camera intrinsics K. This
3D, thus enabling spatially grounded robotic behaviors. This rendered mask serves as the ground-truth projection for
section is divided into 3 sub-sections. We first detail the evaluating the alignment quality of the predicted segmentation.
quality metric assessment of existing camera-to-robot base We compute the Intersection-over-Union (IoU) between the
calibration (Sec. G-A) provided after the data-collect phase in predicted and ground-truth masks as IoU = |M SAM ∩MGT |
|MSAM ∪MGT | . Only
Sec. III. This would allow us to filter the extrinsics provided SAM masks with confidence scores greater than 0.65 are
during data-collect and provide certain guarantees regarding retained. A final threshold of IoU ≥ 0.7 is used to identify
the calibration already provided. We then explain how to high-quality projections, filtering out poorly aligned frames.
calibrate additional cameras with respect to the base using We report the mean IoU across 5 equally subsampled frames
a fully automatic pipeline [34] while also providing guarantees in a video sequence as a measure of calibration quality. Using
in terms of quality metric i.e. reprojection error (Sec. G-B). this process, we identified a total of around 30k scenes with
Furthermore, we discuss calibrating external cameras with either the left or right camera well calibrated with respect to
respect to each other in Sec. G-C. Finally, we include a the scene. The whole process took around 1 day on 8-A100
discussion on limitations and future work in Sec. G-D. Nvidia-GPUs.

A. Quality Assessment of Existing Camera-to-Robot Base B. Automatic Camera-to-Robot Base Calibration


Calibration To supplement the filtering strategy outlined in Sec. G-A
To evaluate the quality of the existing camera-to-robot base and bring in additional cameras for the camera-to-robot base
transformation (Tcam→base ), we project known 3D keypoints calibration, we additionally ran a tuned version of CtRNet-
X ∈ RN ×3 , obtained via forward kinematics for given X [34] out-of-the-box on all of DROID dataset. We used
joint angles θ, into the image plane using the extrinsic the original codebase provided by the authors as well as
matrix Tcam→base and camera intrinsics K. The 2D projections the hyperparameters tuned for the DROID dataset. CtRNet-X
is a feed-forward approach that detects keypoints on robot from both the above-mentioned approaches. Furthermore, we
using a neural-network and matches it with ground-truth 3D present the distribution of IoU and reprojection errors after
keypoint trajectory in the video. Additionally, they utilize a applying the improved calibration strategy and filtering. These
CLIP [41]-guided robot part detection to dynamically select results are shown in Fig. 17.
visible keypoints. Following author’s implementation, we use a
confidence threshold of 0.08, CLIP [41] models’ end-effector C. Automatic Camera-to-Camera Calibration
confidence of 0.1 and robot base confidence of 0.05. To evaluate We utilize the recently released DUSt3R [61] framework
the quality of our camera-to-base calibration, we compute the for improved Camera-to-Camera calibration. DUSt3R [61]
reprojection error between detected 2D keypoints and their supports both relative and absolute pose estimation. For
corresponding 3D projections using the estimated camera pose relative pose estimation, DUST3R proposes obtaining 2D–3D
and intrinsics. To ensure robustness, we first discard low- correspondences between a query image IQ and a reference
confidence 2D observations based on a fixed threshold. We image IB , followed by PnP-RANSAC [15, 31] using known
then apply a Median Absolute Deviation (MAD) based outlier or estimated intrinsics. The relative pose between IQ and IB
rejection strategy: we calculate the median of all reprojection can also be converted to an absolute pose in world coordinates
errors, compute the absolute deviation of each error from the by aligning predicted pointmaps to a known scale, typically
median, and identify inliers as those within 2.5 times the MAD. via a ground truth pointmap for IB . However, this approach
This robust statistical filtering helps suppress the influence of still requires scale alignment post-optimization and can suffer
large outliers, leading to a more reliable estimate of the mean from ambiguities due to noise or uncertainty in the predicted
reprojection error. For the final filtering, we select scenes with geometry.
a mean reprojection-error of 20 or less. We modify the pose optimization pipeline of DUSt3R [61]
Through this process, we identified a total of around 12k to utilize depth maps and known camera intrinsics as an input
scenes which have either the left or right camera correctly in the optimization pipeline to recover absolute poses in a con-
calibrated with respect to the base. This process took around sistent metric scale. Specifically, we begin by running DUSt3R
5 days on 8-A100 Nvidia GPUs. Since the number of well- inference on image pairs to extract dense 3D pointmaps. These
calibrated scenes overlapped between the two strategies, we are predicted pointmaps are aligned to ground truth 3D point
able to calibrate around 36k unique scenes with either the left clouds—constructed from depth and intrinsics—to compute
or right camera well calibrated with respect to the scene using a global scale factor. We then perform a global optimization
the strategy outlined in Sec. G-A and Sec. G. Figure 13 shows step, where the ground truth depth and intrinsics are fixed, and
randomly selected scenes with synthetically rendered robot camera poses are refined to minimize the 3D alignment error
masks using PyTorch3D [44], qualitatively demonstrating the across the scene. This approach enables accurate, scale-aware
high accuracy of our filtered camera-to-robot base calibration absolute pose estimation without relying on post-hoc scale
alignment.
By fixing the depth and intrinsics during optimization, we
Images Existing Calibration Improved Calibration (Sec. F-C3)
ensure that the recovered poses are globally consistent and
metrically accurate. Importantly, our method operates directly
on unmodified DUSt3R [61] outputs, requiring no additional
training or manual scale correction. Figures 14 and 15 show
qualitative improvements in point cloud alignment following
this optimization step.
To assess the quality of the recovered camera-to-camera
calibration, we report the number of matched points between
views, following the original implementation by [61]. For
each image pair, we extract high-confidence 3D points, project
them into 2D using known intrinsics, and identify reciprocal
nearest neighbors in 3D space as reliable matches. Formally,
given pointmaps P0 and P1 , we define the match set M =
{(i, j) | NN(P0 [i]) = j and NN(P1 [j]) = i}. In practice, we
qualitatively observe that a higher number of reciprocal 3D
matches visually correlates with the geometric quality of
estimated poses (see Fig. 14). Although the number of matches
Fig. 15: Camera-to-Camera calibration comparison showing serves as a reasonable proxy for assessing the quality of
images (left), pointclouds from the existing calibration (middle) and
pointclouds after our improved calibration (right), as described in
estimated poses, we observed some false positives in visually
Sec. G-C. Our improved calibration is able to handle challenging cluttered scenes. To enhance robustness, one could lower the fil-
scenes and produces well-aligned pointclouds from both cameras. tering threshold or incorporate the quality assessment described
Note that depth maps which are used to deproject pointclouds using in Sec. G-A to further refine the filtering process. Figure 16
camera intrinsic and extrinsic are not shown here. shows the distribution of match counts across labs. While some
Count / Cumulative

Number of Matched Points

Fig. 16: Distribution of matched points after camera-to-camera calibration for each unique lab along with the cumulative distribution. While
some labs achieve high-quality correspondence, others struggle to reach the same level — often due to challenging lighting or clutter. The
cumulative curve (solid curve line) highlights the accumulation of matched points across all scenes, helping to identify the top quantile of
well-calibrated camera pairs within each lab. These high-confidence matches are especially important as they inform downstream selection of
reliable scenes. Note that the first image from each video was used for pose refinement using the modified DUSt3R [61] pipeline described in
Sec. G-C.

relied on a recent paradigm in 3D deep learning, namely the


Results after Sec. G-A

Results after Sec. G-B

prediction of point-maps. Despite using a modified version of


DUSt3R [61], which utilized privileged depth information for
pose optimization, it relied on the original checkpoints provided
by the authors and hence also borrowed the limitations of the
original model. Despite decent success, the model at times fails
on scenes with clutter and challenging table-top settings with
Fig. 17: Distribution of respective metrics i.e. IOU and mean little to no overlap between images. As the quality of these
reprojection errors after thresholding and filtering with the strategy models keeps improving [60, 62], we believe it would be a
outlined in Sec. G-A and Sec. G-B respectively.
valuable direction to leverage these improved models for more
robust camera-to-camera calibration, particularly in cluttered
and low-overlap scenarios where traditional feature matching
labs exhibit strong geometric consistency, others struggle due
or earlier models struggle.
to challenging conditions like clutter or poor lighting. For all
videos, the first frame was used for pose refinement via our
E. Conclusion
modified DUSt3R [61] pipeline. Future improvements could
include ensembling predictions across frames or leveraging The approaches outlined in each of the aforementioned
temporal consistency to further stabilize pose estimation or sections i.e. Sec. G-A, Sec. G-B and Sec. G-C have different
finetuning DUST3R-like methods on table-top cluttered datasets guarantees in terms of quality metrics. Hence, for the final
observed in robotics manipulation settings. calibration release we offer 3 different sets of camera calibration
matrices. The first one contains Camera-to-Robot base calibra-
D. Limitations and Future Work tion from 36k unique scenes with either the left or right camera
Calibrating a large-scale dataset like DROID is a challenging calibrated with respect to the robot base. This includes results
task. To ensure accuracy and provide guarantees at each step, after the combined calibration methods outlined in Sec. G-A
we divided the calibration process into three distinct stages. and Sec. G-B. The second set of calibration includes all Camera-
Since the complete process is fully automatic, there are still to-Camera calibration matrices, i.e. the relative transformations
some false positives and future work could look at further for all scenes in DROID dataset using the approach outlined
improving on these inconsistencies. Part of our camera-to-base in Sec. G-C. Finally, we also release a third set of calibration
calibration relied on running an out-of-the-box model which with includes a superset of all methods which, totaling around
was trained on Franka panda robot. While successful, its zero- 24k scenes with both cameras calibrated with respect to the
shot generalizability to other robots (without requiring further base and with a mix of 3 different approaches and individual
training or finetuning) remains to be seen. guarantees. In this superset, we use an IOU threshold of 0.6, a
Future work could look at using foundation models to reprojection error of 20 and top 30% quantile based on number
segment out the robot or gripper and estimating keypoints of matches for each stage of calibration described earlier in
on specific parts of the robot. This could provide a more Sections G-A, G-B and G-C respectively. We hope this effort
generalizable solution that could be readily applied to any robot is useful for 3D vision and robotics manipulation research
collected data in-the-wild. Our Camera-to-Camera calibration and also serves an inspiration for off-line automatic camera
calibration of in-the-wild robotics manipulation datasets.
DROID

Bridge V2

RH20T

RT-1

Fig. 18: Distribution of skills, i.e., verbs, for DROID and existing large robot manipulation datasets. Top to bottom: DROID, Bridge V2 [59],
RH20T [14], RT-1 [2]. DROID features a long tail of diverse verb classes that is only matched by Bridge V2, while the RH20T and RT-1
datasets have a more constrained set of skills.
Furniture

Utensils
Containers

Textile
Stationary

Clothes
Appliances
Personal Care
Hardware

Kitchen Tools

Food
Sports
Toys

Accessories

Fig. 19: Distribution of interacted objects in DROID, grouped by category. The robot interacts with a wide range of everyday objects.
Fig. 20: Joint distribution of verbs and interacted objects in DROID. Most objects have a diverse range of interactions that are performed on
them.

You might also like