DROID A Large-Scale in-The-Wild Robot Manipulation Dataset
DROID A Large-Scale in-The-Wild Robot Manipulation Dataset
Alexander Khazatsky∗,1 , Karl Pertsch∗,1,2 , Suraj Nair1,3 , Ashwin Balakrishna3 , Sudeep Dasari4 ,
Siddharth Karamcheti1 , Soroush Nasiriany5 , Mohan Kumar Srirama4 , Lawrence Yunliang Chen2 , Kirsty Ellis6 ,
Peter David Fagan7 , Joey Hejna1 , Masha Itkina3 , Marion Lepert1 , Jason Ma14 , Patrick Tree Miller3 ,
Jimmy Wu8 , Suneel Belkhale1 , Shivin Dass5 , Huy Ha1 , Abraham Lee2 , Youngwoon Lee2,16 , Arhan Jain9 ,
Marius Memmel9 , Sungjae Park10 , Ilija Radosavovic2 , Kaiyuan Wang11 , Albert Zhan6 , Kevin Black2 ,
Cheng Chi1 , Kyle Hatch3 , Shan Lin11 , Jingpei Lu11 , Abdul Rehman7 , Pannag R Sanketi12 ,
arXiv:2403.12945v2 [cs.RO] 22 Apr 2025
Archit Sharma1 , Cody Simpson3 , Quan Vuong12 , Homer Walke2 , Blake Wulfe3 , Ted Xiao12 , Jonathan Yang1 ,
Arefeh Yavary13 , Tony Z. Zhao1 , Christopher Agia1 , Rohan Baijal9 , Mateo Guaman Castro9 , Daphne Chen9 ,
Qiuyu Chen9 , Trinity Chung2 , Jaimyn Drake2 , Ethan Paul Foster1 , Jensen Gao1 , Vitor Guizilini3 ,
David Antonio Herrera1 , Minho Heo10 , Kyle Hsu1 , Jiaheng Hu5 , Muhammad Zubair Irshad3 , Donovon Jackson3 ,
Charlotte Le2 , Yunshuang Li14 , Kevin Lin1 , Roy Lin2 , Zehan Ma2 , Abhiram Maddukuri5 , Suvir Mirchandani1 ,
Daniel Morton1 , Tony Nguyen3 , Abby O’Neill2 , Rosario Scalise9 , Derick Seale3 , Victor Son1 , Stephen Tian1 ,
Andrew Wang2 , Yilin Wu4 , Annie Xie1 , Jingyun Yang1 , Patrick Yin9 , Yunchu Zhang9 ,
Osbert Bastani14 , Glen Berseth6 , Jeannette Bohg1 , Ken Goldberg2 , Abhinav Gupta4 , Abhishek Gupta9 ,
Dinesh Jayaraman14 , Joseph J. Lim10 , Jitendra Malik2 , Roberto Martín-Martín5 , Subramanian Ramamoorthy7 ,
Dorsa Sadigh1 , Shuran Song1,15 , Jiajun Wu1 , Yuke Zhu5 , Thomas Kollar3 , Sergey Levine2 , Chelsea Finn1
Bathroom DROID Kitchen
Distributed Robot
Interaction Dataset
76k Episodes
564 Scenes
52 Buildings
13 Institutions
86 Tasks / Verbs
Fig. 1: We introduce DROID (Distributed Robot Interaction Dataset), an “in-the-wild” robot manipulation dataset with 76k trajectories or
350 hours of interaction data, collected across 564 scenes, 86 tasks, and 52 buildings over the course of 12 months. Each DROID episode
contains three synchronized RGB camera streams, camera calibration, depth information, and natural language instructions. We demonstrate
that training with DROID leads to policies with higher performance, greater robustness, and improved generalization ability. We open source
the full dataset, pre-trained model checkpoints, and a detailed guide for reproducing our robot setup.
Abstract— The creation of large, diverse, high-quality robot manipulation data in diverse environments poses logistical and
manipulation datasets is an important stepping stone on the path safety challenges and requires substantial investments in hardware
toward more capable and robust robotic manipulation policies. and human labour. As a result, even the most general robot
However, creating such datasets is challenging: collecting robot manipulation policies today are mostly trained on data collected
in a small number of environments with limited scene and task
∗ Project Co-leads, correspondence to [email protected], diversity. In this work, we introduce DROID (Distributed Robot
[email protected] Interaction Dataset), a diverse robot manipulation dataset with
76k demonstration trajectories or 350 hours of interaction data, Europe over the course of 12 months. To streamline distributed
collected across 564 scenes and 86 tasks by 50 data collectors in data collection and ensure applicability of the final dataset to
North America, Asia, and Europe over the course of 12 months. a wide range of research settings, all data is collected on the
We demonstrate that training with DROID leads to policies with
higher performance and improved generalization ability. We open same robot hardware stack based on the popular Franka Panda
source the full dataset, policy learning code, and a detailed guide robot arm. Each episode contains three camera views, depth
for reproducing our robot hardware setup. information, camera calibration, and language annotations.
In experiments across 6 tasks and 4 locations, from labs
I. I NTRODUCTION
to offices and real households, we find that DROID boosts
A key feature of robot manipulation policies is their ability to policy performance, robustness and generalizability by 20% on
generalize, i.e., their ability to perform a desired manipulation average over state-of-the-art approaches that leverage existing
task under new lighting conditions, in new environments, or large-scale robot manipulation datasets [39]. We open-source
with new objects. Training policies that are robust to such the full DROID dataset under CC-BY 4.0 license, code for
variations is a crucial step towards the deployment of robots training policies using the dataset, and a detailed guide for
in everyday environments and may bring us closer to every reproducing our complete robot software and hardware setup.
roboticist’s dream: robot models that can be downloaded
and “just work” when tested on a new robot setup. A II. R ELATED W ORK
central ingredient for training such generalizable policies is
a) Large datasets in machine learning: The rapid progress
diverse training data: in computer vision and natural language
in machine learning has been closely tied to the construction
processing, training on large and diverse datasets scraped from
of large and diverse datasets. Examples include ImageNet [11],
the internet yields models that work in a wide range of new
Kitti [18], Ego4D [19] and LAION [46] in computer vision,
tasks. Similarly, in robot manipulation, a number of recent
Common Crawl [43] and The Pile [17] in natural language
works have demonstrated that larger, more diverse robot training
processing, and ShapeNet [5] and Objaverse [9, 10] in 3D
datasets enable us to push the envelope on policy generalization,
modeling. Key to their impact is their size and diversity: by
including positive transfer to new objects, instructions, scenes,
enabling training on larger and more diverse data, they push the
and embodiments [1, 2, 14, 36, 38, 39, 47, 63]. This suggests
capabilities and robustness of machine learning models. With
that an important stepping stone on the path toward more
DROID we aim to continue this trend for robot manipulation
capable and robust robotic manipulation policies is the creation
and provide a large and diverse robot manipulation dataset to
of large, diverse, high-quality robot manipulation datasets.
spur progress on generalizable policy learning.
However, creating such datasets is challenging: in contrast to
b) Robot learning datasets: A number of prior works
vision or language data, training manipulation policies typically
introduce datasets for robot learning of various sizes and
requires robot manipulation data with recorded observations
diversity levels (see Table I). Broadly, these can be categorized
and actions, which cannot be easily scraped from the internet.
into datasets collected autonomously via scripted and semi-
Collecting robot manipulation data in diverse environments
random behaviors or learned agents [3, 8, 20, 26, 27, 32, 40],
poses logistical and safety challenges when moving robots
and datasets collected via human teleoperation [1, 2, 13, 14,
outside of controlled lab environments. Additionally, collecting
24, 36, 50, 59]. Multiple works focus on increasing dataset
data at scale requires substantial investments in hardware
diversity: RH20T [14] collects data across 33 tasks in 7 table-
and human labour for supervision, particularly for collecting
top scenes and BridgeV2 [59] collects data in 24 scenes.1
demonstration data. As a result, even the most general robot
While these datasets increase diversity, most of their data is
manipulation policies today are mostly trained on data collected
collected in a small number of scenes in a single research lab
in controlled, lab-like environments with limited scene and
or building.
task diversity. To enable the next level of generalizable
More recently, there has been a larger effort on pooling
robot manipulation policy learning, the robot manipulation
existing robot datasets into a coherent format, the Open
community needs more diverse datasets, collected across a
X-Embodiment dataset (OXE) [39]. Albeit larger in scale
wide range of environments and tasks.
than prior robot datasets, the OXE dataset still consists of
In this work, we introduce DROID (Distributed Robot
individual datasets with few scenes, thus totalling around 300
Interaction Dataset), a robot manipulation dataset of unprece-
scenes at the time of writing. Our goal with the DROID
dented diversity (see Fig. 1). DROID consist of 76k demon-
dataset is to significantly increase the scene diversity as well
stration trajectories or 350 hours of interaction data, collected
across 564 scenes, 52 buildings and 86 tasks. DROID was 1 Note that prior works use various definitions for what constitutes a “task”
collected by 18 research labs in North America, Asia, and and what constitutes a “scene”. In this work, we use the number of unique
verbs extracted from the language instructions to represent the number of
Affiliations: 1 Stanford University; 2 University of California, Berkeley; tasks, which is more scalable than manually defining tasks [59] yet often
3 Toyota Research Institute; 4 Carnegie Mellon University; 5 University of more reflective of the behavior diversity than e.g., counting the number of
Texas, Austin; 6 University of Montreal; 7 University of Edinburgh; 8 Princeton verb-object combinations [2] (see Fig. 3 for DROID’s verb distribution as an
University; 9 University of Washington; 10 Korea Advanced Institute of Science example). For scenes, we only count a scene as new if there is a substantial
& Technology (KAIST); 11 University of California, San Diego; 12 Google change of the robot’s workspace, e.g., if it gets transported to a new corner
DeepMind; 13 University of California, Davis; 14 University of Pennsylvania; of the kitchen or a new room altogether, but not if only the arrangement of
15 Columbia University; 16 Yonsei University objects in front of the robot or the table cloth changes.
Dataset # Traj. # Verbs # Scenes Lang. Instruct. Cam. Calibration Public Robot Collection
MIME [50] 8.3k 20 1 ✗ ✗ ✓ human teleop
RoboTurk [36] 2.1k 2 1 ✗ ✗ ✓ human teleop
RoboNet [8] 162k n/a 10 ✗ ✗ ✓ scripted
MT-Opt [26, 27] 800k 2 1 ✗ ✗ ✓ scripted & learned
BridgeData [13] 7.2k 4 12 ✓ ✗ ✓ human teleop
BC-Z [24] 26k 3 1 ✓ ✗ ✗ human teleop
RT-1 [2] 130k 2 2 ✓ ✗ ✗ human teleop
RH20T [14] 13k2 33 7 ✓ ✓ ✓ human teleop
RoboSet [1] 98.5k 9 11 ✓ ✗ ✓ 30% human / 70% scripted
BridgeData V2 [59] 60.1k 82 24 ✓ ✗ ✓ 85% human / 15% scripted
DobbE [47]∗ 5.6k 6 216 ✓ n/a (✓) human tool-based
Open X-Embodiment [39]† 1.4M 217 311 (✓) ✗ (✓) dataset aggregation
DROID (ours) 76k 86 564 ✓ ✓ ✓ human teleop
Table I: Comparison to existing datasets for robot manipulation. “# Scenes” refers to the number of unique robot work spaces, e.g., different
kitchens count as different scenes, but rearrangement of objects does not. See Section II for a detailed discussion of the definition of “Tasks”
and “Scenes”. DROID offers high diversity in both, the number of verbs and scenes. ∗ non-robot, tool-based data collection, † not a dataset in
itself, but aggregation of existing datasets, including most previous rows in this table.
as scene realism by collecting data across a wide array of diffusion policies [7] for all of our policy learning experiments.
real world buildings in a diverse set of geographic locations.
As a result, DROID contains data from 564 scenes across III. DROID DATA C OLLECTION S ETUP
52 buildings, a substantial increase compared to any existing In this work, we introduce DROID (Distributed Robot
robot manipulation dataset. Interaction Dataset), an open-source robot manipulation dataset
Collecting such data “in-the-wild” is more common for robot that provides for very high diversity and variability of scenes,
navigation and autonomous driving [4, 18, 28, 48, 49, 55, 57, tasks, and objects (see Table I). Diverse and high-quality data is
64] and enables training of policies that generalize zero-shot a key ingredient for training generalizable policies, and DROID
to new environments and even embodiments [48, 49]. With is designed to deliver both quantity and quality: it contains
DROID, we take a step towards enabling similar generalization 76k robot demonstration trajectories, spanning 86 tasks and
for robotic manipulation policies. Finally, there are some works 564 scenes. It was collected over the course of 12 months in
that leverage cheap, off-the-shelf tools, such as reacher-grabber a large, cross-institutional effort with 18 robots and 50 data
tools, for data collection, equipping robots with the same tools collectors across 13 institutions. All data is collected on a
to allow for zero-shot transfer to the robot [47, 53, 63]. While shared, open-source robot platform.
this simplifies the data collection process, it limits the data We are releasing all resources to enable researchers to build
to wrist camera viewpoints and may suffer from morphology upon DROID at https://droid-dataset.github.io. This includes
differences when transferring from human-arm-collected data the full dataset under CC-BY 4.0 license, an interactive
to robot arm execution. Additionally, DROID has larger scene dataset visualizer, code for training generalizable policies on
and task diversity than prior tool-based collection datasets [47]. DROID, pre-trained policy checkpoints, and a detailed guide
c) Scalable robot policy learning: Learning robot policies for reproducing our robot hardware setup and control stack.
from increasingly large and diverse datasets has been the In this section, we introduce our hardware setup and the data
focus of numerous efforts over the last few years. Initially, collection protocol.
these efforts focused in large part on learning from scripted or
autonomously collected data [8, 12, 20, 26, 32, 40]. The success A. DROID Robot Platform
of transformer models [58] in natural language processing A crucial component of building the DROID dataset was
and computer vision motivated a number of recent works distributed data collection at 13 institutions around the world:
that collected large-scale demonstration datasets and trained it is what enabled us to collect manipulation data across a large
transformer-based policies on them [2, 16, 25, 38, 39, 42, 49, diversity of scenes and tasks. A key challenge in this distributed
51, 65, 67]. Additionally, recent works suggest that diffusion setup is robot hardware: how can we ensure consistent and
denoising models [22] are a powerful parametrization for multi- reproducible robot control across so many setups, locations
modal action output distributions that combine expressivity and time zones? To streamline the distributed data collection
with scalability [7, 16, 21, 38, 54]. Our focus with DROID process we designed the DROID robot platform (see Fig. 2), a
is on introducing a new dataset, not a new policy learning hardware platform for data collection that is shared between all
algorithm. As such, we build on existing state-of-the-art institutions, allowing us to quickly set up new data collection
2 Fang et al. [14] report 110k trajectories for RH20T, but count each camera
units and roll out updates across the whole data collection fleet.
stream separately – here we report the number of unique multi-view trajectories, It is designed to support easy transportation between scenes
to compare fairly to all other datasets. and quick adjustment to new scenes and tasks.
Adjustable Zed 2
Stereo Cameras
we focused on the following objectives: (1) preventing common
data collection mistakes like “camera cannot see robot” or
Zed Mini Wrist “teleoperator in camera view”, (2) encouraging collection of
Stereo Camera diverse data, (3) allowing data collectors to creatively choose
scenes and tasks.
Every data collection session starts with moving the robot to
a new scene. Data collectors were encouraged to choose scenes
Control Laptop
that include multiple interesting tasks, numerous interaction
objects, and a healthy amount of clutter (see example scenes in
Oculus Quest 2 Fig. 12). After setting up the robot in the new scene, the data
Headset for Teleop
collector chooses views for the 3rd person cameras that can
Robotiq 2F-85
capture a wide range of interesting behaviors in the scene. Then
Gripper they perform extrinsic camera calibration using a checkerboard
and the OpenCV calibration algorithm. Next, the data collector
Franka Panda Portable will enter all potential tasks for the current scene into a data
7DoF Robot Arm Standing Desk
collection GUI on the laptop attached to the robot, either by
Fig. 2: The DROID robot platform. We use the same hardware selecting from a list of task options or by typing in free-from
setup across all 13 institutions to streamline data collection while task instructions (see Fig. 11 for screenshots of the GUI).
maximizing portability and flexibility. The setup consists of a Franka During data collection the GUI will prompt the data collector
Panda 7DoF robot arm, two adjustable Zed 2 stereo cameras, a wrist-
mounted Zed Mini stereo camera, and an Oculus Quest 2 headset with with a randomly sampled task from this list for each new
controllers for teleoperation. Everything is mounted on a portable, episode. This way we ensure that there is high coverage of
height-adjustable desk for quick scene changes. diverse tasks and collection is not biased to easier tasks or closer
objects. Additionally, the GUI periodically prompts the data
collector to perform randomly sampled “scene augmentations”
We chose the Franka Emika Panda 7 DoF robot arm as the like nudges to the mobile base, moving and re-calibrating the
base of our setup since it is widely adopted in the robot research 3rd person cameras, changing the room lighting, and adding
community, reliable, relatively affordable and was available or removing items within the scene. For each trajectory, we
at most participating institutions. The robot arm is equipped record the output of all RGB cameras, relevant low level state
with a Robotiq 2F-85 gripper and is mounted on a height- information from the robot, equivalent robot control commands
adjustable standing desk with wheels so it can easily move from various popular action spaces, a data collector ID, and
between scenes and buildings. We record image observations the metadata entered in the GUI (see Section B for a detailed
with three synchronized stereo camera streams: two exterior list of all features we record). The data collector also marks
Zed 2 cameras, table-mounted on adjustable tripods to quickly whether the collected sequence was a success, which we log
adapt to a new scene layout, and a wrist-mounted Zed-Mini as part of the metadata. DROID consists of 76k successful
camera. We use the Polymetis controller [33] and record actions episodes; roughly 16k trajectories in our data collection were
both in robot joint space and in end-effector space at a control labeled as “not successful”, which we include in our dataset
frequency of 15Hz. The setup is completed with the Franka release but do not count towards the size of DROID. A data
robot control box, a NUC that hosts the Polymetis server and collector will typically collect up to 100 trajectories or about
an Alienware laptop that runs our data collection GUI (see 20 minutes of interaction data per scene before moving on to
Section III-B). Everything is powered with a single power a new scene.
cable to further simplify changes in location. During post-processing, we label each episode with natural
For teleoperation, we use the controllers of a Meta Quest language commands using crowdsourcing via the tasq.ai data
2 headset to control the pose of the arm in 6D space as well labeling platform. We provide up to three independently labeled
as the gripper in continuous space. Over the course of this instructions per episode from different crowd workers to ensure
project we have replicated this setup 18 times across various diversity of annotations.
locations in North America, Asia, and Europe. We provide a Since the initial extrinsic calibration parameters, provided
thoroughly tested guide to replicate the hardware and software through conventional calibration detailed above, may not always
of our setup. We found that the setup is well-suited for data be accurate due to factors such as checkerboard misalignment,
collection and policy learning across a wide range of scenes inconsistent lighting, or errors inherent to the OpenCV calibra-
and tasks. tion method, we address these inaccuracies in Section G. We
discuss in detail the automatic post-hoc calibration process and
B. Data Collection Protocol provide three comprehensive sets of camera calibration matrices
Our dataset is collected by 50 data collectors across various for the DROID dataset, each accompanied by respective
research institutions. A shared data collection protocol helps quality assessment metrics. These include camera-to-base
streamline data collection, particularly for inexperienced data calibrations for around 36k unique scenes with one camera
collectors. When designing the collection protocol for DROID, calibrated relative to the base, camera-to-camera calibrations
Bridge V2
RH20T
RT-1
Food
Furniture
Toys
Hardware
Utensils
Textile
Sports
Clothes
Personal Care
Stationary
Kitchen Tools
Containers
Appliances
Accessories
Fig. 3: Distribution of verbs and objects in DROID. Top: Distribution of verbs after de-duplication with GPT-4. DROID has a long tail of
diverse tasks that span a wide range of behaviors. We also visualize the verb distributions for existing large manipulation datasets and find
that only Bridge V2 [59] has a comparable long tail of skills (for a detailed view of verb distributions for all datasets, see Appendix, Fig. 18).
Bottom: Distribution of objects the robot interacts with in DROID, sorted by category (best viewed zoomed in; for a detailed view, see
Fig. 19).
for all scenes, and a curated superset of 24k scenes covering and interaction location diversity. The latter refers to the
all three calibration methods with both cameras calibrated diversity of 3D locations relative to the robot’s base at which
relative to the base. These refined calibrations enhance the interactions with objects occur, an important factor when
dataset’s suitability for robust geometric understanding in generalizing to new scene layouts where interactions often
robotics and 3D perception tasks. For more details, please need to generalize to new table heights or new parts of the
see Sec. Section G. robot’s workspace.
We analyze DROID along these axes and compare it to
IV. DROID DATASET A NALYSIS existing large-scale robot manipulation datasets [2, 14, 59]. For
While we have so far referred to DROID and other large- each dataset, we run our analysis using one randomly sampled
scale robot manipulation datasets as “diverse,” there is nuance third-person camera frame per episode and the provided
in what constitutes a diverse robot dataset. Different axes of language instruction annotations. We find that results are
data diversity will affect the generalization abilities of models consistent across randomly sampled frames.
trained on the data differently: scene diversity may facilitate We visualize the results of our analysis in Figs. 3 to 6.
generalization to new scenes, while task or camera viewpoint Overall, we find that DROID significantly increases diversity
diversity allows for greater generalization to new instructions in tasks, objects, scenes, viewpoints and interaction locations
and camera angles. We will analyze DROID along multiple over existing large scale robot manipulation datasets. A key
important axes of diversity and compare it to existing large reason is DROID’s data collection protocol (see Section III-B):
robot manipulation datasets. by collecting data with 50 data collectors in 52 buildings
When deciding which axes of generalization to inspect across three continents, switching scenes approximately every
for robot manipulation datasets, it is important to consider 20 minutes during collection and giving collectors the freedom
which aspects of the problem may change between the training to freely choose scene-appropriate tasks, we can substantially
and downstream usage scenarios, i.e., which axes we want increase the diversity of scenes, tasks, and objects featured
manipulation policies to generalize over. This may involve in the dataset. Next, we will describe our analysis for each
aspects of the scene, task, and robot setup. We identify the category in more detail.
following important axes of diversity for closer analysis: task a) Task diversity: As explained in Section II, we use the
diversity, object diversity, scene diversity, viewpoint diversity, distribution of de-duplicated verbs in a dataset’s instructions
Side view:
DROID
Bridge V2
RT-1
RH20T
Fig. 7: Robot setups for policy evaluation. We cover a wide range of tasks and scenes, from lab evaluations to offices and real households, to
reflect the diversity of use cases in real robot research. Depending on the task we collect between 50 and 150 demonstrations. We describe
each task with out-of-distribution evaluation modifications in parenthesis, left to right: Close Waffle Maker: The robot needs to close a
waffle maker (distractor objects). Place Chips on Plate: The robot needs to pick up the chips bag and place it on the provided plate (unseen
chips bag and distractor objects). Put Apple in Pot: The robot needs to pick up the apple, place it in the pot, and close the lid (unseen
distractor object). Toasting: The robot needs to pick up the object, place it in the toaster oven, and close the oven (toast a novel object).
Clean up Desk: The robot needs to open the drawer, place the eraser that is on top of the desk inside the drawer, and close it (distractor
objects on desk and in drawer). Cook Lentils: The robot needs to remove the pan lid, pick up and pour lentils into the pan, and turn on the
stove(add distractor objects).
B. Does DROID Improve Policy Performance and Robustness? when co-training with DROID, which has the strongest overall
To study if co-training with DROID can enable improved performance.
policy learning, we train separate policies for each evaluation Qualitatively, we find that policies that leverage DROID
during training are notably smoother and precise than other
ffi
ffl
task and compare all policies head-to-head in A/B evaluations
using 10 rollouts for each task setting and method. To test comparisons, particularly in the more challenging out-of-
how DROID and existing datasets affect policy robustness, we distribution evaluations. For instance, in the OOD setting of the
evaluate each task and method in two settings: “in-distribution,” Waffle Closing task, DROID is the only method that consistently
which reflects the distribution of tasks in the in-domain reaches for the waffle maker, while the other methods get
demonstrations with noise added to the initial robot and object confused about the task. Similarly, in the multi-step Cook
positions, and “out-of-distribution” (OOD), which tests policy Lentils task, baselines tend to fail after two or sometimes just
robustness e.g., by introducing distractor objects or switching one step, while co-training with DROID is the only method able
the manipulated object. We evaluate the following approaches: to consistently finish all three steps See Fig. 9 for examples
• No Co-training: Trains a diffusion policy [7] using the
of qualitative task rollouts.
in-domain demonstrations only
• DROID (Ours): Trains a diffusion policy, but mixes C. How important is the scene diversity in DROID?
batches 50/50 between in-domain demonstrations and One of the unique benefits of DROID compared to existing
DROID demonstrations robot datasets is its amount of scene diversity. Indeed we see
• OXE [39]: Trains a diffusion policy, but mixes batches in Figure 4 that DROID contains far more scene diversity than
50/50 between in-domain demonstrations and trajectories the next most diverse robot manipulation dataset. While we’ve
from the Open X-Embodiment dataset [39] (OXE). OXE seen the benefits of co-training with DROID, can we quantify
contains most of the existing large robot manipulation how much of a role scene diversity plays in improved policy
datasets we compared DROID to in Section IV, as well as robustness?
a large number of other robot datasets, spanning 22 robot To test this, we design an experiment that uses the challeng-
embodiments and approximately 300 scenes total.3 ing OOD versions of the evaluation tasks from Section V-A,
We present the results of our policy evaluations in Fig. 8. but compares:
Across all tasks, we find that DROID substantially improves • DROID (7k, 20 Scenes): Selects for the 20 scenes from
policy performance compared to the diffusion policy trained DROID with the most demonstrations each, resulting in
on in-domain data only. Policies co-trained with DROID also 7362 trajectories with comparatively little scene diversity.
perform better than policies that leverage diverse, existing • DROID (7k, Diverse Scenes): Uniform random sample of
robot datasets in Open X-Embodiment (OXE). Notably, when 7362 successful demonstrations from the DROID dataset,
testing out of distribution performance, the No Co-training which matches dataset size to the previous method while
baseline performs quite poorly while the co-trained policies retaining high scene diversity.
are much more effective. This difference is especially notable
These comparisons use the same 50/50 co-training paradigm
3We use a curated split of OXE based on Octo Model Team et al. [38], with individual task data used in the previous experiment.
which has been shown to work well for policy learning in prior work [38]. Hence, this helps establish whether the scene diversity of
We remove the Language Table dataset [35], equivalent to 5% of the Octo
training mix, due to its repetitive scene layouts and tasks, and its raw size, DROID results in better policy performance than just using 20
which proved challenging to handle for our training infrastructure. scenes while controlling for dataset size.
Fig. 8: Does DROID Improve Policy Performance and Robustness? We find that across all our evaluation tasks, co-training with DROID
significantly improves both in distribution and OOD performance over both no co-training and co-training with the Open-X dataset. We
compare success rate averaged across all tasks with standard error, and find DROID outperforms the next best method by 22% absolute
success rate in-distribution and by 17% out of distribution.
In Figure 10 we observe that using the split of the dataset We hope that DROID can be a catalyst for research on
with more diverse scenes yields better performance in the general-purpose robot manipulation policies that are able to
OOD evaluation setting. By comparing Figure 10’s individual generalize to a broad range of tasks and scenes. In this work,
task performances with the corresponding tasks in Figure 8, we showed one example for leveraging DROID to boost policy
we also see that the performance of co-training with the full performance, but there are many open questions about how to
DROID dataset matches or outperforms the performance with best make use of such diverse data: how should we combine
the subsampled dataset on all three tasks. These results suggest DROID with existing large-scale robot datasets and how can
that the strength of DROID lies in its size and especially in we train policies that perform tasks in new scenes without any
its diversity. in-domain data? Can the diverse interaction data in DROID be
used to learn better visual representations for robotic control?
VI. D ISCUSSION And in what situations is it helpful to train on the full dataset vs.
In this work, we introduced DROID (Distributed Robot slices of the data? We hope that DROID can accelerate research
Interaction Dataset), a new robot manipulation dataset with a on these questions and are excited for how the community
large diversity of scenes, tasks, objects and viewpoints. Our will leverage the dataset! We also hope that our open-sourced
dataset analysis in Section IV showed that DROID has an order hardware platform, which already exists in 18 labs around the
of magnitude larger scene diversity than existing large robot globe and is easy to reproduce, can improve reproducibility of
manipulation datasets, a wide range of tasks, many interaction robot learning research and facilitate future additions to the
objects, and diverse viewpoints. Our policy learning evaluations DROID dataset.
show that DROID is a valuable data resource for improving
ACKNOWLEDGMENT
policy performance and robustness, even in comparison to
existing large robot data sources like the Open X-Embodiment We thank the Toyota Research Institute (TRI) for their
dataset [39]. support in various aspects of this project, from data collection
Remove Lid Pick up Lentils Pour Lentils Turn on Stove
DROID
(ours)
OXE
No
Co-Train
Fig. 9: Representative policy rollout examples on the most challenging “Cook Lentils” task in the kitchen of one of the authors. Qualitatively,
we find that policies co-trained with DROID perform smoother, more precise motions, allowing them to solve long-horizon tasks like lentil
cooking even in the presence of unseen distractor objects. In contrast, policies trained only with in-domain demonstrations or co-trained with
Open-X data [39] struggle with long-horizon tasks and out-of-distribution evaluation settings. See https://droid-dataset.github.io for rollout
videos.
twice (though not sequentially), and also remove these from Put Apple in Pot: A medium horizon task in a lab setting,
the set of unique scenes. where the task is to pick and place an apple into a pot and
This labeling approach has some limitations. For example, then put a lid on the pot. The apple, pot, and lid position
because we group scenes based on robot serial number and only are randomized between episodes on the table. We collect 60
identify duplicates within that group, if two different robots demonstrations, and mark success if the apple is in the pot and
are placed at the same scene then that scene would be counted the lid is on the pot. The out of distribution variant involves
twice. Nevertheless, during labeling we were conservative in placing an additional plate on the table as a distractor.
our estimate of what constituted a unique scene, and as a result Toasting: A medium horizon task in a lab setting, where the
believe that the number reported in the paper represents a task is to put an object on a toaster oven tray, then close the
conservative estimate. toaster oven. The object and toaster position are randomized
A PPENDIX E between episodes on the table. We collect 150 demonstrations,
E VALUATION P ROCEDURE and mark success if the object is in the toaster oven and the
toaster oven is closed. The out of distribution variant consists
We evaluate learned policies on the following 6 tasks,
of considering novel objects to toast.
each with their own out of distribution variants. For each
evaluation, we ensure that each of the policies see a similar Closing Waffle Maker: A short horizon task in a lab setting,
initial distribution of object locations across trials. where the task is to close a waffle maker. The waffle maker
Place Chips on Plate: A short horizon task in a lab setting, position is randomized between episodes. We collect 70
where the task is to pick and place a bag of Doritos chips onto demonstrations, and mark success if the waffle maker is closed.
a plate, with two distractor objects on the table. All objects The out of distribution variant consists of adding several
and the plate position are randomized between episodes on the distractor objects on the table.
table. We collect 50 demonstrations, and mark success if the Clean up Desk: A long horizon task in an office setting, where
chips are in the plate. We also consider two out of distribution the task is to open a drawer, pick and place an eraser into
variants: (1) changing the type of chips to Sun Chips (different the drawer, and then close the drawer. The eraser position is
size and color) and (2) putting two additional distractor objects varied at the start of each episode at a set schedule of different
(an apple and an orange) on the table. positions and orientations. We collect 50 demonstrations, and
Bathroom Bedroom Closet
Fig. 12: Qualitative examples of scenes in DROID. We use GPT-4V to categorize scenes into 9 scene types. DROID contains robot manipulation
demonstrations in a wide range of “in-the-wild” scenes across 52 buildings. Please check out the interactive dataset viewer included in the
supplementary material to browse the dataset videos.
ffi
mark success if the drawer is closed with the eraser in it. The are successfully completed. The out of distribution variant
out of distribution variant consists of adding distractor objects consists of adding several distractor objects and a camera shift.
on the desk, specifically a calculator, three whiteboard markers,
and a clip. We found that adding distractor objects inside the A PPENDIX F
desk caused all policies to fail. D IFFUSION P OLICY D ETAILS
Cook Lentils: A long horizon task in a kitchen setting, where In Section F-A we discuss the policy architecture and
the task is to remove the lid off a pan, pour lentils into the pan, hyperparameters used for all policy learning experiments. Then
and turn on the stove. The object positions are fixed. We collect in Section F-B, we describe how the various datasets in the
50 demonstrations, and mark success if all 3 stages of the task paper are used to construct training batches for policy learning.
Please classify the image into one of the following categories.
Respond with just the category name (do not include the category number).
1. Industrial office: industrial office tables and chairs, conference rooms,
conference TVs
2. Industrial kitchen: industrial refrigerator, sink, coffee maker
3. Industrial dining room: industrial setting with dining tables
4. Home office: desk or desk chairs in a home setting
5. Home kitchen: refrigerator, kitchen sink, kitchen tabletop in a home setting
6. Home dining room: dining table, dining chairs, in a home setting
7. Bedroom: room with a bed
8. Bathroom: Showers, baths, toilets, bathroom sinks
9. Living room: places with couches, armchairs, coffee tables, tvs in a home
setting
10. Hallway / closet: areas between rooms, situations where the robot is
interacting with a door or objects in a closet
11. Other: any other location that does not fit into those categories
12: Unknown: a scene that’s too hard to classify because the image is dark or too
close up
Listing C.1: The prompt provided to GPT4V in order to classify scene types.
dataset [39] (OXE) used in Octo Model Team et al. [38]. cameras calibrated with respect to base, camera-to-camera
We also omitted data from the language table split of OXE calibrations for all scenes, and a curated superset of 24k
to bring down the number of trajectories to a manageable scenes encompassing all three methods and with both cameras
scale (400K trajectories). calibrated with respect to base, facilitating downstream robust
Each of the above settings defer only in the data used geometric understanding in robotics and 3D perception tasks.
to construct each training batch: otherwise all policies have Accurate camera calibration can be very useful in robotics
identical architectures and are trained with the same training and 3D perception, as it enables the consistent encoding of
parameters specified in Section F-A. spatial geometry from visual data. It serves as the backbone
The in-domain demonstrations used for policy training for various downstream tasks in robotics manipulation, such as
consist of only the demonstrations collected for each evaluation learning viewpoint invariant representations [6, 56] or ground-
task in Section E with one exception: for the Toasting and ing actions through 3D vision-and-language models [52, 66],
Close Waffle Maker Tasks, one multi-task policy is trained on thus enabling the robotics agents to achieve geometric and
the combination of their demonstrations. Thus, in this case, visual generalization. In robotic applications, calibration allows
the No Co-training policy defined above trains one diffusion for precise scene understanding and interaction by aligning
policy on the combined in-domain demonstrations, while the sensor observations to a shared spatial frame.
Co-training experiments sample batches via a 50/50 split of The DROID dataset provides initial extrinsic parame-
data from these combined in-domain demonstrations and data ters (Sec. III) that transform coordinates from the camera
from either DROID or OXE [39]. frame to the robot base frame. However, these calibrations are
not always accurate, primarily due to slight errors that can
A PPENDIX G
arise during the manual calibration process, such as imperfect
AUTOMATIC C AMERA C ALIBRATION
checkerboard placements, variations in lighting conditions, or
In this section, we provide three comprehensive sets of inaccuracies in the OpenCV calibration procedure performed
camera calibration matrices for the DROID dataset with their at the start of each data collection session. Following the
respective quality assessment metrics, including camera-to- data collection efforts outlined in Sec. III, we additionally
base calibrations for 36k unique scenes with one of the focus on providing robust calibration values for the col-
Fig. 14: Camera-to-Camera calibration qualitative results showing images, camera poses and pointclouds after our improved off-line and
post-hoc camera calibration as discussed in Sec. G-C. Scenes are picked from the top 30% quantile based on the number of matches after
calibration (See. Figure 16) External cameras are shown in red and blue. Here accumulated pointclouds from both views are shown after
deprojecting the depth maps using camera intrinsics and accumulated using relative camera poses between the two cameras.
lected dataset in an off-line post-hoc manner. This process x ∈ RN ×2 are computed as x = π(K · Tcam→base · X), where
utilizes recent advances in deep-learning based perception π(·) denotes perspective projection followed by normalization.
systems [30, 34, 41, 61] to automatically calibrate the relevant These 2D keypoints are used to guide a Segment-
cameras and provide quality metrics in a post-hoc manner. Anything (SAM) [29] instance segmentation model, which
The following sections details the automatic post-hoc cali- predicts masks MSAM . Simultaneously, synthetic robot masks
bration of the pre-collected DROID dataset. It focuses on two MGT are rendered using PyTorch3D by importing the robot’s
key types of calibration: (i) camera-to-robot base extrinsic mesh geometry and kinematic structure defined in its URDF.
calibration, which computes the transformation between a Each joint angle configuration θ is applied to the URDF
fixed camera and the robot’s kinematic base; and (ii) camera- to compute the articulated 3D mesh pose of the robot. The
to-camera extrinsic calibration, which estimates the relative resulting mesh is transformed to the camera frame using the
pose i.e. orientation and translation between two external same extrinsic transformation Tcam→base . The posed mesh is
cameras. Both are essential for fusing multi-view observations then rasterized into a binary silhouette using a differentiable
as well as allowing the grounding of robotics actions in renderer with the corresponding camera intrinsics K. This
3D, thus enabling spatially grounded robotic behaviors. This rendered mask serves as the ground-truth projection for
section is divided into 3 sub-sections. We first detail the evaluating the alignment quality of the predicted segmentation.
quality metric assessment of existing camera-to-robot base We compute the Intersection-over-Union (IoU) between the
calibration (Sec. G-A) provided after the data-collect phase in predicted and ground-truth masks as IoU = |M SAM ∩MGT |
|MSAM ∪MGT | . Only
Sec. III. This would allow us to filter the extrinsics provided SAM masks with confidence scores greater than 0.65 are
during data-collect and provide certain guarantees regarding retained. A final threshold of IoU ≥ 0.7 is used to identify
the calibration already provided. We then explain how to high-quality projections, filtering out poorly aligned frames.
calibrate additional cameras with respect to the base using We report the mean IoU across 5 equally subsampled frames
a fully automatic pipeline [34] while also providing guarantees in a video sequence as a measure of calibration quality. Using
in terms of quality metric i.e. reprojection error (Sec. G-B). this process, we identified a total of around 30k scenes with
Furthermore, we discuss calibrating external cameras with either the left or right camera well calibrated with respect to
respect to each other in Sec. G-C. Finally, we include a the scene. The whole process took around 1 day on 8-A100
discussion on limitations and future work in Sec. G-D. Nvidia-GPUs.
Fig. 16: Distribution of matched points after camera-to-camera calibration for each unique lab along with the cumulative distribution. While
some labs achieve high-quality correspondence, others struggle to reach the same level — often due to challenging lighting or clutter. The
cumulative curve (solid curve line) highlights the accumulation of matched points across all scenes, helping to identify the top quantile of
well-calibrated camera pairs within each lab. These high-confidence matches are especially important as they inform downstream selection of
reliable scenes. Note that the first image from each video was used for pose refinement using the modified DUSt3R [61] pipeline described in
Sec. G-C.
Bridge V2
RH20T
RT-1
Fig. 18: Distribution of skills, i.e., verbs, for DROID and existing large robot manipulation datasets. Top to bottom: DROID, Bridge V2 [59],
RH20T [14], RT-1 [2]. DROID features a long tail of diverse verb classes that is only matched by Bridge V2, while the RH20T and RT-1
datasets have a more constrained set of skills.
Furniture
Utensils
Containers
Textile
Stationary
Clothes
Appliances
Personal Care
Hardware
Kitchen Tools
Food
Sports
Toys
Accessories
Fig. 19: Distribution of interacted objects in DROID, grouped by category. The robot interacts with a wide range of everyday objects.
Fig. 20: Joint distribution of verbs and interacted objects in DROID. Most objects have a diverse range of interactions that are performed on
them.