Karl Pertsch

I am a postdoc at UC Berkeley and Stanford University, where I work with Sergey Levine and Chelsea Finn on training robot foundation models. I'm also a member of the technical staff at Physical Intelligence.

I completed my PhD at the University of Southern California (USC), working with Joseph Lim. During my PhD, I was fortunate to intern at Meta AI and spend time as a student researcher at Google Brain with Karol Hausman. Before my PhD, I spent one year as a Fulbright Scholar at the University of Pennsylvania, working with Kostas Daniilidis.

Email  /  Twitter  /  Google Scholar  /  CV  /  LinkedIn

Research

I'm interested in machine learning, reinforcement learning and robotics. At the moment, I am working on training foundation models for robotics. Towards this goal, I focus on three key challenges: (1) building diverse robot datasets, (2) training large-scale robot policies on this data, and (3) developing approaches for scalably evaluating robot foundation models.

Re-Mix: Optimizing Data Mixtures for Large Scale Imitation Learning
Joey Hejna, Chethan Bhateja, Yichen Jian, Karl Pertsch, Dorsa Sadigh
Conference on Robot Learning (CoRL), 2024
paper / code

We develop a scalable approach for optimizing data mixtures for large-scale robot imitation learning, using group distributionally robust optimization. Our approach generates dataset weights for the RT-X data mixture that outperform weights tuned by human experts.

Robotic Control via Embodied Chain-of-Thought Reasoning
Michal Zawalski*, William Chen*, Karl Pertsch, Oier Mees, Chelsea Finn
Sergey Levine
Conference on Robot Learning (CoRL), 2024
project page / paper / code / models

We propose embodied chain-of-thought learning for vision-language-action models (VLAs). By training VLAs to "look and think" before acting, i.e. to predict intermediate "grounded reasoning steps" like subtasks, object bounding boxes, etc. we can enable substantially improved generalization. Our approach increases the performance of OpenVLA on challenging generalization evaluations by 30% without any additional robot data.

OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim*, Karl Pertsch*, Siddharth Karamcheti*, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, Chelsea Finn
Conference on Robot Learning (CoRL), 2024
project page / paper / code / models

We introduce OpenVLA, a 7B-parameter open-source vision-language-action model (VLA), pretrained on 970k robot episodes from the Open X-Embodiment dataset. OpenVLA sets a new state of the art for generalist robot manipulation policies. It supports controlling multiple robots out of the box and can be quickly adapted to new robot setups via parameter-efficient fine-tuning. OpenVLA models, code, and training data are fully open-source.

Evaluating Real-World Robot Manipulation Policies in Simulation
Xuanlin Li*, Kyle Hsu*, Jiayuan Gu*, Karl Pertsch, Oier Mees, ..., Sergey Levine, Jiajun Wu, Chelsea Finn, Hao Su, Quan Vuong, Ted Xiao
Conference on Robot Learning (CoRL), 2024
project page / paper / code

We introduce SIMPLER, a collection of simulated environments for manipulation policy evaluation on common real robot setups. We demonstrate strong correlation between policy performance in SIMPLER environments and in the real world through paired sim-and-real evaluations of open-source manipulation policies.

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset
Alexander Khazatsky*, Karl Pertsch*, Suraj Nair, ..., Thomas Kollar, Sergey Levine, Chelsea Finn
Robotics: Science and Systems (RSS), 2024
project page / paper / dataset visualizer

We introduce DROID, the most diverse robot manipulation dataset to date. It contains 76k demonstration trajectories or 350 hours of interaction data, collected across 564 scenes and 84 tasks by 50 data collectors in North America, Asia, and Europe over the course of 12 months. We demonstrate that training with DROID leads to policies with higher performance and improved generalization ability. We open source the full dataset, policy learning code, and a detailed guide for reproducing our robot hardware setup.

Octo: An Open-Source Generalist Robot Policy
Dibya Ghosh*, Homer Walke*, Karl Pertsch*, Kevin Black*, Oier Mees*, ..., Dorsa Sadigh, Chelsea Finn, Sergey Levine
Robotics: Science and Systems (RSS), 2024
project page / tech report / code

We introduce Octo, an open-source generalist policy, trained on 800k robot trajectories. Octo is a large, transformer-based diffusion policy that supports flexible task specification, observation and action spaces. It can control a diverse range of robots out of the box and supports efficient finetuning to new robot configurations. We release pre-trained checkpoints and our full training + finetuning pipelines.

Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Open X-Embodiment Collaboration
(Project co-leads: Quan Vuong, Karl Pertsch)
International Conference on Robotics and Automation (ICRA), 2023 (Best Conference Paper Award)
project page / arXiv / dataset

We introduce the Open X-Embodiment Dataset, the largest robot learning dataset to date with 1M+ real robot trajectories, spanning 22 robot embodiments. We train large, transformer-based policies on the dataset (RT-1-X, RT-2-X) and show that co-training with our diverse dataset substantially improves performance.

Cross-Domain Transfer via Semantic Skill Imitation
Karl Pertsch, Ruta Desai, Vikash Kumar, Franziska Meier, Joseph J. Lim, Dhruv Batra, Akshara Rai
Conference on Robot Learning (CoRL), 2022
project page / arXiv / code

We learn a semantic skill policy that enables cross-domain imitation: from robot to robot between different environments and even from human video to robot. We show that we can learn long-horizon robotic manipulation tasks in a simulated kitchen environment using only three minutes of human video, recorded in my kitchen with a GoPro strapped to my head.

Assisted Teleoperation for Scalable Robot Data Collection
Shivin Dass*, Karl Pertsch*, Hejia Zhang, Youngwoon Lee, Joseph J. Lim, Stefanos Nikolaidis
project page / arXiv / code

We enable scalable robot data collection by assisting human teleoperators with a learned policy. Our approach estimates its uncertainty over future actions to determine when to request user input. In real world user studies we demonstrate that our system enables more efficient teleoperation with reduced mental load and up to four robots in parallel.

Task-Induced Representation Learning
Jun Yamada, Karl Pertsch, Anisha Gunjal, Joseph J. Lim
International Conference on Learning Representations (ICLR), 2022
project page / arXiv / code

We evaluate the effectiveness of representation learning approaches on visually complex environments with substantial distractors. We compare common unsupervised representation learning approaches to task-induced representations, that leverage task information from prior tasks to learn what parts of the scene are important to model and what parts can be ignored.

Skill-based Meta-Reinforcement Learning
Taewook Nam, Shao-Hua Sun, Karl Pertsch, Sung Ju Hwang, Joseph J. Lim
International Conference on Learning Representations (ICLR), 2022
project page / arXiv / code

We perform meta-RL on top of skills extracted from large task-agnostic offline datasets. By combining meta-training tasks with offline data we can meta-learn policies that can quickly learn new long-horizon, sparse reward tasks.

Demonstration-Guided Reinforcement Learning with Learned Skills
Karl Pertsch, Youngwoon Lee, Yue Wu, Joseph J. Lim
Conference on Robot Learning (CoRL), 2021
project page / arXiv / code

We follow long-horizon demonstrations by imitating the demonstrated skills instead of the primitive actions. By using skills learned from large, task-agnostic experience datasets for imitation, our approach SkiLD can seamlessly integrate task-agnostic data & demonstrations via a skill-based learning framework.

Accelerating Reinforcement Learning with Learned Skill Priors
Karl Pertsch, Youngwoon Lee, Joseph J. Lim
Conference on Robot Learning (CoRL), 2020 (Plenary Talk, top 4%)
Workshop on Robot Learning @ NeurIPS, 2020 (Best Paper Runner-up Award)
Deep RL Workshop @ NeurIPS, 2020 (Oral)
project page / arXiv / code

We jointly learn an embedding space of skills and a prior over skills. This skill prior tells us when to use which skill and guides learning on new tasks for effective skill transfer from large offline datasets.

Motion Planner Augmented Reinforcement Learning for Robot Manipulation in Obstructed Environments
Jun Yamada*, Youngwoon Lee*, Gautam Salhorta, Karl Pertsch, Max Pflueger, Gaurav S.Sukhatme, Joseph J. Lim, Peter Englert
Conference on Robot Learning (CoRL), 2020
project page / arXiv / code

Our approach augments model-free RL agents with motion planning capabilities, enabling them to solve long-horizon manipulation tasks in cluttered environments.

Long-Horizon Visual Planning with Goal-Conditioned Hierarchical Predictors
Karl Pertsch*, Oleh Rybkin*, Frederik Ebert, Chelsea Finn, Dinesh Jayaraman, Sergey Levine
Conference on Neural Information Processing Systems (NeurIPS), 2020
project page / arXiv / video / code

We propose a hierarchical prediction model that predicts sequences by recursive infilling. We use this model to devise a hierarchical planning approach that allows to scale visual MPC to long-horizon tasks with hundreds of time steps.

Keyframing the Future: Keyframe Discovery for Visual Prediction and Planning
Karl Pertsch*, Oleh Rybkin*, Jingyun Yang, Shenghao Zhou, Kosta Derpanis, Joseph Lim, Kostas Daniilidis, Andrew Jaegle
Conference on Learning for Dynamics and Control, 2020
project page / arXiv / video / poster

We propose a keyframe-based video prediction model that can unsupervisedly discover the moments of interesting change, the keyframes, in the data. We show that using the predicted keyframes as subgoals for planning improves performance on a simulated pushing task.

Hover over image (or tap the screen) to see the video.

Learning what you can do before doing anything
Oleh Rybkin*, Karl Pertsch*, Kosta Derpanis, Kostas Daniilidis, Andrew Jaegle
International Conference on Learning Representations (ICLR), 2019
project page / arXiv / poster

We learn an agent's action space from pure visual observations along with a predictive model. It can then be used to perform model predictive control, requiring orders of magnitude fewer action annotated videos.

Hover over image (or tap the screen) to see the video.

iPose: Instance-Aware 6D Pose Estimation of Partly Occluded Objects
Omid Hosseini Jafari*, Siva Karthik Mustikovela*, Karl Pertsch, Eric Brachmann, Carsten Rother
Asian Conference on Computer Vision (ACCV), 2018

Combining a CNN-based regression of dense on-object surface labeling with RANSAC-based pose fitting for accurate 6DoF pose estimation of texture-less objects under heavy occlusion.


I borrowed this website layout from here!