An Interactive Agent Foundation Model
An Interactive Agent Foundation Model
Zane Durante * 1 2 § , Bidipta Sarkar * 1 2 § , Ran Gong * 2 3 § , Rohan Taori 1 2 § , Yusuke Noda 2 ,
Paul Tang 1 , Ehsan Adeli 1 , Shrinidhi Kowshika Lakshmikanth 1 , Kevin Schulman 1 , Arnold Milstein 1 ,
Demetri Terzopoulos 3 , Ade Famoti 2 , Noboru Kuno 2 , Ashley Llorens 2 , Hoi Vo 2 † ,
Katsu Ikeuchi 2 † , Li Fei-Fei 1 † , Jianfeng Gao 2 † , Naoki Wake * 2 ▶ , Qiuyuan Huang * 2 ▶
arXiv:2402.05929v1 [cs.AI] 8 Feb 2024
Figure 1. Overview of an Agent AI system that can perceive and act in different domains and applications. Agent AI is emerging as a
promising avenue toward Artificial General Intelligence (AGI). Our model represents an initial step in the development of a model that is
highly capable of human-level reasoning across many tasks and levels of granularity.
Abstract 1. Introduction
The development of artificial intelligence systems The development of AI systems that can not only gather
is transitioning from creating static, task-specific useful sensory information, but also interact with their en-
models to dynamic, agent-based systems capa- vironments in meaningful ways has been a long-time goal
ble of performing well in a wide range of ap- for AI researchers. One key advantage of developing gen-
plications. We propose an Interactive Agent eralist AI systems is that of training a single neural model
Foundation Model that uses a novel multi-task across many tasks and data modalities, an approach which
agent training paradigm for training AI agents is highly scalable via data, compute, and model parameters
across a wide range of domains, datasets, and (Reed et al., 2022). With recent significant advances sur-
tasks. Our training paradigm unifies diverse pre- rounding general-purpose foundation models (Bommasani
training strategies, including visual masked auto- et al., 2021), the AI community has a new set of tools for
encoders, language modeling, and next-action developing generalist, action-taking AI systems en route
prediction, enabling a versatile and adaptable AI to artificial general intelligence. Despite their impressive
framework. We demonstrate the performance of results across various AI benchmarks, large foundation mod-
our framework across three separate domains— els frequently hallucinate the presence of objects and actions
Robotics, Gaming AI, and Healthcare. Our model in scenes and infer factually incorrect information (Rawte
demonstrates its ability to generate meaningful et al., 2023; Peng et al., 2023). We posit that one of the key
and contextually relevant outputs in each area. reasons why these foundation models hallucinate is due to
The strength of our approach lies in its general- their lack of grounding in the environments in which they
ity, leveraging a variety of data sources such as are trained (e.g., large-scale internet data instead of phys-
robotics sequences, gameplay data, large-scale ical or virtual environments). Furthermore, the dominant
video datasets, and textual information for effec- approach for building multimodal systems is to leverage
tive multimodal and multi-task learning. Our ap- frozen pre-trained foundation models for each modality and
proach provides a promising avenue for develop- to train smaller layers that allow for cross-modal informa-
ing generalist, action-taking, multimodal systems. tion passing (Alayrac et al., 2022; Li et al., 2022; 2023d;
Dai et al., 2023; Liu et al., 2023). Since the visual- and
language-specific submodules are not tuned during multi-
∗
Equal Contribution. ▶ Project Lead. † Equal Advisor. modal training, any hallucination errors in the submodules
§
Work done while interning or researching part-time at Microsoft will likely be present in the resulting multimodal system.
Research, Redmond. 1 Stanford University; 2 Microsoft Research,
Redmond; 3 University of California, Los Angeles. Additionally, lack of cross-modal pre-training could make
1
An Interactive Agent Foundation Model
grounding information across modalities challenging. model decoders (that are generally frozen) with represen-
tative models including Flamingo (Alayrac et al., 2022),
Towards such a generalist model that is grounded and pre-
the BLIP-series (Li et al., 2022; 2023d; Dai et al., 2023),
trained within physical or virtual environments, we propose
and LLaVA (Liu et al., 2023). These models are generally
a unified pre-training framework for handling text, visual
trained using the standard language modeling cross-entropy
data, and actions as input. We treat each input type as
loss on large-scale internet data consisting of visual-text
separate tokens and pre-train our model to predict masked
pairs, using a source of data similar to that used to train
tokens across all three modalities. Our approach uses pre-
contrastive dual encoder models (Radford et al., 2021; Bain
trained language models and pre-trained visual-language
et al., 2021; Sun et al., 2023b). Unlike most previous work,
models to effectively initialize our model with pre-trained
we explore training models to predict visual tokens and ac-
submodules, which we jointly train in our unified framework.
tion tokens in addition to language tokens and explicitly
We call our approach and resulting model an Interactive
train our model for agentic tasks.
Agent Foundation Model, due to its ability to interact with
humans and its environment, as well as its visual-language
understanding ability as shown in Figure 1. 2.3. Agent-Based AI
In this paper, we show that a 277M parameter model1 that is Agent-based AI is distinguished from traditional AI by its
jointly pre-trained across 13.4 M video frames from several need to generate dynamic behaviors that are grounded in an
distinct domains and data sources can effectively engage in understanding of environmental contexts. Recent research
interactive multi-modal settings using text, video, images, has focused on employing advanced large foundation mod-
dialogue, captioning, visual question answering, and embod- els to create Agent-based AI systems, as shown in (Durante
ied actions within four disparate virtual environments. In et al., 2024). In the field of robotics, for instance, recent
order to effectively evaluate the broad range of capabilities studies have highlighted the potential of LLM/VLMs in
and generalization abilities of our model, we show results enhancing multimodal interactions between robots, envi-
across distinct domains: (1) Robotics, (2) Gaming AI, and ronments, and humans. This applies to both manipulation
(3) Healthcare. Despite using domain-specific visual inputs, (Jiang et al., 2022; Brohan et al., 2023; 2022; Li et al., 2023e;
text descriptions, and action-spaces, our model is effectively Ahn et al., 2022; Shah et al., 2023b; Li et al., 2023c; Wake
able to generalize across all three domains. To facilitate et al., 2023a; Gong et al., 2023a) and navigation (Gadre
research in this discipline, we plan to release our code and et al., 2023; Dorbala et al., 2023; Cai et al., 2023; Shah
models publicly. et al., 2023a; Zhou et al., 2023; Dorbala et al., 2022; Liang
et al., 2023; Huang et al., 2023). Additionally, significant
advances in reinforcement learning have improved agent pol-
2. Related Work icy training on top of VLM/LLMs. Key advancements have
2.1. Foundation Models been made in areas such as reward design (Yu et al., 2023;
Katara et al., 2023; Ma et al., 2023), efficient data collection
A large number of works have sought to develop general- (Kumar et al., 2023; Du et al., 2023), and the management
purpose foundation models based on large-scale pre-training of long-horizon steps (Xu et al., 2023; Sun et al., 2023a; Li
on broad-scale internet data from a variety of sources (Bom- et al., 2023a; Parakh et al., 2023; Wake et al., 2023b). Simi-
masani et al., 2021). Within the field of Natural Language larly to robotics, gaming agents require an understanding of
Processing, this generally consists of larger proprietary visual scenes and textual instructions/feedback (Puig et al.,
LLMs (Wang et al., 2022) such as the GPT-series (Brown 2023; Li et al., 2021; Srivastava et al., 2022; Gong et al.,
et al., 2020; Min et al., 2022), or smaller open-source mod- 2023b). Agent-AI in the context of healthcare has focused
els such as the LLaMA series (Touvron et al., 2023), or on the text-based interaction between humans by utilizing
instruction-tuned variants such as Alpaca (Taori et al., 2023) the capabilities of LLM/VLMs. Representative applications
and Vicuna (Zheng et al., 2023). Within the field of com- include diagnostic assistance (Lee et al., 2023; Li et al.,
puter vision, strategies such as masked auto-encoders (He 2023b), knowledge retrieval (Peng et al., 2023; Guu et al.,
et al., 2022) and contrastive learning (Radford et al., 2021) 2020), and remote monitoring (Amjad et al., 2023).
are two popular methods for self-supervised learning.
3. Agent Paradigm
2.2. Multimodal Understanding
Recent advancements in AI technology have been remark-
Recently, many multimodal models have been developed
able, enabling a reasonable understanding of linguistic and
that seek to learn a relatively small number of parameters
visual information acquired in open-world environments.
to connect large pre-trained visual encoders and language
At this pivotal historical juncture, public interest in embod-
1
We are currently developing an even larger model. ied agent technology is shifting from research confined to
2
An Interactive Agent Foundation Model
Figure 2. We propose an Agent AI paradigm for supporting interactive multi-modal generalist agent systems. There are 5 main modules
as shown: (1) Agent in Environment and Perception with task-planning and observation, (2) Agent learning, (3) Memory, (4) Action, and
(5) Cognition and Consciousness (we use “consciousness” to imply a degree of awareness of an agent’s state and surroundings). A key
difference between our approach and some previous interactive strategies is that, after training, the agent’s action will directly impact task
planning, as the agent does not need to receive feedback from the environment to plan its next actions.
simulations and controlled environments to practical ap- 3. Interaction with humans and environments. Many
plications in highly uncertain environments. For example, tasks require multiple rounds of interactions between
consider a scenario where a robot, upon being unboxed, can AI and humans or the environment. Enabling fluent
instantly start communicating with non-expert humans and interactions between them would improve the effec-
swiftly adapt to performing household tasks in the home tiveness and efficiency of completing tasks for AI.
environment. In this section, we define a new paradigm for
embodied agents to position our proposed Interactive Agent
Foundation Model within the context of this new paradigm. In light of these principles, our proposed Interactive Agent
Foundation Model represents preliminary research that
We define the embodied agent paradigm as “any intelligent focuses on these critical aspects, aiming to develop an em-
agent capable of autonomously taking suitable and seamless bodied agent that functions as a practical assistance system.
action based on sensory input, whether in the physical world For an overview of our goals for developing an embodied
or in a virtual or mixed-reality environment representing agent, see Figure 2.
the physical world” (Figure 2). Importantly, an embodied
agent is conceptualized as a member of a collaborative Achieving an embodied agent is not easy, especially consid-
system, where it communicates with humans with its vision- ering the complex dynamics of systems with multi-modal
language capabilities and employs a vast set of actions based observations in the physical world. Despite the advancement
on the humans’ needs. In this manner, embodied agents are of recent LLM/VLMs, many challenges must be addressed,
expected to mitigate cumbersome tasks in virtual reality and including but not limited to: 1) unstructured environments,
the physical world. where current visual inputs affect both high-level and low-
level actions of the embodied agent given the same goal in-
We believe such a system of embodied agents requires at struction; 2) open sets of objects, which require the agent’s
least three key components: decision-making module to use common sense knowledge
that is hard to encode manually; 3) natural language interac-
1. Perception that is multi-sensory with fine granularity. tions, which require the agent to understand and operate on
Like humans, multi-sensory perception is crucial for more than just template-based commands, but also a context
agents to understand their environment, such as gaming of goals, constraints, and partial plans expressed in every-
environments, to accomplish various tasks. In particu- day language. To enable a more comprehensive approach to
lar, visual perception is useful for agents that can parse these complex challenges, the inclusion of researchers and
the visual world (e.g., images, videos, gameplay). practitioners from a broader range of fields is critical.
3
An Interactive Agent Foundation Model
… …
Action Prediction Action Prediction Visual Question Answering Action Recognition Visual Captioning
TASK-
SPECIFIC
OUTPUTS
TRAINING
DATA
Action + Cognition Video + Frames/Image
(w/o ANNOTATION)
Language
+ Knowledge
Figure 3. Overview of our Interactive Agent framework. Our foundation model is designed to process multi-modal information that
conveys various levels of abstraction. This approach facilitates a comprehensive understanding of the context and environment, thus
ensuring that actions are coherent. By training on a variety of task domains and applications, we develop a versatile foundation model that
can be fine-tuned for executing optimal actions in a variety of contexts, paving the way towards generally intelligent agents.
and better contextual reasoning. Our current work focuses ken or action token prediction via  = Fϕ (W, ℓ(Eθ (Vi ))).
on developing a joint image and video encoder and align- To incorporate prior time steps into our model, we also in-
ing this joint encoder to existing foundation models. This clude the previous actions and visual frames as input during
has several notable benefits: firstly, it allows for the use pre-training. For a given time step t, we predict Ât as
of both action, image, and video with language datasets
for pre-training. Secondly, it increases the capabilities of Ât = Fϕ (W, ℓ(Eθ (V1 )), A1 , ℓ(Eθ (V2 )), A2 ,
the model across a variety of downstream tasks (e.g., video . . . , ℓ(Eθ (Vt−1 )), At−1 , ℓ((Eθ (Vt ))). (1)
understanding, temporal reasoning, action prediction, in-
teraction with human feedback, etc.). Finally, by using a In practice, due to memory constraints, we only handle
joint encoder, we can reduce the overall model size (instead the previous M actions and frames, and update the pre-
of using two separate encoders), which can be useful for vious Vi and Ai as a sliding window. In order to more
edge deployments or in limited computing scenarios such effectively train our visual encoder to predict masked vi-
as robotics, gaming, and interactive healthcare tasks. sual tokens, we use sinusoidal positional embeddings, as
in (He et al., 2022) instead of the positional embeddings of
4.1. Model Architecture CLIP. Since we are using relatively small checkpoints, we
are able to jointly train our entire model during pre-training,
To effectively initialize our model to handle text, visual, and unlike previous visual-language models that largely rely
agent tokens as input, we initialize our architecture with upon frozen submodules and seek to learn an adaptation
two pre-trained submodules. First, we use CLIP ViT-B16 network for cross-modal alignment (Alayrac et al., 2022; Li
from (Radford et al., 2021) to initialize our visual encoder, et al., 2022; Liu et al., 2023). We show our general process
denoted Eθ , and initialize our action and language model, for formatting our input tokens in Figure 4, and describe our
Fϕ , from OPT-125M (Zhang et al., 2022). We encode each pre-training strategy in Section 4.2. For additional details,
frame in a video Vi as visual features Zi = Eθ (Vi ). We see Appendix A.
enable cross-modal information sharing by training an ad-
ditional linear layer ℓ that transforms the embeddings of
4.2. Pre-Training Strategy
our visual encoder Eθ into the token embedding space of
our transformer model Fϕ . Thus, given a text prompt W We pre-train our model on a wide range of robotics and gam-
and a single video frame Vi , we can obtain Â, a text to- ing tasks, with each input sample containing text instruc-
tions, videos, and action tokens. We notate each sample as a
4
An Interactive Agent Foundation Model
Figure 4. Our Unified Tokenization Framework. We propose a Llang (S) + Lmae (S) + Lact (S)
L(S) = PT . (5)
general pre-training strategy for predicting input tokens. For text |W | + t=0 (|Vt | + |At |)
tokens, we use the standard language modeling task with next
token prediction. For actions, we expand the vocabulary of the
language model to include special “agent” tokens that represent On robotics data, we only use T = 4 frames of video as
each of the actions available to the language model. Finally, we input since the tasks are Markovian and therefore do not re-
incorporate visual tokens into our framework by training a visual quire long histories to accurately predict the next action. Our
encoder to predict masked visual tokens. gaming data samples use T = 9 frames of video as input
since an observation history is necessary for the partially-
observable gaming tasks.
sequence S = (W, V1 , A1 , V2 , A2 , . . . , VT , AT ), where W
is the sequence of tokens corresponding to the text instruc-
tion, Vi is the sequence of image patches corresponding to 5. Tasks
frame i, and Ai is the sequence of action tokens correspond- We believe that a foundational model, trained in visual,
ing to the frame i of a video sequence of T frames. We language, and agent capabilities, leads to a powerful and
denote wj as the tokens of the text prompt W , and denote general-purpose tool that significantly impacts a variety
the parameters of our model as θ. For each sample, there are of interactive tasks. To evaluate the effectiveness of our
three components to the loss function: language modeling, approach, we applied the model to three major agent-AI
masked image auto-encoding, and action modeling. scenarios, encompassing representative downstream tasks:
The language modeling loss is a standard causal language 1) Robotics: human-machine manipulation in the physical
modeling loss to minimize the negative log likelihood of world; 2) Gaming: human-machine embodiment in virtual
each token in the instruction conditioned on prior tokens. reality; 3) Healthcare: augmented human-machine interac-
The language modeling loss for a particular sample S is tion in traditional multimodal tasks. For these tasks, the
pre-trained model was fine-tuned with specific datasets. As
|W |
X a result, the model demonstrated reasonable and competitive
Llang (S) = − log pθ (wj |w<j ). (2) performance in terms of action prediction, visual understand-
j=1 ing, natural language-driven human-machine interactions,
gaming, and hospital scene understanding. We outline the
The masked image autoencoding loss is generated by ran- task definitions and specific datasets used below.
domly masking 75% of the image patches and calculating
the mean-squared error between the reconstructed image 5.1. Robotics Tasks
and original image in pixel space for the masked image For the robotics scenario, we tested the model on language-
patches. The masked auto-encoder loss for a particular guided manipulation tasks. To this end, we selected two
sample, S is: distinct robotics manipulation datasets: Language-Table
T
(Lynch et al., 2023) and CALVIN (Mees et al., 2022). In
the Language-table dataset, a robot gripper rearranged table-
X
Lmae (S) = ||U(Vt ) − U(Dθ (Eθ (M(Vt ))))||22 , (3)
t=1
top objects following language commands. The data were
5
An Interactive Agent Foundation Model
Text Instruction:
Transformer Model
separate the blue Training
cube from the
FΦ(W,Eθ(V1),Eθ(V2),...,Eθ(VT))
Training
green star … … Source Source
Figure 5. Our robotics and gaming pre-training pipeline. For sim- PHI-safe GPT-4
Nurse Labeled
Generated
plicity, we use the same notation as in Sections 4.1 and 4.2; we Annotations Training Data
represent our text instruction as W , input frames as Vt , our visual Video Input
encoder and linear projection layer as Eθ and ℓ, respectively, our Figure 6. A High-level Overview of our Healthcare Tasks. We
action and language transformer model as Fϕ , and the predicted leveraged nurse-labeled annotations to train our multimodal agent
actions at time step t as Ât . on healthcare data. To adapt our model for visual question answer-
ing, we generated additional training data with GPT-4 using the
PHI-safe process shown in Appendix B.
collected through teleoperation in a simulation, totaling
4.93 million frames. In the Calvin dataset, a 7-DOF robot Experienced ICU nurses generated captions of extracted 5-
manipulator performed manipulation tasks following rela- 10 second video clips depicting common nursing activities
tively abstract instructions linked with a series of language in the ICU. We also included routine nursing documentation
commands. We utilized only the data containing language of important observations based on longer 5-30 minute win-
instructions, which amounted to 1.44 million frames. We dows, which included common clinical measures that assist
chose these two datasets to gain insights into the model’s with assessment and treatment of the patient’s condition.
performance across two dimensions: language-instruction For the analysis described in this paper, we focused on the
abstraction and task-step length. RASS (Richmond Agitation-Sedation Scale) score used to
assess the patient’s state of agitation and sedation (Sessler
5.2. Gaming Tasks et al., 2002) and the bed position to confirm that the head
of the bed is at the proper angle to decrease the chance of
Our primary gaming dataset consists of the Minecraft
acquiring a ventilator-associated pneumonia (Keeley, 2007).
demonstrations collected by contractors in (Baker et al.,
Both assessments are recorded frequently in the medical
2022). In the original dataset, contractors were simply in-
record and automated documentation has the potential to
structed to play Minecraft with no specific goal, and the
optimize caretaker time.
dataset provided video gameplay synchronized with player
actions and inventory metadata. However, since our archi- In order to fine-tune our model for human interactions in our
tecture can leverage text instructions, we use GPT-4V to ICU use case, we leveraged the nurse-provided video-clip
label videos with more specific instructions. Our prompt captions and clinical documentation to have GPT-4 gen-
to GPT-4V also includes changes in the player’s inventory erate a synthetic video question-answer dataset that was
over the video, which we found helped to reduce misclas- used to expand the capabilities of our model after healthcare
sifications of objects and actions in the video. In total, the fine-tuning. A definite advantage of the GPT-4 generated
Minecraft portion of our pre-training dataset consists of 4.7 derivative dataset is that it did not use any confidential pa-
million frames. tient data and consequently can be made publicly available
to train any language-grounded clinical model. Figure 6
In addition to Minecraft, we also used a dataset of gameplay
provides an overview of the healthcare tasks we evaluated:
from Bleeding Edge, a team-base multiplayer game, which
(1) video captioning, (2) video question answering, and (3)
consists of video and synchronized player actions. Similarly,
RASS score prediction (which we formulate as an activ-
there are no specific instructions provided with the video,
ity recognition problem). For more information about our
so we use GPT-4V to label the videos in our dataset. The
GPT-4 based question-answer generation procedure, see
Bleeding Edge portion of our pre-training dataset consists
Appendix B.
of 2.3 million frames across 7 different settings in the game.
6
An Interactive Agent Foundation Model
7
An Interactive Agent Foundation Model
Table 1. Results for robotics fine-tuning across tasks on CALVIN and Language-Table, along with their corresponding evaluation metrics.
CALVIN L ANGUAGE TABLE
M ODEL 1 STEP 2 STEP 3 STEP 4 STEP 5 STEP AVG L ENS S UCCESS R ATE
MCIL 37.3 2.7 0.2 0.0 0.0 0.4 —
O URS (F ROM S CRATCH ) 20.6 0.8 0.0 0.0 0.0 0.214 40.0
O URS 64.8 29.0 12.3 4.7 1.9 1.127 42.0
8
An Interactive Agent Foundation Model
Task Text instruction Start frame Predicted Action Ground Truth Action
[STARTACTION] [STARTACTION]
the player is controlling
Bleeding [lockon][meleeattack] [lockon][meleeattack]
a red robot ... fighting
Edge [lrot162] [lmag4] [lrot160] [lmag4]
other characters
[ENDOFACTION] [ENDOFACTION]
Table 3. Examples of actions predicted by our fine-tuned models for Minecraft (above) and Bleeding Edge (below). More examples are
presented in Appendix E.
9. Impact Statement
Table 4. Performance on healthcare text generation and RASS
score action recognition, along with the corresponding evaluation This paper presents the initial steps on making interactive
metrics. Agent pre-training on robotics and gaming data improves agents possible through an Interactive Agent Foundation
performance for action recognition, but does not improve text Model. We do not foresee negative societal consequences
generation abilities.
from presenting and open-sourcing our current work. In
particular, the main output of our model is domain-specific
M ODEL P ERPLEXITY ↓ RASS ACC ↑
actions, such as button inputs for gaming data, making the
CLIP + OPT ( FROZEN ) 93.3 55.4
CLIP + OPT ( UNFROZEN ) 102.7 92.6 downstream applications of our model different from those
O URS ( FROM SCRATCH ) 100.0 70.3 of standard LLMs and VLMs.
O URS (AGENT PRE - TRAINED ) 106.3 95.7
In the domain of robotics, we wish to emphasize that our
level controls. While our model is able to output precise model should not be deployed on real robots without more
movements and actions, GPT-4V only outputs high-level training and additional safety filters.
instruction.
In the domain of gaming, downstream applications of our
Effects of Agent Pre-Training: In Table 2 and Table 4, foundation model may have some societal consequences.
we demonstrate the effectiveness of our agent pre-training Smarter, more realistic AI characters could lead to more
strategy compared to training from scratch and training immersive worlds, which can increase players’ enjoyment
against an equivalent visual-language baseline. In particular, in games, but may also lead to social withdrawal if not used
we show that a commonly used approach for fine-tuning appropriately. Specifically, more realistic AI characters
visual-language models by using frozen visual encoders, could potentially lead to video game addiction and players
similar to LLaVA (Liu et al., 2023) or Mini-GPT-4 (Zhu anthropomorphising artificial players. We encourage game
et al., 2023), performs worse than joint fine-tuning for action developers who build AI agents using our models to mitigate
recognition on our healthcare dataset. Furthermore, our these potential harms by encouraging social interactions
agent pre-training boosts performance for action prediction between human players and applying appropriate content
across all gaming and robotics datasets. filters to AI agents.
In the domain of healthcare, we emphasize that our models
8. Conclusion are not official medical devices and have not gone through
We introduced an Interactive Agent Foundation Model de- rigorous testing in live settings. We strongly discourage
signed to take text, action, and visual inputs. We found that using our models for self-prescription. Even as our models
by pre-training on a mixture of robotics and gaming data, improve in future iterations, we strongly encourage keeping
our model is effective in modeling actions across a variety a medical practitioner in the loop to ensure that unsafe ac-
of domains, even showing positive transfer when fine-tuning tions are avoided. As our models continue to develop, we
in unseen domains such as healthcare. The generality of believe that they will be useful to caretakers, especially by
our framework allows it to be broadly applicable across automatically forming drafts of documentation and notify-
decision-making settings, unlocking new possibilities for ing caretakers when patients may need urgent attention.
generalist agents in multimodal systems.
9
An Interactive Agent Foundation Model
Finally, we note that the capabilities of agent AI models Systems, 35:23716–23736, 2022.
may significantly change at scale. As we scale our model
Amjad, A., Kordel, P., and Fernandes, G. A review on inno-
in terms of architecture, compute, and training data, we
vation in healthcare sector (telehealth) through artificial
will actively monitor its capabilities before releasing new
intelligence. Sustainability, 15(8):6655, 2023.
versions publicly.
Bain, M., Nagrani, A., Varol, G., and Zisserman, A. Frozen
Acknowledgements in time: A joint video and image encoder for end-to-end
retrieval. In Proceedings of the IEEE/CVF International
We are especially grateful to Desney Tan, Peter Lee, Doug Conference on Computer Vision, pp. 1728–1738, 2021.
Burger, Ryen White, Ece Kamar, John Langford, Jonathan
Carlson and Microsoft’s Office of the CTO (OCTO) for their Baker, B., Akkaya, I., Zhokov, P., Huizinga, J., Tang, J.,
advice, enormous support, and encouragement. We appre- Ecoffet, A., Houghton, B., Sampedro, R., and Clune,
ciate the Microsoft gaming team, Microsoft X-box team, J. Video pretraining (vpt): Learning to act by watching
Microsoft 343 team, Kareem Choudhry, Haiyan Zhang, unlabeled online videos. Advances in Neural Information
Spencer Perreault, Dave Bignell, Katja Hofmann, Sam De- Processing Systems, 35:24639–24654, 2022.
vlin, Shanzheng Tan, and Raluca Georgescu for the gaming Bommasani, R., Hudson, D. A., Adeli, E., Altman, R.,
data collection and sharing. We thank Bill Dolan, Nebojsa Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse-
Jojic, Sudha Rao, Adrian Brown, Andrzej Banburski-Fahey, lut, A., Brunskill, E., et al. On the opportunities and risks
and Jianwei Yang for their early insightful discussions and of foundation models. arXiv preprint arXiv:2108.07258,
help with the gaming aspects of our project. We appreciate 2021.
Kiran Muthabatulla and the MSR Central Engineering (CE)
team for their discussion and feedback for the project. The Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J.,
authors gratefully acknowledge the Microsoft HoloLens Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A.,
team, Microsoft Mesh team, and Antonio Criminisi for their Hsu, J., et al. Rt-1: Robotics transformer for real-world
generous provision of equipment and project discussions. control at scale. arXiv preprint arXiv:2212.06817, 2022.
Finally, we would like to express our genuine appreciation
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen,
for Jim Jernigan, Ben Huntley, Oleg Losinets, the Microsoft
X., Choromanski, K., Ding, T., Driess, D., Dubey, A.,
AOAI team, and the GCR team for their Azure-OpenAI
Finn, C., et al. Rt-2: Vision-language-action models
endpoint support and their pointers to the literature.
transfer web knowledge to robotic control. arXiv preprint
We would also like to thank our colleagues from Stanford’s arXiv:2307.15818, 2023.
Partnership in AI-assisted Care, who helped inform the
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
medical applications explored in this work. In particular, we
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
would like to thank Amit Kaushal and Roger Bohn for their
Askell, A., et al. Language models are few-shot learners.
clinical expertise and guidance. Additionally, we greatly
Advances in neural information processing systems, 33:
appreciate Zelun Luo, David Dai, and Dev Dash for their
1877–1901, 2020.
participation as actors for our hospital dataset.
This research was supported by Microsoft Research Project Cai, W., Huang, S., Cheng, G., Long, Y., Gao, P., Sun, C.,
Green 2024, Microsoft Research Project Fair 2023, Stanford and Dong, H. Bridging zero-shot object navigation and
University, University of California at Los Angeles, MSR foundation models through pixel-guided navigation skill.
Accelerator team, and the Microsoft OCTO team. arXiv preprint arXiv:2309.10309, 2023.
Dai, W., Li, J., Li, D., Tiong, A. M. H., Zhao, J., Wang,
References W., Li, B., Fung, P., and Hoi, S. Instructblip: Towards
general-purpose vision-language models with instruction
Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., tuning, 2023.
David, B., Finn, C., Gopalakrishnan, K., Hausman, K.,
Herzog, A., et al. Do as i can, not as i say: Ground- Dorbala, V. S., Sigurdsson, G., Piramuthu, R., Thoma-
ing language in robotic affordances. arXiv preprint son, J., and Sukhatme, G. S. Clip-nav: Using clip for
arXiv:2204.01691, 2022. zero-shot vision-and-language navigation. arXiv preprint
arXiv:2211.16649, 2022.
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Dorbala, V. S., Mullen Jr, J. F., and Manocha, D. Can
Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, an embodied agent find your” cat-shaped mug”? llm-
M., et al. Flamingo: a visual language model for few-shot based zero-shot object navigation. arXiv preprint
learning. Advances in Neural Information Processing arXiv:2303.03480, 2023.
10
An Interactive Agent Foundation Model
Du, Y., Yang, M., Florence, P., Xia, F., Wahid, A., Lee, P., Bubeck, S., and Petro, J. Benefits, limits, and risks
Ichter, B., Sermanet, P., Yu, T., Abbeel, P., Tenenbaum, of gpt-4 as an ai chatbot for medicine. New England
J. B., et al. Video language planning. arXiv preprint Journal of Medicine, 388(13):1233–1239, 2023.
arXiv:2310.10625, 2023.
Li, B., Wu, P., Abbeel, P., and Malik, J. Interactive
Durante, Z., Huang, Q., Wake, N., Gong, R., Park, J. S., task planning with language models. arXiv preprint
Sarkar, B., Taori, R., Noda, Y., Terzopoulos, D., Choi, Y., arXiv:2310.10645, 2023a.
et al. Agent ai: Surveying the horizons of multimodal
interaction. arXiv preprint arXiv:2401.03568, 2024. Li, C., Xia, F., Martı́n-Martı́n, R., Lingelbach, M., Srivas-
tava, S., Shen, B., Vainio, K., Gokmen, C., Dharan, G.,
Gadre, S. Y., Wortsman, M., Ilharco, G., Schmidt, L., and Jain, T., et al. igibson 2.0: Object-centric simulation
Song, S. Cows on pasture: Baselines and benchmarks for for robot learning of everyday household tasks. arXiv
language-driven zero-shot object navigation. In Proceed- preprint arXiv:2108.03272, 2021.
ings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 23171–23181, 2023. Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J.,
Naumann, T., Poon, H., and Gao, J. Llava-med: Training
Gong, R., Gao, X., Gao, Q., Shakiah, S., Thattai, G., and a large language-and-vision assistant for biomedicine in
Sukhatme, G. S. Lemma: Learning language-conditioned one day. arXiv preprint arXiv:2306.00890, 2023b.
multi-robot manipulation. IEEE Robotics and Automation
Letters, 2023a. Li, J., Li, D., Xiong, C., and Hoi, S. Blip: Boot-
strapping language-image pre-training for unified vision-
Gong, R., Huang, Q., Ma, X., Vo, H., Durante, Z., Noda, Y., language understanding and generation. arXiv preprint
Zheng, Z., Zhu, S.-C., Terzopoulos, D., Fei-Fei, L., et al. arXiv:2201.12086, 2022.
Mindagent: Emergent gaming interaction. arXiv preprint
arXiv:2309.09971, 2023b. Li, J., Gao, Q., Johnston, M., Gao, X., He, X., Shakiah,
S., Shi, H., Ghanadan, R., and Wang, W. Y. Mastering
Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M. robot manipulation with multimodal prompts through
Retrieval augmented language model pre-training. In pretraining and multi-task fine-tuning. arXiv preprint
International conference on machine learning, pp. 3929– arXiv:2310.09676, 2023c.
3938. PMLR, 2020.
Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Boot-
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick, R. strapping language-image pre-training with frozen im-
Masked autoencoders are scalable vision learners. CVPR, age encoders and large language models. arXiv preprint
2022. arXiv:2301.12597, 2023d.
Huang, C., Mees, O., Zeng, A., and Burgard, W. Visual Li, X., Liu, M., Zhang, H., Yu, C., Xu, J., Wu, H., Cheang,
language maps for robot navigation. In 2023 IEEE Inter- C., Jing, Y., Zhang, W., Liu, H., et al. Vision-language
national Conference on Robotics and Automation (ICRA), foundation models as effective robot imitators. arXiv
pp. 10608–10615. IEEE, 2023. preprint arXiv:2311.01378, 2023e.
Jiang, Y., Gupta, A., Zhang, Z., Wang, G., Dou, Y., Chen, Y., Liang, X., Ma, L., Guo, S., Han, J., Xu, H., Ma, S., and
Fei-Fei, L., Anandkumar, A., Zhu, Y., and Fan, L. Vima: Liang, X. Mo-vln: A multi-task benchmark for open-set
General robot manipulation with multimodal prompts. zero-shot vision-and-language navigation. arXiv preprint
arXiv, 2022. arXiv:2306.10322, 2023.
Katara, P., Xian, Z., and Fragkiadaki, K. Gen2sim: Scaling Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction
up robot learning in simulation with generative models. tuning, 2023.
arXiv preprint arXiv:2310.18308, 2023.
Lynch, C. and Sermanet, P. Language conditioned imitation
Keeley, L. Reducing the risk of ventilator-acquired pneu- learning over unstructured data. Robotics: Science and
monia through head of bed elevation. Nursing in critical Systems, 2021. URL https://arxiv.org/abs/
care, 12(6):287–294, 2007. 2005.07648.
Kumar, K. N., Essa, I., and Ha, S. Words into action: Lynch, C., Wahid, A., Tompson, J., Ding, T., Betker, J.,
Learning diverse humanoid robot behaviors using lan- Baruch, R., Armstrong, T., and Florence, P. Interactive
guage guided iterative motion refinement. arXiv preprint language: Talking to robots in real time. IEEE Robotics
arXiv:2310.06226, 2023. and Automation Letters, 2023.
11
An Interactive Agent Foundation Model
Ma, Y. J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Shah, R., Martı́n-Martı́n, R., and Zhu, Y. Mutex: Learn-
Jayaraman, D., Zhu, Y., Fan, L., and Anandkumar, A. ing unified policies from multimodal task specifications.
Eureka: Human-level reward design via coding large lan- arXiv preprint arXiv:2309.14320, 2023b.
guage models. arXiv preprint arXiv:2310.12931, 2023.
Srivastava, S., Li, C., Lingelbach, M., Martı́n-Martı́n, R.,
Mees, O., Hermann, L., Rosete-Beas, E., and Burgard, W. Xia, F., Vainio, K. E., Lian, Z., Gokmen, C., Buch, S., Liu,
Calvin: A benchmark for language-conditioned policy K., et al. Behavior: Benchmark for everyday household
learning for long-horizon robot manipulation tasks. IEEE activities in virtual, interactive, and ecological environ-
Robotics and Automation Letters, 7(3):7327–7334, 2022. ments. In Conference on Robot Learning, pp. 477–490.
PMLR, 2022.
Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M.,
Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of Sun, J., Zhang, Q., Duan, Y., Jiang, X., Cheng, C., and
demonstrations: What makes in-context learning work? Xu, R. Prompt, plan, perform: Llm-based humanoid
arXiv preprint arXiv:2202.12837, 2022. control via quantized imitation learning. arXiv preprint
arXiv:2309.11359, 2023a.
Parakh, M., Fong, A., Simeonov, A., Gupta, A., Chen,
T., and Agrawal, P. Human-assisted continual robot Sun, Q., Fang, Y., Wu, L., Wang, X., and Cao, Y. Eva-
learning with foundation models. arXiv preprint clip: Improved training techniques for clip at scale. arXiv
arXiv:2309.14321, 2023. preprint arXiv:2303.15389, 2023b.
Peng, B., Galley, M., He, P., Cheng, H., Xie, Y., Hu, Y., Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li,
Huang, Q., Liden, L., Yu, Z., Chen, W., et al. Check your X., Guestrin, C., Liang, P., and Hashimoto, T. B.
facts and try again: Improving large language models Stanford alpaca: An instruction-following llama
with external knowledge and automated feedback. arXiv model. https://github.com/tatsu-lab/
preprint arXiv:2302.12813, 2023. stanford_alpaca, 2023.
Puig, X., Undersander, E., Szot, A., Cote, M. D., Yang, T.-
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
Y., Partsey, R., Desai, R., Clegg, A. W., Hlavac, M., Min,
M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E.,
S. Y., et al. Habitat 3.0: A co-habitat for humans, avatars
Azhar, F., et al. Llama: Open and efficient foundation lan-
and robots. arXiv preprint arXiv:2310.13724, 2023.
guage models. arXiv preprint arXiv:2302.13971, 2023.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Wake, N., Kanehira, A., Sasabuchi, K., Takamatsu, J.,
et al. Learning transferable visual models from natural and Ikeuchi, K. Gpt models meet robotic applica-
language supervision. In International conference on tions: Co-speech gesturing chat system. arXiv preprint
machine learning, pp. 8748–8763. PMLR, 2021. arXiv:2306.01741, 2023a.
Rawte, V., Sheth, A., and Das, A. A survey of hallu- Wake, N., Kanehira, A., Sasabuchi, K., Takamatsu, J., and
cination in large foundation models. arXiv preprint Ikeuchi, K. Chatgpt empowered long-step robot control in
arXiv:2309.05922, 2023. various environments: A case application. IEEE Access,
11:95060–95078, 2023b. doi: 10.1109/ACCESS.2023.
Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., 3310935.
Novikov, A., Barth-maron, G., Giménez, M., Sulsky, Y.,
Kay, J., Springenberg, J. T., et al. A generalist agent. Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A.,
Transactions on Machine Learning Research, 2022. Khashabi, D., and Hajishirzi, H. Self-instruct: Aligning
language model with self generated instructions. arXiv
Sessler, C. N., Gosnell, M. S., Grap, M. J., Brophy, G. M., preprint arXiv:2212.10560, 2022.
O’Neal, P. V., Keane, K. A., Tesoro, E. P., and Elswick,
R. The richmond agitation–sedation scale: validity and Xu, M., Huang, P., Yu, W., Liu, S., Zhang, X., Niu, Y.,
reliability in adult intensive care unit patients. American Zhang, T., Xia, F., Tan, J., and Zhao, D. Creative robot
journal of respiratory and critical care medicine, 166 tool use with large language models. arXiv preprint
(10):1338–1344, 2002. arXiv:2310.13065, 2023.
Shah, D., Osiński, B., Levine, S., et al. Lm-nav: Robotic Yu, W., Gileadi, N., Fu, C., Kirmani, S., Lee, K.-H., Are-
navigation with large pre-trained models of language, nas, M. G., Chiang, H.-T. L., Erez, T., Hasenclever, L.,
vision, and action. In Conference on Robot Learning, pp. Humplik, J., et al. Language to rewards for robotic skill
492–504. PMLR, 2023a. synthesis. arXiv preprint arXiv:2306.08647, 2023.
12
An Interactive Agent Foundation Model
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M.,
Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V.,
et al. Opt: Open pre-trained transformer language models.
arXiv preprint arXiv:2205.01068, 2022.
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z.,
Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang,
H., Gonzalez, J. E., and Stoica, I. Judging llm-as-a-judge
with mt-bench and chatbot arena, 2023.
Zhou, G., Hong, Y., and Wu, Q. Navgpt: Explicit reasoning
in vision-and-language navigation with large language
models. arXiv preprint arXiv:2305.16986, 2023.
Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M.
Minigpt-4: Enhancing vision-language understanding
with advanced large language models, 2023.
13
An Interactive Agent Foundation Model
Appendix
A. Architecture Details
To effectively handle images and video inputs jointly, we use a divided space-time attention similar to (Bain et al., 2021).
We initialize our visual encoder from CLIP ViT-B16 (Radford et al., 2021), and learn temporal attention layers after each
spatial attention layer. We further mask 75% of the image patches (using tubelet masking for videos) during training, and
use a MAE-decoder similar to (He et al., 2022). Gaming and robotics use a frame-level visual encoder so that the agent is
able to observe a continuous stream of tokens and act after every frame. For healthcare, we leverage the video understanding
capabilities of our visual encoder since the tasks are video-level.
B. GPT-4 Prompting
User prompt: I will give you a caption, a bed angle, and an associated RASS score for a patient. The
caption describes what is happening during the local segment of a video clip (5-10 seconds).
The bed angle describes the position of the bed during the video clip. The RASS score describes the level of
sedation of the patient over a larger 5 minute window.
The RASS score is an integer between -5 and +4. Negative numbers indicate sedation, positive numbers
indicate agitation, and zero indicates a calm, alert patient.
Your task is to create a question and answer pair that is relevant to the caption, the bed angle, and/or the
RASS score. The question should be answerable given the live video feed of the patient. To generate the
question/answer pairs, you must use the caption, bed angle, and RASS score. Please generate your
questions and answers in json format from the RASS score and captions as follows. It is preferable to NOT
ask questions directly related to the bed angle. Do not add any additional text, only the part starting with {
and ending with }.
Output:
{
"question": "What is the clinician doing with the patient?",
"answer": "The clinician is helping the patient up from the bed
and assisting them in walking across the room."
}
Figure 8. Our PHI-safe GPT-4 Prompt for Generating Healthcare QA Examples. By ensuring the usage of non-identifying video
captions and documentation data, we prevent any identifiable patient data leakage to GPT-4 while simultaneously generating additional
visual-language training data. For the particular example shown, we use a RASS score of “0 - Alert and calm”, a caption of “The clinician
is helping the patient up from the bed and then helping them walk across the room.”, and a bed angle of “ > 45°”.
We show our GPT-4 Prompt for Healthcare Visual Question Answering generation in Figure 8, and our GPT-4V Prompt for
gaming instruction generation in Figure 9.
14
An Interactive Agent Foundation Model
GPT-4-Vision
Prompt:
These are frames of a video of a Bleeding Edge player ordered from
left to right and top to bottom as a grid. Give a simple, but precise
description of what the player is doing in 1 sentence. Be specific
about important items, entities, and actions. In your description do
not mention specific frame numbers or the name of the game.
Video input:
Output:
The player begins by running around the
map, passing through different
checkpoints and interacting with several
capture points, then fights against an
enemy player, and finally captures an
objective while being attacked by another
enemy.
Figure 9. Our GPT-4V prompt for games like Bleeding Edge that have 3rd person viewpoints and visually complex scenes. In order to
input a large number of frames (48) to GPT-4V, we input the frames as a grid with frame numbers overlaid on each frame (as shown
above).
E. Example Outputs
We show examples of our model predicting actions on unseen, robotics simulation data in Table 5 and 6. We show example
outputs for healthcare in Table 7, and show example outputs for gaming in Table 8 and 9.
15
An Interactive Agent Foundation Model
Figure 10. When using GPT-4V to choose actions given a history of frames, we find that it gives reasonable high-level actions but does
not choose precise low-level actions, highlighting the importance of our pre-trained model.
Figure 11. Plot of all components of the training loss over the 100 epochs of pre-training.
16
An Interactive Agent Foundation Model
Figure 12. Our gaming pre-training pipeline. For simplicity, we use the same notation as in Sections 4.1 and 4.2; we represent our text
instruction as W , input frames as Vt , our visual encoder and linear projection layer as Eθ and ℓ, respectively, our action and language
transformer model as Fϕ and the predicted actions at time step t as Ât .
17
An Interactive Agent Foundation Model
Table 5. We show 5 unique demonstrations from Language Table, where our model successfully follows the text instruction. In addition to
the high level instruction, we also show the low-level predicted actions of our agent above each frame.
18
An Interactive Agent Foundation Model
Table 6. We show 5 unique demonstrations from CALVIN, where our model successfully follows the text instruction. In addition to the
high level instruction, we also show the low-level predicted actions of our agent above each frame.
19
An Interactive Agent Foundation Model
Action Recognition
0 - Alert and calm
(RASS)
Table 7. We show 4 demonstrations of our agent model’s outputs on a held-out Healthcare dataset that uses actors instead of actual patients.
We demonstrate our model’s outputs across 3 different tasks: video captioning, visual question answering, and RASS score prediction
(action recognition). Due to the nature of our actor-collected example videos, the model predicts that the patient is awake and calm (RASS
score of 0) for most video clips, despite only 60% of the training data containing RASS score of 0.
20
An Interactive Agent Foundation Model
Table 8. We show 5 demonstrations from a held-out Minecraft dataset. In addition to the high level instruction, we show the low-level
predicted actions and ground truth actions. We truncate the instructions to show only the parts relevant to the current frames. The most
common errors are slight differences in camera movements and occasionally performing unnecessary actions. Note that sometimes the
ground truth values are not the only valid actions; for instance, the fourth example predicts that the player will click the bottle, which
happens a few frames later in the ground truth trajectory.
21
An Interactive Agent Foundation Model
a bleeding edge
player is controlling a
[STARTACTION] [evade] [STARTACTION] [evade]
robot character with a
[lrot236] [lmag4] [ENDOFAC- [lrot236] [lmag4] [ENDOFAC-
sword ... engaging in
TION] TION]
combat with enemy
players ...
Table 9. We show 5 unique demonstrations from a held-out Bleeding Edge dataset. In addition to the high level instruction, we show the
low-level predicted actions and ground truth actions. We truncate the instructions to show only the parts relevant to the current frames.
The most common errors are slight deviations from the precise value of the joysticks, which are naturally noisy. Some other errors include
predicting the wrong type of attack, though this typically happens in situations where multiple attacks are still valid.
22