0% found this document useful (0 votes)

68 views22 pages

An Interactive Agent Foundation Model

The document proposes an Interactive Agent Foundation Model that can be trained across a wide range of domains, datasets, and tasks using a novel multi-task agent training paradigm. The model is trained on over 13.4 million video frames from multiple domains using a unified framework to predict masked tokens in text, visual, and action data. Evaluation shows the model can engage in interactive multi-modal settings using various skills across different virtual environments, demonstrating its ability to function as a versatile, general-purpose agent.

Uploaded by

Divine Ruler Equality Allah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views22 pages

An Interactive Agent Foundation Model

Uploaded by

Divine Ruler Equality Allah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

An Interactive Agent Foundation Model

Zane Durante * 1 2 § , Bidipta Sarkar * 1 2 § , Ran Gong * 2 3 § , Rohan Taori 1 2 § , Yusuke Noda 2 ,
Paul Tang 1 , Ehsan Adeli 1 , Shrinidhi Kowshika Lakshmikanth 1 , Kevin Schulman 1 , Arnold Milstein 1 ,
Demetri Terzopoulos 3 , Ade Famoti 2 , Noboru Kuno 2 , Ashley Llorens 2 , Hoi Vo 2 † ,
Katsu Ikeuchi 2 † , Li Fei-Fei 1 † , Jianfeng Gao 2 † , Naoki Wake * 2 ▶ , Qiuyuan Huang * 2 ▶
arXiv:2402.05929v1 [cs.AI] 8 Feb 2024

Figure 1. Overview of an Agent AI system that can perceive and act in different domains and applications. Agent AI is emerging as a
promising avenue toward Artificial General Intelligence (AGI). Our model represents an initial step in the development of a model that is
highly capable of human-level reasoning across many tasks and levels of granularity.

Abstract 1. Introduction
The development of artificial intelligence systems The development of AI systems that can not only gather
is transitioning from creating static, task-specific useful sensory information, but also interact with their en-
models to dynamic, agent-based systems capa- vironments in meaningful ways has been a long-time goal
ble of performing well in a wide range of ap- for AI researchers. One key advantage of developing gen-
plications. We propose an Interactive Agent eralist AI systems is that of training a single neural model
Foundation Model that uses a novel multi-task across many tasks and data modalities, an approach which
agent training paradigm for training AI agents is highly scalable via data, compute, and model parameters
across a wide range of domains, datasets, and (Reed et al., 2022). With recent significant advances sur-
tasks. Our training paradigm unifies diverse pre- rounding general-purpose foundation models (Bommasani
training strategies, including visual masked auto- et al., 2021), the AI community has a new set of tools for
encoders, language modeling, and next-action developing generalist, action-taking AI systems en route
prediction, enabling a versatile and adaptable AI to artificial general intelligence. Despite their impressive
framework. We demonstrate the performance of results across various AI benchmarks, large foundation mod-
our framework across three separate domains— els frequently hallucinate the presence of objects and actions
Robotics, Gaming AI, and Healthcare. Our model in scenes and infer factually incorrect information (Rawte
demonstrates its ability to generate meaningful et al., 2023; Peng et al., 2023). We posit that one of the key
and contextually relevant outputs in each area. reasons why these foundation models hallucinate is due to
The strength of our approach lies in its general- their lack of grounding in the environments in which they
ity, leveraging a variety of data sources such as are trained (e.g., large-scale internet data instead of phys-
robotics sequences, gameplay data, large-scale ical or virtual environments). Furthermore, the dominant
video datasets, and textual information for effec- approach for building multimodal systems is to leverage
tive multimodal and multi-task learning. Our ap- frozen pre-trained foundation models for each modality and
proach provides a promising avenue for develop- to train smaller layers that allow for cross-modal informa-
ing generalist, action-taking, multimodal systems. tion passing (Alayrac et al., 2022; Li et al., 2022; 2023d;
Dai et al., 2023; Liu et al., 2023). Since the visual- and
language-specific submodules are not tuned during multi-
∗
Equal Contribution. ▶ Project Lead. † Equal Advisor. modal training, any hallucination errors in the submodules
§
Work done while interning or researching part-time at Microsoft will likely be present in the resulting multimodal system.
Research, Redmond. 1 Stanford University; 2 Microsoft Research,
Redmond; 3 University of California, Los Angeles. Additionally, lack of cross-modal pre-training could make

1
An Interactive Agent Foundation Model

grounding information across modalities challenging. model decoders (that are generally frozen) with represen-
tative models including Flamingo (Alayrac et al., 2022),
Towards such a generalist model that is grounded and pre-
the BLIP-series (Li et al., 2022; 2023d; Dai et al., 2023),
trained within physical or virtual environments, we propose
and LLaVA (Liu et al., 2023). These models are generally
a unified pre-training framework for handling text, visual
trained using the standard language modeling cross-entropy
data, and actions as input. We treat each input type as
loss on large-scale internet data consisting of visual-text
separate tokens and pre-train our model to predict masked
pairs, using a source of data similar to that used to train
tokens across all three modalities. Our approach uses pre-
contrastive dual encoder models (Radford et al., 2021; Bain
trained language models and pre-trained visual-language
et al., 2021; Sun et al., 2023b). Unlike most previous work,
models to effectively initialize our model with pre-trained
we explore training models to predict visual tokens and ac-
submodules, which we jointly train in our unified framework.
tion tokens in addition to language tokens and explicitly
We call our approach and resulting model an Interactive
train our model for agentic tasks.
Agent Foundation Model, due to its ability to interact with
humans and its environment, as well as its visual-language
understanding ability as shown in Figure 1. 2.3. Agent-Based AI

In this paper, we show that a 277M parameter model1 that is Agent-based AI is distinguished from traditional AI by its
jointly pre-trained across 13.4 M video frames from several need to generate dynamic behaviors that are grounded in an
distinct domains and data sources can effectively engage in understanding of environmental contexts. Recent research
interactive multi-modal settings using text, video, images, has focused on employing advanced large foundation mod-
dialogue, captioning, visual question answering, and embod- els to create Agent-based AI systems, as shown in (Durante
ied actions within four disparate virtual environments. In et al., 2024). In the field of robotics, for instance, recent
order to effectively evaluate the broad range of capabilities studies have highlighted the potential of LLM/VLMs in
and generalization abilities of our model, we show results enhancing multimodal interactions between robots, envi-
across distinct domains: (1) Robotics, (2) Gaming AI, and ronments, and humans. This applies to both manipulation
(3) Healthcare. Despite using domain-specific visual inputs, (Jiang et al., 2022; Brohan et al., 2023; 2022; Li et al., 2023e;
text descriptions, and action-spaces, our model is effectively Ahn et al., 2022; Shah et al., 2023b; Li et al., 2023c; Wake
able to generalize across all three domains. To facilitate et al., 2023a; Gong et al., 2023a) and navigation (Gadre
research in this discipline, we plan to release our code and et al., 2023; Dorbala et al., 2023; Cai et al., 2023; Shah
models publicly. et al., 2023a; Zhou et al., 2023; Dorbala et al., 2022; Liang
et al., 2023; Huang et al., 2023). Additionally, significant
advances in reinforcement learning have improved agent pol-
2. Related Work icy training on top of VLM/LLMs. Key advancements have
2.1. Foundation Models been made in areas such as reward design (Yu et al., 2023;
Katara et al., 2023; Ma et al., 2023), efficient data collection
A large number of works have sought to develop general- (Kumar et al., 2023; Du et al., 2023), and the management
purpose foundation models based on large-scale pre-training of long-horizon steps (Xu et al., 2023; Sun et al., 2023a; Li
on broad-scale internet data from a variety of sources (Bom- et al., 2023a; Parakh et al., 2023; Wake et al., 2023b). Simi-
masani et al., 2021). Within the field of Natural Language larly to robotics, gaming agents require an understanding of
Processing, this generally consists of larger proprietary visual scenes and textual instructions/feedback (Puig et al.,
LLMs (Wang et al., 2022) such as the GPT-series (Brown 2023; Li et al., 2021; Srivastava et al., 2022; Gong et al.,
et al., 2020; Min et al., 2022), or smaller open-source mod- 2023b). Agent-AI in the context of healthcare has focused
els such as the LLaMA series (Touvron et al., 2023), or on the text-based interaction between humans by utilizing
instruction-tuned variants such as Alpaca (Taori et al., 2023) the capabilities of LLM/VLMs. Representative applications
and Vicuna (Zheng et al., 2023). Within the field of com- include diagnostic assistance (Lee et al., 2023; Li et al.,
puter vision, strategies such as masked auto-encoders (He 2023b), knowledge retrieval (Peng et al., 2023; Guu et al.,
et al., 2022) and contrastive learning (Radford et al., 2021) 2020), and remote monitoring (Amjad et al., 2023).
are two popular methods for self-supervised learning.
3. Agent Paradigm
2.2. Multimodal Understanding
Recent advancements in AI technology have been remark-
Recently, many multimodal models have been developed
able, enabling a reasonable understanding of linguistic and
that seek to learn a relatively small number of parameters
visual information acquired in open-world environments.
to connect large pre-trained visual encoders and language
At this pivotal historical juncture, public interest in embod-
1
We are currently developing an even larger model. ied agent technology is shifting from research confined to

2
An Interactive Agent Foundation Model

Figure 2. We propose an Agent AI paradigm for supporting interactive multi-modal generalist agent systems. There are 5 main modules
as shown: (1) Agent in Environment and Perception with task-planning and observation, (2) Agent learning, (3) Memory, (4) Action, and
(5) Cognition and Consciousness (we use “consciousness” to imply a degree of awareness of an agent’s state and surroundings). A key
difference between our approach and some previous interactive strategies is that, after training, the agent’s action will directly impact task
planning, as the agent does not need to receive feedback from the environment to plan its next actions.

simulations and controlled environments to practical ap- 3. Interaction with humans and environments. Many
plications in highly uncertain environments. For example, tasks require multiple rounds of interactions between
consider a scenario where a robot, upon being unboxed, can AI and humans or the environment. Enabling fluent
instantly start communicating with non-expert humans and interactions between them would improve the effec-
swiftly adapt to performing household tasks in the home tiveness and efficiency of completing tasks for AI.
environment. In this section, we define a new paradigm for
embodied agents to position our proposed Interactive Agent
Foundation Model within the context of this new paradigm. In light of these principles, our proposed Interactive Agent
Foundation Model represents preliminary research that
We define the embodied agent paradigm as “any intelligent focuses on these critical aspects, aiming to develop an em-
agent capable of autonomously taking suitable and seamless bodied agent that functions as a practical assistance system.
action based on sensory input, whether in the physical world For an overview of our goals for developing an embodied
or in a virtual or mixed-reality environment representing agent, see Figure 2.
the physical world” (Figure 2). Importantly, an embodied
agent is conceptualized as a member of a collaborative Achieving an embodied agent is not easy, especially consid-
system, where it communicates with humans with its vision- ering the complex dynamics of systems with multi-modal
language capabilities and employs a vast set of actions based observations in the physical world. Despite the advancement
on the humans’ needs. In this manner, embodied agents are of recent LLM/VLMs, many challenges must be addressed,
expected to mitigate cumbersome tasks in virtual reality and including but not limited to: 1) unstructured environments,
the physical world. where current visual inputs affect both high-level and low-
level actions of the embodied agent given the same goal in-
We believe such a system of embodied agents requires at struction; 2) open sets of objects, which require the agent’s
least three key components: decision-making module to use common sense knowledge
that is hard to encode manually; 3) natural language interac-
1. Perception that is multi-sensory with fine granularity. tions, which require the agent to understand and operate on
Like humans, multi-sensory perception is crucial for more than just template-based commands, but also a context
agents to understand their environment, such as gaming of goals, constraints, and partial plans expressed in every-
environments, to accomplish various tasks. In particu- day language. To enable a more comprehensive approach to
lar, visual perception is useful for agents that can parse these complex challenges, the inclusion of researchers and
the visual world (e.g., images, videos, gameplay). practitioners from a broader range of fields is critical.

2. Planning for navigation and manipulation. Planning 4. Agent Foundation Model

is important for long-range tasks, such as navigating
in a robotics environment and conducting sophisti- Our proposed framework is shown in Figure 3. By syner-
cated tasks. Meanwhile, planning should be grounded gistically combining visual perception with linguistic un-
on good perception and interaction abilities to ensure derstanding, our models offer the potential to endow robots
plans can be realized in an environment. with a more intuitive understanding of their surroundings

3
An Interactive Agent Foundation Model

Robotics Gaming The clinician is helping

Healthcare
Predictions : Turn Left the patient out of bed.

… …

The patient is lying in bed,

What is the clinician doing? +1: Restless
TASKS
drinking something from a cup.

Action Prediction Action Prediction Visual Question Answering Action Recognition Visual Captioning

TASK-
SPECIFIC
OUTPUTS

AGENT Agent Pretraining

FOUNDATION
Action Encoder Visual Encoder Language Encoder
MODEL
(UNIFIED))

TRAINING
DATA
Action + Cognition Video + Frames/Image
(w/o ANNOTATION)
Language
+ Knowledge

Low-level Agent Prediction High-level Agent Instruction

Figure 3. Overview of our Interactive Agent framework. Our foundation model is designed to process multi-modal information that
conveys various levels of abstraction. This approach facilitates a comprehensive understanding of the context and environment, thus
ensuring that actions are coherent. By training on a variety of task domains and applications, we develop a versatile foundation model that
can be fine-tuned for executing optimal actions in a variety of contexts, paving the way towards generally intelligent agents.

and better contextual reasoning. Our current work focuses ken or action token prediction via Â = Fϕ (W, ℓ(Eθ (Vi ))).
on developing a joint image and video encoder and align- To incorporate prior time steps into our model, we also in-
ing this joint encoder to existing foundation models. This clude the previous actions and visual frames as input during
has several notable benefits: firstly, it allows for the use pre-training. For a given time step t, we predict Ât as
of both action, image, and video with language datasets
for pre-training. Secondly, it increases the capabilities of Ât = Fϕ (W, ℓ(Eθ (V1 )), A1 , ℓ(Eθ (V2 )), A2 ,
the model across a variety of downstream tasks (e.g., video . . . , ℓ(Eθ (Vt−1 )), At−1 , ℓ((Eθ (Vt ))). (1)
understanding, temporal reasoning, action prediction, in-
teraction with human feedback, etc.). Finally, by using a In practice, due to memory constraints, we only handle
joint encoder, we can reduce the overall model size (instead the previous M actions and frames, and update the pre-
of using two separate encoders), which can be useful for vious Vi and Ai as a sliding window. In order to more
edge deployments or in limited computing scenarios such effectively train our visual encoder to predict masked vi-
as robotics, gaming, and interactive healthcare tasks. sual tokens, we use sinusoidal positional embeddings, as
in (He et al., 2022) instead of the positional embeddings of
4.1. Model Architecture CLIP. Since we are using relatively small checkpoints, we
are able to jointly train our entire model during pre-training,
To effectively initialize our model to handle text, visual, and unlike previous visual-language models that largely rely
agent tokens as input, we initialize our architecture with upon frozen submodules and seek to learn an adaptation
two pre-trained submodules. First, we use CLIP ViT-B16 network for cross-modal alignment (Alayrac et al., 2022; Li
from (Radford et al., 2021) to initialize our visual encoder, et al., 2022; Liu et al., 2023). We show our general process
denoted Eθ , and initialize our action and language model, for formatting our input tokens in Figure 4, and describe our
Fϕ , from OPT-125M (Zhang et al., 2022). We encode each pre-training strategy in Section 4.2. For additional details,
frame in a video Vi as visual features Zi = Eθ (Vi ). We see Appendix A.
enable cross-modal information sharing by training an ad-
ditional linear layer ℓ that transforms the embeddings of
4.2. Pre-Training Strategy
our visual encoder Eθ into the token embedding space of
our transformer model Fϕ . Thus, given a text prompt W We pre-train our model on a wide range of robotics and gam-
and a single video frame Vi , we can obtain Â, a text to- ing tasks, with each input sample containing text instruc-
tions, videos, and action tokens. We notate each sample as a

4
An Interactive Agent Foundation Model

where M randomly masks 75% of the image patches, U

only selects the previously masked out features, and Eθ
and Dθ are the encoder and decoder for the vision module,
respectively.
Finally, the action modeling loss minimizes the negative
log-likelihood of each action token conditioned on all prior
information, including all text tokens, prior visual tokens,
and prior action tokens. The action modeling loss for a
particular sample S is:
|At |
T X
X
Lact (S) = − log pθ ((at )i |W, V≤t , A≤t , (at )<i ).
t=1 i=1
(4)
The full loss function for each sample combines the above
components:

Figure 4. Our Unified Tokenization Framework. We propose a Llang (S) + Lmae (S) + Lact (S)
L(S) = PT . (5)
general pre-training strategy for predicting input tokens. For text |W | + t=0 (|Vt | + |At |)
tokens, we use the standard language modeling task with next
token prediction. For actions, we expand the vocabulary of the
language model to include special “agent” tokens that represent On robotics data, we only use T = 4 frames of video as
each of the actions available to the language model. Finally, we input since the tasks are Markovian and therefore do not re-
incorporate visual tokens into our framework by training a visual quire long histories to accurately predict the next action. Our
encoder to predict masked visual tokens. gaming data samples use T = 9 frames of video as input
since an observation history is necessary for the partially-
observable gaming tasks.
sequence S = (W, V1 , A1 , V2 , A2 , . . . , VT , AT ), where W
is the sequence of tokens corresponding to the text instruc-
tion, Vi is the sequence of image patches corresponding to 5. Tasks
frame i, and Ai is the sequence of action tokens correspond- We believe that a foundational model, trained in visual,
ing to the frame i of a video sequence of T frames. We language, and agent capabilities, leads to a powerful and
denote wj as the tokens of the text prompt W , and denote general-purpose tool that significantly impacts a variety
the parameters of our model as θ. For each sample, there are of interactive tasks. To evaluate the effectiveness of our
three components to the loss function: language modeling, approach, we applied the model to three major agent-AI
masked image auto-encoding, and action modeling. scenarios, encompassing representative downstream tasks:
The language modeling loss is a standard causal language 1) Robotics: human-machine manipulation in the physical
modeling loss to minimize the negative log likelihood of world; 2) Gaming: human-machine embodiment in virtual
each token in the instruction conditioned on prior tokens. reality; 3) Healthcare: augmented human-machine interac-
The language modeling loss for a particular sample S is tion in traditional multimodal tasks. For these tasks, the
pre-trained model was fine-tuned with specific datasets. As
|W |
X a result, the model demonstrated reasonable and competitive
Llang (S) = − log pθ (wj |w<j ). (2) performance in terms of action prediction, visual understand-
j=1 ing, natural language-driven human-machine interactions,
gaming, and hospital scene understanding. We outline the
The masked image autoencoding loss is generated by ran- task definitions and specific datasets used below.
domly masking 75% of the image patches and calculating
the mean-squared error between the reconstructed image 5.1. Robotics Tasks
and original image in pixel space for the masked image For the robotics scenario, we tested the model on language-
patches. The masked auto-encoder loss for a particular guided manipulation tasks. To this end, we selected two
sample, S is: distinct robotics manipulation datasets: Language-Table
T
(Lynch et al., 2023) and CALVIN (Mees et al., 2022). In
the Language-table dataset, a robot gripper rearranged table-
X
Lmae (S) = ||U(Vt ) − U(Dθ (Eθ (M(Vt ))))||22 , (3)
t=1
top objects following language commands. The data were

5
An Interactive Agent Foundation Model

Action ΔX: 0.0135 ΔX: -0.0045 ΔX: -0.0105 ΔX: -0.0105

Predictions: ΔY: 0.0255 ΔY: 0.0135 ΔY: -0.0075 ΔY: 0.0015
Visual Question
Action Recognition
… Video Captioning
Answering
Â1 Â2 Ât ÂT +2: Agitated Q: Where is
+1: Restless The clinician is the patient?
0: Alert & calm helping the A: The patient
FΦ(W,Z1) FΦ(W,Z1,A1,Z2) … FΦ(W,Zt-M,At-M,...,Zt)
… FΦ(W,ZT-M,AT-M...,ZT) patient out of bed. is at the edge
-1: Drowsy
… of the bed.

ℓ(Eθ(V1)) ℓ(Eθ(V2)) ℓ(Eθ(Vt)) ℓ(Eθ(VT))

Text Instruction:
Transformer Model
separate the blue Training
cube from the
FΦ(W,Eθ(V1),Eθ(V2),...,Eθ(VT))
Training
green star … … Source Source

Figure 5. Our robotics and gaming pre-training pipeline. For sim- PHI-safe GPT-4
Nurse Labeled
Generated
plicity, we use the same notation as in Sections 4.1 and 4.2; we Annotations Training Data
represent our text instruction as W , input frames as Vt , our visual Video Input

encoder and linear projection layer as Eθ and ℓ, respectively, our Figure 6. A High-level Overview of our Healthcare Tasks. We
action and language transformer model as Fϕ , and the predicted leveraged nurse-labeled annotations to train our multimodal agent
actions at time step t as Ât . on healthcare data. To adapt our model for visual question answer-
ing, we generated additional training data with GPT-4 using the
PHI-safe process shown in Appendix B.
collected through teleoperation in a simulation, totaling
4.93 million frames. In the Calvin dataset, a 7-DOF robot Experienced ICU nurses generated captions of extracted 5-
manipulator performed manipulation tasks following rela- 10 second video clips depicting common nursing activities
tively abstract instructions linked with a series of language in the ICU. We also included routine nursing documentation
commands. We utilized only the data containing language of important observations based on longer 5-30 minute win-
instructions, which amounted to 1.44 million frames. We dows, which included common clinical measures that assist
chose these two datasets to gain insights into the model’s with assessment and treatment of the patient’s condition.
performance across two dimensions: language-instruction For the analysis described in this paper, we focused on the
abstraction and task-step length. RASS (Richmond Agitation-Sedation Scale) score used to
assess the patient’s state of agitation and sedation (Sessler
5.2. Gaming Tasks et al., 2002) and the bed position to confirm that the head
of the bed is at the proper angle to decrease the chance of
Our primary gaming dataset consists of the Minecraft
acquiring a ventilator-associated pneumonia (Keeley, 2007).
demonstrations collected by contractors in (Baker et al.,
Both assessments are recorded frequently in the medical
2022). In the original dataset, contractors were simply in-
record and automated documentation has the potential to
structed to play Minecraft with no specific goal, and the
optimize caretaker time.
dataset provided video gameplay synchronized with player
actions and inventory metadata. However, since our archi- In order to fine-tune our model for human interactions in our
tecture can leverage text instructions, we use GPT-4V to ICU use case, we leveraged the nurse-provided video-clip
label videos with more specific instructions. Our prompt captions and clinical documentation to have GPT-4 gen-
to GPT-4V also includes changes in the player’s inventory erate a synthetic video question-answer dataset that was
over the video, which we found helped to reduce misclas- used to expand the capabilities of our model after healthcare
sifications of objects and actions in the video. In total, the fine-tuning. A definite advantage of the GPT-4 generated
Minecraft portion of our pre-training dataset consists of 4.7 derivative dataset is that it did not use any confidential pa-
million frames. tient data and consequently can be made publicly available
to train any language-grounded clinical model. Figure 6
In addition to Minecraft, we also used a dataset of gameplay
provides an overview of the healthcare tasks we evaluated:
from Bleeding Edge, a team-base multiplayer game, which
(1) video captioning, (2) video question answering, and (3)
consists of video and synchronized player actions. Similarly,
RASS score prediction (which we formulate as an activ-
there are no specific instructions provided with the video,
ity recognition problem). For more information about our
so we use GPT-4V to label the videos in our dataset. The
GPT-4 based question-answer generation procedure, see
Bleeding Edge portion of our pre-training dataset consists
Appendix B.
of 2.3 million frames across 7 different settings in the game.

5.3. Healthcare Tasks

6. Experiments
In the healthcare domain we explored, our main dataset con-
sisted of real-world recorded scenes from hospital ICU (in- From a technical perspective, we are developing a generic
tensive care unit) rooms using wall-mounted RGB cameras. artificial intelligence agent foundation model that can un-

6
An Interactive Agent Foundation Model

derstand a wide array of input modalities and can produce

coherent outputs and actions within a wide range of di-
verse interactive environments. In addition to evaluating our
framework in these more specific domains, we evaluated the
capabilities of our pre-training model on robotics manipu-
lation, game playing, and interactive healthcare tasks. The
details of the experimental setting and our main results are
described in the following sub-sections.

6.1. Pre-training Experiments

To pre-train our model, we used the full training sets of
Language Table, CALVIN, Minecraft, and Bleeding Edge, Figure 7. Plot of total pre-training loss over 100 epochs.
and trained for 100 epochs. We used a linear warmup co-
sine learning rate scheduler, with an initial learning rate of
0.0001. We initialized the vision component of our model
with the CLIP base model with patch size 16, and initialized tween 70% and 140%, and randomly shifting the hue by at
the language and action components with OPT-125M. We most 0.05. We plot our pre-training loss in Figure 7.
used 12 nodes of 16 V100 GPUs for 175 hours for all of our
pre-training. 6.2. Robotics Experiments
We added new action tokens corresponding to the actions The pre-trained model was fine-tuned for the Language-
used in our training set. All tasks include a token to indicate Table and CALVIN datasets and evaluated separately. For
starting actions and a token to indicate ending actions. For fine-tuning, we used the same pipeline as in pre-training,
Minecraft, there are additionally 23 button actions, and we maintaining the original MAE and language-modeling loss
discretized mouse actions to 100 bins along the x axis and functions, and the original vocabulary size. During fine-
100 bins along the y axis. For Bleeding Edge, there are tuning, 50% of the image patches were masked, while no
11 button actions, and 2 joysticks. Each joystick has 256 masking was involved in the evaluation.
possible values for rotation and 4 values for magnitude,
resulting in a total of 520 joystick action tokens. 6.2.1. L ANGUAGE -TABLE
For robotics, we added new action tokens corresponding In the Language-table dataset, we used data from a setup
to valid actions in the environment, along with agent state involving a total of 8 blocks, out of which 6 blocks were
tokens for proprioception. For all robotics data, we included non-manipulated and unrelated to the tasks. This setup re-
a special action token to indicate the end of a trajectory. In sulted in 181,020 trajectories. We split each trajectory into
Language Table, we included 21 binned actions for each of a series of 4 frames to fit our model architecture, resulting
the x and y directions, representing the end effector transla- in 1,233,659 samples for fine-tuning. To investigate per-
tion target. We also included 21 binned state tokens repre- formance against different task characteristics, the model
senting the current end effector translation for each of the x was evaluated on 5 different subtasks: 1) moving a block to
and y directions, and an equal number of state tokens repre- another block; 2) moving a block relative to another block;
senting the previous robot action. In CALVIN, we included 3) moving a block to an absolute position; 4) moving a block
two actions for the gripper, indicating opening and closing, to a relative position; 5) separating two blocks. For each
along with 21 actions for each of the six degrees of freedom task, 50 trajectories were randomly sampled and evaluated
of the end effector in the relative Cartesian displacement three times, and the average success rate was computed.
action space. We also included 21 binned states for each of While the pre-trained model performed better than training
the 14 attributes of the proprioceptive state, excluding the from scratch (Table 1), our model was outperformed by
gripper action which has two states. other models such as (Brohan et al., 2023), which could be
Our gaming dataset has 525,309 trajectories for Minecraft attributed to the fact that we used less data for pre-training,
and 256,867 for Bleeding Edge, each consisting of 9 frames. only using the human-teleoperated data in the Language-
Our robotics dataset consists of 1,233,659 trajectories for Table, CALVIN, and gaming datasets.
Language-Table and 360,566 for CALVIN, each consist-
ing of 4 frames. Therefore, our total dataset consists of 6.2.2. CALVIN
13,416,484 frames. When sampling trajectories to train our In the CALVIN dataset, each long-step trajectory was split
model, we additionally added color jitter to each of the into a series of 4 frames, resulting in 360,566 samples across
images, randomly scaling the brightness and saturation be- 34 tasks for fine-tuning. To better capture the entire scene,

7
An Interactive Agent Foundation Model

Table 1. Results for robotics fine-tuning across tasks on CALVIN and Language-Table, along with their corresponding evaluation metrics.
CALVIN L ANGUAGE TABLE
M ODEL 1 STEP 2 STEP 3 STEP 4 STEP 5 STEP AVG L ENS S UCCESS R ATE
MCIL 37.3 2.7 0.2 0.0 0.0 0.4 —
O URS (F ROM S CRATCH ) 20.6 0.8 0.0 0.0 0.0 0.214 40.0
O URS 64.8 29.0 12.3 4.7 1.9 1.127 42.0

6.4. Healthcare Experiments

Table 2. Performance metrics for gaming data. We report BLEU-4
scores for action prediction in Minecraft (abbreviated as MC), and For our experiments on our healthcare dataset, we evaluated
Bleeding Edge (abbreviated as BE). We choose the last epoch for our model’s ability on three separate downstream tasks:
the pre-trained model and the epochs with the best validation score video captioning, visual question answering, and activity
for the other models.
recognition in the form of RASS score prediction. We used
M ODEL MC (BLEU-4)↑ BE (BLEU-4)↑ the final checkpoint from our pre-training run as described
O URS ( FROM SCRATCH ) 0.174 0.238 in Section 6.1.
O URS ( PRE - TRAIN ONLY ) 0.170 0.249
O URS (P RE - TRAIN AND FINE - TUNED ) 0.272 0.411 Healthcare Setting For visual question-answering, we
use the question as the text prompt W , and use the fixed
text prompt “A video of” for video captioning. We train our
model to the corresponding text tokens of the caption or an-
the third-person view RGB camera was chosen as the source swer and report the average perplexity across both settings.
of image input from the available camera resources. For fine- We frame RASS score prediction as a 10-way activity clas-
tuning, we incorporated all available appearance settings, sification problem, and train a separate classification head
including the one used for testing, to enlarge the dataset, for our model. We use the video-level setting for our visual
following the standard ABCD → D task definition. To encoder with 9 frames as input, as described in Appendix
evaluate the model performance with multiple steps, we A. To evaluate the effectiveness of our pre-training frame-
computed the averaged success rate at each step, follow- work, we compared the performance of our model against
ing the methodology described in the original CALVIN three baselines that leverage CLIP and OPT for initializa-
paper (Mees et al., 2022). Compared to Multi-context Im- tion. First, we compared against a frozen baseline that uses
itation Learning (MCIL) (Lynch & Sermanet, 2021), our the same pre-trained models, kept frozen, while fine-tuning
model shows better performance while only using 1% of the a single linear layer for cross modal information passing,
data (Table 1). similar to (Liu et al., 2023). Second, we compared against
a joint baseline that uses the same pre-trained models but
6.3. Gaming Experiments fine-tunes them jointly along with the linear layer. For both
of these baselines, we encode frames with CLIP individu-
For both gaming settings of Minecraft and Bleeding Edge, ally and concatenate the frame-level embeddings. Third, we
we evaluated our model’s ability to predict actions given compared against a baseline of our same architecture, that
video frames and high-level instructions, along with its makes use of our video-level encoder and is initialized from
MAE reconstruction quality. Specifically, we used a held- CLIP and OPT, but does not use any large-scale agent pre-
out test dataset of 100 videos each, formatted in the same training. We show our performance against the proposed
manner as our training data. baselines in Table 4. For all results, we train for 20 epochs
We report the BLEU-4 scores of actions in Table 2. We com- on 4 16GB V100 GPUs with a fixed learning rate of 4e-5
pare our pre-trained baseline to fine-tuning on task-specific and report results on a held-out evaluation set. For fair com-
data initialized from our pre-trained model and a version parison, we do not perform any additional hyperparameter
initialized from CLIP and OPT. We find that both fine-tuned search.
models over-fit to the training data within 5 epochs, so we
report the BLEU-4 test scores from the checkpoints with 7. Ablations and Analysis
the highest validation score. We find that fine-tuning our
pre-trained model is significantly more effective than train- Pretraining Loss Curves: We plot our combined pre-
ing from scratch for both gaming domains, highlighting training loss across 100 epochs in Figure 7, and show indi-
the importance of our diverse pre-training mixture. We also vidual components of the loss function in Appendix C.
show a visualization of predicted actions from our fine-tuned Comparisons with GPT-4V: In Figure 10, we show how
model compared to the validation ground-truth in Table 3 our model has the ability to output low-level action predic-
and Appendix E. tions, while GPT-4V is unable to consistently output low-

8
An Interactive Agent Foundation Model

Task Text instruction Start frame Predicted Action Ground Truth Action

the player is using an

[STARTACTION] [STARTACTION]
iron sword to attack
Minecraft [attack] [ENDOFAC- [attack] [ENDOFAC-
and kill pigs in a for-
TION] TION]
est...

[STARTACTION] [STARTACTION]
the player is controlling
Bleeding [lockon][meleeattack] [lockon][meleeattack]
a red robot ... fighting
Edge [lrot162] [lmag4] [lrot160] [lmag4]
other characters
[ENDOFACTION] [ENDOFACTION]

Table 3. Examples of actions predicted by our fine-tuned models for Minecraft (above) and Bleeding Edge (below). More examples are
presented in Appendix E.

9. Impact Statement
Table 4. Performance on healthcare text generation and RASS
score action recognition, along with the corresponding evaluation This paper presents the initial steps on making interactive
metrics. Agent pre-training on robotics and gaming data improves agents possible through an Interactive Agent Foundation
performance for action recognition, but does not improve text Model. We do not foresee negative societal consequences
generation abilities.
from presenting and open-sourcing our current work. In
particular, the main output of our model is domain-specific
M ODEL P ERPLEXITY ↓ RASS ACC ↑
actions, such as button inputs for gaming data, making the
CLIP + OPT ( FROZEN ) 93.3 55.4
CLIP + OPT ( UNFROZEN ) 102.7 92.6 downstream applications of our model different from those
O URS ( FROM SCRATCH ) 100.0 70.3 of standard LLMs and VLMs.
O URS (AGENT PRE - TRAINED ) 106.3 95.7
In the domain of robotics, we wish to emphasize that our
level controls. While our model is able to output precise model should not be deployed on real robots without more
movements and actions, GPT-4V only outputs high-level training and additional safety filters.
instruction.
In the domain of gaming, downstream applications of our
Effects of Agent Pre-Training: In Table 2 and Table 4, foundation model may have some societal consequences.
we demonstrate the effectiveness of our agent pre-training Smarter, more realistic AI characters could lead to more
strategy compared to training from scratch and training immersive worlds, which can increase players’ enjoyment
against an equivalent visual-language baseline. In particular, in games, but may also lead to social withdrawal if not used
we show that a commonly used approach for fine-tuning appropriately. Specifically, more realistic AI characters
visual-language models by using frozen visual encoders, could potentially lead to video game addiction and players
similar to LLaVA (Liu et al., 2023) or Mini-GPT-4 (Zhu anthropomorphising artificial players. We encourage game
et al., 2023), performs worse than joint fine-tuning for action developers who build AI agents using our models to mitigate
recognition on our healthcare dataset. Furthermore, our these potential harms by encouraging social interactions
agent pre-training boosts performance for action prediction between human players and applying appropriate content
across all gaming and robotics datasets. filters to AI agents.
In the domain of healthcare, we emphasize that our models
8. Conclusion are not official medical devices and have not gone through
We introduced an Interactive Agent Foundation Model de- rigorous testing in live settings. We strongly discourage
signed to take text, action, and visual inputs. We found that using our models for self-prescription. Even as our models
by pre-training on a mixture of robotics and gaming data, improve in future iterations, we strongly encourage keeping
our model is effective in modeling actions across a variety a medical practitioner in the loop to ensure that unsafe ac-
of domains, even showing positive transfer when fine-tuning tions are avoided. As our models continue to develop, we
in unseen domains such as healthcare. The generality of believe that they will be useful to caretakers, especially by
our framework allows it to be broadly applicable across automatically forming drafts of documentation and notify-
decision-making settings, unlocking new possibilities for ing caretakers when patients may need urgent attention.
generalist agents in multimodal systems.

9
An Interactive Agent Foundation Model

Finally, we note that the capabilities of agent AI models Systems, 35:23716–23736, 2022.
may significantly change at scale. As we scale our model
Amjad, A., Kordel, P., and Fernandes, G. A review on inno-
in terms of architecture, compute, and training data, we
vation in healthcare sector (telehealth) through artificial
will actively monitor its capabilities before releasing new
intelligence. Sustainability, 15(8):6655, 2023.
versions publicly.
Bain, M., Nagrani, A., Varol, G., and Zisserman, A. Frozen
Acknowledgements in time: A joint video and image encoder for end-to-end
retrieval. In Proceedings of the IEEE/CVF International
We are especially grateful to Desney Tan, Peter Lee, Doug Conference on Computer Vision, pp. 1728–1738, 2021.
Burger, Ryen White, Ece Kamar, John Langford, Jonathan
Carlson and Microsoft’s Office of the CTO (OCTO) for their Baker, B., Akkaya, I., Zhokov, P., Huizinga, J., Tang, J.,
advice, enormous support, and encouragement. We appre- Ecoffet, A., Houghton, B., Sampedro, R., and Clune,
ciate the Microsoft gaming team, Microsoft X-box team, J. Video pretraining (vpt): Learning to act by watching
Microsoft 343 team, Kareem Choudhry, Haiyan Zhang, unlabeled online videos. Advances in Neural Information
Spencer Perreault, Dave Bignell, Katja Hofmann, Sam De- Processing Systems, 35:24639–24654, 2022.
vlin, Shanzheng Tan, and Raluca Georgescu for the gaming Bommasani, R., Hudson, D. A., Adeli, E., Altman, R.,
data collection and sharing. We thank Bill Dolan, Nebojsa Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosse-
Jojic, Sudha Rao, Adrian Brown, Andrzej Banburski-Fahey, lut, A., Brunskill, E., et al. On the opportunities and risks
and Jianwei Yang for their early insightful discussions and of foundation models. arXiv preprint arXiv:2108.07258,
help with the gaming aspects of our project. We appreciate 2021.
Kiran Muthabatulla and the MSR Central Engineering (CE)
team for their discussion and feedback for the project. The Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Dabis, J.,
authors gratefully acknowledge the Microsoft HoloLens Finn, C., Gopalakrishnan, K., Hausman, K., Herzog, A.,
team, Microsoft Mesh team, and Antonio Criminisi for their Hsu, J., et al. Rt-1: Robotics transformer for real-world
generous provision of equipment and project discussions. control at scale. arXiv preprint arXiv:2212.06817, 2022.
Finally, we would like to express our genuine appreciation
Brohan, A., Brown, N., Carbajal, J., Chebotar, Y., Chen,
for Jim Jernigan, Ben Huntley, Oleg Losinets, the Microsoft
X., Choromanski, K., Ding, T., Driess, D., Dubey, A.,
AOAI team, and the GCR team for their Azure-OpenAI
Finn, C., et al. Rt-2: Vision-language-action models
endpoint support and their pointers to the literature.
transfer web knowledge to robotic control. arXiv preprint
We would also like to thank our colleagues from Stanford’s arXiv:2307.15818, 2023.
Partnership in AI-assisted Care, who helped inform the
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D.,
medical applications explored in this work. In particular, we
Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
would like to thank Amit Kaushal and Roger Bohn for their
Askell, A., et al. Language models are few-shot learners.
clinical expertise and guidance. Additionally, we greatly
Advances in neural information processing systems, 33:
appreciate Zelun Luo, David Dai, and Dev Dash for their
1877–1901, 2020.
participation as actors for our hospital dataset.
This research was supported by Microsoft Research Project Cai, W., Huang, S., Cheng, G., Long, Y., Gao, P., Sun, C.,
Green 2024, Microsoft Research Project Fair 2023, Stanford and Dong, H. Bridging zero-shot object navigation and
University, University of California at Los Angeles, MSR foundation models through pixel-guided navigation skill.
Accelerator team, and the Microsoft OCTO team. arXiv preprint arXiv:2309.10309, 2023.
Dai, W., Li, J., Li, D., Tiong, A. M. H., Zhao, J., Wang,
References W., Li, B., Fung, P., and Hoi, S. Instructblip: Towards
general-purpose vision-language models with instruction
Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., tuning, 2023.
David, B., Finn, C., Gopalakrishnan, K., Hausman, K.,
Herzog, A., et al. Do as i can, not as i say: Ground- Dorbala, V. S., Sigurdsson, G., Piramuthu, R., Thoma-
ing language in robotic affordances. arXiv preprint son, J., and Sukhatme, G. S. Clip-nav: Using clip for
arXiv:2204.01691, 2022. zero-shot vision-and-language navigation. arXiv preprint
arXiv:2211.16649, 2022.
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Dorbala, V. S., Mullen Jr, J. F., and Manocha, D. Can
Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, an embodied agent find your” cat-shaped mug”? llm-
M., et al. Flamingo: a visual language model for few-shot based zero-shot object navigation. arXiv preprint
learning. Advances in Neural Information Processing arXiv:2303.03480, 2023.

10
An Interactive Agent Foundation Model

Du, Y., Yang, M., Florence, P., Xia, F., Wahid, A., Lee, P., Bubeck, S., and Petro, J. Benefits, limits, and risks
Ichter, B., Sermanet, P., Yu, T., Abbeel, P., Tenenbaum, of gpt-4 as an ai chatbot for medicine. New England
J. B., et al. Video language planning. arXiv preprint Journal of Medicine, 388(13):1233–1239, 2023.
arXiv:2310.10625, 2023.
Li, B., Wu, P., Abbeel, P., and Malik, J. Interactive
Durante, Z., Huang, Q., Wake, N., Gong, R., Park, J. S., task planning with language models. arXiv preprint
Sarkar, B., Taori, R., Noda, Y., Terzopoulos, D., Choi, Y., arXiv:2310.10645, 2023a.
et al. Agent ai: Surveying the horizons of multimodal
interaction. arXiv preprint arXiv:2401.03568, 2024. Li, C., Xia, F., Martı́n-Martı́n, R., Lingelbach, M., Srivas-
tava, S., Shen, B., Vainio, K., Gokmen, C., Dharan, G.,
Gadre, S. Y., Wortsman, M., Ilharco, G., Schmidt, L., and Jain, T., et al. igibson 2.0: Object-centric simulation
Song, S. Cows on pasture: Baselines and benchmarks for for robot learning of everyday household tasks. arXiv
language-driven zero-shot object navigation. In Proceed- preprint arXiv:2108.03272, 2021.
ings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition, pp. 23171–23181, 2023. Li, C., Wong, C., Zhang, S., Usuyama, N., Liu, H., Yang, J.,
Naumann, T., Poon, H., and Gao, J. Llava-med: Training
Gong, R., Gao, X., Gao, Q., Shakiah, S., Thattai, G., and a large language-and-vision assistant for biomedicine in
Sukhatme, G. S. Lemma: Learning language-conditioned one day. arXiv preprint arXiv:2306.00890, 2023b.
multi-robot manipulation. IEEE Robotics and Automation
Letters, 2023a. Li, J., Li, D., Xiong, C., and Hoi, S. Blip: Boot-
strapping language-image pre-training for unified vision-
Gong, R., Huang, Q., Ma, X., Vo, H., Durante, Z., Noda, Y., language understanding and generation. arXiv preprint
Zheng, Z., Zhu, S.-C., Terzopoulos, D., Fei-Fei, L., et al. arXiv:2201.12086, 2022.
Mindagent: Emergent gaming interaction. arXiv preprint
arXiv:2309.09971, 2023b. Li, J., Gao, Q., Johnston, M., Gao, X., He, X., Shakiah,
S., Shi, H., Ghanadan, R., and Wang, W. Y. Mastering
Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M. robot manipulation with multimodal prompts through
Retrieval augmented language model pre-training. In pretraining and multi-task fine-tuning. arXiv preprint
International conference on machine learning, pp. 3929– arXiv:2310.09676, 2023c.
3938. PMLR, 2020.
Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Boot-
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., and Girshick, R. strapping language-image pre-training with frozen im-
Masked autoencoders are scalable vision learners. CVPR, age encoders and large language models. arXiv preprint
2022. arXiv:2301.12597, 2023d.

Huang, C., Mees, O., Zeng, A., and Burgard, W. Visual Li, X., Liu, M., Zhang, H., Yu, C., Xu, J., Wu, H., Cheang,
language maps for robot navigation. In 2023 IEEE Inter- C., Jing, Y., Zhang, W., Liu, H., et al. Vision-language
national Conference on Robotics and Automation (ICRA), foundation models as effective robot imitators. arXiv
pp. 10608–10615. IEEE, 2023. preprint arXiv:2311.01378, 2023e.

Jiang, Y., Gupta, A., Zhang, Z., Wang, G., Dou, Y., Chen, Y., Liang, X., Ma, L., Guo, S., Han, J., Xu, H., Ma, S., and
Fei-Fei, L., Anandkumar, A., Zhu, Y., and Fan, L. Vima: Liang, X. Mo-vln: A multi-task benchmark for open-set
General robot manipulation with multimodal prompts. zero-shot vision-and-language navigation. arXiv preprint
arXiv, 2022. arXiv:2306.10322, 2023.

Katara, P., Xian, Z., and Fragkiadaki, K. Gen2sim: Scaling Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction
up robot learning in simulation with generative models. tuning, 2023.
arXiv preprint arXiv:2310.18308, 2023.
Lynch, C. and Sermanet, P. Language conditioned imitation
Keeley, L. Reducing the risk of ventilator-acquired pneu- learning over unstructured data. Robotics: Science and
monia through head of bed elevation. Nursing in critical Systems, 2021. URL https://arxiv.org/abs/
care, 12(6):287–294, 2007. 2005.07648.

Kumar, K. N., Essa, I., and Ha, S. Words into action: Lynch, C., Wahid, A., Tompson, J., Ding, T., Betker, J.,
Learning diverse humanoid robot behaviors using lan- Baruch, R., Armstrong, T., and Florence, P. Interactive
guage guided iterative motion refinement. arXiv preprint language: Talking to robots in real time. IEEE Robotics
arXiv:2310.06226, 2023. and Automation Letters, 2023.

11
An Interactive Agent Foundation Model

Ma, Y. J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Shah, R., Martı́n-Martı́n, R., and Zhu, Y. Mutex: Learn-
Jayaraman, D., Zhu, Y., Fan, L., and Anandkumar, A. ing unified policies from multimodal task specifications.
Eureka: Human-level reward design via coding large lan- arXiv preprint arXiv:2309.14320, 2023b.
guage models. arXiv preprint arXiv:2310.12931, 2023.
Srivastava, S., Li, C., Lingelbach, M., Martı́n-Martı́n, R.,
Mees, O., Hermann, L., Rosete-Beas, E., and Burgard, W. Xia, F., Vainio, K. E., Lian, Z., Gokmen, C., Buch, S., Liu,
Calvin: A benchmark for language-conditioned policy K., et al. Behavior: Benchmark for everyday household
learning for long-horizon robot manipulation tasks. IEEE activities in virtual, interactive, and ecological environ-
Robotics and Automation Letters, 7(3):7327–7334, 2022. ments. In Conference on Robot Learning, pp. 477–490.
PMLR, 2022.
Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M.,
Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of Sun, J., Zhang, Q., Duan, Y., Jiang, X., Cheng, C., and
demonstrations: What makes in-context learning work? Xu, R. Prompt, plan, perform: Llm-based humanoid
arXiv preprint arXiv:2202.12837, 2022. control via quantized imitation learning. arXiv preprint
arXiv:2309.11359, 2023a.
Parakh, M., Fong, A., Simeonov, A., Gupta, A., Chen,
T., and Agrawal, P. Human-assisted continual robot Sun, Q., Fang, Y., Wu, L., Wang, X., and Cao, Y. Eva-
learning with foundation models. arXiv preprint clip: Improved training techniques for clip at scale. arXiv
arXiv:2309.14321, 2023. preprint arXiv:2303.15389, 2023b.
Peng, B., Galley, M., He, P., Cheng, H., Xie, Y., Hu, Y., Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li,
Huang, Q., Liden, L., Yu, Z., Chen, W., et al. Check your X., Guestrin, C., Liang, P., and Hashimoto, T. B.
facts and try again: Improving large language models Stanford alpaca: An instruction-following llama
with external knowledge and automated feedback. arXiv model. https://github.com/tatsu-lab/
preprint arXiv:2302.12813, 2023. stanford_alpaca, 2023.
Puig, X., Undersander, E., Szot, A., Cote, M. D., Yang, T.-
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
Y., Partsey, R., Desai, R., Clegg, A. W., Hlavac, M., Min,
M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E.,
S. Y., et al. Habitat 3.0: A co-habitat for humans, avatars
Azhar, F., et al. Llama: Open and efficient foundation lan-
and robots. arXiv preprint arXiv:2310.13724, 2023.
guage models. arXiv preprint arXiv:2302.13971, 2023.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Wake, N., Kanehira, A., Sasabuchi, K., Takamatsu, J.,
et al. Learning transferable visual models from natural and Ikeuchi, K. Gpt models meet robotic applica-
language supervision. In International conference on tions: Co-speech gesturing chat system. arXiv preprint
machine learning, pp. 8748–8763. PMLR, 2021. arXiv:2306.01741, 2023a.

Rawte, V., Sheth, A., and Das, A. A survey of hallu- Wake, N., Kanehira, A., Sasabuchi, K., Takamatsu, J., and
cination in large foundation models. arXiv preprint Ikeuchi, K. Chatgpt empowered long-step robot control in
arXiv:2309.05922, 2023. various environments: A case application. IEEE Access,
11:95060–95078, 2023b. doi: 10.1109/ACCESS.2023.
Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., 3310935.
Novikov, A., Barth-maron, G., Giménez, M., Sulsky, Y.,
Kay, J., Springenberg, J. T., et al. A generalist agent. Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N. A.,
Transactions on Machine Learning Research, 2022. Khashabi, D., and Hajishirzi, H. Self-instruct: Aligning
language model with self generated instructions. arXiv
Sessler, C. N., Gosnell, M. S., Grap, M. J., Brophy, G. M., preprint arXiv:2212.10560, 2022.
O’Neal, P. V., Keane, K. A., Tesoro, E. P., and Elswick,
R. The richmond agitation–sedation scale: validity and Xu, M., Huang, P., Yu, W., Liu, S., Zhang, X., Niu, Y.,
reliability in adult intensive care unit patients. American Zhang, T., Xia, F., Tan, J., and Zhao, D. Creative robot
journal of respiratory and critical care medicine, 166 tool use with large language models. arXiv preprint
(10):1338–1344, 2002. arXiv:2310.13065, 2023.

Shah, D., Osiński, B., Levine, S., et al. Lm-nav: Robotic Yu, W., Gileadi, N., Fu, C., Kirmani, S., Lee, K.-H., Are-
navigation with large pre-trained models of language, nas, M. G., Chiang, H.-T. L., Erez, T., Hasenclever, L.,
vision, and action. In Conference on Robot Learning, pp. Humplik, J., et al. Language to rewards for robotic skill
492–504. PMLR, 2023a. synthesis. arXiv preprint arXiv:2306.08647, 2023.

12
An Interactive Agent Foundation Model

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M.,
Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V.,
et al. Opt: Open pre-trained transformer language models.
arXiv preprint arXiv:2205.01068, 2022.
Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z.,
Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang,
H., Gonzalez, J. E., and Stoica, I. Judging llm-as-a-judge
with mt-bench and chatbot arena, 2023.
Zhou, G., Hong, Y., and Wu, Q. Navgpt: Explicit reasoning
in vision-and-language navigation with large language
models. arXiv preprint arXiv:2305.16986, 2023.

Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M.
Minigpt-4: Enhancing vision-language understanding
with advanced large language models, 2023.

13
An Interactive Agent Foundation Model

Appendix
A. Architecture Details
To effectively handle images and video inputs jointly, we use a divided space-time attention similar to (Bain et al., 2021).
We initialize our visual encoder from CLIP ViT-B16 (Radford et al., 2021), and learn temporal attention layers after each
spatial attention layer. We further mask 75% of the image patches (using tubelet masking for videos) during training, and
use a MAE-decoder similar to (He et al., 2022). Gaming and robotics use a frame-level visual encoder so that the agent is
able to observe a continuous stream of tokens and act after every frame. For healthcare, we leverage the video understanding
capabilities of our visual encoder since the tasks are video-level.

B. GPT-4 Prompting

GPT-4 Prompt for Healthcare

System prompt: You are a helpful hospital assistant that will be creating questions and answers for
clinical training.

User prompt: I will give you a caption, a bed angle, and an associated RASS score for a patient. The
caption describes what is happening during the local segment of a video clip (5-10 seconds).

The bed angle describes the position of the bed during the video clip. The RASS score describes the level of
sedation of the patient over a larger 5 minute window.

The RASS score is an integer between -5 and +4. Negative numbers indicate sedation, positive numbers
indicate agitation, and zero indicates a calm, alert patient.

Your task is to create a question and answer pair that is relevant to the caption, the bed angle, and/or the
RASS score. The question should be answerable given the live video feed of the patient. To generate the
question/answer pairs, you must use the caption, bed angle, and RASS score. Please generate your
questions and answers in json format from the RASS score and captions as follows. It is preferable to NOT
ask questions directly related to the bed angle. Do not add any additional text, only the part starting with {
and ending with }.

RASS score: 0 - Alert and calm

Caption: Someone begins to walk into an empty hospital room
Bed angle: > 45°
Response:
{
"question": "What is the person doing?", Corresponding Video
}
"answer": "The person is walking into the room."
(for reference)
RASS score: 0 - Alert and calm
Caption: The nurse is bringing the patient into the room
Bed angle: > 45°
Response:
{
"question": "Who is the nurse bringing into the room?",
"answer": "The nurse is bringing a patient into the room."
}

RASS score: 0 - Alert and calm

Caption: The clinician is helping the patient up from the
bed and then helping them walk across the room.
Bed angle: > 45°
Response:

Output:
{
"question": "What is the clinician doing with the patient?",
"answer": "The clinician is helping the patient up from the bed
and assisting them in walking across the room."
}

Figure 8. Our PHI-safe GPT-4 Prompt for Generating Healthcare QA Examples. By ensuring the usage of non-identifying video
captions and documentation data, we prevent any identifiable patient data leakage to GPT-4 while simultaneously generating additional
visual-language training data. For the particular example shown, we use a RASS score of “0 - Alert and calm”, a caption of “The clinician
is helping the patient up from the bed and then helping them walk across the room.”, and a bed angle of “ > 45°”.

We show our GPT-4 Prompt for Healthcare Visual Question Answering generation in Figure 8, and our GPT-4V Prompt for
gaming instruction generation in Figure 9.

C. Pre-training Loss Curves

We show all components of the loss function in Figure 11 and plot our combined pre-training loss across 100 epochs in
Figure 7.

14
An Interactive Agent Foundation Model

GPT-4-Vision
Prompt:
These are frames of a video of a Bleeding Edge player ordered from
left to right and top to bottom as a grid. Give a simple, but precise
description of what the player is doing in 1 sentence. Be speciﬁc
about important items, entities, and actions. In your description do
not mention speciﬁc frame numbers or the name of the game.
Video input:

Output:
The player begins by running around the
map, passing through different
checkpoints and interacting with several
capture points, then fights against an
enemy player, and finally captures an
objective while being attacked by another
enemy.

Figure 9. Our GPT-4V prompt for games like Bleeding Edge that have 3rd person viewpoints and visually complex scenes. In order to
input a large number of frames (48) to GPT-4V, we input the frames as a grid with frame numbers overlaid on each frame (as shown
above).

D. Gaming Task Pipeline

We provide an example of our pipeline for a gaming task in Figure 12. Note the similarities to the robotics task in Figure 5
since both tasks require predicting an action given a text instruction and sequence of prior actions.

E. Example Outputs
We show examples of our model predicting actions on unseen, robotics simulation data in Table 5 and 6. We show example
outputs for healthcare in Table 7, and show example outputs for gaming in Table 8 and 9.

15
An Interactive Agent Foundation Model

Figure 10. When using GPT-4V to choose actions given a history of frames, we find that it gives reasonable high-level actions but does
not choose precise low-level actions, highlighting the importance of our pre-trained model.

Figure 11. Plot of all components of the training loss over the 100 epochs of pre-training.

16
An Interactive Agent Foundation Model

Figure 12. Our gaming pre-training pipeline. For simplicity, we use the same notation as in Sections 4.1 and 4.2; we represent our text
instruction as W , input frames as Vt , our visual encoder and linear projection layer as Eθ and ℓ, respectively, our action and language
transformer model as Fϕ and the predicted actions at time step t as Ât .

17
An Interactive Agent Foundation Model

Text instruction Start frame Middle frame End frame

Pull the red moon

apart from the blue → →
moon.

Push the yellow start

→ →
next to the red moon.

Move the red pen-

tagon away from the → →
blue cube.

Move the red moon to

the bottom of the yel- → →
low pentagon.

Pull the red moon to

→ →
the bottom left.

Table 5. We show 5 unique demonstrations from Language Table, where our model successfully follows the text instruction. In addition to
the high level instruction, we also show the low-level predicted actions of our agent above each frame.

18
An Interactive Agent Foundation Model

Text instruction Start frame Middle frame End frame

Push the handle to

→ →
close the drawer.

Lift the red block

from the sliding cabi- → →
net.

Pull the handle to

→ →
open the drawer.

Push the sliding door

→ →
to the left side.

Push the sliding door

→ →
to the right side.

Table 6. We show 5 unique demonstrations from CALVIN, where our model successfully follows the text instruction. In addition to the
high level instruction, we also show the low-level predicted actions of our agent above each frame.

19
An Interactive Agent Foundation Model

Task Start frame End frame Model Output

The patient is awake and

Video Captioning calm. The patient is coopera-
tive. The patient is alert

Q: Where is the patient? A:

Video Question An- patient is in deep sedation.
swering The patient likely requires as-
sistance.

Action Recognition
0 - Alert and calm
(RASS)

The patient is awake and

Video Captioning calm. They are speaking on
the phone.

Table 7. We show 4 demonstrations of our agent model’s outputs on a held-out Healthcare dataset that uses actors instead of actual patients.
We demonstrate our model’s outputs across 3 different tasks: video captioning, visual question answering, and RASS score prediction
(action recognition). Due to the nature of our actor-collected example videos, the model predicts that the patient is awake and calm (RASS
score of 0) for most video clips, despite only 60% of the training data containing RASS score of 0.

20
An Interactive Agent Foundation Model

Text instruction Start frame Predicted Action Ground Truth Action

the player is dig-
ging and placing dirt [STARTACTION] [attack]
[STARTACTION] [attack]
blocks to terraform [CAMERAX0] [CAMERAY-1]
[ENDOFACTION]
the terrain around [ENDOFACTION]
their house...

the player is min-

ing underground us- [STARTACTION] [attack] [STARTACTION] [attack]
ing a diamond pick- [CAMERAX-3] [CAMERAY0] [CAMERAX-3] [CAMERAY0]
axe, gathering cobble- [ENDOFACTION] [ENDOFACTION]
stone, coal, iron ore...

the minecraft player

[STARTACTION] [forward] [STARTACTION] [forward]
is moving around a
[sprint] [ENDOFACTION] [sprint] [ENDOFACTION]
village ...

the player is using a [STARTACTION] [sneak] [use] [STARTACTION] [sneak]

brewing stand ... [ENDOFACTION] [ENDOFACTION]

the player is ... ter-

[STARTACTION] [attack] [STARTACTION] [attack]
raforming by digging
[ENDOFACTION] [ENDOFACTION]
...

Table 8. We show 5 demonstrations from a held-out Minecraft dataset. In addition to the high level instruction, we show the low-level
predicted actions and ground truth actions. We truncate the instructions to show only the parts relevant to the current frames. The most
common errors are slight differences in camera movements and occasionally performing unnecessary actions. Note that sometimes the
ground truth values are not the only valid actions; for instance, the fourth example predicts that the player will click the bottle, which
happens a few frames later in the ground truth trajectory.

21
An Interactive Agent Foundation Model

Text instruction Start frame Predicted Action Ground Truth Action

the player is using

a character with [STARTACTION] [STARTACTION]
a sword to fight [lockon][meleeattack] [lrot214] [lockon][meleeattack] [lrot213]
enemies and collect [lmag4] [ENDOFACTION] [lmag4] [ENDOFACTION]
power cells ...

the player is riding a

hoverboard-like vehi- [STARTACTION] [STARTACTION]
cle ... avoiding or at- [lockon][meleeattack] [lrot204] [lockon][meleeattack] [lrot201]
tacking enemy play- [lmag4] [ENDOFACTION] [lmag4] [ENDOFACTION]
ers ...

the player starts by

descending some
[STARTACTION] [jump]
stairs towards an [STARTACTION] [jump]
[lockon][specialability1]
open area where they [lockon][meleeattack] [lrot201]
[lrot199] [lmag4] [ENDOFAC-
engage in combat [lmag4] [ENDOFACTION]
TION]
with an enemy player
...

the player ... captures

[STARTACTION] [STARTACTION]
an objective point
[lockon][meleeattack] [lrot63] [lockon][meleeattack] [lrot63]
while fighting off
[lmag4] [ENDOFACTION] [lmag4] [ENDOFACTION]
multiple opponents ...

a bleeding edge
player is controlling a
[STARTACTION] [evade] [STARTACTION] [evade]
robot character with a
[lrot236] [lmag4] [ENDOFAC- [lrot236] [lmag4] [ENDOFAC-
sword ... engaging in
TION] TION]
combat with enemy
players ...

Table 9. We show 5 unique demonstrations from a held-out Bleeding Edge dataset. In addition to the high level instruction, we show the
low-level predicted actions and ground truth actions. We truncate the instructions to show only the parts relevant to the current frames.
The most common errors are slight deviations from the precise value of the joysticks, which are naturally noisy. Some other errors include
predicting the wrong type of attack, though this typically happens in situations where multiple attacks are still valid.

What A Megabus Ticket Looks Like
No ratings yet
What A Megabus Ticket Looks Like
2 pages
1,2&3 SQandSQA
No ratings yet
1,2&3 SQandSQA
25 pages
Matlab Tool Box List
No ratings yet
Matlab Tool Box List
4 pages
Position Paper: Agent AI Towards A Holistic Intelligence
No ratings yet
Position Paper: Agent AI Towards A Holistic Intelligence
22 pages
A AI: S H M I: Gent Urveying The Orizons of Ultimodal Nteraction
No ratings yet
A AI: S H M I: Gent Urveying The Orizons of Ultimodal Nteraction
80 pages
AI Agents vs. Agentic AI - A Conceptual Taxonomy, Applications and Challenges
No ratings yet
AI Agents vs. Agentic AI - A Conceptual Taxonomy, Applications and Challenges
91 pages
CAIpaper - AFR v3.3 Drafted
No ratings yet
CAIpaper - AFR v3.3 Drafted
8 pages
Li Et Al. - 2023 - Multimodal Foundation Models From Specialists To
No ratings yet
Li Et Al. - 2023 - Multimodal Foundation Models From Specialists To
119 pages
A Call For Embodied AI
No ratings yet
A Call For Embodied AI
17 pages
2024 - Neuroformer - Antoniades Et Al - Arxiv
No ratings yet
2024 - Neuroformer - Antoniades Et Al - Arxiv
25 pages
Agentstudio: A Toolkit For Building General Virtual Agents
No ratings yet
Agentstudio: A Toolkit For Building General Virtual Agents
12 pages
QP 2
No ratings yet
QP 2
23 pages
An In-Depth Survey of Large Language Model-Based Artificial Intelligence Agents
No ratings yet
An In-Depth Survey of Large Language Model-Based Artificial Intelligence Agents
15 pages
A Call For Embodied Ai 2402.03824v2
No ratings yet
A Call For Embodied Ai 2402.03824v2
16 pages
NLP & LLM - 11
No ratings yet
NLP & LLM - 11
16 pages
Transformers + Reinforcement Learning
No ratings yet
Transformers + Reinforcement Learning
21 pages
Multimodal Foundation Models
No ratings yet
Multimodal Foundation Models
14 pages
Artificial Intelligence Systems Integration: Fundamentals and Applications
From Everand
Artificial Intelligence Systems Integration: Fundamentals and Applications
Fouad Sabry
No ratings yet
Embodied AI-paper - 2
No ratings yet
Embodied AI-paper - 2
15 pages
ChatGPT With Reinforcment Learning
No ratings yet
ChatGPT With Reinforcment Learning
71 pages
Foundation Models in Robotics: Applications, Challenges, and The Future
No ratings yet
Foundation Models in Robotics: Applications, Challenges, and The Future
33 pages
Embodied AI Agents: Modeling The World
No ratings yet
Embodied AI Agents: Modeling The World
44 pages
AI Notes
No ratings yet
AI Notes
19 pages
Harnessing the Power of AI: A Guide to Making Technology Work for You
From Everand
Harnessing the Power of AI: A Guide to Making Technology Work for You
Roy Hope
No ratings yet
Multimodal Foundation World Models For Generalist Embodied Agents
No ratings yet
Multimodal Foundation World Models For Generalist Embodied Agents
19 pages
Wmbodied Agents Meta
No ratings yet
Wmbodied Agents Meta
40 pages
Virtual Intelligence: Fundamentals and Applications
From Everand
Virtual Intelligence: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automated Interview References
No ratings yet
Automated Interview References
16 pages
AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges
No ratings yet
AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenges
32 pages
AI Agents vs. Agentic AI
100% (1)
AI Agents vs. Agentic AI
33 pages
Thinking Beyond Tokens: From Brain-Inspired Intelligence To Cognitive Foundations For Artificial General Intelligence and Its Societal Impact
No ratings yet
Thinking Beyond Tokens: From Brain-Inspired Intelligence To Cognitive Foundations For Artificial General Intelligence and Its Societal Impact
50 pages
Agent Survey
No ratings yet
Agent Survey
35 pages
Lecture 20
No ratings yet
Lecture 20
12 pages
شات القانزن السعودي
No ratings yet
شات القانزن السعودي
19 pages
Multimedia AI Grand Challenges
No ratings yet
Multimedia AI Grand Challenges
3 pages
Vision-Language Pre-Training
No ratings yet
Vision-Language Pre-Training
102 pages
Advances and Challenges in Foundation Agents
No ratings yet
Advances and Challenges in Foundation Agents
264 pages
Code Agents
No ratings yet
Code Agents
24 pages
Towards System 2 Reasoning in LLMS: Learning How To Think With Meta Chain-of-Thought
No ratings yet
Towards System 2 Reasoning in LLMS: Learning How To Think With Meta Chain-of-Thought
14 pages
Lecture 3 - Human in The Loopv1.0
No ratings yet
Lecture 3 - Human in The Loopv1.0
51 pages
Groot n1
No ratings yet
Groot n1
36 pages
MEIA Multimodal Embodied Perception and Interaction in Unknown
No ratings yet
MEIA Multimodal Embodied Perception and Interaction in Unknown
12 pages
677 A Survey On Bridging VLMs
No ratings yet
677 A Survey On Bridging VLMs
20 pages
Perception, Reason, Think, and Plan
No ratings yet
Perception, Reason, Think, and Plan
75 pages
UNIT1
No ratings yet
UNIT1
11 pages
Symbolicai: A Framework For Logic-Based Approaches Combining Generative Models and Solvers
No ratings yet
Symbolicai: A Framework For Logic-Based Approaches Combining Generative Models and Solvers
39 pages
Ai LN 1
No ratings yet
Ai LN 1
8 pages
Types of AI Models and Their Uses-PDF-Format
No ratings yet
Types of AI Models and Their Uses-PDF-Format
14 pages
Autonomous AI Agents
No ratings yet
Autonomous AI Agents
44 pages
NPTEL
No ratings yet
NPTEL
183 pages
Week 1 Lec 1
No ratings yet
Week 1 Lec 1
159 pages
Agentic AI A New Paradigm in Autonomous
No ratings yet
Agentic AI A New Paradigm in Autonomous
2 pages
Generative AI System Design Resources
No ratings yet
Generative AI System Design Resources
5 pages
A AI S D: A S P, C, F D - : Gentic FOR Cientific Iscovery Urvey OF Rogress Hallenges AND Uture Irec Tions
No ratings yet
A AI S D: A S P, C, F D - : Gentic FOR Cientific Iscovery Urvey OF Rogress Hallenges AND Uture Irec Tions
14 pages
Agentic AI Needs A Systems Theory: Alignment Faking
No ratings yet
Agentic AI Needs A Systems Theory: Alignment Faking
15 pages
Generative Artificial Intelligence - Opportunities and Challenges of Large Language Models - SpringerLink
No ratings yet
Generative Artificial Intelligence - Opportunities and Challenges of Large Language Models - SpringerLink
8 pages
Nuro Symbolic AI 1706972510
No ratings yet
Nuro Symbolic AI 1706972510
38 pages
Augmenting LLMs Survey
No ratings yet
Augmenting LLMs Survey
33 pages
Om PDF
No ratings yet
Om PDF
4 pages
Deep Learning Book PDF
No ratings yet
Deep Learning Book PDF
272 pages
Training For Agents
No ratings yet
Training For Agents
36 pages
Unit 1
No ratings yet
Unit 1
17 pages
Knowledge Reasoning: Fundamentals and Applications
From Everand
Knowledge Reasoning: Fundamentals and Applications
Fouad Sabry
No ratings yet
HPE - Sf000100922en - Us - HPCM - Understanding and Troubleshooting Monitoring
No ratings yet
HPE - Sf000100922en - Us - HPCM - Understanding and Troubleshooting Monitoring
8 pages
ASCAC Conference Program Schedule 2018
No ratings yet
ASCAC Conference Program Schedule 2018
6 pages
Okayplayer - Blowing Smoke
No ratings yet
Okayplayer - Blowing Smoke
1 page
Okayplayer-Raj Says Hello
No ratings yet
Okayplayer-Raj Says Hello
2 pages
Her Bak Excerpts
50% (2)
Her Bak Excerpts
15 pages
ASCAC Conference Brochure 2013
100% (1)
ASCAC Conference Brochure 2013
2 pages
PlexPlug Inframework
No ratings yet
PlexPlug Inframework
122 pages
C Make Lists
No ratings yet
C Make Lists
6 pages
Useful Land: Stroke Awareness Month
100% (1)
Useful Land: Stroke Awareness Month
4 pages
RCA Universal Remote Control-Programming Codes
100% (1)
RCA Universal Remote Control-Programming Codes
5 pages
Useful Land: Stroke Awareness Month
100% (1)
Useful Land: Stroke Awareness Month
4 pages
DR Hagins Lecture (ATL)
No ratings yet
DR Hagins Lecture (ATL)
1 page
Business Analyst
No ratings yet
Business Analyst
2 pages
Sciencedirect: © 2016, Ifac (International Federation of Automatic Control) Hosting by Elsevier Ltd. All Rights Reserved
No ratings yet
Sciencedirect: © 2016, Ifac (International Federation of Automatic Control) Hosting by Elsevier Ltd. All Rights Reserved
6 pages
40271259
No ratings yet
40271259
24 pages
Thesis On Sap Erp
100% (3)
Thesis On Sap Erp
7 pages
Cleanroom Software Engineering: CIS 376 Bruce R. Maxim UM-Dearborn
No ratings yet
Cleanroom Software Engineering: CIS 376 Bruce R. Maxim UM-Dearborn
35 pages
MP PET Cut Off For BE - Round 1 Upgraded
No ratings yet
MP PET Cut Off For BE - Round 1 Upgraded
67 pages
Pattern Recognition Suggestion
No ratings yet
Pattern Recognition Suggestion
2 pages
Zeiss Piwebshopfloorqualitydatamanagementindustry4 170929035627
No ratings yet
Zeiss Piwebshopfloorqualitydatamanagementindustry4 170929035627
44 pages
DONE SOFT COMPUTING Unit 1
No ratings yet
DONE SOFT COMPUTING Unit 1
3 pages
Meeting Scheduling System: University of Texas at Dallas
No ratings yet
Meeting Scheduling System: University of Texas at Dallas
15 pages
PM Past Paper
No ratings yet
PM Past Paper
2 pages
11 Block Diagrams
100% (1)
11 Block Diagrams
30 pages
Fast Feedforward Networks: Peter Belcak and Roger Wattenhofer
No ratings yet
Fast Feedforward Networks: Peter Belcak and Roger Wattenhofer
12 pages
Self Balancing Bot
No ratings yet
Self Balancing Bot
14 pages
BEXEL Manager Brochure
No ratings yet
BEXEL Manager Brochure
20 pages
Stock and Inventory
No ratings yet
Stock and Inventory
13 pages
8D Capa Scar Temp
100% (1)
8D Capa Scar Temp
3 pages
607 Lpsa
No ratings yet
607 Lpsa
9 pages
CS60010: Deep Learning CNN - Part 1: Sudeshna Sarkar
No ratings yet
CS60010: Deep Learning CNN - Part 1: Sudeshna Sarkar
64 pages
Topic 6 - EIS Implementation Life Cycle
No ratings yet
Topic 6 - EIS Implementation Life Cycle
25 pages
CoursePlan CPE204 CEIT 03 402P
No ratings yet
CoursePlan CPE204 CEIT 03 402P
4 pages
Answer Key-CHE 4353-Exam 3-Fall 14
No ratings yet
Answer Key-CHE 4353-Exam 3-Fall 14
9 pages
Global App Testing - The Ultimate QA Testing Handbook
No ratings yet
Global App Testing - The Ultimate QA Testing Handbook
71 pages
Certificato ISO 9001 2015 Emiss 2018
No ratings yet
Certificato ISO 9001 2015 Emiss 2018
2 pages
Use Case
No ratings yet
Use Case
4 pages
Performance Dashboard S
No ratings yet
Performance Dashboard S
76 pages
Model-Driven Software Engineering Foundations of Model-Driven Software Engineering
No ratings yet
Model-Driven Software Engineering Foundations of Model-Driven Software Engineering
43 pages
Function Point Analysis
No ratings yet
Function Point Analysis
9 pages