LLM Agent Operating System - 250303 - 024209
LLM Agent Operating System - 250303 - 024209
A BSTRACT
LLM-based intelligent agents face significant deployment challenges, particularly related to resource management.
Allowing unrestricted access to LLM or tool resources can lead to inefficient or even potentially harmful resource
arXiv:2403.16971v3 [cs.OS] 7 Nov 2024
allocation and utilization for agents. Furthermore, the absence of proper scheduling and resource management
mechanisms in current agent designs hinders concurrent processing and limits overall system efficiency. As the
diversity and complexity of agents continue to grow, addressing these resource management issues becomes
increasingly critical to LLM-based agent systems. To address these challenges, this paper proposes the architecture
of AIOS (LLM-based AI Agent Operating System) under the context of managing LLM-based agents. It introduces
a novel architecture for serving LLM-based agents by isolating resources and LLM-specific services from agent
applications into an AIOS kernel. This AIOS kernel provides fundamental services (e.g., scheduling, context
management, memory management, storage management, access control) and efficient management of resources
(e.g., LLM and external tools) for runtime agents. To enhance usability, AIOS also includes an AIOS-Agent SDK,
a comprehensive suite of APIs designed for utilizing functionalities provided by the AIOS kernel. Experimental
results demonstrate that using AIOS can achieve up to 2.1× faster execution for serving agents built by various
agent frameworks. The source code is available at https://github.com/agiresearch/AIOS.
User: I'm flying from San Francisco to New York for business next month, please help organize the trip.
Travel Agent: Understood. I’ll plan and book your itinerary according to your previous preferences.
LLM Storage Tool API Disk Storage Tool API Software Software Text Generation
(managed by LLM) (managed by LLM) (managed by OS) (managed by LLM) (managed by OS) (managed by OS) (managed by LLM)
Figure 1. A motivating example of how an agent (i.e., travel agent) requires both LLM-related and Non-LLM-related (i.e., OS) services to
complete a task, where color in red represents services related to LLM and color in blue represents services not related to LLM.
age resources such as LLMs, this also inhibits the system ◦ AIOS-Agent SDK Development. We develop the AIOS-
efficiency. For example, calling LLMs by prompts in the Agent SDK, which provides a higher level abstraction of
existing agent frameworks (e.g., Autogen, Langchain) under kernel functionalities, allowing developers to focus on appli-
the concurrent setting predominantly employ a trial-and- cation logic and higher-level functionalities without being
error approach: prompts are fed into the LLM, converted to burdened by the implementation details in the kernel.
tensors, and loaded into GPU memory for execution. When
◦ Empirical Results. We conduct extensive evaluations
CUDA memory capacity is exceeded, the system triggers
of AIOS on agents developed using various agent frame-
an out-of-memory exception, deallocates the tensor, and sig-
works. The experimental results demonstrate that AIOS can
nals failure to the requesting agent, necessitating repeated
maintain the performance of agents across a wide range of
retry attempts until successful execution. This strategy sig-
standard benchmarks and can even enhance performance
nificantly impacts system throughput and increases agent re-
in benchmarks that involve calling external tools under the
sponse latency, particularly in environments where multiple
concurrent execution conditions. Furthermore, AIOS signifi-
agents compete for limited GPU resources during inference.
cantly improves execution efficiency, achieving up to a 2.1×
To mitigate the limitations of deploying and running LLM- increase in execution speed for serving agents across differ-
based agents, we introduce AIOS, an architecture designed ent frameworks. These experimental results underscore the
to serve LLM-based agents more efficiently. Our contribu- effectiveness of AIOS in optimizing both agent performance
tions can be summarized in four parts. and execution speed in supporting diverse agent frameworks
in resource-restricted environments.
◦ New Agent-serving Architecture. We introduce AIOS,
a novel architecture for serving LLM-based agents. This
architecture divides agent applications and their accessible 2 T HE A RCHITECTURE OF AIOS
resources such as LLMs and tools into distinct layers, i.e., As depicted in Figure 2, the AIOS architecture is divided
the application layer and the kernel layer. This separation into three distinct layers: the application, kernel, and hard-
enables more systematic resource management, efficiency ware layers. This layered design is intended to establish a
optimization, and safety enhancement. clear separation of concerns within the system. Higher-level
applications abstract the complexities of the underlying lay-
◦ AIOS Kernel Design and Implementation. At the core
ers, interacting with them through well-defined interfaces
of AIOS, we design and implement an AIOS kernel. In
such as software development kits (SDKs) and system calls.
this kernel, agent primitives are designed to decompose
LLM-related queries into sub execution units to enhance Application Layer. At the application layer, agent applica-
concurrency. To orchestrate the execution of these agent tions are developed using the AIOS-Agent SDK, which pro-
primitives, we develop an agent scheduler for scheduling vides the interface for requesting system resources through
and dispatching primitives to appropriate execution modules. invoking system calls. On one hand, by using the SDK to
Additionally, we implement memory, storage, and tool man- request resources, agents are relieved from the burden of
agers, along with the LLM core(s), to handle the execution handling resource management. On the other hand, the SDK
of dispatched primitives. To prevent long-context requests also facilitates isolation, ensuring that system resources can-
consuming the LLM resource, we design a context manager not be directly manipulated by agents. The AIOS-Agent
to handle context interruptions and recoveries in the LLM SDK is designed not only to support agents developed by
core(s), especially in long-context scenarios. Moreover, an using the native SDK functions, but also to facilitate the
access manager is implemented to verify agent access rights integration of non-native agents built with various agent
before executing operations. creation frameworks, such as ReAct (Yao et al., 2023), Re-
AIOS: LLM Agent Operating System
Agent Application Travel Agent Rec Agent Coding Agent Math Agent Narrative Agent
Kernel Layer
Hardware Layer
Figure 2. An overview of the AIOS architecture where responsibilities are isolated across different layers. Application layer facilitates
the design and development of agent applications. Kernel layer manages core functionalities and resources to serve agent applications.
Hardware layer controls and manages physical computing resources and devices to support kernel layer functionalities.
flexion (Shinn et al., 2023), Autogen (Wu et al., 2023), text manager is introduced with mechanisms for context
Open-Interpreter (Lucas, 2024), and MetaGPT (Hong et al., snapshot and restoration, further detailed in Section 3.4. To
2023). By providing the agent adapter function, the SDK optimize agent memory handling, we develop a memory
supports non-native agents by allowing them to interact with manager for managing agent memory operations and a stor-
the AIOS kernel resources. For native agent development, age manager for persistent storage operations, which will
the SDK simplifies the creation of agents by offering pre- be explained further in Section 3.5 and Section 3.6, respec-
defined modules and APIs to achieve functionalities through tively. In addition, a tool manager is designed to load tools
invoking system calls, so that agents can request resources and manage tool call conflicts for the tools supported in the
provided and managed by the AIOS kernel. This helps de- AIOS-Agent SDK, which will be covered in Section 3.7.
velopers focus on the agent’s primary workflows and logic Lastly, an access manager is designed with access control
rather than low-level implementation details. and user intervention, which we elaborate on in Section 3.8.
Kernel Layer. The kernel layer is composed of two distinct Hardware Layer. The hardware layer consists of the phys-
yet synergistic components: the traditional OS kernel and ical components of the system, such as the CPU, GPU,
the specialized AIOS kernel, each fulfilling unique roles memory, disk, and peripheral devices. The hardware layer
within the system’s functionality. The OS kernel retains its is not the main focus of the work—AIOS kernel does not
conventional architecture to manage non-LLM related com- directly interact with the hardware but relies on OS system
puting tasks, while our core innovation centers around the calls to access the physical resources in the hardware layer.
AIOS kernel. Within the AIOS kernel, several modules are
designed to facilitate agent requests through AIOS system 3 AIOS K ERNEL
calls. A scheduler is designed to dispatch these system calls
In this section, we start with an overview of the AIOS kernel,
to appropriate modules and employ strategies for schedul-
highlighting how each module collaborates with other mod-
ing AIOS system calls, which we will discuss in detail in
ules to support integrated functionalities. Following this, we
Section 3.3. To facilitate the integration of diverse LLM
provide an in-depth look into the design and implementation
endpoints, we design a unified interface that encapsulates
of each module, discussing their roles and contributions to
LLMs as cores, akin to CPU cores, thereby allowing the
the overall AIOS architecture.
integration of various LLM endpoints via a single interface.
Additionally, to support context switching for LLMs, a con-
AIOS: LLM Agent Operating System
Agent Queries
Query A1 Query B1 Query C2 Query D2 Query E3 Query F3
Content: {xxx} Content: {xxx} Content: {xxx} Content: {xxx} Content: {xxx} Content: {xxx}
Type: Chat Type: File Operation Type: Tool Use Type: Chat Type: File Operation Type: Tool Use
Query Decomposition
LLM Core(s) LLM Queue E3.1 Memory Queue A1.2 D2.2 Memory Manager
Tool Manager Tool Queue C2.2 F3.2 Storage Queue B1.2 Storage Manager
Figure 3. How agent queries are decomposed into AIOS system calls and how AIOS system calls are dispatched and scheduled. We omit
the access manager module here as the access-related system calls will not be dispatched by the scheduler.
Table 1. AIOS system calls that are dispatched in the scheduler. Table 2. Supported LLM instances in AIOS and the corresponding
deployment options (no offline option for closed-source LLMs).
Module AIOS System Call
LLM Core(s) llm_generate Online Offline
Memory Manager mem_alloc, mem_read, mem_write, mem_clear Open-source Bedrock Huggingface, vllm, Ollama
Storage Manager sto_create, sto_read, sto_write, sto_clear, sto_retrieve Closed-source GPT, Claude, Gemini, Grok -
Tool Manager tool_run
3.1 Relationship and Connection between Modules memory manager, storage manager, and tool manager, to
In the AIOS kernel, queries from agent applications are de- process requests concurrently within dedicated queues, en-
composed into distinct AIOS system calls, each categorized hancing isolation and parallelism. Thread binding imple-
by functionality, such as LLM processing, memory access, mentations are detailed in Appendix A.1, while the data
storage operations, or tool usage, as illustrated in Figure 3. swap between memory and storage manager is covered in
A subset of these system calls is shown in Table 1, while a Section 3.5.
comprehensive list can be found in Appendix A.1.
3.2 LLM Core(s)
After decomposition, each system call is bound to an exe-
Due to the various deployment options of LLMs, e.g., which
cution thread and subsequently dispatched by the scheduler.
LLM is used, whether the LLM is hosted on cloud or on
The scheduler centralizes and manages multiple queues for
local device, what hardware conditions the LLM requires,
various modules, such as the LLM core(s), memory man-
or which inference framework is used, we encapsulate each
ager, storage manager, and tool manager. As a result, system
LLM instance adopting different deployment options as a
calls are directed to the appropriate queue based on a spe-
core, akin to a CPU core in a traditional operating system.
cific attribute set assigned to each call. Each module listens
to its corresponding queue in the scheduler and fetches the This design allows us to treat each LLM instance as a ded-
system calls scheduled to process. Among these processing icated processing unit, enhancing the modularity and ex-
modules, context manager is responsible for handling inter- tensibility within the AIOS architecture. To accommodate
ruptions that may occur during the execution of system calls different LLM instances, we introduce a wrapper for each
in the LLM core(s) (Section 3.4). Additionally, there is in- LLM instance and design unified system calls within this
ternal data swap between the memory manager and storage wrapper specifically for LLM inference. By abstracting an
manager due to memory limitations. This modular archi- LLM instance as a core and implementing standardized sys-
tecture enables key components, such as the LLM core(s), tem calls, AIOS provides a flexible way to integrate LLM
AIOS: LLM Agent Operating System
Context Manager
3 0.6
/s
0.4 resumption of tasks through context snapshot and restora-
1 /s
0.6 0.4
0.7
Search
0.3
Retrieve
tion operations, preventing prolonged system calls from
Search Retrieve
weather time dominating the LLM inference process.
0.8 0.2
2 /s in of
0.6 0.4
The context manager designs two methods to capture and
Search Retrieve
0.7 0.3
Snapshot restore context based on different decoding strategies: text-
/s
4 0.6 0.4
weather time
Search Retrieve based and logits-based approaches. For closed-source LLMs
Suspended 0.7 0.3
/s weather time
without logits access, the text-based approaches directly
3 0.6 0.4
Search Retrieve
0.8 0.2 save the decoded text outputs and follow the previous decod-
in of
0.7 0.3 0.7 0.3 ing strategy at intermediate stages. Conversely, the logits-
weather time Paris London
0.8 0.2 based approach preserves the structure of the intermediate
in of
Search weather in Paris search tree generated during inference, allowing for more
fine-grained restoration of the computational state. This
Figure 4. Illustration of the logits-based context snapshot and approach can be particularly advantageous for maintain-
restoration process. We use beam search algorithm where beam ing continuity in tasks requiring complex decoding strategy.
width is set to 1 as an example. The detailed procedure for the logits-based method is illus-
trated in Figure 4. We use the beam search process, a typical
practice in LLMs (Touvron et al., 2023b; Jiang et al., 2023;
instances under different deployment options, attributed to
Biderman et al., 2023), to illustrate the generative decod-
the modular design of the LLM core(s). Detailed informa-
ing process. For simplicity of illustration, we set the beam
tion of LLM core(s) is provided in Appendix A.2.
width as 1. Specifically, consider the prompt to the LLM
as: Determine whether there will be rain in the destination
3.3 Scheduler of flight UA057. At each step, the LLM evaluates multiple
Instead of placing separate queues within each processing candidate tokens, with the most promising paths kept for fur-
module (e.g., LLM core(s) or memory manager), we cen- ther expansion based on the predefined beam width. When
tralize all the queues within the scheduler module. This the generation process is suspended by the scheduler at an
approach isolates the responsibility for request management intermediate step, the context manager uses the snapshot
from the individual modules, allowing each processing mod- function to capture and store the current intermediate out-
ule to focus on its execution. Besides, centralizing queue puts of the LLM. Upon resumption, the restoration function
management in the scheduler simplifies the coordination of is employed to reload the saved output from the snapshot,
tasks across modules and provides a unified framework for allowing the LLM to continue its generation process ex-
scheduling. To schedule the AIOS system calls dispatched actly from the point of suspension to reach the final answer:
for each module in queues, we utilize two classic schedul- Search weather in Paris. In this way, the context manager
ing algorithms: First-In-First-Out (FIFO) and Round Robin ensures that the temporary suspension of one agent’s re-
(RR) due to their effectiveness and simplicity. The FIFO quest does not lead to a loss of progress, thereby improving
strategy processes system calls in the order they arrive, en- efficiency since it does not need to generate from scratch.
suring a straightforward handling sequence but potentially
leading to increased waiting times for system calls queued 3.5 Memory Manager
later. In contrast, the RR strategy cycles through system
Unlike traditional OS memory manager that handles physi-
calls in a time-sliced manner, allowing for more balanced
cal memory management such as RAM, the “memory” un-
resource distribution and reduced waiting times under high
der the context of LLM-based agent refers to an agent’s
load conditions. To support time-slicing for the RR schedul-
interaction history during the agent’s runtime (Lerman &
ing strategy, we introduce the context interrupt mechanism
Galstyan, 2003; Zhang et al., 2024), such as the agent’s
for LLM inference, which will be introduced in Section
conversation history with the LLM and the execution results
3.4. Our centralized queue architecture provides a flexible
of tool-calling. As a result, the memory manager in AIOS
foundation that accommodates diverse scheduling optimiza-
handles the management of these agent memories during
tions, from basic to sophisticated strategies. We provide the
the agent’s runtime, such as memory structure, allocation,
detailed implementation in Appendix A.3.
read, write, deletion, update, and compression. Agent mem-
ory is stored and managed on RAM by default, but when
3.4 Context Manager the agent’s allocated RAM space is used up, the memory
The inference time of LLMs is a critical bottleneck that can manager will swap agent’s memory between RAM and disk
lead to long-running system calls, potentially monopolizing through an eviction policy. More details are in the following.
system resources. To address this issue and ensure effi-
AIOS: LLM Agent Operating System
Throughput (Normalized)
Table 3. Evaluation of agent performance on benchmarks w/o and 1.0
w/ AIOS, respectively. Success rate (SR%) is used as the metric
for all the benchmarks. "-" represents methods that failed GAIA
0.5
benchmark tasks due to lack of API support.
Method HumanEval MINT GAIA SWE- 0.0
(Code) Bench-Lite ReAct Reflexion Autogen Open-Interpreter MetaGPT
Agents/Agent Frameworks
ReAct w/o AIOS 48.8 29.4 5.5 3.9
ReAct w/ AIOS 50.6 30.1 7.3 4.3 (a) Normalized throughput. Higher is better.
Reflexion w/o AIOS 50.6 32.4 6.7 4.7
Reflexion w/ AIOS 51.8 33.8 7.8 5.1 With AIOS Without AIOS
Latency (Normalized)
Autogen w/o AIOS 87.8 42.5 7.3 4.3 1.0
Autogen w/ AIOS 87.8 42.5 9.7 4.3
Open-Interpreter w/o AIOS 85.4 45.9 - 4.7 0.5
Open-Interpreter w/ AIOS 86.0 48.7 - 5.1
MetaGPT w/o AIOS 82.9 41.1 - 5.9
0.0
MetaGPT w/ AIOS 82.9 41.8 - 5.9 ReAct Reflexion Autogen Open-Interpreter MetaGPT
Agents/Agent Frameworks
(b) Normalized latency. Lower is better.
Agent Frameworks. We conduct evaluation by running Figure 7. Efficiency analysis on different agent frameworks evalu-
agents built from various popular agent frameworks: ReAct ated on the Llama-3.1-8b model on the HumanEval benchmark.
(Yao et al., 2023), Reflexion (Shinn et al., 2023), Auto-
gen (Wu et al., 2023), Open-Interpreter (Lucas, 2024) and embeds the system prompts with more structural input and
MetaGPT (Hong et al., 2023). Details of these agent frame- output within the LLM wrapper. These enhanced prompts
works are introduced in Appendix B.4. provide the LLM with additional context and structural
guidance for higher-quality code generation. In tool calling
Workloads. We evaluate on a resource-constrained sce-
benchmarks like GAIA, agent performance is boosted for
nario in which agents run concurrently with a single LLM
two main reasons. First, the tool manager implements a
deployed that can process only one prompt request at a time.
post-verification process using structural regex to ensure
To create these concurrent conditions, we set the maximum
that the input parameters for tool calls conform the correct
number of working threads to 250 by default, i.e., at most
format. This extra validation step helps prevent errors by
250 agents can run concurrently at the same time. The im-
catching incorrect tool names or parameters generated by
pact of increasing the number of agents will be analyzed
the LLM before the tool call is executed. Second, AIOS
in Section 4.4. By default, we use RR as the scheduling
employs conflict resolution to manage tool calls, preventing
strategy for AIOS to run agents. The impact of using other
conflicts that might otherwise cause successful tool calls
strategy (i.e., FIFO) is reported in Section 4.3.
to fail. By mitigating issues from concurrent tool access,
AIOS ensures stable operation for agents during execution.
4.2 Agent Performance (RQ1)
To evaluate whether using AIOS can maintain or even im-
4.3 Efficiency Analysis (RQ2)
prove the agent performance on standard benchmarks, we
In our efficiency experiments, we evaluate system perfor-
adopt four agent benchmarks, i.e., HumanEval (Chen et al.,
mance using two key metrics: throughput and latency.
2021a), MINT (the code subset) (Wang et al., 2023b), GAIA
Throughput is measured by counting the number of AIOS
(Mialon et al., 2023) and SWE-Bench-Lite (Jimenez et al.,
system calls executed per second, indicating the system’s
2024) to run agents without and with AIOS, respectively.
capacity to handle multiple requests in parallel. Latency, on
We use the success rate (SR%) as the metric, consistent
the other hand, is measured as the average waiting time ex-
with the original benchmarks and use GPT-4o-mini as the
perienced by agents, from the moment a query is submitted
LLM core to run all the agents. To eliminate randomness,
to the completion of the response, reflecting the system’s
we set the temperature to 0 for GPT-4o-mini in all experi-
responsiveness. To ensure a controlled and consistent test-
ments. Detailed descriptions of the benchmark setups and
ing environment, we conduct these evaluations using the
configurations can be found in Appendix C.
two open-source models, Llama-3.1-8b and Mistral-7b, both
As shown in Table 3, incorporating AIOS consistently main- hosted locally. Hosting these models locally reduces poten-
tains agent performance across standard benchmarks. In tial variability in LLM API response times due to network-
some cases, AIOS can also contribute to agent performance related latency issues. As shown in Figure 7a and Figure 8a,
improvements. For example, in code generation benchmarks the results demonstrate that AIOS achieves significantly
such as MINT, HumanEval, and SWE-Bench-Lite, AIOS higher throughput across different agent frameworks, to a
boosts agent performance by prompt enhancement, which 2.1× increase in throughput when using Reflexion-based
AIOS: LLM Agent Operating System
5.0
Table 4. Impact of using different scheduling strategies, where
NONE represents without using AIOS, FIFO and RR represent 2.5
using AIOS with the two different scheduling strategies. All met-
rics are reported in minutes, including overall execution time and 250 500 750 1000 1250 1500 1750 2000
Number of Agents
agent waiting time (average and p90).
(b) Average Agent Waiting Time v.s. Agent Number.
Agent waiting time
Strategy Overall execution time Figure 9. Overall execution time and average agent waiting time
Avg. p90
when agent number increases from 250 to 2000.
None 152.1 9.8 11.0
FIFO 74.2 3.0 5.0
tional overhead. However, RR performs better on the p90
RR 77.3 3.2 4.2 metric (i.e., the value below which 90% of waiting times
fall) due to its fairer scheduling approach, which reduces the
agents on Llama-3.1-8b. This improvement is attributed to likelihood of later tasks having longer waiting time, which
the scheduling employed in the AIOS kernel, which prevents can typically occur in FIFO.
unnecessary trial-and-error attempts by avoiding prompts
that cannot be loaded onto the GPU for execution. In terms 4.4 Scalability Analysis (RQ3)
of latency, as illustrated in Figure 7b and Figure 8b, the In this section, we evaluate the scalability of AIOS by pro-
average waiting time for agents is also substantially reduced. gressively increasing the number of active agents from 250
This reduction highlights the efficiency of AIOS in serving to 2000. These experiments were conducted using the
LLM-based agents. Llama-3.1-8b and Mistral-7b models on the HumanEval
benchmark. Since the HumanEval dataset contains only 164
Impact of Different Scheduling Strategies. To further an-
samples, we scaled up the dataset by duplicating samples
alyze the impact of different scheduling strategies on system
to match the increasing number of agents, enabling large-
efficiency, we conduct an ablation study using agents built
scale concurrent execution of agent instances. As shown in
with ReAct on the HumanEval benchmark with the Llama-
Figure 9a and Figure 9b, the results demonstrate that using
3.1-8b model. We test three strategies: without AIOS, FIFO,
AIOS can allow both the overall execution time and the av-
and Round Robin (RR), and measure the overall execution
erage agent waiting time to maintain an approximate linear
time and agent waiting time (average and p90).
relationship with the number of agents. This predictable,
As shown in Table 4, the FIFO strategy achieves the shortest linear scaling illustrates AIOS’s ability to handle increas-
overall execution time compared to the other strategies. RR ing workloads efficiently, even as demand intensifies. In
comes second in terms of overall execution and average contrast, without AIOS, the execution and waiting times
agent waiting time, as its context switching introduces addi- increase at a faster rate. The gap between using AIOS and
AIOS: LLM Agent Operating System
not using AIOS widens as the number of agents increases, ternal tools to be completed, such as collecting information,
underscoring AIOS’s effectiveness in managing concurrent executing specialized models, or interacting with the exter-
operations. As workloads scale, AIOS can still maintain sys- nal world. Single-agent applications may engage with either
tem stability and responsiveness, reducing both execution digital environment or physical environment or both, de-
and waiting times compared to configurations without AIOS. pending on the task to solve. For example, agents in virtual
This growing performance advantage highlights AIOS’s suit- or digital environment may invoke APIs (Ge et al., 2023a;
ability for environments with high or fluctuating workloads, Schick et al., 2023; Yao & Narasimhan, 2023; Parisi et al.,
demonstrating its potential to serve a large number of agents. 2022; Tang et al., 2023; Xie et al., 2024), browse websites
(Nakano et al., 2022; Deng et al., 2023; Wu et al., 2024),
5 R ELATED W ORK or execute codes (Zhang et al., 2023; Yang et al.), while
agents in the physical environment may manipulate objects
5.1 Evolution of Operating Systems
(Brohan et al., 2023; Fan et al., 2022; Wang et al., 2023a),
The evolution of operating systems (OS) has unfolded in a
carry out lab experiments (Boiko et al., 2023; Bran et al.,
progressive way, evolving from rudimentary systems to the
2023), or make actionable decisions (Huang et al., 2022;
complex and interactive OS of today. Their evolution saw a
Xiang et al., 2023). LLM-based multi-agent systems (MAS)
transition from simple batch job processing (IBM, 2010) to
leverage the interaction among multiple agents for problem
more advanced process management techniques like time-
solving. The relationship among the multiple agents could
sharing (Ritchie & Thompson, 1974) and multi-task pro-
be cooperative (Wang et al., 2023c; Mandi et al., 2023),
cessing (Hoare, 1974; Engler et al., 1995), which facilitated
competitive (Chan et al., 2023; Du et al., 2023), or a mixture
the handling of increasingly complex tasks. The progress
of cooperation and competition (Ge et al., 2023b). In coop-
moved toward modularization within the OS, delineating
erative multi-agent systems, each agent takes and assesses
specific responsibilities such as process scheduling (Liu &
the information provided by other agents, thereby working
Layland, 1973; Dijkstra, 2002), memory management (Den-
together to solve complex tasks, such as role playing (Li
ning, 1968; Daley & Dennis, 1968), and filesystem man-
et al., 2023; Chen et al., 2023; Zhu et al., 2023), social sim-
agement (Rosenblum & Ousterhout, 1992; McKusick et al.,
ulation (Park et al., 2023) and software development (Hong
1984), enhancing efficiency and manageability. The further
et al., 2023; Qian et al., 2023; Wu et al., 2023; Josifoski
advent of graphical user interfaces (GUIs), e.g., Macintosh,
et al., 2023). In competitive multi-agent systems, agents
Windows and GNOME , makes operating systems more
may debate, negotiate and compete with each other in a
interactive and user-centric. Meanwhile, the operating sys-
game environment to achieve their goals, such as improving
tem ecosystem has also expanded, offering a comprehensive
negotiation skills (Fu et al., 2023) and debating about the
suite of developer tools (OS SDKs) and runtime libraries.
correct answer (Du et al., 2023; Chan et al., 2023; Liang
These tools enable application developers to design, im-
et al., 2023; Hua et al., 2023).
plement, and run their applications efficiently within the
OS environment (Ge et al., 2023b). Notable examples of
OS ecosystems include Android Studio , XCode and Cloud 6 C ONCLUSION AND F UTURE W ORK
SDK . In these ecosystems, the OS provides numerous re- This paper introduces AIOS, a novel architecture designed
sources to facilitate software development and serves as a to serve LLM-based agents. Within this architecture, we de-
platform for deploying and hosting software applications, sign and implement an AIOS kernel that isolates resources
leading to a thriving OS-application ecosystem. Recently, and LLM-specific services from agent applications for man-
the community is seeing AI models such as LLMs sink- agement. Additionally, we develop the AIOS-Agent SDK
ing from the application layer down to the system layer to facilitate the usage of the functionalities provided by the
to provide standard services to various applications. With AIOS kernel for agent applications. Experimental results
the incorporation of large language models (LLMs), these demonstrate that AIOS not only maintains, but can also
advanced systems promise to further narrow the commu- improve agent performance on standard benchmarks. Fur-
nication gap between humans and machines, forwarding a thermore, AIOS significantly accelerates overall execution
new era of user-computer interaction. time, improves system throughput, and exhibits scalability
as the number of concurrent agents increases. We hope
5.2 Large Language Model Agents that the insights and methodologies shared in this work
LLM-based single-agent systems (SAS) use a single LLM will contribute to both AI and systems research, fostering a
agent for complex task solving, such as travel planning (Xie more cohesive, effective, and efficient ecosystem for serving
et al., 2024), personalized recommendation, and artistic de- LLM-based agents. We believe future research can explore
sign (Ge et al., 2023a). The agent takes natural language in- innovative directions built upon AIOS to refine and expand
struction from users as input and decomposes the task into a AIOS architecture to better meet the evolving requirements
multistep plan for task solving, where each step may call ex- of developing and deploying LLM-based AI agents.
AIOS: LLM Agent Operating System
Boiko, D. A., MacKnight, R., and Gomes, G. Emergent Daley, R. C. and Dennis, J. B. Virtual memory, processes,
autonomous scientific research capabilities of large lan- and sharing in multics. Communications of the ACM, 11
guage models. arXiv preprint arXiv:2304.05332, 2023. (5):306–312, 1968.
Bran, A. M., Cox, S., White, A. D., and Schwaller, P. Chem- Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang,
crow: Augmenting large-language models with chemistry B., Sun, H., and Su, Y. Mind2web: Towards a general-
tools. arXiv preprint arXiv:2304.05376, 2023. ist agent for the web. Advances in Neural Information
Processing Systems, 36, 2023.
Bresciani, P., Perini, A., Giorgini, P., Giunchiglia, F., and
Mylopoulos, J. Tropos: An agent-oriented software de- Denning, P. J. The working set model for program behavior.
velopment methodology. Autonomous Agents and Multi- Communications of the ACM, 11(5):323–333, 1968.
Agent Systems, 8:203–236, 2004. Dijkstra, E. W. Cooperating sequential processes. In The
origin of concurrent programming: from semaphores to
Brohan, A., Chebotar, Y., Finn, C., Hausman, K., Herzog,
remote procedure calls, pp. 65–138. Springer, 2002.
A., Ho, D., Ibarz, J., Irpan, A., Jang, E., Julian, R., et al.
Do as i can, not as i say: Grounding language in robotic Driess, D., Xia, F., Sajjadi, M. S., Lynch, C., Chowdhery,
affordances. In Conference on robot learning, pp. 287– A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T.,
318. PMLR, 2023. et al. Palm-e: an embodied multimodal language model.
In Proceedings of the 40th International Conference on
Chan, C.-M., Chen, W., Su, Y., Yu, J., Xue, W., Zhang, S., Machine Learning, pp. 8469–8488, 2023.
Fu, J., and Liu, Z. Chateval: Towards better llm-based
evaluators through multi-agent debate. In The Twelfth Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., and Mor-
International Conference on Learning Representations, datch, I. Improving factuality and reasoning in lan-
2023. guage models through multiagent debate. arXiv preprint
arXiv:2305.14325, 2023.
Chen, M., Tworek, J., Jun, H., Yuan, Q., de Oliveira Pinto,
H. P., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle,
Brockman, G., Ray, A., Puri, R., Krueger, G., Petrov, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan,
M., Khlaaf, H., Sastry, G., Mishkin, P., Chan, B., Gray, A., et al. The llama 3 herd of models. arXiv preprint
S., Ryder, N., Pavlov, M., Power, A., Kaiser, L., Bavar- arXiv:2407.21783, 2024.
ian, M., Winter, C., Tillet, P., Such, F. P., Cummings, D.,
Engler, D. R., Kaashoek, M. F., and O’Toole Jr, J. Exokernel:
Plappert, M., Chantzis, F., Barnes, E., Herbert-Voss, A.,
An operating system architecture for application-level
Guss, W. H., Nichol, A., Paino, A., Tezak, N., Tang,
resource management. ACM SIGOPS Operating Systems
J., Babuschkin, I., Balaji, S., Jain, S., Saunders, W.,
Review, 29(5):251–266, 1995.
Hesse, C., Carr, A. N., Leike, J., Achiam, J., Misra,
V., Morikawa, E., Radford, A., Knight, M., Brundage, Fan, L., Wang, G., Jiang, Y., Mandlekar, A., Yang, Y., Zhu,
M., Murati, M., Mayer, K., Welinder, P., McGrew, B., H., Tang, A., Huang, D.-A., Zhu, Y., and Anandkumar, A.
Amodei, D., McCandlish, S., Sutskever, I., and Zaremba, Minedojo: Building open-ended embodied agents with
W. Evaluating large language models trained on code. internet-scale knowledge. Advances in Neural Informa-
2021a. tion Processing Systems, 35:18343–18362, 2022.
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Fu, Y., Peng, H., Khot, T., and Lapata, M. Improv-
Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, ing language model negotiation with self-play and in-
AIOS: LLM Agent Operating System
context learning from ai feedback. arXiv preprint International Conference on Learning Representations,
arXiv:2305.10142, 2023. 2024.
Ge, Y., Hua, W., Mei, K., Tan, J., Xu, S., Li, Z., and Zhang, Josifoski, M., Klein, L., Peyrard, M., Li, Y., Geng, S.,
Y. OpenAGI: When LLM Meets Domain Experts. Ad- Schnitzler, J. P., Yao, Y., Wei, J., Paul, D., and West,
vances in Neural Information Processing Systems, 36, R. Flows: Building blocks of reasoning and collaborating
2023a. ai. arXiv preprint arXiv:2308.01285, 2023.
Ge, Y., Ren, Y., Hua, W., Xu, S., Tan, J., and Zhang, Y. LLM Kim, G., Baldi, P., and McAleer, S. Language models can
as OS, Agents as Apps: Envisioning AIOS, Agents and solve computer tasks. Advances in Neural Information
the AIOS-Agent Ecosystem. arXiv:2312.03815, 2023b. Processing Systems, 36, 2023.
Geng, S., Liu, S., Fu, Z., Ge, Y., and Zhang, Y. Recommen- Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa,
dation as language processing (rlp): A unified pretrain, Y. Large language models are zero-shot reasoners. Ad-
personalized prompt & predict paradigm (p5). In Pro- vances in neural information processing systems, 35:
ceedings of the 16th ACM Conference on Recommender 22199–22213, 2022.
Systems, pp. 299–315, 2022.
Lerman, K. and Galstyan, A. Agent memory and adaptation
Hao, S., Gu, Y., Ma, H., Hong, J., Wang, Z., Wang, D., and in multi-agent systems. In Proceedings of the second
Hu, Z. Reasoning with language model is planning with international joint conference on Autonomous agents and
world model. In Proceedings of the 2023 Conference on multiagent systems, pp. 797–803, 2003.
Empirical Methods in Natural Language Processing, pp. Li, G., Hammoud, H., Itani, H., Khizbullin, D., and Ghanem,
8154–8173, 2023. B. Camel: Communicative agents for "mind" exploration
Hoare, C. A. R. Monitors: An operating system structuring of large language model society. Advances in Neural
concept. Communications of the ACM, 17(10):549–557, Information Processing Systems, 36, 2023.
1974. Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y., Wang,
Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Wang, R., Yang, Y., Tu, Z., and Shi, S. Encouraging divergent
J., Zhang, C., Wang, Z., Yau, S. K. S., Lin, Z., et al. thinking in large language models through multi-agent
Metagpt: Meta programming for multi-agent collabora- debate. arXiv preprint arXiv:2305.19118, 2023.
tive framework. In The Twelfth International Conference Liu, C. L. and Layland, J. W. Scheduling algorithms for mul-
on Learning Representations, 2023. tiprogramming in a hard-real-time environment. Journal
Hua, W., Fan, L., Li, L., Mei, K., Ji, J., Ge, Y., Hemphill, L., of the ACM (JACM), 20(1):46–61, 1973.
and Zhang, Y. War and peace (waragent): Large language Lucas, K. Open interpreter. https://github.com/
model-based multi-agent simulation of world wars. arXiv OpenInterpreter/open-interpreter, 2024.
preprint arXiv:2311.17227, 2023.
Mandi, Z., Jain, S., and Song, S. Roco: Dialectic multi-
Huang, W., Abbeel, P., Pathak, D., and Mordatch, I. Lan- robot collaboration with large language models. arXiv
guage models as zero-shot planners: Extracting ac- preprint arXiv:2307.04738, 2023.
tionable knowledge for embodied agents. In Interna-
tional Conference on Machine Learning, pp. 9118–9147. McKusick, M. K., Joy, W. N., Leffler, S. J., and Fabry,
PMLR, 2022. R. S. A fast file system for unix. ACM Transactions on
Computer Systems (TOCS), 2(3):181–197, 1984.
IBM, C. What is batch processing? z/OS Concepts, 2010.
Mialon, G., Fourrier, C., Swift, C., Wolf, T., LeCun, Y., and
Jennings, N. R., Sycara, K., and Wooldridge, M. A roadmap Scialom, T. Gaia: a benchmark for general ai assistants.
of agent research and development. Autonomous agents arXiv preprint arXiv:2311.12983, 2023.
and multi-agent systems, 1:7–38, 1998.
Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L.,
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W.,
Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Jiang, X., Cobbe, K., Eloundou, T., Krueger, G., Button,
Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint K., Knight, M., Chess, B., and Schulman, J. Webgpt:
arXiv:2310.06825, 2023. Browser-assisted question-answering with human feed-
back, 2022.
Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press,
O., and Narasimhan, K. R. SWE-bench: Can language Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou,
models resolve real-world github issues? In The Twelfth Y., Savarese, S., and Xiong, C. Codegen: An open large
AIOS: LLM Agent Operating System
language model for code with multi-turn program synthe- Tang, Q., Deng, Z., Lin, H., Han, X., Liang, Q., and
sis. arXiv preprint arXiv:2203.13474, 2022. Sun, L. Toolalpaca: Generalized tool learning for lan-
guage models with 3000 simulated cases. arXiv preprint
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., arXiv:2306.05301, 2023.
Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A.,
et al. Training language models to follow instructions Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn,
with human feedback. Advances in Neural Information A., Saravia, E., Poulton, A., Kerkez, V., and Stojnic, R.
Processing Systems, 35:27730–27744, 2022. Galactica: A large language model for science. arXiv
preprint arXiv:2211.09085, 2022.
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu:
a method for automatic evaluation of machine transla- Team, G., Anil, R., Borgeaud, S., Wu, Y., Alayrac, J.-B., Yu,
tion. In Proceedings of the 40th annual meeting of the J., Soricut, R., Schalkwyk, J., Dai, A. M., Hauth, A., et al.
Association for Computational Linguistics, pp. 311–318, Gemini: a family of highly capable multimodal models.
2002. arXiv preprint arXiv:2312.11805, 2023.
Parisi, A., Zhao, Y., and Fiedel, N. Talm: Tool augmented Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux,
language models. arXiv preprint arXiv:2205.12255, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E.,
2022. Azhar, F., et al. Llama: Open and efficient foundation lan-
guage models. arXiv preprint arXiv:2302.13971, 2023a.
Park, J. S., O’Brien, J., Cai, C. J., Morris, M. R., Liang,
P., and Bernstein, M. S. Generative agents: Interactive Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
simulacra of human behavior. In Proceedings of the 36th A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
Annual ACM Symposium on User Interface Software and Bhosale, S., et al. Llama 2: Open foundation and fine-
Technology, pp. 1–22, 2023. tuned chat models. arXiv preprint arXiv:2307.09288,
2023b.
Qian, C., Cong, X., Yang, C., Chen, W., Su, Y., Xu, J.,
Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu,
Liu, Z., and Sun, M. Communicative agents for software
Y., Fan, L., and Anandkumar, A. Voyager: An open-
development. arXiv preprint arXiv:2307.07924, 2023.
ended embodied agent with large language models. In
Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., Lin, Y., Intrinsically-Motivated and Open-Ended Learning Work-
Cong, X., Tang, X., Qian, B., et al. Toolllm: Facilitating shop@ NeurIPS2023, 2023a.
large language models to master 16000+ real-world apis.
Wang, X., Wang, Z., Liu, J., Chen, Y., Yuan, L., Peng, H.,
ICLR, 2024.
and Ji, H. Mint: Evaluating llms in multi-turn interac-
Ritchie, D. M. and Thompson, K. The unix time-sharing tion with tools and language feedback. arXiv preprint
system. Communications of the ACM, 17(7):365–375, arXiv:2309.10691, 2023b.
1974. Wang, Z., Mao, S., Wu, W., Ge, T., Wei, F., and Ji, H.
Rosenblum, M. and Ousterhout, J. K. The design and imple- Unleashing cognitive synergy in large language mod-
mentation of a log-structured file system. ACM Transac- els: A task-solving agent through multi-persona self-
tions on Computer Systems (TOCS), 10(1):26–52, 1992. collaboration. arXiv preprint arXiv:2307.05300, 1(2):
3, 2023c.
Ross, S. I., Martinez, F., Houde, S., Muller, M., and Weisz,
J. D. The programmer’s assistant: Conversational in- Wooldridge, M. and Jennings, N. R. Intelligent agents:
teraction with a large language model for software de- Theory and practice. The knowledge engineering review,
velopment. In Proceedings of the 28th International 10(2):115–152, 1995.
Conference on Intelligent User Interfaces, pp. 491–514, Wu, Q., Bansal, G., Zhang, J., Wu, Y., Zhang, S., Zhu, E., Li,
2023. B., Jiang, L., Zhang, X., and Wang, C. Autogen: Enabling
next-gen llm applications via multi-agent conversation
Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli,
framework. arXiv preprint arXiv:2308.08155, 2023.
M., Zettlemoyer, L., Cancedda, N., and Scialom, T. Tool-
former: Language models can teach themselves to use Wu, Z., Han, C., Ding, Z., Weng, Z., Liu, Z., Yao, S.,
tools. arXiv preprint arXiv:2302.04761, 2023. Yu, T., and Kong, L. Os-copilot: Towards generalist
computer agents with self-improvement. arXiv preprint
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and
arXiv:2402.07456, 2024.
Yao, S. Reflexion: Language agents with verbal rein-
forcement learning. Advances in Neural Information Xiang, J., Tao, T., Gu, Y., Shu, T., Wang, Z., Yang, Z., and
Processing Systems, 36, 2023. Hu, Z. Language models meet world models: Embod-
AIOS: LLM Agent Operating System
APPENDIX
This appendix contains additional details for the paper: “AIOS: LLM Agent Operating System”. The appendix is organized
as follows:
• Section §A provides AIOS Kernel Implementation Details.
• Section §B reports more about AIOS-Agent SDK.
• Section §C reports more Details of Agent Benchmarks.
• Section §D shows more Additional Experimental Results.
• Section §E analyzes Discussion.
Thread Binding. Each system call within AIOS is bound to a separate thread for execution, allowing for concurrent
processing. The thread binding is implemented by inheriting the Thread class and overwrites its init and run methods.
AIOS: LLM Agent Operating System
class SysCall(Thread):
def __init__(self, agent_name, request_data):
super().__init__()
self.agent_name = agent_name
self.request_data = request_data
self.event = threading.Event()
self.pid = None
self.status = None
self.response = None
self.time_limit = None
self.created_time = None
self.start_time = None
self.end_time = None
def run(self):
self.set_pid(self.native_id)
self.event.wait()
class LLMCore(ABC):
def __init__(self,
llm_name: str,
max_gpu_memory: dict = None,
eval_device: str = None,
max_new_tokens: int = 256,
log_mode: str = "console"):
""" Initialize LLMCore with model configurations
"""
pass
@abstractmethod
def load_llm_and_tokenizer(self) -> None:
""" Load the LLM model and tokenizer
"""
pass
def tool_calling_input_format(self,
prompt: list,
tools: list) -> list:
""" Format prompts to include tool information
"""
pass
@abstractmethod
def address_request(self,
llm_request,
temperature=0.0):
""" Process the request sent to the LLM
"""
pass
@abstractmethod
def llm_generate(self,
prompt,
temperature=0.0):
""" Generate a response based on the provided prompt
"""
pass
A.3 Scheduler
For implementing schedulers, different scheduling strategies (e.g., FIFO, RR) will be achieved by creating specific schedulers
that inherit from the base scheduler defined below. This approach ensures that new scheduling strategies can be added
without interfering with existing schedulers, maintaining isolation and flexibility.
AIOS: LLM Agent Operating System
class Scheduler:
def __init__(
self,
llm,
memory_manager,
storage_manager,
tool_manager,
get_llm_request: LLMRequestQueueGetMessage,
get_memory_request: MemoryRequestQueueGetMessage,
get_storage_request: StorageRequestQueueGetMessage,
get_tool_request: ToolRequestQueueGetMessage,
):
""" Initializes the Scheduler with managers, request handlers, and threads for
processing
"""
self.get_llm_request = get_llm_request
self.get_memory_request = get_memory_request
self.get_storage_request = get_storage_request
self.get_tool_request = get_tool_request
self.active = False # start/stop the scheduler
self.log_mode = log_mode
self.request_processors = {
"llm_syscall_processor": Thread(target=self.run_llm_request),
"mem_syscall_processor": Thread(target=self.run_memory_request),
"sto_syscall_processor": Thread(target=self.run_storage_request),
"tool_syscall_processor": Thread(target=self.run_tool_request)
}
self.llm = llm
self.memory_manager = memory_manager
self.storage_manager = storage_manager
self.tool_manager = tool_manager
def start(self):
""" Starts the scheduler and runs all request processor threads
"""
self.active = True
for name, thread_value in self.request_processors.items():
thread_value.start()
def stop(self):
""" Stops the scheduler and joins all processor threads
"""
self.active = False
for name, thread_value in self.request_processors.items():
thread_value.join()
def run_llm_syscall(self):
""" Handles LLM system call requests
"""
pass
def run_memory_syscall(self):
""" Handles memory system call requests
"""
pass
def run_storage_syscall(self):
""" Handles storage system call requests
"""
pass
def run_tool_syscall(self):
""" Handles tool system call requests
"""
pass
AIOS: LLM Agent Operating System
class MemoryManager:
def __init__(self,
memory_limit,
eviction_k,
storage_manager):
""" Initialize the memory manager with limits and a storage manager
"""
self.memory_blocks = Dict()
self.memory_limit = memory_limit
self.eviction_k = eviction_k
self.storage_manager = storage_manager
if rid in self.memory_blocks[aid]:
self.memory_blocks[aid].pop(rid)
self.memory_blocks[aid][rid] = compressed_data
def _total_memory_count(self):
""" Calculate total memory block count across all agents
"""
return sum(len(blocks) for blocks in self.memory_blocks.values())
class StorageManager:
def __init__(self, storage_path, vector_db=None):
""" Initializes storage path and optional vector database
"""
self.storage_path = storage_path
os.makedirs(self.storage_path, exist_ok=True)
self.vector_db = vector_db
if not os.path.exists(file_path):
with open(file_path, "wb") as file:
file.write(b"")
if self.vector_db:
self.vector_db.create_collection(f"{aid}_{rid}" if aid and rid else aname)
if os.path.exists(file_path):
with open(file_path, "rb") as file:
compressed_data = file.read()
return pickle.loads(zlib.decompress(compressed_data)) if compressed_data
else None
return None
if os.path.exists(file_path):
os.remove(file_path)
if self.vector_db:
self.vector_db.delete(f"{aid}_{rid}" if aid and rid else aname)
tool = tool_class(
tool_org_and_name=tool_org_and_name.split("/")[1]
)
tool.run(
params=tool_params
)
with self.lock:
self.tool_conflict_map.pop(tool_org_and_name)
tool_module = importlib.import_module(module_name)
tool_instance = getattr(tool_module, class_name)
return tool_instance
class AccessManager:
def __init__(self):
self.privilege_map = {}
Args:
params (LLMParams): Parameters required for LLM initialization.
Returns:
LLM: An instance of the initialized LLM.
"""
return LLM(**params.model_dump())
@validate(MemoryManagerParams)
def useMemoryManager(params: MemoryManagerParams) -> MemoryManager:
""" Initialize and return a memory instance.
Args:
params (MemoryParams): Parameters required for Memory Manager Initialization.
Returns:
Memory Manager: An instance of the initialized Memory Manager.
"""
return MemoryManager(**params.model_dump())
AIOS: LLM Agent Operating System
@validate(StorageManagerParams)
def useStorageManager(params: StorageManagerParams) -> StorageManager:
""" Initialize and return a storage instance.
Args:
params (StorageManagerParams): Parameters required for Memory Manager
Initialization.
Returns:
Storage Manager: An instance of the initialized Storage Manager.
"""
return StorageManager(**params.model_dump())
@validate(ToolManagerParams)
def useToolManager(params: ToolManagerParams) -> ToolManager:
""" Initialize and return a tool instance.
Args:
params (ToolManagerParams): Parameters required for Tool Manager Initialization.
Returns:
Tool Manager: An instance of the initialized Tool Manager.
"""
return ToolManager(**params.model_dump())
AIOS: LLM Agent Operating System
@validate(SchedulerParams)
def useScheduler(params: SchedulerParams,) -> Tuple[Callable[[], None], Callable[[], None
]]:
""" Initialize and return a scheduler with start and stop functions.
Args:
params (SchedulerParams): Parameters required for the scheduler.
Returns:
Tuple: A tuple containing the start and stop functions for the scheduler.
"""
if params.get_llm_request is None:
from aios.hooks.stores._global import global_llm_req_queue_get_message
params.get_llm_request = global_llm_req_queue_get_message
if params.get_memory_request is None:
from aios.hooks.stores._global import global_memory_req_queue_get_message
params.get_memory_request = global_memory_req_queue_get_message
if params.get_storage_request is None:
from aios.hooks.stores._global import global_storage_req_queue_get_message
params.get_storage_request = global_storage_req_queue_get_message
if params.get_storage is None:
from aios.hooks.stores._global import global_tool_req_queue_get_message
params.get_tool_request = global_tool_req_queue_get_message
scheduler = Scheduler(**params.model_dump())
B AIOS-AGENT SDK
B.1 Query and Response
In the AIOS-Agent SDK, two main data structures, Query and Response are defined to facilitate agent interactions with the
AIOS kernel by structuring input requests and output responses. The Query class serves as the input structure for agents to
perform various actions within AIOS. It includes: The Response class represents the output structure that agents receive
after the AIOS kernel processes the functions to return the Response.
AIOS: LLM Agent Operating System
class Query(BaseModel):
""" Query class represents the input structure for performing various actions.
Attributes:
messages: A list of dictionaries where each dictionary
represents a message containing ’role’ and ’content’ or other key-value pairs.
tools: An optional list of JSON-like objects (dictionaries)
representing tools and their parameters. Default is an empty list.
action_type: A string that must be one of "chat", "call_tool", or "operate_file".
This restricts the type of action the query performs.
message_return_type: The type of the response message. Default is "text".
"""
class Response(BaseModel):
""" Response class represents the output structure after performing actions.
Attributes:
response_message (Optional[str]): The generated response message. Default is None.
tool_calls (Optional[List[Dict[str, Any]]]): An optional list of JSON-like objects
(dictionaries)
representing the tool calls made during processing. Default is None.
"""
response_message: Optional[str] = None
# The generated response message, default is None.
Creation Agent: The creation agent is tailored for content generation tasks, such as writing, graphic design, or even
video editing. By accessing creative tools and resources through the AIOS-Agent SDK, the creation agent can assist with
generating textual content, designing visuals, or assembling multimedia elements, enabling users to produce high-quality
content efficiently. Academic Agent: The academic agent is designed to support research and learning, utilizing the SDK to
access scholarly articles to assist with literature reviews and even provide explanations on complex academic topics.
TravelAgent Profile
RecAgent Profile
Description: You are an expert who is good at recommending TV series and movies.
Workflow:
1. Identify the tool that you need to call to obtain information.
2. Based on the information, give recommendations for the user based on the constrains.
Available tools:
1. TripAdvisor
2. Wikipedia
Example of task inputs: Recommend three action movies from the past five years ranked between 1 and 20 with
ratings above 8.0.
CreationAgent Profile
MathAgent Profile
AcademicAgent Profile
Description: You are an expert who is good at looking up and obtaining information from academic articles.
Workflow:
1. Identify the tool to call based on the academic requirements and call the tool.
2. Gather the information obtained from the tool to write an outline or summarization.
Available tools:
1. Arxiv API
Example of task inputs: Summarize recent studies on the role of artificial intelligence in drug discovery from 2018
to 2023.
to generate intermediate reasoning traces alongside actionable steps for complex task completion. This dual approach
helps models not only plan and track their thought process but also interact with external tools, improving performance
on tasks like question answering, game environments, and decision-making problems that require multi-step reasoning
and adaptability. By alternating between reasoning and action, ReAct reduces errors from solely predictive responses and
enables more accurate, contextually aware task completion.
Reflexion (Shinn et al., 2023). The Reflexion framework enhances language agents with a feedback-driven mechanism,
allowing them to learn from mistakes and adapt behavior through self-reflective feedback loops. By leveraging verbal
reinforcement learning, agents assess and adjust their actions, which improves performance on complex tasks through
iterative learning. This approach makes language agents more resilient and adaptive, enabling them to handle tasks with
evolving requirements and uncertainty.
Autogen (Wu et al., 2023). AutoGen introduces a framework that leverages multiple language model agents with distinct
roles (such as Planner, Executor, and Reflector) to collaboratively solve complex tasks through structured, goal-oriented
conversations. By enabling agents to communicate and share intermediate results, AutoGen coordinates multi-step processes
like data analysis, decision-making, and iterative problem-solving, significantly enhancing efficiency and accuracy beyond
a single model’s capabilities. This approach empowers next-generation applications, allowing LLMs to tackle dynamic
workflows, adapt to task-specific nuances, and achieve higher performance in real-world scenarios. Below is the code of
adapting Autogen for AIOS. Due to ongoing refactoring work by the Autogen team, only Autogen-0.2 (the latest stable
version) is supported.
@add_framework_adapter("AutoGen~0.2")
def prepare_autogen_0_2():
"""
Replace OpenAIWrapper and ConversableAgent methods with aios’s implementation.
This function is used to adapt autogen’s API to aios’s API, and it is used
internally by aios.
"""
# Replace OpenAIWrapper method
OpenAIWrapper.__init__ = adapter_autogen_client_init
OpenAIWrapper.create = adapter_client_create
OpenAIWrapper.extract_text_or_completion_object =
adapter_client_extract_text_or_completion_object
Open-Interpreter (Lucas, 2024). Open Interpreter is an open-source framework that enables users to interact with LLMs
through a ChatGPT-like interface to interpret and execute complex instructions across programming languages directly in
the terminal. It supports both locally-hosted and cloud-based LLMs, allowing for streamlined code execution and debugging
in natural language. By translating natural language instructions into executable code, Open Interpreter offers an intuitive
environment that not only simplifies development workflows but also facilitates learning by providing detailed explanations
and interactive support for various coding challenges, making it suitable for developers at all skill levels. Below is the core
function to be adapted for Open-Interpreter.
AIOS: LLM Agent Operating System
@add_framework_adapter("Open-Interpreter")
def prepare_interpreter():
""" Prepare the interpreter for running LLM in aios.
"""
MetaGPT (Hong et al., 2023). MetaGPT proposes a meta-programming approach that optimizes LLM-driven multi-agent
systems by integrating task-oriented programming paradigms for complex, collaborative problem-solving. MetaGPT encodes
Standardized Operating Procedures (SOPs) directly into structured prompt sequences, creating streamlined workflows that
empower agents with human-like domain expertise to systematically verify intermediate outputs and proactively mitigate
errors. Along this line, MetaGPT addresses the limitations of existing LLM-based frameworks, such as hallucination and
cascading errors during agent chaining. This framework facilitates the decomposition of complex tasks into manageable,
interdependent subtasks, improving overall system robustness, especially in high-stakes, iterative processes where reliability
across agent interactions is crucial. Below is the core function to be adapted for MetaGPT.
@add_framework_adapter("MetaGPT")
def prepare_metagpt():
"""
Prepare the metagpt module to run on aios.
BaseLLM.aask = adapter_aask
C.2 MINT
MINT (Wang et al., 2023b)3 introduced a benchmark to evaluate LLMs’ ability to solve challenge tasks through multi-turn
interactions. The benchmark focuses on code generation, decision making, and reasoning tasks that require LLMs to utilize
tools and incorporate natural language feedback. MINT was constructed by curating multiple single-turn datasets, reducing
an original collection of 29,307 instances to 586 carefully selected examples. The benchmark uses success rate (SR) as its
primary evaluation metric, measuring the percentage of successfully completed tasks. For a given interaction limit k ranging
from 1 to 5, each LLM is allowed up to k turns of interaction, with performance measured as SRk . In our experiments, we
set k = 5 and focus exclusively on MINT’s code generation subset.
2
The dataset can be found at https://www.github.com/openai/human-eval.
3
https://xwang.dev/mint-bench/
AIOS: LLM Agent Operating System
Latency (Normalized)
1.0 1.0
0.5 0.5
0.0 0.0
ReAct Reflexion Autogen Open-Interpreter MetaGPT ReAct Reflexion Autogen Open-Interpreter MetaGPT
Agents/Agent Frameworks Agents/Agent Frameworks
(a) Normalized throughput. Higher is better. (b) Normalized latency. Lower is better.
Figure 10. Efficiency analysis on different agent frameworks evaluated on the Llama-3.1-8b model on the MINT benchmark.
C.3 GAIA
General AI Assistant (GAIA) (Mialon et al., 2023)4 is a benchmark designed to represent a significant milestone in AI
research by evaluating fundamental capabilities essential for general intelligence. Unlike traditional benchmarks that
focus on specialized professional knowledge, GAIA emphasizes everyday tasks that require core abilities including logical
reasoning, multi-modal processing, web navigation, and effective tool utilization. GAIA comprises 466 questions that
evaluate AI assistants across multiple capabilities including reasoning, multi-modal understanding, coding, and tool usage
(particularly web browsing), with tasks involving various data formats like PDFs, spreadsheets, images, videos, and audio.
The benchmark organizes questions into three difficulty levels based on the number of required steps and tools: Level 1
requires minimal tool usage (≤ 5 steps), Level 2 demands multiple tools and 5-10 steps, while Level 3 tests advanced
general assistance capabilities through complex, multi-step sequences requiring diverse tool combinations. Additionally,
while web browsing is central to GAIA, the benchmark deliberately excludes complex web interactions like file uploads or
posting comments, leaving such evaluations for future research.
C.4 SWEBench-Lite
SWE-bench (Jimenez et al., 2024)5 is a software engineering benchmark constructed through a rigorous three-stage pipeline
that processes GitHub pull requests (PRs) from 12 popular Python repositories. The pipeline filters approximately 90,000
PRs based on attributes (issue resolution and test contribution) and execution criteria (successful installation and fail-to-pass
test transitions), resulting in 2,294 high-quality task instances. Each task requires models to generate patch files that resolve
software issues, with success determined by comprehensive test coverage. The benchmark distinguishes itself through
real-world challenges, extensive input context (averaging 195 words per issue), cross-context editing requirements (typically
spanning 1.7 files and 32.8 lines per solution), and robust test-based evaluation. Notably, SWE-bench’s automated collection
process enables continuous updates with new task instances from GitHub repositories, ensuring benchmark relevance over
time.
Throughput (Normalized)
1.0 1.0
0.5 0.5
0.0 0.0
ReAct Reflexion Autogen Open-Interpreter MetaGPT ReAct Reflexion Autogen Open-Interpreter MetaGPT
Agents/Agent Frameworks Agents/Agent Frameworks
(a) Normalized throughput. Higher is better. (b) Normalized latency. Lower is better.
Figure 11. Efficiency analysis on different agent frameworks evaluated on the Mistral-7b model on the MINT benchmark.
Latency (Normalized)
1.0 1.0
0.5 0.5
0.0 0.0
ReAct Reflexion Autogen Open-Interpreter MetaGPT ReAct Reflexion Autogen Open-Interpreter MetaGPT
Agents/Agent Frameworks Agents/Agent Frameworks
(a) Normalized throughput. Higher is better. (b) Normalized latency. Lower is better.
Figure 12. Efficiency analysis on different agent frameworks evaluated on the Llama-3.1-8b model on the GAIA benchmark.
E D ISCUSSION
E.1 Ethical Consideration
In this section, we discuss both potential positive and negative societal impacts of the work.
The potential positive societal impacts include: 1) Enhanced efficiency and productivity: AIOS can automate routine
tasks, achieve more efficient operations, optimize resource allocation, and reduce bottlenecks, leading to better service
and improved efficiency for agent developers; 2) Improved user experience: with better context, memory, and storage
management, AIOS can offer more personalized and responsive interactions, enhancing user satisfaction across various
applications; 3) Innovation ecosystem: the creation of AIOS could foster a vibrant ecosystem of agent developers and
researchers, driving innovation in AI technologies and applications.
The potential negative societal impacts include: 1) Privacy concerns: the integration of LLMs into operating systems may
raise privacy concerns, as AI models such as LLMs may require access to personal data to provide effective services; 2)
Security risks: as AI systems become more integral to critical infrastructure, they could become targets for cyberattacks,
potentially compromising sensitive data and operations; 3) System failures: the failure of integrated systems could have
widespread consequences, affecting multiple sectors simultaneously and causing disruptions.
Balancing the impacts: To maximize the positive impacts and mitigate the negative ones, it is crucial to adopt a balanced
approach to the development and deployment of AIOS, such as 1) Rules and standards: Implementing responsible
development rules and standards to ensure data privacy, security, and ethical use of AI; 2) Robust design: implementing
Table 7. Correctness of context switch (text-based and logits-based), which checks the similarity between the generated final outputs with
context switch enabled and disabled.
Throughput (Normalized)
1.0 1.0
0.5 0.5
0.0 0.0
ReAct Reflexion Autogen Open-Interpreter MetaGPT ReAct Reflexion Autogen Open-Interpreter MetaGPT
Agents/Agent Frameworks Agents/Agent Frameworks
(a) Normalized throughput. Higher is better. (b) Normalized latency. Lower is better.
Figure 13. Efficiency analysis on different agent frameworks evaluated on the Mistral-7b model on the GAIA benchmark.
Latency (Normalized)
1.0 1.0
0.5 0.5
0.0 0.0
ReAct Reflexion Autogen Open-Interpreter MetaGPT ReAct Reflexion Autogen Open-Interpreter MetaGPT
Agents/Agent Frameworks Agents/Agent Frameworks
(a) Normalized throughput. Higher is better. (b) Normalized latency. Lower is better.
Figure 14. Efficiency analysis on different agent frameworks evaluated on the Llama-3.1-8b model on the SWE-Bench-Lite benchmark.
robust system design, regular maintenance, comprehensive testing, continuous monitoring, backup and recovery plans,
developer training, careful documentation, clear communication, and leveraging AI for predictive maintenance and automated
recovery; 3) Public engagement: engaging with the public to raise awareness about the benefits and challenges of AI,
ensuring that societal concerns are addressed in the development process.
By addressing these considerations, society can harness the potential of AIOS while mitigating its risks, leading to a more
equitable and prosperous future.
Throughput (Normalized)
1.0 1.0
0.5 0.5
0.0 0.0
ReAct Reflexion Autogen Open-Interpreter MetaGPT ReAct Reflexion Autogen Open-Interpreter MetaGPT
Agents/Agent Frameworks Agents/Agent Frameworks
(a) Normalized throughput. Higher is better. (b) Normalized latency. Lower is better.
Figure 15. Efficiency analysis on different agent frameworks evaluated on the Mistral-7b model on the SWE-Bench-Lite benchmark.
ensuring the system’s resilience against malicious attacks, such as jailbreaking of LLM or unauthorized access of other
agents’ memory. In the realm of privacy, the exploration of advanced encryption techniques is vital for safeguarding data
transmission within AIOS, thus maintaining the confidentiality of agent communications. Furthermore, the implementation
of watermarking techniques could serve to protect the intellectual property of agent developers by embedding unique
identifiers in outputs, facilitating the tracing of data lineage.
In a nutshell, AIOS stands as a motivating body of work that brings a broad spectrum of research opportunities. Each
outlined direction not only can build upon the foundational elements of AIOS but also can contribute to the advancement of
the field at large.