Nestful Nested API Calls Benchmark
Nestful Nested API Calls Benchmark
Abstract
arXiv:2409.03797v1 [cs.AI] 4 Sep 2024
1 Introduction
Large language models (LLMs) are increasingly being used as the foundation of agentic
systems that can be used to address complex, real-world problems Yao et al. (2023); Deng
et al. (2024). In agentic problem settings, a language model interacts with both the user
as well as the environment to collect information or execute tasks that allow the agent
to carry out the user’s request. This ability to interface with the broader environment
is enabled through calls to tools or application programming interfaces (APIs), and has
resulted in systems with diverse applications ranging from software assistants Jimenez et al.,
to diagnostic systems Roy et al. (2024), to even formal theorem provers Thakur et al. (2023).
For LLMs to be able to utilize APIs1 properly, they must be capable of executing the
following tasks based on a user’s query: (a) API detection, i.e., from a list of available APIs,
choose which APIs need to be called, (b) slot filling, i.e., given an API, identify the correct
parameters and fill in values, and (c) sequencing, i.e., list the sequence of full API calls
needed to respond to the user query. Of the three categories, sequencing is often viewed as
the most challenging as it requires both API detection and slot-filling to create well-formed
API calls.
Unfortunately, while existing datasets used for evaluating API calling capabilities will test
each of the three categories, the way in which they evaluate sequencing is incomplete. That
is, existing evaluation benchmarks pose sequencing as the prediction of single or multiple
1API and function are used interchangeably throughout the paper
1
Preprint
User Query Let me know the COVID-19 statistics of India and get me the latest articles about the politics of India.
Relevant APIs
{ {
{
"name": "Coronavirus_ "name": "Get_Country_Details_By_
Smartable_GetStats", "name": "NewsAPISearchByKeyWord",
Country_Name",
"query_parameters": { "query_parameters": {
"location": { "query_parameters": {
"region": {
"description": "ISO 3166-2 location "name": ...
"description": "Region for the
code...." },
} search results (e.g., 'US'...."
"output_parameters": {
}, }
"output_parameters": { "short_name": {
},
"stats": { "description": "Short name of
"output_parameters": {
"totalConfirmedCases": ... the country"
"newlyConfirmedCases": ... "title": ...
}
... "link": ...
...
} ...
... }
}
} ...
}
} }
API Execution
{
{
"name": "Get_Country_Details_By_Country_Name",
Step - 1
"name":"India",
"arguments": { "short_name":"IN",
"name": "India" ...
} Execute
}
}
{
{
"totalConfirmedCases":
"name": "Coronavirus_Smartable_GetStats",
Step - 2
11063491,
"arguments": {
"newlyConfirmedCases":
"location": "IN"
16577,
} Execute
...
}
} Answer
{
{
"title": "Decode
"name": "NewsAPISearchByKeyWord",
Politics ...",
Step - 3
"arguments": {
"publisher": "The
"query": "politics",
Indian Express",
"region": "IN" Execute "link": ...
}
...
}
} Answer
Figure 1: Example of a nested sequence of function calls from NEST FUL. Based on the
documentation, the APIs "Coronavirus_Smartable_GetStats" and "NewsAPISearchByKey-
Word" take as input a country code; location and region, respectively. The example also
demonstrates how the function "Get_Country_Details_By_Country_Name" is implicitly
required to retrieve the country code, despite not stated explicitly in the user query.
isolated API calls, where the output of any particular API call within that sequence is
considered irrelevant. In contrast, for many real-world tasks, a sequence of API calls may
be nested, i.e., the output of some API calls may be used in the arguments to subsequent
API calls. Figure 1 shows an example of a nested sequence of APIs, where the first API has
to be executed first and its output is used as an argument for the next two API calls.
In this paper, we present NEST FUL, a benchmark specifically designed to evaluate the capa-
bilities of the models on nested API calls. NEST FUL has over 300 human-annotated high
quality examples that have been split into two categories, executable and non-executable
API calls. The executable samples are curated manually by crawling Rapid-APIs whereas
the non-executable samples are handpicked by human annotators from synthetically gener-
ated examples using a state-of-the-art LLM. Table 1 shows howNEST FUL compares against
existing function calling benchmarks. We also evaluate various standard models on NEST-
FUL and show that existing models struggle to perform well on the nested sequencing task,
thus providing a useful avenue for the community to test advancements in API calling
2
Preprint
capabilities. As our main contribution, we provide the dataset in a public github repository2 ,
made available under a permissive, open-source license.
2 Related Work
How best to enable API function calling from LLMs is an active area of research. Methods
that utilize large, general-purpose proprietary models (e.g., Gemini (Team et al., 2023) or GPT
(Achiam et al., 2023)) typically make use of carefully constructed prompts and in-context
learning examples, e.g., Song et al. (2023). Smaller, more specialized models often start
from a strong-performing code model (e.g., DeepSeek-Coder (Guo et al., 2024), CodeLlama
(Roziere et al., 2023), or Granite Code (Mishra et al., 2024)) and fine-tune primarily on highly
curated datasets Srinivasan et al. (2023); Ji et al. (2024); Abdelaziz et al. (2024) that have been
extended with synthetic data Zhang et al. (2024).
In addition to prompting strategies and models, there have also been numerous recent
works releasing training and benchmarking data in service of API function calling. ToolLLM
Qin et al. (2023) produced multi-sequence REST-API data generated using GPT4 Achiam
et al. (2023). Similarly, APIBench Patil et al. (2023) is a synthetic dataset of single-sequence
API data specifically from ML libraries generated based on GPT-4 language models. Another
work focusing on synthetic data generation was APIGen Liu et al. (2024), which proposed a
multi-stage, hierarchical verification approach to ensure all data generated was of sufficient
quality. Lastly, API-BLEND (Basu et al., 2024) introduced a large corpora for training and
systematic testing of tool-augmented LLMs in real-world scenarios. In this work, we focus
on tasks that need an interdependent sequence of API calls, which is a necessity for many
real-world, multi-step problems. This thus differentiates our approach from the existing
evaluation benchmarks, each of which focus on single or multiple isolated API calling
functionality.
3 Data Schema
Each data instance in the NEST FUL dataset consists of a question-answer pair, where the
answer is a sequence of API calls represented as a list of JSON objects. Each JSON object
corresponds to an API, including its ‘name’ and ‘arguments’. Additionally, a unique variable
name is assigned to each JSON object under the key ‘label’, which distinctly identifies each
API, even when two identical APIs with different arguments appear in the same sequence
(parallel API calls). Argument values that need to be grounded with results from previous
function calls are enclosed in a $ sign and formatted as ${variable_name}.{parameter}$,
where {variable_name} refers to the API whose results will be used, and {parameter}
specifies the output parameter of that API response. Below is the template for the data
schema:
{
" input " : < User Query > ,
" output " : [
2 The dataset will be released soon, we are working on getting the required legal clearances
3
Preprint
{
" name " : < API Name > ,
" arguments " : {
< arg_ 1 >: < value from user query > ,
< arg_ 2 >: < value from user query > ,
...
},
" label " : < variable_name >
},
{
" name " : < API Name > ,
" arguments " : {
< arg_ 1 >: < value from user query > ,
< arg_ 2 >: $ { variable_name } . { parameter } $ ,
< arg_ 3 >: < value from user query > ,
...
},
" label " : < variable_name >
}
...
]
}
4 Dataset Collection
The NEST FUL dataset comprises of 300 manually curated instances designed for bench-
marking tool-augmented large language models, with a focus on nested sequencing. Each
instance consists of a user query paired with an answer, represented as a sequence of API
calls, where the output of at least one API is used as the input for subsequent API(s). Based
on the ability to be executed, these 300 instances are categorized into two groups - executable
and non-executable.
The executable portion of the NEST FUL dataset is curated using APIs sourced from Rap-
idAPI. We manually gather 39 different APIs across various domains, including flight
booking, Instagram information, restaurant and hotel searches, music, finance, and more.
For each API, we collect essential specifications such as API names, query/path param-
eters, output parameters, host, endpoint, etc. We also write descriptions by hand for all
parameters (query, path, and output). Following is a template of the specification.
{
" name " : < API name > ,
" description " : < API description > ,
" method " : < API methods , such as GET , POST , ... > ,
" endpoint " : < API endpoint path > ,
" host " : < API host path > ,
" url " : < API URL from RapidAPI for reference > ,
" query_parameters " : < dictionary of query parameters ( if any ) > with parameter name ,
description , type , required field ( boolean ) >,
" path_parameters " : < dictionary of path parameters ( if any ) > with parameter name ,
description , type , required field ( boolean ) >,
" output_parser " : < location of the output parameters in the API response object > ,
" output_parameters " : < the parameters in the API response >
}
Next, based on the gathered API specifications, we construct the executable dataset as a
collection of query-answer pairs, where the answers consist of sequences of executable APIs.
Our code processes the outputs, calling each API sequentially to achieve the final result.
The questions are human-annotated, ensuring that the final answer can only be obtained by
executing the APIs in a nested manner—where the output of one API is used as the input
for the subsequent API. An example is provided below:
{
" input " : " What is the time difference between Morocco and New York ?" ,
" output " : [
{
4
Preprint
For non-executable data curation, we begin by collecting API specifications from the Glaive5
and Schema Guided Dialog (SGD)6 datasets. SGD dataset has a limited set of APIs, but it
has full specifications, including input and output parameters. On the other hand, Glaive
APIs do not have output parameters, for which, we created the output parameters manually.
We then used DiGiT synthetic data generation framework7 to systematically create a set of
nested sequence data. This involves using seed examples along with detailed instructions
to prompt the Mixtral-8x22b-Instruct model8 . Finally, we perform a two-step filtration
process to refine the dataset. First, we programmatically validate the samples to check
for hallucinations and ensure that the output APIs adhere to the specifications; required
parameters are specified and output parameters are correct. Then, we manually review
and exclude any examples with semantic errors, incorrect API sequence order, or improper
variable assignments. Below is an example based on Glaive API list:
{
" input " : " Encrypt the email address ' john . doe@gmail . com ' with the key ' abc 1 2 3 ' and then
add a new contact with the encrypted email , name ' John Doe ', and phone '1 2 3 -4 5 6 -7 8 9 0
'" ,
" output " : [
{
" name " : " encrypt_data " ,
" arguments " : {
" data " : " john . doe@gmail . com " ,
" encryption_key " : " abc 1 2 3 "
},
" label " : " var 1 "
},
{
" name " : " add_contact " ,
" arguments " : {
" email " : " $var 1 . encrypted_data$ " ,
" name " : " John Doe " ,
" phone " : " 1 2 3 -4 5 6 -7 8 9 0 "
},
" label " : " var 2 "
5 https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2
6 https://github.com/google-research-datasets/dstc8-schema-guided-dialogue
7 https://github.com/foundation-model-stack/fms-dgt
8 https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1
5
Preprint
}
]
}
5 Evaluation
5.1 Baselines
In our experiments, we have used 6 open sourced models as baseline: (1) xLAM-1b-fc-r
(Liu et al., 2024); (2) Mistral-7B-Instruct-v0.3 (Jiang et al., 2024); (3) Hermes-2-Pro-Mistral-
7B9 ; (4) Granite-20B-FunctionCalling (Abdelaziz et al., 2024); (5) Mixtral-8x7b-Instruct-v01
(Jiang et al., 2024), and (6) Llama-3-70b-Instruct (Dubey et al., 2024). All these 6 models
are selected from the Berkeley Function-Calling Leaderboard (BFCL)10 , which captures the
API/Function calling abilities of different proprietary and open models. We also considered
evaluating other models like Gorilla-openfunctions-v211 and xLAM-7b-fc-r12 . However,
these models have a limited context length (less than 4,096 tokens), whereas NEST FUL
dataset examples require at least 8,000 tokens.
The experiments are carried out in one-shot and three shots setting, where in the prompt
we provide one or three in-context learning examples, respectively. For each model, we
have used the model specified prompt along with the special tags. Due to context length
limitations, we cannot include the entire API library in the prompt for each sample. Instead,
we pre-process the data to create a shorter API list for each example. This list ensures
inclusion of the gold APIs, the APIs used in ICL examples, and some random APIs, keeping
the total prompt length under 8,000 tokens. Also, the API calls are extracted from the
model’s response as a list of JSON objects, taking into account that each model has a specific
way to generate the API calls in the response.
5.3 Metrics
For a detailed evaluation of the generated responses, we calculate three metrics: Partial
and Full Sequence Match for both non-executable and executable experiments, and API
Execution Pass Rate specifically for executable scenarios. The following sections provide an
in-depth explanation of each metric with examples.
Partial and Full Sequence Match A generated response from the model is a sequence
of API calls, with each call consisting of an API name and its argument-value pairs. We
use the Partial Sequence Matching metric to determine how many predicted APIs (with its
argument-value pairs) in a sequence match with the Gold API sequence. In contrast, the
Full Sequence Matching metric evaluates whether the model predicts the exact full sequence
of APIs, including both the API names and their argument-value pairs, when compared to
the Gold API sequence. This metric checks whether the predicted API sequence is an exact
match with the gold or not. In both cases, we calculate the scores for each data instance and
then compute the statistical mean across the entire dataset as the final score.
Suppose for a user query - "Find me a restaurant in Miami that serves Mexican
food and reserve a table for 4 people on 2024-04-22 at 7 PM", where the gold API
sequence consists of two APIs: FindRestaurants(cuisine=Mexican, city=Miami)
and ReserveRestaurant(restaurant_name=$FindRestaurants.restaurant_name$,
city=Miami, time=7 PM, date=2024-04-22, party_size=4), where we assume that
9 https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B
10 https://gorilla.cs.berkeley.edu/leaderboard.html
11 https://huggingface.co/gorilla-llm/gorilla-openfunctions-v2
12 https://huggingface.co/Salesforce/xLAM-7b-fc-r
6
Preprint
the FindRestaurants will return one restaurant. Now, the model predicts the
following - FindRestaurants(cuisine=Mexican, city=Miami) and ReserveRestau
rant(restaurant_name=$FindRestaurants.restaurant_name$, city=Miami, date=2024-04-
22, party_size=4). It accurately predicts the first API but misses the time argument in the
second API. As a result, the Partial Sequence Match score will be 0.5 because one API
is correct, while the Full Sequence Match score will be 0 since the entire sequence does
not match. It is worth mentioning that these are stricter metrics, but we used it to capture
whether the model is actually responding the user query overall or not.
API Execution Pass Rate For the executable portion of NEST FUL, we also report the API
Execution Pass Rate, which measures whether the predicted APIs can be executed (using
RapidAPI) sequentially. To calculate this metric, we first check if the predicted API names
match the gold API names and they are in correct order, then execute the APIs in sequence.
For the nesting scenarios, where a argument value requires grounding, we do it dynamically
using the responses generated from the prior API calls. If, for a given data instance, all API
names and order match the gold and all APIs execute without any errors, it is considered a
pass. The final score is reported as the percentage of predicted API sequences that pass.
It is important to note that this metric does not guarantee a successful final outcome; it only
measures whether the APIs are executable, that is why we refer to it as a pass rate rather
than a success or win rate. Most of our data involves open-ended queries that result in
dynamic answers from real-world API executions (e.g., retrieving weather details, searching
for hotels, etc.), making it challenging to measure an accuracy based on the final output of
the API sequence.
As an example, Figure 1 showcases an executable API scenario. We consider it as a
pass, when the model predicts the gold APIs (i.e., Get_Country_Details_By_Country_Name,
Coronavirus_Smartable_GetStats, and NewsAPISearchByKeyWord) in the correct order, calls
Get_Country_Details_By_Country_Name to obtain the short_name of a country, automati-
cally passes it to the subsequent APIs (i.e., Coronavirus_Smartable_GetStats and News
APISearchByKeyWord), and, finally, successfully executes both the APIs.
5.4 Results
Table 2 presents a comparison of different baselines on the NEST FUL dataset with one-shot
and three-shots settings. As anticipated, in most of our experiments, the models are doing
better when they are provided with three shot in-context learning examples in the prompt
instead of one-shot example.
Across all models, Partial Sequence Match scores are consistently higher than and Full Sequence
Match scores, which is expected, as the Full Sequence Match is more stricter metric than the
Partial. We looked into the outputs generated by the models and have identified several
common issues across them. None of the five baseline models have been trained with the
robust data schema discussed in Section 3. So, as expected, these models struggle with tasks
such as assigning variables, utilizing output parameter details from the API specifications,
and correctly passing variable names and corresponding output parameters to subsequent
APIs, even when provided with in-context learning examples. Models like xLAM-1b-fc-r
and Mixtral-8x7b-Instruct-v01 also struggles with hallucination, as it sometimes predict
argument values that are not present in the user query or generates natural languange
texts instead of APIs, and in some cases it keeps on generating wrong API sequence until it
reaches to the max token. Also, in many cases they misse the variable assignments correctly.
Llama-3-70b-Instruct outperforms other models in terms of both Partial and Full Sequence
Match scores for both executable and non-executable sections. Mixtral-8x7b-Instruct-v01
and Granite-20B-FunctionCalling are scoring just after Llama-3-70b-Instruct on both the
portion of the dataset. xLAM-1b-fc-r, Mistral-7B-Instruct-v0.3, and Hermes-2-Pro-Mistral-7B
perform poorly (getting < 10%) across the dataset, as it is challenging to get the correct
sequence and doing the appropriate variable mappings. On the API Execution Pass Rate
metric, the Llama-3-70b-instruct is achieving the highest score of 41% (with three-shots
ICL). After that the Mixtral-8x7b-Instruct-v01, Granite-20B-FunctionCalling, Hermes-2-Pro-
7
Preprint
Non-Executable Executable
Models Partial Sequence Full Sequence Partial Sequence Full Sequence API Execution
Match Match Match Match Pass Rate (%)
One Shot xLAM-1b-fc-r 0.20 0.00 0.04 0.00 0.00
Mistral-7B-Instruct-v0.3 0.36 0.04 0.02 0.00 0.00
Hermes-2-Pro-Mistral-7B 0.06 0.00 0.03 0.00 4.88
Granite-20B-FunctionCalling 0.52 0.26 0.10 0.04 15.85
Mixtral-8x7b-Instruct-v01 0.46 0.22 0.13 0.04 25.61
Llama-3-70b-Instruct 0.56 0.29 0.21 0.04 35.37
xLAM-1b-fc-r 0.24 0.00 0.06 0.00 0.00
Three Shots
Table 2: Evaluation Result on NEST FUL with different state-of-the-art LLMs. Models
are sorted based on the size. Experiments are done in one-shot and three shots settings.
Best performance is highlighted in bold, while the second best is underlined. Partial
Sequence Match denotes the percentage of calling the correct API sequence (API names and
arguments) while Full Sequence Match counts the percentage of times where the model
gets the entire sequence of APIs correctly. Both the scores are reported in 0 to 1 range. We
also report API Execution Pass Rate (reported in %) for executable APIs which measures
whether all the predicted APIs by the model are executable in sequence or not.
Mistral-7B models achieve 27%, 16%, and 9% (with three-shots ICL) respectively. In contrast,
Mistral-7B-Instruct-v0.3 and xLAM-1b-fc-r are not able to get any sequence correctly with
proper variable assignments for the nesting scenarios, as a result they score zero on this
metric.
6 Challenges
We consider the NEST FUL as a challenging benchmark for any LLMs for several reasons. In
this section, we will discuss these challenges in detail.
Data-type and Required Parameter Adherence In the API Specification, we define the data
type for all parameters—query, path, and output. The type field specifies the data type, such
as string, number, list, etc. Since APIs follow a strict structure for both input and output,
it is crucial for the model to adhere to these specified format. If the model fails to do so,
particularly in the nesting cases where the output of one API is passed as input to another,
the process will fail if the output type does not match the expected input type. Similarly, we
specify the required fields for all query and path parameters (in the API Specification) to
indicate whether a parameter is optional or mandatory. It is crucial that any model take into
account these required parameters when using an API, as their inclusion is necessary for
successful execution. Ignoring required parameters can lead to incomplete or incorrect API
calls, affecting the model’s performance.
Implicit API calling Implicit function calling refers to a scenario where the system must
invoke a specific API, along with potentially other APIs, to fulfill a user query, even though
the query does not explicitly mention the task that requires that particular API. Figure 1
illustrates an example of implicit function calling, where the user query only mentions task
8
Preprint
7 Conclusion
In this work we introduced NEST FUL, a new benchmark for evaluating the performance
of LLMs on API function calling with nested sequences of function calls (see Sections 3
and 4). We showed that existing LLMs perform poorly on this dataset as compared to
their performance on existing benchmarks and identified their several modes of failure
(see Section 5). In addition, we outlined the many challenges this dataset poses to LLM
function calling approaches (see Section 6). By making this dataset available publicly under
a permissive open-source license, we aim to push the capabilities of API function calling in
new directions and unlock solutions to more realistic, challenging tasks.
References
Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, Sadhana Kumaravel, Matthew Stal-
lone, Rameswar Panda, Yara Rizk, GP Bhargav, Maxwell Crouse, Chulaka Gunasekara,
et al. Granite-function calling model: Introducing function calling abilities via multi-task
learning of granular tasks. arXiv preprint arXiv:2407.00121, 2024.
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al.
Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Kinjal Basu, Ibrahim Abdelaziz, Subhajit Chaudhury, Soham Dan, Maxwell Crouse, Asim
Munawar, Sadhana Kumaravel, Vinod Muthusamy, Pavan Kapanipathi, and Luis A.
Lastras. Api-blend: A comprehensive corpora for training and benchmarking api llms,
2024. URL https://arxiv.org/abs/2402.15491.
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and
Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information
Processing Systems, 36, 2024.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle,
Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3
herd of models. arXiv preprint arXiv:2407.21783, 2024.
Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen,
Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets
programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.
Charlie Cheng-Jie Ji, Huanzhi Mao, Fanjia Yan, Shishir G. Patil, Tianjun Zhang, Ion Stoica,
and Joseph E. Gonzalez. Gorilla openfunctions v2. 2024.
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary,
Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian
Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and
Karthik R Narasimhan. Swe-bench: Can language models resolve real-world github
issues? In The Twelfth International Conference on Learning Representations.
9
Preprint
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li,
Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented
llms, 2023.
Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Shirley Kokane, Juntao Tan,
Weiran Yao, Zhiwei Liu, Yihao Feng, et al. Apigen: Automated pipeline for generating
verifiable and diverse function-calling datasets. arXiv preprint arXiv:2406.18518, 2024.
Mayank Mishra, Matt Stallone, Gaoyuan Zhang, Yikang Shen, Aditya Prasad, Adriana Meza
Soria, Michele Merler, Parameswaran Selvam, Saptha Surendran, Shivdeep Singh, et al.
Granite code models: A family of open foundation models for code intelligence. arXiv
preprint arXiv:2405.04324, 2024.
Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language
model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong,
Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master
16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Las-Casas, Rodrigo Fonseca,
and Saravan Rajmohan. Exploring llm-based agents for root cause analysis. In Com-
panion Proceedings of the 32nd ACM International Conference on the Foundations of Software
Engineering, pp. 208–219, 2024.
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan,
Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation
models for code. arXiv preprint arXiv:2308.12950, 2023.
Yifan Song, Weimin Xiong, Dawei Zhu, Wenhao Wu, Han Qian, Mingbo Song, Hailiang
Huang, Cheng Li, Ke Wang, Rong Yao, et al. Restgpt: Connecting large language models
with real-world restful apis. arXiv preprint arXiv:2306.06624, 2023.
Venkat Krishna Srinivasan, Zhen Dong, Banghua Zhu, Brian Yu, Damon Mosk-Aoyama,
Kurt Keutzer, Jiantao Jiao, and Jian Zhang. Nexusraven: a commercially-permissive
language model for function calling. In NeurIPS 2023 Foundation Models for Decision
Making Workshop, 2023.
Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun. Toolalpaca:
Generalized tool learning for language models with 3000 simulated cases. arXiv preprint
arXiv:2306.05301, 2023.
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui
Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family
of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
Amitayush Thakur, Yeming Wen, and Swarat Chaudhuri. A language-agent approach
to formal theorem-proving. In The 3rd Workshop on Mathematical Reasoning and AI at
NeurIPS’23, 2023.
Qiantong Xu, Fenglu Hong, Bo Li, Changran Hu, Zhengyu Chen, and Jian Zhang. On
the tool manipulation capability of open-source large language models. arXiv preprint
arXiv:2305.16504, 2023.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and
Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh
International Conference on Learning Representations, 2023.
Jianguo Zhang, Tian Lan, Rithesh Murthy, Zhiwei Liu, Weiran Yao, Juntao Tan, Thai Hoang,
Liangwei Yang, Yihao Feng, Zuxin Liu, et al. Agentohana: Design unified data and
training pipeline for effective agent learning. arXiv preprint arXiv:2402.15506, 2024.
10