0% found this document useful (0 votes)

21 views10 pages

Nestful Nested API Calls Benchmark

The document introduces NEST FUL, a benchmark designed to evaluate large language models (LLMs) on nested sequences of API calls, where outputs from one API call serve as inputs for subsequent calls. It consists of 300 human-annotated examples categorized into executable and non-executable samples, highlighting the challenges LLMs face with nested API sequences compared to simpler benchmarks. The dataset aims to provide a resource for testing and improving API calling capabilities in real-world applications.

Uploaded by

chakumchukum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views10 pages

Nestful Nested API Calls Benchmark

Uploaded by

chakumchukum

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Preprint

NEST FUL: A Benchmark for Evaluating LLMs on Nested

Sequences of API Calls
Kinjal Basu1,‡ Ibrahim Abdelaziz1,‡ Kelsey Bradford2 Maxwell Crouse1
Kiran Kate1 Sadhana Kumaravel1 Saurabh Goyal1 Asim Munawar1 Yara Rizk1
Xin Wang1 Luis Lastras1 Pavan Kapanipathi1,‡
1 IBM Research, 2 Georgia Institute of Technology
‡ Corresponding
Authors
{kinjal.basu, ibrahim.abdelaziz1}@ibm.com, [email protected]

Abstract
arXiv:2409.03797v1 [cs.AI] 4 Sep 2024

Autonomous agent applications powered by large language models (LLMs)

have recently risen to prominence as effective tools for addressing com-
plex real-world tasks. At their core, agentic workflows rely on LLMs to
plan and execute the use of tools and external Application Programming
Interfaces (APIs) in sequence to arrive at the answer to a user’s request.
Various benchmarks and leaderboards have emerged to evaluate an LLM’s
capabilities for tool and API use; however, most of these evaluations only
track single or multiple isolated API calling capabilities. In this paper, we
present NEST FUL, a benchmark to evaluate LLMs on nested sequences
of API calls, i.e., sequences where the output of one API call is passed as
input to a subsequent call. NEST FUL has a total of 300 human annotated
samples divided into two types - executable and non-executable. The ex-
ecutable samples are curated manually by crawling Rapid-APIs whereas
the non-executable samples are hand picked by human annotators from
data synthetically generated using an LLM. We evaluate state-of-the-art
LLMs with function calling abilities on NEST FUL. Our results show that
most models do not perform well on nested APIs in NEST FUL as compared
to their performance on the simpler problem settings available in existing
benchmarks.

1 Introduction

Large language models (LLMs) are increasingly being used as the foundation of agentic
systems that can be used to address complex, real-world problems Yao et al. (2023); Deng
et al. (2024). In agentic problem settings, a language model interacts with both the user
as well as the environment to collect information or execute tasks that allow the agent
to carry out the user’s request. This ability to interface with the broader environment
is enabled through calls to tools or application programming interfaces (APIs), and has
resulted in systems with diverse applications ranging from software assistants Jimenez et al.,
to diagnostic systems Roy et al. (2024), to even formal theorem provers Thakur et al. (2023).
For LLMs to be able to utilize APIs1 properly, they must be capable of executing the
following tasks based on a user’s query: (a) API detection, i.e., from a list of available APIs,
choose which APIs need to be called, (b) slot filling, i.e., given an API, identify the correct
parameters and fill in values, and (c) sequencing, i.e., list the sequence of full API calls
needed to respond to the user query. Of the three categories, sequencing is often viewed as
the most challenging as it requires both API detection and slot-filling to create well-formed
API calls.
Unfortunately, while existing datasets used for evaluating API calling capabilities will test
each of the three categories, the way in which they evaluate sequencing is incomplete. That
is, existing evaluation benchmarks pose sequencing as the prediction of single or multiple
1API and function are used interchangeably throughout the paper

1
Preprint

User Query Let me know the COVID-19 statistics of India and get me the latest articles about the politics of India.

Relevant APIs
{ {
{
"name": "Coronavirus_ "name": "Get_Country_Details_By_
Smartable_GetStats", "name": "NewsAPISearchByKeyWord",
Country_Name",
"query_parameters": { "query_parameters": {
"location": { "query_parameters": {
"region": {
"description": "ISO 3166-2 location "name": ...
"description": "Region for the
code...." },
} search results (e.g., 'US'...."
"output_parameters": {
}, }
"output_parameters": { "short_name": {
},
"stats": { "description": "Short name of
"output_parameters": {
"totalConfirmedCases": ... the country"
"newlyConfirmedCases": ... "title": ...
}
... "link": ...
...
} ...
... }
}
} ...
}
} }

API Execution
{
{
"name": "Get_Country_Details_By_Country_Name",

Step - 1
"name":"India",
"arguments": { "short_name":"IN",
"name": "India" ...
} Execute
}
}

{
{
"totalConfirmedCases":
"name": "Coronavirus_Smartable_GetStats",

Step - 2
11063491,
"arguments": {
"newlyConfirmedCases":
"location": "IN"
16577,
} Execute
...
}
} Answer

{
{
"title": "Decode
"name": "NewsAPISearchByKeyWord",
Politics ...",

Step - 3
"arguments": {
"publisher": "The
"query": "politics",
Indian Express",
"region": "IN" Execute "link": ...
}
...
}
} Answer

Figure 1: Example of a nested sequence of function calls from NEST FUL. Based on the
documentation, the APIs "Coronavirus_Smartable_GetStats" and "NewsAPISearchByKey-
Word" take as input a country code; location and region, respectively. The example also
demonstrates how the function "Get_Country_Details_By_Country_Name" is implicitly
required to retrieve the country code, despite not stated explicitly in the user query.

isolated API calls, where the output of any particular API call within that sequence is
considered irrelevant. In contrast, for many real-world tasks, a sequence of API calls may
be nested, i.e., the output of some API calls may be used in the arguments to subsequent
API calls. Figure 1 shows an example of a nested sequence of APIs, where the first API has
to be executed first and its output is used as an argument for the next two API calls.
In this paper, we present NEST FUL, a benchmark specifically designed to evaluate the capa-
bilities of the models on nested API calls. NEST FUL has over 300 human-annotated high
quality examples that have been split into two categories, executable and non-executable
API calls. The executable samples are curated manually by crawling Rapid-APIs whereas
the non-executable samples are handpicked by human annotators from synthetically gener-
ated examples using a state-of-the-art LLM. Table 1 shows howNEST FUL compares against
existing function calling benchmarks. We also evaluate various standard models on NEST-
FUL and show that existing models struggle to perform well on the nested sequencing task,
thus providing a useful avenue for the community to test advancements in API calling

2
Preprint

Human API Name Nested Executable Real World

Dataset Instances
Annotated & Args Sequencing APIs APIs
ToolLLM (Qin et al., 2023) 500 x x x x ✔
RestBench (Song et al., 2023) 157 ✔ x x x ✔
NexusRaven3 318 x ✔ x x ✔
API-Bank (Li et al., 2023) 314 ✔ ✔ x ✔ ✔
ToolBench Xu et al. (2023) 795 ✔ ✔ x ✔ ✔
ToolAlpaca (Tang et al., 2023) 214 x ✔ x x ✔
BFCL4 1700 x ✔ x x x
NEST FUL 300 ✔ ✔ ✔ ✔ ✔

Table 1: A comparison of our NEST FUL dataset to notable Tool-learning benchmarks

capabilities. As our main contribution, we provide the dataset in a public github repository2 ,
made available under a permissive, open-source license.

2 Related Work

How best to enable API function calling from LLMs is an active area of research. Methods
that utilize large, general-purpose proprietary models (e.g., Gemini (Team et al., 2023) or GPT
(Achiam et al., 2023)) typically make use of carefully constructed prompts and in-context
learning examples, e.g., Song et al. (2023). Smaller, more specialized models often start
from a strong-performing code model (e.g., DeepSeek-Coder (Guo et al., 2024), CodeLlama
(Roziere et al., 2023), or Granite Code (Mishra et al., 2024)) and fine-tune primarily on highly
curated datasets Srinivasan et al. (2023); Ji et al. (2024); Abdelaziz et al. (2024) that have been
extended with synthetic data Zhang et al. (2024).
In addition to prompting strategies and models, there have also been numerous recent
works releasing training and benchmarking data in service of API function calling. ToolLLM
Qin et al. (2023) produced multi-sequence REST-API data generated using GPT4 Achiam
et al. (2023). Similarly, APIBench Patil et al. (2023) is a synthetic dataset of single-sequence
API data specifically from ML libraries generated based on GPT-4 language models. Another
work focusing on synthetic data generation was APIGen Liu et al. (2024), which proposed a
multi-stage, hierarchical verification approach to ensure all data generated was of sufficient
quality. Lastly, API-BLEND (Basu et al., 2024) introduced a large corpora for training and
systematic testing of tool-augmented LLMs in real-world scenarios. In this work, we focus
on tasks that need an interdependent sequence of API calls, which is a necessity for many
real-world, multi-step problems. This thus differentiates our approach from the existing
evaluation benchmarks, each of which focus on single or multiple isolated API calling
functionality.

3 Data Schema

Each data instance in the NEST FUL dataset consists of a question-answer pair, where the
answer is a sequence of API calls represented as a list of JSON objects. Each JSON object
corresponds to an API, including its ‘name’ and ‘arguments’. Additionally, a unique variable
name is assigned to each JSON object under the key ‘label’, which distinctly identifies each
API, even when two identical APIs with different arguments appear in the same sequence
(parallel API calls). Argument values that need to be grounded with results from previous
function calls are enclosed in a $ sign and formatted as ${variable_name}.{parameter}$,
where {variable_name} refers to the API whose results will be used, and {parameter}
specifies the output parameter of that API response. Below is the template for the data
schema:
{
" input " : < User Query > ,
" output " : [

2 The dataset will be released soon, we are working on getting the required legal clearances

3
Preprint

{
" name " : < API Name > ,
" arguments " : {
< arg_ 1 >: < value from user query > ,
< arg_ 2 >: < value from user query > ,
...
},
" label " : < variable_name >
},
{
" name " : < API Name > ,
" arguments " : {
< arg_ 1 >: < value from user query > ,
< arg_ 2 >: $ { variable_name } . { parameter } $ ,
< arg_ 3 >: < value from user query > ,
...
},
" label " : < variable_name >
}
...
]
}

4 Dataset Collection

The NEST FUL dataset comprises of 300 manually curated instances designed for bench-
marking tool-augmented large language models, with a focus on nested sequencing. Each
instance consists of a user query paired with an answer, represented as a sequence of API
calls, where the output of at least one API is used as the input for subsequent API(s). Based
on the ability to be executed, these 300 instances are categorized into two groups - executable
and non-executable.

4.1 Executable APIs

The executable portion of the NEST FUL dataset is curated using APIs sourced from Rap-
idAPI. We manually gather 39 different APIs across various domains, including flight
booking, Instagram information, restaurant and hotel searches, music, finance, and more.
For each API, we collect essential specifications such as API names, query/path param-
eters, output parameters, host, endpoint, etc. We also write descriptions by hand for all
parameters (query, path, and output). Following is a template of the specification.
{
" name " : < API name > ,
" description " : < API description > ,
" method " : < API methods , such as GET , POST , ... > ,
" endpoint " : < API endpoint path > ,
" host " : < API host path > ,
" url " : < API URL from RapidAPI for reference > ,
" query_parameters " : < dictionary of query parameters ( if any ) > with parameter name ,
description , type , required field ( boolean ) >,
" path_parameters " : < dictionary of path parameters ( if any ) > with parameter name ,
description , type , required field ( boolean ) >,
" output_parser " : < location of the output parameters in the API response object > ,
" output_parameters " : < the parameters in the API response >
}

Next, based on the gathered API specifications, we construct the executable dataset as a
collection of query-answer pairs, where the answers consist of sequences of executable APIs.
Our code processes the outputs, calling each API sequentially to achieve the final result.
The questions are human-annotated, ensuring that the final answer can only be obtained by
executing the APIs in a nested manner—where the output of one API is used as the input
for the subsequent API. An example is provided below:
{
" input " : " What is the time difference between Morocco and New York ?" ,
" output " : [
{

4
Preprint

" name " : " WeatherAPI . com_Time_Zone_API " ,

" arguments " : {
" q " : " Morocco "
},
" label " : " var 1 "
},
{
" name " : " WeatherAPI . com_Time_Zone_API " ,
" arguments " : {
" q " : " New York "
},
" label " : " var 2 "
},
{
" name " : " CipherCircuit_Math_Assistant_CalculateAllArithmeticOperations " ,
" arguments " : {
" numbers " : " $var 1 . localtime$ - $var 2 . localtime$ "
},
" label " : " var 3 "
},
{
" name " : " var_result " ,
" arguments " : {
" time_difference " : " $var 3 . result$ "
}
}
]
}

4.2 Non-Executable APIs

For non-executable data curation, we begin by collecting API specifications from the Glaive5
and Schema Guided Dialog (SGD)6 datasets. SGD dataset has a limited set of APIs, but it
has full specifications, including input and output parameters. On the other hand, Glaive
APIs do not have output parameters, for which, we created the output parameters manually.
We then used DiGiT synthetic data generation framework7 to systematically create a set of
nested sequence data. This involves using seed examples along with detailed instructions
to prompt the Mixtral-8x22b-Instruct model8 . Finally, we perform a two-step filtration
process to refine the dataset. First, we programmatically validate the samples to check
for hallucinations and ensure that the output APIs adhere to the specifications; required
parameters are specified and output parameters are correct. Then, we manually review
and exclude any examples with semantic errors, incorrect API sequence order, or improper
variable assignments. Below is an example based on Glaive API list:
{
" input " : " Encrypt the email address ' john . doe@gmail . com ' with the key ' abc 1 2 3 ' and then
add a new contact with the encrypted email , name ' John Doe ', and phone '1 2 3 -4 5 6 -7 8 9 0
'" ,
" output " : [
{
" name " : " encrypt_data " ,
" arguments " : {
" data " : " john . doe@gmail . com " ,
" encryption_key " : " abc 1 2 3 "
},
" label " : " var 1 "
},
{
" name " : " add_contact " ,
" arguments " : {
" email " : " $var 1 . encrypted_data$ " ,
" name " : " John Doe " ,
" phone " : " 1 2 3 -4 5 6 -7 8 9 0 "
},
" label " : " var 2 "

5 https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2
6 https://github.com/google-research-datasets/dstc8-schema-guided-dialogue
7 https://github.com/foundation-model-stack/fms-dgt
8 https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1

5
Preprint

}
]
}

5 Evaluation

5.1 Baselines

In our experiments, we have used 6 open sourced models as baseline: (1) xLAM-1b-fc-r
(Liu et al., 2024); (2) Mistral-7B-Instruct-v0.3 (Jiang et al., 2024); (3) Hermes-2-Pro-Mistral-
7B9 ; (4) Granite-20B-FunctionCalling (Abdelaziz et al., 2024); (5) Mixtral-8x7b-Instruct-v01
(Jiang et al., 2024), and (6) Llama-3-70b-Instruct (Dubey et al., 2024). All these 6 models
are selected from the Berkeley Function-Calling Leaderboard (BFCL)10 , which captures the
API/Function calling abilities of different proprietary and open models. We also considered
evaluating other models like Gorilla-openfunctions-v211 and xLAM-7b-fc-r12 . However,
these models have a limited context length (less than 4,096 tokens), whereas NEST FUL
dataset examples require at least 8,000 tokens.

5.2 Experimental Settings

The experiments are carried out in one-shot and three shots setting, where in the prompt
we provide one or three in-context learning examples, respectively. For each model, we
have used the model specified prompt along with the special tags. Due to context length
limitations, we cannot include the entire API library in the prompt for each sample. Instead,
we pre-process the data to create a shorter API list for each example. This list ensures
inclusion of the gold APIs, the APIs used in ICL examples, and some random APIs, keeping
the total prompt length under 8,000 tokens. Also, the API calls are extracted from the
model’s response as a list of JSON objects, taking into account that each model has a specific
way to generate the API calls in the response.

5.3 Metrics

For a detailed evaluation of the generated responses, we calculate three metrics: Partial
and Full Sequence Match for both non-executable and executable experiments, and API
Execution Pass Rate specifically for executable scenarios. The following sections provide an
in-depth explanation of each metric with examples.

Partial and Full Sequence Match A generated response from the model is a sequence
of API calls, with each call consisting of an API name and its argument-value pairs. We
use the Partial Sequence Matching metric to determine how many predicted APIs (with its
argument-value pairs) in a sequence match with the Gold API sequence. In contrast, the
Full Sequence Matching metric evaluates whether the model predicts the exact full sequence
of APIs, including both the API names and their argument-value pairs, when compared to
the Gold API sequence. This metric checks whether the predicted API sequence is an exact
match with the gold or not. In both cases, we calculate the scores for each data instance and
then compute the statistical mean across the entire dataset as the final score.
Suppose for a user query - "Find me a restaurant in Miami that serves Mexican
food and reserve a table for 4 people on 2024-04-22 at 7 PM", where the gold API
sequence consists of two APIs: FindRestaurants(cuisine=Mexican, city=Miami)
and ReserveRestaurant(restaurant_name=$FindRestaurants.restaurant_name$,
city=Miami, time=7 PM, date=2024-04-22, party_size=4), where we assume that
9 https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B
10 https://gorilla.cs.berkeley.edu/leaderboard.html
11 https://huggingface.co/gorilla-llm/gorilla-openfunctions-v2
12 https://huggingface.co/Salesforce/xLAM-7b-fc-r

6
Preprint

the FindRestaurants will return one restaurant. Now, the model predicts the
following - FindRestaurants(cuisine=Mexican, city=Miami) and ReserveRestau
rant(restaurant_name=$FindRestaurants.restaurant_name$, city=Miami, date=2024-04-
22, party_size=4). It accurately predicts the first API but misses the time argument in the
second API. As a result, the Partial Sequence Match score will be 0.5 because one API
is correct, while the Full Sequence Match score will be 0 since the entire sequence does
not match. It is worth mentioning that these are stricter metrics, but we used it to capture
whether the model is actually responding the user query overall or not.
API Execution Pass Rate For the executable portion of NEST FUL, we also report the API
Execution Pass Rate, which measures whether the predicted APIs can be executed (using
RapidAPI) sequentially. To calculate this metric, we first check if the predicted API names
match the gold API names and they are in correct order, then execute the APIs in sequence.
For the nesting scenarios, where a argument value requires grounding, we do it dynamically
using the responses generated from the prior API calls. If, for a given data instance, all API
names and order match the gold and all APIs execute without any errors, it is considered a
pass. The final score is reported as the percentage of predicted API sequences that pass.
It is important to note that this metric does not guarantee a successful final outcome; it only
measures whether the APIs are executable, that is why we refer to it as a pass rate rather
than a success or win rate. Most of our data involves open-ended queries that result in
dynamic answers from real-world API executions (e.g., retrieving weather details, searching
for hotels, etc.), making it challenging to measure an accuracy based on the final output of
the API sequence.
As an example, Figure 1 showcases an executable API scenario. We consider it as a
pass, when the model predicts the gold APIs (i.e., Get_Country_Details_By_Country_Name,
Coronavirus_Smartable_GetStats, and NewsAPISearchByKeyWord) in the correct order, calls
Get_Country_Details_By_Country_Name to obtain the short_name of a country, automati-
cally passes it to the subsequent APIs (i.e., Coronavirus_Smartable_GetStats and News
APISearchByKeyWord), and, finally, successfully executes both the APIs.

5.4 Results

Table 2 presents a comparison of different baselines on the NEST FUL dataset with one-shot
and three-shots settings. As anticipated, in most of our experiments, the models are doing
better when they are provided with three shot in-context learning examples in the prompt
instead of one-shot example.
Across all models, Partial Sequence Match scores are consistently higher than and Full Sequence
Match scores, which is expected, as the Full Sequence Match is more stricter metric than the
Partial. We looked into the outputs generated by the models and have identified several
common issues across them. None of the five baseline models have been trained with the
robust data schema discussed in Section 3. So, as expected, these models struggle with tasks
such as assigning variables, utilizing output parameter details from the API specifications,
and correctly passing variable names and corresponding output parameters to subsequent
APIs, even when provided with in-context learning examples. Models like xLAM-1b-fc-r
and Mixtral-8x7b-Instruct-v01 also struggles with hallucination, as it sometimes predict
argument values that are not present in the user query or generates natural languange
texts instead of APIs, and in some cases it keeps on generating wrong API sequence until it
reaches to the max token. Also, in many cases they misse the variable assignments correctly.
Llama-3-70b-Instruct outperforms other models in terms of both Partial and Full Sequence
Match scores for both executable and non-executable sections. Mixtral-8x7b-Instruct-v01
and Granite-20B-FunctionCalling are scoring just after Llama-3-70b-Instruct on both the
portion of the dataset. xLAM-1b-fc-r, Mistral-7B-Instruct-v0.3, and Hermes-2-Pro-Mistral-7B
perform poorly (getting < 10%) across the dataset, as it is challenging to get the correct
sequence and doing the appropriate variable mappings. On the API Execution Pass Rate
metric, the Llama-3-70b-instruct is achieving the highest score of 41% (with three-shots
ICL). After that the Mixtral-8x7b-Instruct-v01, Granite-20B-FunctionCalling, Hermes-2-Pro-

7
Preprint

Non-Executable Executable
Models Partial Sequence Full Sequence Partial Sequence Full Sequence API Execution
Match Match Match Match Pass Rate (%)
One Shot xLAM-1b-fc-r 0.20 0.00 0.04 0.00 0.00
Mistral-7B-Instruct-v0.3 0.36 0.04 0.02 0.00 0.00
Hermes-2-Pro-Mistral-7B 0.06 0.00 0.03 0.00 4.88
Granite-20B-FunctionCalling 0.52 0.26 0.10 0.04 15.85
Mixtral-8x7b-Instruct-v01 0.46 0.22 0.13 0.04 25.61
Llama-3-70b-Instruct 0.56 0.29 0.21 0.04 35.37
xLAM-1b-fc-r 0.24 0.00 0.06 0.00 0.00
Three Shots

Mistral-7B-Instruct-v0.3 0.36 0.00 0.03 0.00 0.00

Hermes-2-Pro-Mistral-7B 0.13 0.03 0.07 0.02 8.54
Granite-20B-FunctionCalling 0.54 0.30 0.18 0.09 15.85
Mixtral-8x7b-Instruct-v01 0.51 0.28 0.14 0.06 26.83
Llama-3-70b-Instruct 0.60 0.34 0.25 0.10 41.46

Table 2: Evaluation Result on NEST FUL with different state-of-the-art LLMs. Models
are sorted based on the size. Experiments are done in one-shot and three shots settings.
Best performance is highlighted in bold, while the second best is underlined. Partial
Sequence Match denotes the percentage of calling the correct API sequence (API names and
arguments) while Full Sequence Match counts the percentage of times where the model
gets the entire sequence of APIs correctly. Both the scores are reported in 0 to 1 range. We
also report API Execution Pass Rate (reported in %) for executable APIs which measures
whether all the predicted APIs by the model are executable in sequence or not.

Mistral-7B models achieve 27%, 16%, and 9% (with three-shots ICL) respectively. In contrast,
Mistral-7B-Instruct-v0.3 and xLAM-1b-fc-r are not able to get any sequence correctly with
proper variable assignments for the nesting scenarios, as a result they score zero on this
metric.

6 Challenges

We consider the NEST FUL as a challenging benchmark for any LLMs for several reasons. In
this section, we will discuss these challenges in detail.

Data-type and Required Parameter Adherence In the API Specification, we define the data
type for all parameters—query, path, and output. The type field specifies the data type, such
as string, number, list, etc. Since APIs follow a strict structure for both input and output,
it is crucial for the model to adhere to these specified format. If the model fails to do so,
particularly in the nesting cases where the output of one API is passed as input to another,
the process will fail if the output type does not match the expected input type. Similarly, we
specify the required fields for all query and path parameters (in the API Specification) to
indicate whether a parameter is optional or mandatory. It is crucial that any model take into
account these required parameters when using an API, as their inclusion is necessary for
successful execution. Ignoring required parameters can lead to incomplete or incorrect API
calls, affecting the model’s performance.

Variable Assignments As discussed in Section 3, we add variable assignments

for each API in the output to manage parallel function calls, which is very com-
mon in real life applications. An example of parallel nested function calling has
been provided in Section 4.1, where the first two APIs are the same (Weather
API.com_Time_Zone_API) creating parallel functions. However, for the third API (CipherCir
cuit_Math_Assistant_CalculateAllArithmeticOperations), it is crucial to distinguish the
outputs of the first two APIs to obtain the correct result. This adds complexity to the dataset,
as the models are not trained with this schema and must follow the instructions precisely.
Our qualitative studies (discussed in Section 5.4) suggest the same as well.

Implicit API calling Implicit function calling refers to a scenario where the system must
invoke a specific API, along with potentially other APIs, to fulfill a user query, even though
the query does not explicitly mention the task that requires that particular API. Figure 1
illustrates an example of implicit function calling, where the user query only mentions task

8
Preprint

for two APIs — Coronavirus_Smartable_GetStats and NewsAPISearchByKeyWord. However,

when the model analyzes the specifications of both APIs, it identifies that the query parame-
ters — location for the first API and region for the second — require the country name to be
provided in its short form (e.g., IN for India). To fulfill this requirement, the model implicitly
calls the Get_Country_Details_By_Country_Name API, which converts the full country name
into its short version. This benchmark includes multiple scenarios of implicit function
calling, adding to the dataset’s complexity and making it more challenging for models to
handle.

7 Conclusion

In this work we introduced NEST FUL, a new benchmark for evaluating the performance
of LLMs on API function calling with nested sequences of function calls (see Sections 3
and 4). We showed that existing LLMs perform poorly on this dataset as compared to
their performance on existing benchmarks and identified their several modes of failure
(see Section 5). In addition, we outlined the many challenges this dataset poses to LLM
function calling approaches (see Section 6). By making this dataset available publicly under
a permissive open-source license, we aim to push the capabilities of API function calling in
new directions and unlock solutions to more realistic, challenging tasks.

References
Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, Sadhana Kumaravel, Matthew Stal-
lone, Rameswar Panda, Yara Rizk, GP Bhargav, Maxwell Crouse, Chulaka Gunasekara,
et al. Granite-function calling model: Introducing function calling abilities via multi-task
learning of granular tasks. arXiv preprint arXiv:2407.00121, 2024.

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al.
Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.

Kinjal Basu, Ibrahim Abdelaziz, Subhajit Chaudhury, Soham Dan, Maxwell Crouse, Asim
Munawar, Sadhana Kumaravel, Vinod Muthusamy, Pavan Kapanipathi, and Luis A.
Lastras. Api-blend: A comprehensive corpora for training and benchmarking api llms,
2024. URL https://arxiv.org/abs/2402.15491.

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and
Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information
Processing Systems, 36, 2024.

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle,
Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3
herd of models. arXiv preprint arXiv:2407.21783, 2024.

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen,
Xiao Bi, Yu Wu, YK Li, et al. Deepseek-coder: When the large language model meets
programming–the rise of code intelligence. arXiv preprint arXiv:2401.14196, 2024.

Charlie Cheng-Jie Ji, Huanzhi Mao, Fanjia Yan, Shishir G. Patil, Tianjun Zhang, Ion Stoica,
and Joseph E. Gonzalez. Gorilla openfunctions v2. 2024.

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary,
Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian
Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and
Karthik R Narasimhan. Swe-bench: Can language models resolve real-world github
issues? In The Twelfth International Conference on Learning Representations.

9
Preprint

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li,
Fei Huang, and Yongbin Li. Api-bank: A comprehensive benchmark for tool-augmented
llms, 2023.
Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Shirley Kokane, Juntao Tan,
Weiran Yao, Zhiwei Liu, Yihao Feng, et al. Apigen: Automated pipeline for generating
verifiable and diverse function-calling datasets. arXiv preprint arXiv:2406.18518, 2024.
Mayank Mishra, Matt Stallone, Gaoyuan Zhang, Yikang Shen, Aditya Prasad, Adriana Meza
Soria, Michele Merler, Parameswaran Selvam, Saptha Surendran, Shivdeep Singh, et al.
Granite code models: A family of open foundation models for code intelligence. arXiv
preprint arXiv:2405.04324, 2024.
Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language
model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong,
Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master
16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
Devjeet Roy, Xuchao Zhang, Rashi Bhave, Chetan Bansal, Pedro Las-Casas, Rodrigo Fonseca,
and Saravan Rajmohan. Exploring llm-based agents for root cause analysis. In Com-
panion Proceedings of the 32nd ACM International Conference on the Foundations of Software
Engineering, pp. 208–219, 2024.
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan,
Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation
models for code. arXiv preprint arXiv:2308.12950, 2023.
Yifan Song, Weimin Xiong, Dawei Zhu, Wenhao Wu, Han Qian, Mingbo Song, Hailiang
Huang, Cheng Li, Ke Wang, Rong Yao, et al. Restgpt: Connecting large language models
with real-world restful apis. arXiv preprint arXiv:2306.06624, 2023.
Venkat Krishna Srinivasan, Zhen Dong, Banghua Zhu, Brian Yu, Damon Mosk-Aoyama,
Kurt Keutzer, Jiantao Jiao, and Jian Zhang. Nexusraven: a commercially-permissive
language model for function calling. In NeurIPS 2023 Foundation Models for Decision
Making Workshop, 2023.
Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun. Toolalpaca:
Generalized tool learning for language models with 3000 simulated cases. arXiv preprint
arXiv:2306.05301, 2023.
Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui
Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family
of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
Amitayush Thakur, Yeming Wen, and Swarat Chaudhuri. A language-agent approach
to formal theorem-proving. In The 3rd Workshop on Mathematical Reasoning and AI at
NeurIPS’23, 2023.
Qiantong Xu, Fenglu Hong, Bo Li, Changran Hu, Zhengyu Chen, and Jian Zhang. On
the tool manipulation capability of open-source large language models. arXiv preprint
arXiv:2305.16504, 2023.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and
Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh
International Conference on Learning Representations, 2023.
Jianguo Zhang, Tian Lan, Rithesh Murthy, Zhiwei Liu, Weiran Yao, Juntao Tan, Thai Hoang,
Liangwei Yang, Yihao Feng, Zuxin Liu, et al. Agentohana: Design unified data and
training pipeline for effective agent learning. arXiv preprint arXiv:2402.15506, 2024.

Tales of The Sinful Stars
100% (5)
Tales of The Sinful Stars
199 pages
Lecture17 18 LLM Based Agent
No ratings yet
Lecture17 18 LLM Based Agent
64 pages
Cns-Atm Resource Guide
100% (1)
Cns-Atm Resource Guide
131 pages
New Multiple Choice Question
No ratings yet
New Multiple Choice Question
262 pages
AI Agent Workflow Vs Agent Part 5 by Vipra Singh Mar, 2025 Medium
No ratings yet
AI Agent Workflow Vs Agent Part 5 by Vipra Singh Mar, 2025 Medium
25 pages
Problem Solution and Tech Stack
No ratings yet
Problem Solution and Tech Stack
22 pages
A History of UNESCO
No ratings yet
A History of UNESCO
469 pages
Hands-On Lab With LLMs and Gen AI Within IDC
No ratings yet
Hands-On Lab With LLMs and Gen AI Within IDC
57 pages
Few-Shot API Attack Anomaly Detection in A Classification-by-Retrieval Framework
No ratings yet
Few-Shot API Attack Anomaly Detection in A Classification-by-Retrieval Framework
13 pages
T LLM: F L L M M 16000+ R - API: OOL Acilitating Arge Anguage Odels To Aster EAL World S
No ratings yet
T LLM: F L L M M 16000+ R - API: OOL Acilitating Arge Anguage Odels To Aster EAL World S
23 pages
Paper 5
No ratings yet
Paper 5
23 pages
NL2API - A Framework For Bootstrapping Service Recommendation Using Natural Language Queries
No ratings yet
NL2API - A Framework For Bootstrapping Service Recommendation Using Natural Language Queries
9 pages
Apigen: Automated Pipeline For Generating Verifiable and Diverse Function-Calling Datasets
No ratings yet
Apigen: Automated Pipeline For Generating Verifiable and Diverse Function-Calling Datasets
20 pages
Chap.5 FINANCIAL ASSET Valuation
No ratings yet
Chap.5 FINANCIAL ASSET Valuation
39 pages
Toolformer - Language Models Can Teach Themselves To Use Tools
No ratings yet
Toolformer - Language Models Can Teach Themselves To Use Tools
17 pages
API Automation Tool Selection Draft
No ratings yet
API Automation Tool Selection Draft
8 pages
Edoc 2018
No ratings yet
Edoc 2018
10 pages
The Complete Guide To The TOEFL PBT Test Class 1
No ratings yet
The Complete Guide To The TOEFL PBT Test Class 1
3 pages
This Isa New One
No ratings yet
This Isa New One
12 pages
Modul 3 Data Science
No ratings yet
Modul 3 Data Science
10 pages
Interview Prepration
No ratings yet
Interview Prepration
4 pages
Gorilla - Large Language Model Connected With Massive APIs
No ratings yet
Gorilla - Large Language Model Connected With Massive APIs
18 pages
Deep Learning-Based Prediction of Test Input Validity For RESTful APIs
No ratings yet
Deep Learning-Based Prediction of Test Input Validity For RESTful APIs
8 pages
Morest Model-Based RESTful API Testing With Execution Feedback
No ratings yet
Morest Model-Based RESTful API Testing With Execution Feedback
12 pages
Multithreading Example
No ratings yet
Multithreading Example
7 pages
LLM Frameworks
No ratings yet
LLM Frameworks
8 pages
Share Capstone - Mark1
No ratings yet
Share Capstone - Mark1
16 pages
Big Code Bench
No ratings yet
Big Code Bench
62 pages
Qin Et Al - 2023 - ToolLLM
No ratings yet
Qin Et Al - 2023 - ToolLLM
24 pages
Open-Source and Science in The Era of Foundation Models
No ratings yet
Open-Source and Science in The Era of Foundation Models
88 pages
Taskweaver: A Code-First Agent Framework: Equal Contribution
No ratings yet
Taskweaver: A Code-First Agent Framework: Equal Contribution
23 pages
Functionality
No ratings yet
Functionality
10 pages
Team13 DevRev Report
No ratings yet
Team13 DevRev Report
14 pages
Web API Search
No ratings yet
Web API Search
18 pages
API Testing
No ratings yet
API Testing
20 pages
Best Python Project With API
No ratings yet
Best Python Project With API
259 pages
Function Calling at Edge
No ratings yet
Function Calling at Edge
9 pages
Testing Objective
No ratings yet
Testing Objective
1 page
Proyecto Arquitectura AISS RESTest v2
No ratings yet
Proyecto Arquitectura AISS RESTest v2
19 pages
Article Text June 26 2023
No ratings yet
Article Text June 26 2023
3 pages
Understanding The Effectiveness of Large Language Models in Detecting Security Vulnerabilities
No ratings yet
Understanding The Effectiveness of Large Language Models in Detecting Security Vulnerabilities
18 pages
Testing of RESTful Web APIs
No ratings yet
Testing of RESTful Web APIs
3 pages
API Call Based Malware Detection Approach Using Recurrent Neural Network-LSTM
No ratings yet
API Call Based Malware Detection Approach Using Recurrent Neural Network-LSTM
13 pages
IBM WhitePaper
No ratings yet
IBM WhitePaper
9 pages
Function Calling
No ratings yet
Function Calling
16 pages
Pytorch Paper
No ratings yet
Pytorch Paper
12 pages
AgentScope: A Flexible Yet Robust Multi-Agent Platform
No ratings yet
AgentScope: A Flexible Yet Robust Multi-Agent Platform
24 pages
C L L M F F V S ?: AN Arge Anguage Odels Ind and IX Ulnerable Oftware
No ratings yet
C L L M F F V S ?: AN Arge Anguage Odels Ind and IX Ulnerable Oftware
18 pages
SWE-bench: Can Language Models Resolve Real-World Github Issues?
No ratings yet
SWE-bench: Can Language Models Resolve Real-World Github Issues?
51 pages
An Approach To Generating API Test Scripts Using GPT
No ratings yet
An Approach To Generating API Test Scripts Using GPT
9 pages
Formulario de Mantenimiento 1
No ratings yet
Formulario de Mantenimiento 1
2 pages
5474-Article Text-8699-1-10-20200511
No ratings yet
5474-Article Text-8699-1-10-20200511
8 pages
ML Miniproject Final Report
No ratings yet
ML Miniproject Final Report
36 pages
Api MD
No ratings yet
Api MD
13 pages
An Empirical Study On API Usages
No ratings yet
An Empirical Study On API Usages
14 pages
Leveraging Large Language Models To Improve REST API Testing
No ratings yet
Leveraging Large Language Models To Improve REST API Testing
5 pages
API Testing Crash Course 1685052723
No ratings yet
API Testing Crash Course 1685052723
10 pages
Osdi24 Lin Chaofan
No ratings yet
Osdi24 Lin Chaofan
18 pages
An Effective Query System Using Llms and Langchain IJERTV12IS060161
No ratings yet
An Effective Query System Using Llms and Langchain IJERTV12IS060161
4 pages
Toolshed LLM Agents
No ratings yet
Toolshed LLM Agents
38 pages
NeurIPS 2024 Apigen Automated Pipeline For Generating Verifiable and Diverse Function Calling Datasets Paper Datasets and Benchmarks Track
No ratings yet
NeurIPS 2024 Apigen Automated Pipeline For Generating Verifiable and Diverse Function Calling Datasets Paper Datasets and Benchmarks Track
20 pages
Discovering API Usage Specifications For Security Detection Using Two-Stage Code Mining
No ratings yet
Discovering API Usage Specifications For Security Detection Using Two-Stage Code Mining
23 pages
Q D F C: Uerying Atabases With Unction Alling
No ratings yet
Q D F C: Uerying Atabases With Unction Alling
23 pages
A Malware-Detection Method Using Deep Learning To
No ratings yet
A Malware-Detection Method Using Deep Learning To
24 pages
Extrajudicial Settlement by Agreement Between Heirs - PEREIRA v. CA
100% (1)
Extrajudicial Settlement by Agreement Between Heirs - PEREIRA v. CA
2 pages
Swetha-Kuncham-Profile - New Updated - 2024
No ratings yet
Swetha-Kuncham-Profile - New Updated - 2024
12 pages
9959 The Berkeley Function Cal
No ratings yet
9959 The Berkeley Function Cal
22 pages
Aircon Heat Load Calculation Sheet: Project Name: Latihan Address: Pancoran
No ratings yet
Aircon Heat Load Calculation Sheet: Project Name: Latihan Address: Pancoran
21 pages
Enhancing Function-Calling Capabilities in LLMs
No ratings yet
Enhancing Function-Calling Capabilities in LLMs
13 pages
12 Accountancy
No ratings yet
12 Accountancy
4 pages
Scavenger Hunt Lesson Plan
No ratings yet
Scavenger Hunt Lesson Plan
3 pages
Tracer Mt09tra
No ratings yet
Tracer Mt09tra
114 pages
Install Ubuntu Server 18
No ratings yet
Install Ubuntu Server 18
11 pages
BES Quality Teaching Diverse Students
No ratings yet
BES Quality Teaching Diverse Students
103 pages
Management Micro Project For Last Year Student
No ratings yet
Management Micro Project For Last Year Student
10 pages
09 Davao Freeworkers V Cir
No ratings yet
09 Davao Freeworkers V Cir
5 pages
Mapping Corporate Social Responsibility & Un's Sustainable Development Goals A Case 31-Chauhan Et Al - pp189-194
No ratings yet
Mapping Corporate Social Responsibility & Un's Sustainable Development Goals A Case 31-Chauhan Et Al - pp189-194
6 pages
Ashley Dohr Cover Letter
No ratings yet
Ashley Dohr Cover Letter
2 pages
Cultural
No ratings yet
Cultural
9 pages
Little - A Little - Few - A Few - GrammarBank
No ratings yet
Little - A Little - Few - A Few - GrammarBank
3 pages
Health Effects of Voluntary Exposure To Cold Water A Continuing Subject of Debate
No ratings yet
Health Effects of Voluntary Exposure To Cold Water A Continuing Subject of Debate
17 pages
C2 Wordlist Unit 2
No ratings yet
C2 Wordlist Unit 2
8 pages
Basics of Jyotish Science
No ratings yet
Basics of Jyotish Science
2 pages
European Portfolio For Student Teachers of Languages (EPOSTL)
No ratings yet
European Portfolio For Student Teachers of Languages (EPOSTL)
4 pages
Midlife
No ratings yet
Midlife
8 pages
311 Application SC Rout
No ratings yet
311 Application SC Rout
5 pages
FprEN - 1992 1 1 BD
No ratings yet
FprEN - 1992 1 1 BD
4 pages
Conduent Applicant Adaaa Referral Form PDF
No ratings yet
Conduent Applicant Adaaa Referral Form PDF
1 page
Small Claims and Collection of Sum of Money
No ratings yet
Small Claims and Collection of Sum of Money
4 pages

Nestful Nested API Calls Benchmark

Uploaded by

Nestful Nested API Calls Benchmark

Uploaded by

Preprint

NEST FUL: A Benchmark for Evaluating LLMs on Nested

Autonomous agent applications powered by large language models (LLMs)

Human API Name Nested Executable Real World

Table 1: A comparison of our NEST FUL dataset to notable Tool-learning benchmarks

4.1 Executable APIs

" name " : " WeatherAPI . com_Time_Zone_API " ,

4.2 Non-Executable APIs

5.2 Experimental Settings

Mistral-7B-Instruct-v0.3 0.36 0.00 0.03 0.00 0.00

Variable Assignments As discussed in Section 3, we add variable assignments

for two APIs — Coronavirus_Smartable_GetStats and NewsAPISearchByKeyWord. However,

You might also like