Chain of Tools: Large Language Model Is An Automatic Multi-Tool Learner

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Chain of Tools: Large Language Model is an

Automatic Multi-tool Learner

Zhengliang Shi1∗ Shen Gao2 Xiuyi Chen3 Yue Feng4 Lingyong Yan3
Haibo Shi3 Dawei Yin3 Zhumin Chen1 Suzan Verberne5 Zhaochun Ren5†
1
Shandong University 2 University of Electronic Science and Technology of China
arXiv:2405.16533v1 [cs.CL] 26 May 2024

3
Baidu Inc., Beijing, China 4 University of Birmingham, Birmingham, UK
5
Leiden University, Leiden, The Netherlands
[email protected] [email protected]

Abstract
Augmenting large language models (LLMs) with external tools has emerged as a
promising approach to extend their utility, empowering them to solve practical tasks.
Existing work typically empowers LLMs as tool users with a manually designed
workflow, where the LLM plans a series of tools in a step-by-step manner, and
sequentially executes each tool to obtain intermediate results until deriving the final
answer. However, they suffer from two challenges in realistic scenarios: (1) The
handcrafted control flow is often ad-hoc and constraints the LLM to local planning;
(2) The LLM is instructed to use only manually demonstrated tools or well-trained
Python functions, which limits its generalization to new tools. In this work, we first
propose Automatic Tool Chain (ATC), a framework that enables the LLM to act
as a multi-tool user, which directly utilizes a chain of tools through programming.
To scale up the scope of the tools, we next propose a black-box probing method.
This further empowers the LLM as a tool learner that can actively discover and
document tool usages, teaching themselves to properly master new tools. For a
comprehensive evaluation, we build a challenging benchmark named ToolFlow,
which diverges from previous benchmarks by its long-term planning scenarios and
complex toolset. Experiments on both existing datasets and ToolFlow illustrate
the superiority of our framework. Analysis on different settings also validates the
effectiveness and the utility of our black-box probing algorithm.

1 Introduction
Large language models (LLMs) have shown promising capabilities such as in-context learning and
real-world planning [1–3]. To further increase their utility, the tool learning task [4, 5] is proposed to
augment LLMs with external tools, e.g., a Weather App, enabling them to interact with the physical
world [6–8]. With the assistance of tools, LLMs can serve as agents to automatically solve practical
tasks [9–11], such as check the weather in London. More recent work further uses Python code as a
unified interface to access diverse tools, coming with advantages like seamlessly reusing massive
built-in functions and performing for-loop operations [12–15].
Given a practical task, prior work grounds the tool-use process in an iterative plan-execute-observe
pipeline [16–19]. As shown in Figure 1(a), the LLM first plans a series of tools in a step-by-step
manner [20–22]. For each step, the LLM generates arguments in a handcrafted format [23, 24] or
code snippets [12, 19] for execution, continuously incorporating intermediate results into the context
for subsequent actions. However, they suffer from two challenges in realistic scenarios. First, their

Work done during an internship at Baidu

Corresponding author.

Preprint. Under review.


Using the provided tools: [1] GET_IP, [2] GET_Weather, [3] Book_Flight, [4] Book_Hotel, [5] ... Answer the following question:
Will it rain in London from 21 to 27 in April? If not, help me book flights and accommodations from Los Angeles to London for April 21.

Query Candidate tools Query Candidate tools Tool protocol


Step 1
Plan Execution Observation
LLM LLM
Using tool GET_IP GET_IP[London] IP Chain of tools programming

Query Candidate tools Context: [Step 1] Program


Step 2 is_rain = False; IP = GET_IP("London")
Plan Execution Observation
LLM for i in ["1", "2", "3", "4", "5", "6", "7"]:
Using tool GET_Weather GET_Weather[IP, 4-21] Weather
response = GET_Weather(IP, "2023042"+i)
... Step 3 - Step 7 (Iteratively find the weather from 22-26) if response["weather"] == "rain":
is_rain = True
Query Candidate tools Context: [Step 1, 2, 3, 4, 5, 6, 7] break
if not is_rain:
Step 8 hotel = Book_Hotel("London", "4-21")
Plan Execution Observation
LLM fight = Book_Flight("London", "LA", "4-21")
Using tool GET_Weather GET_Weather[IP, 4-27] Weather
print(f"The flight: {flight}"
f"and hotel: {hotel}")
Query Candidate tools Context: [Step 1, 2, 3, 4, 5, 6, 7, 8]
Step 9
Plan Failure: local planning, Success: Concise solution,
LLM
Using tool Book_Hotel accumulated context programmatic chain of tool.

(a) Previous method to utilize tools (b) Our framework to utilize tools

Figure 1: The comparison between previous plan-execute-observe pipeline (left) and our proposed
framework (right).
workflow to interact with tools is typically manually designed and ad-hoc, struggling to generalize to
different tool-use scenarios. The handcrafted workflow also constrains the LLM to local planning,
leading to diminished performance in long-term planning tasks. Second, the LLM can only use
manually demonstrated tools or built-in Python functions, which limits the toolset scope. To this end,
we address the above challenges with a focus on two research objectives in our work: (1) Empower
the LLM as an automatic Multi-tool user to generate a tool-use chain; (2) Further empower the
LLM as an active Multi-tool learner to instruct themselves to master new tools.
To address the first objective, we propose Automatic Tool Chain (ATC), a framework that enables the
LLM to utilize a chain of tools through programming. As shown in Figure 1(b), the LLM directly
learns the input-output schema and data flow dependency of various tools from tool protocols (a.k.a.,
tool documentation). Different from the short-form docstring of simple Python functions in previous
work [13, 25, 15], the protocol comprehensively outlines meta-information about a complex tool,
such as arguments requirement, structural response specifications (i.e., a general schema elaborating
expected execution results) and possible execution statuses. With the assistance of the provided
protocols, we instruct the LLM to generate a program that sequentially calls a chain of tools, parses
the tool response to cache useful information and derives the final answer. To correct runtime errors
in the generated programs, we introduce an attributable reflection mechanism, which allows the LLM
to track faulty snippets, pinpoint incorrect tool usage, and calibrate the programs accordingly.
In realistic scenarios, a potential challenge that limits the scope of the toolset in our framework is
the continuous crafting of documented protocols for diverse and fast-paced tools, which is typically
done by software developers [26, 27]. Therefore, we propose a black-box probing method to address
the second objective. This approach enables the LLM to be an active tool learner that can probe the
input-output schema of new tools and teach itself how to use them. Initially, the LLM is instructed
to generate testing instances that target the functionality of a tool, including relevant tasks and
tool-use program solutions. While executing the generated program, we transform the task-specific
tool response into a general schema and leverage these instances as practical usage demonstrations,
thereby documenting the tool protocol. Considering that a single tool may fail to probe due to the
absence of private arguments, which are only acquired through other tools, we introduce a chain of
probing algorithms. This algorithm effectively optimizes the cooperation among tools that have a
strong input-output dependency.
We first investigate the capability of LLMs to generate a chain of tools on two well-established
datasets from RestBench [14]. For a comprehensive evaluation, we also create a new benchmark
testbed named ToolFlow, including 224 tasks across 107 real-world tools. ToolFlow diverges from
the existing benchmarks by its more long-term planning tasks, the thorough protocol of the toolset,
and complex data flow interdependency among tools, which evaluates our method under more
challenging scenarios. The results show that (1) the LLM can well understand the tool protocol; (2)

2
the LLM exhibits strong capability in planning a chain of tools programmatically; and (3) despite
the straightforward design, our framework substantially surpasses previous baselines with higher
efficiency. In addition, the proposed black-box probing method effectively instructs LLMs to probe
tool protocols and teach themselves to master new tools, extending the scope of the tools in our ATC.
Our contributions are summarized as follows: (i) We propose the Automatic tool chain (ATC), a
framework to empower the LLM as a multi-tool user. (ii) We introduce a black-box probing method,
which further enables the LLM to act as an active tool learner to the scope of the toolset in our ATC.
(iii) We release a new benchmark, ToolFlow, to evaluate tool learning methods in more challenging
scenarios. (iv) Extensive experiments on three datasets validate the superiority of our method.

2 Related Work
Tool learning with foundation models. Augmenting LLMs with external tools has been proven a
promising method for enhancing their utility and enabling interactions with the physical world [8,
6, 28, 29]. As the commonly-used methods, the LLM first breaks down complex tasks and plans a
series of tools in a step-by-step manner [24, 30, 19]. For each step, the LLM separately executes the
tools and incorporates the full response into context, which contains the required arguments to invoke
subsequent tools due to the data flow dependency [20, 31, 22]. Despite advancements, this iterative
workflow is typically manually designed and ad-hoc, struggling to generalize across various tool-use
scenarios. In this work, we propose the ATC, enabling the LLM as an automatic multi-tool learner to
directly integrate a chain of tools.

Programming-enhanced LLMs. Recent work has shown the potential of using programming
languages (PLs) to enhance the planning and reasoning capability of LLMs [32–34]. For example,
previous work enables LLMs to generate a programmatic chain of thought to solve complex numeric
reasoning tasks [35, 36], which exhibits remarkable performance. In the tool learning task, compared
with nature languages (NLs), recent work also shows that LLMs can generate Python code snippets
as actions, with advantages like integrating widely used Python functions and simplifying lengthy
for-loop operations [12]. Additionally, previous work limited the LLM to only use well-documented
tools [14, 27, 25] or the Python function learned from the pre-training stage [12] . In this work, we
further investigate the LLM as a multi-tool learner, teaching themselves to master new tools.

Learning from external feedback. Learning from feedback is a prevailing strategy to mitigate
undesired behaviors of LLMs [37, 38], mirroring a typical human learning strategy where individ-
uals refine their behaviors through trial, error, and correction [39–42]. Previous studies such as
Reflexion [43] show the capability of LLMs to reflect verbal signals from the environment and revise
their mistakes [44]. Recent work prompts LLMs to use automated feedback (e.g., runtime errors)
implemented in software to self-debug its generated code in each step [44, 45, 17]. Despite the
progress, this feedback typically reflects straightforward faults while failing to address the snow-
balling issue [46] in multi-step planning, where an initial error can lead to a series of subsequent
accumulated errors [47]. In the tool learning task, pinpointing the exact tool triggering the error is
crucial [48]. In this work, the proposed attributable reflection mechanism guides LLMs to track the
faulty program snippet, attribute it to a specific tool calling, and revise generated programs.

3 Automatic Tool Chain

3.1 Preliminaries

Solving practical tasks with the assistance of tools can be conceptualized as a planning process.
Formally, the LLM, denoted as Mθ , is equipped with access to a set of tools T = {t1 , t2 , ..., t|T | }
and corresponding documented protocols D = {d1 , d2 , ..., d|D| }. The protocol di provides detailed
meta information about tool ti such as argument requirements, tool description, and the specification
of execution result (a.k.a., schema). Given a natural language task x ∈ X from the task space X , the
object is to generate a sequence of tool callings paired with corresponding arguments to derive the
final answer. Previous work configures the LLM with customized control flow and tool-use templates,
whereby the LLM iteratively interacts with single tools following a plan-execute-observe pipeline. In
this work, we enable the LLM to automatically utilize a chain of tools by generating a program C.

3
Will it rain in London from 21 to 27 in April? If not, (1) Instance discovery
Sample error in
help me to book a … Answer i LLMs Query Solution ❌
program
Success A new … (up to N times)
Toolset tool
correct
Query Run.py Program Query Solution response
LLM
Tool (1) Interpreter
protocol Generation (2) Protocol documenting
Error
Instance
[2] GET_Weather
❌ Synthetic protocol
discovery Query
-->26 response = GET_Weather(.. i Example
(2) Error Solution
attribution
LLM -->27 if response[`rain`]: A new Response schema
KeyError: rain; tool Response
Fault snippet Error
transform
(a) Overall framework (b) Black-box Tool probing
Figure 2: Left: An overview of our framework with the proposed attributable reflection mechanism
(§ 3.3). Right: Our black-box probing method (§ 4.1).

3.2 Chain of Tools Generation


Given a task x, we first provide the LLM with the documented protocol di ∈ D for each tool ti in the
candidate toolset T . The documented protocol di records meta information, including the description
to outline what the tool ti can be used for, the argument requirements to describe how to invoke
it, and the response schema to specify the expected type of execution results. With the assistance
of this meta information, the LLM can automatically learn the tool usage, and master the detailed
input-output schema as well as the data flow relation among various tools. Then, we instruct the LLM
Mθ to directly generate an executable program C to utilize multiple tools and solve the input task x.
Formally, it can be formulated as:
C = Mθ (x, T , D, Ic ). (1)
Here, the Ic indicates a concise instruction for program generation operation, which is provided
in Appendix A.7. The T and D represent the candidate toolset and corresponding tool protocols,
respectively. The generated program sequentially calls multiple tools to acquire useful information,
parses lengthy execution results for subsequent utilization, and simplifies the lengthy task-solving
trajectory with concise programmatic planning. The final result r is derived by executing the generated
program through a code interpreter, which can be formulated as r = Execute(C).

3.3 Programming with Attributable Reflection


The generated program effectively integrates multi-step tool utilization. However, runtime errors,
such as passing redundant arguments, are frequently observed during the execution. Therefore, we
introduce an attributable reflection mechanism, guiding the LLM to first attribute the raised error to a
specific tool, and then adaptively revise the generated program. As shown in Figure 2, if a runtime
error is raised, we capture the error message in the result r, including both the faulty code snippet
and the error trace. Then, we instruct the LLM to localize the specific tool calling which triggers the
error and exactly generate the tool name, represented as tj = Mθ (x, T , Ia , rj ). Here the j indicates
the jth iteration reflection and the Ia indicates the instruction for this error attribution operation. The
identified tool tj paired with its documentation dj as well as the error message is taken as input,
assisting the LLM to revise the generated program, which can be formulated as:
Cj = Mθ (x, T , D, Ic , {(C<j , r<j )}, dj ). (2)
Our attributable reflection mechanism is operated until the generated program is executed successfully
or up to the maximum iteration α.

4 Black-box Probing Enable Toolset Extension


Our framework ATC enables the LLM to directly operate a chain of well-documented tools through
programming. However, manually crafting and maintaining documented protocols for diverse and
fast-paced tools is cost-intensive, which poses a potential limitation to the scope of the toolset in our
framework. Therefore, we propose a black-box probing method, which enables the LLM to act as an
active tool learner, instructing themselves to master new tools. Due to the relation among the data
flow of tools, we also introduce a chain of probing algorithms to enhance our probing process.

4
4.1 Tool Probing

As shown in Figure 2(b), our probing method contains two phases, including Instance discovery
and Protocol documenting. The core idea of the former is to generate tool-use instances through
self-exploration, examining the expected input-output mechanism for each tool, while the latter
transforms specific instances into general tool protocol.
Instance discovery. We instruct the LLM to formulate a question q targeting the functionality of
a tool t and generate a program utilizing the tool t to solve the formulated question. Formally, it
can be represented as (q, C) = Mθ (t, Ip ), where Ip is the instruction for our instance discovery
operation. The response r of the tool t can be examined while executing the generated program C as
r = Execute(C), which represents a specific instance to demonstrate the output of the tool t. Since
the LLM may hallucinate to formulate unsolvable questions or fail to generate a correct program, we
repeat the above operation multiple times until the response r can be examined correctly or up to the
maximum of sampling times N . Thus, we obtain a tool-use instance denoted as ((qi , C), r).
Protocol documenting. On top of of sampled instance ((qi , Ci , ri )), we construct the tool protocol.
Since the response of real-world tools is typically lengthy with intricate structures, we first transform
the query-specific response r into a general schema s to demonstrate the expected output specification.
This process is automatically performed by recursively decomposing each element in r, representing
the hierarchical structure of r, and listing the type of the corresponding value. Then, we utilize the
question-program pair (q, C) as a usage demonstration of the tool t, pairing it with s to construct the
documented protocol d, denoted as d = ((q, C), s, t). We provide an example and detailed procedure
to demonstrate the above transformation process in Appendix Alg. A.1.

4.2 Chain of Probing

During the probing, some tools may not be callable exclusively due to the absence of specific private
arguments, which are only accessible through other tools, i.e., the unique ID. To address the strong
data flow interconnection, we propose the chain of probing algorithm that enables the cooperation
of tools. Formally, we denote the black-box toolset as B, which contains unprobed tools and is
initialized with the entire candidate toolset. The successfully probed tools are cached in the list H,
which is initialized with an empty list.

Initial Iteration for single tool probing. As illustrated in Figure 3, our initial iteration starts by
probing each single tool t within the black-box toolset B, represented as d = LLMProb(t). The
LLMProb(∗) indicates the tool probing operation in § 4.1. If a tool t is successfully probed, i.e., no
exceptional errors are raised, it is moved from black-box toolset B to list H, formulated as:
B = B \ {t}, H = H ∪ {t} (3)
After the initial iteration, H contains tools that are directly callable like the tool C and D in Figure 3,
while the remaining tools in B are interconnected with other tools.

Probing with dependency chain. For the re-


legend Initial
maining tools t in B, we probed them with the A B C D E
Iteration
assistance of tools from H. Specifically, we in- i unprobed tool C D
empty
struct the LLM to select a subset of tools from i probed tool only C and D can be probed
H based on their relevance to the tool t, which
can be formulated as T̂ = Mθ (t, H, Is ). Here, Iteration A
j i B E C D
2
the Is denotes the instruction for tool selection. dataflow dependency probe A with the assistance
A
The selected subset T̂ serves as the prerequisite of its prerequisite tool C
which facilitates the acquisition of necessary ar- i i ...
successfully probed Finish A B C D E
guments to invoke the tool t during the probing
all probed
process, thereby deriving the tool protocol, rep-
resented as d = LLMProb(T̂ ∪ {t}). As shown Figure 3: The tool probing with dependency chain.
in Figure 3, the tool C is selected to assist in
probing the tool B. Our probing process continues for multiple iterations until all tools in B are
successfully probed or reach the maximum of iteration β. A detailed pseudo algorithm of the overall
process is provided in Appendix Alg. 1.

5
5 Dataset and Evaluation Setup
Existing Datasets. We first conduct experiments on widely used RestBench [14], including two
datasets: TMDB and Spotify. The TMDB contains 100 tasks across 54 tools for movie scenarios
while the Spotify contains 57 tasks across 40 tools for music scenarios. Each tool in the RestBench
is paired with a lengthy documented protocol, making it inherently appropriate to benchmark the
protocol utilization capability of LLMs.

A new benchmark – ToolFlow. As shown in Appendix Table 8, to the best of our knowledge, no
existing benchmarks containing complex tools with comprehensive tool protocol (e.g., arguments
requirement and input-output schema) while involving long-term planning tool-use tasks. Therefore,
we build a new test set named ToolFlow to fill this gap. We first collect 107 tools with long protocols
across 4 real-world domains, e.g., Weather and Game, from 16k public tools of the ToolBench [9]
dataset. Then, we invite 7 well-trained experts working on NLP research to provide solutions for 224
complex tasks in the form of tool interaction sequences, including the tool name and corresponding
arguments. Each task requires long-term reasoning and at least 7 times interacting with tools.
ToolFlow also diverges from existing benchmarks by its strong interconnection among the tools (the
arguments of subsequent tools can only be extracted from the response of previous tools) and stability
(the task solution is not time-varying). We provide more details of ToolFlow in A.2.

Evaluation metrics. Following previous work [49, 50], we use three evaluation metrics, including:
(1) Success Rate (Success%), which measures the proportion of successful query completions; (2)
Correct Path Rate (Path%), which calculates the proportion of ground truth tools in model-generated
tool callings; (3) Correct Tool Precision (Prec%), which calculates the precision score between the
model-generated tool callings and ground truth tool sequence. We also conduct the human evaluation
to evaluate our method and the details can be found in Appendix A.1.2.

Baselines. We mainly compare our method with the well-known baselines, including: (1) Re-
Act [18], which prompts LLM to generate the chain-of-thought and actions in an interleaved manner;
(2) CodeAct [12], which prompts LLM to iteratively generate code snippets as actions to call external
tools; (3) ToolLLM-DFSDT[9], which enhances LLMs with the Depth First Search-based Decision
Tree (DFSDT) to select tools to solve a task; (4) RestGPT [14], which includes a coarse-to-fine
planning module and a tool executor; (5) ConAgents [26], which enables the cooperation of three
specialized LLMs to solve complex tasks. For further comparison, We also establish two baselines,
i.e., ReAct@3 and ToolLLM@3, which are up to three times runs of their vanilla method (ReAct or
ToolLLM) until the input task is successfully completed.

6 Experiment Results
6.1 Results of RQ1 – Enable the LLM as an automatic muti-tool user

We utilize three widely used LLMs for different baselines and our method: OpenAI’s gpt-3.5-turbo-
16k and gpt-4-turbo, and the open-source model Mixtral-8x7B [51]3 . The decoding temperature is set
to 0 for deterministic generation. The trial number α in our reflection mechanism (§ 3.3) is set to 3.
Following previous work [26, 14], we provide all the methods with 20 candidate tools for each task
in the test set, which contains the required tools and randomly sampled tools.

Results on RestBench. As shown in Table 1, the LLM, when equipped with our framework,
surpasses all the baselines on the RestBench benchmark in terms of all metrics. For example, our
method achieves 89.00 in success rate metrics on the RestBench-TMDB dataset, which substantially
improves over the commonly used baseline ReAct and ToolLLM. Table 2 and Table 3 further illustrate
that our framework can achieve the best performance with various backbone LLMs, i.e., the Mistral-
8x7B and GPT-4. These results indicate our framework effectively enables LLM to master external
tools and directly generate a program for utilization. The performance of two runs is tested using a
two-tailed paired t-test where no significant difference is found (p > 0.05), showing the stability of
our method. In addition, human evaluation indicates that our method performs substantially better on
executability and utility than strong baselines. See Appendix A.1.2 for details.
3
https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1

6
Table 1: Experiment results on three datasets with gpt-3.5-turbo as backbone. The Path%, Prec%
and Success% indicate Correct Path Rate, Correct Path Precision and Successful Rate metrics.

RestBench-TMDB RestBench-Spotify ToolFlow


Method
Succ% Path% Prec% Succ% Path% Prec% Succ% Path% Prec%
gpt-3.5-turbo
ReAct [18] 61.00 77.13 52.30 50.88 74.64 44.79 22.76 60.75 68.03
CodeAct [26] 63.00 80.91 83.72 54.30 76.64 79.81 27.82 57.93 66.23
ToolLLM [9] 72.00 78.29 49.41 61.40 82.82 25.33 42.14 71.02 65.24
RestGPT [14] 65.00 77.49 80.15 64.91 73.94 88.71 26.83 40.95 62.21
ConAgents [26] 76.00 78.29 82.31 63.16 78.21 82.71 - - -
ReAct@3 70.00 80.96 48.01 59.65 81.80 30.48 28.35 66.66 66.21
ToolLLM@3 74.00 83.29 45.41 66.67 83.41 23.73 44.70 73.85 60.77
ATC (ours) 89.00 84.71 83.87 78.95 78.54 91.46 60.21 78.31 72.45

Table 2: Experiment with the Mistral-8x7B. Table 3: Experiment with the GPT-4.

TMDB ToolFlow TMDB ToolFlow


Method Method
Succ% Path% Succ% Path% Succ% Path% Succ% Path%
mixtral-8x7B-instruct-v0.1 gpt-4-turbo
ReAct 24.74 73.34 10.53 41.37 ReAct 77.00 86.05 25.99 65.98
ReAct@3 37.88 76.85 18.95 52.40 ReAct@3 80.00 89.21 30.98 67.55
ToolLLM@3 45.00 74.40 22.54 51.85 ToolLLM@3 82.00 90.62 50.46 76.73
Ours 58.00 78.17 29.87 59.14 Ours 94.00 92.68 65.74 83.54

Results on ToolFlow. Table 1 presents the results on our ToolFlow benchmark. We find that our
ToolFlow poses a significant challenge for previous baselines, where the best performance only
achieves a 44.70 success rate with GPT-3.5 as the backbone. Our method pushes the success rate to
60.21 with a 15.51 point improvement. The potential reason for our improvement is that our ATC
enables the LLM to generate a chain of tools programmatically, which is more effective in controlling
workflow and consolidating lengthy task-solving trajectories into a concise program.
Ablation for our attributable reflection. We Table 4: Comparison with two variants of our attri-
compare our attributable reflection method with bution reflection mechanism.
two ablative variants: (1) w/o reflect, which al-
lows the LLMs to generate a program as the Method TMDB NovelTools
final solution without further revision, and (2) Succ% Path% Succ% Path%
w/ naive reflection, which enables the LLMs Ours 89.00 84.71 59.12 77.24
to revise generated programs directly using er- - w/o reflect 83.00↓6.0 82.25↓2.5 47.31↓11.81 73.35↓3.89
ror messages from code interpreter. The results - w/ naive 85.00↓4.0 82.74↓2.0 55.44↓3.68 75.40↓1.84
are shown in Table 4. We observe that our at-
tributable reflection outperforms the two variants with a 5-10% point improvement. It demonstrates
the superiority of the reflection mechanism and the effectiveness of our error attribution strategy.

6.2 Results of RQ2 – Enable the LLM as an active Multi-tool learner

We evaluate our black-box probing method on three datasets using different backbone LLMs, i.e.,
RestBench-TMDB, RestBench-Spotify, and ToolFlow, respectively. The sampling number N (§ 4)
is set to 3 and the maximum iteration number β (§ 4.2) is set to 4. We mainly evaluate our probing
method by computing the number of successfully probed tools. To evaluate the utility of the auto-
documented protocol, we compare the performance of our ATC supported by standard protocol
(std protocol) and synthetic protocol (auto protocol). Considering that our synthetic documentation
contains a usage example for each tool, we further set a zero-shot experiment, which only remains
the transformed schema (auto schema).
Success rate of tool probing. Table 5 shows the number of successfully probed tools. We find that
the open-source model Mixtral-8x7B, when equipped with our probing method, can probe 82.5% to
88.2% of tools and synthesize their tool documentation. The number of successfully probed tools
also increases when alternating more powerful backbone LLMs, specifically GPT-4. These results
validate the effectiveness of our tool probing method. We further analyse the cases where the LLM

7
Best baseline performance Ours w/ auto schema Ours w/ auto protocol Ours w/ std protocol

Successful rate (mixtral-8x7B) Successful rate (gpt-3.5-turbo) Successful rate (gpt-4-turbo)


60 88 100
58.0 86.0 87.0 89.0 93.0 93.0 94.0
48 54.0 56.0 76 85
45.0 74.0 82.0
36 64 70

56.4 59.1
60.2 62.4 64.0 65.7
24 27.3 28.6 29.9 52 55
22.5 44.7
50.5
12 40 40
TMDB ToolFlow TMDB ToolFlow TMDB ToolFlow
Correct path rate (mixtral-8x7B) 90 Correct path rate (gpt-3.5-turbo) Correct path rate (gpt-4-turbo)
88 100

76 85 92
76.9 75.3 78.2 78.2 84.7 93.5 94.7 92.7
83.3 82.6 83.3 90.6
64 80 84
83.5
52 55.7 58.5
59.1 75 76.4 77.2
78.3
76 80.3 82.1
52.4 76.7
73.9
40 70 68
TMDB ToolFlow TMDB ToolFlow TMDB ToolFlow
Figure 4: The comparison of our framework when equipped with different documentation.
fails to probe successfully. A potential reason is that the code interpreter only examines compile
errors and runtime faults, failing to calibrate the correctness of the program’s output. This limitation
can lead to the False Success phenomena.

Ablation for the black-box probing. We Table 5: The number of successfully probed tools
compare our tool probing with two ablative vari- using our vanilla probing method and two variants.
ants: (1) w/o multi-sample, which replaces the
multiple sampling strategy in § 4 with sampling Method TMDB Spotify NovelTools
only one instance; and (2) w/o chain, which ig- Totally 54 40 107
nore the dependency chain among tools in § 4.2 Probing (mixtral) 47 33 90
and separately probes single tools. As shown in Probing (gpt-4) 54 38 102
Table 5, in terms of the number of successfully Probing (gpt-3.5) 54 38 98
probed tools, we observe a 3-7 point decrease - w/o multi-sample 50↓4 35↓3 91↓7
for w/o multi-sample, which indicates that the - w/o chain 47↓7 17↓21 87↓11
LLMs may fail to generate a correct program at
one pass. We also find a substantial decrease between our vanilla probing method and the w/o chain
variant. These results demonstrate the necessity of optimizing the combination of the tools with
strong interconnection.

Utility of Auto-documented Protocol. Figure 4 shows the performance of our proposed framework
in different settings. Compared with using standard protocol crafted manually (i.e., Ours w/ std
protocol), the LLM achieves comparable performance with the assistance of auto-documented
protocol (i.e., Ours w/ auto protocol), which illustrates the utility of our synthetic protocol. We also
observe that our framework substantially outperforms the best baseline even when only using the
transformed schema, i.e., Ours w/ auto schema. This result further demonstrates the effectiveness
of our tool probing and protocol documenting methods which can extend our proposed framework
into diverse new tools without handcrafted protocols. In addition, we also conduct the case study to
evaluate the quality of the synthetic protocol and show a concrete example in A.3.

7 Discussion
The impact of iteration count in our attribution reflection. Our attribution reflection mechanism
enables LLMs to adaptively revise their generated programs according to error messages raised by the
program interpreter. We further alternate the maximum reflection count α from 1 to 5 and evaluate the
Success Rate with the same setting as Table 1 Successful rate Successful rate
(ours). As shown in Figure 5, we observe an 69 96

increasing Success Rate when α shifts from 1 to 61 90


3, which illustrates that LLMs can adapt their
generation accordingly. We also find a relatively 53 84

stable trend when the α keeps increasing (from 45 78


1 2 3 4 5 1 2 3 4 5
3 to 5), which indicates that the LLMs can revise
Iteration (ToolFlow) Iteration (TMDB)
most of the errors in 3 iterations. We also analyse
the cases of unsuccessful corrections and we find Figure 5: The qualitative analysis for the maxi-
that the generated program may be grammatically mum reflection count (§ 3.3).

8
Success Failure
Tasks
27 (73%) 10 (27%)
(7 ≤ Tool num)
Tasks
46 (83%) 9 (16%)
(4 ≤ Tool num.≤ 6)
Tasks
30 (79%) 8 (21%)
( tool num. ≤ 3)
(a) Error distribution (b) Error type distribution
Figure 6: The statistics of the error of our framework. Left: We calculate the success and failure rates
for tasks that require different numbers of tools. Right: The percentage of different type of error.
correct but yields incorrect answers and therefore cannot be detected by the compiler (i.e., the False
Success phenomena).
Statistics of error cases. To further evaluate the potential advancement and drawback of our
method (Table 1), we count the success and failure rates for tasks with different complexity. We first
randomly sample a total of 130 tasks from the RestBench and our ToolFlow. Following previous
work [14, 9], we assess the task complexity using the number of tools calling in the ground truth
solution. Figure 6(a) represents the results. We find that the LLM when equipped with our framework,
can effectively solve both short-term planning (i.e., 3 ≥ tool num.) and long-term planning (i.e., 7
≤ tool num.). We also analyse the type of failure tasks and divide them into four categories shown
in Figure 6(b). Most of the errors are derived from misunderstanding the tool documentation or the
mismatch between a task and selected tools. A potential solution is to enrich the tool documentation,
further clarify the distinction among similar tools, and append negative examples as prior experience
in the tool documentation to instruct LLMs to well master a tool. We take it as our future work.
Efficiency at inference. The intensive param- Average token consumption
k=1000
eters of LLMs typically raise the concern about 105k 90.5k
inference cost. Thus, we compare the token
70.3k
consumption between our framework (auto 70k
doc) and strong baselines on the TMDB and
ToolFlow datasets and show the results in Fig- 40.3k
35k
ure 7 to explain more intuitively. We observe 25.6k 28.4k 28.2k
19.8k 21.7k
15.6k 17.7k
that although our framework achieves better per-
formance, we spend fewer tokens compared
with all baselines. The potential reason is that TMDB ToolFlow
our framework benefits from the inherent ad- Ours ReAct ReAct@3 ToolLLM ToolLLM@3
vancement of programming language which sup-
Figure 7: Consumed tokens for different methods.
ports control of the workflow and allows the
composition of multiple tools to perform complex logical operations. By contrast, the previous
baseline interacts with tools in a step-by-step manner, leading to a long task-solving trajectory with
substantial inference costs. We also compute the token consumption for our probing process, where
each tool costs 2703 tokens to prob on average. More details can be found in A.1.
Case study. We conduct the case studies and find that our proposed framework is more effective
at utilizing various tools to solve complex tasks. We also provide concrete examples to intuitively
explain each component of our method in A.3.

8 Conclusions
We presented Automatic Tool Chain (ATC), a framework that enables the LLM to act as a multi-tool
user. ATC enables LLMs to learn input-output schemas and data flow dependency of various tools
from documented tool protocols, programmatically generating a chain of tools to solve complex tasks.
ATC overcomes the limitations of existing tool learning methods, including relying on manually
designed workflows and lengthy inference steps. On top of ATC, we propose a black-box probing
method, empowering the LLM to act as a multi-tool learner that can automatically discover tool
protocols and teach itself to master new tools. Extensive experiments conducted on existing datasets
and a newly created challenging benchmark demonstrate that an LLM, when equipped with our
framework, achieves the best performance compared with all the baselines. We expect future research
to further calibrate the output of generated programs, mitigating the false success phenomena, i.e.,
the program triggers no runtime error but still gives an incorrect answer. We are also interested in
exploring the integration of our framework into vision foundation models, to develop a multi-modal
agent to solve complex practical tasks.

9
References
[1] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi,
and Hannaneh Hajishirzi. Self-Instruct: Aligning Language Models with Self-Generated
Instructions. In Association for Computational Linguistics: ACL, 2023.
[2] Xiaohan Xu, Ming Li, Chongyang Tao, Tao Shen, Reynold Cheng, Jinyang Li, Can Xu, Dacheng
Tao, and Tianyi Zhou. A Survey on Knowledge Distillation of Large Language Models. arXiv,
2024.
[3] Saaket Agashe, Yue Fan, and Xin Eric Wang. Evaluating multi-agent coordination abilities in
large language models. arXiv preprint arXiv:2310.03903, 2023.
[4] Yujia Qin, Shengding Hu, Yankai Lin, Weize Chen, Ning Ding, Ganqu Cui, Zheni Zeng, Yufei
Huang, Chaojun Xiao, Chi Han, et al. Tool learning with foundation models. arXiv preprint
arXiv:2304.08354, 2023.
[5] Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle-
moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language Models Can Teach
Themselves to Use Tools. Neural Information Processing Systems: NeurIPS, 2023.
[6] Yujia Qin, Zihan Cai, Dian Jin, Lan Yan, Shihao Liang, Kunlun Zhu, Yankai Lin, Xu Han, Ning
Ding, Huadong Wang, Ruobing Xie, Fanchao Qi, Zhiyuan Liu, Maosong Sun, and Jie Zhou.
WebCPM: Interactive web search for Chinese long-form question answering. In Association
for Computational Linguistics: ACL, 2023.
[7] Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann,
Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language
model for finance. arXiv preprint arXiv:2303.17564, 2023.
[8] Andres M Bran, Sam Cox, Andrew D White, and Philippe Schwaller. ChemCrow: Augmenting
large-language models with chemistry tools. arXiv preprint arXiv:2304.05376, 2023.
[9] Yujia Qin, Shi Liang, Yining Ye, Kunlun Zhu, Lan Yan, Ya-Ting Lu, Yankai Lin, Xin Cong,
Xiangru Tang, Bill Qian, Sihan Zhao, Runchu Tian, Ruobing Xie, Jie Zhou, Marc H. Gerstein,
Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating Large Language Models
to Master 16000+ Real-world APIs. International Conference on Learning Representations:
ICLR, 2023.
[10] Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu
Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with
self-improvement. arXiv preprint arXiv:2402.07456, 2024.
[11] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo-
pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted
question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
[12] Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji.
Executable code actions elicit better llm agents. arXiv preprint arXiv:2402.01030, 2024.
[13] Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large Language
Model Connected with Massive APIs. arXiv preprint arXiv:2305.15334, 2023.
[14] Yifan Song, Weimin Xiong, Dawei Zhu, Chengzu Li, Ke Wang, Ye Tian, and Sujian Li.
RestGPT: Connecting Large Language Models with Real-World Applications via RESTful
APIs. arXiv, 2023.
[15] Lifan Yuan, Yangyi Chen, Xingyao Wang, Yi R Fung, Hao Peng, and Heng Ji. Craft: Cus-
tomizing llms by creating and retrieving from specialized toolsets. International Conference on
Learning Representations: ICLR, 2024.
[16] Archiki Prasad, Alexander Koller, Mareike Hartmann, Peter Clark, Ashish Sabharwal, Mohit
Bansal, and Tushar Khot. Adapt: As-needed decomposition and planning with language models.
arXiv preprint arXiv:2311.05772, 2023.

10
[17] Cheng Qian, Chi Han, Yi Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. Creator: Tool creation
for disentangling abstract and concrete reasoning of large language models. In Findings of the
Association for Computational Linguistics: EMNLP, 2023.
[18] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and
Yuan Cao. ReAct: Synergizing Reasoning and Acting in Language Models. In International
Conference on Learning Representations: ICLR, 2023.
[19] Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer,
and Marco Tulio Ribeiro. Art: Automatic multi-step reasoning and tool-use for large language
models. arXiv preprint arXiv:2303.09014, 2023.
[20] Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Chandu, Kai-Wei Chang, Yejin
Choi, and Bill Yuchen Lin. Lumos: Learning agents with unified data, modular design, and
open-source llms. arXiv preprint arXiv:2311.05657, 2023.
[21] Boshi Wang, Hao Fang, Jason Eisner, Benjamin Van Durme, and Yu Su. LLMs in the Imag-
inarium: tool learning through simulated trial and error. arXiv preprint arXiv:2403.04746,
2024.
[22] Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun
Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language
models. Neural Information Processing Systems: NeurIPS, 2023.
[23] Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, and Le Sun. Toolalpaca:
Generalized tool learning for language models with 3000 simulated cases. arXiv preprint
arXiv:2306.05301, 2023.
[24] Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed,
Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for
multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.
[25] Cheng-Yu Hsieh, Si-An Chen, Chun-Liang Li, Yasuhisa Fujii, Alexander Ratner, Chen-Yu Lee,
Ranjay Krishna, and Tomas Pfister. Tool documentation enables zero-shot tool-usage with large
language models. arXiv preprint arXiv:2308.00675, 2023.
[26] Zhengliang Shi, Shen Gao, Xiuyi Chen, Lingyong Yan, Haibo Shi, Dawei Yin, Zhumin Chen,
Pengjie Ren, Suzan Verberne, and Zhaochun Ren. Learning to use tools via cooperative and
interactive agents. arXiv preprint arXiv:2403.03031, 2024.
[27] Siyu Yuan, Kaitao Song, Jiangjie Chen, Xu Tan, Yongliang Shen, Ren Kan, Dongsheng Li,
and Deqing Yang. Easytool: Enhancing llm-based agents with concise tool instruction. arXiv
preprint arXiv:2401.06201, 2024.
[28] Qiao Jin, Yifan Yang, Qingyu Chen, and Zhiyong Lu. Genegpt: Augmenting large language
models with domain tools for improved access to biomedical information. Bioinformatics,
2024.
[29] Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen,
Ji Zhang, and Fei Huang. Small llms are weak tool learners: A multi-llm agent. arXiv preprint
arXiv:2401.07324, 2024.
[30] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang.
Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural
Information Processing Systems, 2024.
[31] Shuofei Qiao, Ningyu Zhang, Runnan Fang, Yujie Luo, Wangchunshu Zhou, Yuchen Eleanor
Jiang, Chengfei Lv, and Huajun Chen. AUTOACT: Automatic Agent Learning from Scratch
via Self-Planning. arXiv preprint arXiv:2401.05268, 2024.
[32] Ke Yang, Jiateng Liu, John Wu, Chaoqi Yang, Yi R Fung, Sha Li, Zixuan Huang, Xu Cao,
Xingyao Wang, Yiquan Wang, et al. If llm is the wizard, then code is the wand: A survey
on how code empowers large language models to serve as intelligent agents. arXiv preprint
arXiv:2401.00812, 2024.

11
[33] Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri
Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. Dspy:
Compiling declarative language model calls into self-improving pipelines. arXiv preprint
arXiv:2310.03714, 2023.
[34] Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, and Yitao Liang. Describe,
explain, plan and select: Interactive planning with large language models enables open-world
multi-task agents. arXiv preprint arXiv:2302.01560, 2023.
[35] Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts
prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv
preprint arXiv:2211.12588, 2022.
[36] Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan,
and Graham Neubig. PAL: Program-aided language models. In Proceedings of Machine
Learning Research: PMLR, 2023.
[37] Chen Qian, Yufan Dang, Jiahao Li, Wei Liu, Weize Chen, Cheng Yang, Zhiyuan Liu, and
Maosong Sun. Experiential co-learning of software-developing agents. arXiv preprint
arXiv:2312.17025, 2023.
[38] Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel:
Llm agents are experiential learners. In Proceedings of the AAAI Conference on Artificial
Intelligence, pages 19632–19642, 2024.
[39] Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili
Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for
multi-agent collaborative framework. arXiv, 2023.
[40] Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv-
ing factuality and reasoning in language models through multiagent debate. arXiv preprint
arXiv:2305.14325, 2023.
[41] Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. Improving language model negotiation
with self-play and in-context learning from ai feedback. arXiv preprint arXiv:2305.10142,
2023.
[42] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min,
Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv
preprint arXiv:2303.18223, 2023.
[43] Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and
Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In Neural
Information Processing Systems: NeurIPS, 2023.
[44] Xingyao Wang, Hao Peng, Reyhaneh Jabbarvand, and Heng Ji. Leti: Learning to generate from
textual interactions. arXiv preprint arXiv:2305.10314, 2023.
[45] Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language
models to self-debug. arXiv preprint arXiv:2304.05128, 2023.
[46] Haoqiang Kang, Juntong Ni, and Huaxiu Yao. Ever: Mitigating hallucination in large language
models through real-time verification and rectification. arXiv preprint arXiv:2311.09114, 2023.
[47] Muru Zhang, Ofir Press, William Merrill, Alisa Liu, and Noah A Smith. How language model
hallucinations can snowball. arXiv preprint arXiv:2305.13534, 2023.
[48] Dheeraj Mekala, Jason Weston, Jack Lanchantin, Roberta Raileanu, Maria Lomeli, Jingbo
Shang, and Jane Dwivedi-Yu. Toolverifier: Generalization to new tools via self-verification.
arXiv preprint arXiv:2402.14158, 2024.
[49] Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, and Ying Shan. GPT4Tools:
Teaching Large Language Model to Use Tools via Self-instruction. Neural Information
Processing Systems: NeurIPS, 2023.

12
[50] Shen Gao, Zhengliang Shi, Minghang Zhu, Bowen Fang, Xin Xin, Pengjie Ren, Zhumin Chen,
Jun Ma, and Zhaochun Ren. Confucius: Iterative tool learning from introspection feedback by
easy-to-difficult curriculum. In Proceedings of the AAAI Conference on Artificial Intelligence:
AAAI, 2024.
[51] Albert Qiaochu Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh
Chaplot, Diego de Las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile
Saulnier, L’elio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut
Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b. arXiv preprint
arXiv:2310.06825, 2023.
[52] Qiantong Xu, Fenglu Hong, Bo Li, Changran Hu, Zhengyu Chen, and Jian Zhang. On
the tool manipulation capability of open-source large language models. arXiv preprint
arXiv:2305.16504, 2023.
[53] Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei
Huang, and Yongbin Li. API-bank: A comprehensive benchmark for tool-augmented LLMs. In
Association for Computational Linguistics: EMNLP, 2023.
[54] Junjie Ye, Guanyu Li, Songyang Gao, Caishuang Huang, Yilong Wu, Sixian Li, Xiaoran Fan,
Shihan Dou, Qi Zhang, Tao Gui, et al. Tooleyes: Fine-grained evaluation for tool learning
capabilities of large language models in real-world scenarios. arXiv preprint arXiv:2401.00741,
2024.
[55] Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu,
Maosong Sun, and Yang Liu. Stabletoolbench: Towards stable large-scale benchmarking on
tool learning of large language models. arXiv preprint arXiv:2403.07714, 2024.
[56] Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma,
Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. Advances in Neural
Information Processing Systems, 36, 2024.

13
A Appendix

A.1 Implementation details

The black-box tool probing. In our experiments, we evaluate our probing method for three datasets,
i.e., RestBench-TMDB, RestBench-Spotify, and ToolFlow, respectively. The maximum number of
probing iterations is set to 4. We provide the details for our tool probing process in Table 6.

Table 6: The implement details of our black-box tool probing process.


RestBench RestBench Ours
Datasets
-TMDB -Spotify -ToolFlow
Maximum number of probing iterations 4 4 4
Runtime probing iteration during the experiment 2 4 4
Avg. number of probing times for each tool 1.31 3.53 1.27
The total token consumption for the entire probing process 57574 240478 110386
Number of total tools/successfully probed tools 54 40 107
Avg. token consumption for each tools 1066 6012 1032

Efficiency Analysis The intensive parameters of LLMs typically raise the concern about inference
cost. Thus, we compare the token consumption and the number of inference step between our
framework (auto doc) and strong baselines on the TMDB and ToolFlow datasets. The results are
shown in Figure 8 to explain more intuitively. We observe that although our framework achieves
better performance, we spend fewer tokens compared with all baselines. The potential reason is that
our framework benefits from the inherent advancement of programming language which supports
control of the workflow and allows the composition of multiple tools to perform complex logical
operations. In addition, the previous baseline interacts with tools in a step-by-step manner, leading to
a long task-solving trajectory with substantial inference steps. By contrast, our method composes
a chain of tools in single programmatic planning, simplifying the repeative workflow in a concise
program. These results further validate the superiority of our framework.

105,000 Average token consumption


90499
70311
70,000
40314
35,000 25554 28475 28424
15619 17756 19844 21649

0
TMDB ToolFlow
Average inference step 24.75
24 22.4

16
7.56 8.86 7.95
8 5.32 4.59
3.23 2.14
1.37
0
TMDB ToolFlow
Ours ReAct ReAct@3 ToolLLM ToolLLM@3
Figure 8: The distribution of required tool-use number on our collected ToolFlow benchmark.

14
The runtime consistency of our experiment. Since the non-deterministic generation of LLMs
by nature, we further explore the consistency and stability of our framework. We repeat our method
(ATC (ours)) with the same setting as Table 1 in RestBench. The statistical significance of differences
observed between the performance of two runs is tested using a two-tailed paired t-test. We find no
significant difference between the results of two randomly conducted experiments (p > 0.05).

Pseudo algorithm for our probing with chain of probing. We provide a pseudo algorithm in
Algorithm 1 to explain the process of our chain of probing algorithm.

Algorithm 1 The pseudo code of our chain of probing algorithm.


Require: Function LLMProb(tool) → doc: probe a tool to obtain its documentation (Section 4.1)
Require: LLMSample function
1: Input: T : the toolset, maxNum: the maximum iteration times
2: Output: a list of probed tools, denoted as H
3: function T OPO P ROBING(T )
4: B←T ▷ initially, all the tools are unprobed, which are stored in a black-box list B
5: H←[] ▷ initialize the list of probed tools with an empty list
6: D←[] ▷ a list used to store the documentation of probed tools
7: for t in T do
8: d ← LLMP ROB(t, [ ]) ▷ prob each single tool
9: if d ̸= None then ▷ if the tool can be probed successfully
10: B.remove(t) ▷ remove the probed tool from black-box list B
11: H.append(t) ▷ add the probed tool in H
12: D.append(t) ▷ add the synthetic documentation into D
13: end if
14: end for
15: cnt ← 0 ▷ initialize a counter to count the current number of iteration.
16: while len(B) ̸= 0 and cnt < maxNum do ▷ continue to prob tools in the black-box list
17: for t in B do
18: subset ← LLMS AMPLE(t,H) ▷ sample related tools for assistance from H
19: d ← LLMP ROB(t, subset) ▷ prob current tool with the assistance of related tools
20: if d ̸= None then ▷ if the tool can be probed successfully
21: B.remove(t) ▷ remove the probed tool from black-box list B
22: H.append(t) ▷ add the probed tool in H
23: D.append(t) ▷ add the synthetic protocol into D
24: end if
25: end for
26: cnt ← cnt + 1 ▷ increment the counter for each iteration
27: end while
28: return results
29: end function

Procedure for schema transformation We automatically transform a question-specific tool re-


sponse into a general response schema to demonstrate the input-output specification of a tool. This
process is conducted by recursively decomposing each element in the lengthy execution results, rep-
resenting the hierarchical structure among elements, and listing the type of the corresponding value
as the schema. The pseudo code of the overall procedure is given in Algorithm 2. To explain more
intuitively, we also illustrated our algorithm with an example in Appendix A.6, which demonstrates
how to construct the schema based on lengthy execution results.

A.1.1 Automatic evaluation


We mainly use Success Rate, Correct Path Rate, and Correct Path Precision as automatic evaluation
metrics following previous works [49, 50]. The Success Rate (Success%) computes the proportion
of successful query completions. Specifically, if all the ground-truth tools are executed correctly, the
success rate is 1, otherwise 0. The Correct Path Rate (Path%) calculates the proportion of ground
truth tools in model-generated tool callings. The Correct Path Precision (Prec%) calculates the
precision score between the model-generated tool callings and ground truth tool sequence.

15
Algorithm 2 Schema transformation algorithm, which automatically transforms a specific tool
response (i.e., execution results of the tool) into a general response schema.
Require: Function IsInstanceOf(v) → type, returns the type of the input variable ‘v’.
1: function S CHEMAT RANSFORMATION(response)
2: if IsInstanceOf(response, Atomic) then
3: return response.type
4: else if IsInstanceOf(response, List) and len(response) > 0 then
5: tmp ← S CHEMAT RANSFORMATION(response[0])
6: return List[tmp] ▷ only remain the first element for simplify the result
7: else if IsInstanceOf(response, Dict) then
8: schema ← Dict()
9: for all key, child in response do ▷ recursively simplify each child element
10: schema[key] ← S CHEMAT RANSFORMATION(child)
11: end for
12: return schema
13: end if
14: end function

A.1.2 Human evaluation


Metrics of human evaluation. Following previous work [26], we conduct a human evaluation on
two metrics, including: (1) Executability (Exec): whether multiple tools are invoked in a correct
logical order to complete the task; (2) Tool utilization (Uility): whether the model can observe the
relevant values from lengthy execution results and incorporate them to predict the next action. We
invite three well-educated volunteers to evaluate 30 cases randomly sampled from RestBench-TMDB,
RestBench-Spotify, and ToolFlow datasets, respectively, with a three-scale rating.

Details of human evaluation. Specifically, the annotators manually evaluate the task-solving
trajectory step-by-step for Utility and Executability metrics using the ground truth solution as a
reference. To guarantee annotation quality, we ask at least two annotators to evaluate the same
example repeatedly. If there is a discrepancy between the two annotators (i.e., two annotators give a
different score), we ask a third annotator to recheck it. The Kappa statistics for Executability and
Tool utilization metrics are 0.70 and 0.69, which illustrates the agreement among annotators.

Results of human evaluation. The results are shown in Table 7. We find that our method achieves
the best in the Executability aspect with 0.16 to 0.28 absolute improvement compared with strong
baselines, e.g., ToolLLM@3. We also observe that our method achieves higher performance on
Utility. The potential reason for our superiority is that our framework enables the LLM to operate a
chain of tools through programming, which is more executable compared with the manually designed
workflow in previous work.

Table 7: The human evaluation on three datasets for executability and utility. Scores are on a scale
of 1–3.

RestBench-TMDB RestBench-Spotify NovelTools


Method
Exec Utility Exec Utility Exec Utility
ReAct [18] 1.73 1.83 1.63 2.07 1.48 1.67
CodeAct [18] 1.87 2.07 1.83 1.98 1.67 1.87
ToolLLM@3 [9] 2.30 2.23 2.21 2.30 2.05 2.04
Ours (auto doc) 2.52 2.47 2.37 2.54 2.33 2.45

Quality improvement. We also employ the following strategies to ensure the reliability of our
human evaluation.

• Detailed annotator training. We hold regular meetings to ensure that each evaluator has no
questions about the evaluation metrics. We also design pre-annotation tests, where each evaluator
undergoes detailed training to familiarize themselves with our evaluation task. The training is

16
Table 8: The comparison with existing benchmarks. We highlight our collected benchmark in
boldface and underline the benchmark used in our experiments.
Total Avg. number of Total Avg. token of Tool calling
Dataset
Task number tools per task tool number documented protocol interface
ToolFlow (ours) 224 7.31 107 552.92 RESTful API
RestBench-TMDB [14] 100 2.26 54 673.67 RESTful API
RestBench-Spotify [14] 57 2.56 40 792.15 RESTful API
ToolBench-ToolEval [9] 600 2.56 1806 159.47 RESTful API
ToolBench-sam [52] 895 5.35 232 66.98 Private
API-bank [53] 272 1.99 101 75.85 Python function
ToolEyes [54] 382 2 568 72.06 Python function

Table 9: The statistics of our collected ToolFlow benchmark, where we show the tool number and
example number for each domain.
Domain of the tools in our ToolFlow
Totally
Food Recipe Open Weather Game Movie
Task number 64 50 50 60 224
Tool number 22 11 20 54 107

designed to minimize subjective interpretations and ensure that each annotator applies the rules
consistently.
• Quality improvement: Each evaluator is invited to manually evaluate the task-solving trajectory
step-by-step using the ground truth solution provided by datasets as a reference. To guarantee
evaluation quality, we ask at least two evaluators to evaluate the same questions repeatedly. If there
is a discrepancy between the two evaluators (i.e., two evaluators give a different correctness score),
we ask a third evaluator to recheck it.

A.2 A new benchmark – ToolFlow

Our ToolFlow benchmark is proposed to evaluate tool-use LLMs using more challenging tasks.
Compared with the existing benchmark, our ToolFlow has the following advantages.

• Long-term planning. Most existing tool learning benchmarks are relatively simple, with each task
being solved using 2 or 3 steps. However, real-world tasks often require complex workflows, such
as computing the rating scores for the top 10 newly released movies. To reflect
the tool learning capability of LLMs in realistic scenarios, each task in our ToolFlow benchmark is
designed to involve at least 7 tool calls on average.
• Connected reasoning. Each task in our benchmark requires the model to interact with tools
multiple times. To increase the challenge of the task, there is a strong interdependency among the
tools, meaning that the argument of the current tool can only be extracted from the execution results
of previous tools. This interdependent nature forces the models to connect information across all
execution results of tools to solve a complex task, instead of simply making multiple calls without
further reasoning.
• Consistency and stability: For high reproducibility, each task in our benchmark does not involve
specific time, and the outputs of the tools are not time-varying.

We also compare our ToolFlow with existing benchmarks in Table 8.

A.2.1 Details for benchmark construction


Previous work like ToolBench [9] directly employs LLMs to generate datasets. However, it is proved
to be less diverse or has unsolvable tasks [55, 56], raising concern about the scope and effectiveness
of the evaluation. In this work, we adopt a bottom-up task collection approach driven by manual
effort. Specifically, we employ 7 experts (a.k.a., annotators) who work on NLP research to brainstorm
tasks for different combinations of tools. Each expert is encouraged to integrate various tools to

17
The distribution of required tool-use numbers (ToolFlow)

16.38% 31.97% 29.51% 12.30% 9.84%

0% 25% 50% 75% 100%


1 ≤ Tool num. ≤ 3 4 ≤ Tool num. ≤ 5 6 ≤ Tool num. ≤ 7 8 ≤ Tool num. ≤ 9 10 ≤ Tool num.

Figure 9: The required tools per task distribution in our collected ToolFlow benchmark.

formulate a challenging task. Next, the experts need to manually solve these tasks with the assistance
of candidate tools and annotate the ground truth solution, which includes the path of required tools
and corresponding arguments for each tool calling. To establish a benchmark for highly consistent
evaluations, we exclude any tasks where the solution varies over time. Specifically, a task is filtered
out if the ground-truth solution path for the tool differs between two runs. Ultimately, we construct
227 examples across 107 tools from four domains. Table 10 shows an example of our collected
benchmark. Compared with existing benchmarks which only list the required tools for each task, we
further provide a ground truth solution for reference, including the required tools and corresponding
arguments. Although the dataset is not large, each task in our benchmark is of high quality and
represents the types of requests frequently made by users. The statistics of our benchmark are shown
in Figure 9 and Table 9.

Table 10: An example of our collected ToolFlow benchmark.


Example of our ToolFlow benchmark (Food domain)
Task:
Please help me find a steak recipe and a pasta recipe. These recipes should have a carbohydrate
content no higher than 80 grams per 100 grams, no lower than 5 grams per 100 grams. The protein
content should be at least 5 grams per 100 grams for each recipe. Among them, which recipe requires
fewer pieces of equipment, and how many ingredients does the recipe with fewer equipment contain?

Base url for tool:


https://spoonacular-recipe-food-nutrition-v1.p.rapidapi.com/

Ground truth solution:


1. GET /recipes/complexSearch
- arguments: {"query": "steak", "minCarbs":5, "maxCarbs": 80, "minProtein": 5, "number": 1}
2. GET /recipes/complexSearch
- arguments: {"query": "pasta", "minCarbs":5, "maxCarbs": 80, "minProtein": 5, "number": 1}
3. GET /recipes/recipe_id/equipmentWidget.json
- arguments:{"recipe_id": 1094259}
4. GET /recipes/recipe_id/ingredientWidget.json
- arguments: {"recipe_id": 1094259}
5. GET /recipes/recipe_id/equipmentWidget.json
- arguments: {"recipe_id": 532245}
6. GET /recipes/recipe_id/ingredientWidget.json
- arguments: {"recipe_id": 532245}

Ground truth tools:


1. GET /recipes/complexSearch
2. GET /recipes/{recipe_id}/equipmentWidget.json
3. GET /recipes/{recipe_id}/ingredientWidget.json
4. GET /recipes/{recipe_id}/equipmentWidget.json
5. GET /recipes/{recipe_id}/ingredientWidget.json
6. GET /recipes/{recipe_id}/similar

A.2.2 Strategy for quality improvement

To ensure the quality of our constructed benchmark, we employ the following strategies.

18
• Detailed annotator training. We hold regular meetings to ensure that each expert has no questions
about the annotation criteria. We also design pre-annotation tests, where each expert undergoes
detailed training to familiarize themselves with our annotation task.
• Cross-check for potential discrepancies. To guarantee annotation quality, we ask at least two
experts to annotate the same task repeatedly. If there is a discrepancy between the two experts, i.e.,
two experts give different solutions for the same task, we ask a third expert to recheck it. We also
filter the task with ambiguity to improve the reliability of our benchmark.
• Periodic audits: We conduct periodic audits of the annotations. These audits involved cross-
checking a subset of annotated examples to verify compliance with the established criteria. We
also held regular review meetings where annotation experts discussed challenging cases, ensuring a
common understanding and application of the rules.

A.3 Case Study

We conduct comprehensive case studies and find that our framework ATC is effective at coordinating
various tools to solve complex tasks and our probing method can instruct the LLM to probe the input-
output mechanism of tools, automatically synthesizing documentation. We provide the following
cases to intuitively explain the details of our method.

Output example of our ATC. Figure 10 and Figure 11 show concrete output examples of our
methods in RestBench-TMDB and our collected ToolFlow benchmarks, respectively.

Query: Give me some cover images of movies directed by Christopher Nolan.

1. # Generated program of our method


2. import requests
3. headers = {"Authorization": f"Bearer eyJhbGciOiJIUzI1NiJ9.eyJhdWQiOiIwZ"}
4. # Call GET_search_person API to search for Christopher Nolan
5. params = {
6. "query": "Christopher Nolan",
7. "include_adult": False
8. }
9. url = "https://api.themoviedb.org/3/search/person"
10. search_person_response = requests.get(url=url, headers=headers, params=params)
11. search_person_results = search_person_response.json()["results"]
12. if len(search_person_results) > 0:
13. # Get the person id of Christopher Nolan
14. person_id = search_person_results[0]["id"]
15.
16. # Call GET_person_person_id_movie_credits API to get the movie credits for Christopher Nolan
17. person_movie_credits_url = f"https://api.themoviedb.org/3/person/{person_id}/movie_credits"
18. person_movie_credits_response = requests.get(person_movie_credits_url, headers=headers)
19. person_movie_credits_results = person_movie_credits_response.json()
20.
21. # Get the cover images of movies directed by Christopher Nolan
22. cover_images = []
23. for movie in person_movie_credits_results["crew"]:
24. if movie["job"] == "Director" and movie["poster_path"]:
25. cover_images.append(movie["poster_path"])
26.
27. print(cover_images)
28. else:
29. print("Christopher Nolan not found.")

Output: ['/cXDFFv6nL0So.jpg', '/rQ884232vY6Zp.jpg', '/riVXh3EimGO5Itg.jpg’]

Figure 10: A concrete example on the RestBench-TMDB dataset, including the input query, the
program generated by our method, and the corresponding output result.

Example for attribution reflection Figure 12 shows the details of our attribution reflection mecha-
nism. Specifically, when an exceptional error, i.e.,, keyError, is raised while running the generated
program. The LLM is first instructed to attribute the error to the specific tool calling and generate the
tool name GET_movie_credits_tools which accounts for the raised error. Then, the selected tool
name paired with the error message is fed into the LLM, guiding the LLM to adaptively revise the
error and re-generate the correct program. The final answer “GoodFellas” is obtained by running the
newly generated program.

19
Query: I am looking for pizza recipes. Could you please find me one each for Italian, Chinese, british, irish, french, and Japanese
cuisines, and tell me their title? Each recipe should include tomatoes and cheese as ingredients. Among them, can you recommend
three names of the recipes that are similar to the Italian pizza recipe?

1. # Generated program of our method


2. import requests
3. headers = {"X-RapidAPI-Host": "spoonacular-recipe-food-nutrition-v1.p.rapidapi.com"} # Headers
4. pizza_recipe_titles = {} # Dictionary to store the first pizza recipe
5.
6. cuisines = ["Italian", "Chinese", "British", "Irish", "French", "Japanese"] # List of cuisines
7. for cuisine in cuisines:
8. url = "https://spoonacular-recipe-food-nutrition-v1.p.rapidapi.com/recipes/complexSearch"
9. params = {"query": "pizza", "number": 1, "cuisine": cuisine, "include": "tomatoes,cheese"}
10. results = requests.get(url, headers=headers, params=params).json().get("results",[])
11. if results: # Store the recipe id and title
12. recipe = results[0]
13. pizza_recipe_titles[cuisine] = recipe["title"]
14. else:
15. pizza_recipe_titles[cuisine] = "No pizza recipe found for this cuisine"
16.
17. for cuisine, title in first_pizza_recipe_titles.items(): # Output the found recipe titles
18. print(f"{cuisine} pizza recipe title: {title}")
19.
20. # find similar recipes to the Italian pizza recipe
21. italian_pizza_recipes = list(filter(lambda x: x[0] == "Italian", pizza_recipe_titles.items()))
22. if italian_pizza_recipes:
23. italian_pizza_title = italian_pizza_recipes[0][1]
24. url = "https://spoonacular-recipe-food-nutrition-v1.p.rapidapi.com/recipes/findByIngredients"
25. params = {"ingredients": italian_pizza_title, "number": 3}
26. similar_results = requests.get(similar_recipes_url, headers=headers, params=params).json()
27. similar_recipes = [recipe['title'] for recipe in similar_results]
28. print(f“Three similar recipes to the Italian pizza recipe are: {similar_recipes}")
29. else:
30. print("No Italian pizza recipe was found to find similar recipes.")

Output: Italian pizza recipe title: Pizza


Chinese pizza recipe title: Pizza
...
Three similar recipes to the Italian pizza recipe are: [Pizzuchini, Cheesy Pizza Panini, Portobello]

Figure 11: A concrete example on the ToolFlow benchmark, including the input query, the program
generated by our method, and the corresponding output result.

Example for tool probing We illustrate an example in Figure 13 to demonstrate the details of our
tool probing process. Given a black-box tool, i.e., GET_search, we start with its basic information,
which includes its description and parameter requirements. Using this information, we instruct the
LLM to formulate a query that targets the functionality of the tool, i.e., Find the album related
to the keyword ’pop’ in the US market. Subsequently, we generate a complete program
to execute the query and record the results. This process probes the input-output mechanism of a
black-box tool and synthesizes its protocol, which can be used in our ATC framework. The proposed
probing allows the LLM to teach themselves to master new tools, extending our framework into
diverse and fast-paced tools without manually crafted documentation.

Example for tool probing with chain of probing algorithm Our chain of probing algorithm is
designed to optimize the probing order and facilitate coordination among tools that exhibit strong
interdependencies. The core conception is to probe the complex tools with the assistance of other
probed tools. Figure 14 shows a specific example. Specifically, the probed tool “GET_search” is used
to obtain the id of an album. On top of the obtained album id, the black-box tool “GET_albums”
is then can be called to acquire information about the targeted album. the necessity of our chain of
probing strategy.

A.4 Limitations

The main limitation of our work is that our framework uses an LLM as a backbone which is limited
when perceiving multi-modal tasks. When executing the tools, we represent image and speech input
with url following previous work. In the future, we plan to extend our framework to multi-modal
foundation models. In addition, we stop the iteration of our attribution reflection mechanism if no
runtime errors are raised by the code interpreter. In the future, we will further calibrate the output of

20
Query: tell me the highest rated movie directed by Martin Scorsese

1.# Generated program of our method (iteration 1)


2.import requests
3.headers = {"Authorization": "Bearer eyJhbGciOiJIUzI1NiJ9"} # Define the headers
4.
5.url = "https://api.themoviedb.org/3/search/person" # Step 1: Search for person Martin Scorsese
6.data = requests.get(url, headers=headers, params={"query": "Martin Scorsese", "page": 1}).json()
7.person_id = data['results'][0]['id'] # get Martin Scorsese's person_id
8.
9.credits_url = f"https://api.themoviedb.org/3/person/{person_id}/movie_credits" # Step 2: Get movie credits
10.data = requests.get(credits_url, headers=headers).json()
11.
12.movies = data['cast'] + data['crew']
13.highest_rated_movie, highest_rating = None, 0.0
14.for movie in movies: # Step 3: Find the highest rated movie
15. if 'director' in movie[department'].lower() and movie['vote_average'] > highest_rating:
16. highest_rating = movie['vote_average']
17. highest_rated_movie = movie
18.
19.print(highest_rated_movie['title']) # Print the highest rated movie

Output:
25 highest_rating = 0.0
26 for movie in movies:
---> 27 if 'director' in movie['department'].lower() and movie['vote_average'] > highest_rating:
28 highest_rating = movie['vote_average']
29 highest_rated_movie = movie

KeyError: 'department' ❌

Attributable reflection: The error is triggered when calling the GET_movie_credits tools -> ['GET_movie_credits tools']

1.# Generated program of our method (iteration 2)


2.import requests
3.headers = {"Authorization": "Bearer eyJhbGciOiJIUzI1NiJ9"} # Define the headers
4.
5.url = "https://api.themoviedb.org/3/search/person" # Step 1: Search for person Martin Scorsese
6.data = requests.get(url, headers=headers, params={"query": "Martin Scorsese", "page": 1}).json()
7.person_id = data['results'][0]['id'] # get Martin Scorsese's person_id
8.
9.credits_url = f"https://api.themoviedb.org/3/person/{person_id}/movie_credits" # Step 2: Get movie credits
10.data = requests.get(credits_url, headers=headers).json()
11.
12.movies = data['crew']
13.highest_rated_movie, highest_rating = None, 0.0
14.for movie in movies: # Step 3: Find the highest rated movie
15. if 'director' in movie['job'].lower() and movie['vote_average'] > highest_rating:
16. highest_rating = movie['vote_average']
17. highest_rated_movie = movie
18.
19.print(highest_rated_movie['title']) # Print the highest rated movie

Output: GoodFellas ✅

Figure 12: A concrete example of our attributable reflection mechanism. We highlight the program
snippet that is corrected in the 2nd iteration.

generated programs, mitigating the false success phenomena, i.e., the program triggers no runtime
error but gives an incorrect answer, with the goal of developing a more reliable tool-use agent.

A.5 Ethics Statement

The research conducted in this paper aims at the development of empowering large language models
(LLMs) as multi-tool learners, which can programmatically utilize a chain of tools and instruct
themselves to master new tools. In addition to the use of state-of-the-art commercial LLMs, we
have experimented with an open-source LLM, for reproducibility reasons and to allow the use of our
method in lower-resource contexts. In the process of conducting this research, we have adhered to
ethical standards to ensure the integrity and validity of our work. All the tools used in this study were
obtained from existing public platforms, thus ensuring a high level of transparency and reproducibility
in our experimental procedure. In addition, we have made every effort to ensure that our research
does not harm individuals or groups, nor does it involve any form of deception or potential misuse of
information.

21
A new tool: GET_search
# Description
Get Spotify catalog information about albums, artists, playlists, tracks, shows, episodes or audiobooks. (GET https://api.spotify.com/v1/search)
# Parameter
- q: Your search query. (type: string)
- type: A comma-separated list of item types to search across. Search results include hits (type: array)
- market: An [ISO 3166-1 alpha-2 country code]. (type: string)
- limit: The maximum number of results to return in each item type. (type: integer)
- offset: The index of the first result to return. Use (type: integer)
- include_external: If `include_external=audio` is specified it signals that the client can play externally hosted audio content, and marks (type: string)

Instance generation
- Formulated query: Find the albums, artists, and playlists related to the keyword "pop" in the US market.
- Test code

1.import requests
2.import json
3.
4.url = "https://api.spotify.com/v1/search"
5.params = {
6. "q": "pop",
7. "type": "album,artist,playlist",
8. "market": "US"
9.}
10.
11.response = requests.get(url, headers=headers, params=params)
12.print(json.dumps(response.json(), indent=0))

Tool response: { "albums":{ "href":"https://api.spotify.com/v1/search?query=pop&type=album&market=US&


offset=0&limit=20", "items": [{"album_type": "single", "artists": [{"external_urls" { "spotify":"https:
//open.spotify.com/artist/1Xyo4u8 uXC1ZmMpa tF05PJ"}, "href": "https://api.spotify.com/v1/artists/
1Xyo4u8uXC1ZmMpatF05PJ","id": "1Xyo4u8uXC1ZmMpatF05PJ", "name": "The Weeknd", "type": "artist", "uri"

Figure 13: A concrete example of our tool probing process, where we highlight the question
formulated by the LLM and the program snippet used to call the tool.

22
A new tool: GET_albums
# Description
Get Spotify catalog information for a single album. (GET https://api.spotify.com/v1/albums/{id})
# Parameter
- id: The [Spotify ID] of the album. (type: string)
- market: An [ISO 3166-1 alpha-2 country code]. (type: string)
The `{variable}` in the url path should also be replaced with actual value.

Related tools:
[1] GET_search:
- description: Get Spotify catalog information about albums, artists, playlists…
- Parameter:
- q: Your search query. (type: string)
- type: …
- Response schema: : {"albums": { "href":str, ...

Instance generation

- Formulated query: Find the albums, artists, and playlists related to the keyword "pop" in the US market.

- Test code

1.import requests
2.import json
3.
4.url = "https://api.spotify.com/v1/search"
5.params = {
6. "q": "pop",
7. "type": "album,artist,playlist",
8. "market": "US"
9.}
10.
11.response = requests.get(url, headers=headers, params=params)
12.print(json.dumps(response.json(), indent=0))

Tool response: { "albums": { "href":"https://api.spotify.com/v1/search?query=pop&type=album&market


=US&offset=0&limit=20", "items": [{"album_type": "single", "artists": [{"external_urls" { "spotify":
"https://open.spotify.com/artist/1Xyo4u8 uXC1ZmMpa tF05PJ"}, "href": "https://api.spotify.com/v1/
artists/1Xyo4u8uXC1ZmMpatF05PJ","id": "1Xyo4u8uXC1ZmMpatF05PJ", "name": "The Weeknd", "type":
"artist", "uri": "spotify:artist:1Xyo4u8uXC1ZmMpatF05PJ"},...

Figure 14: A concrete example to illustrate the coordination among tools during our tool probing
process. Initially, the probed tool “GET_search” is used to obtain the id of an album. Using this
obtained album id, the black-box tool “GET_albums” is then can be called to acquire information
about the targeted album.

23
A.6 The synthetic tool response schema and tool protocol

The probed execution results of a tool (GET_ingredient_information) from our ToolFlow


{" id ": 9266 , " original ": " pineapples " , " originalName ":
" pineapples " , " name ": " pineapples " , " amount ": 100 , " unit ": " grams " ,
" unitShort ": " g " , " unitLong ": " grams " , " possibleUnits ": [" piece " ,
" slice " , " fruit " , " g " , " oz " , " cup " , " serving "] , " estimatedCost ":
{" value ": 33.04 , " unit ": " US Cents "} , " consistency ": " solid " ,
" shoppingListUnits ": [" pieces "] , " aisle ": " Produce " , " image ":
" pineapple . jpg " , " meta ": [] , " nutrition ": {" nutrients ": [{" name ":
" Phosphorus " , " amount ": 8 , " unit ": " mg " , " percentOfDailyNeeds ":
0.8} , {" name ": " Copper " , " amount ": 0.11 , " unit ": " mg " ,
" p ercentOfDailyNeeds ": 5.5} , {" name ": " Protein " , " amount ": 0.54 ,
" unit ": " g " , " percentOfDa ilyNeeds ": 1.08} , {" name ":
" Carbohydrates " , " amount ": 13.1 , " unit ": " g " ,
" p ercentOfDailyNeeds ": 4.37} , {" name ": " Vitamin B12 " , " amount ": 0 ,
" unit ": "\ u00b5g " , " perce ntOfDailyNeeds ": 0} , {" name ": " Mono
Unsaturated Fat " , " amount ": 0.01 , " unit ": " g " ,
" p ercentOfDailyNeeds ": 0} , {" name ": " Alcohol " , " amount ": 0 , " unit ":
" g " , " percentOfDailyNeeds ": 0} , {" name ": " Fat " , " amount ": 0.12 ,
" unit ": " g " , " percentOfDa ilyNeeds ": 0.18} , {" name ": " Zinc " ,
" amount ": 0.12 , " unit ": " mg " , " percentOfDailyNeeds ": 0.8} , {" name ":
" Poly Unsaturated Fat " , " amount ": 0.04 , " unit ": " g " ,
" p ercentOfDailyNeeds ": 0} , {" name ": " Vitamin B6 " , " amount ": 0.11 ,
" unit ": " mg " , " percentOfDailyNeeds ": 5.6} , {" name ": " Caffeine " ,
" amount ": 0 , " unit ": " mg " , " percentOfDailyNeeds ": 0} , {" name ":
" Iron " , " amount ": 0.29 , " unit ": " mg " , " percentOfDailyNeeds ": 1.61} ,
{" name ": " Sugar " , " amount ": 9.85 , " unit ": " g " ,
" p ercentOfDailyNeeds ": 10.94} , {" name ": " Vitamin K " , " amount ": 0.7 ,
" unit ": "\ u00b5g " , " perce ntOfDailyNeeds ": 0.67} , {" name ":
" Lycopene " , " amount ": 0 , " unit ": "\ u00b5g " , " percentOfDailyNeeds ":
0} , {" name ": " Folate " , " amount ": 18 , " unit ": "\ u00b5g " ,
" p ercentOfDailyNeeds ": 4.5} , {" name ": " Vitamin B1 " , " amount ": 0.08 ,
" unit ": " mg " , " percentOfDailyNeeds ": 5.27} , {" name ": " Vitamin B2 " ,
" amount ": 0.03 , " unit ": " mg " , " percentOfDailyNeeds ": 1.88} ,
{" name ": " Magnesium " , " amount ": 12 , " unit ": " mg " ,

The synthetic response schema based on the probed execution result for the tool
GET_ingredient_information
{
" id ": " int " ,
" original ": " str " ,
" originalName ": " str " ,
" name ": " str " ,
" amount ": " int " ,
" unit ": " str " ,
" unitShort ": " str " ,
" unitLong ": " str " ,
" possibleUnits ": [
" str "
],
" estimatedCost ": {
" value ": " float " ,
" unit ": " str "
},
" consistency ": " str " ,
" shoppingListUnits ": [
" str "
],
" aisle ": " str " ,

24
" image ": " str " ,
" meta ": [] ,
" nutrition ": {
" nutrients ": [
{
" name ": " str " ,
" amount ": " int " ,
" unit ": " str " ,
" percentO fDailyNeeds ": " float "
}
],
" properties ": [
{
" name ": " str " ,
" amount ": " float " ,
" unit ": " str "
}
],
" flavonoids ": [
{
" name ": " str " ,
" amount ": " int " ,
" unit ": " str "
}
],
" caloricBreakdown ": {
" percentProtein ": " float " ,
" percentFat ": " float " ,
" percentCarbs ": " float "
},
" weightPerServing ": {
" amount ": " int " ,
" unit ": " str "
}
},
" categoryPath ": [
" str "
]
}

The auto-documented protocol during our tool probing process for the tool
GET_ingredient_information
API name : G E T _ i n g r e d i e n t _ i n f o r m a t i o n
### API url
https :// spoonacular - recipe - food - nutrition - v1 . p . rapidapi . com /
food / ingredients /{ ingredient_id }/ information
### Request type
GET
### Description
Use an ingredient id to get all available information about an
ingredient , such as its image and supermarket aisle .
### Parameter
- ingredient_id : The id of the food ( ingredient ) . ( type : integer )
- amount : The amount of that food . ( type : integer )
- unit : The unit for the given amount . ( type : String )
The ` { variable } ` in the url path should also be replaced with
actual value .
### Execution result specification
{
" id ": " int " ,
" original ": " str " ,
" originalName ": " str " ,
" name ": " str " ,

25
" amount ": " int " ,
" unit ": " str " ,
" unitShort ": " str " ,
" unitLong ": " str " ,
" possibleUnits ": [
" str "
],
" estimatedCost ": {
" value ": " float " ,
" unit ": " str "
},
" consistency ": " str " ,
" shoppingListUnits ": [
" str "
],
" aisle ": " str " ,
" image ": " str " ,
" meta ": [] ,
" nutrition ": {
" nutrients ": [
{
" name ": " str " ,
" amount ": " int " ,
" unit ": " str " ,
" percentO fDailyNeeds ": " float "
}
],
" properties ": [
{
" name ": " str " ,
" amount ": " float " ,
" unit ": " str "
}
],
" flavonoids ": [
{
" name ": " str " ,
" amount ": " int " ,
" unit ": " str "
}
],
" caloricBreakdown ": {
" percentProtein ": " float " ,
" percentFat ": " float " ,
" percentCarbs ": " float "
},
" weightPerServing ": {
" amount ": " int " ,
" unit ": " str "
}
},
" categoryPath ": [
" str "
]
}
### Request body
" This API do not need the request body when calling ."

26
A.7 Instructions

The instruction Ic to generate chain of tools programmatically (See § 3.2 for more details).

Here are some real - world tools ( APIs ) . You need to answer my
question by writing Python programs to call a series of tools and `
print ` the final answer . The tools adhere to the RESTful format and
can be accessed via HTTP request .

Here is the ** OpenAPI Protocol ** of the given tools , including


their HTTP URL , description , arguments , and execution results
( schema ) .

{ documented protocol }

Try to write a correct Python program and avoid grammar errors ,


e . g . ` variable is not defined ` . You need to provide Python code
that can be executed directly . Please add the name of the used APIs
in Python comments for the attribution consideration .

Query : { query }
Your output :
` ` ` python
[ Program ]
```

The instruction Ia to enable the attributable reflection mechanism (See § 3.3 for more details).

In this task , you are a code reviewer . You should read my code and
the error message to identify the reason for my bug . Here is my
code :
` ` ` python
{ code }
```

Specifically , my code is to call the following APIs to solve the


question : "{ question }". Here are the details of the APIs .
{ docs }

However , my code encounters the bug :


=================
{ bug }
=================

Please help me to analyze my code and attribute the error message


to the specific APIs callings . Note that you should select the tool
name that triggers the error from the above tool list as your
output .

Your output : [ the tool name ]

The instruction Ip for our black-box probing process (See § 4 for more details).
In this task , you are a programming engineer . Your task is to test
the provided APIs , which are provided by the OpenAPI platform to
access the web information , e . g . , movie and music . Specifically ,
you need to write Python code to pass the required arguments to
call the APIs , getting the execution results of the APIs .

27
Here are the OpenAPI Specification of given APIs , including their
HTTP URL , functional description , and argument details .

{ docs }

To test the APIs , please first propose a clear and detailed query
to specify the purpose of your test code , e . g . , find a movies named
The Matrix . Then provide Python code that can be executed directly
to finish your query and get the execution results .

Note : please always transform the raw execution results into ` str `
( e . g . , json . dumps ( response . json () , indent =0) ) and print them in your
code via ` print () ` .

Starting below , your should follow this format :

Your query : " propose your query "


Test code :
` ` ` python
def test_code ( param ...) :
... implement details
# call your ` test_code ` function
...
# transform the raw execution results into ` str ` and print it
...
```

28

You might also like