0% found this document useful (0 votes)
9 views12 pages

Inter IIT Mid Eval Report

The document outlines the development of an AI chatbot powered by a Large Language Model (LLM) that intelligently recommends and integrates various tools to address user queries. It discusses methodologies for tool integration, the advantages of open-source LLMs, and various prompting techniques like Hard Prompting and Chain of Thought prompting to enhance the chatbot's performance. The document also presents evaluation metrics and experimental results highlighting the capabilities and limitations of different LLMs in API selection and argument completion.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views12 pages

Inter IIT Mid Eval Report

The document outlines the development of an AI chatbot powered by a Large Language Model (LLM) that intelligently recommends and integrates various tools to address user queries. It discusses methodologies for tool integration, the advantages of open-source LLMs, and various prompting techniques like Hard Prompting and Chain of Thought prompting to enhance the chatbot's performance. The document also presents evaluation metrics and experimental results highlighting the capabilities and limitations of different LLMs in API selection and argument completion.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

DevRev’s AI Agent 007: Tooling up for Success

Team 15

Abstract approach include the Reverse Chain method proposed by


We aim to develop a Large Language Model Zhang et al. (2023b) and the task planning step in Hugging-
(LLM) powered chatbot, augmented by a set of GPT introduced by Shen et al. (2023).
tools, each accompanied by its detailed descrip-
tion. The chatbot should intelligently recommend 1.2. Singular vs Multiple Tool Integration
a subset of these tools, specifying the arguments
A lot of the early literature in the tool learning domain
for their utilization, and providing guidance on
was based on mostly single tool usage like Toolformer by
how to combine them effectively to address user
(Schick et al., 2023) and Gorilla by (Patil et al., 2023). How-
queries. Additionally, our solution should incor-
ever, our task involves queries that might require multiple
porate features facilitating the seamless addition
API calls. The solution may require multiple iterations with
and modification of tools within our toolset.
a tool, and there can be multiple tools involved, each with
In the following sections, we present the current its own set of parameters.
direction of our work which is backed by an ex-
tensive literature review and experimentation. 1.3. Open Source vs Closed Source LLMs
The research community suggests that there is a significant
1. Literature Review gap between open-source LLMs and closed-source LLMs
when it comes to tool-use/ tool planning capabilities Qin
The challenge associated with this problem statement ne- et al. (2023), Xu et al. (2023b).
cessitates the strategic specialization of Large Language
Models (LLMs) within specific domains. This targeted While closed-source LLMs may initially perform adequately
approach ensures optimal utilization of tools, enabling com- “off the shelf” for this task, they lack the inherent advan-
prehensive responses to a diverse range of queries. This tages offered by open-source alternatives, such as cost-
has been addressed in the literature under terms such as effectiveness, flexibility, security, and reduced dependency
Tool Ordering, Task/Tool Planning, Task/Tool Scheduling, on external providers and therefore, it is crucial to engage in
Tool Learning, Tool Augmented Large Language Models, extensive experimentation with open-source LLMs as well.
Tool Integration with LLMs, Autonomous Agents backed We are looking into several approaches that can help us
by LLMs, and other similar terms. We also explore how align open-source LLMs to this task, possibly surpassing
we can incorporate the work done in the field of Domain the capabilities of closed-source LLMs while giving us full
Specific Question Answering using LLMs. control over the model. We are also looking into ways
In the following subsections, we aim to narrow down our that would help make model serving and inference fast and
focus to a specific section of this extensive literature. Here, cheap to justify the viability of deploying a model of our
we elaborate on the part of the domain that aligns with our own (Discussed in Latency Section 5).
interests and pertains to the particular task at hand.
1.4. Number of Trainable Parameters
1.1. API Interaction
Based on the count of trainable parameters, our method-
Majority of the existing literature in this field revolves ologies can be broadly categorized into three groups: Hard
around actually utilizing the APIs or external tools Qin et al. Prompting, which involves no trainable parameters; Param-
(2023), Patil et al. (2023), Schick et al. (2023), Yang et al. eter Efficient Fine Tuning (PEFT), where only a subset of
(2023) and integrating the results from these interactions as parameters is tuned compared to the entire set; and Full
observations into the context. Fine Tuning. Presently, our primary focus has been on
maximizing efficiency through Hard Prompting, the most
However, our unique problem statement necessitates work- resource-friendly category among the three.
ing solely with API descriptions, excluding the actual ex-
ecution of API calls. Notable works that share a similar Concurrently, recognizing the constraints on resources, we
DevRev’s AI Agent 007: Tooling up for Success

Figure 1. An overview of the inference pipeline augmented with synthetic data generation based on a human crafted seed set, instruct fine
tuning using a scaled-up version of the seed set, post-processing of the raw outputs for tool verification and to check for hallucinations and
an interface for tool updation.

are actively engaged in two key initiatives. Firstly, we are 2.1. Evaluation Metrics
dedicated to constructing an instruct dataset Wang et al.
In our current evaluation scheme, we focus on aspects such
(2022), Zhuang et al. (2023) Qin et al. (2023) (details to be
as API Selection, Argument Completion, API Ordering,
discussed later). Secondly, we are exploring various PEFT
Hallucinations and Structure of the Output to assess the
techniques Li and Liang (2021), Jain et al. (2023), Dettmers
quality of the output and token usage and chain latency to
et al. (2023), Hao et al. (2023), Zhang et al. (2023a) to
assess efficiency.
enhance the alignment of our model, particularly smaller
open source Language Model Models (LLMs), with the API selection aims to assess the ability of the model to
designated task. The optimization of smaller LLMs for this correctly identify the necessary APIs required to solve the
task not only contributes to resource conservation during user query.
deployment but also facilitates quicker inference.
Argument Completion refers to the ability of the model
to provide the correct argument names and values for each
1.5. Domain Specific Question Answering
API being called.
In cases where the toolset is extensive and the documenta- API Ordering is the ability of the model to arrange the
tion exceeds the capacity of in-context learning methods, selected APIs in the correct order.
researchers have experimented with diverse retrieval tech-
niques Qin et al. (2023), Yuan et al. (2023), Patil et al. We also try to test for aspects like the model’s tendency to
(2023). These methods involve retrieving pertinent tools for Hallucinate, how the output is being structured, etc.
user queries from a tool database and integrating them into
the chatbot’s context using Retrieval Augmented Genera- 2.2. Hard Prompting Techniques
tion (RAG) pipelines to enhance their chatbots. However,
Hard Prompting, also referred to as Prompt Engineering,
following our discussion with the company, it appears that,
involves manually handcrafting text prompts with discrete
at present, there is no immediate necessity for the incorpo-
input tokens. Developing effective prompts for any task re-
ration of retrieval techniques. Most of the needed domain
quires extensive literature review and an iterative prompt de-
knowledge can be added in context.
velopment process following the guidelines of good prompt
engineering.
2. Experimentation
The key advantages of Hard Prompting are:
In this section, we discuss details of our current experimenta-
tion and the corresponding evaluations. We have highlighted • No training is involved.
some findings in the report, with additional detailed results • Allows for easy addition and modification of tools in
available in the accompanying deliverables. the toolset (either directly in context via the system
DevRev’s AI Agent 007: Tooling up for Success

Figure 2. Overview of some prompting techniques Yao et al. (2023a). Here we have compared the structure of Tree of Thoughts prompting
with Zero-Shot and Chain of Thought prompting

prompt or in the database in case of a large toolset). Instruction tuning has been shown to improve zero-shot
learning Wei et al. (2022) and will be discussed in further
Some disadvantages of Hard Prompting are: sections.
• Limitation of context length (can be mitigated by using
retrieval techniques) 2.2.2. F EW S HOT
• Mastering such a complex task which involves learning Since zero-shot results showed that GPT-3.5 could not in-
new and more complex tools is challenging. ternalize the usage of APIs with no arguments and also con-
• Model’s existing vocabulary and knowledge are not fused words like ’summarize’ within queries with natural
aligned with the task at hand. language text, providing it with in-context demonstrations
covering as many tools as possible was necessary.
We have tried the following prompting techniques to tackle Thus, we tried a few-shot prompting approach, which trains
this task: the model to execute tasks with minimal instances and is,
therefore, a more flexible and adaptive approach than zero-
2.2.1. Z ERO S HOT shot prompting. The evaluation of the performance of the
Large LLMs today are tuned to follow instructions and few-shot learning-based approach is based on limited ex-
are trained on large amounts of data; so they are capable posure, with the assumption that the model’s capacity to
of performing some tasks ”zero-shot.” As expected, for a generalize from a few examples reflects its overall capabil-
complex task such as this, there are definite limitations in ity.
Zero Shot Prompting. In our few-shot approach, we tested various prompts for
We noticed that there was a lot of variability in responses, output API list generation and structuring. However, we
hallucinations, and also issues with output structuring. The encountered issues with the model using natural text instead
outputs were also very vulnerable to small changes in the of indexing outputs from previous API tools. To address
prompts. this, we had to explicitly instruct the model in the output
structuring prompt. The model also tended to miss certain
We tried multiple variations of zero-shot approaches which methods, such as creating a summary before generating ac-
also served as baselines for our other approaches such as tionable objects. In longer queries, it often fails to complete
Single Prompt, Two Prompt (System prompt + JSON struc- all tasks, providing an incomplete list of APIs.
turing prompt) with different variations of the structuring
prompt, Single Prompt + Few Shot structuring prompt, Sub- 2.2.3. C HAIN OF T HOUGHT (C OT)
task decomposition + Subtask Answering (+ Structuring),
Subtask decomposition + Subtask Answering + Follow up Chain of Thought prompting enables complex reasoning
Structure Prompt, etc. capabilities through intermediate reasoning steps. It allows
DevRev’s AI Agent 007: Tooling up for Success

language models to break down complex problems into end, two examples were given. From our exper-
manageable intermediate steps, allocating additional com- iments, Two-shot CoT seemed to perform well
putation where more reasoning is required. while One-Shot CoT struggled.
The transparency of CoT provides insights into the model’s • We ensured that the few shot examples were in
decision-making process, aiding in debugging and under- a specific format so as to keep some uniformity
standing how specific answers are reached. This approach is and make it easier to add more examples or edit
versatile and potentially applicable to various tasks requiring existing ones.
complex reasoning, including this one.
Exploration into advanced CoT techniques revealed a
promising contender: Plan and Solve Prompting. This ap-
proach introduces a dual-phase methodology, requiring the
formulation of a strategic plan to break down the overarch-
ing task into smaller, actionable subtasks, followed by the
systematic execution of these subtasks in alignment with the
devised plan. Notably, experimental results showcased its
superiority over zero-shot CoT and demonstrated compara-
ble performance to Few-Shot CoT. Given these encouraging
findings, there is a strong motivation to conduct more in-
depth experiments to unlock the full capabilities of this
approach.

Model API Argument API Hallucination


Name Selection Completion Ordering
gpt-3.5-turbo 13/100 18/100 6/100 21/100
gpt-4 13/100 3/100 1/100 7/100
Claude instant 1.2 29/100 32/100 3/100 19/100
Claude 2 36/100 41/100 33/100 41/100
LLaMa-7b-chat 79/100 86/100 51/100 61/100

Table 1. Error Statistics for Few Shot CoT


Figure 3. Few Shot CoT Error Statistics. GPT-4 performs the best
while Llama has the worst performance.
2.2.4. T REE OF T HOUGHTS
Experimental Setup: Two distinct CoT prompting tech-
The adoption of the Tree of Thoughts technique is motivated
niques were explored: Zero-Shot CoT and Few-Shot CoT.
by the limitations of linear chaining. ToT enables collabo-
rative decision-making and iterative refinement, allowing
1. Zero-Shot CoT Prompting: agents to consider diverse reasoning paths and backtrack
• Leveraging the approach outlined by Kojima et al. when necessary. This approach is crucial for handling com-
(2022), the ”Let’s think step by step” prompt was plex queries that may demand multiple API calls, intricate
appended to the original query. logic, and conditional reasoning.
• Trials were conducted on a self-curated dataset
Model API Argument API Hallucination
involving prompts that included lists of APIs with
Name Selection Completion Ordering
descriptions, explanations of each API, and the
desired solution structure. gpt-3.5-turbo 29/100 19/100 7/100 16/100
gpt-4 14/100 6/100 3/100 5/100
• While results were generally optimal, a notable Claude instant 1.2 36/100 39/100 8/100 23/100
failure occurred in a query involving the ”whoami” Claude 2 17/100 34/100 7/100 9/100
tool name, indicating a need for additional exam- LLaMa-7b-chat 76/100 89/100 56/100 79/100
ples to aid the model’s reasoning.
Table 2. Error Statistics for ToT
2. Few-Shot CoT Prompting:
The experiment utilizes a carefully crafted prompt that sim-
• For Few-Shot CoT trials, examples with reason- ulates a collaborative effort among three customer service
ing were manually handcrafted. The prompt was agents. Equipped with exceptional logical thinking skills,
similar to the one in zero-shot CoT, except at the these agents respond to a customer query using the Tree of
DevRev’s AI Agent 007: Tooling up for Success

Thoughts method, detailing the utilization of a set of listed 1. Thought:


APIs at each step.
• What is the task according to previous observa-
Performance Analysis : The Tree of Thoughts method- tions?
ology inherently integrates an error-aware approach. At • What you want to accomplish at this step.
any stage, an agent may conclude that the given query can-
• What information is needed?
not be resolved using the provided APIs. This allows for
self-evaluation, acknowledgment of errors, and a thought- 2. Observation:
ful decision-making process. However, this approach was
prone to overthinking as the model sometimes gave addi- • What new information did you gain after this step?
tional tools alongside those necessary to answer the query.
3. Action:
• Pick: Pick from the APIs that will work for you.
• Call: Identify the correct arguments for the se-
lected API.
• Finish: Finish if you have all the APIs needed,
and the query can be answered.

We manually composed ReAct-format trajectories to use


as few-shot exemplars in the prompts. Each trajectory con-
sisted of multiple thought-action-observation steps.

Model API Argument API Hallucination


Name Selection Completion Ordering
gpt-3.5-turbo 21/100 17/100 11/100 8/100
gpt-4 12/100 6/100 5/100 6/100
Claude instant 1.2 16/100 16/100 6/100 12/100
Claude 2 12/100 14/100 6/100 14/100
LLaMa-7b-chat 74/100 82/100 53/100 65/100

Table 3. Error Statistics for ReAct


Figure 4. ToT Error Statistics

2.2.6. R EVERSE C HAIN


2.2.5. R E ACT Reverse Chain Zhang et al. (2023b) is a target-driven
prompting approach designed for the purpose of multi-API
ReAct Yao et al. (2023b) overcomes prevalent issues of hal-
planning using LLMs. This method decomposes the multi-
lucination and error propagation in chain-of-thought reason-
API planning task into API selection and argument comple-
ing by interacting with simple APIs and generating human-
tion based on the API descriptions, specifically, the Reverse
like ask-solving trajectories that are more interpretable than
Chain performs planning in a reverse manner starting from
baselines without reasoning traces. ReAct prompts LLMs
the final API and then each preceding task is inferred back-
to generate both verbal reasoning traces and actions pertain-
wards.
ing to a task in an interleaved manner, which allows the
model to perform reasoning to create, maintain, and adjust Given a user query and a set of APIs along with their de-
plans for acting (reason to act), while also interacting with scriptions, the first step of the Reverse Chain is to ask the
the external environments(tools) to incorporate additional LLM to select an appropriate API to address the query. The
information into reasoning (act to reason). selected API will be the final API used in the proposed
solution to the query.
Experimental Setup:
The second step is to provide the arguments and argument
We instructed the model to approach the query in an iterative
descriptions of the selected API to the language model and
manner while interleaving thought-action-observation steps
ask it to complete the arguments by either fetching an appro-
in each iteration.
priate argument value from the user query or by specifying
So in each iteration, the model performed the following another API whose output can complete the argument value
steps: or the model can leave the argument empty if it’s not needed.
DevRev’s AI Agent 007: Tooling up for Success

Notably, the flexibility of this method surpasses previous


attempts with GPT-3.5 in few-shot prompting. In tasks in-
volving symbolic reasoning, sub-tasks that pose challenges
for LLMs are subjected to further decomposition. When
faced with complexity due to input length, tasks are recur-
sively broken down into smaller inputs.
On the other hand, GPT-3.5 encounters difficulties when
executing decomposed prompting in zero-shot scenarios. To
overcome this limitation, a shift towards a few-shot decom-
posed prompting approach is deemed essential.
In this methodology, explicit examples of decomposed
queries derived from the parent query are provided, resulting
in improved outcomes.
The model exhibits inconsistent output formats, presenting
decomposed queries in varying structural formats with each
iteration. Additionally, the model tends to display few-shot
examples alongside the primary output.

Figure 5. ReAct Error Statistics 2.2.8. E XPERT P ROMPTING


ExpertPrompting Xu et al. (2023a) is an augmented strategy
The argument completion step is repeated for APIs needed for instructing LLMs. For each specific instruction, Expert-
to complete an argument. Prompting first envisions a distinguished expert agent that
is best suited for the instruction, and then asks the LLMs to
This method breaks down a complex task into simpler sub- answer the instruction conditioned on such expert identity.
tasks, and since each step is inferred backwards from the
target, this method is less prone to deviation from the in-
tended path. However, since each step is inferred backwards,
any error at any step will lead to a completely incorrect so-
lution path.
On paper, this prompting technique seems to have the most
potential but we have encountered some difficulties in tuning
it to get its most optimal implementation.

2.2.7. D ECOMPOSED P ROMPTING


Decomposed Prompting Khot et al. (2023) is an approach
that involves breaking down intricate tasks into more man-
ageable sub-tasks, and assigning them to a shared library Figure 7. Expert Prompting, an augmented strategy for instruct-
of tools designed for prompting. The modular structure ing LLMs. For each specific instruction, ExpertPrompting first
of Decomposed Prompting allows for optimization, further envisions a distinguished expert agent that is best suited for the
decomposition, and the easy replacement of prompts or instruction, and then asks the LLMs to answer the instruction con-
ditioned on such expert identity.
functions.

The expert identity is produced with In-Context Learning.


Each expert identity is defined at very delicate granularity
using a detailed and elaborate description. ExpertPrompt-
ing is also simple to implement, requiring no sophisticated
crafting of prompt templates or iterative processes.

Figure 6. Decomposed Prompting, a method for solving complex We employ an expert software engineer persona with Ex-
tasks by breaking them down into simpler sub-tasks. Involves de- pertPrompting. Our trials show that language models show
composing the original query into sub-queries and then combining improved performance with the inclusion of ExpertPrompt-
the results. ing for our task of addressing user queries.
DevRev’s AI Agent 007: Tooling up for Success

2.3. Output Structuring • Separate Interface for Tool Management: Develop-


ing a dedicated interface for tool addition and modifica-
• Using JSON Mode in GPT: The GPT model can
tion. This interface provides a controlled environment
structure its outputs in JSON format. This mode can
for adding, modifying, or removing tools, allowing
be activated either through explicit instructions in the
users to make changes systematically.
initial prompt or via follow-up prompts during the in-
teraction.
2.5. NER for Query Enhancement
• Few-Shot Demonstrations: To ensure that the out-
Named entity recognition (NER) is a component of natu-
put is structured correctly, it’s beneficial to provide
ral language processing (NLP) that identifies predefined
few-shot demonstrations. These examples guide the
categories of objects in a body of text.
model in formatting its responses in the desired JSON
structure. We employ NER to identify a carefully curated set of entities
from the user query. Our entity set currently includes :
• Post Processing: After receiving the response from
the model, the output string can be parsed to extract • Rev Org • Issue
individual JSON objects. • User • Stage
• Creating a Fixed JSON Schema based on model • Source Channel • Part
response: By analyzing the count and arguments of
various APIs involved, a fixed JSON schema can be Our prompt instructs the language model to perform named
developed. This schema facilitates the use of standard entity recognition (NER) prior to answering the query. The
processing techniques, such as JSONFormer, to handle query, tagged with the identified entities, is then utilized by
the structured data efficiently. the Language Model to adeptly address the user query.
The inclusion of named entity recognition (NER) increases
• Handling Complex Logic Queries: The output struc-
the robustness of the model by preventing important in-
ture for complex logic queries requires a different
formation from being overlooked, thus improving overall
approach, which will be discussed in a later section.
model performance. Furthermore, our observations indicate
These queries often involve more intricate data relation-
that without instruction to perform NER, the model does
ships and may necessitate a more flexible or elaborate
not consistently utilize the ”who am i” tool when necessary.
JSON structure.
However, NER addresses this issue as it identifies the cur-
rent user as an entity, enabling the model to utilize this key
2.4. Addition and Modification of Tools information when answering the query.
• Real-time Integration within System Prompt: If
the tool addition or modification is an integral part of 2.6. Tool Documentation over description
the system prompt, regex methods could be applied In our prompting techniques, providing a few shot exam-
to identify and manipulate tool-related patterns in the ples seemed to work best but even they resulted in a lot of
text. errors with undesirable biased usage. As queries grew more
• In-context learning: In-context learning would imply complex, the selection search grew combinatorially and the
that the model learns to recognize and generate tools demonstrations provided failed to generalize to these more
based on examples and demonstrations provided in the complex tasks. Thus we used an alternative to these: tool
training data. documentation. We observed that zero-shot prompts with
only tool documentation achieved performance on par with
• ToolkenGPT: The central concept of ToolkenGPT in- few-shot CoT. Tool documentation is significantly more
volves representing each tool as a unique token. Each valuable than zero-shot demonstrations, generalizing well
tool is represented as a ”toolken,” and its embedding to more complex tasks.
is inserted into the LLM head, solving the problem of
hallucination. 2.7. Post Processing

• Retrieval-based Modification: Employing a database We explore further post-processing the model’s response, to
for storing and retrieving tools. First, given an input deal with the following issues:
prompt, it uses a retriever to fetch relevant documents
from a corpus of knowledge. Then, it feeds these doc- • Output Structuring: Our experiments show that even
uments, along with the original prompt, to an LLM when instructed LLMs may struggle to produce output
which generates a response. in the desired format. This issue can be solved by
DevRev’s AI Agent 007: Tooling up for Success

processing the model’s output to the desired JSON


schema.

• Hallucinations: Language Models have the tendency


to hallucinate tools and their arguments when a tool
apt to address the user query does not exist. Halluci-
nations can be tackled in post-processing where any
tool that the LLM has hallucinated can be removed.
This approach also allows us to handle cases where the
response should be an empty list.

2.8. Complex Queries


We experimented with the model using a diverse set of
queries, some of which cannot be potentially solved by
taking the composition of available functions; and might
need some additional logic around combining the outputs
of those functions. Figure 8. Dataset Generation Pipeline. Creating Instruct Dataset
using Automatic: Self-Instruct, Semi-Automatic: Human Guided
Notably, certain queries demanded the model to execute and Query Generation using LLMs.
complex iterative processes to reach the desired output. One
illustrative example was the instruction: ”Begin by listing
the first 15 work items owned by Eve in the Testing stage.
Then, iteratively filter the list to include only those of type
issue and task separately, and summarize each subset.”
3.1. Seed Dataset Generation
This particular scenario demanded the model to engage in
We provided the model with a set of APIs and corresponding
multiple tool calls, sequentially filtering the list based on
examples to generate additional queries using a few-shot
specific criteria and summarizing the results at each step.
learning approach. The queries generated, while acceptable,
However, the model struggled to recognize the iterative
lacked the desired complexity and diversity.
nature of the task and consequently performed sub-optimally
in such cases. To enhance the output, we directed the model to explore
unused tools, embody a customer persona for a chatbot
Conversely, the model demonstrated competence in queries
interface, and implement complex logic, including itera-
involving BOOLEAN logic. These queries leveraged the
tive and conditional structure. While the model produced
model’s ability to comprehend and process logical condi-
more human-sounding responses, there were challenges in
tions effectively.
interpreting them. The model especially gave queries that
This discrepancy in performance highlights the need for incorporated time-related terms. This issue was effectively
refining the model’s understanding of iterative processes resolved by directing the model to avoid such terms. This
and complex logic, enabling it to recognize and execute comprehensive approach aimed to enrich the diversity and
multi-step tasks more efficiently. Further fine-tuning using complexity of the generated queries.
diverse examples featuring intricate logic could enhance
Additionally, a curated selection of well-formed queries
the model’s performance at handling a broader scope of
produced by the model was added to the few-shot examples
complicated queries.
for improved performance.

3. Dataset Generation Ensuring the complexity of the generated queries is crucial.


We’ve categorized the dataset generation seed into three
Dataset Generation is essential for this problem statement, levels: Low, Medium, and High. This categorization is
for evaluation, and for fine-tuning. We tried multiple prelim- determined by the complexity associated with the number
inary approaches to automate the generation of queries in of tasks within each query. Initially, the zero-shot generated
zero-shot, few-shot, and using other prompting techniques, queries were brief and typically focused on a single task.
especially those that have an intermediate query genera- However, through the implementation of more sophisticated
tion process. However, we quickly realized the need for prompting methods, the number of tasks per query increased.
a small, manually or semi-manually curated seed dataset Therefore, the complexity assessment is solely based on the
before scaling it up to provide diversity and complexity. quantity of tasks involved.
DevRev’s AI Agent 007: Tooling up for Success

3.2. Human-Guided Query Generation 4. Parameter Efficient Fine Tuning


While the input from human experts yields high-quality As we test various LLMs on their capabilities to achieve
queries, the process is laborious and time-intensive. To the task, it is clear that fine-tuning the model is a must to
expedite and scale the dataset creation, we are looking into get more robust and reliable results. Tuning all parameters
utilizing LLMs like GPT-4 to synthesize instructional data of a model is not at all feasible due to resource constraints
for fine-tuning. However, relying solely on language models and is not always an effective approach since it can degrade
can present challenges, such as generating unanswerable or the general reasoning abilities of the original model and
factually incorrect questions, and posing overly simplistic lead them to situations like catastrophic forgetting Lin et al.
queries answerable through the model’s internal knowledge. (2023).
We propose a human-guided approach combining human ex- Hence, we leverage Parameter Efficient Fine Tuning Meth-
pertise with automated language model generation to tackle ods (PEFT), to only fine-tune a small subset of parameters
these challenges. These queries are used to generate query of the LLM while freezing most of its parameters. It is
templates using GPT-4 which are further enhanced by au- important to recognize which part of parameters to tune to
tomatically populating them with sampled values from the get optimal results, so, we plan to explore some different
reference data, amplifying the dataset size. techniques that may prove effective for this particular task.

3.3. Tools Synthesis


Our dataset expansion involved synthesizing tools by feed-
ing documentation inputs through Langchain into GPT-4.
This process allowed the model to create new tools based on
the structured data from the documentation. Additionally,
we manually generated diverse queries using these newly
created tools. Hence, our goal is to broaden our array of
tools and expand the dataset’s scope.

Method Number of Tools


Zero-Shot 20
Few-Shot 15
CoT 13

Table 4. Tool Dataset Statistics

3.4. Self-Instruct Dataset Figure 9. PEFT techniques for finetuning embedding space using
NEFTune and Toolkens, updating inner layer parameters using
In the Self-Instruct framework Wang et al. (2022), the pro- LoRA, and adding additional layers of LlaMa Adapters.
cess starts with a compact seed set of manually crafted
queries from prior approaches. This set acts as the task pool,
randomly selecting queries to stimulate the language model 4.1. LLM inner layers finetuning
(LM). The LM generates new queries and their correspond-
4.1.1. L O RA
ing outcomes. The generated outputs go through a careful
filtering process to eliminate low-quality or redundant con- Low-rank adaptation, or LoRA, updates the parameters of
tent. The refined data seamlessly integrates back into the specific dense layers by injecting trainable rank decompo-
initial repository. This iterative bootstrapping algorithm pro- sition matrices into the architecture. Common practice is
gressively sharpens the LM’s ability to follow query patterns, to keep the ratio lora r / lora alpha = 1:1, so as not to over-
establishing a self-improving cycle that leverages its own power the base model, but we can try to deviate from this
generated data for subsequent fine-tuning. This approach is approach, our task being very specific to developer queries.
expected to outperform the programmatic generation of new LoRA technique fine-tuning is orthogonal to many fine-
queries from predefined templates, presenting possibilities tuning methods which means it can be combined with many
for a more diverse dataset. other fine-tuning techniques as well, and, it introduces no in-
DevRev’s AI Agent 007: Tooling up for Success

Method Number of Tools Arguments Complexity Executable


Queries Utilized Utilized (Low / Med / High)
Manual with Few Shot Seed 90 3 4 22 / 56 / 12 81
Few Shot 50 2 3 27 / 15 / 8 48
Few shot with CoT 25 3 5 8 / 12 / 5 23

Table 5. Query Dataset Statistics

ference latency compared to a fully fine-tuned model, which formatted output. The research shows this method to be
is one of the important factors for this task. effective in improving the quality of chatbots and works
well with other PEFT methods like QLoRA(Quantized
LORA has proven to be effective in learning tool capabilities
LoRA) by Dettmers et al. (2023). NEFTune avoids model
as evident from works by Qiao et al. (2023) and Yang et al.
overfitting to the specifics of the instruction-tuning dataset,
(2023), making this technique worth exploring.
especially when the dataset is small, and adds no inference
latency because no new parameter layers are introduced.
4.1.2. P REFIX -T UNING
Since hard-prompting plays a big role in correctly address- 4.2.2. T OOLKEN GPT
ing the input queries by the model, prefix-tuning by Li and
Fine-tuning model embeddings to learn separate tokens for
Liang (2021) allows us to optimize even the soft-prompts
tools is an approach utilized in ToolkenGPT by Hao et al.
by adding learnable task-specific prefixes to them, which
(2023). Like NEFTune, this fine-tuning method also focuses
may enhance the model’s ability to interpret different tasks
on the embeddings. This technique arms the model with spe-
in the user query. It is possible that by learning only 0.1% of
cial vocabulary for available tools by adding special tokens
the parameters prefix-tuning can obtain reasonable perfor-
called ”toolkens” to the embedding space, expanding its un-
mance and is able to extrapolate it equally to unseen queries.
derstanding of the tools available. New tools can be added
We recognize that newly added parameters might introduce
to the toolset easily, and performance can be improved by
inference latency, but testing speed-quality trade-off is some-
fine-tuning the model on tool-specific tasks while keeping
thing worth investigating for this PEFT technique.
inference latency unchanged and updated parameters less.
This technique introduces a way to handle newly introduced
4.1.3. LL A M A A DAPTERS
tools while performing well overall.
Extending the ideas of prefix tuning and the original adapter
method by Houlsby et al. (2019), researchers proposed 4.3. Additional Methods
LLaMA-Adapter Zhang et al. (2023a), a model-agnostic
fine-tuning method, which provides an additional advantage If the time and resources allow us, we will explore further
due to zero-init attention which makes the initial tuning testing methods like Prompt Tuning Lester et al. (2021),
more stable without disrupting model’s linguistic knowl- Diff-Pruning Guo et al. (2021), BitFit Zaken et al. (2022),
edge. We would want our base model to retain its reasoning IA3 Liu et al. (2022). All these PEFT methods have not been
abilities while improving its tool-learning capabilities, we used for the task exactly like ours, hence it is also possible
can exploit this method to try to achieve this. Also, this we would have to apply a variation of the techniques or even
method adds additional layers to only a few top and bottom come up with a different one to get the results we need. It
layers, playing to our advantage when it comes to inference would be interesting to see which of these would impact the
latency. model in what way, whether to induce new capabilities or
bring out the already existing ones in our base model.
4.2. LLM Embedding Layer Finetuning
4.2.1. NEFT UNE Technique % of parameters

Noisy Embedding Improve Instruction Finetun- In-Context Learning 0%


Prefix Tuning 0.1 %
ing(NEFTune) PEFT method by Jain et al. (2023) is LoRA 0.1 to 1 %
fairly recent and shows that augmentation, as simple as LlaMa Adapter 0.01 to 0.1 %
adding noise to embedding vectors, can considerably Adapter Tuning 3 to 4 %
improve model finetuning. Our model must be able to Tuning top k layers 20 %
handle conversational queries while preserving its technical Full Fine Tuning 100 %
abilities and correctly identifying arguments to give Table 6. Parameter percentages for different techniques
DevRev’s AI Agent 007: Tooling up for Success

5. Latency and Inference We have also recently secured access to ChatGPT Plus
which would facilitate the dataset generation process as
When serving user queries, latency and response times are well as allow us to try some AI-based evaluation techniques.
especially important factors that impact the user experience.
We are exploring various approaches aimed at reducing Limited compute resources have posed challenges for thor-
latency, accelerating inference speeds, and reducing memory ough testing methodologies. Due to computing constraints,
consumption like precision reduction (float16 or bfloat16) testing on GPT-4 was restricted in scope. We have addi-
to speed up the model, 8-bit or 4-bit quantization to reduce tionally applied for access to alternative APIs, including
memory consumption by 2x or 3x, fine-tuning with adapters Anthropic, which would expand our capabilities for evalua-
(LoRA, QLoRA) to improve prediction accuracy on your tion. Broader access to application programming interfaces
data which works well in combination with quantization (APIs) and obtaining additional computational resources
afterward, tensor parallelism for faster inference on multiple for dataset generation and training would facilitate a more
GPUs to run large models. robust assessment and allow for a more comprehensive ex-
amination of approaches.
We are also looking at libraries for LLM inference and
serving, such as Text Generation Inference, DeepSpeed, or
vLLM. These already include various optimization tech- 7. Deliverables
niques: tensor parallelism, quantization, continuous batch- Drive link for experimentation, prompts and seed dataset:
ing of incoming requests, optimized CUDA kernels, and Team 15 Mid Eval.
more.
We are tracking the chain latency and token usage associated
with our experiments to evaluate different models. We have References
tabulated some of those results in the experiments section T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer.
of our deliverables. Qlora: Efficient finetuning of quantized llms, 2023.

D. Guo, A. M. Rush, and Y. Kim. Parameter-efficient trans-


6. Resources
fer learning with diff pruning, 2021.
We conducted our assessment of various language mod-
els using several conversational AI platforms including S. Hao, T. Liu, Z. Wang, and Z. Hu. Toolkengpt: Augment-
claude.ai, the free version of ChatGPT, the LMSYS chat ing frozen language models with massive tools via tool
interface, and Forefront.ai. We also leveraged Langchain embeddings, 2023.
to implement various complex conversational chains and N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone,
analyze the performance of multiple pre-trained language Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and
models. In addition, we utilized the OpenAI API to access S. Gelly. Parameter-efficient transfer learning for nlp,
their language models for evaluation. 2019.
Forefront.ai hosts the GPT-3.5 and Claude Instant 1.2 mod-
N. Jain, P. yeh Chiang, Y. Wen, J. Kirchenbauer, H.-M.
els with free public access. A key benefit is the ability
Chu, G. Somepalli, B. R. Bartoldson, B. Kailkhura,
to adjust the response temperature setting, improving the
A. Schwarzschild, A. Saha, M. Goldblum, J. Geiping,
reproducibility of test results. Specialized assistants are
and T. Goldstein. Neftune: Noisy embeddings improve
available on this platform to assist with unique domains,
instruction finetuning, 2023.
helping maximize the benefits of ExpertPrompting Xu et al.
(2023a). Leveraging the software engineer assistant allows T. Khot, H. Trivedi, M. Finlayson, Y. Fu, K. Richardson,
for further refinement in handling user queries. P. Clark, and A. Sabharwal. Decomposed prompting: A
Evaluation on ChatGPT lacked robustness and reproducibil- modular approach for solving complex tasks, 2023.
ity as the platform does not permit customizing the temper- T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa.
ature parameter. Similar constraints with Claude.ai have Large language models are zero-shot reasoners. In Ad-
made thorough testing of Claude 2 challenging, highlight- vances in Neural Information Processing Systems, vol-
ing the need for access to Anthropic API to facilitate more ume 35, pages 22199–22213, 2022.
robust evaluation (We are yet to receive access).
B. Lester, R. Al-Rfou, and N. Constant. The power of scale
The LMSYS Chat interface provides access to various lan-
for parameter-efficient prompt tuning, 2021.
guage models but usability is impacted by a character limit
for prompts. Multi-part prompts are necessary and thus X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous
consistency across responses cannot be guaranteed. prompts for generation, 2021.
DevRev’s AI Agent 007: Tooling up for Success

Y. Lin, L. Tan, H. Lin, Z. Zheng, R. Pi, J. Zhang, S. Diao, L. Yuan, Y. Chen, X. Wang, Y. R. Fung, H. Peng, and H. Ji.
H. Wang, H. Zhao, Y. Yao, and T. Zhang. Speciality vs Craft: Customizing llms by creating and retrieving from
generality: An empirical study on catastrophic forgetting specialized toolsets, 2023.
in fine-tuning foundation models, 2023.
E. B. Zaken, S. Ravfogel, and Y. Goldberg. Bitfit: Sim-
H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, M. Bansal, ple parameter-efficient fine-tuning for transformer-based
and C. Raffel. Few-shot parameter-efficient fine-tuning masked language-models, 2022.
is better and cheaper than in-context learning, 2022.
R. Zhang, J. Han, C. Liu, P. Gao, A. Zhou, X. Hu, S. Yan,
S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez. Gorilla: P. Lu, H. Li, and Y. Qiao. Llama-adapter: Efficient fine-
Large language model connected with massive apis, 2023. tuning of language models with zero-init attention, 2023a.

S. Qiao, H. Gui, H. Chen, and N. Zhang. Making lan- Y. Zhang, H. Cai, Y. Chen, R. Sun, and J. Zheng. Reverse
guage models better tool learners with execution feed- chain: A generic-rule for llms to master multi-api plan-
back, 2023. ning, 2023b.

Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, Y. Zhuang, Y. Yu, K. Wang, H. Sun, and C. Zhang. Toolqa:
X. Cong, X. Tang, B. Qian, S. Zhao, R. Tian, R. Xie, A dataset for llm question answering with external tools,
J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun. Toolllm: 2023.
Facilitating large language models to master 16000+ real-
world apis, 2023.

T. Schick, J. Dwivedi-Yu, R. Dessı̀, R. Raileanu, M. Lomeli,


L. Zettlemoyer, N. Cancedda, and T. Scialom. Tool-
former: Language models can teach themselves to use
tools, 2023.

Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang.


Hugginggpt: Solving ai tasks with chatgpt and its friends
in hugging face, 2023.

Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith,


D. Khashabi, and H. Hajishirzi. Self-instruct: Aligning
language model with self generated instructions, 2022.

J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester,


N. Du, A. M. Dai, and Q. V. Le. Finetuned language
models are zero-shot learners, 2022.

B. Xu, A. Yang, J. Lin, Q. Wang, C. Zhou, Y. Zhang, and


Z. Mao. Expertprompting: Instructing large language
models to be distinguished experts, 2023a.

Q. Xu, F. Hong, B. Li, C. Hu, Z. Chen, and J. Zhang. On


the tool manipulation capability of open-source large
language models, 2023b.

R. Yang, L. Song, Y. Li, S. Zhao, Y. Ge, X. Li, and Y. Shan.


Gpt4tools: Teaching large language model to use tools
via self-instruction, 2023.

S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao,


and K. Narasimhan. Tree of thoughts: Deliberate problem
solving with large language models, 2023a.

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan,


and Y. Cao. React: Synergizing reasoning and acting in
language models, 2023b.

You might also like