Inter IIT Mid Eval Report
Inter IIT Mid Eval Report
Team 15
Figure 1. An overview of the inference pipeline augmented with synthetic data generation based on a human crafted seed set, instruct fine
tuning using a scaled-up version of the seed set, post-processing of the raw outputs for tool verification and to check for hallucinations and
an interface for tool updation.
are actively engaged in two key initiatives. Firstly, we are 2.1. Evaluation Metrics
dedicated to constructing an instruct dataset Wang et al.
In our current evaluation scheme, we focus on aspects such
(2022), Zhuang et al. (2023) Qin et al. (2023) (details to be
as API Selection, Argument Completion, API Ordering,
discussed later). Secondly, we are exploring various PEFT
Hallucinations and Structure of the Output to assess the
techniques Li and Liang (2021), Jain et al. (2023), Dettmers
quality of the output and token usage and chain latency to
et al. (2023), Hao et al. (2023), Zhang et al. (2023a) to
assess efficiency.
enhance the alignment of our model, particularly smaller
open source Language Model Models (LLMs), with the API selection aims to assess the ability of the model to
designated task. The optimization of smaller LLMs for this correctly identify the necessary APIs required to solve the
task not only contributes to resource conservation during user query.
deployment but also facilitates quicker inference.
Argument Completion refers to the ability of the model
to provide the correct argument names and values for each
1.5. Domain Specific Question Answering
API being called.
In cases where the toolset is extensive and the documenta- API Ordering is the ability of the model to arrange the
tion exceeds the capacity of in-context learning methods, selected APIs in the correct order.
researchers have experimented with diverse retrieval tech-
niques Qin et al. (2023), Yuan et al. (2023), Patil et al. We also try to test for aspects like the model’s tendency to
(2023). These methods involve retrieving pertinent tools for Hallucinate, how the output is being structured, etc.
user queries from a tool database and integrating them into
the chatbot’s context using Retrieval Augmented Genera- 2.2. Hard Prompting Techniques
tion (RAG) pipelines to enhance their chatbots. However,
Hard Prompting, also referred to as Prompt Engineering,
following our discussion with the company, it appears that,
involves manually handcrafting text prompts with discrete
at present, there is no immediate necessity for the incorpo-
input tokens. Developing effective prompts for any task re-
ration of retrieval techniques. Most of the needed domain
quires extensive literature review and an iterative prompt de-
knowledge can be added in context.
velopment process following the guidelines of good prompt
engineering.
2. Experimentation
The key advantages of Hard Prompting are:
In this section, we discuss details of our current experimenta-
tion and the corresponding evaluations. We have highlighted • No training is involved.
some findings in the report, with additional detailed results • Allows for easy addition and modification of tools in
available in the accompanying deliverables. the toolset (either directly in context via the system
DevRev’s AI Agent 007: Tooling up for Success
Figure 2. Overview of some prompting techniques Yao et al. (2023a). Here we have compared the structure of Tree of Thoughts prompting
with Zero-Shot and Chain of Thought prompting
prompt or in the database in case of a large toolset). Instruction tuning has been shown to improve zero-shot
learning Wei et al. (2022) and will be discussed in further
Some disadvantages of Hard Prompting are: sections.
• Limitation of context length (can be mitigated by using
retrieval techniques) 2.2.2. F EW S HOT
• Mastering such a complex task which involves learning Since zero-shot results showed that GPT-3.5 could not in-
new and more complex tools is challenging. ternalize the usage of APIs with no arguments and also con-
• Model’s existing vocabulary and knowledge are not fused words like ’summarize’ within queries with natural
aligned with the task at hand. language text, providing it with in-context demonstrations
covering as many tools as possible was necessary.
We have tried the following prompting techniques to tackle Thus, we tried a few-shot prompting approach, which trains
this task: the model to execute tasks with minimal instances and is,
therefore, a more flexible and adaptive approach than zero-
2.2.1. Z ERO S HOT shot prompting. The evaluation of the performance of the
Large LLMs today are tuned to follow instructions and few-shot learning-based approach is based on limited ex-
are trained on large amounts of data; so they are capable posure, with the assumption that the model’s capacity to
of performing some tasks ”zero-shot.” As expected, for a generalize from a few examples reflects its overall capabil-
complex task such as this, there are definite limitations in ity.
Zero Shot Prompting. In our few-shot approach, we tested various prompts for
We noticed that there was a lot of variability in responses, output API list generation and structuring. However, we
hallucinations, and also issues with output structuring. The encountered issues with the model using natural text instead
outputs were also very vulnerable to small changes in the of indexing outputs from previous API tools. To address
prompts. this, we had to explicitly instruct the model in the output
structuring prompt. The model also tended to miss certain
We tried multiple variations of zero-shot approaches which methods, such as creating a summary before generating ac-
also served as baselines for our other approaches such as tionable objects. In longer queries, it often fails to complete
Single Prompt, Two Prompt (System prompt + JSON struc- all tasks, providing an incomplete list of APIs.
turing prompt) with different variations of the structuring
prompt, Single Prompt + Few Shot structuring prompt, Sub- 2.2.3. C HAIN OF T HOUGHT (C OT)
task decomposition + Subtask Answering (+ Structuring),
Subtask decomposition + Subtask Answering + Follow up Chain of Thought prompting enables complex reasoning
Structure Prompt, etc. capabilities through intermediate reasoning steps. It allows
DevRev’s AI Agent 007: Tooling up for Success
language models to break down complex problems into end, two examples were given. From our exper-
manageable intermediate steps, allocating additional com- iments, Two-shot CoT seemed to perform well
putation where more reasoning is required. while One-Shot CoT struggled.
The transparency of CoT provides insights into the model’s • We ensured that the few shot examples were in
decision-making process, aiding in debugging and under- a specific format so as to keep some uniformity
standing how specific answers are reached. This approach is and make it easier to add more examples or edit
versatile and potentially applicable to various tasks requiring existing ones.
complex reasoning, including this one.
Exploration into advanced CoT techniques revealed a
promising contender: Plan and Solve Prompting. This ap-
proach introduces a dual-phase methodology, requiring the
formulation of a strategic plan to break down the overarch-
ing task into smaller, actionable subtasks, followed by the
systematic execution of these subtasks in alignment with the
devised plan. Notably, experimental results showcased its
superiority over zero-shot CoT and demonstrated compara-
ble performance to Few-Shot CoT. Given these encouraging
findings, there is a strong motivation to conduct more in-
depth experiments to unlock the full capabilities of this
approach.
Figure 6. Decomposed Prompting, a method for solving complex We employ an expert software engineer persona with Ex-
tasks by breaking them down into simpler sub-tasks. Involves de- pertPrompting. Our trials show that language models show
composing the original query into sub-queries and then combining improved performance with the inclusion of ExpertPrompt-
the results. ing for our task of addressing user queries.
DevRev’s AI Agent 007: Tooling up for Success
• Retrieval-based Modification: Employing a database We explore further post-processing the model’s response, to
for storing and retrieving tools. First, given an input deal with the following issues:
prompt, it uses a retriever to fetch relevant documents
from a corpus of knowledge. Then, it feeds these doc- • Output Structuring: Our experiments show that even
uments, along with the original prompt, to an LLM when instructed LLMs may struggle to produce output
which generates a response. in the desired format. This issue can be solved by
DevRev’s AI Agent 007: Tooling up for Success
3.4. Self-Instruct Dataset Figure 9. PEFT techniques for finetuning embedding space using
NEFTune and Toolkens, updating inner layer parameters using
In the Self-Instruct framework Wang et al. (2022), the pro- LoRA, and adding additional layers of LlaMa Adapters.
cess starts with a compact seed set of manually crafted
queries from prior approaches. This set acts as the task pool,
randomly selecting queries to stimulate the language model 4.1. LLM inner layers finetuning
(LM). The LM generates new queries and their correspond-
4.1.1. L O RA
ing outcomes. The generated outputs go through a careful
filtering process to eliminate low-quality or redundant con- Low-rank adaptation, or LoRA, updates the parameters of
tent. The refined data seamlessly integrates back into the specific dense layers by injecting trainable rank decompo-
initial repository. This iterative bootstrapping algorithm pro- sition matrices into the architecture. Common practice is
gressively sharpens the LM’s ability to follow query patterns, to keep the ratio lora r / lora alpha = 1:1, so as not to over-
establishing a self-improving cycle that leverages its own power the base model, but we can try to deviate from this
generated data for subsequent fine-tuning. This approach is approach, our task being very specific to developer queries.
expected to outperform the programmatic generation of new LoRA technique fine-tuning is orthogonal to many fine-
queries from predefined templates, presenting possibilities tuning methods which means it can be combined with many
for a more diverse dataset. other fine-tuning techniques as well, and, it introduces no in-
DevRev’s AI Agent 007: Tooling up for Success
ference latency compared to a fully fine-tuned model, which formatted output. The research shows this method to be
is one of the important factors for this task. effective in improving the quality of chatbots and works
well with other PEFT methods like QLoRA(Quantized
LORA has proven to be effective in learning tool capabilities
LoRA) by Dettmers et al. (2023). NEFTune avoids model
as evident from works by Qiao et al. (2023) and Yang et al.
overfitting to the specifics of the instruction-tuning dataset,
(2023), making this technique worth exploring.
especially when the dataset is small, and adds no inference
latency because no new parameter layers are introduced.
4.1.2. P REFIX -T UNING
Since hard-prompting plays a big role in correctly address- 4.2.2. T OOLKEN GPT
ing the input queries by the model, prefix-tuning by Li and
Fine-tuning model embeddings to learn separate tokens for
Liang (2021) allows us to optimize even the soft-prompts
tools is an approach utilized in ToolkenGPT by Hao et al.
by adding learnable task-specific prefixes to them, which
(2023). Like NEFTune, this fine-tuning method also focuses
may enhance the model’s ability to interpret different tasks
on the embeddings. This technique arms the model with spe-
in the user query. It is possible that by learning only 0.1% of
cial vocabulary for available tools by adding special tokens
the parameters prefix-tuning can obtain reasonable perfor-
called ”toolkens” to the embedding space, expanding its un-
mance and is able to extrapolate it equally to unseen queries.
derstanding of the tools available. New tools can be added
We recognize that newly added parameters might introduce
to the toolset easily, and performance can be improved by
inference latency, but testing speed-quality trade-off is some-
fine-tuning the model on tool-specific tasks while keeping
thing worth investigating for this PEFT technique.
inference latency unchanged and updated parameters less.
This technique introduces a way to handle newly introduced
4.1.3. LL A M A A DAPTERS
tools while performing well overall.
Extending the ideas of prefix tuning and the original adapter
method by Houlsby et al. (2019), researchers proposed 4.3. Additional Methods
LLaMA-Adapter Zhang et al. (2023a), a model-agnostic
fine-tuning method, which provides an additional advantage If the time and resources allow us, we will explore further
due to zero-init attention which makes the initial tuning testing methods like Prompt Tuning Lester et al. (2021),
more stable without disrupting model’s linguistic knowl- Diff-Pruning Guo et al. (2021), BitFit Zaken et al. (2022),
edge. We would want our base model to retain its reasoning IA3 Liu et al. (2022). All these PEFT methods have not been
abilities while improving its tool-learning capabilities, we used for the task exactly like ours, hence it is also possible
can exploit this method to try to achieve this. Also, this we would have to apply a variation of the techniques or even
method adds additional layers to only a few top and bottom come up with a different one to get the results we need. It
layers, playing to our advantage when it comes to inference would be interesting to see which of these would impact the
latency. model in what way, whether to induce new capabilities or
bring out the already existing ones in our base model.
4.2. LLM Embedding Layer Finetuning
4.2.1. NEFT UNE Technique % of parameters
5. Latency and Inference We have also recently secured access to ChatGPT Plus
which would facilitate the dataset generation process as
When serving user queries, latency and response times are well as allow us to try some AI-based evaluation techniques.
especially important factors that impact the user experience.
We are exploring various approaches aimed at reducing Limited compute resources have posed challenges for thor-
latency, accelerating inference speeds, and reducing memory ough testing methodologies. Due to computing constraints,
consumption like precision reduction (float16 or bfloat16) testing on GPT-4 was restricted in scope. We have addi-
to speed up the model, 8-bit or 4-bit quantization to reduce tionally applied for access to alternative APIs, including
memory consumption by 2x or 3x, fine-tuning with adapters Anthropic, which would expand our capabilities for evalua-
(LoRA, QLoRA) to improve prediction accuracy on your tion. Broader access to application programming interfaces
data which works well in combination with quantization (APIs) and obtaining additional computational resources
afterward, tensor parallelism for faster inference on multiple for dataset generation and training would facilitate a more
GPUs to run large models. robust assessment and allow for a more comprehensive ex-
amination of approaches.
We are also looking at libraries for LLM inference and
serving, such as Text Generation Inference, DeepSpeed, or
vLLM. These already include various optimization tech- 7. Deliverables
niques: tensor parallelism, quantization, continuous batch- Drive link for experimentation, prompts and seed dataset:
ing of incoming requests, optimized CUDA kernels, and Team 15 Mid Eval.
more.
We are tracking the chain latency and token usage associated
with our experiments to evaluate different models. We have References
tabulated some of those results in the experiments section T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer.
of our deliverables. Qlora: Efficient finetuning of quantized llms, 2023.
Y. Lin, L. Tan, H. Lin, Z. Zheng, R. Pi, J. Zhang, S. Diao, L. Yuan, Y. Chen, X. Wang, Y. R. Fung, H. Peng, and H. Ji.
H. Wang, H. Zhao, Y. Yao, and T. Zhang. Speciality vs Craft: Customizing llms by creating and retrieving from
generality: An empirical study on catastrophic forgetting specialized toolsets, 2023.
in fine-tuning foundation models, 2023.
E. B. Zaken, S. Ravfogel, and Y. Goldberg. Bitfit: Sim-
H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, M. Bansal, ple parameter-efficient fine-tuning for transformer-based
and C. Raffel. Few-shot parameter-efficient fine-tuning masked language-models, 2022.
is better and cheaper than in-context learning, 2022.
R. Zhang, J. Han, C. Liu, P. Gao, A. Zhou, X. Hu, S. Yan,
S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez. Gorilla: P. Lu, H. Li, and Y. Qiao. Llama-adapter: Efficient fine-
Large language model connected with massive apis, 2023. tuning of language models with zero-init attention, 2023a.
S. Qiao, H. Gui, H. Chen, and N. Zhang. Making lan- Y. Zhang, H. Cai, Y. Chen, R. Sun, and J. Zheng. Reverse
guage models better tool learners with execution feed- chain: A generic-rule for llms to master multi-api plan-
back, 2023. ning, 2023b.
Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, Y. Zhuang, Y. Yu, K. Wang, H. Sun, and C. Zhang. Toolqa:
X. Cong, X. Tang, B. Qian, S. Zhao, R. Tian, R. Xie, A dataset for llm question answering with external tools,
J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun. Toolllm: 2023.
Facilitating large language models to master 16000+ real-
world apis, 2023.