0% found this document useful (0 votes)

11 views

Inference Efficiency by Learning Task Complexity

Uploaded by

abhinavgcpandey30

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views

Inference Efficiency by Learning Task Complexity

Uploaded by

abhinavgcpandey30

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Preprint Version

ComplexityNet: Increasing Language Model Inference

Efficiency by Learning Task Complexity

Henry Bae & Kehang Zhu ∗ Aghyad Deeb & Alex Fleury
Department of Physics Department of Computer Science
Harvard University Harvard University
Cambridge, MA 02138, USA Cambridge, MA 02138, USA
[email protected]
kehang [email protected]
arXiv:2312.11511v2 [cs.CL] 30 Mar 2024

Abstract

We introduce ComplexityNet, a framework designed for the evaluation

of task complexity and the allocation of tasks to Large Language Models
(LLMs) of varying capabilities. This framework was applied to python
code generation tasks, utilizing the Mostly Basic Python Problems (MBPP)
dataset. To facilitate this, we first developed a set of labels to quantify
task complexity accurately. Our methodology began with fine-tuning a
small language model to predict the likelihood of generating accurate out-
put across different LLMs. It achieved an accuracy of 79% in classifying
task complexity, a significant increase from the 34% accuracy observed in
the baseline model without fine-tuning. In the following step of allocat-
ing tasks, ComplexityNet reduced computational resource usage by 90%
when compared to using the most complex model alone, while sustaining
a high code generation accuracy of 86.7%. This study demonstrates that
fine-tuning smaller models to categorize tasks based on their complexity
can lead to a more balanced trade-off between accuracy and efficiency in
the use of Large Language Models. Our findings suggest a promising direc-
tion for optimizing LLM applications, especially in resource-constrained
environments.

1 Introduction

The advent of large language models (LLMs) like GPT-4 signifies a remarkable advancement
in the field of artificial intelligence (OpenAI, 2023; Touvron et al., 2023), providing levels of
natural language processing and generation that were previously unattainable. However,
this progress comes with its own set of challenges, particularly in the realm of inference
costs. As these models become more advanced, the computational resources required to
operate them increase significantly (Li et al., 2020). For instance, it is speculated that GPT-4
utilizes around ten times more computations than its predecessor (roughly estimated by
the API price), GPT-3.5 (Brown et al., 2020), highlighting a steep upward trend in resource
demands. There is a uniform cost per token regardless of how difficult the task is.
This escalation in computational needs raises concerns about the sustainability and accessi-
bility of these technologies(Li et al., 2020; Corro et al., 2023), especially for smaller entities or
individual researchers. As more users gravitate towards these advanced models regardless
the task simplicity, there’s a significant waste of computational resources. While mainstream
platforms offer manual model selection, allowing users to choose their preferred model
(OpenAI, 2023), there’s a lack of a systematic approach for automatically selecting the most
appropriate model for a given task. This inefficiency becomes particularly apparent in tasks
such as replicating code from documentation, where the output is extensive but the task’s
complexity does not align with the high computational expenditure. As a result, in the
ongoing evolution of LLMs, it becomes critical to research and establish strategies to reduce
∗ We give thanks to Professor H.T. Kung for providing helpful discussions

1
Preprint Version

these costs. Key questions arise, such as how to more effectively harness the capabilities
of smaller models, and how to accurately assess task complexity to appropriately allocate
tasks across various models. Addressing these considerations is vital not just for enhancing
the financial viability of these models, but also for expanding their utility and accessibil-
ity. Resolving these concerns is an important step towards unlocking the full spectrum of
possibilities offered by LLMs in a wide array of applications.
To resolve the challenges posed above, we introduce the following question:

For a given problem, what is the smallest model class that returns a correct answer?

To quantify this question, we define the complexity of the problem as the simplest and the
least capable LLM that is able to correctly accomplish the task given in the prompt. Smaller
and less capable models will generate correct solutions for problems with low complexity
but will fail to do so for problems with higher complexity.
This notion of complexity can be difficult to define with respect to LLM’s especially when the
evaluation of the answer is subjective (For example: writing a meaningful poem). Therefore,
we narrow the scope of problems to those that have definitive correct answers, such as
mathematics problems (Ex: “What is the sum of two numbers?”) or concrete programming
problems (Ex: Function to generate an array of n prime numbers).
To simplify the problem further, we restrict ourselves to three different language models
with clear distinction in the number of parameters and their performance, Code Llama 7B
(Rozière et al., 2023), GPT-3.5, and GPT-4 1 (OpenAI, 2023). The table of models and their
comparisons are shown below 2 :

Table 1: Comparison of Language Models

Model Parameters and Inference Cost
Number of Parameters Inference Cost
Code Llama 7B 7 Billion $0.0002/1K Token
GPT-3.5 175 Billion $0.002/1K Token
GPT-4 >1 Trillion* $0.03/1K Token

Based on the three models, we aim to create a classification model that takes in the prompt
as an input and outputs the complexity as we define it and choose the appropriate model
based on the score. Figure 1 illustrates this setup.

2 Related Works

Metrics for Determining the Capabilities of LLMs: The evaluation of LLMs has seen
the development of diverse methodologies, each tailored to test specific capabilities. For
instance, MMLU (Hendrycks et al., 2020) uses multiple-choice questions across a wide range
of 57 subjects, from professional to academic, assessing the LLMs’ understanding in these
varied domains. Another approach, GSM8K (Cobbe et al., 2021), zeroes in on grade-school
math problems to evaluate numerical reasoning. Similarly, MATH (Hendrycks et al., 2021b)
challenges LLMs with math problems across five difficulty levels and seven sub-disciplines,
providing a comprehensive metric for their mathematical capabilities. In addition, Hu-
manEval (Chen et al., 2022) focuses on Python coding tasks to assess programming skills.
Reading comprehension and arithmetic are evaluated in DROP (Dua et al., 2019), measured
using the F1-score, and common-sense reasoning is tested in Zellers et al. (2019) through
multiple-choice questions.
1 In this work, we use ”GPT-4 Turbo” model. But we will simply denote it as ”GPT-4”.
2We listed the data obtained at the time we ran this study – December 1st, 2023

2
Preprint Version

Figure 1: Overview of the problem: The prompt is first fed through the complexity model
then to one of the three models. We want to train a complexity model that picks the lowest
cost-model that successfully accomplishes the task.

These methodologies, while providing a percentage score for various tasks, do not provide
a framework to analyze whether a model is capable of correctly completing a specific task,
which is needed to efficiently utilize the model’s abilities with minimal computational
resources.
LLM Autonomous Agents: The concept of autonomous agents in AI involves utilizing
models that can independently manage complex tasks. Research in this area includes the
HuggingGPT project and various studies advocating for the use of LLMs as controllers to
manage existing AI models. For example, a paper from Microsoft Research (Shen et al., 2023)
suggests using central agents to enhance multi-modal agent functionalities. Moreover, Liu
et al. (2023) propose using LLMs as a central controller to manage communication among
multiple agents, each focusing on specific types of actions. Another significant contribution
Qin et al. (2023) involves fine-tuning LLaMA into ToolLLaMA, equipped with a neural API
retriever, and its evaluation through ToolEval.
These advances point to a need for smarter model selection processes, moving beyond
selection based on function descriptions, which often leads to inefficient computational
resource usage.

3 Methods

3.1 Determining the task complexity for different LLMs

Our approach first involves fine-tuning a small Language Model to output complexity levels
based on the given task in prompt. To do so, we need to create a dataset with complexity
labels for the prompts. These labels cannot be calculated manually, as the definition of
complexity we presented here makes it entirely dependent on the LLM’s outputs. Therefore,
we adopt an empirical way to define the complexity level by the relative success rate for each
task. Here, we only focus on tasks with clear answers and the success condition is simply
that the output by LLM (taken after explanations) matches the solution. A visualization of
our approach is depicted in Figure 2.

3
Preprint Version

Figure 2: Overview of our approach. Each row of the dataset is fed through the three
language models, and we store the success rate of each models. These success rates are used
to generate a single complexity value for each prompts.

Figure 3: One example of the ordering mapping based on the success rate of each model at
the task.

Due to the stochastic nature of the outputs, we would query each of the K primary LLMs
L1 , L2 , ...LK multiple times (M) for each task i (i = 1, ..., N) at a non-zero temperature. The
number of successful trials will be denoted as X Lk ,i , where k ∈ [1, K ] and i ∈ [1, N ] and
X Lk ,i ∈ [1, M ]. In the following, we will denote Code Llama, GPT-3.5 and GPT-4 as L1 , L2
and L3 respectively.
We then classified a prompt’s complexity into one of five classes, each represented by an
integer ∈ {1, 2, 3, 4, 5}. 1 represents a very simple task and 5 represents a highly complex
task. A task’s complexity is determined through a mapping based on the results of the
K × M total tests from before.
One example of the ordering mapping from the success rate of each LLM to the 5 complexity
classes are defined in Figure 3 for task repetition time M = 5. For task i, level 1 corresponds
to the case where X L1 ,i = 5 or X L1 ,i + X L2 ,i ≥ 7, where we assume Code Llama was doing a
good job. And level 2 maps to the case where the level 1 condition failed but X L2 ,i = 5 holds.
Level 3 corresponds the scenario where the level 1&2 conditions failed but X L3 ,i = 5 holds.
If the level 1,2 and 3 conditions failed and the condition that X L2 ,i ≥ 2 or X L3 ,i ≥ 2 holds,
we label these tasks to level 4. If all of the conditions fails, we assign these tasks to level 5.

4
Preprint Version

This mapping is based on the assumption that a model can reasonably solve a task that
it solved correctly at least twice over five trials, factoring in higher confidence due to
the ambiguity of the prompt, parameters, and verification conditions. We do not seek a
guaranteed correct answer (i.e. 5/5 correct solutions): we allocate the optimal model given
the empirical results of the 15 total trials. But we have to note that more elaborated mappings
can be applied.

3.2 Automated pipeline for creating Task Complexity Dataset

The method can in principle applied in any task class. For demonstration purpose, we chose
to start with the most common use case – Python coding task. And we picked up the Mostly
Basic Python Problems (MBPP) dataset (Austin et al., 2021). Each row of the dataset consists
of a task that usually starts with, “Write a function . . .”, the Python code (solution) that
accomplishes the task, and a series of Python assertion statements that can be used to verify
the output of the model and get the success rate. The format and structure of the dataset
allowed us to construct a single pipeline to run inference and verification.
To create the dataset, we constructed an automated pipeline to query the LLMs using tasks
from the MBPP dataset. We chose the set of LLMs to be GPT-4, GPT-3.5 and Code Llama
with 7 billion parameters. The first two LLMs were called by OpenAI API and Code Llama
was downloaded from Huggingface and ran locally. We have also provided predetermined
system prompt that would ensure the model is only outputting code without the explanation,
and that it is following the same function definitions as the code that is in the assertion
statement. The Code Llama seemed to perform poorly on the prompt we engineered for
the two larger models so we cut down the prompt to only include the description of the
function from the MBPP dataset and the format of the function that specifies the arguments.
The tasks were inputted to the three language models with a predetermined system prompt,
and the output of the model was checked against the assertion statements from the dataset
to verify the correctness of each LLM’s answer. This was repeated 5 times (M=5) to reduce
random noise in LLM’s output and improve the robustness of the method. We then adopted
the mapping scheme as shown in Figure 3 to assign these tasks to different complexity
levels.
Before fine-tuning, we also “cleaned” the dataset by removing all datapoints where each of
Code Llama, GPT 3.5, and GPT 4 answered correctly 05 times. We observed that in each of
these cases, the models failed due to a mis-match between assertion code function definition
and the prompt function definition (error in MBPP dataset).

3.3 Fine tuning process and task assignment

Once this dataset is established, we can fine-tune a small language model to analyze the
complexity of the prompts, which we’re going to refer to as the Complexity model. In this
work, we chose the Complexity model as DaVinci-002 (Brown et al., 2020), which OpenAI
API allows an easy fine-tuning.
And we split the whole complexity label dataset into a training set and a test set. Once
the Complexity model is fine-tuned by the training set, it can assign out-of-sample tasks to
different LLMs according to the predicted task complexity level.
In this study, we adopted a simple rule of assigning tasks to different LLMs by their
complexity labels. The scheme maps each of these complexity classes to one of Code Llama,
GPT-3.5, or GPT-4. We assigned the complexity lavels 1 and 2 to Code Llama, Levels 3 and
4 to GPT-3.5, and Level 5 to GPT-4. Further studies are needed to improve the task level
mapping and assignment rules.

5
Preprint Version

Figure 4: Comparison of the prediction accuracy of the the task complexity levels.

4 Results

4.1 Prediction accuracy on task complexity

The complexity dataset contains N = 180 of labels. We use N1 = 144 (80%) for training and
N2 = 36 (20%) for testing. When running evaluations, we set the temperature of our LLMs
to be 1. The results indicated that a non-fine-tuned Davinci-002 achieved a 34% accuracy.
However, after fine-tuning, the model achieved 79% accuracy, which yielded a significant
45% improvement. A comparison of the test set accuracy is shown in Figure 4.
Initially, we ran each task in MBPP dataset on each LLM for only one trial. And we classified
complexity into the classes 0, 1, and 2, which maps to Code Llama, GPT-3.5, and GPT-4,
respectively. level 0 means Code Llama succeeded in the task and level 1 means the task can
be done by GPT-3.5 and GPT-4 but not Code Llama. The highest level (level 2) means Code
Llama and GPT-3.5 both failed but GPT-4 succeeded. The tasks that all of the three model
failed will be dumped. We followed the same procedure to fine-tune the Davinci-002 model
and the accuracy is 62%. The non-fine-tuned model shows no significant change from the
additional 4 trials, achieving a similar 32% as before. So, 5-trial labeling method yield a 17%
benefit over the single-trial labeling method.
This low accuracy reveals shortcomings of the single-trial labeling method. Providing an
LLM one opportunity to solve a Python task does not effectively represent that model’s
capability to solve a problem of that complexity. We also observed inflated type II error
rates that corroborated our skepticism.

4.2 Cost saving and Accuracy Trade-offs

We now discuss the cost savings and the accuracy trade-offs associated with the utilization
of the complexity model. To obtain a numerical estimate we need to make a series of
assumptions. First, we approximate the cost of each models by observing their usage costs
with API calls. From this, we can assign a unit cost 1 to Code Llama 7B model, a cost of 10
to GPT-3.5, and a cost of 100 to GPT-4. Second, we assume that the increase in computing
for (five calls) to the complexity model is negligible considering the size of the output, a

6
Preprint Version

single token, as well as the complexity of the model that is in par with the least complex
model. Lastly, we assume that we have an uniform distribution of dataset from different
complexities.
Below are cost savings estimates based on the empirical distribution of complexity observed
in our N = 180 dataset and based on observed over and underestimations of complexity
during inference. We benchmark our performances and measure savings on the assumption
that users use GPT-4 for all of their tasks.

Avg Compute of Correct Answer ( x ) = 0.67 · 1 + 10 · 0.27 + 100 · 0.06

Avg Compute of Wrong Answer (y) = 0.65 · 1 + 10 · 0.29 + 100 · 0.06
100 − (0.79 · x + (1 − 0.79) · y)
Compute Savings = = 0.90
100
The dramatic 90% decrease in savings is due to high accuracy of our complexity model and
the ability for Code Llama and GPT-3.5 to succeed in python tasks. Notably, the method
still sustained a high code generation accuracy of 86.7%.
We assume that the distribution of complexity in our dataset is reflective of the true distribu-
tion of difficulty of python problems in the MBPP dataset. We validated that the distribution
of complexity in our test set during inference is approximately the same as the distribution
in our entire dataset. We also note that the distribution of complexity labels 1, 2, and 3
are almost exactly equal in both the correct and incorrect answers. Both of these points
corroborate our assumption.

5 Discussion

5.1 Ecological niche of LLMs

The high classification accuracy of the fine-tuned model with 5-Trial Labeling method
and the resulting 90% cost savings underscore a significant inefficiency in the prevailing
approaches to LLM utilization, i.e., blindly using the most capable models for every tasks.
Our findings reveal the possibility of employing a top-down strategy for refining LLM
inference processes that do not involve modifications at the model level.
Such findings echo the evolutionary study in machine behaviors (Rahwan et al., 2019),
raising the question of the ongoing relevance of smaller, older models in the face of advance-
ments of larger and more capable models. However, akin to the diverse roles observed
in natural ecosystems, where not only the largest or most advanced species thrive, these
“vintage” LLMs may carve out unique niches. This suggests a dynamic, multi-faceted
ecosystem of LLMs, where diversity rather than dominance dictates ecological balance.

5.2 Extensibility of the Framework

In this work, we analyzed the complexity of prompts that involve coding tasks that corre-
spond to a deterministic answer. This opens questions about the feasibility of determining
complexity for more abstract tasks such as generation of essays or poems. To do so, we can
utilize different datasets such as the MMLU(Hendrycks et al., 2021a), a multi-task language
understanding dataset benchmark, and determine whether our approach for determining
complexity. It is important to note that our overall procedure can be generalized to any
datasets that have assertions or validation statements, and we speculate that training the
model on a combination of many datasets can help capture the complexity of a wide range
of tasks and prompts.

6 Conclusion

We presented a framework that applies a top-down optimization approach to enhance the

inference efficiency of a set of LLMs. A small language model was fine-tuned to determines

7
Preprint Version

a task’s complexity and then match it with the best-suited, size-appropriate language model.
We implemented this approach in code generation tasks across three LLMs of different
capabilities, achieving a remarkable 90% reduction in inference costs while maintaining an
accuracy rate of 86.7%. It’s important to highlight that these results were achieved with
substantial room for improvement in areas such as mapping, task assignment rules, and
the fine-tuning process. Next steps include exploring how this approach might work for
a broader variety of tasks, especially those that are more complex or less defined than
generating code.

References
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David
Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with
large language models. arXiv preprint arXiv:2108.07732, 2021.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla
Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini
Agarwal, Ariel Herbert-Voss, Gretchen Krueger, T. J. Henighan, Rewon Child, Aditya
Ramesh, Daniel M. Ziegler, Jeff Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric
Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam
McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are
few-shot learners. ArXiv, abs/2005.14165, 2020. URL https://api.semanticscholar.
org/CorpusID:218971783.
Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou,
and Weizhu Chen. Codet: Code generation with generated tests. arXiv preprint
arXiv:2207.10397, 2022.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser,
Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers
to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
Luciano Del Corro, Allison Del Giorno, Sahaj Agarwal, Ting Yu, Ahmed Hassan Awadallah,
and Subhabrata Mukherjee. Skipdecode: Autoregressive skip decoding with batching
and caching for efficient llm inference. ArXiv, abs/2307.02628, 2023. URL https://api.
semanticscholar.org/CorpusID:259360560.
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt
Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over
paragraphs. arXiv preprint arXiv:1903.00161, 2019.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint
arXiv:2009.03300, 2020.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and
Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of
the International Conference on Learning Representations (ICLR), 2021a.
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang,
Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the
math dataset. arXiv preprint arXiv:2103.03874, 2021b.
Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein, and Joey
Gonzalez. Train big, then compress: Rethinking model size for efficient training and
inference of transformers. In International Conference on Machine Learning, 2020. URL
https://api.semanticscholar.org/CorpusID:263868979.
Zhiwei Liu, Weiran Yao, Jianguo Zhang, Le Xue, Shelby Heinecke, Rithesh Murthy, Yihao
Feng, Zeyuan Chen, Juan Carlos Niebles, Devansh Arpit, et al. Bolaa: Benchmarking and
orchestrating llm-augmented autonomous agents. arXiv preprint arXiv:2308.05960, 2023.

8
Preprint Version

OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://api.

semanticscholar.org/CorpusID:257532815.
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong,
Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master
16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
Iyad Rahwan, Manuel Cebrian, Nick Obradovich, Josh Bongard, Jean-François Bonnefon,
Cynthia Breazeal, Jacob W Crandall, Nicholas A Christakis, Iain D Couzin, Matthew O
Jackson, et al. Machine behaviour. Nature, 568(7753):477–486, 2019.
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan,
Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov,
Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong,
Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas
Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models
for code, 2023.
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang.
Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint
arXiv:2303.17580, 2023.
Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine
Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M.
Bikel, Lukas Blecher, Cristian Cantón Ferrer, Moya Chen, Guillem Cucurull, David
Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj
Goswami, Naman Goyal, Anthony S. Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan,
Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel M. Kloumann, A. V. Korenev,
Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich,
Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog,
Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan
Schelten, Ruan Silva, Eric Michael Smith, R. Subramanian, Xia Tan, Binh Tang, Ross
Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zhengxu Yan, Iliyan Zarov, Yuchen
Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert
Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned
chat models. ArXiv, abs/2307.09288, 2023. URL https://api.semanticscholar.org/
CorpusID:259950998.
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a
machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.

LLM mesh
No ratings yet
LLM mesh
54 pages
LLM Cheat Sheetpdf
No ratings yet
LLM Cheat Sheetpdf
7 pages
Comparing LLMs Using A Unified Performance Ranking System
No ratings yet
Comparing LLMs Using A Unified Performance Ranking System
13 pages
Eightthings
No ratings yet
Eightthings
16 pages
Benchmarking Large Language Models with a Unified Performance Ranking Metric
No ratings yet
Benchmarking Large Language Models with a Unified Performance Ranking Metric
13 pages
2503.10814v1
No ratings yet
2503.10814v1
15 pages
An Overview of Large Language Models for Statisticians
No ratings yet
An Overview of Large Language Models for Statisticians
67 pages
The Efficacy of Specialized Language Models in Advancing Educational Outcomes
No ratings yet
The Efficacy of Specialized Language Models in Advancing Educational Outcomes
8 pages
Eights LLM Model App
No ratings yet
Eights LLM Model App
8 pages
ssrn-4504303
No ratings yet
ssrn-4504303
8 pages
LLM - Michael R Douglas
No ratings yet
LLM - Michael R Douglas
47 pages
A Survey On Efficient Inference For Large Language Models
No ratings yet
A Survey On Efficient Inference For Large Language Models
35 pages
Survey On Efficient Inference For LLMs 1721657409
No ratings yet
Survey On Efficient Inference For LLMs 1721657409
36 pages
Eights - LLM Model
No ratings yet
Eights - LLM Model
10 pages
Impact Robotic
No ratings yet
Impact Robotic
21 pages
2503.10573v1
No ratings yet
2503.10573v1
27 pages
2404.14294v1
No ratings yet
2404.14294v1
34 pages
Efficient Large Language Models- A Survey
No ratings yet
Efficient Large Language Models- A Survey
67 pages
Fine Tuning Techniques for Large Language Models LLMs
No ratings yet
Fine Tuning Techniques for Large Language Models LLMs
15 pages
survey
No ratings yet
survey
23 pages
Advanced Prompt Engineering
No ratings yet
Advanced Prompt Engineering
27 pages
Generative Ai Terminology
100% (2)
Generative Ai Terminology
26 pages
Explainability for Large Language Models-A Survey
No ratings yet
Explainability for Large Language Models-A Survey
38 pages
Augmenting LLMs Survey
No ratings yet
Augmenting LLMs Survey
33 pages
What Is The Role of Small Models in The LLM Era A Survey
No ratings yet
What Is The Role of Small Models in The LLM Era A Survey
25 pages
Large Language Models Johns Hopkins University
No ratings yet
Large Language Models Johns Hopkins University
54 pages
Through The Lens of Core Competency: Survey On Evaluation of Large Language Models
No ratings yet
Through The Lens of Core Competency: Survey On Evaluation of Large Language Models
22 pages
Explainability For Large Language Models: A Survey
No ratings yet
Explainability For Large Language Models: A Survey
31 pages
Easy Problems That LLMs Get Wrong
No ratings yet
Easy Problems That LLMs Get Wrong
46 pages
Survay
No ratings yet
Survay
59 pages
Benchmark LLM
No ratings yet
Benchmark LLM
28 pages
A T: E G A A LLM: Gent Uning Nabling Eneralized Gent Bilities For S
No ratings yet
A T: E G A A LLM: Gent Uning Nabling Eneralized Gent Bilities For S
31 pages
133_large_language_model_evaluatio
No ratings yet
133_large_language_model_evaluatio
12 pages
A Survey On Evaluation of Large Language Models
No ratings yet
A Survey On Evaluation of Large Language Models
42 pages
Google REST
No ratings yet
Google REST
19 pages
S 001: N Q A C LLM E: Afurai EW Ualitative Pproach For ODE Valuation
No ratings yet
S 001: N Q A C LLM E: Afurai EW Ualitative Pproach For ODE Valuation
22 pages
Exploring The Frontiers of LLMs in Psychological Applications
No ratings yet
Exploring The Frontiers of LLMs in Psychological Applications
34 pages
Can Large Language Models Reason and Plan?
No ratings yet
Can Large Language Models Reason and Plan?
5 pages
A Philosophical Introduction to Language Models Part II -- Milliere and Buckner
No ratings yet
A Philosophical Introduction to Language Models Part II -- Milliere and Buckner
47 pages
1 s2.0 S2666651024000111 Main
No ratings yet
1 s2.0 S2666651024000111 Main
26 pages
Large Language Models Are Human-Level Prompt Engineers
No ratings yet
Large Language Models Are Human-Level Prompt Engineers
2 pages
A Survey On Evaluation of Large Language Models
No ratings yet
A Survey On Evaluation of Large Language Models
45 pages
SSRN Id4655822
No ratings yet
SSRN Id4655822
9 pages
LLMand Logicor Mimick
No ratings yet
LLMand Logicor Mimick
11 pages
Reasoning in Large Language Models Through Symbolic Math Word Problems
No ratings yet
Reasoning in Large Language Models Through Symbolic Math Word Problems
13 pages
A Survey On Evaluation of Large Language Models
No ratings yet
A Survey On Evaluation of Large Language Models
26 pages
Beyond Task Performance: Evaluating and Readucing The Flaws of Large Multimodal Models With In-Context Learning
No ratings yet
Beyond Task Performance: Evaluating and Readucing The Flaws of Large Multimodal Models With In-Context Learning
31 pages
mccoy-et-al-2024-embers-of-autoregression-show-how-large-language-models-are-shaped-by-the-problem-they-are-trained-to
No ratings yet
mccoy-et-al-2024-embers-of-autoregression-show-how-large-language-models-are-shaped-by-the-problem-they-are-trained-to
12 pages
Talking About Large Language Models
No ratings yet
Talking About Large Language Models
13 pages
10.48550 Arxiv.2204.02311
No ratings yet
10.48550 Arxiv.2204.02311
87 pages
Small Language Models (SLMS)
No ratings yet
Small Language Models (SLMS)
23 pages
Data Seminar
No ratings yet
Data Seminar
10 pages
NeurIPS-2023-hugginggpt-solving-ai-tasks-with-chatgpt-and-its-friends-in-hugging-face-Paper-Conference
No ratings yet
NeurIPS-2023-hugginggpt-solving-ai-tasks-with-chatgpt-and-its-friends-in-hugging-face-Paper-Conference
27 pages
FutureOfLearning_LLMs_Book_Chapter
No ratings yet
FutureOfLearning_LLMs_Book_Chapter
12 pages
Hugginggpt: Solving Ai Tasks With Chatgpt and Its Friends in Hugging Face
No ratings yet
Hugginggpt: Solving Ai Tasks With Chatgpt and Its Friends in Hugging Face
27 pages
2411 03350v1
No ratings yet
2411 03350v1
76 pages
Research Paper
No ratings yet
Research Paper
14 pages
LLM From Scratch
No ratings yet
LLM From Scratch
27 pages
A Survey On Evaluation of Large Language Models
No ratings yet
A Survey On Evaluation of Large Language Models
24 pages
BCS document
No ratings yet
BCS document
6 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
ENGLISH TEST FOR GRADE XI SEMESTER 2 + Conditional S
No ratings yet
ENGLISH TEST FOR GRADE XI SEMESTER 2 + Conditional S
2 pages
Download ebooks file Animal Handling and Physical Restraint 1st Edition C.B. Chastain all chapters
100% (1)
Download ebooks file Animal Handling and Physical Restraint 1st Edition C.B. Chastain all chapters
55 pages
Troncoso Et Al 2018 - Rock Art Central South America-1
No ratings yet
Troncoso Et Al 2018 - Rock Art Central South America-1
42 pages
Llama 3.1 system idea
No ratings yet
Llama 3.1 system idea
13 pages
Gpt4all Paper
No ratings yet
Gpt4all Paper
6 pages
Building LLaMA 3 From Scratch With Python
No ratings yet
Building LLaMA 3 From Scratch With Python
34 pages
Sparse Llama: Revolutionizing LLMs With 70% Sparsity
No ratings yet
Sparse Llama: Revolutionizing LLMs With 70% Sparsity
8 pages
Anatomy of The Dromedarian Camel
No ratings yet
Anatomy of The Dromedarian Camel
127 pages
Llama3 Content Moderation v1
No ratings yet
Llama3 Content Moderation v1
12 pages
AI Powered Tools in Teaching and Learning-1
No ratings yet
AI Powered Tools in Teaching and Learning-1
23 pages
kdak24003enn_competition_policy_brief_generative_AI_and_virtual_worlds
No ratings yet
kdak24003enn_competition_policy_brief_generative_AI_and_virtual_worlds
13 pages
Veterinary Techniques in Llamas and Alpacas 2nd Edition David E. Anderson - Instantly access the full ebook content in just a few seconds
100% (1)
Veterinary Techniques in Llamas and Alpacas 2nd Edition David E. Anderson - Instantly access the full ebook content in just a few seconds
49 pages
CA - 1974 - 15-2 - 188-196 Pastoral Nomadism in The Andes Browman
No ratings yet
CA - 1974 - 15-2 - 188-196 Pastoral Nomadism in The Andes Browman
9 pages
ChatGPT For Visually Impaired and Blind
No ratings yet
ChatGPT For Visually Impaired and Blind
6 pages
Andrews Diseases of the Skin Clinical Atlas 2nd Edition Robert G. Micheletti - The ebook is available for instant download, no waiting required
100% (1)
Andrews Diseases of the Skin Clinical Atlas 2nd Edition Robert G. Micheletti - The ebook is available for instant download, no waiting required
47 pages
2024 Build Llms
No ratings yet
2024 Build Llms
87 pages
20240618-AV-Zageb-AI-day1
No ratings yet
20240618-AV-Zageb-AI-day1
104 pages
2024 LifeArchitect - Ai Data (Shared) - Large Language Models (2024)
No ratings yet
2024 LifeArchitect - Ai Data (Shared) - Large Language Models (2024)
7 pages
PDF South American Contributions to World Archaeology 1st Edition Mariano Bonomo download
No ratings yet
PDF South American Contributions to World Archaeology 1st Edition Mariano Bonomo download
40 pages
AMZN
No ratings yet
AMZN
17 pages
Yvonne_AI in Action Building a Personal LLM Agent.pptx
No ratings yet
Yvonne_AI in Action Building a Personal LLM Agent.pptx
30 pages
Lecun 20240328 Harvard
No ratings yet
Lecun 20240328 Harvard
97 pages
CRAG - Comprehensive RAG Benchmark
No ratings yet
CRAG - Comprehensive RAG Benchmark
16 pages
Panaversity Cloud Native Applied Generative AI Engineer
No ratings yet
Panaversity Cloud Native Applied Generative AI Engineer
36 pages
Amurd: Annotated Arabic-English Receipt Dataset For Key Information Extraction and Classification
No ratings yet
Amurd: Annotated Arabic-English Receipt Dataset For Key Information Extraction and Classification
11 pages
Download full Protective Relaying Principles and Applications 4th Blackburn Solution Manual all chapters
100% (3)
Download full Protective Relaying Principles and Applications 4th Blackburn Solution Manual all chapters
51 pages
Alternatives For OpenAI API
100% (1)
Alternatives For OpenAI API
10 pages
Paper 7
No ratings yet
Paper 7
16 pages
Plants and Animals of The Book of Mormon
100% (3)
Plants and Animals of The Book of Mormon
42 pages